Why is your CFO angry? Because when they ask, "Why did our AI bill go up 20% this month?", your answer is probably weak. You might say "Traffic is up," but the CFO checks the analytics and sees traffic is only up 5%. The numbers don't add up. The trust evaporates.
This discrepancy happens because engineering teams often track Infrastructure Metrics (Token Count, Latency, Error Rate) while finance teams track Business Metrics (CAC, LTV, Margins). In the AI era, we need a bridge metric.
Why "Monthly Spend" is Useless
Tracking high-level monthly spend on OpenAI or Anthropic tells you nothing about the health of your feature. The bill might have gone up because:
Positive: Users are engaging more deeply with the product.
Negative: A prompt engineering change caused models to become verbose, outputting 2x more tokens for the same answer.
Negative: Agents are getting stuck in loops, retrying tasks 5 times before succeeding.
Negative: Users are asking harder questions that require more expensive reasoning.
"Total Spend" hides all these variables. It is a vanity metric for billing, not a diagnostic metric for engineering.
The Holy Grail: Cost Per Solved Task (CPST)
You need to normalize your spend against value. The metric you must adopt is Cost Per Solved Task (CPST).
CPST= Total Spend on Agents (Inference + Tools) /Number of Successful Outcomes
This metric is powerful because it punishes failure. If your bill doubles but your successful resolutions triple, your CPST has gone down, and you are a hero. If your bill stays flat but success rate drops (perhaps due to a new model update breaking logic), your CPST spikes, alerting you immediately that you are burning money on bad outcomes.
How to Instrument CPST
Step 1: Programmatic Definition of Success
This is the hardest part. You must define what a "win" looks like for your agent. It cannot be subjective.
Support Bot: "User did not request human escalation within 10 minutes" OR "User clicked 'Thumbs Up'."
Coding Agent: "The generated code passed the unit test suite."
Search Agent: "The user clicked on one of the provided citations."
Workflow Agent: "The API call to the downstream system returned 200 OK."
Step 2: Trace Tagging with OpenTelemetry
You rely on tracing. Use OpenTelemetry (OTEL) to instrument your AI chains. At the end of every chain execution, you must append a tag:
outcome: successoutcome: failure
You also log the total_cost (calculated from token usage) on the trace span.
Step 3: The Dashboard (Grafana / Arize)
Now, build a query that sums the cost of all traces and divides by the count of success traces. Plot this over time.
Using CPST as a Leading Indicator
We often see that CPST spikes before user complaints roll in.
Example: You deploy a new prompt that encourages the agent to be "more thorough."
Day 1: Error rates remain low (0%). Users are getting answers.
Day 2: You notice CPST jumps from $0.50 to $0.75.
Investigation: You dig into the traces. You find that the "thorough" agent is now doing 3 Google searches instead of 1. The success rate hasn't changed, but the efficiency has collapsed. You catch this billing leak immediately, rather than waiting for the end-of-month invoice.
The "Cheap Model" Fallacy
CPST also helps you argue for better models. A manager might say, "GPT-4o is too expensive, switch to GPT-4o-mini."
GPT-4o-mini: Cost per run $0.01. Success Rate 60%. CPST = $0.016 (plus cost of user churn).
GPT-4o: Cost per run $0.10. Success Rate 99%. CPST = $0.101.
While the CPST is higher for the big model, you can now quantify exactly how much (6x). You can then ask: "Is a satisfied customer worth the extra 8 cents?" Usually, the answer is yes. Without CPST, you only see the 10x token cost difference, which looks terrifying.
Conclusion
Stop optimizing for cheaper tokens. Optimize for cheaper solutions. By tracking Cost Per Solved Task, you align Engineering, Finance, and Product around a single source of truth that represents value, not just volume.
All in One Place
Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.

