The AI bill rarely explodes for one reason. It usually grows quietly: slightly longer contexts, a few percent of retries, embeddings on every query, a system prompt appended to every call. After a quarter, the company pays twice as much and can’t pinpoint what’s eating the budget. FinOps for LLM starts with one assumption: you can’t optimize what you don’t measure in breakdown.
Why You Track Cost per Token, Function, and User
#The global cloud bill only tells you it’s expensive—it doesn’t say what’s actually costly. Meaningful monitoring breaks down spending into three axes. Per token shows the ratio of input to output tokens and catches bloated context. Per function assigns cost to a specific task: classification, summarization, RAG response, report generation. Per user (or per client, per team) reveals that 5% of accounts generate 40% of the bill.
Only these three views together enable decisions. If one function consumes half the budget, you optimize that, not the entire system equally. It’s the same logic we use in AI agent quality monitoring—without per-task attribution, measurement is decoration, not a tool.
Where Costs Hide
#Most wasted budget doesn’t come from “too large a model,” but from places no one looks. Here are the most common.
| Hidden Cost Source | Mechanism | Typical Share of Bill |
|---|---|---|
| Long contexts | Entire conversation history or full document appended to every call | 20-40% |
| System prompts | Extensive instruction with every query, multiplied by traffic | 10-25% |
| Retries and timeouts | Repeats after errors or slow responses, billed double | 5-15% |
| Embeddings | Vectorization of queries and documents on every indexing and search | 5-20% |
| No cache | Same questions processed from scratch in every session | 15-50% |
The most insidious are long contexts and system prompts—they’re invisible, appearing as “normal” calls while quietly multiplying input tokens per request. A shorter system prompt and context trimmed to what’s actually needed often saves more than switching models.
Semantic Caching and Routing to a Cheaper Model
#Two levers deliver the fastest return. The first is semantic caching: questions with similar meaning are served from the buffer instead of sent to the model. In FAQ and customer service scenarios, repeatability is high, so hit rates can be significant—instead of full inference, you respond from cache. The rule is simple: pay once, answer many times.
The second is routing to a cheaper, sufficient model. Not every task needs the largest model. Ticket classification, form field extraction, or short summaries work fine on a small, cheap model; save the powerful one for tasks that truly require it. That’s the role of an LLM router—a single decision that often cuts variable cost by 40-70% with no noticeable quality drop. How to choose the right model for the task is broken down in a separate guide.
Budgets, Alerts, and Attribution in Practice
#Optimization without guardrails is only half the job. The other half is budgets and alerts that stop the bill before it gets expensive. In practice, we implement a few simple mechanisms:
- Daily and monthly budgets per function and per client, with a soft threshold (alert) and a hard one (degradation to a cheaper model or throttling).
- Anomaly alerts—a sudden cost spike per user usually means a retry loop, abuse, or a leaking context.
- Tagging every call with a function ID and tenant so attribution is automatic, not manual.
- Kill switch at the observability level that cuts off an expensive function without bringing down the whole system.
These mechanisms don’t replace optimization—they ensure one bug in the code or unusual traffic spike doesn’t turn a good month into a blown budget. Unit cost is calculated the same way as in AI agent pricing: per completed task, not per abstract “month of AI.”
Self-Hosting vs. Cloud: When to Switch
#The question “self-hosting or cloud API” is about volume, not ideology. With low or irregular traffic, the cloud wins—you pay for usage, no entry cost, zero maintenance. With steady, high load, self-hosting starts to win: hardware amortization and your own inference deliver lower and—more importantly—predictable unit cost.
| Option | When It Pays Off | Cost Profile |
|---|---|---|
| Cloud / API | Low or variable volume, quick start, no MLOps team | Low fixed, grows linearly with traffic |
| Self-hosting | High, steady volume, data residency requirement | High upfront, low and flat unit cost |
| Hybrid | Base load local, spikes to cloud | Medium, optimized for volume |
The crossover point depends on monthly call volume and whether you have someone to maintain the infrastructure. We don’t give a single number because it would be made up—the threshold shifts with every hardware generation and cloud pricing change. That’s why we match the option to real load, measuring unit cost in both before making any permanent switch.
FAQ
#Where do I start with LLM cost monitoring?
#With attribution: tagging every call by function, tenant, and input/output token count. Without this, you only see the global bill and don’t know what to optimize. Once you have cost per function and per user, choosing levers (cache, routing, shorter context) becomes obvious.
What most often burns the AI budget?
#Most often, not the model itself, but long contexts and system prompts appended to every call, lack of caching on repeat questions, and retry loops. These costs are hidden because each individual call looks normal—they add up only at scale.
How much can caching and routing realistically cut the bill?
#From our experience, routing to a cheaper, sufficient model typically cuts variable cost by 40-70%, and semantic caching in scenarios with repeat questions can handle 15-50% of calls. The combined effect depends on traffic profile, so treat these ranges as guidance, not a promise.
Do I need a dedicated tool for LLM FinOps?
#Not at first—consistent per-call cost logging and a simple dashboard with budgets and alerts are enough. Dedicated tools make sense with multiple models, teams, and high volume. The key is attribution discipline, not the tool’s brand.
When is self-hosting cheaper than the cloud?
#With steady, high volume and where cost predictability and data residency matter. With low or irregular traffic, the cloud usually wins due to no entry cost. The threshold is set by your volume—calculate unit cost in both options before deciding.