A company launches a pilot with an AI assistant. Initial tests work great. After a month in production, the API budget is exceeded by 280%. The cause isn’t technical—it’s a design decision from the pilot phase: the system sent full PDFs to the model instead of fragments extracted by RAG, and every query included a 2400-token system instruction copied to each call.
LLM token cost isn’t linear relative to the number of queries. It’s linear relative to the number of tokens. The difference between these two statements is the difference between a maintainable budget and one that forces system shutdown.
How to count tokens: not all are priced equally
#A token is the basic billing unit for language models. One token is usually 3-4 characters in English, 2-3 characters in Polish (Polish diacritics and longer words break into more tokens). For Polish text, assume 30-40% more tokens than for the English equivalent.
Output tokens (throughput) usually cost more than input tokens. The table below shows a typical cost structure for different model classes (indicative prices—verify in the current pricing of your chosen provider).
| Model Class | Input Tokens | Output Tokens | Out/In Price Ratio |
|---|---|---|---|
| Small model (7B-13B, local) | 0 PLN (self-hosting) | 0 PLN (self-hosting) | — |
| Mid-class API model | 0.15-0.60 USD / 1M | 0.60-2.50 USD / 1M | 3-5× |
| Premium API model | 1.50-5.00 USD / 1M | 6.00-20.00 USD / 1M | 3-6× |
| Long-context model | 3.00-10.00 USD / 1M (>100k tok) | 10.00-30.00 USD / 1M | 3-4× |
Key observation: Premium models cost 10-30 times more than mid-class models. But if a mid-class model requires twice the prompt length to achieve the same result, the real difference is smaller. Before choosing a model, measure both parameters: price per token and required context length to achieve acceptable quality.
Calculate for your own parameters using the inference calculator, where you can input real volume and get the monthly cost in PLN.
Where the biggest costs hide in practice
#In production projects we’ve analyzed, 70-80% of token costs come from three sources rarely discussed during the pilot phase.
System prompt copied to every query. The system instruction describing assistant behavior is usually 500-3000 tokens. With 10,000 queries per day and a 1500-token prompt, this amounts to 15 million tokens monthly just for system context—before the model even reads a single user question. Most API providers don’t automatically cache system prompts between calls unless you use prompt caching.
Sending full documents instead of fragments. An agent receiving a PDF invoice as full text (2000-8000 tokens) instead of 3-5 fragments extracted by RAG (150-400 tokens) may use 10-30 times more tokens per operation. The difference is dramatic at high document volumes.
Conversation history without trimming. Chat interfaces that pass the entire conversation history to the model grow linearly with session length. A 20-exchange conversation can have 15,000 tokens of context for the last query, even if the user asks something simple. A sliding window (last N messages) or summarizing older history reduces this cost by 60-80%.
Prompt caching: the biggest quick win
#Most major API providers offer prompt caching, which reduces the cost of a repeatedly used prompt prefix (system prompt, reference documents, instructions) by 70-90%. The mechanism works by hashing the prefix and storing the model’s internal state calculations. Second and subsequent calls with the same prefix pay a fraction of the price.
Conditions that must be met for caching to work:
- The prefix must be byte-identical. One character change invalidates the cache.
- The prefix must exceed the minimum length (usually 1024-2048 tokens, depending on the provider).
- Calls must occur within a time window (usually a few minutes to an hour).
In practice, this means: system prompts and instructions should be at the start of the context, before dynamic parts (user question, RAG results). Dynamic elements should be at the end to avoid invalidating the prefix.
For a system with a 2000-token system prompt and 10,000 daily calls, prompt caching reduces input token costs by 50-65% without any changes to application logic.
RAG as a token-limiting strategy
#RAG is often described as a technique for improving response quality. That’s true, but in the context of token costs, RAG is primarily a context selection strategy.
The difference between a system with and without RAG:
- Without RAG: Entire company documents (10-50 pages, 8000-40,000 tokens) go into every query.
- With RAG: Semantic search extracts 3-5 most relevant fragments (300-800 tokens), and only those go to the model.
Good reranking after the search phase further reduces the number of fragments passed to the model while maintaining high relevance. The retrieve-rerank-trim pattern (retrieve 20 fragments, rerank, send top 3-5) reduces context tokens by 70-80% compared to naive retrieve-all.
The article company GPT based on knowledge describes RAG architecture in detail. For pipelines with high document volumes, read how to prepare company data for AI, which covers chunking strategies directly affecting the number of tokens per query.
Model router: not every query needs a premium model
#An LLM router is a layer that classifies a query and routes it to the cheapest model sufficient for the task. In a production customer service system, the typical query distribution looks like this:
| Query Type | Example | Required Model | Relative Cost |
|---|---|---|---|
| Simple FAQ, single-sentence answer | “What are your opening hours?” | Small model / local | 1× |
| Information extraction from document | “What’s in paragraph 3 of this contract?” | Mid-class model | 3-5× |
| Multi-document analysis | “Compare these two offers” | Premium model | 10-20× |
| Reasoning, complex inference | “What error is hidden in this argument?” | Premium model or thinking mode | 15-40× |
Routing based on intent classification (a small model classifies, a large one executes only when needed) reduces costs by 50-70% in systems with heterogeneous query types. Requires A/B testing to confirm that response quality with routing doesn’t drop below an acceptable threshold.
Our OpenClaw router infrastructure applies this pattern by default, routing simple queries to a local model and complex ones to cloud models while maintaining audit logs for every call.
Monitoring and alerts: token budget as SLO
#Without measuring token costs, they remain invisible until the API provider’s invoice exceeds the budget. Treat token usage as an operational metric, similar to latency and availability.
Minimum metrics to track in production:
- Input and output tokens per endpoint or feature (not just total).
- Cost per user session or per business transaction.
- Percentage of system prompt in input tokens.
- Cache hit rate for prompt caching (if used).
- Distribution of model response length (long responses may signal unnecessary verbosity).
Alerts should operate at two levels: a warning at 70% of the daily budget and a hard limit at 90% with automatic throttling or degradation to a cheaper model. Monitoring AI agent quality describes a broader observability context, including observability metrics for the AI layer.
Output strategies: shorter responses without quality loss
#Output tokens are more expensive. A few patterns that shorten model responses without degrading quality:
Precise format instructions. “Answer in a maximum of 3 sentences” works, but better: “List a maximum of 3 points as a list, without introduction or summary.” Models without instructions tend to generate ceremonial openings and closings that add no value.
Structured output. When expecting data to be processed by code, structured output (JSON schema) eliminates narrative wrapping. Extracting 5 fields from a document as JSON takes 80-120 output tokens, while the same extraction as narrative takes 300-600 tokens.
Temperature vs. length. Higher temperature doesn’t lengthen responses, but temperature=0 with explicit length in the prompt yields more predictable, shorter responses for deterministic tasks.
Stop sequences. Define a stop token for the model (e.g., ### or a specific JSON separator) so the model doesn’t continue after the actual response ends. Particularly useful for generating lists with a limited number of items.
Self-hosting as a strategy for high volume
#At volumes above 5-10 million tokens per day, self-hosting a local model can be cheaper than API, even after accounting for infrastructure costs. The break-even point depends on the model, hardware, and query profile.
For tasks that don’t require a frontier model (classification, data extraction, FAQ, simple summaries), local 7B-34B class models achieve acceptable quality at near-zero cost per token. The article local vs API LLM cost describes a break-even calculator and typical usage profiles.
The self-hosting decision isn’t just about cost. It also involves data residency (data doesn’t leave the infrastructure), compliance with RODO for sensitive data, and eliminating dependency on an external provider. The article self-hosted LLM and RODO covers these aspects in detail.
Try it live
#Describe your AI system’s architecture (daily query volume, system prompt length, whether you use RAG) and the model will identify where the biggest token optimization potential lies (playground: PII masked, zero retention):
FAQ
#Does switching to a cheaper model always reduce costs?
#Not always. A cheaper model often requires a longer, more precise prompt to achieve the same result as a premium model. If the cost per token is 5 times lower, but the prompt needs to be 3 times longer and the model generates twice as many output tokens to avoid errors, the actual savings are small or nonexistent. Before changing models, measure the total cost per task (not per token) on a representative test set of at least 200 examples.
What is prompt caching, and when does it really work?
#Prompt caching is a mechanism where the API provider retains the model’s internal state for a repeatedly used prompt prefix. Each subsequent call with an identical prefix pays for cached tokens instead of full input tokens, which is usually 70-90% cheaper. Condition: The prefix must be byte-identical between calls and exceed the minimum length threshold (usually 1024 tokens). Changing even one character invalidates the cache. In practice, it works great for static system prompts and contextual instructions that don’t change between user queries.
How does RAG reduce token cost compared to sending full documents?
#RAG replaces expensive full-document context with cheap fragment context. Instead of sending an entire document (2000-40,000 tokens) to every query, RAG uses semantic search to select 3-5 relevant fragments (150-600 tokens total). With 10,000 daily queries and a 5000-token document, the difference is 50 million vs. 2-3 million tokens monthly. The cost of embeddings and search is negligible compared to savings on model tokens. More on building RAG pipelines in the article semantic search and embeddings in the company.
Is self-hosting a local model more cost-effective than API?
#Self-hosting eliminates per-token costs but has its own expenses: GPU hardware or rented servers, maintenance, model updates, and engineering time. At low volumes (below 1 million tokens per day), API is usually cheaper after accounting for operational costs. Self-hosting becomes cost-effective at high, consistent volumes and for tasks that don’t require a frontier model. Use the ROI calculator to compare both scenarios for your usage profile.
How to measure token cost per feature, not just total?
#Track tokens at the call level, not the session. For every model call, log: the feature or endpoint that triggered it, input and output token counts, and whether cache was used. Aggregate by feature in a dashboard (e.g., Grafana with Prometheus metrics). This reveals which feature accounts for 60% of costs, allowing you to prioritize optimization. The implementation pattern is described in monitoring AI agent quality under the LLM call telemetry section.