A customer service assistant with an 1,800-token system prompt and a 600-token knowledge base context sends a total of 2,400 tokens of "static" input per each of 15,000 monthly queries. That’s 36 million tokens monthly for text that doesn’t change. Prompt caching eliminates this cost without a single line of business code.
What is prompt caching and how the prefix mechanism works
#Language models process every call from scratch unless the provider implements a prefix cache mechanism. On the first call with a given prefix, the model computes the so-called attention keys and values (KV-cache) for those tokens and saves them on the server. Each subsequent call starting with an identical prefix reads the KV-cache instead of recomputing it.
The cache hit condition is precise: the prefix must be byte-identical from the start of the prompt to the defined split point. One changed character, a reordered instruction, or differently formatted date in the system prompt breaks the match. This is a fundamental difference from semantic cache, which operates on semantic similarity and tolerates paraphrasing.
Practical consequences of this mechanism: the static system prompt must be at the beginning of the prompt (before variable user context), and the variable part (query, fresh RAG context, conversation history) must follow the block you want to cache. API providers offering this feature (including Anthropic, Google Gemini, some OpenAI variants) require a minimum cached fragment length, typically 1,024 tokens, to make caching cost-effective on the infrastructure side.
The cost of a cache hit depends on the provider: Anthropic charges 10% of the normal input token price for a cache hit, while Google Gemini in context caching mode can drop to 25% of the base price. In both cases, writing to the cache (first call) costs 100-125% of the normal input price, so profitability starts from the second query with the same prefix.
How prompt caching differs from semantic cache
#These two mechanisms operate at different layers and solve different problems. They’re worth comparing because companies implementing AI often confuse them or choose one where both are needed.
Prompt caching works at the inference level: it eliminates the cost of processing the immutable input part. It doesn’t touch the model’s response or cache results. Each query still goes to the model and receives a unique response. The benefit is solely the reduction of input token costs for the static prefix.
Semantic cache works at the response level: it captures the (query, response) pair and returns the stored response for semantically similar subsequent queries without model involvement. The benefit includes both input and output token costs, and even complete elimination of model latency (10-30 ms instead of 300-1500 ms).
The two mechanisms complement each other: semantic cache handles repetitive FAQ questions without the model, while prompt caching reduces the cost of all remaining queries that still go to the model with a static prefix. In systems with extensive RAG, it’s worth considering both simultaneously, as we describe in the article on LLM token cost optimization.
When prompt caching delivers the biggest savings
#Savings are directly proportional to two factors: the share of the static prefix in the total prompt length and query volume. Scenarios with high multiplication of these factors are natural priorities.
Assistant with a long system prompt and large RAG context. If the system prompt is 2,000 tokens and you attach 1,500 tokens of documents from the knowledge base as static product context to each call, the prefix totals 3,500 tokens. With 10,000 monthly queries, that’s 35 million input tokens to eliminate. At $0.003/1k tokens, savings reach $80-100 monthly; for a premium model ($0.015/1k), it exceeds $450 monthly.
Batch document processing. Analyzing a long document (report, contract, dataset) split into multiple analytical calls is a classic case: the document is the prefix, and analytical questions change. Prompt caching reduces the cost of repeated analysis of the same file by 60-80%.
Multi-step agent pipelines. In an agent architecture performing a dozen steps in one session, where each step sees the same system prompt and history of previous actions as a prefix, the cache accumulates savings with each additional step.
When the effect is marginal: systems with short prompts (below 1,024 tokens of static prefix don’t qualify for cache with most providers), systems with high context variability (where the "static" prefix changes every few queries), and single-shot calls where each prompt is unique.
How to structure your prompt to hit the cache
#The order of blocks in a prompt is an engineering decision with a direct impact on system TCO. Practical rule: from most static to most variable.
Cache-optimized structure:
- System prompt (role instructions, tone, rules, guardrails) — most static, changes once every few weeks or months.
- Static knowledge context (product documents, FAQ, company glossary) — changes when the knowledge base updates, not with every query.
- Conversation history (previous turns) — changes every few turns but grows gradually within a session.
- User query — variable, at the end.
Structural errors that break cache: injecting a date or timestamp into the system prompt (changes every second, prefix never hits cache), placing variable user context before static instructions, dynamically reformatting static blocks with each call.
Use token separators between blocks if the API allows it to define the split point. Some SDKs (Anthropic Python SDK from version 0.28) support cache_control: {"type": "ephemeral"} annotation at the message level, letting you mark the exact split point between static and variable segments.
Detailed patterns for building system prompts, including element order and instruction structure, are described in the article on enterprise prompt engineering.
Table: scenario vs. savings vs. cache hit condition
#| Scenario | Static prefix share | Estimated input token savings | Cache hit condition |
|---|---|---|---|
| Assistant with extensive system prompt (2,000+ tok.) | 60-80% of prompt | 50-70% of input costs | Prefix byte-identical, min. 1,024 tok. |
| RAG with large product context (1,500+ tok. documents) | 40-65% of prompt | 35-55% of input costs | Document block before user query |
| Long document analysis (batch, multiple questions) | 70-90% of prompt | 60-80% of input costs | Document as prefix, questions as suffix |
| Multi-step agent (dozen steps/session) | 50-75% of prompt per step | 45-65% of input costs | System prompt + history as cache, new step as suffix |
| Short system prompt (< 500 tok.) | < 30% of prompt | < 15% of input costs | Below minimum threshold for most providers |
| Prompt with dynamic date/timestamp in prefix | 0% (prefix always different) | 0% | Doesn’t hit cache, requires refactoring |
Prompt caching and self-hosting with local models
#Prompt caching is a server-side infrastructure feature. Models run locally via Ollama, vLLM, or llama.cpp may support this mechanism, but it depends on the specific inference server implementation.
vLLM (from version 0.4.0) supports automatic prefix caching for all models if enable_prefix_caching=True is set at startup. llama.cpp with the --cache-prompt parameter saves and loads KV-cache between sessions in the same process. Ollama (as of 2026) doesn’t expose this option through the public API, but the mechanism exists in the llama.cpp layer it’s built on.
For self-hosting, the key additional requirement is memory: KV-cache for a 2,000-token prefix of a 70B model occupies 0.5-2 GB VRAM (depending on quantization precision). Before enabling prefix caching on local infrastructure, check if you have VRAM headroom. Savings on GPU time and throughput can be significant with high query volume to the same prefix.
Analyze the cost-effectiveness threshold between API and self-hosting, accounting for hardware costs and prompt caching on both sides, in the article on AI agent maintenance costs.
Try it yourself: prompt caching cost-effectiveness analysis for your system
#FAQ
#Does prompt caching require changes to application code?
#It depends on the provider. With Anthropic, you must mark cache_control blocks in the SDK or directly in the API request, which requires a few lines of code. Google Gemini Context Caching has a separate endpoint for saving context before the call. OpenAI (selected models) caches prefixes automatically without annotations but only when reusing the exact same prefix. In any case, changes affect the API call layer, not the assistant’s business logic.
How long does the provider keep the prefix cache?
#Cache lifetime (TTL) varies by provider. Anthropic sets TTL at 5 minutes for standard cache, with the option to refresh via subsequent calls. Google Gemini Context Caching lets you set a custom TTL (from minutes to hours) and charges a separate fee for cache storage per hour. With low query volume (below a few per minute), a 5-minute TTL can cause frequent cache misses and reduce effective savings.
Prompt caching and GDPR (RODO): does data stay in the provider’s infrastructure longer?
#Yes, the prefix cache is stored on the provider’s side for the TTL duration. If the system prompt or RAG context contains personal or confidential company data, using prompt caching means this data is stored on the provider’s servers longer than a single call. Before enabling cache, review the DPA (Data Processing Agreement) with the provider and assess whether the prefix content requires a DPIA. If sensitive data can’t leave local infrastructure, consider self-hosting with vLLM prefix caching.
Yes, and it’s one of the measurable side benefits. Eliminating KV-cache recomputation for a large prefix reduces time to first token by 15-40% for typical system prompt sizes. The effect is most visible with prefixes over 2,000 tokens and models with long context (100k+), where full prefix processing takes hundreds of milliseconds.
Can you combine prompt caching with a model router?
#Yes, this is a common optimization pattern. A model router directs queries to different model classes (cheap/fast vs. expensive/accurate). Enable prompt caching separately for each model class because the system prefix differs for each model in the router. Routers built on n8n or custom code can pass the model identifier to the prompt management layer so the cache key includes the model version and avoids hits between different models using similar system prompts.