A legal team at a law firm deployed a RAG assistant to search through 2,000 contracts. After a few weeks, they noticed the assistant accurately answered questions about clauses from the first and last pages of documents but consistently ignored provisions in the middle sections. The model wasn’t broken. The problem was that 20-30 text fragments were fed into each query without any sorting, and those in the middle of the window were given lower priority by the model.
This is a classic example of poor context engineering. It’s not about how much text you feed the model. It’s about what you feed it, in what order, and whether the model has a real chance to use it effectively.
Accurate retrieval instead of cramming the entire document
#The most common mistake is passing the entire document to the LLM instead of precisely selected fragments. For documents spanning 50 or more pages, the context window quickly fills up, costs rise, and you still have no guarantee the model will focus on the right section.
The RAG approach works differently: instead of the whole document, the model receives 5-10 fragments selected by semantic search and optionally a reranker. The quality of these fragments depends on chunking, which we cover in the article on document chunking for RAG. In practice, implementations clearly show that the quality of retrieved fragments has a greater impact on response accuracy than the model’s parameters themselves.
Human decision point: The number of retrieved fragments (top-k) and the threshold for rejecting weak results (score threshold) require calibration on a test set specific to your document and domain. Setting these parameters is an engineering task, not an algorithmic one.
The lost-in-the-middle effect and fragment order
#In 2023, researchers from Stanford University published an observation that AI implementation practitioners already knew from their own projects: language models distinctly remember information from the beginning and end of the context window better than from its middle. The effect grows with window length. With a 16,000-token window, information placed between positions 6,000 and 10,000 may be practically ignored by the model.
Practical implications for context insertion order:
- System instruction: Always at the beginning, before document fragments.
- Most relevant RAG fragments: At the start of the list, not the end.
- Conversation history: After the system instruction but before document fragments.
- Low-relevance fragments: If you must include them, place them in the middle. Better to filter them out.
The reranking result can be used for sorting: the fragment with the highest score goes first, the lowest last (if at all). At Cashcrown, we recommend keeping fewer than 8 fragments in a single context window for fact-based queries. More rarely improves quality but always increases costs.
Compressing conversation history
#In conversational assistants, conversation history grows with each exchange. After 10-15 dialogue turns, it can occupy 3,000-6,000 tokens before you even add fragments from the knowledge base. Two compression strategies we use in practice:
Incremental summarization: After every n-th turn (e.g., every 5 exchanges), the agent generates a summary of the previous part of the conversation and replaces the full transcripts of those turns. The summary takes up 200-400 tokens instead of 1,500-2,500. Downside: precise facts mentioned in the middle of the conversation may be simplified in the summary. Require source citation in the summary or pause for decisions requiring precision.
Vector memory with selection: Instead of rewriting history, you store it in a vector database (a pattern described in the article on AI agent memory) and retrieve only those fragments relevant to the current query. History from a week-old session doesn’t clog the window unless needed.
The choice between these approaches depends on the nature of the dialogue. For multi-turn conversations about a single document, incremental summarization works better. For assistants serving customers over many weeks, vector memory is a necessity. The decision on the compression method should be made by someone who understands which pieces of history are critical for the given use case.
Token budget vs. cost and latency
#Every token in the context window costs money and affects response time. For cloud models, the cost of input tokens ranges from $0.50 to $15 per million tokens, depending on the model. For self-hosting, it’s GPU time and latency. Full costs and optimization patterns are covered in the article on LLM token cost.
The table below shows how different context-filling strategies affect token budget, cost, and the risk of quality degradation:
| Strategy | Typical context size (tokens) | Relative cost | Lost-in-middle risk |
|---|---|---|---|
| Full document without filtering | 8,000-128,000 | high | high (above 16k) |
| RAG top-5 without reranking | 1,500-3,000 | low | low |
| RAG top-10 with reranking | 2,500-5,000 | medium | low (sorted) |
| Full 15-turn history | 3,000-6,000 | medium | medium |
| Compressed history | 800-1,500 | low | low |
| Full history + full doc | over 20,000 | very high | very high |
Good design rule: Every context element should be justifiable by a concrete benefit to response quality. If you can’t explain why a given fragment is there, remove it.
Compression and formatting of context content
#The form of the context matters as much as its content. A raw PDF after OCR with artifacts, repeating headers, and footnotes is worse context than the same text cleaned and structured.
A few concrete patterns that improve response quality:
Source labeling: Each RAG fragment preceded by a label [Source: Contract XYZ, §3.2] allows the model to cite sources and lets guardrail systems verify that responses don’t go beyond provided facts. Citing sources in responses is a basic mechanism for limiting hallucinations. More on this pattern in the article on limiting AI hallucinations.
Decontextualized fragments: A fragment like “In this matter, the deadline is 14 days” without specifying what matter it refers to is useless. During chunking, it’s worth adding the section header as a prefix to each fragment so the model has context about its position in the document.
Negative instruction in the prompt: Explicitly instructing the model to respond “I don’t know” instead of speculating when fragments don’t contain the answer reduces the number of fabricated responses. This is part of prompt engineering covered in more detail in the article on prompt engineering for businesses.
Human decision point: For high-risk outputs (legal, medical, financial decisions), the system should always indicate the source fragment from which the answer is derived. Human verification of this citation is a mandatory gate before acting on the model’s response.
Test your own scenario
#FAQ
#How many RAG fragments should be included in a single query?
#The optimal number depends on fragment length and the model, but in practice, 5-8 well-reranked fragments yield better results than 15-20 fragments without selection. Above 10 fragments, the risk of the lost-in-the-middle effect increases, and inference costs rise linearly. Start with top-5, measure quality on a test set, and increase cautiously.
Does a longer context window replace well-designed RAG?
#No. Models with 128,000+ token windows tempt you to insert entire documents without filtering, but response precision for questions about details in the middle of the document is lower than with careful retrieval. A large context window is a useful fallback or for one-off analysis, not a substitute for RAG architecture in production systems.
How to handle questions where RAG fragments don’t provide an answer?
#The system instruction should explicitly direct the model to return “I don’t have sufficient information in the available documents” instead of speculating. It’s worth measuring the rate of such responses (the “I don’t know” rate) on a test set: a value below 5-10% for a well-indexed knowledge base is a good benchmark. A higher rate may indicate issues with chunking or retrieval.
How does conversation history compression affect long dialogue consistency?
#Incremental summarization may lose precise facts mentioned in the compressed part of the conversation. Safeguard: Store key facts (numbers, dates, party names) as a separate structured record alongside the summary. The agent can update this record after each turn. For sensitive data, the decision on what goes into permanent memory should be approved by a human-validated retention policy.
Is context engineering a one-time setup or an ongoing process?
#Ongoing. The distribution of user queries changes over time, and with it, the optimal retrieval and context order configuration. We recommend a monthly review of the golden test set: if the faithfulness or accuracy metric drops by more than 5 percentage points, investigate whether the query distribution or document structure has changed. Such quality audits are a continuous process, not a one-time system setup.
