AI for summarizing long documents: strategies and limitatio…

AI for summarizing long documents: strategies and limitations

The legal department receives a 180-page joint venture agreement for review before signing. There’s no week. There’s one day. The question comes quickly: can AI summarize this?

The answer is: yes, with an important caveat. The model will reduce the time to orient yourself in the document from hours to minutes. But the summary does not replace reading when it comes to clauses that determine liability, compensation, and termination conditions. These are two different use cases that should not be confused.

Problem: document longer than the context window

Language models have a limited context window. Even models with windows of 128,000 tokens have a limit, and processing accuracy drops when the context is fully utilized. A 180-page contract, an 8-hour board meeting transcript, or a 300-page annual company report are documents that often exceed this limit or get close enough that the summary quality noticeably declines.

Two architectural solutions to this problem have different properties and different failure modes.

Map-reduce (hierarchical summarization): The document is split into fragments, each fragment is summarized separately (map phase), and then the summaries are synthesized into a whole (reduce phase). Multi-level hierarchies can be built: first paragraphs into sections, then sections into chapters, then chapters into the whole. The advantage is scalability: the document can be of any length. The disadvantage is that dependencies between fragments may be lost. A clause in chapter 3 defines a term used in chapter 12; if the chapter 12 fragment is summarized without knowing how the term is defined in chapter 3, the model will either guess or miss it.

RAG with citations: Instead of summarizing the entire document sequentially, the system answers specific questions through semantic search. The query “what are the termination conditions” retrieves the most relevant fragments, which the model synthesizes with the obligation to cite the page number and paragraph. The advantage is a higher level of trust: every answer has a source. The disadvantage is the need for precise questions and the lack of a holistic overview without iteration. The article AI for document analysis describes this pipeline in detail.

Structured summary strategies

The most useful summaries in a business context are not free narrative text but structures filled by the model according to a schema. Three formats that work in practice:

Key points with location. A list of 5-15 findings with mandatory page and section references. The format forces the model to anchor each point in the text and facilitates human verification: the reader checks not the whole document, but specific places.

Risk summary. A list of items with risk type, description, and document location. Useful for lawyers and due diligence analysts who want to quickly find clauses requiring attention. The model fills the schema via structured output, making integration with risk management systems easier.

Action list. From meeting minutes, project briefs, and audit reports, the model can extract action points with assigned person and deadline. Condition: the protocol must explicitly contain these elements. If they are not explicitly listed, the model will infer them, increasing the risk of error.

All three formats can be validated with a JSON schema before passing results downstream. The article LLM output validation discusses how to design this layer.

Strategy comparison: when to use which

Strategy	Best for	Risk	Required verification
Map-reduce	long reports, transcripts, narrative documents	loss of dependencies between sections	spot-check, key sections
RAG with questions	contracts, due diligence, Q&A about the document	missing clauses outside the query	confirm no hits
Structured output	tabular extraction, checklists, KPIs	hallucination of numerical values	every number and date
Hierarchical (3 levels)	very long documents (300+ pages)	coherence degradation at the top of the hierarchy	human synthesis of the whole

Choosing the strategy depends on the summary’s purpose, the document’s sensitivity, and how much time the human verifier has. For high-stakes legal or financial documents, there is no strategy that eliminates the need for verification.

Failure modes to be aware of

At Cashcrown, we observe two failure modes that disproportionately occur when summarizing long documents.

Missed clause. In map-reduce, a clause may be missed if the fragment it’s in lacks sufficient context for the model to deem it relevant. This happens with clauses embedded in seemingly standard sections (e.g., a change-of-law clause in “Final Provisions”). None of the currently available architectures guarantee 100% recall for critical clauses without a dedicated golden test set.

Hallucination of a fact not present in the source. The model fills gaps with probable text. When summarizing a contract, it may “complete” a missing payment term with a value typical for such agreements. When summarizing a report, it may provide cumulative KPIs the report didn’t contain but that sound reasonable. Citing the source for every summary point is the most effective defense: a point without a citation signals the model may have guessed.

The article how to limit AI hallucinations describes defense layers in detail. Key takeaway: hallucinations cannot be eliminated by a better model. An architecture with citations and a confidence threshold reduces them to an acceptable level.

The boundary: when a summary isn’t enough

For legal and financial documents, there is a hard boundary that cannot be crossed.

An AI summary is a navigation tool: it lets you quickly find which sections require attention, on which pages critical clauses are located, and what is non-standard compared to the template. It is not and should not be the final interpretation of the content on which decisions about signing, accepting terms, or assuming liability are made.

Human oversight for legal and financial documents means specifically: verifying critical clauses by a lawyer or analyst at the source, not at the summary. The summary speeds up this process by pointing out where to look. It does not replace looking.

For documents covered by professional secrecy or containing personal data, the architecture should include self-hosting the model or masking PII before sending to external APIs. The article company GPT based on knowledge discusses deployment variants with different data risk profiles.

Chunking and verification: two conditions for a good summary

The quality of a summary largely depends on how the document is split into fragments before processing. Fragments that are too small lose context from the previous paragraph. Fragments that are too large reduce precision and increase cost per query.

A few rules that have worked in our deployments:

Chunk boundaries should align with paragraph or section boundaries, not be set mechanically every 512 tokens.
Each chunk should contain metadata: page number, section header, document identifier. Without this metadata, citation is impossible.
For map-reduce, it’s worth using a 10-15% overlap between adjacent chunks so clauses spanning page breaks don’t lose context.
For documents with tables (financial reports, contracts with payment schedules), tables require a separate chunking strategy: an entire table row as one chunk with column headers in each fragment.

The article document chunking for RAG describes chunking strategies in detail.

Describe the document type and what you want to extract from it, and the model will propose a summary strategy tailored to your case (playground: PII masked, zero retention):

▶Choose a summary strategy for your documentsandbox · reasoning

FAQ

Does map-reduce guarantee no clause will be missed?

No. Map-reduce improves scalability but does not guarantee full coverage. Clauses placed in sections the model deems less important during the map phase may not make it into the synthesis. The only way to empirically measure coverage is a golden set: collecting pre-labeled critical clauses and checking how many the system correctly identifies. A target above 95% recall for critical clauses is achievable after calibration but requires iteration with real documents.

How to tell if the model is citing the source or hallucinating a citation?

In a well-designed system, every sentence in the summary is linked to a fragment identifier (page number, section, sentence). Verification involves going to the indicated location and confirming the text actually exists there. A system without a citation mechanism at the paragraph or sentence level provides no tool for verification and is unsuitable for legal or financial use cases. The output validation layer should block responses with a low anchoring ratio to the source.

Can AI summarize documents in multiple languages simultaneously?

Yes, modern multilingual models support mixed-language summaries. The practical problem is specialized terminology: legal and financial clauses have precise meanings that don’t always translate directly between languages. For bilingual documents (e.g., a Polish contract with a working English translation), it’s better to build separate indexes per language and compare results cross-linguistically rather than relying on automatic translation in the summary layer.

How many tokens does summarizing a 100-page document cost?

It depends on the strategy. Map-reduce on 100 pages with 500-token chunks and 20% overlap generates about 250 fragments. Each map phase is one model call, the reduce phase is another. With a model priced at 1-3 USD per million tokens, the cost of one summary ranges from a few cents to a few dollars. For large volumes (dozens of documents weekly), consider a model router: a cheaper model for the map phase, a stronger one for the reduce phase and questions about critical clauses.

Can AI summaries be treated as evidence in a legal dispute?

No. A summary is a product of a language model and may contain errors, omissions, or incorrect interpretation of legal context. Evidence in a dispute is the content of the original document. The summary can be used as an internal tool for orientation and triage but does not replace the original or a legal opinion. AI systems for summarizing legal documents are, under the AI Act, decision-support systems and require that a human has the ability to verify and override every model recommendation.