A law firm receives a mandate to review a package of contracts as part of an M&A transaction. Five people spend three days searching through 400 documents. They look for non-compete clauses, termination periods, change-of-control clauses, and indemnification. Most of the time isn’t spent thinking—it’s reading and searching.
This is the kind of work AI does quickly and repeatedly. It doesn’t replace lawyers during negotiations or assess the business risk of a transaction. But it can compress three days of document review into a few hours, leaving specialists time for what truly requires their expertise.
How the document analysis pipeline works
#Document analysis consists of several layers operating sequentially. Each has different technical requirements and different points where it can fail.
Layer 1: Ingestion and OCR. Documents come in as PDF, DOCX, XLSX, scans, or even photos from a phone. OCR converts scans and images into text. For digital documents (text-based PDFs), this step is trivial. For low-quality scans, it’s one of the main risk points: a misread digit in a penalty clause has consequences.
Layer 2: Chunking and indexing. Text is split into chunks and converted into embeddings by a model like BGE-M3. Chunks are stored in a vector database. Key decision: chunk size and whether they preserve paragraph, section, and document context. Chunks that are too small lose context; chunks that are too large reduce search precision.
Layer 3: Search and reranking. A user query (e.g., “find all change-of-control clauses”) is converted into an embedding and compared with chunks in the database. Hybrid search combines vector search with full-text search, improving recall for precise legal terms. Reranking sorts results by relevance before passing them to the model.
Layer 4: Generating answers with citations. The model generates answers based solely on retrieved chunks, always including the document number, page, and paragraph. An answer without citations is a warning sign: the model may be hallucinating instead of referencing actual content.
Layer 5: Structured output. For data extraction (tables from contracts, KPIs from reports), the model returns structured output in JSON format, ready for import. Schema validation occurs before data is passed further.
Contract review: what AI detects and what it doesn’t
#Contract review is one of the best-fitting applications for AI document analysis. Agreements have predictable structures, repetitive clauses, and defined terms—exactly the conditions where semantic models perform best.
What AI effectively detects:
- Clauses with specific scope: non-compete, penalty clauses, termination periods, warranty conditions, confidentiality clauses. Semantic search finds clauses even if they use different wording than the query.
- Discrepancies between documents: the same contract party has different contact details in two places, payment terms in the preamble don’t match the paragraph content. AI compares chunks from different parts of a document or across a set of documents.
- Missing elements: a template for a complete contract of a given type includes 12 required sections. The system flags documents missing one or more.
- Standard vs. non-standard clauses: If you have a database of your own contract templates, the system compares a clause from the document with the template and reports deviations and their scale.
What AI doesn’t replace:
- Assessing legal risk in the context of a transaction and jurisdiction. This requires knowledge of law, precedents, and the specifics of the parties.
- Negotiations and advisory. AI doesn’t know the parties’ intentions, relationship history, or the client’s business priorities.
- Interpreting disputed clauses. When meaning depends on interpretation, a lawyer is needed—not a model.
Guardrails should block answers where the model lacks sufficiently certain grounds in the document content and instead generates general legal knowledge as a response.
Data extraction from financial reports
#Financial reports are the second major use case. An analyst reviews quarterly reports from 15 portfolio companies. From each, they extract the same 20 metrics: revenue, EBITDA, net debt, capex, employment. Manually, this takes several hours per reporting cycle.
AI reduces this process to validation instead of extraction:
- The system reads the document (PDF report, XLSX, CSV).
- It identifies tables and narrative sections containing metrics.
- It maps metrics to a standardized schema and returns JSON with values, units, and page numbers.
- The analyst verifies items flagged by the system as low-confidence or where values deviate from the previous period by more than a defined threshold.
Key challenges in report extraction:
- Different formats between issuers. EBITDA in one report is a table row; in another, it’s only in the narrative section. The system must handle both patterns.
- Accounting transformations. Reports present adjusted EBITDA. To calculate EBITDA from raw data, several steps are required. This demands either predefined extraction rules or a model with a verifiable reasoning chain.
- Currencies and units. One report lists amounts in thousands of PLN; another in millions of EUR. Normalization must be explicit and auditable.
For large volumes (dozens of companies per reporting cycle), ROI is quick. For one-off analyses, a small-scale pilot helps assess how many hours extraction actually saves with your specific document formats.
Due diligence: AI as the first filter
#Legal and financial due diligence often involves analyzing hundreds of documents in a short time window. Classic problem: too much to read, too little time, high stakes for errors.
AI doesn’t conduct due diligence instead of a lawyer or advisor. It serves as the first filter, which:
- Classifies documents by category (contracts, licenses, administrative decisions, corporate documents) and assigns them to the right specialists.
- Flags high-risk clauses in categories: change of control, penalties above a threshold, non-market clauses, off-balance-sheet obligations.
- Generates a list of questions for the seller based on missing documents or detected discrepancies.
- Creates thematic summaries with citations: “Contracts with change-of-control clauses: 14 documents, list below with page numbers.”
The difference between AI as a filter and AI as analysis: filtering is about organization and highlighting what needs attention. Analysis involves assessing significance and making recommendations. AI does the first well. The second requires a human.
In practice, a due diligence pilot usually starts with one document category (e.g., only contracts with key suppliers) and one type of question (e.g., termination clauses). The scope expands after verifying result quality on this narrow case.
Comparison of architectural approaches
#The choice of architecture depends on data sensitivity, document volume, and precision requirements.
| Architecture | Use Case | Data Sensitivity | Precision | Infrastructure Cost |
|---|---|---|---|---|
| RAG on cloud model | Public reports, non-NDA documents | Low | High | Low (pay-per-use) |
| Local RAG (self-hosted LLM) | Contracts, transaction documents, NDAs | High | High | Higher (own server) |
| Hybrid RAG + full-text | Large document sets with specialized terminology | Any | Highest | Medium-high |
| OCR + structured output pipeline | Tabular extraction from reports | Any | Depends on OCR quality | Low-medium |
| Agent with tool-use | Complex DD with cross-document comparison | High | Requires verification | High |
Self-hosting a model is justified when documents are covered by NDAs, professional secrecy, or contain personal data of transaction parties. PII should be masked before being sent to any external API, even if the provider claims zero retention. More on this pattern in the article PII anonymization before AI.
GDPR, AI Act, and data sensitivity in document analysis
#Documents in legal and transactional processes often contain personal data: names of parties, PESEL numbers, contact details, employment information. GDPR imposes obligations for data minimization and purpose limitation.
Two technical requirements that must be met before launching the pipeline:
PII masking before indexing. Personal data is identified and masked or tokenized in the ingestion layer before chunks enter the vector database. The model sees “PARTY_A” instead of a specific name. The token-to-real-data mapping is stored separately, outside the index.
Isolation per project or client. Each case (transaction, client, project) has its own separate index. A query for one project never accesses documents from another. This is an architectural requirement, not just configuration.
For high-risk due diligence processes (acquisitions in regulated sectors, sensitive data), a DPIA is required before implementation. Systematic document analysis by AI may qualify as “large-scale processing” under GDPR. A detailed review of regulatory obligations is in the article AI Act and GDPR 2026.
The AI Act classifies document analysis systems as low or limited-risk if decisions are made by humans based on AI indications. If the system generates recommendations directly affecting financial or legal decisions without human verification, the classification may change.
Result quality: what to measure and how to verify
#A document analysis system that works in a pilot often reveals issues when first exposed to real client documents. Three metrics that indicate whether the system is ready for production deployment:
- Recall of critical clauses: What percentage of clauses from a pre-labeled test set the system correctly identified. Target: above 95% for critical clauses (penalties, deadlines, change of control). Recall below 90% indicates a chunking problem or overly narrow semantic search.
- Citation precision: What percentage of cited chunks actually come from the specified page and paragraph. Incorrect citations (model provides a page number, but the chunk comes from elsewhere) signal indirect hallucination. Target: 100%.
- Escalation rate: What percentage of queries the system forwards for human verification instead of answering itself. Too low a rate (system answers everything) means guardrails are missing. Too high (system escalates everything) means the system isn’t delivering value.
Monitoring agent quality covers the methodology for measurement, alerts, and quality drift for AI systems in production.
Try it live
#Describe the type of documents you analyze and what you want to extract from them, and the model will indicate which architectural layers make sense for your case (playground: PII masked, zero retention):
FAQ
#Can AI read and analyze contracts in Polish?
#Yes, modern multilingual models handle Polish without special fine-tuning for contracts. Semantic search works correctly for legal terminology in Polish, though precision for highly specialized clauses (e.g., terminology from real estate development or transport law) is higher when the vector database contains documents from the same domain. Semantic search and embeddings covers selecting an embedding model for Polish.
How does AI handle scanned documents and photos?
#Modern OCR systems with vision models support scans and photos, but quality depends on resolution and legibility. Documents with handwritten notes, low-quality scans, or damaged paper originals reduce extraction confidence. The pattern is always the same: low confidence in OCR for a given fragment means forwarding it to a manual queue instead of automatic extraction. Assess the OCR readiness of your documents with the readiness assessment tool.
Is data from contracts and due diligence documents secure when using AI?
#Security depends on architecture, not the use of AI. Documents covered by NDAs or professional secrecy should be processed locally (self-hosted model) or with PII masking before being sent to external APIs. Each project should have an isolated index in the vector database so queries from one case don’t access documents from another. Every operation log (what the system read, what it proposed, who approved) must be reproducible. Technical requirements are detailed in the article AI agent security.
How long does it take to implement a document analysis system?
#A pilot for one document category and one query type usually takes 3-6 weeks: one week for ingestion and indexing a test set, one week for guardrail configuration and calibration, 2-4 weeks for quality verification with real users. Full implementation covering multiple document categories, ERP or DMS integration, and advanced extraction pipelines takes 2-4 months depending on scope. The ROI calculator estimates payback time based on actual document volume and specialist hourly rates.
Can AI compare multiple documents against each other?
#Yes, this is one of the most useful patterns in due diligence. An agent with tool-use can perform multiple queries to the vector database sequentially and compare results: “Contract A contains clause X, contracts B and C do not.” Complex comparisons across large document sets require careful pipeline design and clear guardrails blocking answers generated without basis in the text. AI agent vs chatbot explains the difference between a simple assistant and an agent capable of multi-step reasoning over documents.