A company with 4,000 PDF documents indexing them with the default setting of "500 characters per chunk" ends up with an assistant that quotes the correct paragraph but fails to understand which contract it refers to. Retrieval precision in this scenario drops by 30–50% compared to a properly calibrated pipeline. The issue lies not in the language model or vector database, but in the granularity of the fragments.
Three basic strategies and when to use them
#Fixed-size chunking divides a document into fragments of a predetermined number of tokens or characters, with optional overlap. This is the simplest approach, fast to implement, and predictable in indexing costs. It works well when documents have a uniform structure, such as system logs or product descriptions written from a single template. The main drawback: the splitter does not recognize sentence boundaries, often cutting a thought in half.
Recursive character splitter (e.g., RecursiveCharacterTextSplitter in LangChain) attempts to preserve semantic boundaries by splitting sequentially by \n\n, \n, . , and only then by character. For most Polish business documents—such as instructions, regulations, and reports written in prose—this is the default choice, offering a good balance between quality and effort.
Semantic chunking groups sentences with similar meanings based on the cosine distance of consecutive embeddings. Fragment sizes are variable, complicating token count predictions and inference costs, but retrieval precision for long, dense documents (e.g., multi-page contracts) can be 10–25% higher than with fixed-size chunking.
Choosing fragment size and overlap
#There is no one-size-fits-all chunk size. It depends on three factors: the embedding model, the nature of user queries, and the average expected response length from the assistant.
Practical ranges for Polish business documents:
- Fact-based questions (dates, names, amounts): 256–512 tokens, overlap 50–80 tokens.
- Questions requiring procedural context: 512–1024 tokens, overlap 100–150 tokens.
- Narrative documents, analyses, reports: 1024–2048 tokens, overlap 150–200 tokens.
Overlap is not "wasted space." Fragment A ending mid-paragraph and fragment B starting 100 tokens earlier together ensure that semantic search finds the answer even when the query hits exactly at the split boundary.
Embedding models have a context window limit: BGE-M3 supports up to 8192 tokens, but its quality optimum lies in the 512–1024 token range per fragment. Beyond this threshold, the semantic signal spreads across the entire window, reducing retrieval precision.
Chunking tables and code: a separate path
#Tables and code are structures that standard character-based splitters destroy. A table split between two fragments loses column headers in the first and data in the second, preventing the model from linking values to their meaning.
Solution: Detect tables and code blocks during document parsing (e.g., using pdfplumber, camelot, or Unstructured.io), then:
- Tables: Convert to a text format (Markdown or JSON row-by-row) and save as a separate chunk with metadata
type: table. If the table exceeds 512 tokens, split it into blocks of 5–10 rows, repeating the header in each block. - Code: Preserve as a single chunk no longer than 1024 tokens. If the code snippet is longer, split by function blocks (lines starting with
def,function,class), not by an arbitrary number of characters.
Metadata type: table and type: code enable filtering in hybrid queries. More on this process in the article how to prepare company data for AI.
Strategy vs. document type: reference table
#| Document Type | Recommended Strategy | Chunk Size (tokens) | Overlap (tokens) |
|---|---|---|---|
| FAQ, Q&A lists | Semantic by question | 128–256 | 0–30 |
| Contracts, regulations (clauses) | Recursive by section header | 512–768 | 80–120 |
| User manuals, procedures | Standard recursive | 512–1024 | 100–150 |
| Analytical reports, narratives | Semantic or fixed-size | 1024–2048 | 150–200 |
| Data tables (price lists, registers) | Table row-by-row + header | 256–512 | 0 (repeat header) |
| Code snippets, scripts | Fixed by function boundary | 512–1024 | 0 |
| Emails, short messages | Fixed-size or no splitting | 128–256 | 0–20 |
Chunk metadata as a secondary retrieval signal
#The fragment text alone is not enough. Each chunk should carry metadata that the vector database can filter before reranking: source filename, page number, document type, update date, client or project identifier (if applicable).
Filtering by metadata before vector search reduces the search space and removes outdated or irrelevant documents from results. For a company with 4,000 documents categorized (contracts, manuals, FAQs), this step shortens the candidate list by 60–80% before semantic ranking.
A RAG pipeline with metadata filters works as follows: user query → query embedding → filter (e.g., type: contract AND client_id: X) → vector search on subset → reranking top-k → response generation. The effect is measurable: response precision improves, and hallucinations decrease because the model only receives relevant fragments.
More on retrieval quality evaluation in the article RAG: evaluating response quality.
Try it live
#Have your own set of documents and wondering where to start? The sandbox below lets you test reasoning for your case:
FAQ
#How many tokens should a single chunk have for legal documents?
#For contracts and regulations, the optimum lies in the 512–768 token range with an overlap of 80–120 tokens. More important than size is the split point: divide by section headers (§ 1, § 2, "General Provisions"), not by an arbitrary number of characters. A clause that fits entirely into one chunk will be retrieved more accurately than the same clause split across two fragments.
Does greater overlap always improve retrieval quality?
#No. Overlap exceeding 20–25% of the chunk size increases indexing costs and may introduce redundancy that disrupts ranking. If two fragments with 50% shared content appear in the top-5 results, the model receives duplicated context instead of different perspectives. A 10–15% overlap of the chunk size is a safe starting point for most documents.
How does chunking affect hybrid search (BM25 + vectors)?
#Chunking impacts both components: fragments that are too short lose keywords for BM25, while those that are too long dilute the vector signal. For hybrid search, the optimum usually lies in the 512–1024 token range, where both signals work effectively. Conduct A/B tests on a representative set of queries before finalizing the chunk size.
Should I re-index documents after changing the chunking strategy?
#Yes, full re-indexing is necessary after changing the strategy or chunk size. Embeddings generated from differently split fragments are not comparable to previous ones. In practice: prepare a new collection alongside the old one, run quality tests on a golden set, and only switch traffic to the new collection after achieving better results. More on this process in the article RAG knowledge updates and versioning.
How to handle PDFs with mixed structure (text, tables, images)?
#Use a structural parser like Unstructured.io or pdfplumber to detect element types before splitting. Process continuous text with a recursive splitter, convert tables to Markdown row-by-row with repeated headers, and describe images (charts, diagrams) using a vision model, attaching the description as a separate chunk with metadata type: image_caption. This segmentation preserves the document's semantic structure without losing information in any layer.