A company integrates three systems: Confluence, SharePoint, and an old FAQ database in Excel. After the first RAG indexing, responses are vague or cite conflicting procedures. There’s no error in the model or Qdrant configuration. The problem lies earlier: the same documents appeared in the index multiple times, in slightly different versions, and now each version competes for the same spot in search results.
Deduplication and data cleaning happen before indexing, not after. When skipped, symptoms emerge gradually: the assistant responds inconsistently, token costs rise, and quality metrics don’t improve despite prompt tuning.
Why dirty data silently breaks RAG
#RAG works on ranking: the system retrieves fragments that best match the query semantically and provides them to the model as context. If the index contains five versions of the same procedure, three of them end up in the context instead of three different, complementary documents. The model receives less information than it could from a clean corpus.
Duplicates also generate concrete costs. Each fragment occupies tokens in the context window. With limits of 8,000 or 128,000 tokens, repeated content wastes space—paid for or limiting what the model can analyze simultaneously.
The third effect is inconsistency. A 2022 procedure version and a 2024 version may describe the same step differently. The model doesn’t know which is current and cites both or arbitrarily chooses one. From the outside, this looks like a “hallucination,” but the cause is the knowledge base, not the model.
At Cashcrown, we observe that in projects where clients come with disorganized corpora, a significant portion of preparatory work falls on deduplication, not model configuration. We can’t provide exact numbers on how much this improves response quality—it depends on the state of the data. You know there’s a problem when faithfulness and relevance metrics don’t rise despite prompt changes.
Three layers of duplicate detection
#No single method detects all types of duplicates. In practice, three layers are applied sequentially, as each subsequent one is computationally more expensive.
Layer 1: Exact matching (exact match). Compute the SHA-256 or MD5 hash of the entire document or its fragment after normalization (lowercase, whitespace removal). If two files have identical hashes, one is discarded without further analysis. This is the fastest layer, with zero LLM cost. Effective for files copied between folders or synchronized by two integrations simultaneously.
Layer 2: Fuzzy matching (fuzzy match). Algorithms like Levenshtein distance, Jaccard similarity on n-grams, or MinHash detect documents that differ slightly: one paragraph added, a typo corrected, formatting changed. The similarity threshold is set empirically, typically Jaccard above 0.85–0.90 suggests a duplicate candidate. The decision to merge or discard remains for human confirmation.
Layer 3: Semantic detection (embedding-based). Generate embeddings for each fragment and measure cosine similarity. Fragments with similarity above a threshold (typically 0.92–0.95, calibrated on a golden test set) are candidates for merging. This layer catches cases where two documents say the same thing differently: changed terminology, different translation, or a rewritten procedure with the same meaning. Details on vector similarity in the article semantic search and embeddings in enterprise.
Each higher layer requires more expensive verification. That’s why the sequence makes sense: filter first with what’s cheap, and apply semantics only where the previous two layers didn’t yield a clear answer.
Normalization: before you even look for duplicates
#Deduplication without prior normalization generates false distinctions. The same document saved once as PROCEDURA_v3_final.docx and once as procedura v3 final.docx has different hashes. Text normalization removes these superficial differences before comparison.
| Normalization operation | What it does | When mandatory |
|---|---|---|
| Lowercase | Converts all letters to lowercase | Always for hash and fuzzy |
| Whitespace removal | Leading spaces, indentation, double spaces | Always |
| Unicode normalization (NFC/NFD) | Unifies representation of letters with diacritics (ą, ę, ź) | For Polish texts from mixed sources |
| File metadata removal | Author headers, modification dates, comments | For hash duplicates |
| Template header stripping | Footers, repeating company headers | For fuzzy and semantic |
| Number and date normalization | 01.01.2024 vs 1 stycznia 2024 vs 2024-01-01 | For procedural document comparison |
Normalization should be deterministic and reversible for audit purposes: don’t overwrite originals, write to a separate pipeline preparing data for indexing. Originals remain unchanged in the source.
Near-duplicates and search quality
#A near-duplicate is a harder problem than an identical duplicate because it’s less visible. Two procedural document fragments differing by one paragraph about a special case may look like duplicates but be complementary.
In semantic search, near-duplicates disrupt ranking in a specific way: fragment A and its almost-identical copy B reinforce each other in results, as the search engine assumes the topic is “well represented.” Other, less repeated content drops in ranking, even if it could be more useful.
Practical approach to near-duplicates in a RAG corpus:
- Group candidates by semantic similarity (clusters from embeddings).
- For each group: keep the version with the latest modification date or the most complete content.
- Move older versions to an archive, not the active index.
- Mark versions with unclear relationships (e.g., differing by a key technical detail) for manual review.
The semantic threshold for near-duplicates is usually set lower than for identical ones, as you want to be conservative: better to flag a candidate for review than automatically merge two documents that turn out to have important differences.
PII as a mandatory cleaning step
#Data cleaning for AI isn’t just deduplication. Parallel to it, personally identifiable information (PII) detection and handling must occur in every document before indexing.
Typical PII types in enterprise corpora: customer names, email addresses, phone numbers, PESEL and NIP numbers, residential addresses, order numbers linked to individuals, session identifiers. All are subject to GDPR and require a legal basis for processing, minimization, and handling of the right to erasure.
In practice, you have three paths:
Masking before indexing. PII is replaced with a token ([EMAIL_1], [PESEL_1]) in the data preparation pipeline, before generating embeddings. The index receives text with tokens, not real values. Detailed masking architecture in the article PII anonymization and masking before AI.
Excluding the document from the index. If a document contains essential PII that can’t be anonymized without losing value (e.g., a customer call transcript), it doesn’t enter the main index. It may go to a dedicated collection with access control and a strict legal basis.
Data extraction and statistical anonymization. For documents used in analysis (reports, surveys), extract aggregates instead of individual data. The model then works with statistics, not personal data.
Tools for automatic PII detection in Polish texts: Microsoft Presidio (with a Polish NER model), spaCy with the pl_core_news_lg model, regex rules for PESEL/NIP/phone patterns. None provide 100% recall on raw enterprise data, especially with proper names that resemble general text. That’s why automatic PII detection is the first step, not the only one—particularly for special category personal data (Article 9 GDPR).
More on preparing data for GDPR and AI Act in the article how to prepare enterprise data for AI.
Where humans must make decisions
#Automation speeds up deduplication and cleaning but doesn’t eliminate the need for human decisions. There are several points where automatic action is inadmissible or too risky.
Merging customer records in CRM. A fuzzy matching algorithm might find that “Jan Kowalski, ul. Lipowa 5” and “Jan Kowalski, ul. Lipowa 5a” are likely the same person. Likely, not certainly. Merging the order history, contacts, and quotes of two different customers with similar data is an error with real business and legal consequences (GDPR requires data accuracy). Automation proposes merge candidates; a human approves or rejects each case.
Deleting document versions. An older procedure version may be needed as evidence in an audit or complaint. Automatically deleting an “older duplicate version” might erase material the company is required to retain. Archive, don’t delete, and delegate the decision on physical deletion to someone who knows the document’s legal context.
Sensitive data under AI Act. For documents in high-risk areas (HR, credit scoring, recruitment systems), every change to the training or indexed dataset requires documentation and human approval, per Article 10 AI Act (data quality and governance). Automatic deduplication without an audit trail is non-compliant here.
Semantic certainty threshold. If two fragments have vector similarity between 0.85 and 0.92 (gray zone), the semantic model isn’t confident enough to decide alone. Such pairs go to a manual verification queue. This threshold is calibrated once on a golden set of examples before pipeline launch.
The pattern we use: automation proposes, a decision list goes to the person responsible for the database, and the decision is logged with a date and justification. More on managing decisions on datasets in the article data governance for AI.
FAQ
#What’s the difference between deduplication and data normalization before AI?
#Normalization removes superficial differences in the representation of the same text: case changes, spaces, Unicode encoding of diacritical characters. Deduplication detects and handles actual content repetitions at three levels: identical, fuzzy, and semantic. Without normalization, the same document in two encodings might not be recognized as a duplicate by hash, so normalization is always the first step.
How to set the similarity threshold for semantic deduplication?
#Calibrate the threshold empirically: prepare 100–200 document pairs manually labeled “duplicate / not duplicate,” run the embedding model, and measure precision and recall at different values. A typical starting point is cosine similarity 0.92–0.95 for short fragments, lower for longer documents. Choose a threshold that minimizes false merges, as merging two different records is harder to reverse than leaving a candidate for manual review.
Is deduplication required for every RAG implementation?
#For small, homogeneous databases (a few hundred documents from one source), duplicates are rare, and the impact on quality is minor. For databases with thousands of documents from multiple systems (CRM, SharePoint, Confluence, emails), duplicates almost always appear because the same information circulates in many places. Sign that deduplication is needed: identical test questions receive different answers on subsequent runs, or the assistant cites conflicting procedures in the same response.
How to handle the right to erasure under GDPR in a deduplicated index?
#A deletion request applies to all instances of an individual’s data, including the vector index. Fragments built from documents containing that person’s PII must be deleted. The deduplication pipeline should maintain a provenance map: which fragment comes from which source document. Without it, complete deletion isn’t possible. If fragments were merged with other documents, the merged version requires manual review before deletion.
How long does deduplication and cleaning take before the first RAG deployment?
#Time depends primarily on the state of the input data and the number of sources. For a single, structured system (e.g., Confluence with a few hundred pages), exact and fuzzy deduplication plus normalization take one to three days, with hours for manual candidate review. For three or more systems with years of history, inconsistent terminology, and duplicates across systems, the work takes one to three weeks, with a real portion spent on decisions requiring business context, not just technical. A detailed plan for preparing enterprise data for AI is described in a separate article on this blog.
