Companies that deployed an internal AI assistant too quickly often report the same issue: the model responds imprecisely or cites outdated procedures from two years ago. The cause is rarely a bad model. The cause is a disorganized input database. Before the selected LLM gains access to company knowledge, that knowledge must be properly cleaned, segmented, and indexed. This is pre-deployment work, not post-deployment.
Why data quality determines AI quality
#RAG operates on a simple principle: when a user asks a question, the system searches for semantically matching fragments of your knowledge, then provides them to the model as context for formulating a response. The model doesn’t invent content — it responds solely based on what it finds in the index.
The consequence is direct: if the index contains conflicting information (e.g., two versions of the same procedure, one of which is outdated), the model may cite either depending on which fragment ranks higher. If a document is a scanned PDF without a text layer, the model won’t see it at all. If price tables are in Excel sheets not converted to text, the assistant won’t be able to provide values from those tables.
Good data preparation solves these problems before you even launch the model. It’s a one-time investment with long-term benefits: an indexed, clean knowledge base works for years with regular updates.
Step 1: Audit and inventory of sources
#Before cleaning, you need to know what you have. A typical AI data audit answers five questions:
- What formats? DOCX, PDF, Excel/Sheets spreadsheets, Confluence/Notion pages, emails, CRM conversation logs, FAQ database. Each format requires a different parser.
- How current? Documents older than 12-18 months require verification. Outdated procedures in the index are a source of incorrect responses.
- What access scope? Some documents are confidential (contracts, HR data). They shouldn’t go into an index accessible to all users — or must be indexed in a separate collection with access control.
- Are there duplicates? The same policy in five versions, three of which are outdated, isn’t a knowledge base — it’s noise.
- Is there personal data? Documents containing PII require either anonymization before indexing or local hosting only with an appropriate legal basis under RODO.
The audit result is a list of sources with an assessment: index / index after cleaning / exclude / index in a separate collection with ACL.
Step 2: Parsing and text extraction
#The RAG engine needs clean text. Not a PDF file, not an image — text. Parsing converts each source format into flat or hierarchical text with metadata.
Typical challenges:
- Scanned PDFs without OCR — without a text layer, these are images. They require OCR before indexing. OCR quality directly affects search quality.
- PDFs with multi-column layouts — a naive parser reads columns horizontally instead of vertically, producing unreadable word sequences. A layout-aware parser is needed.
- Tables in PDF or Word — most parsers lose table structure. Tables with numerical data (prices, technical parameters, schedules) require a separate extraction path to Markdown or JSON.
- Excel spreadsheets — each sheet and data range is a separate fragment. Column header context must be preserved, otherwise values lose meaning.
- Confluence/Notion pages via API — usually return HTML or JSON with rich structure. Headers H2/H3 must be preserved as hierarchy signals during chunking.
At the output, each document becomes a set of paragraphs or sections with metadata: document title, date, department, category, source format.
Step 3: Chunking — splitting into fragments
#Chunking is one of the most critical steps, with the biggest impact on search accuracy. A poor chunking strategy can ruin a deployment even with perfect documents.
Basic principles:
- Fragment size 256-512 tokens for most use cases. Too short (50 tokens) loses context. Too long (2000 tokens) fills the model’s context window with one document, leaving no room for others.
- Overlap 10-20% between adjacent fragments preserves semantic continuity at split boundaries. Without overlap, a sentence like “continuing the previous point” loses its preceding context.
- Respect semantic boundaries — split at H2/H3 headers, paragraphs, list items. Splitting mid-sentence always yields worse results than splitting at a header.
- Each fragment carries metadata — document title, section header, date, category. In hybrid search, metadata filters allow narrowing results to a specific department or date.
Special case: step-by-step procedural documents. Here, the optimal chunk is one step with full procedure context in metadata. A step without context stating it’s step 3 of 7 in a complaint procedure is useless in isolation.
| Document type | Chunking strategy | Size |
|---|---|---|
| Procedural documents (instructions, SOPs) | By steps / section headers | 300-500 tokens |
| FAQ / question database | By question-answer pairs | 150-300 tokens |
| Legal documents / contracts | By articles and paragraphs | 400-600 tokens |
| Product descriptions / catalogs | One chunk per product | 200-400 tokens |
| Long reports / analyses | Sliding window with 20% overlap | 512 tokens |
| Data tables | Header + row(s) per fragment | Depends on density |
Step 4: Generating embeddings
#Each fragment must be converted into an embedding — a vector of numbers representing the text’s meaning. This vector goes into the vector database and forms the basis for semantic search.
Key decisions at this stage:
Embedding model selection. We use BGE-M3 run locally via Ollama. It supports Polish without translation, produces 1024-dimensional vectors, and runs on the company server’s CPU. No content leaves the infrastructure during indexing. For public content without sensitive data, cloud-based embedding models are acceptable if PII has been masked beforehand.
Incremental indexing. The first full indexing may take minutes to hours depending on volume. Each document update should trigger reindexing of only the changed fragments — not the entire corpus.
Versioning. Changing the embedding model requires full reindexing of the entire database. This is important for planning: choose a model for the long term, as migration has real costs.
Step 5: RODO, PII, and data security
#Preparing data for AI is simultaneously a data protection exercise. Every stage of the pipeline is a potential leak point.
Obligations for personal data:
- Legal basis for processing data for AI purposes must exist before data enters the index. Consent or legitimate interest for the specific use case.
- Minimization — index only data necessary for the assistant’s purpose. A full CRM database with customer history shouldn’t go into an assistant answering internal procedure questions.
- PII masking — before sending fragments to an external generative model via our LLM router, we automatically detect and mask personal data: names, phone numbers, PESEL, email addresses.
- Data residency and self-hosting — for data covered by professional secrecy (legal, medical, financial) or sensitive HR data, the entire pipeline (embeddings, vector database, generative model) should run locally.
- Right to erasure — when RODO requires deleting an individual’s data, their documents must be removed from the index. The pipeline should support selective fragment deletion from the vector database.
For HR, employee evaluation, or financial projects — areas of high risk under the AI Act — a DPIA (Data Protection Impact Assessment) is required before deployment. Details in the article AI Act and RODO 2026.
Step 6: Maintaining and updating the index
#Data preparation isn’t a one-time project. The knowledge base is dynamic: documents change, processes evolve, new products enter the offering.
Best practices for maintenance:
- Document versioning with expiration dates. Each chunk carries metadata with the document date. The system can deprioritize fragments older than X months or require manual confirmation of currency.
- Automated reindexing pipeline. A file change in the knowledge repository (Confluence, SharePoint) triggers automatic recalculation of affected fragments. Not once a month — with every change.
- Response quality monitoring. If the assistant starts responding “I don’t know” more often than usual, it’s a signal the database needs updating — not that the model is broken.
- Thematic gaps. Observability tools track questions for which RAG found no fragment. This is a list of topics to supplement in the knowledge base.
Cycle: index → collect unanswered questions → supplement the base → reindex. After a few iterations, the assistant covers the vast majority of real user questions.
Cost and duration
#Time and cost depend primarily on the state of the input data. Clean, structured documents (Word, Confluence, well-formatted PDFs) shorten the project severalfold compared to databases full of scans and unordered spreadsheets.
| Stage | Time (typical pilot) | Notes |
|---|---|---|
| Audit and inventory | 1-3 days | Depends on number of source systems |
| Parsing and cleaning | 2-5 days | Scans, OCR, and tables extend this |
| Chunking and configuration | 1-2 days | Iteration after quality tests |
| Indexing (embeddings + Qdrant) | Hours to 1 day | Local BGE-M3 on CPU |
| Quality tests and corrections | 2-4 days | Test questions, error analysis |
| Total pilot (one knowledge area) | 1-3 weeks | With clean data, closer to 1 week |
Infrastructure costs for local hosting are mainly server and deployment time. Cloud embeddings cost a few cents per million tokens. Assess your project with the ROI calculator or use the AI readiness assessment, which evaluates your knowledge base’s state among other factors.
We prepare a full project quote after a data audit. Contact us via the contact form.
Try it live
#This sandbox runs the same pipeline as our deployments: paste a fragment of your knowledge base, and the model will respond solely based on the provided text. PII is masked before the model, zero retention.
FAQ
#Where to start preparing data for AI when the database is chaotic?
#Start with an audit: list all systems where company knowledge resides (SharePoint, Confluence, emails, CRM, shared drive), assess the currency and format of each dataset. Choose one thematic area with relatively clean data, such as customer service FAQ or employee onboarding procedures, and run a pilot on that subset. It’s better to launch an assistant with 200 good documents than 2000 chaotic ones. More on the phased approach in the article where to start AI deployment.
Do company data have to leave the company to build RAG?
#No. With a local embedding model (e.g., BGE-M3 via Ollama) and a local vector database (Qdrant on-prem), the entire indexing pipeline runs in your infrastructure. Only the query text and a few selected context fragments go to an external generative model, after automatic PII masking. For sensitive HR, legal, and financial data, we can handle it through self-hosting of the entire generative model.
What to do with outdated documents in the knowledge base?
#Don’t delete them immediately if you’re unsure which versions are current. Instead, mark them with dates and deprioritize them in ranking via metadata: fragments older than 18 months get a lower score in ties, and the assistant can inform the user that the document is from an older version of the procedure. Systematically organizing the knowledge base is a project in itself that pays off many times over, as it improves not only AI but also employee search.
How often should the index be updated after the first deployment?
#It depends on how often the knowledge base changes. For actively updated procedures, the optimal approach is an incremental pipeline triggered by document changes (webhook from Confluence or SharePoint to the indexing engine). For more stable bases, a weekly or monthly scan with automatic change detection via file checksum is sufficient. Lack of index updates when documents change is the most common cause of assistant quality degradation after a few months of operation.
Is data preparation required when using a ready-made AI SaaS tool?
#Yes, in every scenario. Even ready-made RAG platforms (SaaS) require documents to be machine-readable, current, and logically segmented. The difference is that ready-made SaaS tools offer their own interface for uploading and managing documents, but the quality of chunking and data structure still depends on you. For sensitive data or large volumes, your own infrastructure gives full control over what and where is processed. Comparison of approaches in the article custom AI assistant or ready-made.