We see this regularly: a team builds a company assistant, indexes "everything possible," and then discovers that contracts with confidentiality clauses, HR data, and customer correspondence history have ended up in the vector database — accessible to any employee who asks the right question. The model works correctly. The problem is the lack of governance over what it is even allowed to show. Data governance is work that must be done before the first document enters the index, not an audit conducted after an incident.
Classification: before you index anything
#Every document you consider as a source for AI must be assigned a sensitivity class. This is the foundation — all other decisions about access, retention, and hosting depend on it. Without classification, you cannot meaningfully answer the question "can this file be sent to a cloud model."
We typically work with four levels. The higher the level, the stricter the processing rules and the greater the likelihood that the data must remain local.
| Class | Examples | Rule for AI |
|---|---|---|
| Public | Offer, FAQ, marketing materials | Any model, including cloud |
| Internal | Procedures, SOPs, technical documentation | PII masking, LLM router with control |
| Confidential | Contracts, business data, plans | Separate collection with ACL, data residency |
| Sensitive / Special | HR data, health data, legal data | Self-hosting only, DPIA required |
Classification doesn’t have to be perfect from day one. It’s enough that it’s unambiguous and assigned automatically where possible — based on location in the source system, a label in SharePoint, or a rule at the folder level. A document without a class is treated by default as confidential, never as public. This is a safer fallback.
Role-based access control
#The most common mistake in RAG implementations is a "flat" index — one vector set that everyone has identical access to. The assistant then inherits the broadest possible scope: a cleverly phrased question is enough to extract a document fragment that the questioner should never see.
The correct pattern is to transfer permissions from the source system to the search layer. Each fragment carries metadata about who can see it (department, role, access level), and the query filters results based on the user’s identity before passing them to the model. The model never receives context that the questioner is not entitled to — so it cannot reveal it even under pressure from a clever prompt.
In practice, this means three things: separate collections or metadata filters for different sensitivity classes, mapping company roles to index permissions, and the principle of "deny if unknown." If the system cannot determine a user’s permissions, it does not show sensitive fragments. More on how we integrate such restrictions into the model layer itself in the article on AI agent security.
Retention: how long and why
#Retention is a question most companies don’t have a ready answer to in the context of AI. Input data (documents in the index) and operational data (conversation logs, queries, generated responses) are subject to separate policies — and separate risks.
Input data should live only as long as it is current and needed. An outdated procedure from three years ago in the index is not only lower-quality responses but also a legal risk if it contains personal data whose processing basis has expired. Conversation logs are a separate topic: by default, we apply zero or minimal retention for query content, as users paste data into the assistant that no one anticipated.
A good retention policy answers four questions for each data type:
- Why are we keeping this? The processing purpose must be concrete — "it might be useful" is not a purpose compliant with minimization.
- How long? A specific period with an expiration date, not "indefinitely."
- What happens after the term? Automatic deletion from the index and vector database, not just the source system.
- How do we implement the right to erasure? Selective deletion of fragments related to a specific person — vectors must also be deleted, not just the source file.
The last point is a surprisingly common oversight. Deleting a file from SharePoint does not remove its embeddings from the vector database. If the pipeline does not support selective deletion, a RODO erasure request remains technically unfulfilled.
Lineage: where every answer comes from
#Lineage (data origin) is the ability to trace which document, in which version and from which source, each fragment in the index comes from — and ultimately, every sentence in the assistant’s response. Without this, you cannot answer two critical questions: "why did the model say this?" and "is this information still current?"
In practice, every fragment in the vector database should carry origin metadata: source document identifier, version, date, origin system, and sensitivity class. When the assistant cites a fragment, it can point to the specific source — this is the foundation of user trust and auditability. Lineage is also the basis for compliance with the accountability principle in RODO: you must be able to demonstrate what personal data is processed by the system and where it comes from.
| Lineage Element | Purpose | Where It Is Stored |
|---|---|---|
| Document ID and version | Updates and withdrawal | Fragment metadata |
| Date and source system | Freshness, origin audit | Fragment metadata |
| Sensitivity class | Access filtering | Fragment metadata |
| Citation in response | Trust, verifiability | Generation layer (RAG) |
| Operation log (who/when indexed) | Accountability, DPIA | Governance log |
Solid lineage also pays off operationally: when the source document changes, you know exactly which fragments to reindex, instead of recalculating the entire corpus. The foundation of clean input data, on which lineage actually works, is described in the article how to prepare company data for AI.
PII minimization and DPIA
#Minimization is the principle that saves the most projects: only what is necessary for the purpose enters AI. An assistant answering questions about internal procedures doesn’t need the full CRM database with customer history. The fewer personal data in the pipeline, the lower the risk and the simpler the compliance.
Practically, PII minimization works on two levels. First — input selection: we don’t index data that isn’t needed for the assistant’s task. Second — on-the-fly masking: before a fragment is sent to an external generative model, we automatically detect and replace names, phone numbers, PESEL, and email addresses. For data covered by professional secrecy, the entire pipeline can run locally, eliminating the problem of sending data outside — we describe this in the article on self-hosted LLM and RODO.
DPIA (Data Protection Impact Assessment) is required when processing may pose a high risk to individuals — and AI implementations on HR, health, or financial data usually fall into this category. DPIA is not a formality: it’s an exercise that forces answers to questions about purpose, legal basis, data residency, and security measures before the system launches. Done well, it often reveals gaps that are cheaper to fix at the design stage than after deployment. Full context on obligations in the article AI Act and RODO 2026.
Data governance checklist for AI
#A practical checklist we go through before every indexing. If you can’t check an item, the data doesn’t enter the index until the gap is closed.
- Classification — every source has an assigned sensitivity class; no class = treated as confidential.
- Legal basis — exists and is documented for every processing purpose.
- Role-based access — permissions from the source system mapped in index filters; "deny if unknown" principle.
- Minimization — only data necessary for the assistant’s task is indexed.
- PII masking — automatic before sending to an external model; verified on a sample.
- Retention — storage period and automatic deletion defined for input data and logs.
- Right to erasure — pipeline supports selective deletion of fragments, including vectors.
- Lineage — every fragment carries origin metadata; answers can be traced to the source.
- DPIA — performed for high-risk areas before deployment.
- Hosting — sensitivity class determines whether data can leave the infrastructure.
This isn’t a one-time checklist. We return to it with every change in data scope, because a new source means new risk. Governance that lives is cheaper than an incident that doesn’t.
FAQ
#How does data governance for AI differ from a standard information security policy?
#A classic information security policy focuses on protecting data at rest and in transit — encryption, access, backups. Data governance for AI adds a layer specific to models: control over what the model can "see" through RAG, PII masking before sending to the model, and answer lineage. It’s an extension of the existing policy, not a replacement.
Is a DPIA required for every AI implementation?
#Not for every one. DPIA is required when processing may pose a high risk to individuals’ rights — typically for HR, health, financial data, or large-scale profiling. An assistant for a public FAQ doesn’t require it. If in doubt, it’s better to perform a simplified assessment and document the decision, as this itself is evidence of accountability required by RODO.
How do we implement the right to erasure in a RAG system?
#The key is that deleting the source file does not remove its embeddings from the vector database. The pipeline must support selective deletion of fragments linked to a specific person or document — and it’s lineage that allows finding them. Without origin metadata, an erasure request is technically unfeasible, which is a direct violation of RODO obligations.
Can sensitive data be sent to a cloud model after PII masking?
#It depends on the data class and legal basis, but masking alone is usually insufficient for special category data. Masking reduces risk for internal data, but for HR, legal, or health data, we recommend full self-hosting and data residency in the company’s infrastructure. Then no content leaves your environment — details in the article on self-hosted LLM and RODO.
Where to start if we have no data classification?
#Start with one area you want to feed into AI and classify only that — not the entire company at once. Go folder by folder, assign each source one of four classes, and apply the rule that a document without a class is confidential by default. A pilot on one clean, well-classified area is safer and faster than trying to organize everything — the same phased approach we recommend for preparing company data for AI.