A company connects an AI system to an email inbox. The assistant is supposed to answer customer questions, classify requests, and suggest responses. Initial tests go well. A few weeks later, someone asks: How did you know there were PESEL numbers in those emails? Does all of this go to the cloud?
This is the moment when most AI implementations face their first serious data security question. It’s not an academic question. RODO imposes a data minimization obligation: only the information necessary to perform the task should reach the model. A customer’s name, order number, PESEL number, IP address, or email address—none of these values are needed for the model to classify the query’s intent or suggest a response. All are PII and should be replaced with tokens before sending.
What is PII and why does it matter in AI?
#PII (Personally Identifiable Information) is any information that can identify or distinguish a specific individual, directly or indirectly. Under Polish law, the scope is regulated by RODO, which includes:
- first and last name,
- PESEL, NIP, ID number,
- email address and phone number,
- residential address,
- IP address and session identifiers,
- biometric and medical data (special category, Art. 9 RODO),
- combinations of data that together identify a person, even if none do so individually.
Language models do not "remember" data in the sense of a database. However, they process it on the provider’s side: text enters the provider’s computational environment, is tokenized, processed by the model, and a response is returned. Depending on the agreement and configuration, data may be used to improve the model or logged for a certain period. This is enough for lawyers and auditors to question the legal basis for such transfer.
Masking PII before sending solves this problem at the source: the model sees [NAME_1], [EMAIL_1], [PESEL_1] instead of real values. The classification or response suggestion task is completed. Personal data never leaves the infrastructure.
Three patterns: masking, pseudonymization, anonymization
#Not all approaches are equivalent. It’s worth distinguishing three concepts because the legal and technical consequences differ.
| Pattern | How it works | Reversibility | RODO status |
|---|---|---|---|
| Masking (tokenization) | PII replaced with token [TYPE_N] locally, token-map stored locally | reversible (local map) | personal data still processed locally; pseudonym sent to model |
| Pseudonymization | PII replaced with a fixed, deterministic hash or code; same email → same token | reversible (key) | RODO: still personal data, but risk reduced |
| Anonymization | removal or generalization to make re-identification impossible | irreversible | RODO no longer applies to such processed data |
In a typical RAG flow, masking is used: PII is replaced with a token, the model’s response is returned with tokens, and a local layer restores the original context where needed for action (e.g., inserting a name into an email response). The token-value map never leaves your server.
Full anonymization is difficult and rarely needed in operational flows. It’s useful for logging, reporting, and model training: a dataset without re-identification possibilities is not subject to RODO and can be used freely.
What does the PII masking layer look like in practice?
#A standard implementation consists of four components operating locally, before any network traffic to an external LLM.
1. PII Detector. A library or model (e.g., spaCy NER, Microsoft Presidio, custom regex rules) scans the text and identifies all fragments that resemble PII. For Polish text, this means recognizing PESEL (11-digit pattern with checksum), NIP (10-digit), phone numbers (9-digit sequences with prefixes), email addresses (RFC 5322 regex), and proper names (NER).
2. Tokenizer. Each detected PII is replaced with a token while preserving type and ordinal number in the document: Jan Kowalski → [PERSON_1], [email protected] → [EMAIL_1], 48123456789 → [PHONE_1]. If the same email appears three times, all three instances receive the same token.
3. Reversal Map. The token-map ({„[PERSON_1]": „Jan Kowalski"}) is stored locally (in session memory or an encrypted database) and never sent to the model.
4. De-tokenizer. After the model’s response returns, the local layer optionally substitutes the original values back where needed for action (e.g., addressing a response). If the response goes to an audit log, it remains tokenized.
In our architecture, this layer operates as middleware in the LLM router. Every model call passes through it automatically, without changes on the application side.
Which data to mask: priorities for Polish companies
#Not every word requires masking. Prioritization by risk:
| PII Category | Examples | Masking Priority |
|---|---|---|
| Legal identifiers | PESEL, NIP, ID number, passport number | critical — always |
| Contact data | email, phone, address | high — always |
| Financial data | account number, card number, amounts + name | high — always |
| Health/biometric data | diagnosis, test result | critical (Art. 9 RODO) |
| First and last name | in a corporate document context | high |
| IP address | in logs, request headers | medium — context-dependent |
| Company names (B2B clients) | in contract content | medium — NDA-dependent |
Practical rule: if data goes to inference in a cloud model, mask at least the first three categories unconditionally. Health and biometric data are a special RODO category (Art. 9) — they require a separate DPIA and, in many cases, exclusively local processing.
Self-hosting as a security boundary
#PII masking reduces risk but doesn’t eliminate all vectors. For particularly sensitive data (employee records, medical data, contracts with confidentiality clauses, children’s data), the appropriate response is self-hosting: the language model and embedding model run on your hardware, and no queries leave the internal network.
In our stack, a local BGE-M3 handles the embedding layer for RAG without any outgoing traffic. Generative models can run via Ollama on a local GPU or through an LLM router with a fallback policy: sensitive modes → local model, standard modes → cloud with masked PII.
This solution has one cost: local models are less advanced than large cloud models. The difference is acceptable for classification, extraction, and semantic search. It’s more noticeable for complex reasoning. The answer to "is self-hosting enough?" depends on the task scope — discussed further in the article about company knowledge-based GPT.
RODO, AI Act, and DPIA: what must be documented
#Implementing AI that processes personal data requires several legal-formal steps, regardless of the technical masking quality.
Data Protection Impact Assessment (DPIA). If processing is systematic, large-scale, or involves special categories of data (health, origin, biometric data), DPIA is mandatory under Art. 35 RODO. Automated processing of customer correspondence containing personal data usually qualifies for DPIA.
Processing activities register. New AI flows must be included in the register: purpose, legal basis, data categories, recipients, retention period, security measures.
AI Act (effective from 2025/2026). Systems that classify individuals, assess credibility, or make decisions affecting individual rights are treated as high-risk systems. They require human-oversight, technical documentation, and registration in the EU database. Auxiliary systems (suggestions for a consultant that a human approves) have lower requirements. Detailed requirements are discussed in the article AI Act and RODO 2026.
Data Processing Agreement (DPA). If you use a cloud model, even with PII masking, you formally entrust processing to the provider. A DPA with the provider is required by RODO Art. 28.
Pitfalls: what PII masking doesn’t solve
#PII masking is a necessary foundation, not a complete security solution. Three areas require separate solutions:
Inference attacks. Even anonymized data can allow re-identification if the context is unique enough. A document describing [PERSON_1], director of a company in a small town in Silesia, employing 12 people may be identifiable without a name. Mitigation: generalizing context in logs, limiting granularity.
Prompt injection. Malicious instructions hidden in input data can attempt to extract the token-map or bypass masking. The solution is guardrails operating before and after the model, described in the article AI agent security.
Model memorization. If your data is used to train the provider’s model, there’s a risk the model "memorized" fragments of your documents. For highly sensitive data, the only sure protection is disabling training options (opt-out in the provider’s API) or self-hosting. Check the provider’s policy before implementation.
Try it live
#Paste a text fragment with potential personal data, and the model will show how the PII masking layer identifies and tokenizes data before sending it to the model (playground: PII masked locally, zero retention):
FAQ
#Is PII masking required by RODO?
#RODO does not explicitly mandate masking, but Art. 5(1)(c) (data minimization) and Art. 25 (privacy by design) together require that only necessary data be processed. Sending full personal data to a cloud model when the task only requires intent classification violates minimization. PII masking is the simplest technique to meet this requirement. For special categories of data (health, biometrics), a DPIA is also required.
Which libraries are used for PII detection in Polish?
#The most popular options are Microsoft Presidio (open source, extensible with Polish NER and regex patterns), spaCy with the pl_core_news_lg model (good for recognizing proper names and NER), and custom regex rules for deterministic identifiers (PESEL, NIP, IBAN account number). In practice, a combination is used: regex for identifiers with known patterns + NER for names and place names. None of these libraries are perfect, so it’s worth verifying random samples after masking.
What about data in PDF files and images sent to vision models?
#PDFs and images are more challenging because PII can be embedded in scans, logos, footers, or handwritten signatures. The OCR (OCR) layer should first extract text, and only then should the PII detector process it before sending it to the model. For highly sensitive documents (contracts, medical records), a self-hosted vision model operating without external network access is safer.
How does PII masking affect model response quality?
#With well-designed masking, the impact is minimal. Intent classification, structure extraction, semantic search, and response generation work just as well with tokens as with real data because the model processes structure and context, not numerical values. Problems arise when business logic requires values (e.g., calculating age from PESEL) — here, masking should be applied after calculation, not before.
How to implement PII masking in an existing AI system without rebuilding?
#The fastest path is middleware in the LLM router layer: every model call passes through a PII pre-processor and a de-tokenizing post-processor. The application doesn’t need to know about masking. Implementing such a layer usually takes from a few days to a few weeks, depending on flow complexity. A scope and cost calculator is available in the ROI calculator. If you’re just exploring possibilities, start with the readiness assessment.