Embeddings for Polish: How to Choose a Model for RAG

A professional services company implemented RAG on 4,000 contracts and internal procedures. Initial tests with a popular cloud model yielded recall@5 at 58%. After switching to locally hosted BGE-M3, recall increased to 79%. The difference stemmed from one reason: Polish inflection. The word "zamówienia" (orders, genitive) and "zamówienie" (order, nominative) are two different tokens for a simple model, but two close points in vector space for BGE-M3.

Why Polish is a Challenge for Embeddings#

Polish has rich inflection: nouns decline through 7 cases, verbs through persons, numbers, and genders. A single concept appears in documents as "dostawca" (supplier, nominative), "dostawcy" (supplier, genitive), "dostawcą" (supplier, instrumental), "dostawcy" (suppliers, nominative pl.). Models trained primarily on English treat each form as a separate token and learn only those forms they encountered frequently enough in the corpus.

Diacritics add another layer. Users often search for "zazolc" instead of "zażółć" or "wez" instead of "weź". A good model for Polish should place both forms close together in vector space, not treat them as different words.

A third issue is mixed-language documents. Contracts, ISO procedures, and business correspondence contain English terms (compliance, vendor, deliverable), abbreviations (NDA, SLA, KPI), and Polish sentences. A Polish monolingual model may not recognize these terms in an English context. A multilingual model handles this better because it has seen both languages in the same sentence.

Criteria for Choosing an Embedding Model for Polish RAG#

When selecting a model, check five parameters.

Multilingualism with a rich Slavic corpus. Not every "multilingual" model was trained on Polish in equal proportion. BGE-M3 and multilingual-e5-large have a documented European corpus. Models based on mBERT or older versions of LaBSE performed worse with Polish inflection.

Vector dimension and memory. Higher dimensions (1536 vs 768) don’t always yield better results but always require more RAM in the vector database. For 100,000 chunks with 1024 dimensions, an HNSW index occupies about 400-500 MB. With 1536 dimensions, it’s 600-800 MB.

Model context length. Embedding converts an entire chunk into one vector. If the model supports only 512 tokens (about 350 words), and your documents are contract pages, you lose information from the latter part of the chunk. BGE-M3 supports up to 8192 tokens in ColBERT mode, standard 512. E5-large: 512. For long documents, the chunking strategy matters.

Self-hosting and GDPR. If processing contracts, HR data, or business correspondence, self-hosting is the default choice. A local model (via Ollama or direct ONNX server) doesn’t send document chunks to external APIs. This isn’t a preference but a GDPR compliance foundation (Article 28, data processing agreement).

Indexing speed. For 50,000 chunks, a model running on a server CPU (e.g., BGE-M3 via Ollama) takes 40-90 minutes for initial indexing. A cloud API model: 5-15 minutes. For incremental reindexing, the difference is smaller, but worth considering in architecture.

Model Comparison: Multilingual vs. Specialized#

Criterion	BGE-M3 (multilingual, local)	multilingual-e5-large (multilingual, local)	text-embedding-3-small (multilingual, cloud)	PL-only models (sdadas/Polish-RoBERTa)
Polish inflection	very good	good	good	very good
Mixed PL/EN documents	very good	good	very good	poor
Vector dimension	1024	1024	1536	768
Max context length	512 (standard) / 8192 (ColBERT)	512	8192	512
Self-hosting	yes (Ollama/ONNX)	yes (Ollama/HF)	no (OpenAI API)	yes (HF)
GDPR (no data egress)	yes	yes	no (data to cloud)	yes
Operational cost	server resources	server resources	payment per tokens	server resources
Maturity for PL	high	medium	high	high, but PL-only

Practical conclusions: For GDPR-compliant projects with mixed-language documents, locally hosted BGE-M3 is the first choice. For projects without personal data and needing a quick start, text-embedding-3-small via API shortens deployment by 2-4 weeks. PL-only models are worth considering only for purely Polish-language indexes without English terminology.

How to Evaluate a Model on Your Own Data#

Evaluation on public benchmarks (MTEB, BEIR) says little about how a model will perform on your contracts or ISO procedures. You need your own golden set.

Build a set of 50-100 pairs: a question representative of real user queries + a list of 3-5 documents that should appear in results. Collect questions from actual system users, don’t invent them at your desk.

Then measure two metrics:

Recall@5 — how many of the expected documents appear in the top 5 results. A good result for corporate documents: 70-80%. Below 60%, the model isn’t suitable without hybrid BM25 support.

MRR (Mean Reciprocal Rank) — how high in the results list the first relevant document appears. The higher, the less context needs to be sent to the generative model, reducing token costs.

It’s worth evaluating two or three models in parallel: index the same set of chunks, query with the same set of questions, compare numbers. Tools for this: RAGAS (open source, Python) or a custom script calculating recall and MRR in 50 lines of code.

We discuss this evaluation pattern in detail in the article on RAG evaluation.

Practical Pipeline: From Document to Search#

After selecting a model, the pipeline looks like this.

Document parsing: PDF, DOCX, emails via OCR or native parser. Remove headers/footers. For contracts, extract numbered sections as separate chunks.

Chunking: 300-600 tokens per chunk with 10-15% overlap. For long contract clauses, 20% overlap reduces the risk of losing context at chunk boundaries. Details in the article on preparing data for AI.

Embeddings: pass chunks to the model (local or via API), save vectors in a vector database (Qdrant, pgvector).

Search: for a user query, compute the query embedding, perform ANN search, optionally combine with BM25 (hybrid search), filter through a reranker.

Answer generation: pass the top 3-5 chunks as context to the LLM. Verify if the answer is grounded in the context (faithfulness).

Full discussion of semantic search in a company in the article Semantic Search and Embeddings in Business.

Try It Live#

▶Choosing an Embedding Model for Polish RAGsandbox · reasoning

FAQ#

Does BGE-M3 really handle Polish inflection better than English models?#

BGE-M3 was trained on a multilingual corpus including Slavic languages, such as Polish. In tests on Polish legal and technical documents, it achieves recall@5 10-20 percentage points higher than models based solely on English or older mBERT. The difference is particularly noticeable for queries in case forms (genitive, instrumental), which users write naturally, while the document contains the nominative form.

When should I choose a monolingual (PL-only) model instead of a multilingual one?#

When your index contains only Polish documents without English terminology, and you need the highest precision for simple factual queries. PL-only models (e.g., sdadas/Polish-RoBERTa-base-finetuned-polish-question-answering) may yield slightly better results with pure Polish text. For mixed PL/EN documents or correspondence with English abbreviations, the advantage quickly disappears.

How long does initial indexing take with BGE-M3 locally?#

On a typical server with 16 GB RAM and CPU (no GPU), indexing 10,000 chunks of 400 tokens takes 15-30 minutes. For 50,000 chunks: 60-90 minutes. Embedding recalculation happens once; subsequent queries are fast (tens of milliseconds). If faster initial indexing is critical, consider a GPU server or cloud API (for data not subject to GDPR).

Can I use different embedding models for different document collections?#

Yes, but each collection must consistently use the same model during indexing and querying. Mixing models in one index is not allowed: vectors from different models are not geometrically comparable. If you want to change the model, you must reindex the entire collection. Therefore, the model decision should precede building a production index.

If your documents contain personal data (names, PESEL numbers, addresses in correspondence), sending them to an external embedding API requires a data processing agreement with the provider (Article 28 GDPR) and a DPIA for high-risk scenarios. Self-hosting the model locally eliminates this issue: document chunks never leave your infrastructure. Before deployment, also check your organization’s AI readiness assessment.

Why Polish is a Challenge for Embeddings#

Criteria for Choosing an Embedding Model for Polish RAG#

When selecting a model, check five parameters.

Model Comparison: Multilingual vs. Specialized#

Criterion	BGE-M3 (multilingual, local)	multilingual-e5-large (multilingual, local)	text-embedding-3-small (multilingual, cloud)	PL-only models (sdadas/Polish-RoBERTa)
Polish inflection	very good	good	good	very good
Mixed PL/EN documents	very good	good	very good	poor
Vector dimension	1024	1024	1536	768
Max context length	512 (standard) / 8192 (ColBERT)	512	8192	512
Self-hosting	yes (Ollama/ONNX)	yes (Ollama/HF)	no (OpenAI API)	yes (HF)
GDPR (no data egress)	yes	yes	no (data to cloud)	yes
Operational cost	server resources	server resources	payment per tokens	server resources
Maturity for PL	high	medium	high	high, but PL-only

How to Evaluate a Model on Your Own Data#

Evaluation on public benchmarks (MTEB, BEIR) says little about how a model will perform on your contracts or ISO procedures. You need your own golden set.

Then measure two metrics:

Recall@5 — how many of the expected documents appear in the top 5 results. A good result for corporate documents: 70-80%. Below 60%, the model isn’t suitable without hybrid BM25 support.

MRR (Mean Reciprocal Rank) — how high in the results list the first relevant document appears. The higher, the less context needs to be sent to the generative model, reducing token costs.

We discuss this evaluation pattern in detail in the article on RAG evaluation.

Practical Pipeline: From Document to Search#

After selecting a model, the pipeline looks like this.

Document parsing: PDF, DOCX, emails via OCR or native parser. Remove headers/footers. For contracts, extract numbered sections as separate chunks.

Chunking: 300-600 tokens per chunk with 10-15% overlap. For long contract clauses, 20% overlap reduces the risk of losing context at chunk boundaries. Details in the article on preparing data for AI.

Embeddings: pass chunks to the model (local or via API), save vectors in a vector database (Qdrant, pgvector).

Search: for a user query, compute the query embedding, perform ANN search, optionally combine with BM25 (hybrid search), filter through a reranker.

Answer generation: pass the top 3-5 chunks as context to the LLM. Verify if the answer is grounded in the context (faithfulness).

Full discussion of semantic search in a company in the article Semantic Search and Embeddings in Business.

Try It Live#

▶Choosing an Embedding Model for Polish RAGsandbox · reasoning

Embeddings for Polish: How to Choose a Model for RAG

Why Polish is a Challenge for Embeddings#

Criteria for Choosing an Embedding Model for Polish RAG#

Model Comparison: Multilingual vs. Specialized#

How to Evaluate a Model on Your Own Data#

Practical Pipeline: From Document to Search#

Try It Live#

FAQ#

Does BGE-M3 really handle Polish inflection better than English models?#

When should I choose a monolingual (PL-only) model instead of a multilingual one?#

How long does initial indexing take with BGE-M3 locally?#

Can I use different embedding models for different document collections?#

How does GDPR affect the choice of an embedding model?#

Embeddings for Polish: How to Choose a Model for RAG

Why Polish is a Challenge for Embeddings#

Criteria for Choosing an Embedding Model for Polish RAG#

Model Comparison: Multilingual vs. Specialized#

How to Evaluate a Model on Your Own Data#

Practical Pipeline: From Document to Search#

Try It Live#

FAQ#

Does BGE-M3 really handle Polish inflection better than English models?#

When should I choose a monolingual (PL-only) model instead of a multilingual one?#

How long does initial indexing take with BGE-M3 locally?#

Can I use different embedding models for different document collections?#

How does GDPR affect the choice of an embedding model?#