A professional services company implemented RAG on 4,000 contracts and internal procedures. Initial tests with a popular cloud model yielded recall@5 at 58%. After switching to locally hosted BGE-M3, recall increased to 79%. The difference stemmed from one reason: Polish inflection. The word "zamówienia" (orders, genitive) and "zamówienie" (order, nominative) are two different tokens for a simple model, but two close points in vector space for BGE-M3.
Why Polish is a Challenge for Embeddings
#Polish has rich inflection: nouns decline through 7 cases, verbs through persons, numbers, and genders. A single concept appears in documents as "dostawca" (supplier, nominative), "dostawcy" (supplier, genitive), "dostawcą" (supplier, instrumental), "dostawcy" (suppliers, nominative pl.). Models trained primarily on English treat each form as a separate token and learn only those forms they encountered frequently enough in the corpus.
Diacritics add another layer. Users often search for "zlecenie" instead of "zlecenie" or "wez" instead of "weź". A good model for Polish should place both forms close together in vector space, not treat them as different words.
A third issue is mixed-language documents. Contracts, ISO procedures, and business correspondence contain English terms (compliance, vendor, deliverable), abbreviations (NDA, SLA, KPI), and Polish sentences. A Polish monolingual model may not recognize these terms in an English context. A multilingual model handles this better because it has seen both languages in the same sentence.
Criteria for Choosing an Embedding Model for Polish RAG
#When selecting a model, check five parameters.
Multilingualism with a rich Slavic corpus. Not every "multilingual" model was trained on Polish in equal proportion. BGE-M3 and multilingual-e5-large have a documented European corpus. Models based on mBERT or older versions of LaBSE performed worse with Polish inflection.
Vector dimension and memory. Higher dimensions (1536 vs 768) don’t always yield better results but always require more RAM in the vector database. For 100,000 chunks with 1024 dimensions, an HNSW index occupies about 400-500 MB. With 1536 dimensions, it’s 600-800 MB.
Model context length. Embedding converts an entire chunk into one vector. If the model supports only 512 tokens (about 350 words), and your documents are contract pages, you lose information from the latter part of the chunk. BGE-M3 supports up to 8192 tokens in ColBERT mode, standard 512. E5-large: 512. For long documents, the chunking strategy matters.
Self-hosting and RODO. If processing contracts, HR data, or business correspondence, self-hosting is the default choice. A local model (via Ollama or direct ONNX server) doesn’t send document chunks to external APIs. This isn’t a preference but a RODO compliance foundation (Article 28, data processing agreement).
Indexing speed. For 50,000 chunks, a model running on a server CPU (e.g., BGE-M3 via Ollama) takes 40-90 minutes for initial indexing. A cloud API model: 5-15 minutes. For incremental reindexing, the difference is smaller, but worth considering in architecture.
Model Comparison: Multilingual vs. Specialized
#| Criterion | BGE-M3 (multilingual, local) | multilingual-e5-large (multilingual, local) | text-embedding-3-small (multilingual, cloud) | PL-only models (sdadas/Polish-RoBERTa) |
|---|---|---|---|---|
| Polish inflection | very good | good | good | very good |
| Mixed PL/EN documents | very good | good | very good | poor |
| Vector dimension | 1024 | 1024 | 1536 | 768 |
| Max context length | 512 (standard) / 8192 (ColBERT) | 512 | 8192 | 512 |
| Self-hosting | yes (Ollama/ONNX) | yes (Ollama/HF) | no (OpenAI API) | yes (HF) |
| RODO (no data egress) | yes | yes | no (data to cloud) | yes |
| Operational cost | server resources | server resources | payment per tokens | server resources |
| Maturity for PL | high | medium | high | high, but PL-only |
Practical conclusions: For RODO-compliant projects with mixed-language documents, locally hosted BGE-M3 is the first choice. For projects without personal data and needing a quick start, text-embedding-3-small via API shortens deployment by 2-4 weeks. PL-only models are worth considering only for purely Polish-language indexes without English terminology.
How to Evaluate a Model on Your Own Data
#Evaluation on public benchmarks (MTEB, BEIR) says little about how a model will perform on your contracts or ISO procedures. You need your own golden set.
Build a set of 50-100 pairs: a question representative of real user queries + a list of 3-5 documents that should appear in results. Collect questions from actual system users, don’t invent them at your desk.
Then measure two metrics:
Recall@5 — how many of the expected documents appear in the top 5 results. A good result for corporate documents: 70-80%. Below 60%, the model isn’t suitable without hybrid BM25 support.
MRR (Mean Reciprocal Rank) — how high in the results list the first relevant document appears. The higher, the less context needs to be sent to the generative model, reducing token costs.
It’s worth evaluating two or three models in parallel: index the same set of chunks, query with the same set of questions, compare numbers. Tools for this: RAGAS (open source, Python) or a custom script calculating recall and MRR in 50 lines of code.
We discuss this evaluation pattern in detail in the article on RAG evaluation.
Practical Pipeline: From Document to Search
#After selecting a model, the pipeline looks like this.
Document parsing: PDF, DOCX, emails via OCR or native parser. Remove headers/footers. For contracts, extract numbered sections as separate chunks.
Chunking: 300-600 tokens per chunk with 10-15% overlap. For long contract clauses, 20% overlap reduces the risk of losing context at chunk boundaries. Details in the article on preparing data for AI.
Embeddings: pass chunks to the model (local or via API), save vectors in a vector database (Qdrant, pgvector).
Search: for a user query, compute the query embedding, perform ANN search, optionally combine with BM25 (hybrid search), filter through a reranker.
Answer generation: pass the top 3-5 chunks as context to the LLM. Verify if the answer is grounded in the context (faithfulness).
Full discussion of semantic search in a company in the article Semantic Search and Embeddings in Business.
Try It Live
#FAQ
#Does BGE-M3 really handle Polish inflection better than English models?
#BGE-M3 was trained on a multilingual corpus including Slavic languages, such as Polish. In tests on Polish legal and technical documents, it achieves recall@5 10-20 percentage points higher than models based solely on English or older mBERT. The difference is particularly noticeable for queries in case forms (genitive, instrumental), which users write naturally, while the document contains the nominative form.
When should I choose a monolingual (PL-only) model instead of a multilingual one?
#When your index contains only Polish documents without English terminology, and you need the highest precision for simple factual queries. PL-only models (e.g., sdadas/Polish-RoBERTa-base-finetuned-polish-question-answering) may yield slightly better results with pure Polish text. For mixed PL/EN documents or correspondence with English abbreviations, the advantage quickly disappears.
How long does initial indexing take with BGE-M3 locally?
#On a typical server with 16 GB RAM and CPU (no GPU), indexing 10,000 chunks of 400 tokens takes 15-30 minutes. For 50,000 chunks: 60-90 minutes. Embedding recalculation happens once; subsequent queries are fast (tens of milliseconds). If faster initial indexing is critical, consider a GPU server or cloud API (for data not subject to RODO).
Can I use different embedding models for different document collections?
#Yes, but each collection must consistently use the same model during indexing and querying. Mixing models in one index is not allowed: vectors from different models are not geometrically comparable. If you want to change the model, you must reindex the entire collection. Therefore, the model decision should precede building a production index.
How does RODO affect the choice of an embedding model?
#If your documents contain personal data (names, PESEL numbers, addresses in correspondence), sending them to an external embedding API requires a data processing agreement with the provider (Article 28 RODO) and a DPIA for high-risk scenarios. Self-hosting the model locally eliminates this issue: document chunks never leave your infrastructure. Before deployment, also check your organization’s AI readiness assessment.