Most company knowledge bases are still searched full-text: find documents containing these words. It works when employees know the terminology. It fails when customers write differently than the documentation, when questions involve concepts described in scattered fragments, or when different departments name the same process differently. Semantic search solves exactly this problem—without rewriting documents.
What is an embedding and why it works
#An embedding is a representation of text as a vector of numbers. An embedding model (a neural network trained on a massive corpus) maps each sentence into a multi-dimensional space so that sentences with similar meanings land close together, and those with different meanings farther apart. This isn’t search by hashes or n-grams—it’s the geometry of meaning.
Practical effect: the sentence “how to cancel an order” and the sentence “procedure for withdrawing from the contract” end up in neighboring points in space, even though they share no words. A classic full-text search treats these queries as disjoint. A semantic engine treats them as near-synonyms.
Mechanism in brief:
- Each fragment of the knowledge base is converted into an embedding once (during indexing).
- The user’s query is converted into an embedding in real time.
- The engine calculates cosine similarity between the query vector and document vectors.
- Returns fragments with the highest semantic alignment, regardless of the words used.
Embedding models: what we run locally
#The choice of embedding model directly impacts search quality, speed, and whether data leaves your infrastructure at all. Basic security rule: internal content (contracts, procedures, customer data) should be embedded locally before anything goes to an external generative model.
We use BGE-M3 run locally via Ollama. BGE-M3 produces 1024-dimensional vectors, handles multilingualism (including Polish) without translating queries, and runs on a company server’s CPU—no content leaves your network during indexing.
| Model | Dimensions | Languages | Hosting | PII out |
|---|---|---|---|---|
| BGE-M3 (local) | 1024 | multilingual (PL, EN, DE...) | own server | no |
| text-embedding-3-small | 1536 | multilingual | cloud | yes |
| multilingual-e5-large | 1024 | multilingual | own server | no |
| nomic-embed-text | 768 | mainly EN | own server / cloud | option |
When data is confidential or covered by RODO, we choose local hosting. For public content (e.g., product catalogs without personal data), a cloud endpoint is acceptable, provided PII masking is applied before sending.
Vector database: where embeddings live
#Vectors are stored in a vector database. In our stack, it’s Qdrant running on your own server (local storage, no outgoing connections). Qdrant supports:
- ANN search (approximate nearest neighbor) with HNSW index—millions of vectors in tens of milliseconds,
- payload filtering—search semantically only in documents of the “HR procedures” category or only active products,
- named vectors—the same document has one embedding for search and another for reranking.
Alternatives include pgvector (Postgres extension—good if you want one database for everything) and Weaviate (full platform with its own schema). We choose Qdrant for projects requiring high query throughput and data isolation at the collection level.
Hybrid search: when semantic alone isn’t enough
#Hybrid search combines full-text (BM25) and semantic results, then merges them via reranking. This is key for:
- queries about codes or numbers—“act 2025/0048” is precisely matched by BM25, not semantics,
- queries about proper names—embedding models struggle with unique product names, SKUs, or surnames,
- short single-word queries—too little context for semantics to provide an advantage.
In practice, a hybrid engine: searches BM25 and ANN in parallel, merges results with Reciprocal Rank Fusion (RRF), then a reranker (cross-encoder) scores them again considering the full query+fragment context. Result: higher quality across variable query types than any single mechanism could deliver.
More on this pattern in the article RAG vs fine-tuning—hybrid is one reason RAG scales so well on diverse knowledge bases.
RAG: embeddings as the foundation of a corporate assistant
#RAG (retrieval-augmented generation) is an architecture where the generative model doesn’t answer “from memory” but first receives retrieved fragments of your knowledge, then constructs a response with source citations. Embeddings and the vector database are exactly the “search” at the heart of RAG.
Practical pipeline:
- The document is split into fragments (chunking—typically 256–512 tokens with overlap).
- Each fragment is embedded and stored in Qdrant with metadata (category, date, department).
- The user’s query is embedded and searched in Qdrant.
- Top-k fragments are passed as context to the LLM via a model router.
- The model answers solely based on these fragments, providing citations.
If the search doesn’t find a sufficiently matching fragment (similarity threshold), the system says “I don’t know” and escalates to a human—this is human-handoff, not a flaw in the architecture.
We describe the full pipeline in the article where to start AI implementation—semantic search is one of the fastest-ROI implementations because it works on existing company knowledge.
When to implement semantic search
#Not every knowledge base needs embeddings. Implement when:
- Users don’t find the right documents even though they exist—because they ask differently than the documentation is written.
- The company has over a few hundred documents and different departments name the same concepts differently.
- You want to build an assistant that answers questions based on internal data (RAG).
- Data spans multiple languages or customers write informally (customer service, e-commerce).
Don’t implement semantic search as a first step if the knowledge base doesn’t exist or is chaotic. Embeddings reflect the quality of input data. Organizing a narrow slice for a specific process is faster and yields better results than embedding inconsistent files.
Check organizational readiness with the AI readiness assessment—one dimension directly concerns the state of the knowledge base.
Costs and implementation time
#Cost depends on document volume, model choice, and target architecture. Approximate parameters for a pilot project:
- Indexing up to a few thousand fragments on a standard CPU server takes minutes.
- BGE-M3 locally: zero licensing cost, hardware or VPS server cost.
- Cloud embeddings: a few cents per million tokens (under $1 for a typical SME knowledge base).
- Qdrant self-hosted: free (open-source), hosting cost.
ROI is easiest to calculate in customer service: if the semantic system resolves 30% of queries without human involvement, and an agent costs N PLN/h, you have a simple calculator. Calculate it yourself with the ROI calculator.
We estimate a full project (indexing + RAG + interface + guardrails) after a data audit. A pilot on one knowledge area usually fits within weeks. Contact us via the contact form to discuss scope.
Try it live
#The sandbox below runs the same semantic mechanism as our implementations—paste a document fragment and ask a question. The model answers solely based on your text, not its own memory. PII is masked before the model, zero retention.
FAQ
#How does semantic search differ from full-text?
#Full-text search (e.g., Elasticsearch, PostgreSQL FTS) looks for documents containing the given words or their variations. Semantic search converts the query into an embedding and searches for documents with similar meaning, regardless of the words used. In practice: a customer asking about a “complaint” lands on a procedure described as a “report,” which a classic search engine wouldn’t connect.
Do embeddings send company data to the cloud?
#Only if you choose a cloud embedding model. With local BGE-M3 (Ollama), content doesn’t leave your infrastructure during indexing. Only the context of retrieved fragments goes to the generative model, previously masked by our router for PII variables. Sensitive data can remain entirely local throughout the pipeline.
How many documents are needed for it to make sense?
#Semantic search starts providing a clear advantage over full-text with just a few dozen documents when queries are linguistically diverse. Below a dozen documents, plain BM25 is simpler and sufficient. Above a few thousand fragments, chunking and metadata matter more—how you split documents impacts answer quality more than the embedding model choice.
How long does it take to implement an embedding-based RAG assistant?
#A pilot on one knowledge area (e.g., customer service FAQ or HR procedures) typically takes weeks, depending on data state and volume. Inconsistent knowledge bases requiring cleanup or integration with external systems (CRM, ERP) extend the timeline. More on implementation stages in the article where to start AI implementation.
Is semantic search compliant with RODO?
#Semantic search itself doesn’t violate RODO. Key questions concern what you index and where you store vectors. If documents contain personal data, RODO applies: legal basis, data minimization, right to erasure. With local hosting (Qdrant on-prem, BGE-M3 locally), data doesn’t leave your infrastructure. Legal details are covered in the article AI Act and RODO 2026.