A backend team in a SaaS company has 180,000 lines of code, 40 runbooks in Confluence, and an API reference automatically generated from Swagger. A new developer spends their first week mostly asking questions like: “where is this error handled?”, “how does authorization work for endpoint X?”, and “what does this flag do?”. Senior colleagues know the answers. The problem is that this knowledge isn’t recorded anywhere in a searchable form.
At Cashcrown, we research how RAG over repositories and technical documentation works in practical deployments. This article outlines the differences from RAG over prose, the approach to symbolic chunking, the role of hybrid search, and the limits of what requires a developer’s decision.
Why code and technical documentation are a different challenge than prose
#HR documents, procedures, or FAQs have a linear structure. A character- or token-based splitter produces sensible chunks because a sentence taken out of context is still a sentence.
Code has a symbolic structure. The function calculate_discount(order_id, user_tier) is a semantic unit. Breaking its definition in the middle leaves you with a fragment missing the signature or a fragment missing the body—each useless on its own. The same applies to classes, methods, try/except blocks, and decorators.
Technical documentation is a hybrid: a runbook contains a list of steps referencing specific commands and environment variables, while an API wiki describes endpoints with exact parameter types. Keyword-based search works significantly better here than with free-form prose. On top of that, there’s volatility: a RAG indexing once a month cites signatures that stopped existing two weeks earlier.
Symbol-based chunking, not paragraph-based
#A good code chunking strategy relies on symbol boundaries, not tokens. An AST (Abstract Syntax Tree) parser extracts functions, classes, and methods as closed units. Libraries like tree-sitter support over 40 languages and return the boundaries of each symbol as line numbers.
Practical pattern: chunk = one symbol with docstring, signature, and body. Mandatory metadata includes file_path, symbol_name, start_line, end_line, last_commit_sha. For technical documentation, the rule is simpler: split by section headings (H2/H3), not by token count. A runbook split by steps yields procedurally self-contained chunks.
The table below compares chunking approaches for different resource types:
| Resource Type | Chunking Strategy | Mandatory Metadata | Approximate Size |
|---|---|---|---|
| Source code (Python, TS, Go) | AST by symbol boundaries | file_path, symbol_name, start_line, end_line, last_commit_sha | 50–300 lines per symbol |
| API Reference (OpenAPI/Swagger) | One endpoint = one chunk | method, path, operationId, version | 200–600 tokens |
| Runbook / Operational Procedure | Recursive by H2/H3 headings | section_title, service_name, last_modified | 300–800 tokens |
| Internal Wiki (technical prose) | Recursive with 10–15% overlap | page_title, page_url, author, last_modified | 512–1024 tokens |
| Changelog / Commit Messages | Fixed-size or entire entry | commit_sha, author, date, branch | 128–256 tokens |
For more on chunking strategies in general, see the article chunking documents for RAG.
Why hybrid search is essential here
#For prose documents, pure vector search performs well because queries are semantic: “how to handle a complaint?” hits the right paragraph even without exact word matches.
For code, queries often contain exact identifiers: UserRepository.findByEmail, POST /v1/orders/{id}/cancel, CASHCROWN_API_JWT_SECRET. Semantic search (embedding-based) struggles with rare tokens. The model sees UserRepository as a string without broad semantic meaning and hits generic fragments instead of the class definition.
BM25 is an inverted index based on word frequency. For code identifiers, it works like a full-text search: it finds UserRepository.findByEmail exactly where the method is defined or called. Combining BM25 with vector search via hybrid search (RRF or linear result weighting) improves recall for queries with exact names without degrading results for semantic queries.
In Cashcrown’s internal tests on a dataset of 12,000 symbols from a commercial project, recall@5 for pure vector search was 61%. After adding BM25 in hybrid mode, it rose to 79%. The difference was most visible for queries containing method signatures and file paths. These are indicative numbers from one project, not a guarantee of repeatability on other codebases.
After retrieval, it’s worth adding a reranking layer to re-sort the top-k fragments for relevance to the specific query. For code, this matters because BM25 may promote fragments with lexically identical names but from a different module or namespace. The reranker evaluates the full context, not just token presence. For a comparison of vector databases that natively support hybrid search, see the article how to choose a vector database.
Index freshness: a problem that can’t be ignored
#Code has a different lifecycle than an HR procedure. A leave policy changes once a year. A method signature in an active project changes every sprint.
A RAG over code that doesn’t refresh its index produces answers with deprecated APIs. A developer who gets an example using a function removed two commits ago wastes time debugging an error generated by the assistant.
Three levels of refresh we use depending on the project’s rate of change:
Incremental reindexing on commit. A GitLab/GitHub webhook (or post-receive hook) triggers reindexing only for files changed in the commit. git diff --name-only HEAD~1 HEAD provides the file list. Each file is parsed again with AST. Symbols whose boundaries haven’t changed remain in the index; changed ones replace previous versions.
Chunk versioning by commit SHA. Each chunk stores last_commit_sha in its metadata. At query time, you can filter by branch (branch: main), allowing answers about both current code and specific historical versions.
TTL for documentation. Runbooks and wikis lack webhooks. We set a TTL after which the document is automatically reindexed: 24–48 hours for actively changing documentation, 7–14 days for stable specifications.
The assistant should disclose the document’s last indexing date in the citation. “Source: src/auth/user_service.py, line 142, indexed 2 hours ago” gives the developer a signal whether to trust the answer. Omitting this information is a hidden risk.
Citing the source as a requirement, not an option
#For prose documents, citing the filename and page number is a best practice. For code, it’s a security requirement.
A developer who gets the answer “the calculate_discount function is in the pricing module” can’t apply it without verification. There may be multiple pricing modules in the monorepo. The method may have changed its signature in the last commit. The assistant may have confused a similarly named method from another module.
A technically useful answer looks like this: “calculate_discount(order_id: int, user_tier: str) -> Decimal (file: src/billing/pricing.py, line 87, commit a3f9c12, indexed 40 minutes ago)”. The developer clicks the file link, sees the current code, verifies the signature, and only then applies it.
Guardrails for RAG over code should block two classes of answers:
No coverage in the index. If retrieval returns no fragments with similarity above the threshold (approximately 0.65–0.75 for a semantic search model), the assistant responds: “I didn’t find this signature in the indexed version of the code”. This is better than attempting to reconstruct from the model’s memory, which would produce non-existent APIs.
Stale index. If the chunk’s metadata indicates the file hasn’t been reindexed in more than X hours (threshold depends on the project’s rate of change), the answer includes a warning about potential staleness.
For more on how RAG over a knowledge base handles citations and guardrails, see the article company GPT based on knowledge.
Use cases: where the assistant helps the most
#Three cases we encounter most often in projects, and the human’s role in each:
Developer onboarding. A new team member asks: “where is OAuth authorization handled?”. The assistant returns a fragment from src/auth/oauth_handler.py with the starting line, docstring, and the committer’s name. The developer reads the code. The assistant shortens the time to the first meaningful question to a senior colleague but doesn’t replace code review.
“Where is X implemented?” questions. During code review or debugging: “where is the retry mechanism for 503 errors implemented?”. The assistant returns a list of code locations that match semantically and lexically. The developer decides which is the correct one for the given context. The assistant knows nothing about the architectural intent in the architect’s mind.
Answering support questions from documentation. A support client asks about an API parameter. The assistant searches the indexed OpenAPI reference and returns the description with the specification version number. Support verifies in Swagger UI before responding. The assistant reduces search time but doesn’t eliminate verification.
In all three cases, the human’s role is irreplaceable: architectural decisions, applying code in production, and answers where errors have real consequences always require human judgment. A RAG assistant over code is a navigational tool, not an authority.
For broader context on hybrid search for technical knowledge bases, see the article hybrid search: BM25 and vectors.
FAQ
#Can RAG over code suggest ready-to-paste snippets?
#It can return an existing code fragment from the repository as a citation. This differs from generating new code: the assistant shows how a given pattern is already implemented in the project, with the file and line number. The developer assesses whether the fragment is reusable in the new context. The assistant shouldn’t generate code that isn’t in the database because the risk of fabricating a non-existent signature is high.
How to handle questions about code that isn’t yet indexed?
#When retrieval returns no results or results below the similarity threshold, the guardrail should respond directly: “I didn’t find coverage for this query in the indexed version of the repository”. The assistant can suggest an alternative query or indicate that the file may not have been indexed yet. Guessing a function’s signature based on the model’s general knowledge is the most common cause of incorrect answers in such systems.
How often should I reindex technical documentation in Confluence?
#It depends on the rate of change. Runbooks for actively developed services should be reindexed every 24 hours. A stable architectural specification can have a TTL of 7–14 days. The key is to inform the assistant of the last indexing date in every answer so the user knows whether the cited content is current. A stale index without warning is worse than no index.
Why isn’t pure vector search sufficient for code?
#The embedding model represents semantic meaning, but function, class, and variable names are rare tokens without broad contextual meaning in the general corpus. Pure vector search will promote generic fragments about authentication instead of the specific AuthService class. BM25 treats identifiers as keywords and hits them exactly. Combining both via hybrid search handles both semantic and lexical queries.
What to do when the assistant cites a signature that no longer exists?
#This signals that the index is stale. The response should be twofold: on the user side, verify the source file before applying, as this is always required. On the system side, shorten the reindexing TTL for changing files and add a warning in the answer when metadata indicates an old commit SHA. Never apply code returned by the assistant without checking the current version in the repository.