The controlling department asks the assistant: “What was the net revenue of the Gdańsk branch in Q3 2024?” The model returns a number. Problem: no one knows which sheet it came from, whether it’s the post-correction version or a draft, and whether the model read it correctly or rounded it based on similar rows. Structured data in RAG is a separate class of problem, significantly harder than retrieval over text documents.
Why standard chunking destroys tables
#A standard text splitter sees a table as a continuous string of characters and divides it by token count. If a table has 40 rows, the splitter cuts it roughly in half. Fragment A gets column headers and rows 1-20. Fragment B gets rows 21-40 without headers.
The consequences are serious. The retrieval model pulls fragment B and doesn’t know that the third column is “Net Revenue [PLN],” not “Gross Margin [%]”. Generating an answer from such a fragment is essentially guessing. For questions about specific values, the percentage error here is often 100%, because the model takes a number from the wrong column.
This problem affects spreadsheets (XLSX, CSV), exports from ERP and CRM systems, financial reports in tabular format, and SQL query results saved as files. In each case, the semantic consistency of a row depends on the header, which a standard splitter severs.
Two retrieval modes for structured data
#There’s no single approach that works for all questions about tabular data. At Cashcrown, we use two modes and choose between them during query classification.
Text-to-SQL (or text-to-query) works well for aggregative or point questions: “What was the total revenue in Q3?”, “Show 5 customers with the highest debt”, “How many orders were worth over 50,000 PLN?”. The LLM translates natural language into a SQL query executed on a real database or an in-memory schema. The result is deterministic and has unambiguous provenance: a specific table, specific columns, a specific WHERE condition.
Text-to-SQL limitations are real: the model must know the schema, questions about data scattered across multiple tables require complex JOINs, and a poorly generated query returns the wrong number without warning. Every generated query should be reviewed by a human before first execution on production data, and the system should log every executed query with a timestamp and user.
Semantic retrieval on denormalized rows works for contextual and pattern questions: “Which customer has a profile similar to company X?”, “Find products with similar sales cycles”, “Which cost categories grew disproportionately last year?”. Here, the focus isn’t on a single number but on relationships, anomalies, and similarities that can’t be expressed with a simple SQL query.
Schema-aware chunking: how to preserve headers
#Regardless of the chosen mode, the data preparation stage is critical. For tabular data, we use a schema-aware approach that differs from standard chunking described in the article on document chunking for RAG.
The principle is simple: every table row indexed in the vector database must carry column headers. Not in a separate fragment, but inline, in the same chunk. Markdown format works well because it’s readable for the model and preserves the column-value relationship:
Branch: Gdańsk | Period: Q3 2024 | Net Revenue [PLN]: 1,247,890 | Gross Margin [%]: 34.2 | Version: post-correction 2024-10-15
Each such row-chunk gets metadata: source file name, sheet or table name, row number in the original file, document version date, data type (financial, sales, HR). Metadata enables filtering before vector search, as detailed in hybrid BM25 and vector search.
For large tables (over 500 rows), we don’t index everything. We index rows with potential to answer business questions—rows with non-zero values, exception flags, or department/period-level aggregations. Granular rows go into a SQL database accessible via text-to-SQL.
Hybrid approach: when to combine both modes
#| Question Type | Recommended Mode | Example | Human Validation |
|---|---|---|---|
| Specific value from a single cell | Text-to-SQL | “Revenue of branch X in March” | Query review before production |
| Conditional aggregation | Text-to-SQL | “Sum of orders over 10,000 in Q2” | Query review + result verification |
| Similarity, pattern | Semantic retrieval | “Customers similar to segment A” | Top-k results evaluation |
| Anomaly, deviation | Semantic retrieval + SQL | “Which costs grew faster than revenue?” | Both steps verification |
| Mixed question | Hybrid (retrieval + SQL join) | “Summarize Q3 and identify similar historical periods” | Full audit trail |
Mixed questions are the hardest. The assistant must first fetch specific numbers via SQL, then enrich them with semantic context from similar periods or products. Such architecture requires an orchestrator to merge both results before generation. A human gate on such a complex result is mandatory if the answer feeds a management decision.
Number provenance: cite, don’t approximate
#This is a hard architectural requirement enforced at the system prompt level and verified in the golden evaluation set at Cashcrown.
Every number returned by the assistant must have attribution to a specific source. Provenance format depends on the mode:
For text-to-SQL: “1,247,890 PLN (source: report_results_Q3_2024.xlsx, sheet Branches, row 47, version post-correction 2024-10-15, query: SELECT net_revenue FROM branches WHERE name='Gdańsk' AND period='Q3_2024')”.
For semantic retrieval: “34.2% (source: row 47 fragment from report_results_Q3_2024.xlsx, cosine distance 0.97)”.
The model should never round or interpolate a number it didn’t find in the index. If the data isn’t available, the answer is: “I don’t have this number in the database. Check file X or ask an analyst for an export.” This is better than a number that looks credible but comes from the nearest similar row.
More on building AI systems responsible for company data in the article about enterprise GPT on knowledge base.
Try it: RAG over financial data in your company
#Filtering and security for structured data
#Financial, HR, and sales data rarely should be accessible to all assistant users. In RAG architecture over structured data, access filtering is implemented via chunk metadata: each row carries an access_level label (e.g., “management”, “controlling”, “sales”) and tenant_id in multi-company systems.
The vector query filters by these metadata before semantic ranking. This way, a controller doesn’t see HR data, and a salesperson doesn’t see net margins. Implementation details of access isolation are described in the article on vector database and decision criteria.
The second security dimension is data freshness. Financial sheets are versioned. A vector index built on a draft from October 12 will return different numbers than one built on the post-correction version from October 15. The system must store the document version as metadata for each chunk and expose it to the user with every answer. The decision on which version to consider official belongs to humans, not the assistant.
For regulatory reports (e.g., financial statements, reports to KNF), we recommend a read-only approach with logging of every query and response. AI Act requirements for documenting high-risk systems may cover such assistants when they influence financial decisions. More on data analysis and BI in the article AI for data analysis and BI.
FAQ
#Is text-to-SQL safe for production data?
#Text-to-SQL should run on a read-only replica or schemas with SELECT permissions only for the assistant account. Every generated query should be logged with a timestamp and session ID. Before production deployment, we recommend reviewing at least 50 sample queries by an analyst or developer familiar with the schema. A poorly generated JOIN or incorrect WHERE condition can return a number that looks credible but is calculated on the wrong subset.
What to do with tables where columns have ambiguous names?
#The data preparation stage should include mapping technical column names to business descriptions. A column rev_net_adj_eur in the schema becomes “Net Revenue post-correction [EUR]” in chunk metadata and the text-to-SQL schema. This is analytical work that must be done by someone familiar with the domain; the LLM can suggest mappings, but the final decision on the business name belongs to the department using the data.
How to handle tables with merged cells (merge cells) in XLSX?
#Merged cells are one of the most common issues when parsing XLSX. Libraries like openpyxl return values only for the top-left cell of a merged area, the rest are empty. During parsing, propagate the value of the merged cell to all rows it covers before data goes to chunking. Without this propagation, the model sees rows without group headers, destroying context and leading to incorrect attributions.
How to evaluate RAG quality over tabular data?
#The golden set for tabular data should include questions with exact, verifiable answers: a specific number from a specific cell. The evaluation metric isn’t just semantic relevance but numerical accuracy: whether the returned value matches the source value to the penny (for financial amounts, tolerance should be zero). For RAG over mixed data, build separate golden sets for SQL and semantic modes, as they differ in success metrics.
Does hybrid search improve precision for numerical data?
#To a limited extent. BM25 handles exact text matching well, e.g., branch names, product codes, contract numbers. For numbers alone, the BM25 signal is weak because the same digits appear in many rows. Hybrid search helps with mixed questions where the user provides a name and asks about a number. For purely aggregative questions, text-to-SQL is more reliable than any form of vector retrieval.