Synthetic data for AI training and testing: 2026 guide

An insurance company wants to train a model to detect fraud. It has one hundred thousand transaction records. Out of these, fraud cases amount to three hundred. A classifier trained on such a dataset will learn to predict "no fraud" in 99.7% of cases and achieve 99.7% accuracy without detecting anything useful. The problem does not lie in the algorithm. It lies in the data.

Synthetic data is one of the answers to this problem. Not the only one, nor always the right one, but by 2026, it has become a standard tool in the toolkit of every team building AI systems for businesses. Below, I describe when this tool makes sense, how to use it safely, and where the boundaries lie that are not worth crossing.

What synthetic data is and what it is not#

Synthetic data is data generated by a model or algorithm based on patterns learned from source data. Key point: it is neither a copy nor an anonymized version of the original data. It is a new population of records that mimics the structure and distribution of the original.

Three classes of synthetic data that appear in projects:

Synthetic tabular data — rows of a table resembling real transactions, patients, customers, or events, generated by models such as CTGAN, TVAE, or Gaussian Copula. Each row is new. None corresponds to a specific individual.

Synthetic texts and documents — content generated by LLM based on schemas (e.g., synthetic invoices, complaint emails, inspection reports) for training and testing systems based on RAG or data extraction.

Synthetic images and unstructured data — photos, scans, recordings generated by generative models, used when data is scarce in computer vision or OCR systems.

Synthetic data is not the same as anonymized data. Anonymization removes or masks PII from real records. Synthetic data does not contain real records as the source of individual rows, although the model generating it was trained on real data. This difference has legal and architectural implications.

When synthetic data makes sense and when it does not#

Not every data problem is solved by synthesis. The table below compares signals that suggest synthetic data with those that should exclude it.

Signal	Synthetic Data	Alternative
Rare classes (fraud, failure, rejection) below 1%	Yes — augmentation of rare class	Over-sampling (SMOTE) for simpler cases
Data contains PII and cannot be shared with external provider	Yes — instead of masking while preserving distribution	Local self-hosting of training model
Production data needed in test environment	Yes — test database without leakage risk	Subset with full anonymization
No data at all (new product, new market)	Caution — model has nothing to learn distribution from	Pilot collection of real data, expert rules
High-risk AI Act Annex III system	Yes, but with full documentation of training pipeline	Real data with DPIA and legal basis
Model must detect subtle behavioral patterns (e.g., relationship-based fraud)	No — synthesis loses higher-order relationships	Real data, possibly federated learning

Decision criterion: synthetic data works for statistical problems (class imbalance, lack of environmental data, privacy). It does not work for problems requiring real variability of patterns that the generating model did not see in the source data.

Methods for generating tabular data#

For tabular data (the most common case in businesses), the choice of method depends on the complexity of dependencies in the data.

Gaussian Copula models dependencies between columns using a multivariate normal distribution. Fast, interpretable, handles simple correlations well. Fails with strongly nonlinear or categorical data with rare combinations.

CTGAN (Conditional Tabular GAN) learns conditional distributions through a generative adversarial network. Better for data with many column types and nonlinear dependencies. Requires more source data for training (roughly a few thousand rows minimum) and is harder to calibrate.

TVAE (Tabular Variational Autoencoder) similar to CTGAN but based on a variational autoencoder. Often more stable in training, worse with very rare value combinations.

LLM-based methods — a newer approach where LLM generates synthetic rows based on schema description and examples. Works with small datasets (few-shot), slower and more expensive for millions of records, but provides high realism for textual or mixed data.

The choice of method should be preceded by evaluation on a validation set: train the target model once on synthetic data, once on real data, compare metrics. A difference below 5% on the same test set is a good signal. A difference above 15% suggests that synthesis loses important patterns.

Validating synthetic data quality#

Generating data is half the work. The other half is confirming that it is useful and safe. Three validation dimensions:

Statistical fidelity. Compare distributions of each column: mean, std, quantiles, mode for categorical. Check the correlation matrix (Pearson for numerical, Cramér's V for categorical) between real and synthetic data. Libraries like sdmetrics or ydata-profiling generate such reports automatically.

Utility (Train on Synthetic, Test on Real — TSTR). Train a model on synthetic data. Test on real data. Compare with a model trained on real data (TRTR). A TSTR/TRTR metric ratio close to 1.0 means synthesis preserves patterns important for the model. If it drops below 0.85, revisit the generator parameters.

Privacy (Privacy Metrics). Key metrics: Distance to Closest Record (DCR) and Nearest Neighbor Adversarial Accuracy (NNAA). DCR measures how close each synthetic record is to its nearest counterpart in the real data. Records too close to originals may violate privacy through membership inference attacks—detecting whether a specific person was in the training set.

Observability of the generation process is as important as observability of the production model. Log generator parameters, source data version, and validation metric results with each generation.

Synthetic data is not automatically outside the scope of GDPR. The European Data Protection Supervisor (EDPS) and the Board (EDPB) clarify that if the model generating the data was trained on personal data, and the generated records allow re-identification (e.g., through a combination of rare features), synthetic data may still qualify as personal data under Article 4(1) GDPR.

Requirements depend on the risk assessment of re-identification:

If DCR and NNAA indicate low re-identification risk, and data was generated from aggregates (not specific records), standard legal bases for processing synthetic data are analogous to anonymized data.

If synthetic data is generated in the context of a high-risk system under the AI Act (e.g., credit scoring, recruitment, medical systems), documentation of the training pipeline must include a description of the generation method, privacy metrics, and the result of a DPIA. This is a requirement of Article 10 of the AI Act regarding data management.

Practical rule: generate a validation report before each use of synthetic data in production or in systems subject to the AI Act. Store the report with the model. For human-oversight in high-risk systems, auditors must have access to the history of training data, including synthetic data.

Synthetic data for testing and debugging AI agents#

A separate use case that does not require model training or full statistical validation is synthetic data for testing environments and AI agents.

An agent handling orders must be tested on scenarios that rarely occur in production: an order with a missing address, twenty items in the cart from the same category, a currency other than PLN, a delivery date in the past. Such cases appear five times per million transactions in production data. In a test database, they can be generated in any quantity.

This type of synthetic data is generated using simple scripts or by LLM with instructions to create edge test cases. It does not require CTGAN or TVAE. It does require well-documented edge cases, typically recorded during requirements analysis.

When building guardrails for an agent, synthetic test data allows for automated regression testing: every change in the system prompt or guardrail logic runs through a set of synthetic test scenarios. This is the same pattern as unit tests in software engineering, but adapted for nondeterministic AI systems. More on monitoring agent quality in the article monitoring AI agent quality.

Integration with RAG pipeline and fine-tuning#

Synthetic data fits into two places in the AI pipeline: as training data (fine-tuning) and as documents expanding the RAG knowledge base.

For fine-tuning, synthetic question-answer pairs based on company documents allow model specialization without sending sensitive documents to an external provider. Schema: a local LLM generates questions and answers from documents (which you have the right to process). These synthetic pairs form the training set for fine-tuning. Original documents never leave the environment. When this variant makes sense and when it is better to stick with RAG alone is described in the article when fine-tuning makes sense.

For RAG, synthetic data supplements the knowledge base with scenarios not covered by real documentation: example customer dialogues, sample inquiry requests, sample inspection reports. This gives the model context for more accurate responses without revealing real customer data.

Key limitation: synthetic documents for RAG must be clearly marked as synthetic in the vector metadata. Mixing synthetic and real data without labeling complicates audits and makes debugging hallucinations harder. More on managing the RAG knowledge base in the article RAG knowledge updates and versioning.

Try it live#

▶Assess the usefulness of synthetic data in your casesandbox · reasoning

FAQ#

No. Synthetic data may still qualify as personal data if the generating model was trained on personal data and the generated records allow re-identification of specific individuals through a combination of rare features. A re-identification risk assessment using DCR and NNAA metrics should precede every use of synthetic data in systems processing information about natural persons. For high-risk systems under the AI Act, a full DPIA is required.

How much source data is needed to generate synthetic data?#

It depends on the method. Gaussian Copula works with a few hundred rows and simple dependencies. CTGAN and TVAE require roughly a few thousand rows for stable generator training, and more for many categorical columns with rare values. LLM-based generation (few-shot) works with a few dozen examples, but statistical quality is lower than generative methods. With very small datasets (below 500 records), synthetic data may paradoxically not improve the model, as the generator learns noise instead of patterns.

How to check if synthetic data does not "leak" original records?#

Calculate Distance to Closest Record (DCR) for each synthetic record relative to the source set. If the median DCR is close to zero, the generator is copying original rows instead of creating new ones. Supplement this with the Nearest Neighbor Adversarial Accuracy (NNAA) test: a classifier trained on a sample of "real vs. synthetic" should have accuracy close to 0.5 (random), not close to 1.0 (able to distinguish). The sdmetrics library implements both tests as ready-to-use functions.

Will synthetic data replace collecting real data?#

Not entirely. Synthetic data is a valuable complementary tool, not a replacement. The model generating synthesis learns patterns from real data—it has no way to generate phenomena that did not exist in the real data. For a new product without history, a new market, or a new type of event, synthetic data cannot replace the process of collecting pilot data. That is why it is worth assessing data readiness before every AI project using the readiness assessment tool and ROI calculator.

How to estimate a synthetic data implementation project?#

The scope of work depends on the complexity of the data schema, the number of columns with PII, the required level of validation (statistical vs. TSTR vs. full privacy), and whether the data will enter a high-risk AI Act system. Rough estimate: pilot projects with method evaluation on an existing dataset and implementation of a synthesis pipeline take several to a dozen weeks. A full estimate is prepared after analysis using the ROI calculator or direct contact.

What synthetic data is and what it is not#

Three classes of synthetic data that appear in projects:

Synthetic images and unstructured data — photos, scans, recordings generated by generative models, used when data is scarce in computer vision or OCR systems.

When synthetic data makes sense and when it does not#

Not every data problem is solved by synthesis. The table below compares signals that suggest synthetic data with those that should exclude it.

Signal	Synthetic Data	Alternative
Rare classes (fraud, failure, rejection) below 1%	Yes — augmentation of rare class	Over-sampling (SMOTE) for simpler cases
Data contains PII and cannot be shared with external provider	Yes — instead of masking while preserving distribution	Local self-hosting of training model
Production data needed in test environment	Yes — test database without leakage risk	Subset with full anonymization
No data at all (new product, new market)	Caution — model has nothing to learn distribution from	Pilot collection of real data, expert rules
High-risk AI Act Annex III system	Yes, but with full documentation of training pipeline	Real data with DPIA and legal basis
Model must detect subtle behavioral patterns (e.g., relationship-based fraud)	No — synthesis loses higher-order relationships	Real data, possibly federated learning

Methods for generating tabular data#

For tabular data (the most common case in businesses), the choice of method depends on the complexity of dependencies in the data.

TVAE (Tabular Variational Autoencoder) similar to CTGAN but based on a variational autoencoder. Often more stable in training, worse with very rare value combinations.

Validating synthetic data quality#

Generating data is half the work. The other half is confirming that it is useful and safe. Three validation dimensions:

Observability of the generation process is as important as observability of the production model. Log generator parameters, source data version, and validation metric results with each generation.

Requirements depend on the risk assessment of re-identification:

Synthetic data for testing and debugging AI agents#

A separate use case that does not require model training or full statistical validation is synthetic data for testing environments and AI agents.

Integration with RAG pipeline and fine-tuning#

Synthetic data fits into two places in the AI pipeline: as training data (fine-tuning) and as documents expanding the RAG knowledge base.

Synthetic data for AI training and testing: 2026 guide

What synthetic data is and what it is not#

When synthetic data makes sense and when it does not#

Methods for generating tabular data#

Validating synthetic data quality#

GDPR and AI Act: what applies to synthetic data#

Synthetic data for testing and debugging AI agents#

Integration with RAG pipeline and fine-tuning#

Try it live#

FAQ#

Are synthetic data GDPR-compliant by definition?#

How much source data is needed to generate synthetic data?#

How to check if synthetic data does not "leak" original records?#

Will synthetic data replace collecting real data?#

How to estimate a synthetic data implementation project?#

Synthetic data for AI training and testing: 2026 guide

What synthetic data is and what it is not#

When synthetic data makes sense and when it does not#

Methods for generating tabular data#

Validating synthetic data quality#

GDPR and AI Act: what applies to synthetic data#

Synthetic data for testing and debugging AI agents#

Integration with RAG pipeline and fine-tuning#

Try it live#

FAQ#

Are synthetic data GDPR-compliant by definition?#

How much source data is needed to generate synthetic data?#

How to check if synthetic data does not "leak" original records?#

Will synthetic data replace collecting real data?#

How to estimate a synthetic data implementation project?#