4 posts
How to evaluate an end-to-end RAG system in 2026: recall@k and precision for retrieval, faithfulness and source attribution, building a golden set, and offline vs. online evaluation.
How to evaluate embedding model quality on your own data in 2026: recall@k, MRR, nDCG, building a golden set, and pitfalls of offline and online evaluation.
LLM-as-a-judge in 2026: when automated quality assessment works, what systematic biases it introduces, and how to calibrate the judge before trusting it with production decisions.
How to manage changes in an AI system in 2026: prompt and model versioning, regression testing on a golden set, safe updates, changelog, and rollback.