AI and Machine Learning Microsoft Foundry verified

RAG evaluation

RAG evaluation is how you check whether a grounded AI system is actually useful and safe. It tests questions against expected sources, retrieved context, and generated answers. Good evaluation asks whether the right documents were found, whether the answer used them correctly, whether citations support the response, and whether the answer is complete for the user’s need. It gives teams a way to improve chunking, retrieval, prompts, and models without arguing from a few hand-picked demos.

Back to glossary browser Open Microsoft Learn source

Aliases: retrieval augmented generation evaluation, RAG evaluators
Difficulty: intermediate
CLI mappings: 5
Last verified: 2026-05-21

Microsoft Learn

RAG evaluation measures whether a retrieval augmented generation workflow returns relevant context and produces accurate, complete, grounded answers. In Microsoft Foundry, built-in and custom evaluators can assess response quality, retrieval quality, safety, and dataset-based performance.

Microsoft Learn: Retrieval-Augmented Generation evaluators2026-05-21

Technical context

In Azure architecture, RAG evaluation sits across AI development, observability, governance, and release management. It uses datasets of questions, expected answers, retrieved context, model outputs, and scoring metrics. Microsoft Foundry and the Azure AI evaluation tooling can run built-in and custom evaluators for groundedness, relevance, completeness, retrieval quality, and safety. Results should connect to prompt versions, index versions, model deployments, source snapshots, Application Insights traces, and approval gates before a RAG application reaches production users.

Why it matters

RAG evaluation matters because fluent answers can still be wrong, incomplete, unsupported, or based on the wrong source. Teams that only demo a few successful questions miss the long tail of ambiguous language, outdated documents, permission edge cases, and retrieval failures. Evaluation creates an evidence base for release decisions: what improved, what regressed, and which user scenarios are still unsafe. It also helps separate model problems from search problems. When business owners ask whether an AI assistant is ready, RAG evaluation provides measurable quality signals instead of confidence theater. That evidence helps leaders approve, delay, or narrow a launch responsibly.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

Microsoft Foundry evaluation screens show datasets, evaluator selections, metric results, run history, safety signals, and failed rows that require review before release approval by owners.

Signal 02

Evaluation result exports contain questions, retrieved context, generated answers, metric scores, pass-or-fail status, and identifiers for prompt, model, or dataset versions in each controlled run.

Signal 03

CI/CD pipelines or release checklists reference RAG evaluation thresholds that block deployment when groundedness, relevance, citation coverage, or safety scores regress before rollout approval gates.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Block a chatbot release when new prompts improve tone but reduce groundedness or citation support on regulated questions.
Compare retrieval strategies, such as vector, hybrid, semantic, or agentic retrieval, using the same benchmark dataset.
Turn production complaints into repeatable test questions so previously wrong answers cannot silently return after tuning.
Measure whether source updates, index rebuilds, or embedding changes improved answer quality without increasing unsafe responses.
Provide business owners with measurable readiness evidence before expanding a RAG assistant to more users or departments.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Insurance claims assistant earns release approval

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

SilverPath Mutual built a claims-assistance chatbot for adjusters, but early demos answered confidently from outdated policy endorsements. Leaders required measurable proof before allowing production use.

Business/Technical Objectives

Measure groundedness and completeness against real adjuster questions.
Identify whether failures came from retrieval, prompts, or stale documents.
Block release if severe unsupported answers remained.
Create a repeatable benchmark for future policy updates.

Solution Using RAG evaluation

The team created a RAG evaluation dataset with 180 questions covering exclusions, claim limits, documentation rules, and ambiguous no-answer cases. Each row included expected source documents and acceptable answer criteria. Microsoft Foundry evaluation runs scored generated responses for groundedness, relevance, completeness, and safety, while Application Insights linked each row to retrieved document IDs and model deployment names. CLI evidence captured the AI account, search service, diagnostic settings, and deployment state for every major run. Failures showed that several policy PDFs were chunked too broadly, so engineers changed chunk boundaries around endorsement tables and reran the benchmark before release.

Results & Business Impact

Groundedness improved from 72% to 91% after chunking and retrieval filters were corrected.
Severe unsupported answers fell from 14 cases to two, both routed to human review.
Release approval time dropped by 35% because scorecards replaced subjective demo debates.
The benchmark became a mandatory gate for quarterly policy document refreshes.

Key Takeaway for Glossary Readers

RAG evaluation turns model readiness from a persuasive demo into a measurable release decision.

Case study 02

Manufacturer compares retrieval strategies before rollout

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

HelioFab Systems planned a field-service assistant for turbine technicians. Search results looked plausible, but engineers disagreed about whether hybrid search or vector-only retrieval worked better.

Business/Technical Objectives

Compare retrieval modes using the same service-question benchmark.
Measure answer quality and technician-facing latency together.
Preserve citations to exact manual sections and bulletins.
Choose a retrieval design before training field teams.

Solution Using RAG evaluation

The AI team built an evaluation dataset from 220 historical service tickets and mapped each question to expected manual sections. They tested vector-only, hybrid, and hybrid plus semantic ranking configurations in Azure AI Search, then generated answers with the same Azure OpenAI deployment. RAG evaluation measured retrieval relevance, groundedness, citation support, and answer completeness. Operators used CLI to capture search service SKU, replicas, AI deployment names, and diagnostic settings so performance differences were tied to known resource states. The final design used hybrid retrieval with metadata filters for equipment model and bulletin date, because it handled fault codes and descriptive symptoms better than vector-only search.

Results & Business Impact

Hybrid retrieval improved relevant-context recall by 19 percentage points over vector-only testing.
Median answer latency increased by 420 milliseconds, which field teams accepted for better citations.
Citation support reached 94% on benchmark questions after metadata filtering was added.
The chosen design avoided a costly production rollback after field training materials were prepared.

Key Takeaway for Glossary Readers

RAG evaluation helps teams choose retrieval architecture using evidence instead of preference or a few impressive examples.

Case study 03

Legal research product controls unsafe expansion

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

BriefHarbor, a legal technology startup, wanted to expand a RAG assistant from contract summaries to litigation research. The new domain introduced higher risk around jurisdiction, dates, and unsupported conclusions.

Business/Technical Objectives

Detect unsupported legal assertions before feature expansion.
Test no-answer behavior when source material was incomplete.
Track quality by jurisdiction and document type.
Keep evaluation outputs restricted to the product and compliance teams.

Solution Using RAG evaluation

The product team created separate evaluation datasets for statutes, case summaries, procedural rules, and no-answer scenarios. The RAG application used Azure AI Search with jurisdiction and effective-date metadata, while Azure OpenAI generated answers only from retrieved context. Microsoft Foundry evaluators scored groundedness, relevance, completeness, and safety; custom review labels marked jurisdiction mismatches as high severity. CLI scripts captured AI account and search service configuration for each candidate release, while Key Vault and private endpoint settings protected evaluation data. Failed rows were routed back to legal content owners, who improved source tagging rather than asking engineers to hide gaps with prompt language.

Results & Business Impact

Jurisdiction mismatch failures dropped from 11% to 2.4% after metadata and prompt changes.
No-answer behavior passed 93% of incomplete-source test cases before rollout.
Compliance review time fell by 28% because high-severity failures were clearly separated.
The company delayed one risky feature area until evaluation data showed acceptable grounding.

Key Takeaway for Glossary Readers

RAG evaluation is especially valuable when the cost of a confident but unsupported answer is high.

Why use Azure CLI for this?

As an Azure engineer with ten years of delivery experience, I use Azure CLI around RAG evaluation to prove the environment behind the scores. The evaluation report is only meaningful if I know which search service, index, model deployment, dataset, application version, and diagnostic settings were used. CLI helps capture those resource states before and after a run, compare staging and production, and export evidence for approval. There may be SDK or portal steps for running evaluations, but CLI is still the fastest way to inventory the Azure resources that make the evaluation reproducible. That context protects score comparisons from drift.

CLI use cases

Capture the AI account, model deployment, search service, and diagnostic configuration associated with an evaluation run.
Compare staging and production resource settings before trusting differences in RAG evaluation results.
Export deployment names, SKUs, locations, and tags for governance evidence tied to evaluator scorecards.
Check monitoring resources so failed evaluations can be correlated with application traces and retrieved document IDs.
Inventory evaluation-related resources before cleaning up test environments that still store datasets or logs.

Before you run CLI

Confirm tenant, subscription, resource group, Foundry project or workspace, search service, AI account, and model deployment used by the evaluation.
Know whether the command is inspecting resources, reading logs, changing model deployments, or deleting evaluation data.
Check permissions for management-plane resources and for any data-plane datasets that contain sensitive questions or retrieved context.
Avoid exporting unredacted evaluation rows into shared locations when they include customer data, private documents, or safety failures.
Use consistent output formats so resource evidence can be attached to scorecards and release approvals.

What output tells you

AI account and deployment fields identify which model endpoint produced or judged the evaluated answers.
Search service fields show whether the evaluation used the intended retrieval capacity, region, and index environment.
Diagnostic setting output indicates whether traces are available to investigate low groundedness or missing citation failures.
Tags and resource IDs connect evaluation results to application version, owner, environment, cost center, and release record.
SKU, region, and quota fields explain whether evaluation performance issues come from resource capacity rather than prompt quality.

Mapped Azure CLI commands

RAG evaluation evidence collection

adjacent

az cognitiveservices account show --name <ai-account> --resource-group <resource-group>

az cognitiveservices accountdiscoverAI and Machine Learning

az cognitiveservices account deployment list --name <ai-account> --resource-group <resource-group>

az cognitiveservices account deploymentdiscoverAI and Machine Learning

az search service show --name <search-service> --resource-group <resource-group>

az search servicediscoverAI and Machine Learning

az monitor diagnostic-settings list --resource <resource-id>

az monitor diagnostic-settingsdiscoverAI and Machine Learning

az resource list --resource-group <resource-group> --tag workload=rag --output table

az resourcediscoverAI and Machine Learning

Architecture context

A seasoned architect treats RAG evaluation as a release gate and feedback loop. The architecture should maintain test datasets for common questions, edge cases, sensitive content, no-answer scenarios, and newly changed sources. Each evaluation run should reference the prompt version, index schema, source snapshot, embedding model, generation model, retrieval mode, and scoring configuration. Results should be stored with deployment records so teams can compare quality over time. Evaluation also shapes design decisions: whether chunking is too broad, semantic ranking helps, citations are reliable, or user permissions are causing retrieval gaps. Keep those artifacts available for post-release review and rollback after deployment approval.

Security

Security impact is direct because evaluation datasets often contain realistic user questions, retrieved snippets, and generated answers. Those records can reveal sensitive documents, customer issues, policy gaps, or prompt-injection behavior. Store evaluation data in controlled projects, restrict who can view traces, and avoid using production secrets or unredacted personal data. Evaluators should include safety checks for data leakage, unsupported claims, unsafe instructions, and responses that ignore access boundaries. The most dangerous result is a high average score that hides a few severe exposure cases. Review failures by severity, not only by aggregate score. Include privacy reviewers when datasets contain realistic customer questions.

Cost

RAG evaluation has direct cost because every run can consume model tokens, search queries, evaluator model calls, storage, and logging. Larger datasets produce better confidence but increase spend and runtime. FinOps teams should track cost per evaluation run, scheduled frequency, model choice, context size, and whether full test suites are needed for every change. A practical pattern is a small smoke set for frequent releases and a broader benchmark for major source, prompt, or model updates. The savings come from catching bad RAG behavior before it creates support load, compliance reviews, or production rework. Track wasted runs caused by environment drift.

Reliability

Reliability impact is indirect but important. RAG evaluation does not keep the application running, but it prevents unstable answer quality from reaching production. Repeated evaluation runs can detect regressions after source updates, index rebuilds, prompt edits, model changes, or retrieval tuning. Reliable programs keep a stable benchmark set, run it before releases, and add new questions from production failures. They also track score variance because non-deterministic generation can make small improvements look larger than they are. A release should not pass only because one evaluation run happened to score well. Treat evaluation history as part of production change control and rollback planning.

Performance

Performance impact is mostly diagnostic. RAG evaluation can measure answer quality and also reveal latency side effects from retrieval settings, context size, reranking, and model choice. A change that improves groundedness but doubles response time may be unacceptable for a support chatbot. Evaluation should therefore record search latency, model latency, token counts, retrieved chunk count, and time to first response where possible. In large test suites, evaluation runtime itself matters, so teams may parallelize runs, sample datasets, or run quick smoke checks before the full benchmark. Quality and speed must be reviewed together. Review benchmark runtime before scheduling frequent automated gates.

Operations

Operators support RAG evaluation by keeping datasets current, triggering evaluation runs, storing results, and linking failures to owners. They should know how to inspect evaluation inputs, retrieved context, answer output, model deployment, and index version for each failed question. Dashboards should show score trends, high-severity failures, unsupported-answer rate, citation coverage, and latency or token side effects. Runbooks should define when a regression blocks release, when a prompt rollback is required, and how production feedback becomes new evaluation cases. Evaluation is operational work, not a one-time lab exercise. Operators should also archive scorecards, exceptions, and remediation notes after each run and approvals review.

Common mistakes

Evaluating only happy-path questions and missing ambiguous, adversarial, stale, or no-answer scenarios.
Changing prompts, indexes, and models at the same time, then being unable to explain why scores changed.
Trusting aggregate averages while ignoring a small number of severe data leakage or unsupported-answer failures.
Running evaluation against staging resources that do not match production search indexes, model deployments, or network paths.
Treating evaluation as a launch task instead of adding production failures back into the benchmark dataset.

Operator quick checks

Confirm every evaluation run records dataset version, prompt version, model deployment, index version, and retrieval mode.
Review failed rows manually before accepting a score improvement as real progress.
Check that safety and no-answer cases are included, not only common factual questions.
Compare latency and token usage alongside groundedness and relevance scores.
Make sure evaluation data storage follows the same sensitivity rules as source documents.

Questions to ask

What score threshold blocks release, and who can approve an exception?
Which metric matters most for this workload: retrieval relevance, groundedness, completeness, safety, or latency?
How are production complaints converted into new evaluation cases?
What changed between the previous passing run and the current failing run?
Where are evaluation datasets, outputs, and traces stored, and who can read them?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph