AI and Machine Learning Azure Machine Learning verified

Prompt evaluation

Prompt evaluation is how teams test whether a prompt actually works instead of relying on a good demo. You run the AI behavior against a dataset of representative inputs, expected answers, safety cases, or business rules. Evaluators score quality, grounding, safety, formatting, and task success. In Azure and Microsoft Foundry, prompt evaluation helps decide whether a prompt, model, agent, or retrieval change is ready for production. It turns subjective “looks good” reviews into measurable evidence.

Aliases
No aliases mapped yet
Difficulty
intermediate
CLI mappings
6
Last verified
2026-05-20

Microsoft Learn

Prompt evaluation measures how well prompts, models, agents, or AI applications perform against test data. In Microsoft Foundry, evaluations can use built-in and custom evaluators to assess quality, safety, groundedness, relevance, task success, and other metrics before or after deployment.

Microsoft Learn: Run evaluations from the Microsoft Foundry portal2026-05-20

Technical context

In Azure architecture, prompt evaluation sits in the AI application lifecycle between prompt design, model deployment, RAG configuration, agent tools, monitoring, and release gates. Microsoft Foundry supports evaluations that run against datasets and use built-in or custom evaluators. Results can inform model selection, pre-production testing, production monitoring, and CI/CD quality checks. Evaluations may require Azure OpenAI deployments for AI-assisted scoring, Foundry project roles, datasets, traces, and telemetry from Application Insights or related monitoring systems.

Why it matters

Prompt evaluation matters because prompt changes can look harmless but alter accuracy, safety, cost, latency, or formatting. A team cannot judge production readiness from a few hand-picked examples. Evaluation creates repeatable evidence across common questions, edge cases, harmful inputs, missing grounding, and expected output structures. It also helps teams compare model versions, prompt variants, retrieval strategies, and agent tool behavior. In business terms, evaluation reduces release risk and creates audit evidence for responsible AI decisions. In engineering terms, it prevents regressions from reaching users unnoticed and gives operators a measurable signal when quality starts drifting after deployment. This evidence is critical for governance.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Microsoft Foundry, evaluation pages show datasets, evaluator choices, run status, metric scores, failed examples, thresholds, approval notes, and links back to models or agents.

Signal 02

CI/CD pipelines may run prompt evaluation jobs before deployment and block release when groundedness, safety, formatting, task-completion, latency, or quality scores miss approved thresholds.

Signal 03

Application Insights and Foundry observability dashboards connect evaluation failures with traces, token consumption, latency, tool calls, content-safety events, quality scores, and production monitoring after deployment.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Block a prompt release when groundedness scores fall below threshold for regulated customer-support answers.
  • Compare prompt variants against the same dataset before changing a production AI assistant’s behavior.
  • Validate that a model upgrade preserves JSON formatting, refusal behavior, and task completion quality.
  • Add incident examples to a regression dataset so a fixed failure does not quietly return later.
  • Measure safety, relevance, and tool-call accuracy together before approving an agent for broader traffic.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Airline baggage assistant earns release gate

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An airline operations team prepared an AI assistant for baggage claim questions. Manual demos looked promising, but leaders needed evidence before exposing it to travelers.

Business/Technical Objectives
  • Measure answer accuracy for common baggage scenarios.
  • Validate refusal behavior for compensation promises outside policy.
  • Keep response time acceptable during airport disruption events.
  • Create a release gate that support leaders could understand.
Solution Using Prompt evaluation

The AI team built a prompt evaluation dataset from historical baggage questions, policy articles, delayed-flight scenarios, and adversarial refund requests. Microsoft Foundry evaluations measured relevance, groundedness, safety, and formatting against expected outputs. Failed cases were reviewed with baggage policy owners, then added to the regression set. The team also tracked latency and token use because a highly detailed answer was not useful during disruption traffic. Azure CLI captured the OpenAI deployment, region, diagnostics, and resource IDs for the approval packet, so executives knew the evaluation environment matched production assumptions. The dataset was kept under support-team ownership so new disruption scenarios could be added quickly.

Results & Business Impact
  • Groundedness improved from 76% to 91% after two prompt revisions.
  • Unsafe compensation promises fell to zero in the release-blocking test set.
  • P95 response latency stayed under the four-second target.
  • Support leadership approved rollout with a documented rollback threshold.
Key Takeaway for Glossary Readers

Prompt evaluation gives non-AI stakeholders measurable evidence for deciding whether an AI behavior is safe to release.

Case study 02

University tutor checks explanation quality

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A university built a tutoring assistant for introductory statistics courses. Faculty wanted helpful explanations without giving direct answers to graded homework.

Business/Technical Objectives
  • Measure whether responses guided students instead of solving assignments outright.
  • Detect hallucinated formulas and unsupported definitions.
  • Compare two prompt variants before semester launch.
  • Keep evaluation results accessible to faculty reviewers.
Solution Using Prompt evaluation

The academic technology team created an evaluation dataset from practice questions, homework-like prompts, misconception examples, and accessibility requests. Microsoft Foundry evaluators measured coherence, task adherence, groundedness, and harmful academic-integrity behavior. A custom rubric evaluator scored whether the assistant used hints, questions, and step-by-step reasoning without providing final graded answers. Prompt variants were tested against the same dataset, while Azure CLI documented the deployment and monitoring settings used during evaluation. Faculty reviewed failed examples and added new cases for common student misconceptions before the production release. The dataset was versioned by course term so future changes could be compared fairly.

Results & Business Impact
  • The selected prompt improved tutoring-rubric score by 24% over the original version.
  • Direct-answer violations dropped from 19% to 4% in homework-like cases.
  • Faculty review time fell by 35% because failed examples were grouped by metric.
  • The assistant launched with a published evaluation threshold for future changes.
Key Takeaway for Glossary Readers

Prompt evaluation helps educators balance helpful AI guidance with policy boundaries and measurable learning goals.

Case study 03

Energy analyst prevents report hallucinations

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An energy trading desk used an AI assistant to summarize market briefings. Analysts noticed that prompt changes sometimes inserted unsupported price drivers into morning reports.

Business/Technical Objectives
  • Reduce unsupported claims in generated market summaries.
  • Compare model and prompt changes before analyst distribution.
  • Track latency and token cost during high-volume report windows.
  • Keep an auditable trail for compliance review.
Solution Using Prompt evaluation

The engineering group created prompt evaluations using prior market briefings, approved analyst notes, and deliberately conflicting source snippets. Microsoft Foundry evaluators scored groundedness, relevance, citation quality, and output structure. Reports with unsupported price drivers were marked as release blockers. The team ran evaluation suites whenever prompts, retrieval filters, or model deployments changed. Azure CLI exports captured the resource group, deployment, diagnostic settings, and region for each evaluation run. Application Insights traces helped investigators connect low scores to either retrieval gaps, prompt wording, or model behavior. Analysts also tagged each failed example with the trading desk process it affected.

Results & Business Impact
  • Unsupported market-driver statements fell by 47% across the regression set.
  • Prompt review time before daily release dropped from 90 minutes to 38 minutes.
  • Token cost stayed within the approved budget after shorter prompt variants were chosen.
  • Compliance reviewers received evaluation evidence within one business day.
Key Takeaway for Glossary Readers

Prompt evaluation is a practical control for AI workflows where unsupported wording can create business and compliance risk.

Why use Azure CLI for this?

As an Azure engineer with ten years in production environments, I use Azure CLI around prompt evaluation to verify the platform facts behind evaluation scores. The evaluation result is only meaningful if I know which project, model deployment, endpoint, region, storage, identity, and diagnostic settings were used. CLI helps capture those facts repeatably for release reviews, audits, and incident timelines. Portal evaluation screens are helpful for analysis, but scripts make it easier to compare environments, export evidence, and confirm that failed scores were not caused by deployment drift or missing monitoring. I also use it to preserve evaluation context when teams revisit old release decisions.

CLI use cases

  • Inventory Foundry, Azure OpenAI, and monitoring resources before running a formal evaluation batch.
  • Show the model deployment used by AI-assisted evaluators so scores can be reproduced later.
  • Check diagnostic settings when evaluation failures require traces, token metrics, or safety-event evidence.
  • Export resource IDs and configuration for audit records attached to responsible AI release approval.
  • Compare staging and production deployment settings when the same prompt receives different evaluation scores.

Before you run CLI

  • Confirm tenant, subscription, resource group, Foundry project or Azure OpenAI account, deployment, region, and role assignment.
  • Check whether evaluation datasets contain sensitive prompts, outputs, customer records, or regulated examples before export.
  • Understand cost risk from large test datasets, AI-assisted scoring, repeated runs, and high-token prompts.
  • Verify provider registration and permissions for Azure OpenAI, Monitor, storage, and any connected search resources.
  • Use JSON output and record evaluation run IDs, timestamps, thresholds, and model versions for reproducibility.

What output tells you

  • Resource output confirms the tenant, project, deployment, and region used to generate or score evaluation responses.
  • Diagnostic settings reveal whether traces and metrics exist to explain low evaluation scores or failed test cases.
  • Metric output can show whether latency, throttling, or token growth contributed to a quality regression.
  • Storage and workspace IDs identify where datasets, traces, and evidence may need access review or retention controls.
  • Deployment details show whether a model change, rather than a prompt change, influenced the evaluation result.

Mapped Azure CLI commands

Ml operations

direct
az ml workspace list --resource-group <resource-group>
az ml workspacediscoverAI and Machine Learning
az ml workspace show --name <workspace> --resource-group <resource-group>
az ml workspacediscoverAI and Machine Learning
az ml workspace create --name <workspace> --resource-group <resource-group> --location <region>
az ml workspaceprovisionAI and Machine Learning
az ml compute list --workspace-name <workspace> --resource-group <resource-group>
az ml computediscoverAI and Machine Learning
az ml model list --workspace-name <workspace> --resource-group <resource-group>
az ml modeldiscoverAI and Machine Learning
az ml online-endpoint list --workspace-name <workspace> --resource-group <resource-group>
az ml online-endpointdiscoverAI and Machine Learning

Architecture context

As an Azure architect, I design prompt evaluation as a control plane for AI quality. The evaluation dataset, evaluators, model deployment, retrieval sources, and metrics must be versioned together, otherwise scores become hard to trust. Evaluation should run before production release, after major model or prompt changes, and periodically against live failure patterns. For RAG systems, I want groundedness, relevance, citation quality, and unsupported-answer metrics. For agents, I want tool-call accuracy and task completion. Evaluation is also connected to observability: traces, token usage, latency, and safety events help explain why a score changed instead of just reporting that it changed.

Security

Security impact is direct because evaluations can test jailbreaks, prompt injection, unsafe content, data leakage, and tool misuse before users encounter them. Evaluation datasets may contain sensitive examples, so access control, redaction, storage security, and retention matter. AI-assisted evaluators can send test inputs and outputs to a model deployment, so teams must know which tenant, project, region, and model are used. Secure evaluation also verifies that prompts refuse disallowed requests, respect grounding boundaries, and do not reveal secrets. Scores should not replace authorization checks, but they provide evidence that application controls are working under realistic adversarial and edge-case conditions. Access reviews should cover these artifacts.

Cost

Cost impact comes from evaluation runs, model calls used for scoring, dataset size, storage, traces, and engineering review time. Large evaluation suites can become expensive if they run frequently against high-cost deployments or include oversized prompts. However, weak evaluation can cost more by allowing poor answers, support escalations, rework, and unsafe releases. FinOps owners should track evaluation frequency, test-case count, input and output tokens, evaluator model choice, and failed-release avoidance. Teams can control cost by tiering datasets into smoke, regression, and full-release suites and by running expensive evaluators only when risk justifies them. Clear thresholds prevent expensive full evaluations from running when smoke tests already fail.

Reliability

Reliability impact is significant because evaluation catches behavioral regressions that infrastructure health checks miss. The service can be online while the AI feature returns unsupported answers, malformed JSON, wrong tool calls, or unsafe refusals. Reliable teams maintain stable evaluation datasets, add new cases from incidents, and compare scores before release. They also track evaluator drift, model version changes, and dataset freshness. Evaluations should include normal, edge, and failure inputs so operators understand blast radius before deployment. When production monitoring shows quality degradation, evaluation reruns help determine whether the root cause is prompt wording, retrieval, model behavior, or tool integration. Baseline scores should be preserved for rollback.

Performance

Performance impact is mostly diagnostic, but it affects runtime decisions. Evaluations can measure latency, token usage, output length, and tool-call count alongside quality scores. A prompt variant with better wording may still be unacceptable if it doubles response time or requires too many retries. Evaluation helps teams compare quality against performance tradeoffs before users feel the change. Operators should separate evaluator runtime from application runtime, because a slow evaluation job does not always mean the production endpoint is slow. The key performance value is making speed, cost, and quality visible in the same release decision. This keeps release decisions grounded in user-visible speed and quality.

Operations

Operators use prompt evaluation to create release gates, incident evidence, and continuous quality feedback. Work includes preparing datasets, selecting evaluators, running evaluations, reviewing metrics, investigating failed examples, and updating runbooks. Azure CLI supports the surrounding work by verifying the Foundry or Azure OpenAI resources, deployments, monitoring configuration, and storage locations involved. Teams should document who owns datasets, who can approve thresholds, how results are retained, and how failures trigger rollback. Practical operations also include comparing scores across environments, exporting reports for compliance, and turning real production failures into new regression tests. That makes evaluation a repeatable operational control, not a one-time experiment.

Common mistakes

  • Judging a prompt from manual demos instead of representative datasets and measurable thresholds.
  • Mixing model, retrieval, and prompt changes in one evaluation so the failed variable is unclear.
  • Using sensitive production examples in evaluation exports without redaction, access control, or retention rules.
  • Treating one high aggregate score as proof that edge cases, safety, and formatting are acceptable.
  • Ignoring cost and latency metrics when a prompt improves quality but becomes too slow or expensive.