AI and Machine Learning AI platform verified

Prompt flow evaluation

Prompt flow evaluation is a way to test a Prompt Flow run by feeding its outputs into another flow that calculates scores or metrics. Instead of reading every answer manually, teams define evaluation inputs, ground truth, scoring logic, and aggregation. This helps compare prompt variants, measure quality, and decide whether a flow is good enough. In 2026, it is especially relevant for teams maintaining legacy Prompt Flow workloads while planning migration before the announced retirement date.

Aliases
No aliases mapped yet
Difficulty
intermediate
CLI mappings
4
Last verified
2026-05-20

Microsoft Learn

Prompt flow evaluation uses a special evaluation flow to score outputs from another prompt flow, often during batch runs. It can calculate metrics for accuracy, groundedness, quality, or task fit, while Prompt Flow workloads also face the April 2027 retirement timeline.

Microsoft Learn: Evaluation flow and metrics in prompt flow2026-05-20

Technical context

In Azure architecture, prompt flow evaluation belongs to the Azure Machine Learning and Foundry classic workflow around Prompt Flow. An evaluation flow is a special flow type that consumes outputs from a standard flow or batch run and produces scores, metrics, and aggregate results. It may use ground-truth columns, generated answers, questions, context, custom Python logic, or model-assisted scoring. It connects to datasets, workspaces, compute, storage, identities, and run history. Current designs should account for Prompt Flow retirement and migration planning.

Why it matters

Prompt flow evaluation matters because it turns a flow run into measurable evidence. A Prompt Flow application might look acceptable in a small manual test but fail on edge cases, return malformed output, miss grounding, or produce poor classifications at scale. Evaluation flows help teams score each row of a batch dataset and aggregate results across the full run. That supports prompt tuning, variant comparison, release gates, and migration validation. With Prompt Flow retirement approaching, evaluation results also help teams prove that a replacement workflow behaves as well as the legacy flow before the old asset is frozen or removed.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

Prompt Flow run history shows evaluation runs, scored outputs, logged metrics, failed rows, batch datasets, aggregate results, thresholds, baselines, and links to the tested flow.

Signal 02

Azure Machine Learning workspace assets expose compute, storage, connections, identities, jobs, datasets, and artifacts used to run evaluation flows and preserve metric evidence before approvals.

Signal 03

Migration dashboards or release reports compare legacy Prompt Flow evaluation baselines against replacement workflows before traffic is moved, endpoints are retired, or parity is accepted.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Score a legacy Prompt Flow batch run against ground-truth answers before approving a prompt variant.
  • Create a baseline metric set so a replacement workflow can prove parity before Prompt Flow retirement.
  • Detect failed rows where a flow returns malformed output, missing context, or incorrect classifications.
  • Aggregate quality, groundedness, and accuracy metrics across large datasets instead of reviewing outputs manually.
  • Troubleshoot evaluation failures caused by missing columns, broken connections, compute issues, or evaluator drift.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Media localization team compares variants

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A media localization group used Prompt Flow to translate short promotional descriptions into multiple languages. Editors needed a measurable way to compare prompt variants before a seasonal launch.

Business/Technical Objectives
  • Score translation quality across eight target languages.
  • Identify malformed outputs before descriptions reached publishing tools.
  • Compare two legacy flow variants against the same dataset.
  • Capture a baseline for future migration planning.
Solution Using Prompt flow evaluation

The team built a prompt flow evaluation that consumed outputs from each translation flow run and compared them with editor-approved examples. The evaluation flow calculated formatting pass rate, terminology consistency, and a reviewer rubric score. It also logged failed rows where required fields were missing. Azure CLI was used to record the Azure Machine Learning workspace, compute, endpoint, and storage configuration tied to each run. Editors reviewed only the lowest-scoring examples instead of every output. The best variant became the release candidate, and its metrics were saved as the baseline for replacement testing before Prompt Flow retirement.

Results & Business Impact
  • Editor review workload dropped by 46% for the launch dataset.
  • Malformed output rate fell from 12% to 2% after the winning variant was selected.
  • Terminology consistency improved by 21% across priority languages.
  • A migration baseline was created six months before planned platform replacement.
Key Takeaway for Glossary Readers

Prompt flow evaluation helps teams compare legacy flow variants with evidence instead of opinion.

Case study 02

Mining safety flow validates incident triage

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A mining company used a Prompt Flow workflow to classify safety reports from remote sites. The operations team needed confidence that changes would not miss severe incidents.

Business/Technical Objectives
  • Measure classification accuracy for severity and incident type.
  • Catch missing required fields before reports entered the safety system.
  • Keep evaluation evidence for compliance reviews.
  • Use baseline results to plan migration away from Prompt Flow.
Solution Using Prompt flow evaluation

Engineers created an evaluation flow that received the triage flow’s output, the original report text, and ground-truth labels from safety officers. The evaluator scored severity accuracy, incident-category accuracy, required-field completeness, and escalation correctness. Failed rows were grouped by mine site and incident type so safety leaders could review patterns. Azure CLI documented the workspace, compute, storage, and diagnostic settings used by evaluation runs. The team also saved baseline metrics before rebuilding the workflow in a new orchestration pattern, ensuring the replacement would not weaken high-severity detection. Safety officers owned threshold approval so the evaluation reflected field risk, not only engineering preference.

Results & Business Impact
  • High-severity recall improved from 88% to 96% after prompt and rule changes.
  • Required-field completeness reached 99% in the final evaluation run.
  • Compliance evidence packages were produced in hours instead of several days.
  • Replacement design had a clear parity target before migration started.
Key Takeaway for Glossary Readers

Prompt flow evaluation is valuable when migration cannot compromise safety-critical behavior.

Case study 03

Automotive warranty team proves migration parity

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An automotive manufacturer depended on a Prompt Flow workflow to summarize warranty claims. A migration program needed proof that the new service matched the legacy flow’s behavior.

Business/Technical Objectives
  • Capture baseline quality metrics from the existing flow.
  • Compare replacement workflow outputs against the same warranty dataset.
  • Detect regressions in part-number extraction and claim routing.
  • Avoid running duplicate systems longer than necessary.
Solution Using Prompt flow evaluation

Before migration, the team ran a prompt flow evaluation against a dataset of historical warranty claims, approved summaries, part numbers, and routing decisions. The evaluation flow scored summary accuracy, part-number extraction, routing match, and JSON validity. The replacement application then processed the same dataset, and engineers compared results against the baseline. Azure CLI exports recorded the workspace, compute, deployment, and storage environment used by the legacy evaluation. Failed cases were reviewed by warranty analysts, who found that one new prompt missed regional warranty exceptions and required correction before cutover. The migration team reran the baseline after every significant replacement-service change.

Results & Business Impact
  • Replacement summary accuracy matched the legacy baseline within the approved 2% tolerance.
  • Part-number extraction improved from 91% to 95%.
  • Duplicate runtime was reduced by four weeks because parity evidence was ready.
  • One regional warranty regression was fixed before any traffic moved.
Key Takeaway for Glossary Readers

Prompt flow evaluation can turn a risky legacy migration into a measurable parity exercise.

Why use Azure CLI for this?

As an Azure engineer with ten years of operational work, I use Azure CLI around prompt flow evaluation because the score is only trustworthy if the workspace and run context are clear. CLI helps confirm the Azure Machine Learning workspace, compute, endpoints, storage, identity, and monitoring configuration used by the evaluation. It also supports migration reporting: I can export facts about the legacy flow environment before comparing it with a replacement. The portal is useful for reviewing results, but CLI gives repeatable inventory and evidence for release gates, audits, and troubleshooting. I also use it to document baseline evidence before a legacy flow is migrated or retired.

CLI use cases

  • Inspect the Azure Machine Learning workspace and compute used to run evaluation flows.
  • List jobs, endpoints, or deployments around the tested flow before comparing evaluation results.
  • Check storage and identity dependencies when evaluation datasets or outputs cannot be accessed.
  • Export resource configuration for migration parity evidence between legacy and replacement workflows.
  • Validate diagnostic settings before investigating failed rows, slow evaluation runs, or missing metrics.

Before you run CLI

  • Confirm tenant, subscription, resource group, workspace, compute, dataset location, and permissions for evaluation assets.
  • Check whether evaluation inputs or outputs contain sensitive prompts, generated text, or ground-truth labels.
  • Understand cost risk before launching large batch evaluations or model-assisted scoring jobs.
  • Avoid destructive endpoint or workspace commands while investigating a flow that may still serve production traffic.
  • Use JSON output and record flow version, evaluation flow version, dataset name, run ID, and threshold date.

What output tells you

  • Workspace and compute output show where evaluation flows run and whether capacity may explain slow or failed jobs.
  • Job or run metadata identifies the dataset, timestamps, status, and artifacts tied to each evaluation result.
  • Identity and storage fields reveal access problems when an evaluation flow cannot read inputs or write metrics.
  • Endpoint and deployment details show which tested flow version produced outputs being scored by the evaluator.
  • Diagnostic configuration indicates whether failed rows, node errors, and timing details can be investigated later.

Mapped Azure CLI commands

Ai operations

adjacent
az cognitiveservices account list --resource-group <resource-group>
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account show --name <account-name> --resource-group <resource-group>
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account create --name <account-name> --resource-group <resource-group> --kind <kind> --sku S0 --location <region>
az cognitiveservices accountprovisionAI and Machine Learning
az cognitiveservices account delete --name <account-name> --resource-group <resource-group>
az cognitiveservices accountremoveAI and Machine Learning

Architecture context

As an Azure architect, I treat prompt flow evaluation as both a quality tool and a migration safety net. The evaluation flow depends on the tested flow’s outputs, dataset shape, scoring logic, compute, model connections, and metric logging. If any of those change without versioning, the score loses meaning. For legacy workloads, I want a baseline evaluation before migration, then the same or equivalent dataset run against the replacement architecture. Evaluation flows also need data governance because ground truth, generated answers, and traces can include sensitive content. The architectural goal is reproducible scoring, not just another notebook full of one-off checks.

Security

Security impact is indirect but important. Prompt flow evaluation often processes prompts, generated answers, retrieved context, and ground-truth labels, which can contain confidential or regulated data. Access to evaluation datasets, run outputs, logs, and metric artifacts should be limited to approved reviewers. If an evaluation flow uses model-assisted scoring, teams must understand where evaluation content is sent and which deployment handles it. Security tests should include prompt injection, unsafe answers, data leakage, and tool misuse where relevant. During migration, evaluation evidence should not be exported into unmanaged files or scripts that bypass normal data protection controls. Reviewers should handle outputs as protected evidence.

Cost

Cost impact comes from evaluation compute, model calls, storage, logs, and reviewer time. A large batch evaluation can consume significant tokens if each row calls a model for answer generation or scoring. Custom evaluators can also require compute that sits idle between runs. Still, evaluation can prevent larger costs by catching poor releases, reducing manual review, and validating migration before duplicate systems run for months. FinOps reviews should track dataset size, run frequency, evaluator model, compute duration, storage retention, and repeated failed runs. Teams should keep lightweight smoke evaluations separate from full regression suites. Smaller smoke datasets help catch obvious failures before full regression runs begin.

Reliability

Reliability impact is strong for change control. Evaluation flows help detect when a standard flow stops meeting quality thresholds after prompt edits, tool changes, model updates, or migration. However, the evaluation itself must be reliable: datasets need stable schema, ground truth must be maintained, compute must run consistently, and metrics should be logged in a reproducible way. Teams should handle failed evaluation nodes, missing columns, partial batch results, and evaluator version changes. For Prompt Flow retirement, baseline and replacement evaluations reduce the risk of cutting over to a workflow that is technically live but behaviorally worse. That makes cutover decisions safer and auditable.

Performance

Performance impact is mostly about evaluation throughput and diagnostic speed. Evaluation flows process rows, execute scoring logic, and may call models or tools, so large datasets can take time. Slow evaluations delay releases and migration decisions. Operators should measure run duration, per-node timing, compute queueing, token volume, and failed-row counts. If performance is poor, they can reduce unnecessary model scoring, partition datasets, right-size compute, or create smaller smoke suites. Production runtime performance is evaluated indirectly: the tested flow’s latency, output length, and tool-call behavior can become metrics in the evaluation result. Fast smoke suites keep release pipelines useful while full suites run less frequently.

Operations

Operators use prompt flow evaluation by preparing datasets, running batch tests, reviewing per-row scores, checking aggregate metrics, and comparing results across variants or releases. They also troubleshoot missing inputs, failed nodes, compute errors, connection failures, and inconsistent metric logging. Azure CLI helps inspect the Azure Machine Learning workspace, compute resources, endpoints, storage, and monitoring settings that support those evaluations. Runbooks should include dataset location, expected columns, evaluator owners, metric thresholds, retention rules, and migration milestones. For legacy flows, every important evaluation should be tagged to the tested flow version and replacement candidate. That keeps evaluation evidence reproducible when migration teams compare old and new behavior.

Common mistakes

  • Treating prompt flow evaluation as general Foundry evaluation without recognizing the legacy Prompt Flow context.
  • Comparing scores from different datasets, evaluators, or model versions as if they were equivalent.
  • Running large evaluations without checking compute cost, token cost, and storage retention.
  • Ignoring failed rows because the aggregate score still looks acceptable.
  • Migrating a flow without first capturing baseline evaluation metrics from the legacy implementation.