Analytics Synapse Analytics field-manual-complete field-manual-complete field-manual-complete

Synapse pipeline

A Synapse pipeline is the runbook for a data workflow. Instead of asking people to copy files, run notebooks, start SQL scripts, and check outputs manually, the pipeline connects those steps into an ordered process. It can branch, retry, wait, pass parameters, call notebooks, move data, and record each activity result. For learners, the key idea is simple: a pipeline does not usually do all computation itself; it coordinates activities that use storage, Spark, SQL, linked services, and integration runtimes.

Back to glossary browser Open Microsoft Learn source

Aliases: Azure Synapse pipeline, Synapse Analytics pipeline, pipeline in Synapse, Synapse data integration pipeline
Difficulty: fundamentals
CLI mappings: 8
Last verified: 2026-05-27T07:24:06Z

Browse trail Learn Analytics Synapse Analytics Synapse pipeline

Learning map Graph Analytics concept cluster Synapse pipeline

Context Concept cluster: Analytics concept cluster

Microsoft Learn

A Synapse pipeline is a data-driven workflow in Azure Synapse Analytics that groups activities for moving, transforming, and orchestrating data. It can run manually or from triggers, pass parameters between activities, and record pipeline and activity run history for monitoring.

Microsoft Learn: Pipelines and activities in Azure Data Factory and Azure Synapse Analytics2026-05-27T07:24:06Z

Technical context

Synapse pipelines live inside a Synapse workspace and share the data integration model used by Azure Data Factory. They use activities such as Copy, Notebook, SQL script, data flow, lookup, foreach, and execute pipeline to orchestrate work. Triggers start them on schedules, events, tumbling windows, or manual requests. Linked services define external connections, datasets describe data structures, and run history records activity state. Pipelines sit in the control plane for artifact management and in the operational plane for execution, monitoring, retries, and dependencies.

Why it matters

Synapse pipelines matter because analytics systems fail most often between steps, not inside one perfect query. A business process may require a file to arrive, a validation to pass, a notebook to enrich data, a SQL load to complete, and a dashboard refresh to happen in order. Without a pipeline, teams rely on manual timing, tribal knowledge, or hidden scripts. A good pipeline makes dependency, retry, parameter, and monitoring behavior visible. It also gives operators a single run record to inspect when a missed load, bad file, permission change, or downstream delay affects business reporting. It turns operational promises into evidence that platform and business teams can inspect together. That traceability prevents teams from arguing over which step actually missed its promise.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Synapse Studio Integrate hub, pipelines show activities, dependency arrows, parameters, variables, debug controls, publish status, trigger associations, and validation messages before production publishing meetings.

Signal 02

In pipeline run history, each run shows a run ID, status, start time, duration, activity results, input parameters, retry count, and errors during operator triage.

Signal 03

In deployment repositories, pipeline JSON defines activities, linked-service references, parameters, variables, policy settings, annotations, and trigger bindings reviewed during pull requests and security audit reviews.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Orchestrate file arrival, validation, transformation, SQL loading, and dashboard readiness as one monitored business workflow.
Parameterize the same data movement pattern for many source systems without duplicating one-off scripts.
Call Synapse notebooks or Spark jobs only after source partitions pass freshness and schema checks.
Recover from failed activities by rerunning a known pipeline segment instead of rebuilding the entire data day manually.
Promote data workflows between workspaces by exporting pipeline JSON and comparing deployed definitions.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Airline maintenance analytics stops missing overnight readiness windows

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An airline maintenance planning group combined aircraft sensor files, work-order exports, and technician notes every night. Manual orchestration caused morning planners to start with incomplete risk scores twice a week.

Business/Technical Objectives

Deliver complete maintenance risk tables before 04:45 local time.
Stop processing when a required aircraft file or work-order extract is missing.
Make reruns safe after partial failures without duplicating maintenance events.
Give operations a single run ID for overnight support handoff.

Solution Using Synapse pipeline

The team built a Synapse pipeline with validation activities for file freshness, lookup activities for expected fleet counts, copy activities into a landing schema, and a notebook activity for scoring. Parameters carried the maintenance date and fleet region. Outputs were written to date-partitioned folders, and SQL loads used merge logic to keep reruns idempotent. Azure CLI exported the pipeline JSON for release review and queried run history during overnight support. Failed validation stopped the workflow before Spark compute started, while alerts included the failed activity name and run ID.

Results & Business Impact

Complete readiness data arrived before 04:45 on 96% of nights, up from 71% before orchestration.
Duplicate maintenance events from reruns fell from about 1,400 per month to zero.
Average support triage time dropped from 38 minutes to 9 minutes because every issue began with a run ID.
Spark cost for failed nights decreased 29% after freshness checks ran before notebook execution.

Key Takeaway for Glossary Readers

A Synapse pipeline is the control layer that turns scattered data tasks into a recoverable business process.

Case study 02

Water utility improves regulatory reporting after storm events

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A city water utility had to report storm overflow measurements within strict deadlines. During heavy rain, CSV drops, lab results, and sensor corrections arrived out of order and overwhelmed a manual checklist.

Business/Technical Objectives

Ingest field sensor, lab, and correction files with visible dependency rules.
Prevent draft reports from publishing until quality thresholds pass.
Cut storm-event reporting preparation from six hours to under two hours.
Keep evidence of each activity result for regulatory review.

Solution Using Synapse pipeline

Engineers designed a Synapse pipeline with separate branches for telemetry files, lab uploads, and late corrections. A control activity waited for mandatory sources or a documented cutoff time, then a SQL script activity applied quality rules. If thresholds failed, the pipeline wrote exceptions to an operations table and stopped before publishing. If they passed, a final copy activity moved certified data to a reporting zone. CLI run queries exported status, activity duration, and failure messages for the compliance packet. Parameters identified the storm event, watershed, and reporting deadline.

Results & Business Impact

Median report preparation time fell from 6.4 hours to 1.7 hours during the next storm season.
Unapproved draft reports sent to analysts dropped from nine incidents to one minor correction.
Regulatory evidence assembly shrank from two business days to less than three hours.
Late correction handling became visible, reducing status calls between field teams and data engineers by 62%.

Key Takeaway for Glossary Readers

Synapse pipelines make timing, dependency, and quality decisions explicit when data arrives under real-world pressure.

Case study 03

Streaming media provider contains a runaway personalization rebuild

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A streaming platform rebuilt recommendation features through several Synapse activities. A bad trigger configuration launched overlapping rebuilds that competed for Spark capacity during a major content premiere.

Business/Technical Objectives

Identify and stop overlapping pipeline runs before they exhausted compute capacity.
Separate emergency cancellation from permanent pipeline fixes.
Compare deployed pipeline JSON against the approved release definition.
Reduce premiere-night recommendation delay below fifteen minutes.

Solution Using Synapse pipeline

The operations team used Azure CLI to query pipeline runs by workspace and UTC window, then cancelled duplicate run IDs while leaving the oldest valid run active. They exported the production pipeline definition and found a trigger parameter mismatch introduced during a portal hotfix. Engineers corrected the JSON in source control, redeployed with a non-overlap guard, and added an activity that checked for active rebuild runs before starting Spark work. Monitoring alerts were updated to include concurrent run count and expected premiere windows.

Results & Business Impact

Five duplicate rebuilds were cancelled within twelve minutes, preventing an estimated 7,800 extra Spark core-minutes.
Recommendation freshness during the premiere stayed within eleven minutes instead of the projected forty-five-minute delay.
The trigger mismatch was found in 23 minutes because exported JSON proved production drift.
Overlapping rebuild incidents dropped to zero across the next eight high-traffic releases.

Key Takeaway for Glossary Readers

Pipeline run visibility and CLI-driven control are essential when orchestration mistakes threaten live customer experience.

Why use Azure CLI for this?

With a decade of Azure operations behind me, I reach for Azure CLI when Synapse pipelines need promotion, evidence, or fast incident triage. The portal is fine for design, but CLI is better for exporting definitions, listing pipelines across workspaces, querying run history, and cancelling a bad run before it fans out. It also helps compare environments without trusting memory. For regulated teams, pipeline JSON from CLI becomes release evidence. For support teams, pipeline-run queries provide exact run IDs, timestamps, statuses, and parameters without clicking through crowded monitoring screens. That discipline prevents emergency guesswork during production incidents. That visibility is what keeps urgent recovery work from becoming tribal knowledge. It is the fastest way to turn a production mystery into a searchable timeline.

CLI use cases

Export or show a pipeline definition before deployment to confirm activities, expressions, and linked services match review.
Create or update a pipeline from JSON as part of CI/CD instead of manually rebuilding the canvas.
Query pipeline runs during an incident to find failed runs, long durations, and affected time windows.
Cancel a runaway run before it launches expensive Spark, SQL, or copy activity fan-out.
List pipelines across workspaces to find duplicate or retired workflows that still have active triggers.

Before you run CLI

Confirm the active tenant and subscription, then verify the workspace name because pipeline names are not globally unique.
Treat create, set, delete, and cancel as change-controlled actions with rollback notes and current JSON exports.
Check the caller has Synapse artifact permissions plus access to linked services, storage, SQL pools, and monitoring outputs.
Use UTC timestamps for run queries and choose a narrow window so incident output stays readable.
Review whether cancelling a parent pipeline should also stop child runs or downstream activities.

What output tells you

Pipeline show output exposes the deployed activity graph, expressions, policy settings, folder path, and linked artifact references.
Pipeline-run query output shows run IDs, status, start and end times, duration, and high-level failure state for triage.
A cancelled run proves orchestration was stopped, but downstream compute or partial outputs may still need cleanup checks.
Differences between exported JSON and repository JSON indicate deployment drift or portal edits outside the release path.
Missing pipelines usually indicate the wrong workspace, unpublished artifact state, permission limits, or a failed deployment.

Mapped Azure CLI commands

Synapse pipeline artifact operations

direct

az synapse pipeline list --workspace-name <workspace-name>

az synapse pipelinediscoverAnalytics

az synapse pipeline show --workspace-name <workspace-name> --name <pipeline-name>

az synapse pipelinediscoverAnalytics

az synapse pipeline create --workspace-name <workspace-name> --name <pipeline-name> --file @<pipeline.json>

az synapse pipelineprovisionAnalytics

az synapse pipeline set --workspace-name <workspace-name> --name <pipeline-name> --file @<pipeline.json>

az synapse pipelineconfigureAnalytics

az synapse pipeline delete --workspace-name <workspace-name> --name <pipeline-name>

az synapse pipelineremoveAnalytics

Synapse pipeline run operations

direct

az synapse pipeline-run query-by-workspace --workspace-name <workspace-name> --last-updated-after <utc-start> --last-updated-before <utc-end>

az synapse pipeline-rundiscoverAnalytics

az synapse pipeline-run show --workspace-name <workspace-name> --run-id <run-id>

az synapse pipeline-rundiscoverAnalytics

az synapse pipeline-run cancel --workspace-name <workspace-name> --run-id <run-id> --yes

az synapse pipeline-runremoveAnalytics

Architecture context

A Synapse pipeline is the orchestration layer of a Synapse data platform. It should sit above storage, compute, and integration resources while keeping those dependencies explicit. In a healthy architecture, pipelines are small enough to troubleshoot, parameterized for environment promotion, and built around clear data contracts. Long chains should use child pipelines or activities with meaningful boundaries. Retry policy, timeout, concurrency, trigger design, and failure routing need the same review as code. The pipeline should not hide credentials or business rules in random expressions; linked services, Key Vault, Git, and monitoring should carry that responsibility. This guardrail prevents routine data movement from creating surprise side effects. Keep orchestration boring, observable, and easy to rerun safely.

Security

Pipeline security depends on artifact permissions, linked service credentials, managed identity, and the data stores each activity can reach. A user who can edit a pipeline may redirect outputs, call a notebook, or run a copy that exposes sensitive data. Use Synapse roles for artifact access, Azure RBAC for workspace resources, Key Vault for secrets, and least-privilege access on storage and databases. Review pipeline parameters because they can carry paths, table names, or connection choices. Logs may contain row counts, file names, or error messages that reveal regulated data locations, so monitoring access matters too. Treat orchestration changes as security-relevant because they decide which data moves where. Sensitive parameter values should be masked or moved into secure references.

Cost

Pipeline cost is usually indirect but very real. Activities can start Spark pools, run data flows, scan storage, execute SQL, move data across regions, and generate logs. A retry loop can multiply compute spend, and an hourly trigger that should have been daily can burn budget quietly. Copy activity staging, integration runtime choices, and verbose diagnostic retention also affect cost. FinOps review should map each pipeline to run frequency, average duration, data volume, compute resources, and owner. Cost improves when pipelines fail fast on validation, avoid duplicate runs, and scale compute only when needed. FinOps teams should tie schedule frequency to actual business freshness needs. Every trigger should have a cost owner and expected run volume. Chargeback tags and run ownership make that spend visible.

Reliability

A reliable pipeline makes dependencies and failure behavior explicit. It should verify source availability, use retries only where the action is safe, set realistic timeouts, and write outputs idempotently. Do not let a failed validation activity continue into a destructive load. Triggers should match business freshness requirements and avoid overlapping runs unless the workflow is designed for concurrency. For recovery, operators need run IDs, activity output, parameter values, and a known restart point. Pipelines that copy partial files, rerun non-idempotent SQL, or skip failure notifications create larger blast radius than the original data issue. That detail is what makes reruns predictable after partial failure. The design should make the next safe action obvious.

Performance

Pipeline performance is end-to-end timing, not just one activity speed. Bottlenecks may come from source throttling, integration runtime placement, Spark pool startup, SQL pool concurrency, sequential dependencies, retry storms, or excessive validation waits. A pipeline that performs well has parallelism where safe, clear activity boundaries, and parameters that avoid unnecessary full refreshes. Operators should track duration by activity, queue time, source wait, and downstream compute time instead of only total run duration. The best tuning often removes unnecessary waits before changing compute size. Test with realistic data volume, then adjust concurrency carefully so a faster workflow does not overload a source system.

Operations

Operators manage Synapse pipelines through artifact review, trigger control, run history, activity diagnostics, and deployment evidence. Daily work includes checking failed runs, long-running activities, integration runtime pressure, parameter mistakes, and source-system delays. Pipeline JSON should be versioned, promoted through CI/CD, and tied to an owner because production workflows become business commitments. During an incident, the operator starts with the pipeline run ID, then drills into activity output and dependency timing. Good operations practice retires unused triggers, documents safe manual rerun steps, and baselines normal duration before schedule changes. The on-call guide should name owners, validation queries, and alert routes for every promoted pipeline.

Common mistakes

Relying on portal-only edits and then discovering test, production, and Git versions no longer match.
Adding retries around non-idempotent copy or SQL activities, which duplicates rows or overwrites good data after transient failures.
Using broad run-history windows during incidents and missing the specific failed run among hundreds of entries.
Forgetting that pipeline parameters can change data paths, dates, and behavior even when the pipeline definition is unchanged.
Deleting or disabling a pipeline without checking triggers, parent pipelines, dashboards, and downstream business schedules.

Operator quick checks

Can you show the deployed pipeline JSON and match it to the approved repository version?
Are retries, timeouts, and concurrency settings safe for every activity that writes data?
Does the pipeline fail before expensive work when source freshness or schema checks fail?
Can operators find the last successful run ID and rerun instructions in under five minutes?
Are triggers aligned with business freshness rather than a default schedule someone forgot to revisit?

Questions to ask

What data contract does this pipeline enforce before it starts expensive transformation work?
Who can edit, publish, trigger, cancel, or delete this pipeline in production?
Which activity causes the highest cost or longest recovery time when it fails?
What outputs must be cleaned up before rerunning after a partial failure?
Which alert tells the business that this pipeline missed its useful reporting window?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph