Analytics Azure Synapse Analytics field-manual-complete field-manual-complete field-manual-complete

Synapse Apache Spark notebook

A Synapse Apache Spark notebook is an interactive workbook for Spark code inside a Synapse workspace. Data engineers and analysts use cells to read data, transform it, visualize samples, and test logic before turning it into a scheduled job or pipeline step. The notebook itself is not the compute; it attaches to a Spark pool that runs the work. This makes it useful for exploration and production preparation, but it also needs source control, parameter discipline, access review, and cost awareness.

Aliases
Synapse Spark notebook, Apache Spark notebook in Synapse, Synapse notebook, Spark notebook
Difficulty
intermediate
CLI mappings
6
Last verified
2026-05-27T00:59:56Z

Microsoft Learn

A Synapse Apache Spark notebook is a notebook artifact in Azure Synapse Analytics whose code cells run on a serverless Apache Spark pool. It supports interactive development, data exploration, transformation logic, and pipeline-driven execution. in governed analytics workflows. for teams.

Microsoft Learn: Create, develop, and maintain Synapse notebooks2026-05-27T00:59:56Z

Technical context

Technically, a Synapse Apache Spark notebook is a workspace artifact that stores code, metadata, parameters, and Spark configuration references. When executed, it creates or uses a Spark session on a selected Spark pool. Notebooks can be run interactively in Synapse Studio, imported or exported with Azure CLI, and orchestrated from Synapse pipelines. They interact with linked services, managed identities, data lake files, libraries, monitoring logs, and Spark UI. The notebook sits between human exploration and automated data engineering workflows.

Why it matters

Synapse Apache Spark notebooks matter because many data transformations begin as investigation before becoming reliable pipelines. A notebook lets an engineer inspect messy data, test joins, validate schemas, and explain results in one place. That speed is powerful, but notebooks can also become risky if production logic lives only in someone's workspace, secrets are typed into cells, or expensive Spark sessions are left running. Used well, notebooks bridge discovery and operations: they document reasoning, become parameterized pipeline activities, and give teams a shared artifact for reviewing data logic before it supports reports, models, or decisions for production analytics teams operationally.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Synapse Studio, Spark notebooks appear under workspace artifacts with code cells, parameters, attached Spark pools, run controls, and output previews during reviews and releases.

Signal 02

In Azure CLI output, synapse notebook list or show commands reveal notebook names, workspace ownership, folders, and deployment status for review during audits and releases.

Signal 03

In Synapse pipeline history, a notebook activity shows parameters, Spark pool execution, duration, failure messages, and links to run details after scheduled runs during incidents.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Explore raw data and turn the discovered transformation logic into a reviewed Synapse pipeline activity.
  • Parameterize Spark processing so the same notebook can run for different dates, tenants, or data zones.
  • Export notebooks for source control before promoting analytics logic between development and production workspaces.
  • Debug failed Spark transformations with code, narrative context, sample outputs, and Spark run details together.
  • Standardize data-engineering notebooks that prepare curated lake data for reporting, search, or machine learning.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Nonprofit turns donor analysis into a governed pipeline

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A global nonprofit used ad hoc notebooks to analyze donor behavior after campaigns. Results were useful, but every analyst ran a slightly different version and leadership questioned which numbers were official.

Business/Technical Objectives
  • Standardize donor-segmentation logic across campaign teams.
  • Protect personally identifiable donor data during exploration and production runs.
  • Reduce monthly reporting preparation from days to hours.
  • Promote notebook changes through review instead of manual copy-paste.
Solution Using Synapse Apache Spark notebook

The data team converted the best exploratory Synapse Apache Spark notebook into a parameterized production notebook. It accepted campaign month and region parameters, read donor data through managed identity, masked sensitive fields in sample outputs, and wrote curated aggregates to a controlled reporting zone. Azure CLI exported the notebook into source control, and release pipelines imported approved versions into production. A Synapse pipeline executed the notebook on a scheduled Spark pool and stored run metadata for audit review.

Results & Business Impact
  • Monthly donor reporting preparation fell from 3.5 days to 6 hours.
  • Leadership received one official segmentation output instead of five spreadsheet variants.
  • PII sampling issues found in two cells were removed before production promotion.
  • Notebook promotion became a reviewed pull request with export and import evidence.
Key Takeaway for Glossary Readers

A Synapse Apache Spark notebook becomes much more powerful when exploration is promoted into a governed, parameterized artifact.

Case study 02

Telecom operations debug network quality faster

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A telecommunications provider investigated intermittent 5G quality drops across several cities. Engineers had logs, tower telemetry, and customer reports, but correlation work took too long during regional incidents.

Business/Technical Objectives
  • Correlate tower telemetry, device events, and customer reports within one investigation workflow.
  • Cut incident analysis time during regional quality degradations.
  • Avoid granting every investigator broad storage-account access.
  • Capture the final diagnostic logic for future incidents.
Solution Using Synapse Apache Spark notebook

The analytics team built a Synapse Apache Spark notebook that accepted city, time window, and tower-cluster parameters. The notebook joined telemetry and support data from curated lake folders, displayed only aggregated samples, and wrote incident-specific summaries to a restricted operations zone. Investigators ran it interactively during incidents, while a reviewed version could be triggered from a pipeline for recurring checks. Azure CLI exports captured each approved notebook version, and Spark session output helped operators distinguish code problems from pool capacity delays.

Results & Business Impact
  • Median incident analysis time dropped from 4.2 hours to 58 minutes.
  • Investigators used managed identity access instead of shared storage keys.
  • Recurring quality checks reused the same reviewed notebook logic weekly.
  • Spark session evidence reduced false platform escalations by 37%. Operations reused the notebook during the next regional maintenance drill.
Key Takeaway for Glossary Readers

Synapse Apache Spark notebooks are effective incident tools when they combine repeatable parameters, safe data access, and exportable logic.

Case study 03

Insurance actuaries make feature engineering reproducible

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An insurance analytics group prepared claim-risk features for pricing models in isolated notebooks. Model reviewers could not reproduce past feature sets, and expensive reruns sometimes produced different outputs.

Business/Technical Objectives
  • Make feature engineering reproducible for model governance reviews.
  • Reduce rerun cost for large claim-history transformations.
  • Separate development experiments from approved production logic.
  • Create a rollback path for notebook changes that affect model inputs.
Solution Using Synapse Apache Spark notebook

The team refactored the feature-engineering work into a Synapse Apache Spark notebook with explicit parameters for valuation date, business line, and input snapshot. The notebook wrote deterministic outputs to versioned curated folders and logged row counts, schema checks, and feature summaries. Development notebooks stayed in a separate workspace, while approved notebooks were exported with Azure CLI and imported into production through a release pipeline. The production pipeline pinned the Spark pool and library versions used for model-ready datasets.

Results & Business Impact
  • Feature-set reproduction time fell from two days to under four hours.
  • Rerun compute cost dropped 33% after idempotent outputs avoided full recomputation.
  • Model reviewers traced every approved feature set to a notebook version and input snapshot.
  • A bad feature change was rolled back by importing the prior exported notebook.
Key Takeaway for Glossary Readers

A Synapse Apache Spark notebook can satisfy analytics speed and governance when parameters, outputs, and promotion paths are deliberate.

Why use Azure CLI for this?

Azure CLI is useful for Synapse Apache Spark notebooks because it turns notebook management into a repeatable artifact workflow. Engineers can list notebooks, export them for code review, import approved versions, inspect the target workspace, and check related Spark pools. CLI is also valuable during incidents: operators can confirm whether the notebook exists, whether it was recently promoted, and which pool or parameters a run should use. It does not replace Synapse Studio for interactive debugging, but it makes notebook inventory, deployment, and evidence capture much cleaner. That discipline prevents notebooks from becoming invisible production dependencies during production release gates.

CLI use cases

  • List notebooks in a Synapse workspace to inventory artifacts before a release or cleanup review.
  • Show a notebook and confirm it exists in the expected workspace and folder.
  • Export notebooks into source control or incident evidence before making production changes.
  • Import an approved notebook version into a target workspace during a deployment pipeline.
  • Check related Spark pools and sessions when a notebook activity fails or runs longer than expected.

Before you run CLI

  • Confirm tenant, subscription, Synapse workspace, notebook name, folder path, and intended environment before modifying artifacts.
  • Know whether you are listing, exporting, importing, setting, or deleting because notebook commands can change production logic.
  • Check workspace roles, source-control expectations, and approval status before importing a notebook into production.
  • Verify the target Spark pool, executor settings, parameters, libraries, and data access paths used by the notebook.
  • Use output folders and JSON files carefully so exported notebooks do not overwrite reviewed source artifacts accidentally.

What output tells you

  • Notebook list output shows which artifacts exist in the workspace and whether naming or folder conventions are followed.
  • Notebook show output confirms the specific artifact, workspace, metadata, and sometimes Spark pool references or settings.
  • Export output provides a file that can be compared against source control or attached to an incident review.
  • Import or set command results show whether the approved notebook version was accepted by the target workspace.
  • Related Spark pool and session output helps separate notebook-code failures from compute capacity or environment problems.

Mapped Azure CLI commands

Synapse notebook artifact operations

direct
az synapse notebook list --workspace-name <workspace-name>
az synapse notebookdiscoverAnalytics
az synapse notebook show --workspace-name <workspace-name> --name <notebook-name>
az synapse notebookdiscoverAnalytics
az synapse notebook export --workspace-name <workspace-name> --name <notebook-name> --output-folder <folder>
az synapse notebookoperateAnalytics
az synapse notebook import --workspace-name <workspace-name> --name <notebook-name> --file @<path-to-notebook>
az synapse notebookprovisionAnalytics
az synapse spark pool show --name <spark-pool-name> --workspace-name <workspace-name> --resource-group <resource-group>
az synapse spark pooldiscoverAnalytics
az synapse spark session list --workspace-name <workspace-name> --spark-pool-name <spark-pool-name>
az synapse spark sessiondiscoverAnalytics

Architecture context

As an Azure architect, I treat Synapse Apache Spark notebooks as code artifacts in the analytics platform, not personal scratchpads. Development notebooks can be exploratory, but production notebooks should be parameterized, source-controlled, reviewed, and executed by pipelines with managed identities. They should avoid embedded secrets, hard-coded storage paths, and environment-specific assumptions. The Spark pool, libraries, storage permissions, and monitoring should be defined consistently across environments. Notebook output should land in controlled data zones with idempotent paths. This architecture lets teams keep the speed of notebooks without sacrificing deployment hygiene, governance, or recovery across development, production, and operations teams in practice.

Security

Security impact is direct because notebooks can read, transform, display, and write sensitive data. A careless cell can expose secrets, sample protected rows, or write curated data to the wrong location. Access to create, edit, run, import, and export notebooks should align with workspace roles and data privileges. Managed identities and linked services should replace pasted credentials. Operators should review notebooks for hard-coded keys, broad storage paths, debug prints of sensitive data, and unapproved package sources. Exported notebooks should be protected like code because they may reveal data logic, endpoints, parameters, and operational assumptions during reviews before every production release.

Cost

Cost impact is indirect through the Spark sessions the notebook starts and the data it processes. The notebook artifact itself is not the expensive part; attached Spark pools, long-running interactive sessions, oversized executors, repeated test runs, and inefficient reads create spend. Development teams can accidentally leave sessions active or rerun heavy transformations while experimenting. Production pipelines can multiply cost if notebooks are not idempotent and must be rerun after partial failures. Cost controls include pool time to live, right-sized executor settings, scheduling discipline, sampled development data, and monitoring idle sessions before they become a hidden platform tax quickly per run.

Reliability

Reliability impact is high when notebooks become pipeline steps. Interactive code that works once can fail in production because parameters are missing, libraries differ, sessions time out, data schemas drift, or output paths are not idempotent. Reliable notebooks declare inputs, validate schemas, handle empty data, write deterministic outputs, and fail loudly with useful errors. Pipelines should retry only safe operations and alert on failed runs. Operators should keep known-good exports and avoid editing production notebooks directly. Reliability improves when notebooks are promoted through environments like any other deployment artifact during every scheduled production run for data owners and users alike.

Performance

Performance impact comes from notebook code and Spark execution choices. Slow notebooks often suffer from poor partitioning, repeated full data scans, unnecessary collects to the driver, small-file problems, unpersisted intermediate data, or under-sized pools. Interactive users may also hide performance issues by running only small samples. Operators should review Spark stages, input sizes, shuffle behavior, executor use, and pipeline duration before adding compute. Parameterized notebooks should be tested with production-sized data. Good notebook performance means the logic remains understandable while using Spark parallelism efficiently enough for scheduled analytics windows before the notebook becomes a daily dependency for scheduled business users.

Operations

Operations teams manage Synapse Apache Spark notebooks through inventory, export, import, source control, pipeline orchestration, run monitoring, and incident review. They list notebooks in a workspace, confirm naming and folder conventions, check Spark pool references, review recent runs, and capture failed-session evidence. Runbooks should explain how to rerun a notebook with parameters, export the current version, compare it with source control, and identify whether a failure came from code, pool capacity, library changes, or data. Good operations also include cleanup of abandoned notebooks that keep confusing support and release teams before each release or incident rerun and after incidents and audits.

Common mistakes

  • Treating production notebooks as personal scratchpads and editing them directly during incidents.
  • Hard-coding storage paths, dates, secrets, or Spark pool names instead of using parameters and secure services.
  • Exporting notebooks after a hotfix but never committing the changed artifact back to source control.
  • Testing notebook logic only on tiny samples and discovering performance failures during scheduled production runs.
  • Deleting or overwriting a notebook without first exporting evidence and checking pipeline dependencies.