Databases Database complete template-specs-five-use-cases template-specs-five-use-cases-three-case-studies

Slow query

Slow query is not just a query that feels sluggish; it is a database statement whose duration, frequency, CPU use, I/O, waits, or memory pressure affects users, batch windows, or service capacity. In Azure, you usually find slow queries through Azure SQL Query Store, PostgreSQL query statistics, MySQL slow query logs, Application Insights dependency telemetry, or Log Analytics. The fix might be indexing, query rewriting, parameter review, scaling, cache design, or removing unnecessary calls. A good slow-query investigation connects the database symptom to the application path that users actually experience.

Back to glossary browser Open Microsoft Learn source

Aliases: Long-running query, slow SQL query, query regression, expensive query
Difficulty: advanced
CLI mappings: 2
Last verified: 2026-05-04

Microsoft Learn

Microsoft Learn database guidance uses slow query analysis across Azure SQL, Azure Database for PostgreSQL, Azure Database for MySQL, and related services to find statements whose duration, resource use, waits, or frequency hurt users or capacity. Evidence usually comes from Query Store, diagnostic logs, metrics, or engine-specific views.

Microsoft Learn: Azure database documentation2026-05-04

Technical context

In Azure architecture, slow query analysis sits across the data plane, application telemetry, and monitoring layer. The database engine records execution duration, plans, waits, reads, writes, and frequency, while Azure Monitor and Application Insights show how those statements affect request latency and dependency calls. Azure SQL often uses Query Store; PostgreSQL can use pg_stat_statements and diagnostic logging; MySQL can use slow query logs and performance metrics. CLI is usually used to discover resources, diagnostic settings, SKU, metrics, and log routing, while SQL or KQL tools perform deeper statement analysis.

Why it matters

Slow queries matter because they are one of the most common causes of cloud applications feeling unreliable even when every Azure resource reports healthy. One inefficient statement can hold locks, saturate CPU, burn I/O, increase DTU or vCore pressure, extend batch windows, and trigger application timeouts. Scaling the database may hide the issue temporarily while increasing spend. Good teams identify whether the problem is query shape, missing index, stale statistics, parameter sensitivity, chatty application design, or insufficient capacity. The value is not only faster SQL; it is fewer incidents, cleaner capacity planning, and evidence-based decisions about tuning versus scaling during peak demand.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

Azure SQL Query Store surfaces slow query text, plans, duration, CPU, reads, waits, and regression patterns inside the database troubleshooting workflow for production reviews and audit evidence.

Signal 02

Application Insights dependency telemetry shows database calls that dominate request duration, helping teams connect slow SQL behavior to user-facing latency for a specific endpoint or workflow.

Signal 03

Log Analytics workbooks or database diagnostic logs reveal repeated long-running statements, deadlocks, timeouts, or high resource usage during incident windows and tuning reviews after releases.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Find the specific SQL statement causing checkout, login, reporting, or API latency instead of scaling blindly.
Distinguish a missing-index problem from a capacity problem before upgrading database tiers or adding replicas.
Correlate Application Insights dependency latency with Query Store evidence during a production performance incident.
Detect release-related query regressions after schema changes, new filters, parameter changes, or ORM-generated SQL updates.
Build a tuning backlog from recurring high-total-duration queries that quietly consume compute every business day.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Energy trading desk fixes quote-screen latency

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A renewable-energy trading desk used Azure SQL Database for intraday quote screens. Traders saw random freezes, but CPU graphs alone did not explain why some screens took 12 seconds to refresh.

Business/Technical Objectives

Identify the statement causing p95 quote refresh latency above three seconds.
Avoid a permanent vCore increase unless evidence proved capacity was the constraint.
Reduce lock contention during market-open reporting queries.
Create a repeatable tuning record for compliance review.

Solution Using Slow query

The database engineer used Azure CLI to confirm the production database, SKU, region, diagnostic settings, and ownership tags, then reviewed Query Store for the incident windows. The evidence showed one reporting query scanning historical positions while quote screens needed current positions. The fix added a filtered index, rewrote the report predicate, and moved a heavy export outside market-open hours. Application Insights dependency telemetry confirmed that quote-screen calls were the affected user path. A temporary scale-up was used only during the first morning after rollout, then removed once metrics proved the fix held.

Results & Business Impact

p95 quote refresh latency fell from 12.4 seconds to 1.8 seconds.
A proposed permanent four-vCore increase was avoided, saving about 31 percent on that database.
Lock wait incidents during market open dropped from daily to one minor event in a month.
Compliance received a tuning record with before-and-after Query Store evidence.

Key Takeaway for Glossary Readers

Slow-query evidence helped the team fix the real bottleneck instead of buying capacity for a bad access pattern.

Case study 02

Library consortium rescues nightly catalog indexing

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A public-library consortium ran a shared catalog on Azure Database for PostgreSQL Flexible Server. Nightly indexing drifted past opening time, leaving patrons with stale availability data.

Business/Technical Objectives

Cut nightly catalog indexing from six hours to under two hours.
Find whether the delay came from SQL, application batching, or database capacity.
Keep patron search availability stable during the tuning rollout.
Give each member library a clear report on the changed data path.

Solution Using Slow query

Operators used Azure CLI to verify the PostgreSQL server, storage, compute tier, diagnostic settings, and resource tags, then reviewed pg_stat_statements and query logs for the batch window. The slowest query joined circulation history without a selective date predicate, causing huge reads as data grew. The development team rewrote the query, added a composite index, and split indexing into smaller batches with progress logging. Azure Monitor metrics and Application Insights dependency telemetry were compared before and after the change. The runbook documented how to check top total-duration queries each month before the catalog window drifted again.

Results & Business Impact

Nightly indexing time dropped from 6.2 hours to 94 minutes.
Patron stale-availability complaints fell 63 percent in the next month.
CPU peaks during indexing dropped from 92 percent to 54 percent.
Monthly review now identifies top total-duration queries before they affect opening hours.

Key Takeaway for Glossary Readers

Slow-query analysis turned a creeping batch problem into a measurable, maintainable database practice.

Case study 03

Veterinary SaaS stops retry storms from one query

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A veterinary practice SaaS hosted appointment data in Azure SQL Database. Clinics reported intermittent booking failures whenever morning reminder jobs started.

Business/Technical Objectives

Find the database call that triggered application retries and booking timeouts.
Reduce p99 booking latency below two seconds during reminder processing.
Avoid disabling reminders, which reduced no-show rates for clinics.
Separate application retry behavior from database tuning actions.

Solution Using Slow query

The SRE team started with Azure CLI to verify the database, SKU, diagnostic settings, and recent deployment timeline. Application Insights showed booking requests waiting on a dependency call to check appointment availability. Query Store revealed a parameter-sensitive plan that performed well for small clinics but scanned heavily for large clinics. The team added a targeted index, changed the stored procedure to reduce parameter sensitivity, and throttled reminder-job concurrency. They also adjusted application retries to avoid multiplying pressure during database waits. Post-change dashboards tracked p99 booking latency, Query Store regressions, and reminder-job duration together.

Results & Business Impact

p99 booking latency during reminders fell from 8.7 seconds to 1.6 seconds.
Morning booking failures dropped from 4.8 percent to 0.3 percent.
Reminder delivery stayed on schedule, preserving the clinics no-show reduction program.
Retry volume during the peak window fell 71 percent after application and query fixes.

Key Takeaway for Glossary Readers

A slow query can become an application outage when retries multiply it; tuning must follow the whole path.

Why use Azure CLI for this?

From ten years of Azure engineering work, I use Azure CLI in slow-query investigations to establish the Azure facts before going deep in SQL tools. CLI quickly shows the database, server, SKU, region, diagnostic settings, metrics, private endpoint context, and resource tags. That prevents a common mistake: tuning the wrong database or blaming the engine when telemetry is not enabled. CLI also helps export evidence for change records, compare production with staging, and confirm whether recent scaling, firewall, or monitoring changes happened near the regression. The actual plan review may happen in T-SQL or KQL, but CLI frames the investigation.

CLI use cases

Show the database or server SKU, region, and resource ID before starting a slow-query investigation.
List diagnostic settings to confirm query or performance logs are routed to the expected workspace.
Pull database metrics such as CPU, DTU, storage, connection count, or deadlocks for the incident window.
Export resource tags and ownership metadata so the right application and database teams join the review.
Compare staging and production database settings when a query regression appears only in one environment.

Before you run CLI

Confirm the tenant, subscription, resource group, server, and database so you do not investigate the wrong environment.
Use read-only commands for evidence collection unless an approved emergency action requires scaling or configuration changes.
Check whether you have database-level permissions for Query Store or DMVs in addition to Azure resource permissions.
Define the incident time window and output format before collecting metrics or exporting settings.
Treat query text, parameters, logs, and connection strings as sensitive evidence that may require restricted handling.

What output tells you

Resource output confirms the database engine, tier, SKU, region, and ownership metadata for the affected workload.
Metric output shows whether CPU, DTU, I/O, storage, connections, or deadlocks aligned with the slow-query symptom.
Diagnostic settings show whether logs are being captured and where query-related evidence should be searched.
Recent configuration output helps identify scaling, firewall, backup, or monitoring changes near the performance regression.
Tag and resource group data show which application team owns the workload and should approve tuning changes.

Mapped Azure CLI commands

Adjacent discovery commands

adjacent

az resource list --resource-group <resource-group> --output table

az resourcediscoverDatabases

az resource show --ids <resource-id>

az resourcediscoverManagement and Governance

Architecture context

Architecturally, a slow query is a symptom that crosses application, database, and observability boundaries. The query text runs in the data plane, but the user impact appears in application latency, queue buildup, retry storms, or failed jobs. Azure SQL, PostgreSQL, MySQL, Cosmos DB, and analytics engines expose different evidence, so the architecture must identify which engine owns the workload and where diagnostics are retained. I expect a production design to include Query Store or equivalent telemetry, Log Analytics routing where appropriate, index and schema ownership, and an escalation path between app and database teams. Without that contract, slow-query incidents turn into blame loops.

Security

Security matters because slow-query evidence can expose sensitive business data, table names, tenant identifiers, predicates, stored procedure names, or customer behavior. Operators should avoid pasting full query text into public tickets or broad chat channels. Access to Query Store, database DMVs, diagnostic logs, and Log Analytics workspaces should follow least privilege. If the fix requires adding an index, changing a stored procedure, or enabling logs, the change should follow normal review because it can affect data exposure, retention, and compliance. Slow-query troubleshooting should also confirm that retry storms are not masking authentication failures or blocked network paths. Evidence handling should be as deliberate as code or schema handling.

Cost

Slow queries create cost through wasted compute, larger SKUs, longer job runtimes, excess logging, and engineering time. Teams often scale from General Purpose to Business Critical or raise vCores when the cheaper fix is a targeted index, query rewrite, or reduced call frequency. In serverless databases, slow queries can extend active compute time. In provisioned systems, they consume capacity that could serve other work. Cost analysis should compare tuning effort, index storage, plan stability, and scaling benefit. The FinOps lesson is simple: do not buy permanent capacity for a query problem until evidence proves capacity is the real constraint. That evidence also supports capacity requests when tuning alone is insufficient.

Reliability

Reliability impact is direct. Slow queries can exhaust connection pools, hold locks, delay transactions, miss batch windows, and cause upstream services to retry until a small database issue becomes a platform incident. A single reporting query can block operational writes if isolation, indexing, or workload separation is poor. Operators should correlate query duration with application errors, deadlocks, CPU, I/O, storage, and connection metrics. Reliable recovery may involve killing a session, scaling temporarily, applying an index, rolling back a release, or moving a report. The runbook should define safe emergency actions and post-incident tuning ownership. Recovery steps should be rehearsed before the busiest business window.

Performance

Performance is the visible symptom and the diagnostic target. Slow-query work looks at duration, CPU time, logical reads, physical reads, waits, plan choice, parameter values, blocking, and frequency. Azure telemetry should connect the statement to user-facing latency, p95 or p99 response time, and dependency calls. A query that runs once for ten seconds may be less damaging than one that runs thousands of times for 300 milliseconds. Performance tuning may require indexes, statistics updates, plan forcing, query rewriting, caching, batching, or application changes. Operators should measure before and after, not rely on explain-plan optimism alone. The final proof should be measured under representative workload conditions.

Operations

Operations teams handle slow queries by collecting evidence before guessing. They identify the affected database, time window, workload path, query signature, plan, waits, and related application dependency calls. Azure CLI helps confirm resource configuration and diagnostic routing; SQL tools, Query Store, logs, and KQL identify the expensive statements. Operators document whether the issue is new, recurring, release-related, data-volume-related, or capacity-related. Changes should be tested with representative data and monitored after deployment. Good operations also keep a known list of critical queries, expected durations, and rollback steps for index or procedure changes. They preserve before-and-after metrics so improvements are not anecdotal.

Common mistakes

Scaling the database before checking whether one query, plan regression, or missing index is consuming capacity.
Looking only at average duration and ignoring high-frequency queries with the largest total resource cost.
Sharing full query text or parameters in broad incident channels where sensitive data may be exposed.
Tuning in development with tiny data volumes and assuming the result will hold in production.
Forgetting that application retry behavior can multiply the impact of one slow dependency call.

Operator quick checks

Confirm which database, server, application endpoint, and user journey are affected before reviewing query plans.
Check Query Store, dependency telemetry, and metrics for the same time window.
Compare top queries by total duration, CPU, reads, frequency, and recent regression status.
Verify whether a recent deployment, schema change, statistics change, or data-volume spike preceded the issue.
Measure performance after the proposed fix using production-like parameters and realistic row counts.

Questions to ask

Is this query slow because of plan choice, missing index, blocking, data volume, or insufficient capacity?
Which user journey or batch process is affected, and what latency target is being violated?
Who owns the query text, schema, indexes, and deployment path for the fix?
What emergency mitigation is safe if the slow query starts causing connection exhaustion or lock contention?
How will the team prove the fix reduced p95 latency, total duration, or resource consumption after rollout?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph