Databases Azure SQL Database premium premium field-manual-complete

Azure SQL failover group

An Azure SQL failover group is a disaster-recovery wrapper for one or more Azure SQL databases. It keeps databases replicated to a secondary logical server in another region and gives applications listener names that can move during failover. Instead of rebuilding connection strings during a regional incident, teams design applications to connect through the failover group listener. It is especially useful when several databases must fail over together, when recovery drills need repeatable steps, and when business continuity depends on a documented regional escape route.

Back to glossary browser Open Microsoft Learn source

Aliases: Azure SQL failover group, SQL failover group, auto-failover group, failover group, geo-failover group
Difficulty: intermediate
CLI mappings: 5
Last verified: 2026-06-02

Browse trail Learn Databases Azure SQL Database Azure SQL failover group

Learning map Learn Azure SQL and Relational Data Azure SQL failover group

Context Learning path: Azure SQL and Relational Data

Microsoft Learn

Microsoft Learn describes Azure SQL failover groups as a declarative way to manage geo-replication and coordinated failover for databases on a logical server to another logical server in a different region. Listener endpoints help applications reconnect without changing connection strings.

Microsoft Learn: Failover groups overview and best practices - Azure SQL Database2026-06-02

Technical context

Technically, a failover group sits at the Azure SQL logical-server layer and uses active geo-replication underneath. It contains one or more user databases, a partner server in another region, failover policy, grace period, and read-write or read-only listener endpoints. It does not replace backups, zone redundancy, or application retry logic. Operators configure it with paired firewall, private endpoint, DNS, identity, and monitoring choices so applications can authenticate and route correctly after planned or unplanned failover.

Why it matters

Azure SQL failover groups matter because regional resilience is not automatic just because a database is managed. If an application is hard-coded to one server name, a regional outage can turn into a manual connection-string scramble. Failover groups give architects a cleaner recovery pattern: replicate critical databases, use stable listener endpoints, test failover before a crisis, and define who can initiate the switch. They also expose uncomfortable dependencies, such as firewall rules, private DNS, application retry settings, and background jobs that assume the primary region never changes. For regulated workloads, they provide evidence that recovery planning is operational, not just documented.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In the Azure SQL logical server business continuity blade, failover groups show partner server, database membership, replication state, failover policy, and listener endpoints for recovery drills.

Signal 02

In Azure CLI output from az sql failover-group show, operators see group name, read-write endpoint, read-only endpoint, replication role, and partner server metadata during DR reviews.

Signal 03

In application configuration and secret stores, the connection string should reference the failover group listener rather than a region-specific logical server name for failover readiness reviews.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Provide a stable listener endpoint for applications that must survive an Azure SQL regional failover.
Coordinate failover for several related user databases instead of managing each geo-replica separately.
Run planned disaster-recovery exercises that prove DNS, firewall, identity, and application retry behavior.
Support read-only reporting against a secondary endpoint when workload design and consistency requirements allow it.
Move a production database group during a regional incident without editing application connection strings under pressure.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Travel booking regional recovery

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A travel booking platform ran reservations and partner availability databases in one Azure region. A severe regional networking incident threatened checkout availability during a holiday travel surge.

Business/Technical Objectives

Recover reservation writes in another region within thirty minutes
Avoid changing application connection strings during the incident
Protect partner availability reads from stale routing assumptions
Produce an incident timeline for executive and partner review

Solution Using Azure SQL failover group

The architecture team had already grouped the reservation and availability databases into an Azure SQL failover group with a partner logical server in a second region. Applications used the read-write listener from Key Vault-backed configuration, and background workers had retry logic for transient SQL errors. During the incident, operators used Azure CLI to show group state, confirm database membership, and initiate the approved failover. Network engineers validated private DNS and firewall rules from the app subnet before traffic was fully reopened. After failover, the team ran checkout, cancellation, and partner-search smoke tests against the listener.

Results & Business Impact

Reservation write capability returned in twenty-one minutes after failover approval
No application connection string changes were required during the incident
Checkout error rate dropped from 18 percent to under 1 percent after retry pools reset
The post-incident report included CLI evidence for group state, timing, and database membership

Key Takeaway for Glossary Readers

A failover group is valuable because it turns regional database recovery into a rehearsed routing change instead of a frantic application reconfiguration effort.

Case study 02

Tax filing DR audit

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A state tax agency needed to prove that its filing portal could survive loss of the primary database region before the annual filing deadline.

Business/Technical Objectives

Demonstrate regional database recovery within the approved RTO
Keep taxpayer portal configuration stable during planned failover
Validate audit, firewall, and private endpoint behavior in both regions
Document clear authority for failover and failback decisions

Solution Using Azure SQL failover group

The platform team created a failover group for the portal, payment staging, and correspondence databases, using a secondary logical server in a compliant region. Application settings referenced the listener endpoint, while private DNS zones and firewall rules were mirrored and reviewed. During a scheduled exercise, operators used CLI commands to capture the initial group state, perform planned failover, and verify role reversal. Security staff checked audit-log delivery and Microsoft Entra administrator configuration after the switch. The team then ran taxpayer submission, payment validation, and staff review workflows before failing back in a second controlled window.

Results & Business Impact

The planned failover completed in fourteen minutes, beating the thirty-minute RTO
All three portal workflows passed smoke tests without connection-string edits
Audit delivery continued from the secondary server with no logging gap observed
The runbook gained named approvers, rollback criteria, and CLI evidence requirements

Key Takeaway for Glossary Readers

Failover groups make disaster-recovery audits practical because teams can prove not only replication, but the surrounding operational path.

Case study 03

Industrial telemetry continuity

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An industrial equipment maker stored fleet telemetry summaries in Azure SQL and used them for customer dashboards and maintenance alerts across thousands of deployed machines.

Business/Technical Objectives

Keep customer dashboards available during regional maintenance
Use the read-only listener for reporting without endangering failover readiness
Confirm alert workers reconnect after database role changes
Reduce manual recovery steps for the operations center

Solution Using Azure SQL failover group

The data platform team placed the telemetry summary and maintenance rules databases in a failover group. Customer dashboards used the read-write listener for normal operations, while a reporting job used the read-only listener after consistency expectations were documented. Before a regional maintenance event, operators verified partner server capacity, failover policy, listener DNS, and private endpoint connectivity. They performed a planned failover, restarted connection pools, and checked that maintenance alert workers resumed against the new primary. Query latency and replication health were monitored throughout the event.

Results & Business Impact

Dashboard availability stayed above 99.95 percent during the maintenance window
Manual database recovery steps dropped from seventeen checklist items to six
Reporting traffic used secondary capacity without increasing primary CPU during normal weeks
Maintenance alert workers reconnected within four minutes after role reversal

Key Takeaway for Glossary Readers

A well-designed failover group protects the database tier, but its real value appears when applications, jobs, and network paths are ready to follow it.

Why use Azure CLI for this?

With a decade of Azure operations behind me, I prefer Azure CLI for failover groups because disaster recovery cannot depend on portal muscle memory. CLI lets the team list every group, prove the partner server, show endpoint configuration, capture replication state, and run scripted validation before a real outage. It also keeps recovery drills honest: the same commands used in a tabletop exercise can be used in production with controlled approvals. Screenshots are weak evidence; command output with timestamps, resource IDs, and subscription context is stronger. For failover groups, that repeatability is the difference between a tested runbook and hopeful documentation.

CLI use cases

List failover groups across resource groups to confirm every critical database has the expected regional partner.
Show read-write and read-only listener endpoints before updating application configuration or secret references.
Run a planned failover during a DR exercise and capture before-and-after group state for audit evidence.
Update failover policy or grace period through reviewed automation instead of undocumented portal changes.

Before you run CLI

Confirm tenant, subscription, resource group, primary server, partner server, database list, and approved recovery window.
Verify SQL Server Contributor or equivalent RBAC, plus any change-management approval required to fail over production.
Check private DNS, firewall rules, application retry behavior, and monitoring in both regions before initiating failover.
Understand whether the command is read-only, mutating, or failover-triggering, and choose explicit output formats for evidence.

What output tells you

Failover-group output identifies the primary and partner servers, current replication role, group membership, and listener names applications should use.
Policy fields show whether failover is manual or automatic and what grace period applies before automatic failover can occur.
Database membership confirms whether all required user databases are protected or whether a dependent database was left outside the group.
Post-failover output confirms role reversal and helps operators decide whether application smoke tests are checking the right region.

Mapped Azure CLI commands

Azure SQL failover group operations

direct

az sql failover-group list --server <server-name> --resource-group <resource-group>

az sql failover-groupdiscoverDatabases

az sql failover-group show --name <group-name> --server <server-name> --resource-group <resource-group>

az sql failover-groupdiscoverDatabases

az sql failover-group create --name <group-name> --server <primary-server> --partner-server <secondary-server> --resource-group <resource-group> --failover-policy Manual

az sql failover-groupprovisionDatabases

az sql failover-group update --name <group-name> --server <server-name> --resource-group <resource-group> --failover-policy Automatic --grace-period <minutes>

az sql failover-groupconfigureDatabases

az sql failover-group set-primary --name <group-name> --server <secondary-server> --resource-group <resource-group>

az sql failover-groupoperateDatabases

Architecture context

In architecture, I use Azure SQL failover groups when the workload needs regional recovery for a group of databases and the application can tolerate asynchronous replication behavior. The design starts with the recovery objective, not the feature checkbox: define RPO, RTO, failover authority, data-loss tolerance, network routing, and post-failover validation. The secondary server must have compatible configuration, identity, firewall, private endpoint, auditing, and monitoring. Applications should connect to the listener, implement retry logic, and avoid region-specific assumptions. Failover groups fit well with deployment runbooks, DR exercises, and read-only reporting patterns, but they require testing because database failover alone does not move every dependency.

Security

Security for failover groups centers on making the secondary path as controlled as the primary path. Server-level firewall rules, private endpoints, private DNS zones, Microsoft Entra authentication, SQL logins, auditing, Defender settings, and RBAC all need review. A failover that lands on a server with weaker network restrictions or missing audit configuration creates a quiet security gap. Operators should restrict who can create, update, or fail over the group because those actions affect production routing. Secrets and connection strings should use the listener name, and emergency access should be tested without granting broad standing privileges to the disaster-recovery team. Review it quarterly.

Cost

Failover groups have cost impact because the secondary databases and their compute, storage, backup, monitoring, and networking footprint are real production capacity, not free insurance. The right design balances recovery objectives against idle standby spend. Read-only workloads can sometimes use the secondary listener to extract value from replicated capacity, but that must be weighed against performance and failover readiness. Costs also appear in private endpoints, cross-region traffic, logs, DR testing, and staff time. The expensive mistake is paying for a secondary region that cannot actually serve traffic because identity, DNS, or application dependencies were never tested. Tested readiness protects that investment.

Reliability

Reliability is the main purpose of a failover group, but it depends on more than replication. The group can coordinate database failover and keep listener endpoints stable, yet applications still need retry logic, connection pooling behavior, idempotent writes, and region-aware dependencies. Planned failovers should be rehearsed so teams understand DNS behavior, read-only workloads, job schedulers, and post-failover health checks. RPO is not zero for asynchronous geo-replication, so the business must understand potential data loss. Reliable operation also means testing failback, monitoring replication lag, and documenting when manual failover is allowed during an ambiguous regional incident. Practice makes the promise real.

Performance

Failover groups affect performance through replication lag, read-only routing, connection retries, and the capacity chosen for the secondary server. Normal application latency should use the primary listener, while read-only listener patterns can offload reporting when the design supports it. During failover, applications may see transient connection failures, stale DNS cache behavior, or slower performance if the secondary has a smaller service objective. Performance testing should include failover drills, connection-pool reset behavior, and post-failover workload validation. The group does not tune queries or indexes, but it changes where database traffic lands when regional recovery is exercised. Measure it during every exercise.

Operations

Operators manage failover groups by checking replication state, partner server health, failover policy, grace period, listener names, and database membership. They run planned failover drills, review alerting, validate that applications use the listener, and confirm that firewall and private DNS settings work from both regions. During incidents, the runbook should identify decision authority, business sign-off, data-loss tolerance, application smoke tests, and failback steps. Azure CLI is useful for listing groups, showing state, triggering planned changes, and producing evidence for recovery reviews. Operations teams should also monitor jobs and integrations that do not automatically follow the database listener. Keep evidence with every drill.

Common mistakes

Pointing application connection strings at the logical server instead of the failover group listener, defeating the recovery design.
Testing database failover but forgetting private DNS, firewall, identity, jobs, and dependent services in the secondary region.
Assuming asynchronous geo-replication means zero data loss without agreeing on RPO and business approval rules.
Creating the secondary database capacity too small, causing the recovered application to work but perform poorly.

Operator quick checks

List failover groups and confirm every protected database appears in the expected group.
Resolve the listener from the application network path before a disaster-recovery drill.
Verify alerts exist for replication health, failover activity, and application connection failures.
Run a planned failover in a controlled window and record application validation results.

Questions to ask

Who can authorize failover, and what level of data loss is acceptable for this workload?
Do all applications, jobs, and integrations use the listener rather than a region-specific server name?
What breaks outside the database when the primary region changes?
How will the team validate, communicate, and eventually fail back after recovery?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learning paths

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph