Disaster recovery and failover for Azure Files

Here’s a concise summary of the article on disaster recovery and failover for Azure Files.

Overview

Azure recommends having a disaster recovery (DR) plan for regional outages. An important DR action is failing over a storage account to its secondary region when the primary endpoint becomes unavailable.
Azure File Sync requires the Storage Sync Service to also be failed over because it must be in the same region as the storage account. If only the storage account is failed over, sync and cloud tiering will fail until the Storage Sync Service is failed over.

Applies to

The document enumerates supported combinations of management model, billing model, media tier and redundancy for SMB and NFS access (table in original).

Key concepts

Customer-managed failover: You (the customer) can initiate failover of a geo-redundant storage account (GRS/GZRS) to the secondary region if the primary becomes unavailable. After failover completes, the secondary becomes the new primary and clients can use it.
Azure-managed failover: In extreme disasters, Microsoft may initiate a regional failover. No action is required from you; during the failover you might not have write access.

RPO, RTO, and costs

DR planning requires understanding recovery point objective (RPO — how much data loss is tolerable) and recovery time objective (RTO — how quickly services must be restored).
Lower RPO/RTOs generally mean higher DR cost.

Choosing redundancy

Azure Files supports LRS and ZRS for all shares. GRS and GZRS are supported for HDD file shares and enable account failover for regional outages.
SSD (premium) file shares do not support GRS/GZRS. To get geographic redundancy for SSD shares, consider syncing between two file shares (link in original).

How account failover works

Normally, data is written in primary and asynchronously replicated to secondary.
If the primary endpoint is unavailable, you can initiate failover. The failover updates the Storage DNS so the secondary becomes primary. Typical duration is about an hour; suspend workloads if possible.
After failover, file handles and leases are not retained — clients should unmount and remount shares.
Important: after failover, the storage account in the new primary region is configured as locally redundant. To resume geo-replication, reconfigure geo-redundancy (which incurs time and cost).

Anticipate data loss

Because replication is asynchronous, recent writes not yet replicated to the secondary can be lost when you force failover.
All data replicated to the secondary is retained; any primary-only writes not yet replicated are lost.

Last Sync Time (LST)

The Last Sync Time indicates the most recent time data from primary is guaranteed replicated to secondary. Use LST to estimate potential data loss before failing over or failing back.
A system snapshot is made every ~15 minutes and replicated, but the snapshot in the secondary might be older due to geo-lag.
You can query LST via Azure PowerShell, Azure CLI, or client libraries (link in original).

Failback caution

After failing over, the account becomes LRS in the new primary. When you re-enable geo-redundancy, the new primary starts copying back to the old primary (now new secondary). If you fail back before data has fully re-replicated, you can lose a lot of data.
Always check Last Sync Time and compare with recent writes before failing back.
After failback, you can reconfigure redundancy (LRS→GRS or ZRS→GZRS as appropriate) — this takes time and cost.

How to initiate failover

You can initiate account failover from Azure portal, PowerShell, CLI, or the Azure Storage resource provider API (link in original).