[SQL in Azure] High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines

http://msdn.microsoft.com/en-us/library/azure/jj870962.aspx

Microsoft Azure virtual machines (VMs) with SQL Server can help lower the cost of a high availability and disaster recovery (HADR) database solution. Most SQL Server HADR solutions are supported in Azure virtual machines, both as Azure-only and as hybrid solutions. In an Azure-only solution, the entire HADR system runs in Azure. In a hybrid configuration, part of the solution runs in Azure and the other part runs on-premises in your organization. The flexibility of the Azure environment enables you to move partially or completely to Azure to satisfy the budget and HADR requirements of your SQL Server database systems.

Need for SQL Server HADR Solution in Azure Virtual Machines

It is up to you to ensure that your database system possesses the HADR capabilities that the service-level agreement (SLA) requires. The fact that Azure provides high availability mechanisms, such as service healing for cloud services and failure recovery detection for the Virtual Machines, does not itself guarantee you can meet the desired SLA. These mechanisms protect the high availability of the VMs but not the high availability of SQL Server running inside the VMs. It is possible for the SQL Server instance to fail while the VM is online and healthy. Moreover, even the high availability mechanisms provided by Azure allow for downtime of the VMs due to events such as recovery from software or hardware failures and operating system upgrades.

In addition, Geo Redundant Storage (GRS) in Azure, which is implemented with a feature called geo-replication, may not be an adequate disaster recovery solution for your databases. Because geo-replication sends data asynchronously, recent updates can be lost in the event of disaster. More information regarding geo-replication limitations are covered in the Geo-replication not supported for data and log files on separate disks section.

HADR Deployment Architectures

SQL Server HADR technologies that are supported in Azure include:

It is possible to combine the technologies together to implement a SQL Server solution that has both high availability and disaster recovery capabilities. Depending on the technology you use, a hybrid deployment may require a VPN tunnel with the Azure virtual network. The sections below show you some of the example deployment architectures.

Azure-Only: High Availability Solutions

You can have a high availability solution for your SQL Server databases in Azure using AlwaysOn Availability Groups or database mirroring.

Technology	Example Architectures
AlwaysOn Availability Groups	All availability replicas running in Azure VMs for high availability within the same region. You need to configure a domain controller in addition to the SQL Server virtual machines because Windows Server Failover Clustering (WSFC) requires an Active Directory domain. For more information, see Tutorial: AlwaysOn Availability Groups in Azure (GUI).
Database Mirroring	Principal, mirror, and witness servers all running in the same Azure datacenter for high availability. You can deploy using a domain controller. You can also deploy the same database mirroring configuration without a domain controller by using server certificates instead. For more information, see Tutorial: Database Mirroring for High Availability in Azure.

Technology

Example Architectures

AlwaysOn Availability Groups

All availability replicas running in Azure VMs for high availability within the same region. You need to configure a domain controller in addition to the SQL Server virtual machines because Windows Server Failover Clustering (WSFC) requires an Active Directory domain.

For more information, see Tutorial: AlwaysOn Availability Groups in Azure (GUI).

Database Mirroring

Principal, mirror, and witness servers all running in the same Azure datacenter for high availability. You can deploy using a domain controller.

You can also deploy the same database mirroring configuration without a domain controller by using server certificates instead.

For more information, see Tutorial: Database Mirroring for High Availability in Azure.

Azure-Only: Disaster Recovery Solutions

You can have a disaster recovery solution for your SQL Server databases in Azure using AlwaysOn Availability Groups, database mirroring, or backup and restore with storage blobs.

Technology	Example Architectures
AlwaysOn Availability Groups	Availability replicas running across multiple datacenters in Azure VMs for disaster recovery. This cross-region solution protects against complete site outage. Within a region, all replicas should be within the same cloud service and the same VNet. Because each region will have a separate VNet, these solutions require VNet to VNet connectivity. For more information, see Configure VNet to VNet Connectivity.
Database Mirroring	Principal and mirror and servers running in different datacenters for disaster recovery. You must deploy using server certificates because an active directory domain cannot span multiple datacenters. For more information, see Tutorial: Database Mirroring for Disaster Recovery in Azure.
Backup and Restore with Azure Blob Storage Service	Production databases backed up directly to blob storage in a different datacenter for disaster recovery. For more information, see Backup and Restore for SQL Server in Azure Virtual Machines.

Technology

Example Architectures

AlwaysOn Availability Groups

Availability replicas running across multiple datacenters in Azure VMs for disaster recovery. This cross-region solution protects against complete site outage.

Within a region, all replicas should be within the same cloud service and the same VNet. Because each region will have a separate VNet, these solutions require VNet to VNet connectivity. For more information, see Configure VNet to VNet Connectivity.

Database Mirroring

Principal and mirror and servers running in different datacenters for disaster recovery. You must deploy using server certificates because an active directory domain cannot span multiple datacenters.

For more information, see Tutorial: Database Mirroring for Disaster Recovery in Azure.

Backup and Restore with Azure Blob Storage Service

Production databases backed up directly to blob storage in a different datacenter for disaster recovery.

For more information, see Backup and Restore for SQL Server in Azure Virtual Machines.

Hybrid IT: Disaster Recovery Solutions

You can have a disaster recovery solution for your SQL Server databases in a hybrid-IT environment using AlwaysOn Availability Groups, database mirroring, log shipping, and backup and restore with Azure blog storage.

Technology	Example Architectures
AlwaysOn Availability Groups	Some availability replicas running in Azure VMs and other replicas running on-premises for cross-site disaster recovery. The production site can be either on-premises or in an Azure datacenter. Because all availability replicas must be in the same WSFC cluster, the WSFC cluster must span both networks (a multi-subnet WSFC cluster). This configuration requires a VPN connection between Azure and the on-premises network. For successful disaster recovery of your databases, you should also install a replica domain controller at the disaster recovery site. For more information, see Tutorial: AlwaysOn Availability Groups in Hybrid IT.
Database Mirroring	One partner running in an Azure VM and the other running on-premises for cross-site disaster recovery using server certificates. Partners do not need to be in the same Active Directory domain, and no VPN connection is required. For more information, see Tutorial: Database Mirroring for Disaster Recovery in Hybrid IT. One partner running in an Azure VM and the other running on-premises in the same Active Directory domain for cross-site disaster recovery. A VPN connection between the Azure virtual network and the on-premises network is required. For successful disaster recovery of your databases, you should also install a replica domain controller at the disaster recovery site.
Log Shipping	One server running in an Azure VM and the other running on-premises for cross-site disaster recovery. Log shipping depends on Windows file sharing, so a VPN connection between the Azure virtual network and the on-premises network is required. For successful disaster recovery of your databases, you should also install a replica domain controller at the disaster recovery site. For more information, see Tutorial: Log Shipping for Disaster Recovery in Hybrid IT.
Backup and Restore with Azure Blob Storage Service	On-premises production databases backed up directly to Azure blob storage for disaster recovery. For more information, see Backup and Restore for SQL Server in Azure Virtual Machines.

Important Considerations for SQL Server HADR in Azure

Azure VMs, storage, and networking have different operational characteristics than an on-premises, non-virtualized IT infrastructure. A successful implementation of a HADR SQL Server solution in Azure requires that you understand these differences and design your solution to accommodate them.

High availability nodes in an availability set

Availability sets in Azure enable you to place the high availability nodes into separate Fault Domains (FDs) and Update Domains (UDs). For Azure VMs to be placed in the same availability set, you must deploy them in the same cloud service. Only nodes in the same cloud service can participate in the same availability set. For more information, see Manage the Availability of Virtual Machines.

WSFC cluster behavior in Azure networking

The non-RFC-compliant DHCP service in Azure can cause the creation of certain WSFC cluster configurations to fail, due to the cluster network name being assigned a duplicate IP address, such as the same IP address as one of the cluster nodes. This is an issue when you implement AlwaysOn Availability Groups, which depends on the WSFC feature.

Consider the scenario when a two-node cluster is created and brought online:

The cluster comes online, then NODE1 requests a dynamically assigned IP address for the cluster network name.
No IP address other than NODE1’s own IP address is given by the DHCP service, since the DHCP service recognizes that the request comes from NODE1 itself.
Windows detects that a duplicate address is assigned both to NODE1 and to the cluster network name, and the default cluster group fails to come online.
The default cluster group moves to NODE2, which treats NODE1’s IP address as the cluster IP address and brings the default cluster group online.
When NODE2 attempts to establish connectivity with NODE1, packets directed at NODE1 never leave NODE2 because it resolves NODE1’s IP address to itself. NODE2 cannot establish connectivity with NODE1, then loses quorum and shuts down the cluster.
In the meantime, NODE1 can send packets to NODE2, but NODE2 cannot reply. NODE1 loses quorum and shuts down the cluster.

This scenario can be avoided by assigning an unused static IP address, such as a link-local IP address like 169.254.1.1, to the cluster network name in order to bring the cluster network name online. To simplify this process, see Configuring Windows Failover Cluster in Azure for AlwaysOn Availability Groups.

The following tutorials for AlwaysOn Availability Groups demonstrate how to configure an availability group in different scenarios.

Availability group listener support

Availability group listeners are supported on Azure VMs running Windows Server 2008 R2, Windows Server 2012, and Windows Server 2012 R2. This support is made possible by the use of load-balanced endpoints with direct server return (DSR) enabled on the Azure VMs that are availability group nodes. You must follow special configuration steps for the listeners to work for both client applications that are running in Azure as well as those running on-premises. For instructions on setting up a listener, see Tutorial: Listener Configuration for AlwaysOn Availability Groups.

You can still connect to each availability replica separately by connecting directly to the service instance. Also, since AlwaysOn Availability Groups are backward compatible with database mirroring clients, you can connect to the availability replicas like database mirroring partners as long as the replicas are configured similar to database mirroring:

One primary replica and one secondary replica
The secondary replica is configured as non-readable (Readable Secondary option set to No)

An example client connection string that corresponds to this database mirroring-like configuration using ADO.NET or SQL Server Native Client is below:

Transact-SQL

Copy

Data Source=ReplicaServer1;Failover Partner=ReplicaServer2;Initial Catalog=AvailabilityDatabase;

For more information on client connectivity, see:

Network Latency in Hybrid IT

You should deploy your HADR solution with the assumption that there may be periods of time with high network latency between your on-premises network and Azure. When deploying replicas to Azure, you should use asynchronous commit instead of synchronous commit for the synchronization mode. When deploying database mirroring servers both on-premises and in Azure, use the high performance mode instead of the high safety mode.

Geo-replication not supported for data and log files on separate disks

Geo-replication in Azure disks does not support the data file and log file of the same database to be stored on separate disks. GRS replicates changes on each disk independently and asynchronously. This mechanism guarantees the write order within a single disk on the geo-replicated copy, but not across geo-replicated copies of multiple disks. If you configure a database to store its data file and its log file on separate disks, the recovered disks after a disaster may contain a more up-to-date copy of the data file than the log file, which breaks the write-ahead log in SQL Server and the ACID properties of transactions. If you do not have the option to disable geo-replication on the storage account, you should keep all data and log files for a given database on the same disk. If you must use more than one disk due to the size of the database, you need to deploy one of the disaster recovery solutions listed above to ensure data redundancy.