一个磁盘I/O故障导致的AlwaysOn FailOver 过程梳理和分析

下面是我们在使用AlwaysOn过程中遇到的一个切换案例。这个案例发生在2014年8月，虽然时间相对久远了，但是对我们学习理解AlwaysOn的FailOver原理和过程还是很有帮助的。本次FailOver的触发原因是系统I/O问题。大家需要理解，操作系统I/O出现了问题不一定立即触发SQL Server发生漂移，因为坏的槽点可能不在SQL Server实例所用到的位置，但是随着时间持续和数据堆积，问题槽点可能扩大升级。我们可以看到在本例中，第一次出现I/O问题到SQL Server 漂移间隔了16分钟，所以大家不要奇怪。我们重点可以FailOver的过程和触发条件设置上，即文章的第二和第三部分。

一 . 系统 I/O 异常 Log追踪

1.1 10:36:12 发现I/O异常

1.2 10:45:43 显示个别读写花费时间较长

1.3 10:45:28 看似I/O严重

1.4 10:52:20 出现个别连接Fail现象

（查看表中的最后一笔数据显示为10:53:17）

二 . AlwaysOn FailOver 过程

2.1 系统提示需要FailOver

2.2 高可用性组的本地副本需要离线。

（相关知识：Lease expired event from the cluster. Possible causes include loss of lease, possible network issues and sp_server_diagnostic query timeout. ）

2.3 错误提示信息显示，SQL Instance和WSFC连接异常。

2.4 可用性副本的角色发生变换。

2.5 角色为RESOLVING无法访问DB

（相关知识：When the role of an availability replica is indeterminate, such as during a failover, its databases are temporarily in a NOT SYNCHRONIZING state. Their role is set to RESOLVING until the role of the availability replica has resolved.）

此时：通过SSMS管理器，连接数据也是不可以访问的，显示状态为不同步了。

三 . 相关知识点

3.1 什么是resourceDell？resourceDell的用途？

由于AlwaysOn可用性组是建立在Windows故障转移群集之上的，Alwayson可用性组需要一个群集resourceDell来连接Windows群集和SQLServer实例。由于可用性组是一个群集资源，Windows群集需要透过AlwaysOn的resourceDell来控制资源的上线/离线，检查资源是否失败，更改资源的状态和属性，以及发生各种命令给可用性副本实例。（AlwaysOn可用性组的资源类型是“SQLServer Availability Group”）

AlwaysOn通过sp_server_diagnostics来检查可用性组的健康状况，不断地获得诊断信息。sp_server_diagnostics的评估结果会被用来和AlwaysOn可用性组的FailureConditionLevel设置相比较，来约定是否符合发生故障转移的条件。一旦条件满足，则可用性组就被切换到新的可用性副本上。

3.2 HealthCheckTimeout

The HealthCheckTimeout setting is used to specify the length of time, in milliseconds, that the SQL Server resource DLL should wait for information returned by the sp_server_diagnostics stored procedure before reporting the AlwaysOn Failover Cluster Instance (FCI) as unresponsive. Changes that are made to the timeout settings are effective immediately and do not require a restart of the SQL Server resource.

The resource DLL determines the responsiveness of the SQL instance using a health check timeout. The HealthCheckTimeout property defines how long the resource DLL should wait for the sp_server_diagnostics stored procedure before it reports the SQL instance as unresponsive to the WSFC service.

The following items describe how this property affects timeout and repeat interval settings:

The resource DLL calls the sp_server_diagnostics stored procedure and sets the repeat interval to one-third of the HealthCheckTimeout setting.
If the sp_server_diagnostics stored procedure is slow or is not returning information, the resource DLL will wait for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.
If the dedicated connection is lost, the resource DLL will retry the connection to the SQL instance for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.

3.3 FailureConditionLevel

The SQL Server Database Engine resource DLL determines whether the detected health status is a condition for failure using the FailureConditionLevel property. The FailureConditionLevel property defines which detected health statuses cause restarts or failovers.

Review sp_server_diagnostics (Transact-SQL) as this system stored procedure plays in important role in the failure condition levels.

Level	Condition	Description
0	No automatic failover or restart	Indicates that no failover or restart will be triggered automatically on any failure conditions. This level is for system maintenance purposes only.
1	Failover or restart on server down	Indicates that a server restart or failover will be triggered if the following condition is raised: SQL Server service is down.
2	Failover or restart on server unresponsive	Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).
3	Failover or restart on critical server errors	Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’.
4	Failover or restart on moderate server errors	Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’. System stored procedure sp_server_diagnostics returns ‘resource error’.
5	Failover or restart on any qualified failure conditions	Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’. System stored procedure sp_server_diagnostics returns ‘resource error’. System stored procedure sp_server_diagnostics returns ‘query_processing error’.

3.4 通过SQL更改相关配置。

The following example sets the HealthCheckTimeout option to 15,000 milliseconds (15 seconds).

ALTER SERVER CONFIGURATION

SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 15000;

The following example sets the FailureConditionLevel property to 0, indicating that failover or restart will not be triggered automatically on any failure conditions.

ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY FailureConditionLevel = 0;

四 . 结语

可用性副本的FailOver不仅仅取决于Availability Mode 和FailOver Mode,还要受限于FailureConditionLevel。

本文版权归作者所有，未经作者同意不得转载,谢谢配合！！！