一个磁盘I/O故障导致的AlwaysOn FailOver 过程梳理和分析
下面是我们在使用AlwaysOn过程中遇到的一个切换案例。这个案例发生在2014年8月,虽然时间相对久远了,但是对我们学习理解AlwaysOn的FailOver原理和过程还是很有帮助的。本次FailOver的触发原因是系统I/O问题。大家需要理解,操作系统I/O出现了问题不一定立即触发SQL Server发生漂移,因为坏的槽点可能不在SQL Server实例所用到的位置,但是随着时间持续 和数据堆积,问题槽点可能扩大升级。我们可以看到在本例中,第一次出现I/O问题到SQL Server 漂移间隔了16分钟,所以大家不要奇怪。我们重点可以FailOver的过程和触发条件设置上,即文章的第二和第三部分。
一 . 系统 I/O 异常 Log追踪
1.1 10:36:12 发现I/O异常
1.2 10:45:43 显示个别读写花费时间较长
1.3 10:45:28 看似I/O严重
1.4 10:52:20 出现个别连接Fail现象
(查看表中的最后一笔数据显示为10:53:17)
二 . AlwaysOn FailOver 过程
2.1 系统提示需要FailOver
2.2 高可用性组的本地副本需要离线。
(相关知识:Lease expired event from the cluster. Possible causes include loss of lease, possible network issues and sp_server_diagnostic query timeout. )
2.3 错误提示信息显示,SQL Instance和WSFC连接异常。
2.4 可用性副本的角色发生变换。
2.5 角色为RESOLVING无法访问DB
(相关知识:When the role of an availability replica is indeterminate, such as during a failover, its databases are temporarily in a NOT SYNCHRONIZING state. Their role is set to RESOLVING until the role of the availability replica has resolved.)
此时: 通过SSMS管理器,连接数据也是不可以访问的,显示状态为不同步了。
三 . 相关知识点
3.1 什么是resourceDell?resourceDell的用途?
由于AlwaysOn可用性组是建立在Windows故障转移群集之上的,Alwayson可用性组需要一个群集resourceDell来连接Windows群集和SQLServer实例。由于可用性组是一个群集资源,Windows群集需要透过AlwaysOn的resourceDell来控制资源的上线/离线,检查资源是否失败,更改资源的状态和属性,以及发生各种命令给可用性副本实例。(AlwaysOn可用性组的资源类型是“SQLServer Availability Group”)
AlwaysOn通过sp_server_diagnostics来检查可用性组的健康状况,不断地获得诊断信息。sp_server_diagnostics的评估结果会被用来和AlwaysOn可用性组的FailureConditionLevel设置相比较,来约定是否符合发生故障转移的条件。一旦条件满足,则可用性组就被切换到新的可用性副本上。
3.2 HealthCheckTimeout
The HealthCheckTimeout setting is used to specify the length of time, in milliseconds, that the SQL Server resource DLL should wait for information returned by the sp_server_diagnostics stored procedure before reporting the AlwaysOn Failover Cluster Instance (FCI) as unresponsive. Changes that are made to the timeout settings are effective immediately and do not require a restart of the SQL Server resource.
The resource DLL determines the responsiveness of the SQL instance using a health check timeout. The HealthCheckTimeout property defines how long the resource DLL should wait for the sp_server_diagnostics stored procedure before it reports the SQL instance as unresponsive to the WSFC service.
The following items describe how this property affects timeout and repeat interval settings:
- The resource DLL calls the sp_server_diagnostics stored procedure and sets the repeat interval to one-third of the HealthCheckTimeout setting.
- If the sp_server_diagnostics stored procedure is slow or is not returning information, the resource DLL will wait for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.
- If the dedicated connection is lost, the resource DLL will retry the connection to the SQL instance for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.
3.3 FailureConditionLevel
The SQL Server Database Engine resource DLL determines whether the detected health status is a condition for failure using the FailureConditionLevel property. The FailureConditionLevel property defines which detected health statuses cause restarts or failovers.
Review sp_server_diagnostics (Transact-SQL) as this system stored procedure plays in important role in the failure condition levels.
Level |
Condition |
Description |
0 |
No automatic failover or restart |
|
1 |
Failover or restart on server down |
Indicates that a server restart or failover will be triggered if the following condition is raised: SQL Server service is down. |
2 |
Failover or restart on server unresponsive |
Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). |
3 |
Failover or restart on critical server errors |
Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’. |
4 |
Failover or restart on moderate server errors |
Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’. System stored procedure sp_server_diagnostics returns ‘resource error’. |
5 |
Failover or restart on any qualified failure conditions |
Indicates that a server restart or failover will be triggered if any of the following conditions are raised: SQL Server service is down. SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings). System stored procedure sp_server_diagnostics returns ‘system error’. System stored procedure sp_server_diagnostics returns ‘resource error’. System stored procedure sp_server_diagnostics returns ‘query_processing error’. |
3.4 通过SQL更改相关配置。
The following example sets the HealthCheckTimeout option to 15,000 milliseconds (15 seconds).
ALTER SERVER CONFIGURATION
SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 15000;
The following example sets the FailureConditionLevel property to 0, indicating that failover or restart will not be triggered automatically on any failure conditions.
ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY FailureConditionLevel = 0;
四 . 结语
可用性副本的FailOver不仅仅取决于Availability Mode 和FailOver Mode,还要受限于FailureConditionLevel。
本文版权归作者所有,未经作者同意不得转载,谢谢配合!!!
一个磁盘I/O故障导致的AlwaysOn FailOver 过程梳理和分析的更多相关文章
- RAC OCR盘故障导致的集群重启恢复
一.事故说明 最近出现了一次OCR盘的故障导致Oracle集群件宕机的事故,后以独占模式启动集群,并使用ocr备份恢复了OCR文件以及重新设置了vote disk,然后关闭集群,重启成功. 因此在此处 ...
- Reporting Service 2008 “报表服务器数据库内出错。此错误可能是因连接失败、超时或数据库中磁盘空间不足而导致的”
今天遇到了两个关于Reporting Service的问题, 出现问题的环境为Microsoft SQL Server 2008 R2 (SP2) - 10.50.4000.0 (X64) .具体情况 ...
- GameObject.DestroyImmediate(go, true)会使磁盘资源数据丢失,导致不可用
GameObject.DestroyImmediate(go, true)会使磁盘资源数据丢失,导致不可用 第二个参数true表示 allowDestroyingAssets,表示允许销毁资源. 实测 ...
- AlwaysOn可用性组功能测试(二)--SQL Server群集故障转移对AlwaysOn可用性组的影响
三. SQL Server群集故障转移对AlwaysOn可用性组的影响 1. 主副本在SQL Server群集CLUSTEST03/CLUSTEST03上 1.1将节点转移Server02.以下是故障 ...
- 物理机异常断电,linux虚拟机系统磁盘mount失败,导致无法启动; kubectl 连接失败
虚拟机 CentOS 7 挂载文件系统失败 上周五下班前没有关闭虚拟机和物理机, 今天周一开了虚拟机之后,发现操作系统启动失败. 原因跟 这篇文章描述的一模一样. 解决操作系统的文件系统挂载的问题之后 ...
- 记录一个奇葩的问题:k8s集群中master节点上部署一个单节点的nacos,导致master节点状态不在线
情况详细描述; k8s集群,一台master,两台worker 在master节点上部署一个单节点的nacos,导致master节点状态不在线(不论是否修改nacos的默认端口号都会导致master节 ...
- EVA 4400存储硬盘故障数据恢复方案和数据恢复过程
EVA系列存储是一款以虚拟化存储为实现目的的HP中高端存储设备,平时数据会不断的迁移,加上任务通常较为繁重,所以磁盘的负载相对是较重的,也是很容易出现故障的.EVA是依靠大量磁盘的冗余空间,以及故障后 ...
- AlwaysON同步过程
<SQL Server 2012实施与管理实战指南>中指AlwaysON同步过程如下: 任何一个SQL Server里都有个叫Log Writer的线程,当任何一个SQL用户提交一个数据修 ...
- 1125MySQL Sending data导致查询很慢的问题详细分析
-- 问题1 tablename使用主键索引反而比idx_ref_id慢的原因EXPLAIN SELECT SQL_NO_CACHE COUNT(id) FROM dbname.tbname FORC ...
随机推荐
- [Swift]LeetCode392. 判断子序列 | Is Subsequence
Given a string s and a string t, check if s is subsequence of t. You may assume that there is only l ...
- [Swift]LeetCode542. 01 矩阵 | 01 Matrix
Given a matrix consists of 0 and 1, find the distance of the nearest 0 for each cell. The distance b ...
- PHP常用设计模式讲解
开发中适当的使用设计模式,可以让项目有更易扩展,易维护.低耦合,代码简洁等 单例模式 <?php /** * 单例模式:使类在全局范围内只允许创建一个对象,常用于数据库连接等 */ class ...
- djang-异步——定时操作
django本身是一个同步框架,flask也是,所以要把它变成异步操作的话还得专门设置一下 我的这个系统呢是windows系统,python3.7的 所以有的库是不可以兼容的 ,然后到时候会稍微修改一 ...
- python之定义参数模块argparse(一)基本使用
在shell脚本中,若脚本带参数,则在脚本中使用$1.$2...等引用, 在python中,也可以定义类似的引用参数,可以为必选项也可以可选项. 基本用法如下三种: 1.必选项(位置参数) impor ...
- 【Docker】(2)---仓库、镜像、容器
[Docker](2)---仓库.镜像.容器 学习Docker,我觉得首先要了解的是仓库.镜像.容器到底是什么,他们有什么区别. 一.通俗理解 1.Docker 镜像 (images) 容器运 ...
- qt 窗口鼠标穿透
Qt 不规则窗体 – 鼠标点击穿透 qt实现鼠标穿透,如果要被穿透窗口只有一层,也即没有嵌套窗口,直接只用对子窗口使用setAttribute (Qt::WA_TransparentForMouseE ...
- NPM 安装速度慢,镜像修改
今天安装gitbook的时候,竟然花了两个小时没有安装成功,大家在使用npm安装依赖的时候速度是不是经常慢的要死?最佳解决方案是手动更改镜像服务器地址, 强烈推荐阿里巴巴在国内的镜像服务器,执行下面命 ...
- SpringCloud系列十二:SpringCloudSleuth(SpringCloudSleuth 简介、SpringCloudSleuth 基本配置、数据采集)
声明:本文来源于MLDN培训视频的课堂笔记,写在这里只是为了方便查阅. 1.概念:SpringCloudSleuth 2.具体内容 Sleuth 是一种提供的跟踪服务,也就是说利用 sleuth 技术 ...
- javascript中的栈、队列。
javascript中的栈.队列 栈方法 栈是一种LIFO(后进先出)的数据结构,在js中实现只需用到2个函数 push() 接受参数并将其放置 ...