What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)
In this Document
Purpose |
Scope |
Details |
1. Clusterware layer |
2. Real Application Cluster (database) layer |
Known Issues |
References |
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
Information in this document applies to any platform.
PURPOSE
This note is to explain what is split brain in an Oracle Real Application cluster and what errors/consequences are associated with it.
SCOPE
For DBA and Support engineer.
DETAILS
In generic term, split-brain indicates data inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and unifying their data to each other.
There are two components in Oracle Real Application Cluster implementation could experience split brain.
1. Clusterware layer
Cluster nodes maintain their heartbeat via private network and voting disk. When there is a private network disruption, cluster nodes can not communicate to each other via private network for the time period of misscount setting, split brain will happen. In such case, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:
a. The group with more cluster nodes survive
b. The group with lower node member in case of same number of node(s) available in each group
c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.
Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:
###################################
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
###################################
Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.
Solution: Please engage network administrator to check private network layer to eliminate any network fault.
2. Real Application Cluster (database) layer
To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.
Common messages in instance alert log are similar to:
---------
Mon Dec 07 19:43:05 2011
IPC Send timeout detected.Sender: ospid 26318
Receiver: inst 2 binc 554466600 ospid 29940
IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave:
2
...
alert log of instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send timeout detected. Receiver ospid 29940
Mon Dec 07 19:42:18 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc
(incident=90153):
ORA-29740: evicted by member 0, group incarnation 10
Incident details in:
/u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc
In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:
a. Network problem
b. Process hang
c. Bug etc
Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.
In case of instance eviction, alert log and all background traces need to be checked to determine the root cause.
Known Issues
1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows
2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
Fixed in: 11.2.0.1
3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1
4. Bug 7587008 - Hung RAC instance not evicted from cluster Document 7587008.8
Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release
5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows
6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3
What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)的更多相关文章
- Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (文档 ID 1546004.1)
In this Document Purpose Scope Details What does "split brain" mean? Why is this ...
- oracle数据库 PSU,SPU(CPU),Bundle Patches 和 Patchsets 补丁号码快速参考 (文档 ID 1922396.1)
数据库 PSU,SPU(CPU),Bundle Patches 和 Patchsets 补丁号码快速参考 (文档 ID 1922396.1) 文档内容 用途 详细信息 Patchsets ...
- Troubleshooting 10g and 11.1 Clusterware Reboots (文档 ID 265769.1)
Troubleshooting 10g and 11.1 Clusterware Reboots (文档 ID 265769.1) This document is intended for DBA' ...
- xtts v4for oracle 11g&12c(文档ID 2471245
xtts v4for oracle 11g&12c(文档ID 2471245.1) 序号 主机 操作项目 操作内容 备注: 阶段一:初始阶段 1.1 源端 环境验证 migrate_check ...
- Oracle client客户端简易安装网上文档一
Oracle client客户端简易安装网上文档一-------------------------------------------------------------------------一. ...
- Oracle版本发布规划 (文档 ID 742060.1)
Oracle Database Release Schedule of Current Database Releases (文档 ID 742060.1) Oracle Database RoadM ...
- How to Modify Public Network Information including VIP in Oracle Clusterware (文档 ID 276434.1)
APPLIES TO: Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.2 [Release 11.2 to 12. ...
- Clusterware 和 RAC 中的域名解析的配置校验和检查 (文档 ID 1945838.1)
适用于: Oracle Database - Enterprise Edition - 版本 10.1.0.2 到 12.1.0.1 [发行版 10.1 到 12.1]Oracle Database ...
- 在Oracle电子商务套件版本12.2中创建自定义应用程序(文档ID 1577707.1)
在本文档中 本笔记介绍了在Oracle电子商务套件版本12.2中创建自定义应用程序所需的基本步骤.如果您要创建新表单,报告等,则需要自定义应用程序.它们允许您将自定义编写的文件与Oracle电子商务套 ...
随机推荐
- Entity Framework菜鸟初飞
Entity Framework菜鸟初飞 http://blog.csdn.net/zezhi821/article/details/7235134
- TTL电平和CMOS电平总结
TTL电平和CMOS电平总结 1,TTL电平: 输出高电平>2.4V,输出低电平<0.4V.在室温下,一般输出高电平是3.5V,输出低电平是0.2V.最小输入高电平和低电 ...
- IOS开发-手势简单使用及手势不响应处理办法
1.点击 2.长按 3.拖拽 4.轻扫.捏合.旋转 5.使用手势需要注意的地方 1.注意处理轻扫和拖拽的冲突 //那个时间短的话 就让那个先执行 //处理 拖拽和轻扫 两个手势的冲突 //需要轻扫手势 ...
- IOS开发-UITextField代理常用的方法总结
1.//当用户全部清空的时候的时候 会调用 -(BOOL)textFieldShouldClear:(UITextField *)textField: 2.//可以得到用户输入的字符 -(BOOL)t ...
- 【jmter】逻辑控制器(Logic Controller)
1. Jmeter官网对逻辑控制器的解释是:“Logic Controllers determine the order in which Samplers are processed.”.意思是说, ...
- nova分析(7)—— nova-scheduler
Nova-Scheduler主要完成虚拟机实例的调度分配任务,创建虚拟机时,虚拟机该调度到哪台物理机上,迁移时若没有指定主机,也需要经过scheduler.资源调度是云平台中的一个很关键问题,如何做到 ...
- CSS常用样式整理
元素边框显示圆角:-moz-border-radius适用于火狐浏览器,-webkit-border-radius适用于Safari和Chrome两种浏览器. 浏览器渐变背景颜色: FILTER: p ...
- BestCoder Round #85 hdu5778 abs(素数筛+暴力)
abs 题意: 问题描述 给定一个数x,求正整数y,使得满足以下条件: 1.y-x的绝对值最小 2.y的质因数分解式中每个质因数均恰好出现2次. 输入描述 第一行输入一个整数T 每组数据有一行,一个整 ...
- POJ 1410 Intersection(计算几何)
题目大意:题目意思很简单,就是说有一个矩阵是实心的,给出一条线段,问线段和矩阵是否相交解题思路:用到了线段与线段是否交叉,然后再判断线段是否在矩阵里面,这里要注意的是,他给出的矩阵的坐标明显不是左上和 ...
- C语言sizeof
一.关于sizeof 1.它是C的关键字.是一个运算符,不是函数: 2.一般用法为sizeof 变量或sizeof(数据类型):后边这种写法会让人误认为是函数,但这种写法是为了防止和C中类型修饰符(s ...