大约是一周前,一位资深的Oracle工程师向我和客户介绍RAC中脑裂的处理过程,据他介绍脑裂发生时通过各节点对voting disk(投票磁盘)的抢夺,那些争抢到(n/2+1)数量voting disk的节点就可以survive(幸存)下来,而没有争抢到voting disk的节点则被evicted踢出节点。

不得不说以上这番观点,来得太过随意了,一位从Oracle 6就开始从事维护工作的老工程师也会犯这样的概念性错误,只能说Oracle技术的更新过于日新月异了。

在理解脑裂(Brain Split)处理过程前,有必要介绍一下Oracle RAC Css(Cluster Synchronization Services)的工作框架:

Oracle RAC CSS提供2种后台服务包括群组管理(Group Managment简称GM)和节点监控(Node Monitor简称NM),其中GM管理组(group)和锁(lock)服务。在集群中任意时刻总有一个节点会充当GM主控节点(master node)。集群中的其他节点串行地将GM请求发送到主控节点(master node),而master node将集群成员变更信息广播给集群中的其他节点。组成员关系(group membership)在每次发生集群重置(cluster reconfiguration)时发生同步。每一个节点独立地诠释集群成员变化信息。

而节点监控NM服务则负责通过skgxn(skgxn-libskgxn.a,提供节点监控的库)与其他厂商的集群软件保持节点信息的一致性。此外NM还提供对我们熟知的网络心跳(Network heartbeat)和磁盘心跳(Disk heartbeat)的维护以保证节点始终存活着。当集群成员没有正常Network heartbeat或Disk heartbeat时NM负责将成员踢出集群,被踢出集群的节点将发生节点重启(reboot)。

NM服务通过OCR中的记录(OCR中记录了Interconnect的信息)来了解其所需要监听和交互的端点,将心跳信息通过网络发送到其他集群成员。同时它也监控来自所有其他集群成员的网络心跳Network heartbeat,每一秒钟都会发生这样的网络心跳,若某个节点的网络心跳在misscount(by the way:10.2.0.1中Linux上默认misscount为60s,其他平台为30s,若使用了第三方vendor clusterware则为600s,但10.2.0.1中未引入disktimeout;10.2.0.4以后misscount为60s,disktimeout为200s;11.2以后misscount为30s:CRS-4678: Successful get misscount 30 for Cluster Synchronization Services,CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services)指定的秒数中都没有被收到的话,该节点被认为已经”死亡”了。NM还负责当其他节点加入或离开集群时初始化集群的重置(Initiates cluster reconfiguration)。

在解决脑裂的场景中,NM还会监控voting disk以了解其他的竞争子集群(subclusters)。关于子集群我们有必要介绍一下,试想我们的环境中存在大量的节点,以Oracle官方构建过的128个节点的环境为我们的想象空间,当网络故障发生时存在多种的可能性,一种可能性是全局的网络失败,即128个节点中每个节点都不能互相发生网络心跳,此时会产生多达128个的信息”孤岛”子集群。另一种可能性是局部的网络失败,128个节点中被分成多个部分,每个部分中包含多于一个的节点,这些部分就可以被称作子集群(subclusters)。当出现网络故障时子集群内部的多个节点仍能互相通信传输投票信息(vote mesg),但子集群或者孤岛节点之间已经无法通过常规的Interconnect网络交流了,这个时候NM Reconfiguration就需要用到voting disk投票磁盘。

因为NM要使用voting disk来解决因为网络故障造成的通信障碍,所以需要保证voting disk在任意时刻都可以被正常访问。在正常状态下,每个节点都会进行磁盘心跳活动,具体来说就是会到投票磁盘的某个块上写入disk心跳信息,这种活动每一秒钟都会发生,同时CSS还会每秒读取一种称作”kill block”的”赐死块”,当”kill block”的内容表示本节点被驱逐出集群时,CSS会主动重启节点。

为了保证以上的磁盘心跳和读取”kill block”的活动始终正常运作CSS要求保证至少(N/2+1)个投票磁盘要被节点正常访问,这样就保证了每2个节点间总是至少有一个投票磁盘是它们都可以正常访问的,在正常情况下(注意是风平浪静的正常情况)只要节点所能访问的在线voting disk多于无法访问的voting disk,该节点都能幸福地活下去,当无法访问的voting disk多于正常的voting disk时,Cluster Communication Service进程将失败并引起节点重启。所以有一种说法认为voting disk只要有2个足以保证冗余度就可以了,没有必要有3个或以上voting disk,这种说法是错误的。Oracle推荐集群中至少要有3个voting disks。

补充1:

Question:
有同学问那么voting disk  必须是奇数个呢?

Answer:
实际上我们仅仅是推荐使用奇数个vote disk ,而非必须是奇数个。10gR2中vote disk的数目上限是32个。

Question
我们可以使用2或4个vote disk吗?

Answer:
可以的。 但是2、4这样的数目在“至少(N/2+1)个投票磁盘要被节点正常访问”这一disk heartbeat的硬性算法下是不利的:
当我们使用2个vote disk 时,不能发生任意个vote disk的心跳失败
当我们使用3个vote disk 时,不能发生大于1个的vote disk心跳失败
当我们使用4个vote disk 时,不能发生大于1个的vote disk心跳失败 ,这和3个时的容错率是一样,但是因为我们有更多的vote disk,这会导致管理成本和引入的风险增长
当我们使用5个vote disk 时,不能发生大于2个的vote disk心跳失败
当我们使用6个vote disk 时,仍然不能发生大于2个的vote disk心跳失败, 同样的因为比5时多出一个 ,也会引入不合理的管理成本和风险

补充2:
Question:
若节点间的网络心跳正常,且节点所能正常心跳的vote disk 大于 不能正常访问的 ,如3个votedisk 时 恰巧有1个vote disk 的disk heartbeat 超时,此时Brain split 会发生吗?

Answer:
这种情况即不会触发Brain Split,也不会引发节点驱逐协议(eviction protocol)。 当单个或小于(N/2+1)个的voting disk心跳失败(disk heartbeat failure)时,这种心跳失败可能是由于短期内节点访问voting disk发生I/O error错误而引起的,此时css会立刻将这些失败的voting disk标记为OFFLINE。虽然有一定数量的voting disk OFFLINE了,但是我们仍有至少(N/2+1)个投票磁盘可用,这保证了eviction protocol不会被调用,所以没有节点会被reboot重启。紧接着node monitor模块的Disk ping Monitor Thread(DPMT-clssnmDiskPMT)会重复尝试访问这些失败的OFFLINE voting disk,若这些投票磁盘变得再次可I/O访问且经过验证其上的数据也没有讹误,那么css会再次将此voting disk标记为ONLINE;但是如果在45s( 这里的45s是基于misscount和 内部算法获得的) 内仍不能正常访问相关的voting disk,那么DMPT将在cssd.log中生成警告信息,如:

CSSD]2011-11-11 20:11:20.668 >
WARNING: clssnmDiskPMT: long disk latency >(45940 ms) to voting disk (0//dev/asm-votedisk1)

假设以上发生clssnmDiskPMT警告的RAC场景共有3个voting disk,现已有一个asm-votedisk1因为I/O error或其他原因而被标记为OFFLINE,若此时再有一个votedisk也出现了问题并disk heartbeat 失败,那么节点会因为少于规定数目(2)的votedisk而引发eviction protocol,进而重启reboot。

单个或小于(N/2+1)个的voting disk心跳失败都仅仅生成警告(Warning),而非致命的错误。因为仍有绝大数量的vote disk可被访问,因此生成的警告都是非致命的,eviction protocol将不会被触发。

当实际的NM Reconfiguration集群重置情况发生时所有的active节点和正在加入集群的节点都会参与到reconfig中,那些没有应答(ack)的节点都将不再被归入新的集群关系中。实际上reconfig重置包括多个阶段:
1.初始化阶段 — reconfig manager(由集群成员号最低的节点担任)向其他节点发送启动reconfig的信号
2.投票阶段 — 节点向reconfig manager发送该节点所了解的成员关系
3.脑裂检查阶段 — reconfig manager检查是否脑裂
4.驱逐阶段 — reconfig manager驱逐非成员节点
5.更新阶段 — reconfig manager向成员节点发送权威成员关系信息

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去:

  1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
  2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。
  3. 采用Stonith algorithm 的IO fencing(remote power reset)

补充:

STONITH算法是一种常用的I/O Fencing algorithm,是RAC中必要的远程关闭节点的接口。其想法十分简单,当某个节点上运行的软件希望确保本集群内的其他节点不能使用某种资源时,拔出其他节点的插座即可。这是一种简单、可靠且有些残酷的算法。Stonith 的优势是其没有特定的硬件需求,也不会限制集群的扩展性。

Oracle Clusterware的Process Monitor模块负责实现IO fencing,保证不会因节点/实例的不协调工作而产生corruption。Process Monitor的工作具体由hangcheck timer或者oprocd 完成, 在Linux平台上10.2.0.4 之前都没有oprocd的存在(其他Unix平台在10.2.0.1就有了),在安装RAC之前需要额外安装hangcheck timer软件以保证IO fencing, 到10.2.0.4 时Linux上也有了oprocd,具体见<Know about RAC Clusterware Process OPROCD> 一文。 这些负责IO fencing的进程一般都会被锁定在内存中(locked in memory)、实时运行(Real time)、休眠固定的时间(Sleep a fixed time)、以root身份运行;若进程唤醒后发现时间已经太晚,那么它会强制reboot;若这些进程发生失败,则也会重启,所以在RAC环境中oprocd是一个很重要的进程,不要失去手动去kill这个进程。

在完成脑裂检查后进入驱逐阶段,被驱逐节点会收到发送给它们的驱逐信息(如果网络可用的话),若无法发送信息则会通过写出驱逐通知到voting disk上的”kill block”来达到驱逐通知的目的。同时还会等待被驱逐节点表示其已收到驱逐通知,这种表示可能是通过网络通信的方式也可能是投票磁盘上的状态信息。

可以看到Oracle CSS中Brain Split Check时会尽可能地保证最大的一个子集群存活下来以保证RAC系统具有最高的可用性,而并不如那位资深工程师所说的在Cluster Reconfiguration阶段会通过节点对投票磁盘的抢占来保证哪个节点存活下来。

以下为一个三节点RAC环境中的2个示例场景:

1.1号节点网络失败,2,3号节点形成子集群;2,3节点通过voting disk向1号节点发起驱逐:

以下为1号节点的ocssd.log日志:
[ CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 29.610 seconds
[ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1037, misstime 30390
[ CSSD]2011-04-23 17:11:42.943 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 3.280308e-268artbeat fatal, eviction in 29.150 seconds 对2,3号节点发起misscount计时 [ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30850
[ CSSD]2011-04-23 17:11:42.943 [3042950032] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:11:44.368 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 3.280308e-268artbeat fatal, eviction in 28.610 seconds
[ CSSD]2011-04-23 17:12:04.778 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 14.580 seconds
[ CSSD]2011-04-23 17:12:04.779 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 3.280308e-268artbeat fatal, eviction in 14.120 seconds
[ CSSD]2011-04-23 17:12:06.207 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 3.280308e-268artbeat fatal, eviction in 13.580 seconds
[ CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 5.560 seconds
[ CSSD]2011-04-23 17:12:17.719 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 5.100 seconds
[ CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 4.560 seconds
[ CSSD]2011-04-23 17:12:19.165 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 4.100 seconds
[ CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 3.560 seconds
[ CSSD]2011-04-23 17:12:20.642 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 3.100 seconds
[ CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 2.560 seconds
[ CSSD]2011-04-23 17:12:22.139 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 2.100 seconds
[ CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 3.280308e-268artbeat fatal, eviction in 1.550 seconds
[ CSSD]2011-04-23 17:12:23.588 [3042950032] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 3.280308e-268artbeat fatal, eviction in 1.090 seconds

2号节点的ocssd.log日志:

[    CSSD]2011-04-23 17:11:53.054 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 29.800 seconds
[ CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE: clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30200
[ CSSD]2011-04-23 17:11:53.054 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:11:54.516 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 8.910601e-269artbeat fatal, eviction in 28.790 seconds
[ CSSD]2011-04-23 17:12:14.826 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 14.800 seconds
[ CSSD]2011-04-23 17:12:16.265 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 8.910601e-269artbeat fatal, eviction in 13.800 seconds
[ CSSD]2011-04-23 17:12:27.755 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 5.800 seconds
[ CSSD]2011-04-23 17:12:29.197 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 4.800 seconds
[ CSSD]2011-04-23 17:12:30.658 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 3.800 seconds
[ CSSD]2011-04-23 17:12:32.133 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 2.800 seconds
[ CSSD]2011-04-23 17:12:33.602 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 1.790 seconds
[ CSSD]2011-04-23 17:12:35.126 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 8.910601e-269artbeat fatal, eviction in 0.800 seconds [ CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:12:35.399 [117574544] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10] clssnmHandleSyn应答3号节点发送的同步信息 [ CSSD]2011-04-23 17:12:35.399 [5073104] >USER: NMEVENT_SUSPEND [00][00][00][0e] 发生Node Monitoring SUSPEND事件 [ CSSD]2011-04-23 17:12:35.405 [117574544] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(10) 通过clssnmSendVoteInfo向3号节点发送投票信息Vote mesg [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 1, state (3/0) unique (1303592601/1303592601) prevConuni(0) birth (9/9) (old/new)
[ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7e80ae8), probe((nil))
[ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmDeactivateNode: node 1 (vrh1) left cluster 确认1号节点离开了集群cluster [ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[ CSSD]2011-04-23 17:12:35.415 [117574544] >TRACE: clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new)
[ CSSD]2011-04-23 17:12:35.415 [117574544] >USER: clssnmHandleUpdate: SYNC(10) from node(3) completed
[ CSSD]2011-04-23 17:12:35.416 [117574544] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:12:35.416 [117574544] >USER: clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:12:35.416 [117574544] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2011-04-23 17:12:35.416 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (10)
[ CSSD]2011-04-23 17:12:35.416 [3021970320] >USER: NMEVENT_RECONFIG [00][00][00][0c]
[ CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2011-04-23 17:12:35.417 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9)
[ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9)
[ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[ CSSD]2011-04-23 17:12:35.418 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9)
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9)
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9)
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 10
[ CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE: clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa
[ CSSD]2011-04-23 17:12:35.419 [3063929744] >TRACE: clssgmPeerListener: connects done (2/2)
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2)
[ CSSD]2011-04-23 17:12:35.419 [3021970320] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status
[ CSSD]2011-04-23 17:12:35.428 [3021970320] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes [ CSSD]CLSS-3001: local node number 2, master node number 2 完成reconfiguration [ CSSD]2011-04-23 17:12:35.440 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(10), with status(1)

以下为3号节点的ocssd.log:

[    CSSD]2011-04-23 17:12:36.303 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 50 1.867300e-268artbeat fatal, eviction in 29.220 seconds
[ CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE: clssnmPollingThread: node vrh1 (1) is impending reconfig, flag 1037, misstime 30780
[ CSSD]2011-04-23 17:12:36.303 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:12:57.889 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 75 1.867300e-268artbeat fatal, eviction in 14.220 seconds
[ CSSD]2011-04-23 17:13:10.674 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 5.220 seconds
[ CSSD]2011-04-23 17:13:12.115 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 4.220 seconds
[ CSSD]2011-04-23 17:13:13.597 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 3.210 seconds
[ CSSD]2011-04-23 17:13:15.024 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 2.220 seconds
[ CSSD]2011-04-23 17:13:16.504 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 1.220 seconds
[ CSSD]2011-04-23 17:13:17.987 [3053439888] >WARNING: clssnmPollingThread: node vrh1 (1) at 90 1.867300e-268artbeat fatal, eviction in 0.220 seconds
[ CSSD]2011-04-23 17:13:18.325 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh1 (1), flags 0x040d, state 3, wt4c 0
[ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 10
[ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[ CSSD]2011-04-23 17:13:18.326 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ALIVE
[ CSSD]2011-04-23 17:13:18.327 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(10)
[ CSSD]2011-04-23 17:13:18.329 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(2)
[ CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:13:18.329 [89033616] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[21] sync[10]
[ CSSD]2011-04-23 17:13:18.330 [8136912] >USER: NMEVENT_SUSPEND [00][00][00][0e]
[ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 1, vrh1, misstime(60010) state(5)
[ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:13:18.332 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE
[ CSSD]2011-04-23 17:13:18.334 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(2)
[ CSSD]2011-04-23 17:13:18.335 [89033616] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(10)
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(13) 以上完成了2-3节点间的Vote mesg通信,这些信息包含Node identifier,GM peer to peer listening endpoint以及
View of cluster membership。 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info... 开始检测voting disk上的信息 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo: node 1, vrh1, state 5 with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2 发现其他子集群,包含1号节点且1号节点为该子集群的leader,为最小子集群;3号与2号节点组成最大子集群,2号节点为leader节点 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Start
[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmEvict: Evicting node 1, vrh1, birth 9, death 10, impendingrcfg 1, stateflags 0x40d
发起对1号节点的驱逐 [ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmSendShutdown: req to node 1, kill time 443294
[ CSSD]2011-04-23 17:13:18.339 [3032460176] >TRACE: clssnmDiscHelper: vrh1, node(1) connection failed, con (0xb7eaf220), probe((nil))
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmWaitOnEvictions: node 1, vrh1, undead 1
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmCheckKillStatus: Node 1, vrh1, down, LATS(443144),timeout(150) clssnmCheckKillStatus检查1号节点是否down了 [ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE
[ CSSD]2011-04-23 17:13:18.340 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(10)
[ CSSD]2011-04-23 17:13:18.341 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(2)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 1, state (5/0) unique (1303592601/1303592601) prevConuni(1303592601) birth (9/9) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmDeactivateNode: node 1 (vrh1) left cluster [ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[ CSSD]2011-04-23 17:13:18.341 [89033616] >TRACE: clssnmUpdateNodeState: node 3, state (3/3) unique (1303591326/1303591326) prevConuni(0) birth (3/3) (old/new)
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: SYNC(10) from node(3) completed
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:13:18.342 [89033616] >USER: clssnmHandleUpdate: NODE 3 (vrh3) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:13:18.342 [89033616] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2011-04-23 17:13:18.347 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15)
[ CSSD]2011-04-23 17:13:18.348 [3032460176] >TRACE: clssnmDoSyncUpdate: Sync 10 complete!
[ CSSD]2011-04-23 17:13:18.350 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (10)
[ CSSD]2011-04-23 17:13:18.350 [3021970320] >USER: NMEVENT_RECONFIG [00][00][00][0c]
[ CSSD]2011-04-23 17:13:18.351 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2011-04-23 17:13:18.352 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(crs_version) birth(9/9)
[ CSSD]2011-04-23 17:13:18.353 [3063929744] >TRACE: clssgmDispatchCMXMSG(): got message type(7) src(2) incarn(10) during incarn(9/9)
[ CSSD]2011-04-23 17:13:18.354 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(EVMDMAIN) birth(9/9)
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(CRSDMAIN) birth(9/9)
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[ CSSD]2011-04-23 17:13:18.355 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh2 type 3
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(ocr_crs) birth(9/9)
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[ CSSD]2011-04-23 17:13:18.356 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(1) grock(#CSS_CLSSOMON) birth(9/9)
[ CSSD]2011-04-23 17:13:18.357 [3021970320] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 10
[ CSSD]2011-04-23 17:13:18.366 [3063929744] >TRACE: clssgmPeerDeactivate: node 1 (vrh1), death 10, state 0x80000000 connstate 0xa
[ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmHandleDBDone(): src/dest (2/65535) size(68) incarn 10
[ CSSD]2011-04-23 17:13:18.367 [3063929744] >TRACE: clssgmPeerListener: connects done (2/2)
[ CSSD]2011-04-23 17:13:18.369 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 10 is node(2) birth(2) 更新阶段 [ CSSD]CLSS-3000: reconfiguration successful, incarnation 10 with 2 nodes [ CSSD]CLSS-3001: local node number 3, master node number 2 [ CSSD]2011-04-23 17:13:18.372 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(10), with status(1)

2.另一场景为1号节点未加入集群,2号节点的网络失败,因2号节点的member number较小故其通过voting disk向3号节点发起驱逐

以下为2号节点的ocssd.log日志

[    CSSD]2011-04-23 17:41:48.643 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 29.890 seconds
[ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: node vrh3 (3) is impending reconfig, flag 1037, misstime 30110
[ CSSD]2011-04-23 17:41:48.643 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:41:50.132 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 50 8.910601e-269artbeat fatal, eviction in 28.890 seconds
[ CSSD]2011-04-23 17:42:10.533 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 14.860 seconds
[ CSSD]2011-04-23 17:42:11.962 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 75 8.910601e-269artbeat fatal, eviction in 13.860 seconds
[ CSSD]2011-04-23 17:42:23.523 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 5.840 seconds
[ CSSD]2011-04-23 17:42:24.989 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 4.840 seconds
[ CSSD]2011-04-23 17:42:26.423 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 3.840 seconds
[ CSSD]2011-04-23 17:42:27.890 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 2.840 seconds
[ CSSD]2011-04-23 17:42:29.382 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 1.840 seconds
[ CSSD]2011-04-23 17:42:30.832 [3053439888] >WARNING: clssnmPollingThread: node vrh3 (3) at 90 8.910601e-269artbeat fatal, eviction in 0.830 seconds
[ CSSD]2011-04-23 17:42:32.020 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh3 (3), flags 0x040d, state 3, wt4c 0
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[ CSSD]2011-04-23 17:42:32.020 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:42:32.021 [117574544] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[vrh2] seq[13] sync[13]
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 3, vrh3, misstime(60000) state(5)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:42:32.021 [5073104] >USER: NMEVENT_SUSPEND [00][00][00][0c]
[ CSSD]2011-04-23 17:42:32.021 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2011-04-23 17:42:32.022 [117574544] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3 has smaller cluster size 1; my cluster size 1 with leader 2 检查voting disk后发现子集群3为最小"子集群"(3号节点的node number较2号大);2号节点为最大子集群 [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Start [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmEvict: Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d [ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmSendShutdown: req to node 3, kill time 1643084 发起对3号节点的驱逐和shutdown request [ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmDiscHelper: vrh3, node(3) connection failed, con (0xb7e79bb0), probe((nil))
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmWaitOnEvictions: node 3, vrh3, undead 1
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmCheckKillStatus: Node 3, vrh3, down, LATS(1642874),timeout(210)
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2011-04-23 17:42:32.023 [3032460176] >TRACE: clssnmSendUpdate: syncSeqNo(13)
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 1, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 2, state (3/3) unique (1303591210/1303591210) prevConuni(0) birth (2/2) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmUpdateNodeState: node 3, state (5/0) unique (1303591326/1303591326) prevConuni(1303591326) birth (3/3) (old/new)
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmDeactivateNode: node 3 (vrh3) left cluster [ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: SYNC(13) from node(2) completed
[ CSSD]2011-04-23 17:42:32.024 [117574544] >USER: clssnmHandleUpdate: NODE 2 (vrh2) IS ACTIVE MEMBER OF CLUSTER
[ CSSD]2011-04-23 17:42:32.024 [117574544] >TRACE: clssnmHandleUpdate: diskTimeout set to (200000)ms
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(15)
[ CSSD]2011-04-23 17:42:32.024 [3032460176] >TRACE: clssnmDoSyncUpdate: Sync 13 complete!
[ CSSD]2011-04-23 17:42:32.024 [3021970320] >TRACE: clssgmReconfigThread: started for reconfig (13)
[ CSSD]2011-04-23 17:42:32.024 [3021970320] >USER: NMEVENT_RECONFIG [00][00][00][04]
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock crs_version type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(2) grock(crs_version) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_FAILOVER type 3
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_FAILOVER) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(EVMDMAIN) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(CRSDMAIN) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh1 type 3
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh1) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock _ORA_CRS_MEMBER_vrh3 type 3
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(_ORA_CRS_MEMBER_vrh3) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(ocr_crs) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupGrocks: cleaning up grock #CSS_CLSSOMON type 2
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmCleanupOrphanMembers: cleaning up remote mbr(3) grock(#CSS_CLSSOMON) birth(3/3)
[ CSSD]2011-04-23 17:42:32.025 [3021970320] >TRACE: clssgmEstablishConnections: 1 nodes in cluster incarn 13
[ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerDeactivate: node 3 (vrh3), death 13, state 0x0 connstate 0xf
[ CSSD]2011-04-23 17:42:32.026 [3063929744] >TRACE: clssgmPeerListener: connects done (1/1)
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmEstablishMasterNode: MASTER for 13 is node(2) birth(2)
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterCMSync: Synchronizing group/lock status
[ CSSD]2011-04-23 17:42:32.026 [3021970320] >TRACE: clssgmMasterSendDBDone: group/lock status synchronization complete
[ CSSD]CLSS-3000: reconfiguration successful, incarnation 13 with 1 nodes [ CSSD]CLSS-3001: local node number 2, master node number 2 完成reconfiguration [ CSSD]2011-04-23 17:42:32.027 [3021970320] >TRACE: clssgmReconfigThread: completed for reconfig(13), with status(1)

以下为3号节点的ocssd.log日志:

[    CSSD]2011-04-23 17:42:33.204 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 50 1.867300e-268artbeat fatal, eviction in 29.360 seconds
[ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: node vrh2 (2) is impending reconfig, flag 1039, misstime 30640
[ CSSD]2011-04-23 17:42:33.204 [3053439888] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2011-04-23 17:42:55.168 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 75 1.867300e-268artbeat fatal, eviction in 14.330 seconds
[ CSSD]2011-04-23 17:43:08.182 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 5.310 seconds
[ CSSD]2011-04-23 17:43:09.661 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 4.300 seconds
[ CSSD]2011-04-23 17:43:11.144 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 3.300 seconds
[ CSSD]2011-04-23 17:43:12.634 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 2.300 seconds
[ CSSD]2011-04-23 17:43:14.053 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 1.300 seconds
[ CSSD]2011-04-23 17:43:15.467 [3053439888] >WARNING: clssnmPollingThread: node vrh2 (2) at 90 1.867300e-268artbeat fatal, eviction in 0.300 seconds
[ CSSD]2011-04-23 17:43:15.911 [3053439888] >TRACE: clssnmPollingThread: Eviction started for node vrh2 (2), flags 0x040f, state 3, wt4c 0
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ALIVE
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2011-04-23 17:43:15.911 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: diskTimeout set to (57000)ms
[ CSSD]2011-04-23 17:43:15.912 [89033616] >TRACE: clssnmHandleSync: Acknowledging sync: src[3] srcName[vrh3] seq[29] sync[13]
[ CSSD]2011-04-23 17:43:15.912 [8136912] >USER: NMEVENT_SUSPEND [00][00][00][0c]
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmDoSyncUpdate: Terminating node 2, vrh2, misstime(60010) state(5)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmSetupAckWait: node(3) is ACTIVE
[ CSSD]2011-04-23 17:43:15.913 [89033616] >TRACE: clssnmSendVoteInfo: node(3) syncSeqNo(13)
[ CSSD]2011-04-23 17:43:15.912 [3032460176] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: : my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1) 读取voting disk后发现kill block,为避免split brain,自我aborting! [ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: ###################################
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: clssscExit: CSSD aborting
[ CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR: ###################################
[ CSSD]--- DUMP GROCK STATE DB ---
[ CSSD]----------
[ CSSD] type 2, Id 4, Name = (crs_version)
[ CSSD] flags: 0x1000
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 2
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD] . . . . .
[ CSSD] memberNo =1, seq 12
[ CSSD] flags = 0x1000, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 2
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 3, Id 11, Name = (_ORA_CRS_FAILOVER)
[ CSSD] flags: 0x0
[ CSSD] grant: count=1, type 3, wait 1
[ CSSD] Member Count =1, master -3
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 0
[ CSSD] flags = 0x12, granted 1
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 2, Name = (EVMDMAIN)
[ CSSD] flags: 0x1000
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 1
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 2
[ CSSD] privateDataSize = 508
[ CSSD] publicDataSize = 504
[ CSSD] . . . . .
[ CSSD] memberNo =3, seq 2
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 508
[ CSSD] publicDataSize = 504
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 5, Name = (CRSDMAIN)
[ CSSD] flags: 0x1000
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =1, master 3
[ CSSD] . . . . .
[ CSSD] memberNo =3, seq 2
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 128
[ CSSD] publicDataSize = 128
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh1)
[ CSSD] flags: 0x0
[ CSSD] grant: count=1, type 3, wait 1
[ CSSD] Member Count =1, master -3
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 0
[ CSSD] flags = 0x12, granted 1
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 3, Id 12, Name = (_ORA_CRS_MEMBER_vrh3)
[ CSSD] flags: 0x0
[ CSSD] grant: count=1, type 3, wait 1
[ CSSD] Member Count =1, master -3
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 0
[ CSSD] flags = 0x12, granted 1
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 3, Name = (ocr_crs)
[ CSSD] flags: 0x1000
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 3
[ CSSD] . . . . .
[ CSSD] memberNo =3, seq 2
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 32
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 12
[ CSSD] flags = 0x1000, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 2
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 32
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 1, Name = (#CSS_CLSSOMON)
[ CSSD] flags: 0x1000
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 1
[ CSSD] flags = 0x1000, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 2
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD] . . . . .
[ CSSD] memberNo =3, seq 2
[ CSSD] flags = 0x1000, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 3, nodeBirth 3
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]--- END OF GROCK STATE DUMP ---
[ CSSD]------- Begin Dump -------

Oracle RAC Brain Split Resolution的更多相关文章

  1. Oracle rac集群环境中的特殊问题

    备注:本文摘抄于张晓明<大话Oracle RAC:集群 高可用性 备份与恢复> 因为集群环境需要多个计算机协同工作,要达到理想状态,必须要考虑在集群环境下面临的新挑战. 1.并发控制 在集 ...

  2. oracle rac 巡检过程详解

    一 RAC环境 RAC架构,2节点信息 节点1 SQL> show parameter instance NAME                                 TYPE    ...

  3. ORACLE RAC 11.2.0.4 CentOS release 6.9 静默安装1.0版本

    RAC11.2.0.4静默安装 1.0版本,20180613 #本文档IP地址使用X隐藏,个人可按照自己的当前环境IP进行适当修改 1. 清除原环境中的单实例软件 #清除原环境: 删除/etc/ora ...

  4. Optimizing Oracle RAC

    Oracle Real Application Clusters (RAC) databases form an increasing proportion of Oracle database sy ...

  5. GitHub: Oracle RAC Database on Docker 未测试 改天试试

    https://github.com/oracle/docker-images/blob/master/OracleDatabase/RAC/OracleRealApplicationClusters ...

  6. 【Oracle 集群】Linux下Oracle RAC集群搭建之Oracle DataBase安装(八)

    Oracle 11G RAC数据库安装(八) 概述:写下本文档的初衷和动力,来源于上篇的<oracle基本操作手册>.oracle基本操作手册是作者研一假期对oracle基础知识学习的汇总 ...

  7. 【Oracle 集群】Linux下Oracle RAC集群搭建之基本测试与使用(九)

    Oracle 11G RAC数据库安装(九) 概述:写下本文档的初衷和动力,来源于上篇的<oracle基本操作手册>.oracle基本操作手册是作者研一假期对oracle基础知识学习的汇总 ...

  8. Oracle RAC客户端tnsnames.ora相关配置及测试

    1.Oracle RAC服务端/etc/hosts部分内容如下 2.查看服务端的local_listener和remote_listener参数 3.客户端tnsnames.ora配置参考 3.1 1 ...

  9. Oracle RAC安装部署文档

    1.    部署环境步骤 1.1 软件环境 操作系统:CentOS release 6.5(推荐使用5.*的系统)192.168.1.151    racnode1 192.168.1.152    ...

随机推荐

  1. [转]C/C++获取当前系统时间

    原文转自:http://www.cnblogs.com/mfryf/archive/2012/02/13/2349360.html 个人觉得第二种还是比较实用的,而且也是最常用的~ 不过当计算算法耗时 ...

  2. bootstrap下拉框式标签页

    <ul id="myTab" class="nav nav-tabs"> <li class="active"> & ...

  3. 数组,for语句(补10.11)

    1.数组定义:一系列通数据类型的数据集合. 2.数组赋值的两种方法: 先定义后赋值:(赋值从0开始) var aa = new Arrey(); aa[0] = 1; aa[1] = 2; 定义并赋值 ...

  4. 在Sql2005中,向表中插入数据时遇到uniqueidentifier列,如何插入数据?

    Sql2005中,提供了uniqueidentifier 数据类型.说白了,就是个GUID,这种类型开发时倒是很有必要的. 今天程序中遇到了这个问题:表里定义了一个uniqueidentifier 列 ...

  5. Mathematics Base - Tensor

    以下是我对张量的理解,备注是具体解释,Xmind导出的图片没法显示出来,主要还是将张量间的关系画出来,方便理解. 图1 张量

  6. 图解SynchronousQueue原理详解-非公平模式

    SynchronousQueue原理详解-非公平模式 开篇 说明:本文分析采用的是jdk1.8 约定:下面内容中Ref-xxx代表的是引用地址,引用对应的节点 前面已经讲解了公平模式的内容,今天来讲解 ...

  7. [转] iOS开发-搜索栏UISearchBar和UISearchController

    原文网址: http://www.cnblogs.com/xiaofeixiang/p/4273620.html?utm_source=tuicool iOS中UISearchDisplayContr ...

  8. 小白入门AI教程:教你快速搭建大数据平台『Hadoop+Spark』

    Apache Spark 简介 Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎.Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源 ...

  9. 一个小时学会 MySQL 数据库

    随着移动互联网的结束与人工智能的到来大数据变成越来越重要,下一个成功者应该是拥有海量数据的,数据与数据库你应该知道. 一.数据库概要 数据库(Database)是存储与管理数据的软件系统,就像一个存入 ...

  10. 题解 P1006 传纸条

    传送门 其实我觉得这个跟P1004挺类似(又是动规) 题解P1004 #include<iostream> #include<cstdio> #include<cstri ...