经纪xx系统节点VIP案例介绍和深入分析异常
系统环境 硬件平台 & 操作
IBM 570
操作系统版本号 AIX 5.3
物理内存 32G
Oracle 产品及版本号 10.2.0.5 RAC
业务类型 OLTP
背景概述
交易系统在xx月xx
日。节点二VIP异常下线导致节点二数据库服务失
效。接到请求后。第一时间进行连线处理。故障发生在凌晨 3点,并且
AIX(errpt)、Oracle DB(alert.log
)、CRS (crsd.log
、ocssd.log、vip.log、
coredump )等均没有留下太多有效信息,情况很复杂。
问题具体诊断及分析
一、检查errpt日志:
节点二VIP异常下线时,节点二上无报错产生。
二、检查CRS日志:
节点一:
2013-04-29 03:42:33.180: [ CRSRES][11376]32startRunnable: setting CLIvalues
— 说明:节点一上仅有一行信息反映出故障期间节点一上CRS以前运行过命令。
节点二:
2013-04-29 03:41:12.308: [ CRSAPP][11263]32CheckResource error forora.xxxxdb02.vip error code =1
2013-04-29 03:41:12.335: [ CRSRES][11263]32In stateChanged,ora.xxxxdb02.vip target is ONLINE
2013-04-29 03:41:12.335: [ CRSRES][11263]32ora.xxxxdb02.vip on xxxxdb02went OFFLINE unexpectedly
2013-04-29 03:41:12.335: [ CRSRES][11263]32StopResource: setting CLIvalues
2013-04-29 03:41:12.340: [ CRSRES][11263]32Attempting to stop`ora.xxxxdb02.vip` on member
`xxxxdb02`
2013-04-29 03:41:12.893: [ CRSRES][11269]32In stateChanged,ora.xxxxdb.xxxxdb1.xxxxdb2.srv target
is ONLINE
2013-04-29 03:41:12.894: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srvon xxxxdb02 went OFFLINE
unexpectedly
2013-04-29 03:41:12.894: [ CRSRES][11269]32StopResource: setting CLIvalues
2013-04-29 03:41:12.899: [ CRSRES][11269]32Attempting to stop`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on
member `xxxxdb02`
2013-04-29 03:41:12.958: [ CRSRES][11263]32Stop of `ora.xxxxdb02.vip` onmember `xxxxdb02`
succeeded.
2013-04-29 03:41:12.971: [ CRSRES][11263]32ora.xxxxdb02.vipRESTART_COUNT=0 RESTART_ATTEMPTS=0
2013-04-29 03:41:12.976: [ CRSRES][11263]32ora.xxxxdb02.vip failed onxxxxdb02 relocating.
2013-04-29 03:41:13.025: [ CRSRES][11263]32StopResource: setting CLIvalues
2013-04-29 03:41:13.029: [ CRSRES][11263]32Attempting to stop
`ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` onmember `xxxxdb02`
2013-04-29 03:41:13.146: [ CRSRES][11269]32Stop of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member
`xxxxdb02` succeeded.
2013-04-29 03:41:13.146: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv RESTART_COUNT=0
RESTART_ATTEMPTS=1
2013-04-29 03:41:13.159: [ CRSRES][11269]32Restartingora.xxxxdb.xxxxdb1.xxxxdb2.srv on xxxxdb02
2013-04-29 03:41:13.164: [ CRSRES][11269]32startRunnable: setting CLIvalues
2013-04-29 03:41:13.164: [ CRSRES][11269]32Attempting to start `ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on
member `xxxxdb02`
2013-04-29 03:41:45.618: [ CRSAPP][11269]32StartResource error forora.xxxxdb.xxxxdb1.xxxxdb2.srv
error code = 1
2013-04-29 03:41:45.799: [ CRSRES][11269]32Start of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member`xxxxdb02` failed.
2013-04-29 03:41:45.820: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv failed on xxxxdb02 relocating.
2013-04-29 03:41:45.885: [ CRSRES][11269]32Cannot relocate ora.xxxxdb.xxxxdb1.xxxxdb2.srvStopping dependents
2013-04-29 03:41:45.897: [ CRSRES][11269]32StopResource: setting CLIvalues
2013-04-29 03:42:29.483: [ CRSRES][11263]32Stop of `ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` on member`xxxxdb02` succeeded.
2013-04-29 03:42:29.496: [ CRSRES][11263]32Attempting to start`ora.xxxxdb02.vip` on member`xxdb01np5`
2013-04-29 03:42:32.036: [ CRSRES][11263]32Start of `ora.xxxxdb02.vip`on member `xxdb01np5`succeeded.
— 说明:节点二是故障节点,上面信息反映出,因为VIP检測异常导致节点二上VIP被强
制OFFLINE。同一时候CRS将基于VIP
资源的监听和数据库服务也进行了OFFLINE操作。
三、分析CRS状态:
因为故障期间客户非常好的保留了现场环境。因此我们得以现场分析到CRS
状态。
Name Type Target State Host
------------------------------------------------------------
ora.xxxxdb.db application ONLINE ONLINE xxdb01np5
ora....sdb1.cs application ONLINE OFFLINE
ora....db1.srv application ONLINE ONLINE xxdb01np5
ora....db2.srv application ONLINE ONLINE xxxxdb02
ora....b1.inst application ONLINE ONLINE xxdb01np5
ora....b2.inst application ONLINE ONLINE xxxxdb02
ora....P5.lsnr application ONLINE ONLINE xxdb01np5
ora....np5.gsd application ONLINE ONLINE xxdb01np5
ora....np5.ons application ONLINE ONLINE xxdb01np5
ora....np5.vip application ONLINE ONLINE xxdb01np5
ora....P5.lsnr application ONLINE OFFLINE
ora....np5.gsd application ONLINE ONLINE xxxxdb02
ora....np5.ons application ONLINE ONLINE xxxxdb02
ora....np5.vip application ONLINE ONLINE xxdb01np5
— 说明:故障期间。节点二上的数据库(实例)资源仍然在执行,Nodeapp资源正常,
唯一被OFFLINE的是一个Service和listener
资源。
可是我们注意到节点二上VIP
资源
已经被节点一接管过去。这说明。节点二的VIP(xxx.xxx.xxx.4
)在排查问题的时候是
没问题的。
四、分析数据库告警日志
Sun Apr 29 03:41:13 BEIST 2012
ALTER SYSTEM SET service_names='xxxxdb'SCOPE=MEMORY SID='xxxxdb2';
Sun Apr 29 03:41:55 BEIST 2012
ALTER SYSTEM SET service_names=''SCOPE=MEMORY SID='xxxxdb2';
Sun Apr 29 03:41:55 BEIST 2012
Immediate Kill Session#: 1659, Serial#: 2
Immediate Kill Session: sess:70000038f773bb8 OS pid: 1180972
Sun Apr 29 03:41:55 BEIST 2012
Process OS id : 1180972 alive after kill
Errors in file
Immediate Kill Session#: 1660, Serial#: 2
Immediate Kill Session: sess:7000003847623c0 OS pid: 1179690
Sun Apr 29 03:41:56 BEIST 2012
Process OS id : 1179690 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1661, Serial#: 2
Immediate Kill Session: sess:7000003837620c0 OS pid: 1300264
Sun Apr 29 03:41:56 BEIST 2012
Process OS id : 1300264 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1662, Serial#: 2
Immediate Kill Session: sess:70000038275bc80 OS pid: 1155624
Sun Apr 29 03:41:57 BEIST 2012
Process OS id : 1155624 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1663, Serial#: 2
Immediate Kill Session: sess:700000387762ed0 OS pid: 1000484
Sun Apr 29 03:41:57 BEIST 2012
Process OS id : 1000484 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1664, Serial#: 2
Immediate Kill Session: sess:7000003867923d8 OS pid: 1175586
Sun Apr 29 03:41:58 BEIST 2012
Process OS id : 1175586 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1665, Serial#: 2
Immediate Kill Session: sess:7000003857a9d30 OS pid: 1296160
Sun Apr 29 03:41:58 BEIST 2012
Process OS id : 1296160 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1666, Serial#: 3
Immediate Kill Session: sess:70000038f775130 OS pid: 1176862
Sun Apr 29 03:41:59 BEIST 2012
Process OS id : 1176862 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1667, Serial#: 2
Immediate Kill Session: sess:700000384763938 OS pid: 1151516
Sun Apr 29 03:42:00 BEIST 2012
Process OS id : 1151516 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1668, Serial#: 2
Immediate Kill Session: sess:700000383763638 OS pid: 693786
Sun Apr 29 03:42:01 BEIST 2012
Process OS id : 693786 alive after kill
Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc
Immediate Kill Session#: 1669, Serial#: 2
Immediate Kill Session: sess:70000038275d1f8 OS pid: 1292056 6
— 说明:故障期间,节点二上数据库的Service被OFFLINE,这一点正常。
但随后CRS
将连接到节点二上的全部client连接都强行Kill掉了。该问题和Oracle Bug 6955040有
关。
五、分析RACGVIP
因为问题原因定位到 VIP资源异常上,而VIP切换到节点一上是正常的,
因此极有可能是因为VIP检查的时候出现异常导致CRS
将VIP OFFLINE。因为
Oracle CRS在进行VIP检測的时候是通过RACGVIP这个 SHELL脚本实现功能
的,因此我们分析和測试了RACGVIP脚本。
我们注意到当中一段脚本,例如以下:
#Check the status of the interface thro' pinging gateway
if[ -n "$DEFAULTGW" ]
then
_RET=1
#get base IP address of the interface
tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
#get RX packets numbers
_O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5;exit}}"`
x=$CHECK_TIMES
while [ $x -gt 0 ]
do
if [ -n "$tmpIP" ]
then
$PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
else
$PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
fi
_O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5;exit}}"`
if [ "$_O1" != "$_O2" ]
then
# RX packets numbers changed
_RET=0
break
fi
$SLEEP 1
x=`$EXPR $x - 1`
done
if [ $_RET -ne 0 ]
then
logx "checkIf: RX packets checked if=$_IF failed"
else
logx "checkIf: RX packets checked if=$_IF OK"
fi
else
logx "checkIf: Default gateway is not defined(host=$HOSTNAME)"
if [ $FAIL_WHEN_DEFAULTGW_NO_FOUND -eq 1 ]
then
_RET=1
else
_RET=0
fi
fi
— 说明:这段脚本说明。CRS在做VIP check的时候机制例如以下,定期向缺省网关(Default
Gateway)发包,假设发包前后检測到网卡上有包流量(output值前后不一样)则说
明检測网络正常。VIP check通过。发包的操作通过AIX
命令运行,还原全部变量后的
命令例如以下
— ping –S xxx.xxx.xxx.2 –c 1–w 1 xxx.xxx.xxx.254
— 參数含义例如以下:
-S hostname/IP addr
Uses the IP address as the source addressin outgoing ping packets.
-c Count
Specifies the number of echo requests, asindicated by the Count
variable, to be sent (and received).
-w timeout
This option works only with the -c option.It causes ping to wait
for a maximum of 'timeout' seconds for areply (after sending the
last packet).
— PING_TIMEOUT參数控制了-c
和-w的值,RACGVIP
脚本中设置例如以下:
# timeout of ping in number of loops
PING_TIMEOUT=" -c 1 -w 1"
— 我们检查了10.2.0.3版本号下的设置,发如今10.2.0.3版本号下,PING_TIMEOUT參数设置
为例如以下值:
# timeout of ping in number of loops
PING_TIMEOUT=" -c 1 -w 3"
— 当前PING_TIMEOUT的设置意味着假设在pingdefault gateway
的1秒内没有成功向
外发送出一个包,那么CRS会经过一次SLEEP后再次尝试ping操作。假设还是没有成
功。CRS即宣布网络异常,VIP OFFLINE。
结论
Oracle 10.2.0.5在RACGVIP脚本上的细微改变说明,Oracle觉得10.2.0.5版本号
下对网络性能及稳定性更高了。
可是将PING_TIMEOUT
由1秒更改为3秒,会大
大添加CRS
的敏感度,使得细微的网络异常即会造成CRS 中的 VIP检測失败。
依据上述分析,我们觉得节点二故障原因例如以下:
1、
升级到10.2.0.5后,CRS
对网络延时更加敏感
2、
在4月29 日凌晨3点故障期间,网络出现了及其短暂的延时导致VIP检測
失败,CRS
将VIP以及依赖于VIP的资源OFFLINE。
3、 CRS
在OFFLINE数据库服务的时候命中Bug6955040造成client连接被
所有杀光。
故障重现
我方在生产环境模拟了这一段异常情况:
步骤流程
1 打开CRS VIPdebug
2
节点二上找出VIP(xxx.xxx.xxx.4
)使用的网络设备(en10)
以及相应的ServiceIP(xxx.xxx.xxx.2
)
3 交换机上找出相应的port
4 Block该port一小段时间
5 观察CRS
和数据库日志
我方我们观察到当中 CRS
日志和数据库日志信息和故障期间全然一致,这
充分证明了前面的推断。
同一时候。VIP的debug
信息显演示样例如以下内容,我们能够从中清晰的看到,因为
网络问题导致发包失败后,CRS
马上将VIP置于 OFFLINE状态:
2013-05-01 20:36:22.560: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:18 BEIST 2012 [ 1139710 ] Checkinginterface existance
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Calling getifbyip
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip: started for xxx.xxx.xxx.4
2013-05-01 20:36:22.560: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:18 BEIST 2012 [ 1139710 ] getifbyip:checking if failover is happening (en10)
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip: failover is not happening
(en10)
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip: returning IP en10
2013-05-01 20:36:22.560: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:18 BEIST 2012 [ 1139710 ] Completedgetifbyip en10
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Completed with initial interface test
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Broadcast = 130.0.255.255
2013-05-01 20:36:22.560: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:18 BEIST 2012 [ 1139710 ] checkIf:start for if=en10
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] IsIfAlive: start for if=en10
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] defaultgw: started
2013-05-01 20:36:22.560: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:18 BEIST 2012 [ 1139710 ]defaultgw: completed withxxx.xxx.xxx.254
Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] About to execute com mand:/usr/sbin/ping
-S xxx.xxx.xxx.2 -c 1 -w 1 xxx.xxx.xxx.254
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:20 BEIST 2012 [ 1139710 ] About toexecute command: /usr/sbin/ping -S
xxx.xxx.xxx.2 -c 1 -w 1 xxx.xxx.xxx.254
Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: RX packets checked if=en10 failed
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:22 BEIST 2012 [ 1139710 ] Interfaceen10 checked failed (host=xxxxdb02)
Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: end for if=en10
Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] checkIf: end for if=en10
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:22 BEIST 2012 [ 1139710 ] PerformingCRS_STAT testing
Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed CRS_STAT testing
Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed second gateway test
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:22 BEIST 2012 [ 1139710 ] Interfacetests Invalid parameters, or failed to bring upVIP (host=xxxxdb02)
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54
/oracle/product/10.2.0/crs_1/bin/racgvipcheck xxxxdb02
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcexecut: rc = 1, time = 4.304s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcaction: restyp=0 act_typ=2 stat=1
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcaction: init 0.000s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:
clsrcaction: action failed, 4.438s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcaction: post 0.000s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcaction: all 4.438s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:end for
resource = ora.xxxxdb02.vip, action =check, status = 1, time = 4.442s
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:clsrccln:
exiting ora.xxxxdb02.vip refcount=1
2013-05-01 20:36:22.561: [ RACG][1] [1291828][1][ora.xxxxdb02.vip]:
clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0
2013-05-01 20:36:22.664: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0
2013-05-01 20:36:22.664: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global
2013-05-01 20:36:22.665: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:begin for
resource = ora.xxxxdb02.vip, action = stop
2013-05-01 20:36:22.902: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrccssgetnodename: all 0.238s
2013-05-01 20:36:22.902: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcnodeapps: calling FAILSRVSA
2013-05-01 20:36:22.907: [ RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:
clsrcrundetach: cmd =/oracle/product/10.2.0/crs_1/bin/racgmain ora.xxxxdb02.vip
rundetach 1 failsrvsa xxxxdb02, rc = 0, time = 0.005s
2013-05-01 20:36:22.908: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcnodeapps: Posting PNWDOWN_EVENT
2013-05-01 20:36:22.908: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrccssgetnodename: all 0.000s
2013-05-01 20:36:22.908: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrccssgetnodename: all 0.000s
2013-05-01 20:36:22.908: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcssgetrhost: using cached local hostname
2013-05-01 20:36:22.909: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrccssgetlhost: all 0.001s
2013-05-01 20:36:22.935: [ RACG][1] [889364][1][ora.xxxxdb02.vip]:
clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0
2013-05-01 20:36:22.935: [ RACG][1] [889364][1][ora.xxxxdb02.vip]:
clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global
2013-05-01 20:36:22.935: [ RACG][1] [889364][1][ora.xxxxdb02.vip]:begin for
resource = ora.xxxxdb02.vip, action =rundetach
2013-05-01 20:36:22.964: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_COUNT=0
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf =
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = NAME=ora.xxxxdb.xxxxdb1.xxxxdb2.srv
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = FAILURE_THRESHOLD=0
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TARGET=ONLINE
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf =
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_COUNT=0
2013-05-01 20:36:22.982: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = FAILURE_THRESHOLD=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TARGET=ONLINE
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_ATTEMPTS=5
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = NAME=ora.xxxxdb02.ons
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TYPE=application
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_ATTEMPTS=3
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_COUNT=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = FAILURE_THRESHOLD=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TARGET=ONLINE
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02 15
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf =
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = NAME=ora.xxxxdb02.vip
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TYPE=application
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_ATTEMPTS=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_COUNT=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = FAILURE_THRESHOLD=0
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TARGET=ONLINE
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf =
2013-05-01 20:36:22.983: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:
resname = NULL, host = xxxxdb02, time =0.007s
2013-05-01 20:36:23.075: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcpostevt: EvmEventPost 1 0.000s
2013-05-01 20:36:23.075: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcposthaevt: forward to EVM 0.166s
2013-05-01 20:36:23.080: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcnodeapp: prsr num_env = 0
2013-05-01 20:36:23.080: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcnodeapp: settingORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1
2013-05-01 20:36:23.111: [ RACG][1] [889364][1][ora.xxxxdb02.vip]:
clsrcpostevt: EvmEventPost 1 0.008s
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]: clsrcstatcb:
buf = NAME=ora.xxxxdb02.vip
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TYPE=application
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_ATTEMPTS=0
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = RESTART_COUNT=0
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = FAILURE_THRESHOLD=0
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = TARGET=ONLINE
2013-05-01 20:36:23.114: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf = STATE=ONLINE on xxxxdb02
2013-05-01 20:36:23.115: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:
buf =
2013-05-01 20:36:23.115: [ RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:
resname = ora.xxxxdb02.vip, host = NULL,time = 0.004s
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:23 BEIST 2012 [ 921812 ] Checkinginterface existance
Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Calling getifbyip
Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip: started for xxx.xxx.xxx.4
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:23 BEIST 2012 [ 921812 ] getifbyip:checking if failover is happening ()
Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip: failover i s not happening()
Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Completed getifbyip
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May 1
20:36:23 BEIST 2012 [ 921812 ] Completedwith initial interface test
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:
clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54
/oracle/product/10.2.0/crs_1/bin/racgvipstop xxxxdb02
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcexecut: rc = 0, time = 0.204s
2013-05-01 20:36:23.284: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcposthaevt: reason = failure
2013-05-01 20:36:23.285: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:clsrccln:
exiting ora.xxxxdb02.vip refcount=1
2013-05-01 20:36:23.286: [ RACG][1] [1115450][1][ora.xxxxdb02.vip]:
clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0
解决方式
依据分析结果,我们觉得,10.2.0.5中CRS
对网络过于敏感,出现网络延时
的时候会对数据库集群产生较大影响。针对眼下的情况。我们建议例如以下:
一、
详查网络问题,极偶然的丢包或延时在网络层面也属于常见情况。
有可能是线缆问题。也可能是交换机、server网卡、网络配置等
问题,须要具体检查网络情况。
二、
改动过于敏感的CRS 配置,将发包超时设置为3秒( 10.2.0.5之前的
值):
改动$ORA_CRS_HOME/bin/racgvip
脚本例如以下部分
# timeout of ping in number of loops
PING_TIMEOUT=" -c 1 -w 1"
改动成例如以下内容:
# timeout of ping in number of loops
PING_TIMEOUT=" -c 1 -w 3"
三、
因为Bug 6955040是VIP异常后被触发。眼下优先解决VIP异常问
题。该Bug
能够忽略。
版权声明:本文博主原创文章,博客,未经同意不得转载。
经纪xx系统节点VIP案例介绍和深入分析异常的更多相关文章
- 集群监控系统Ganglia应用案例
集群监控系统Ganglia应用案例 --我们把集群系统投入生产环境后,这时就需要一套可视化的工具来监视集群系统,这将有助于我们迅速地了解机群的整体配置情况,准确地把握机群各个监控节点的信息,全面地察看 ...
- (转) SLAM系统的研究点介绍 与 Kinect视觉SLAM技术介绍
首页 视界智尚 算法技术 每日技术 来打我呀 注册 SLAM系统的研究点介绍 本文主要谈谈SLAM中的各个研究点,为研究生们(应该是博客的多数读者吧)作一个提纲挈领的摘要.然后,我 ...
- xx系统属性分析
在本周的课程学习当中,我们简单了解到系统的一些属性,同时在课下也对<大型网站技术架构:核心原理与案例分析>进行了初步的阅读. 在书籍中我看到了许多其他的知识,也对课堂学习的知识有了巩固,现 ...
- 25 Zabbix系统数据表结构介绍
点击返回:自学Zabbix之路 点击返回:自学Zabbix4.0之路 点击返回:自学zabbix集锦 25 Zabbix系统数据表结构介绍 自学Zabbix之路15.1 Zabbix数据库表结构简单解 ...
- 3星|《AI极简经济学》:AI的预测、决策、战略等方面的应用案例介绍
AI极简经济学 主要内容是AI的各种应用案例介绍.作者把这些案例分到五个部分介绍:预测.决策.工具.战略.社会. 看书名和介绍以为会从经济学的角度解读AI,有更多的新鲜的视角和观点,读后比较失望,基本 ...
- kafka关于修改副本数和分区的数的案例实战(也可用作leader节点均衡案例)
kafka关于修改副本数和分区的数的案例实战(也可用作leader节点均衡案例) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.关于topic分区数的修改 1>.创建1分 ...
- Asp.net博客系统收集和简单介绍
国内Asp.net博客系统收集和简单介绍 [转载文章,仅供个人参考,引自http://www.soyaoo.com/Blog/post/92.html] 1.ZJ-Blog程序简介:基于A ...
- 针对XX系统的可用性方面的相关想法(结合书)
在开始对此系统进行再次分析之前,再回顾下可用性.首先,可用性是与系统故障有关的一个质量属性,是指系统正常运行的时间的比例,一般通过两次故障之间的时间长度或在系统崩溃情况下能恢复正常运行的速度来衡量,同 ...
- 针对XX系统的可用性和易用性构想
可用性是与系统故障有关的一个质量属性,是指系统正常运行的时间的比例,一般通过两次故障之间的时间长度或在系统崩溃情况下能恢复正常运行的速度来衡量,同时此概念涉及一个公式的计算,就是系统正常运行时间的百分 ...
随机推荐
- linux su,sudo命令
linux su 命令 建议大家切换用户的时候 使用 su - root 这样,否则可能发现某些命令执行不了 关于su .su - 及 sudo的区别 请往下看 1.命令作用 su的作用是变更为其它使 ...
- wp天气预报
ak url http://developer.baidu.com/map/index.php?title=car 后台cs using System; using System.Collectio ...
- PHP草根论之设计模式-訪问者模式
关于模式本身的概念,请參考网上其他文章 此处仅仅讨论在PHP实际开发过程中的应用 此模式适用范围极为受限,适用情景: 1.适用于项目维护过程,不适用于项目开发过程 2.新增需求,要求为一个/多个类添加 ...
- java注解(一)
虽然平时有使用注解,不过没有深入了解,今天无聊,重新从基础深入了解整理下: java注解是附加在代码中的一些元信息,用于一些工具在编译.运行时进行解析和使用,起到说明.配置的功能. 注解不会也 ...
- New Hire Training Experience
Game Description: 1. In a closed door, there will be a circle, and 30 numbers in the circle. 2. Each ...
- abstract修饰方法总结
abstract这种方法修饰,主要用在抽象类和抽象方法. 抽象的类是不可实例化的比如 public abstract class Test{ } 他能够含有抽象的方法 public abstract ...
- Python3.4 邮件(包含附件与中国)
import smtplib import os from email.mime.text import MIMEText from email.mime.multipart import MIMEM ...
- MIT 操作系统实验 MIT JOS lab2
MIT JOS lab2 首先把内存分布理清楚,由/boot/main.c可知这里把kernel的img的ELF header读入到物理地址0x10000处 这里能够回想JOS lab1的一个小问.当 ...
- 【原创】leetCodeOj --- Copy List with Random Pointer 解题报告
题目地址: https://oj.leetcode.com/problems/copy-list-with-random-pointer/ 题目内容: A linked list is given s ...
- linux下改动内核參数进行Tcp性能调优 -- 高并发
前言: Tcp/ip协议对网络编程的重要性,进行过网络开发的人员都知道,我们所编写的网络程序除了硬件,结构等限制,通过改动Tcp/ip内核參数也能得到非常大的性能提升, 以下就列举一些Tcp/ip内核 ...