ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:
1. 应用程序访问不了数据库,使用SQL Developer测试发现访问不了数据库。报ORA-12570 TNS:packet reader failure
2. 使用lsnrctl status检查监听,一直没有响应,这个是极少见的情况。
3. 检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPU Wait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IO waits and contention)
CPU Wait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/O Wait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)
4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:
3 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
10 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
17 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
24 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
31 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
38 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds
41 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
48 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds
关于“ORA-00494: enqueue [CF] held for too long (more than 900 seconds).....”这个错误,我们先看看这个错误的相关描述:
[oracle@DB-Server ~]$ oerr ora 494
00494, 00000, "enqueue %s held for too long (more than %s seconds) by 'inst %s, osid %s'"
// *Cause: The specified process did not release the enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the incident information.
出现ORA-00494 意味这Instance Crash了,可以参考官方文档 Database Crashes With ORA-00494 (文档 ID 753290.1):
This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).
This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.
Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.
From the alert.log we can see:
The database has waited too long for a CF enqueue, so the next error is reported:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'
Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.
Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).
Cause#2: Checking the I/O State in the AWR report we find that:
Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform
Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
The ORA-00494 error occurs during periods of super-high stress, activity to the point there the server becomes unresponsive due to overloaded disk I/O, CPU or RAM.
从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误, 以至数据库实例Crash。
1:告警日志里面有“ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'” 错误,CF指Control file schema global enqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout
SQL> COL NAME FOR A45 ;
SQL> COL VALUE FOR A32 ;
SQL> COL DESCR FOR A80 ;
SQL> SELECT x.ksppinm NAME
2 , y.ksppstvl VALUE
3 , x.ksppdesc DESCR
4 FROM SYS.x$ksppi x, SYS.x$ksppcv y
5 WHERE x.indx = y.indx
6 AND x.ksppinm LIKE '%&par%';
Enter value for par: controlfile_enqueue
old 6: AND x.ksppinm LIKE '%&par%'
new 6: AND x.ksppinm LIKE '%controlfile_enqueue%'
NAME VALUE DESCR
--------------------------------------------- -------------------------------- --------------------------------------------------------------------------------
_controlfile_enqueue_timeout 900 control file enqueue timeout in seconds
_controlfile_enqueue_holding_time 120 control file enqueue max holding time in seconds
_controlfile_enqueue_dump FALSE dump the system states after controlfile enqueue timeout
_kill_controlfile_enqueue_blocker TRUE enable killing controlfile enqueue blocker on timeout
SQL>
检查redo log的切换频率,发现在2016-11-09 零点到2点,以及2016-11-08 22:00~ 24:00的redo log 切换频率都很低。排除有大量DML操作的可能性, 根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息
SELECT
TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99') "00",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99') "01",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99') "02",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99') "03",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99') "04",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99') "05",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99') "06",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99') "07",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99') "08",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99') "09",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99') "10",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99') "11",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99') "12",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99') "13",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99') "14",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99') "15",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99') "16",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99') "17",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99') "18",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99') "19",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99') "20",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99') "21",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99') "22",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99') "23"
FROM
V$LOG_HISTORY
GROUP BY
TO_CHAR(FIRST_TIME,'YYYY-MM-DD')
ORDER BY 1 DESC;
2:关于 The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (文档 ID 1068799.1)
I/O contention or slowness leads to control file enqueue timeout.
One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :
From the AWR:
1) high "log file parallel write" and "control file sequential read" waits
2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)
3) very high %iowait : 98.57%.
4) confirmed IO peak during that time
Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.
这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPU Wait%非常高。这意味着可能出现了严重的IO等待和争用(IO waits and contention)
3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲
Operating System CPU Utilization
CPU等待IO资源(Wait IO)也是从10:45 PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。
Operating System CPU Other
然后,我们看看Operating System I/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙 由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。
Operating System I/O
Operating System I/O Throughput
然后我们检查一下操作系统的日志,如下所示:
如下截图所示,“INFO: task kjournald:xxx blocked for more than 120 seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库Instance Crash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。
后续处理解决
此时使用shutdown immediate关闭不了数据库,没有任何响应。只能shutdown abort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误
ORA-01102: cannot mount database in EXCLUSIVE mode
ORA-00205: error in identifying control file, check alert log for more info
ORA-00202: control file: '/u01/app/oracle/oradata/epps/control01.ctl'
ORA-27086: unable to lock file - already in use
关于这个错误,此处不做展开,可以参考ORA-01102: cannot mount database in EXCLUSIVE mode,kill掉大部分进程后,发现有三个进程使用kill -9 kill不掉,如下截图所示:
kill -9发送SIGKILL信号将其终止,但是以下两种情况不起作用:
a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。
b、 该进程处于”kernel mode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。
由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。 重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'的更多相关文章
- 记录一则ASM实例阻塞,rbal进程异常的案例
1.故障现象描述 2.确认故障现象 3.排查ASM层面 4.解决问题 1.故障现象描述 环境:AIX 7.1 + Standalone Oracle 11.2.0.4 现象:客户反映某11g版本的AD ...
- Oracle 性能之 Enq: CF - contention
Oracle 性能之 Enq: CF - contention Table of Contents 1. 原因 2. 解决问题 2.1. 针对持有锁进程类型处理 2.1.1. 查看持有锁会话的进程类型 ...
- Oracle11g版本中未归档隐藏参数
In this post, I will give a list of all undocumented parameters in Oracle 11g. Here is a query to se ...
- ORA-01591 锁定已被有问题的分配事务处理--解决方法(转)
转载自love wife & love life —Roger 的Oracle技术博客 本文链接地址: ORA-01591: lock held by in-doubt distributed ...
- Oracle EBS R12的启停脚本
以下脚本用root用户登录执行: 一.DB启停使用EBS提供的脚本ebs_start.shsu - oraprod -c "/d01/oracle/PROD/db/tech_st/10.2. ...
- REORG TABLESPACE on z/os
这个困扰了我两天的问题终于解决了,在运行这个job时:总是提示 A REQUIRED DD CARD OR TEMPLATE IS MISSING NAME=SYSDISC A REQUIRED DD ...
- Centos6.5里安装Hbase(伪分布式)
首先我们到官方网站下载Hbase,而我使用的版本是hbase-0.94.27.tar.gz 解压下来: tar zxvf hbase-.tar.gz 寻找java安装路径 [root@localhos ...
- hbase 使用
hbase shell命令的使用 再使用hbase 命令之前先检查一下hbase是否运行正常 hadoop@Master:/usr/hbase/bin$ jps HMaster NameNode Se ...
- Hbase学习记录(2)| Shell操作
查看表结构 describe '表名' 查看版本 get '表名','zhangsan'{COLUMN=>'info:age',VERSIONS=>3} 删除整行 deleteall '表 ...
随机推荐
- Android消息处理机制(Handler、Looper、MessageQueue与Message)
Android是消息驱动的,实现消息驱动有几个要素: 消息的表示:Message 消息队列:MessageQueue 消息循环,用于循环取出消息进行处理:Looper 消息处理,消息循环从消息队列中取 ...
- Atitit 图像处理 调用opencv 通过java api attilax总结
Atitit 图像处理 调用opencv 通过java api attilax总结 1.1. Opencv java api的支持 opencv2.4.2 就有了对java api的支持1 1. ...
- iOS-证书
1.iOS-证书相关 2.iOS-证书申请 3.iOS-APNS证书申请与使用 4.iOS-App发布证书的申请与使用
- SQL Server 批量完整备份
一.本文所涉及的内容(Contents) 本文所涉及的内容(Contents) 背景(Contexts) 实现代码(SQL Codes) 实现方式一(One) 实现方式二(Two) 实现方式三(Thr ...
- [转]Linux中configure/makefile
本文教你如何使用autoconf.automake等来制作一个以源代码形式(.tar.gz)发布的软件.并可在执行configure时使用自定义参数. 一.概述和基础知识 在Linux下得到一个以源代 ...
- 用spm2构建seajs项目的过程
前言 Javascript模块化规范有CommonJs规范,和主要适用于浏览器环境的AMD规范,以及国内的CMD规范,它是SeaJs遵循的模块化规范.因为以前项目中用SeaJs做过前端的模块管理工具, ...
- jQuery-template.js学习
花了点时间,看了下jQuery-template.js,不多废话,先上结构 jQuery.each({..},function(){}) jQuery.fn.extend({..}) jQuery.e ...
- CSS3与页面布局学习总结(八)——浏览器兼容与前端性能优化
一.浏览器兼容 1.1.概要 世界上没有任何一个浏览器是一样的,同样的代码在不一样的浏览器上运行就存在兼容性问题.不同浏览器其内核亦不尽相同,相同内核的版本不同,相同版本的内核浏览器品牌不一样,各种运 ...
- ASP.NET Core 中文文档 第二章 指南(4.4)添加 Model
原文:Adding a model 作者:Rick Anderson 翻译:娄宇(Lyrics) 校对:许登洋(Seay).孟帅洋(书缘).姚阿勇(Mr.Yao).夏申斌 在这一节里,你将添加一些类来 ...
- TCP初始化序列号ISN
TCP初始化序列号ISN TCP初始化序列号不能设置为一个固定值,因为这样容易被攻击者猜出后续序列号,从而遭到攻击. RFC1948中提出了一个较好的初始化序列号ISN随机生成算法. ISN = M ...