ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:
1. 应用程序访问不了数据库,使用SQL Developer测试发现访问不了数据库。报ORA-12570 TNS:packet reader failure
2. 使用lsnrctl status检查监听,一直没有响应,这个是极少见的情况。
3. 检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPU Wait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IO waits and contention)
CPU Wait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/O Wait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)
4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:
3 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
10 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
17 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
24 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
31 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
38 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds
41 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'
48 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds
关于“ORA-00494: enqueue [CF] held for too long (more than 900 seconds).....”这个错误,我们先看看这个错误的相关描述:
[oracle@DB-Server ~]$ oerr ora 494
00494, 00000, "enqueue %s held for too long (more than %s seconds) by 'inst %s, osid %s'"
// *Cause: The specified process did not release the enqueue within
// the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
// Services with the incident information.
出现ORA-00494 意味这Instance Crash了,可以参考官方文档 Database Crashes With ORA-00494 (文档 ID 753290.1):
This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).
This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.
Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.
From the alert.log we can see:
The database has waited too long for a CF enqueue, so the next error is reported:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'
Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.
Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).
Cause#2: Checking the I/O State in the AWR report we find that:
Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform
Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
The ORA-00494 error occurs during periods of super-high stress, activity to the point there the server becomes unresponsive due to overloaded disk I/O, CPU or RAM.
从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误, 以至数据库实例Crash。
1:告警日志里面有“ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'” 错误,CF指Control file schema global enqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout
SQL> COL NAME FOR A45 ;
SQL> COL VALUE FOR A32 ;
SQL> COL DESCR FOR A80 ;
SQL> SELECT x.ksppinm NAME
2 , y.ksppstvl VALUE
3 , x.ksppdesc DESCR
4 FROM SYS.x$ksppi x, SYS.x$ksppcv y
5 WHERE x.indx = y.indx
6 AND x.ksppinm LIKE '%&par%';
Enter value for par: controlfile_enqueue
old 6: AND x.ksppinm LIKE '%&par%'
new 6: AND x.ksppinm LIKE '%controlfile_enqueue%'
NAME VALUE DESCR
--------------------------------------------- -------------------------------- --------------------------------------------------------------------------------
_controlfile_enqueue_timeout 900 control file enqueue timeout in seconds
_controlfile_enqueue_holding_time 120 control file enqueue max holding time in seconds
_controlfile_enqueue_dump FALSE dump the system states after controlfile enqueue timeout
_kill_controlfile_enqueue_blocker TRUE enable killing controlfile enqueue blocker on timeout
SQL>
检查redo log的切换频率,发现在2016-11-09 零点到2点,以及2016-11-08 22:00~ 24:00的redo log 切换频率都很低。排除有大量DML操作的可能性, 根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息
SELECT
TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99') "00",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99') "01",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99') "02",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99') "03",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99') "04",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99') "05",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99') "06",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99') "07",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99') "08",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99') "09",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99') "10",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99') "11",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99') "12",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99') "13",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99') "14",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99') "15",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99') "16",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99') "17",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99') "18",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99') "19",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99') "20",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99') "21",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99') "22",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99') "23"
FROM
V$LOG_HISTORY
GROUP BY
TO_CHAR(FIRST_TIME,'YYYY-MM-DD')
ORDER BY 1 DESC;
2:关于 The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'
告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (文档 ID 1068799.1)
I/O contention or slowness leads to control file enqueue timeout.
One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :
From the AWR:
1) high "log file parallel write" and "control file sequential read" waits
2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)
3) very high %iowait : 98.57%.
4) confirmed IO peak during that time
Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.
这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPU Wait%非常高。这意味着可能出现了严重的IO等待和争用(IO waits and contention)
3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲
Operating System CPU Utilization
CPU等待IO资源(Wait IO)也是从10:45 PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。
Operating System CPU Other
然后,我们看看Operating System I/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙 由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。
Operating System I/O
Operating System I/O Throughput
然后我们检查一下操作系统的日志,如下所示:
如下截图所示,“INFO: task kjournald:xxx blocked for more than 120 seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库Instance Crash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。
后续处理解决
此时使用shutdown immediate关闭不了数据库,没有任何响应。只能shutdown abort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误
ORA-01102: cannot mount database in EXCLUSIVE mode
ORA-00205: error in identifying control file, check alert log for more info
ORA-00202: control file: '/u01/app/oracle/oradata/epps/control01.ctl'
ORA-27086: unable to lock file - already in use
关于这个错误,此处不做展开,可以参考ORA-01102: cannot mount database in EXCLUSIVE mode,kill掉大部分进程后,发现有三个进程使用kill -9 kill不掉,如下截图所示:
kill -9发送SIGKILL信号将其终止,但是以下两种情况不起作用:
a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。
b、 该进程处于”kernel mode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。
由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。 重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'的更多相关文章
- 记录一则ASM实例阻塞,rbal进程异常的案例
1.故障现象描述 2.确认故障现象 3.排查ASM层面 4.解决问题 1.故障现象描述 环境:AIX 7.1 + Standalone Oracle 11.2.0.4 现象:客户反映某11g版本的AD ...
- Oracle 性能之 Enq: CF - contention
Oracle 性能之 Enq: CF - contention Table of Contents 1. 原因 2. 解决问题 2.1. 针对持有锁进程类型处理 2.1.1. 查看持有锁会话的进程类型 ...
- Oracle11g版本中未归档隐藏参数
In this post, I will give a list of all undocumented parameters in Oracle 11g. Here is a query to se ...
- ORA-01591 锁定已被有问题的分配事务处理--解决方法(转)
转载自love wife & love life —Roger 的Oracle技术博客 本文链接地址: ORA-01591: lock held by in-doubt distributed ...
- Oracle EBS R12的启停脚本
以下脚本用root用户登录执行: 一.DB启停使用EBS提供的脚本ebs_start.shsu - oraprod -c "/d01/oracle/PROD/db/tech_st/10.2. ...
- REORG TABLESPACE on z/os
这个困扰了我两天的问题终于解决了,在运行这个job时:总是提示 A REQUIRED DD CARD OR TEMPLATE IS MISSING NAME=SYSDISC A REQUIRED DD ...
- Centos6.5里安装Hbase(伪分布式)
首先我们到官方网站下载Hbase,而我使用的版本是hbase-0.94.27.tar.gz 解压下来: tar zxvf hbase-.tar.gz 寻找java安装路径 [root@localhos ...
- hbase 使用
hbase shell命令的使用 再使用hbase 命令之前先检查一下hbase是否运行正常 hadoop@Master:/usr/hbase/bin$ jps HMaster NameNode Se ...
- Hbase学习记录(2)| Shell操作
查看表结构 describe '表名' 查看版本 get '表名','zhangsan'{COLUMN=>'info:age',VERSIONS=>3} 删除整行 deleteall '表 ...
随机推荐
- CSharpGL(9)解析OBJ文件并用CSharpGL渲染
CSharpGL(9)解析OBJ文件并用CSharpGL渲染 2016-08-13 由于CSharpGL一直在更新,现在这个教程已经不适用最新的代码了.CSharpGL源码中包含10多个独立的Demo ...
- Java总结篇系列:Java泛型
一. 泛型概念的提出(为什么需要泛型)? 首先,我们看下下面这段简短的代码: 1 public class GenericTest { 2 3 public static void main(Stri ...
- 编程之美—烙饼排序问题(JAVA)
一.问题描述 星期五的晚上,一帮同事在希格玛大厦附近的"硬盘酒吧"多喝了几杯.程序员多喝了几杯之后谈什么呢?自然是算法问题.有个同事说:"我以前在餐 馆打工,顾 ...
- JavaScript权威设计--JavaScript函数(简要学习笔记十一)
1.函数调用的四种方式 第三种:构造函数调用 如果构造函数调用在圆括号内包含一组实参列表,先计算这些实参表达式,然后传入函数内.这和函数调用和方法调用是一致的.但如果构造函数没有形参,JavaScri ...
- ASP.NET MVC5+EF6+EasyUI 后台管理系统(45)-工作流设计-设计步骤
系列目录 步骤设计很重要,特别是规则的选择. 我这里分为几个规则 1.按自行选择(在起草时候自行选审批人,比较灵活) 2.按上级(无需指定,当时需要知道用户的上司是谁,可以在职位管理设置,或者在用户表 ...
- ASP.NET MVC5+EF6+EasyUI 后台管理系统(55)-工作流设计-表单布局
系列目录 前言:这一节比较有趣.基本纯UI,但是不是很复杂 有了实现表单的打印和更加符合流程表单方式,我们必须自定义布局来适合业务场景打印!我们想要什么效果?看下图 (我们没有布局之前的表单和设置布局 ...
- angularJS实用的开发技巧
一.开端 真的是忙完这一阵子就可以忙下一阵子了啊... 最近在做一个angularJS+Ionic的移动端项目...记录一些技巧,方便自己以后查阅,也方便需要的人可以看一看...^_^ 二.基础原则了 ...
- Linux内核启动过程概述
版权声明:本文原创,转载需声明作者ID和原文链接地址. Hi!大家好,我是CrazyCatJack.今天给大家带来的是Linux内核启动过程概述.希望能够帮助大家更好的理解Linux内核的启动,并且创 ...
- SELECT TOP 1 比不加TOP 1 慢的原因分析以及SELECT TOP 1语句执行计划预估原理
本文出处:http://www.cnblogs.com/wy123/p/6082338.html 现实中遇到过到这么一种情况: 在某些特殊场景下:进行查询的时候,加了TOP 1比不加TOP 1要慢(而 ...
- 3.JAVA之GUI编程Frame窗口
创建图形化界面思路: 1.创建frame窗体: 2.对窗体进行基本设置: 比如大小.位置.布局 3.定义组件: 4.将组件通过add方法添加到窗体中: 5.让窗体显示,通过setVisible(tur ...













