再谈ORACLE CPROCD进程

罗列一下有关oprocd的知识点

oprocd是oracle在rac中引入用来fencing io的

在unix系统下，假设我们没有採用oracle之外的第三方集群软件，才会存在oprocd进程

在linux系统下，仅仅有在10.2.0.4版本号后，才会具有oprocd进程

在window下，不会存在oprocd 进程，可是会存在一个oraFenceService服务，用来实现同样的功能，该服务採用的技术是基于windows的，与oprocd不同

oprocd进程能够执行在两者模式下：fatal和no fatal，在fatal模式下，假设系统hang住，或者其它原因触发oprocd则oprocd进程会自己主动重新启动server。在no fatal模式下，假设系统hang住，或者其它原因触发oprocd进程，则oprocd进程会在日志中记录警告信息，可是不会重新启动系统。

oprocd进程具有两个參数：timeout 指定oprocd进程调用的时间间隔 margin 指定同意的时间偏差，假设时间偏差超过margin，则oprocd进程会重新启动系统或者记录错误信息到日志。

oprocd进程的日志文件位于：/etc/oracle/oprocd 或者 /var/opt/oracle/oprocd

oprocd进程从cssd进程派生而来，而且以root用户身份同意

[root@node2 init.d]# ps -ef | grep oprocd

root      5109 11227  0 20:37 pts/0    00:00:00 grep oprocd

root      5758  4849  0 19:14 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oprocd

root      6084  5758  0 19:14 ?        00:00:00 /u01/app/crs_home/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

假设一个节点被hang住了非常长时间，那么集群中的其它节点会把该节点剔除出去，在这样的情况下，我们须要採取措施重新启动被hang住的节点，以便达到fencing io的目的。oprocd被设置了两个參数：timeout 和margin，进程会每间隔timeout时间被唤醒一次，假设本次被唤醒的时间与上次被唤醒的时间间隔超过timeout+margin，那么oprocd进程会觉得oracle 节点被hang住，因此会自己主动重新启动节点或者将警告信息写入日志。

通常情况下，我们能够将oprocd进程重新启动系统的原因归为四类：

1:：操作系统的调度问题

2：操作系统的存在硬件或者驱动问题

3：系统具有大量负载，导致调度程序无法及时调入oprocd进程

4：oracle bug

Bug 5015469 – OPROCD may reboot the node whenever the system date is moved

backwards.

Fixed in 10.2.0.3+

Fixed in 10.1.0.3 + One off patch for Bug 4206159.

Fixed in 10.2.0.4+

Fixed in 10.2.0.3+

Bug 4206159 – Oprocd is prone to time regression due to current API used (AIX only)

Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 – Oprocd logfile is cleared after a reboot

Bug 5037858 – Increase the warning levels if a reboot is approaching

oprocd进程的两个參数：timeout和margin，其默认值在init.cssd 文件里指定，如

[root@node2 init.d]# cat init.cssd | grep ^OPROCD_DEFAULT_

OPROCD_DEFAULT_TIMEOUT=1000

OPROCD_DEFAULT_MARGIN=500

OPROCD_DEFAULT_HISTORGRAM=

因此，默认情况下，假设两次唤醒oprocd进程的时间间隔超过1.5s，oprocd进程就会重新启动系统。这往往是不合适的，假设我们手工改动init.cssd文件里的默认值，须要oracle support才干够。

假设须要突破1.5s的限制，我们能够调用init.cssd来实现目的，通过调用init.cssd能够改动两个參数：reboottime 和 diagwait，假设diagwait> reboottime,那么margin=diagwait-reboottime。在设置diagwait时，须要将集群中全部节点的全部进程停掉，都在能够造成数据损坏，仅仅需在rac中的一个节点改动就可以。建议将diagwait改动为13

[root@node2 bin]# ./crsctl get css reboottime

3

[root@node2 bin]# ./crsctl get css diagwait

13

[root@node2 bin]# ./crsctl set css diagwait 13 -force

在11.2.0.1后，我们不再须要改动diagwait，因此架构已经发生了改变。

在windows下我们也能够改动diagwait，可是与在linux下不同，改动diagwait不会造成上面的变化。

以下再来看一下有关hangcheck_timer的有关信息，hangcheck_timer与oprocd能够实现同样的功能，可是两者之间没有必定的联系

Hangcheck-Timer Module

Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux

Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer
was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.

Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node. It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.
This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error. If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.
Hangcheck-timer will not cause reboots to occur due to CPU starvation.

Hangcheck-timer requires three configuration parameters:

    hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.

    hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.

    hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer
module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version. In the 2.4 kernel, the default is 1. In 2.6
kernels, the default is 0.

Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:

    When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages

    If you see the following message in /var/log/messages: "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1. If this message is seen, you must reload the hangcheck
module as described earlier in this note, with the hangcheck_reboot value set to 1.

Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2

再谈ORACLE CPROCD进程的更多相关文章

再谈mysql锁机制及原理—锁的诠释
加锁是实现数据库并发控制的一个非常重要的技术.当事务在对某个数据对象进行操作前,先向系统发出请求,对其加锁.加锁后事务就对该数据对象有了一定的控制,在该事务释放锁之前,其他的事务不能对此数据对象进行更 ...
[转载]再谈PostgreSQL的膨胀和vacuum机制及最佳实践
本文转载自 www.postgres.cn 下的文章: 再谈PostgreSQL的膨胀和vacuum机制及最佳实践http://www.postgres.cn/news/viewone/1/390 还 ...
沉淀再出发：再谈java的多线程机制
沉淀再出发:再谈java的多线程机制一.前言自从我们学习了操作系统之后,对于其中的线程和进程就有了非常深刻的理解,但是,我们可能在C,C++语言之中尝试过这些机制,并且做过相应的实验,但是对于ja ...
再谈Transaction——MySQL事务处理分析
MySQL 事务基础概念/Definition of Transaction 事务(Transaction)是访问和更新数据库的程序执行单元;事务中可能包含一个或多个 sql 语句,这些语句要么都执行 ...
[转帖]再谈IO的异步，同步，阻塞和非阻塞
再谈IO的异步,同步,阻塞和非阻塞 https://yq.aliyun.com/articles/53674?spm=a2c4e.11155435.0.0.48bfe8efHUE8wg krypt ...
浅谈Oracle事务【转载竹沥半夏】
浅谈Oracle事务[转载竹沥半夏] 所谓事务,他是一个操作序列,这些操作要么都执行,要么都不执行,是一个不可分割的工作单元.通俗解释就是事务是把很多事情当成一件事情来完成,也就是大家都在一条船上,要 ...
[转载]再谈百度：KPI、无人机，以及一个必须给父母看的案例
[转载]再谈百度:KPI.无人机,以及一个必须给父母看的案例发表于 2016-03-15 | 0 Comments | 阅读次数 33 原文: 再谈百度:KPI.无人机,以及一个必须 ...
Support Vector Machine (3) : 再谈泛化误差（Generalization Error）
目录 Support Vector Machine (1) : 简单SVM原理 Support Vector Machine (2) : Sequential Minimal Optimization ...
Unity教程之再谈Unity中的优化技术
这是从 Unity教程之再谈Unity中的优化技术这篇文章里提取出来的一部分,这篇文章让我学到了挺多可能我应该知道却还没知道的知识,写的挺好的优化几何体这一步主要是为了针对性能瓶颈中的”顶点 ...

随机推荐

页面中插入百度地图（使用百度地图API）
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQveWF5dW4wNTE2/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA ...
SQL入门学习2-聚合与排序
3-1 对表进行聚合查询聚合函数所谓聚合,就是将多行汇总为一行. 函数名功能 COUNT 计算表中的记录数(行数) SUM 计算表中数值列的数据合计值 AVG 计算表中数值列的数据平均值 MAX ...
Android使用代码消除App数据并重新启动设备
/** * 使用代码消除App数据 * 我们不寻常的清除App数据,中找到相应的App * 然后选择其清除数据.以下给出代码实现. * * 注意事项: * 1 设备须要root * 2 该演示样例中删 ...
oracle数据库全然恢复和不全然恢复以及运行用户管理辈分恢复
比較全然恢复和不全然恢复: 一.全然恢复:将数据库恢复到当前最新状态,包含直至请求恢复时进行的全部已提交的数据更改二.不全然恢复:将数据库恢复到请求恢复操作之前指定的过去时间点一.全然恢复过程以 ...
hexo 部署至Git遇到的坑
查找资料的时候发现了next这个博客主题,next!非常的漂亮,顺手查看了hexo的相关部署. Hexo官方介绍 Hexo 是一个快速.简洁且高效的博客框架.Hexo 使用 Markdown(或其他渲 ...
IT薪酬
新加坡IT薪酬 2014-06-12 12:51 by 圣殿骑士, 8856 阅读, 37 评论, 收藏, 编辑很多朋友发邮件或留言问我关于新加坡IT薪酬的问题,由于前段时间比较忙,所以没有及时一一 ...
hdu 4445 Crazy Tank （暴力枚举）
Crazy Tank Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total ...
rman(上)
CHANGE命令更改备份和副本的状态 1.change backupset 100 unavailable CATALOG命令是用来备份的碎片和复制信息到RMAN数据库 1.息增加到RMAN ...
c++双缓冲技术，以避免闪烁绘图
当数据量非常大时,画图可能须要几秒钟甚至更长的时间,并且有时还会出现闪烁现象,为了解决这些问题.可採用双缓冲技术来画图. 双缓冲即在内存中创建一个与屏幕画图区域一致的对象,先将图形绘制到内存中的这个对 ...
java_Timer_schedule jdk自带定时器
定时器经常在项目中用到,定制执行某些操作,比如爬虫就需要定时加载种子等操作,之前一直用spring的定制器近期做项目发现,jdk有很简单的提供代码如下 1 /* * Copyright (c) 20 ...

再谈ORACLE CPROCD进程

再谈ORACLE CPROCD进程的更多相关文章

随机推荐

热门专题