数据库版本:11.2.0.4 RAC环境

操作系统版本:Asianux Server release 7.3

数据库报错分析

  1. 接到业务消息,应用无法访问,开发人员查看日志后发现无法连接数据库。
  2. 查看数据库进程,发现无数据库进程存在,命令如下:
ps -ef | grep smon
  1. 查看数据库alert日志,报错信息如下:
PMON (ospid: 248987): terminating the instance due to error 484
Wed Feb 15 08:53:54 2023
System state dump requested by (instance=1, osid=248987 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /opt/oracle/oracle/diag/rdbms/xxxx/xxxxx/trace/xxxxx_diag_248998_20230215085354.trc
Instance terminated by PMON, pid = 248987
  1. 并无其它ORA类错误,查看trc文件(仅截取部分内容)后发现,仅有当时系统状态,也无实际报错。
SYSTEM STATE (level=10)
------------
System global information:
processes: base 0x4a857feee0, size 23040, cleanup 0x4a45a00830
allocation: free sessions 0x4ae673e408, free calls (nil)
control alloc errors: 1092 (process), 1092 (session), 1092 (call)
PMON latch cleanup depth: 1
seconds since PMON's last scan for dead processes: 2
system statistics:
0 OS CPU Qt wait time
2218880 Requests to/from client
2123 logons cumulative
377 logons current
316895 opened cursors cumulative
4953 opened cursors current
12983 user commits
0 user rollbacks
2360027 user calls
1544560 recursive calls
2824397 recursive cpu usage
139 pinned cursors current
503 user logons cumulative
206 user logouts cumulative
916157130 session logical reads
0 session logical reads in local numa group
0 session logical reads in remote numa group
0 session stored procedure space
1814442 CPU used when call started
4648429 CPU used by this session
13669191 DB time
14290758662 cluster wait time
21153141850 concurrency wait time
458454660 application wait time
11148959323 user I/O wait time
9 scheduler wait time
87484635410 non-idle wait time
54628931 non-idle wait count
6541375703609 in call idle wait time
1676394729 session connect time
1676422142 process last non-idle time
511617120 session uga memory
29236137136 session uga memory max
2167152 messages sent
2167153 messages received
1542351 background timeouts
1 remote Oradebug requests
4044554832 session pga memory
33068958032 session pga memory max
0 recursive system API invocations
27720 enqueue timeouts
35968 enqueue waits
0 enqueue deadlocks
5063524 enqueue requests
129584 enqueue conversions
5035305 enqueue releases
1496013 global enqueue gets sync
29624 global enqueue gets async
536653 global enqueue get time
1378253 global enqueue releases
12359025 physical read total IO requests
2788577 physical read total multi block requests
  1. 因昨天宕机原因是因为MTU网络问题导致,故分析问题被带偏,一直向节点被驱逐方向查找,查找了grid等日志。之前的报错记录与解决方法如下:
# 解决方法:
# 在sysctl中添加
net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640 #日志报错
Fri Feb 10 22:00:14 2023
Archived Log entry 91604 added for thread 1 sequence 48514 ID 0xef764f9c dest 1:
Fri Feb 10 22:33:21 2023
IPC Send timeout detected. Receiver ospid 7259 [
Fri Feb 10 22:33:21 2023
Errors in file /opt/oracle/oracle/diag/rdbms/xxx/xxx/trace/xxx1_lms6_7259.trc:
Fri Feb 10 22:34:11 2023
Detected an inconsistent instance membership by instance 2
Fri Feb 10 22:34:11 2023
Received an instance abort message from instance 2Fri Feb 10 22:34:11 2023 Received an instance abort message from instance 2
Please check instance 2 alert and LMON trace files for detail.
Please check instance 2 alert and LMON trace files for detail.
Fri Feb 10 22:34:11 2023
System state dump requested by (instance=1, osid=7235 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /opt/oracle/oracle/diag/rdbms/xxx/xxx/trace/xxx1_diag_7221_20230210223411.trc
LMS0 (ospid: 7235): terminating the instance due to error 481
Instance terminated by LMS0, pid = 7235
  1. 最终查找MOS,找到一篇文章:PMON Terminating the Instance Due To Error 822 (Doc ID 2342018.1)
The collected OS error log shows "Out of memory" and "Free swap = 0kB" errors, and it proved that mman process got killed due to "Out of memory" problem:
# 简单来说就是系统内存溢出,pmon进程被杀掉。
  1. 防止内存溢出的方法可以配置hugepage来解决,但系统已经配置过hugepage参数了(配置方法见附件),不应该出现此错误,但还是需要排查一下,于是查看了操作系统的message日志。
view /var/log/message
# 找到了下面的日志,果然是内存溢出。
Feb 15 08:53:53 jcsjdb1 kernel: Out of memory: Kill process 249044 (oracle) score 8 or sacrifice child
Feb 15 08:53:53 jcsjdb1 kernel: Killed process 249044 (oracle) total-vm:317602672kB, anon-rss:161904kB, file-rss:3704kB, shmem-rss:8386616kB
Feb 15 08:53:53 jcsjdb1 rtkit-daemon[4520]: Demoted 1 threads.
  1. 查看hugepage配置
# 确实是配置了
sysctl -a | grep hugepage
vm.hugepages_treat_as_movable = 0
vm.nr_hugepages = 327680
vm.nr_hugepages_mempolicy = 327680
vm.nr_overcommit_hugepages = 0
  1. hugepage配置完成后,oracle在启动时可以查看参数确定是否使用了hugepage,查看启动日志:
************************ Large Pages Information *******************
Per process system memlock (soft) limit = 64 KB # 注意看这里,可与第10步比对 Total Shared Global Region in Large Pages = 0 KB (0%) # 注意看这里,配置正常应该是100% Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 327680 (640 GB)
Large Pages configured system wide = 327680 (640 GB)
Large Page size = 2048 KB RECOMMENDATION:
Total System Global Area size is 303 GB. For optimal performance,
prior to the next instance restart:
1. Large pages are automatically locked into physical memory.
Increase the per process memlock (soft) limit to at least 303 GB to lock
100% System Global Area's large pages into physical memory
********************************************************************
  1. 对比以前的启动参数
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED # 注意看这里,与第九步比对 Total Shared Global Region in Large Pages = 303 GB (100%) Large Pages used by this instance: 154881 (303 GB)
Large Pages unused system wide = 172799 (337 GB)
Large Pages configured system wide = 327680 (640 GB)
Large Page size = 2048 KB
********************************************************************
  1. 之后查看了limis.conf参数配置,发现memlock参数未配置

最后总结

所以这次的问题是因为昨天的重启,导致数据库启动时limit参数的memlock默认由UNLIMITED变为了64KB,导致hugepage无法生效导致的,所以将memlock配置上之后,再在升级窗口重启数据库即可。

反思

本次排查问题时,忘记查看操作系统的message日志,也与昨天的宕机误导有关,之后排查问题时,也要注意支查看一下操作系统日志。

附 hugepage配置方法

  1. 停止数据库实例

  2. 查看当前系统是否配置HugePages

# 下面的查询中HugePages相关的几个值都为0,表明当前未配值HugePages,其次可以看到Hugepagesize为2MB。
$ grep Huge /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
  1. 修改用户的memlock限制
# 通过修改/etc/security/limits.conf 配值文件来实现
# 该参数的值通常配值位略小于当前的已安装系统内存,如当前你的系统内存为600G,可以做如下设置
* soft memlock 500000000
* hard memlock 500000000
# 上述的设置单位为 kb ,不会降低系统性能。至少也要配值为略大于系统上所有SGA的总和。
# 使用ulimit -l 来校验该设置
  1. 计算vm.nr_hugepages 的值
  • Hugepagesize 使用default的2M即可
  • Huge page的数量要设的合适大小,即 HugePages_Total * Hugepagesize > SGA 大小。按照SGA=350G,Hugepagesize=2M来计算的话,HugePages_Total应该要至少350*1024/2=179200。另外要加上ASM占用的部分(同样的算法)
  • 数据库参数use_large_pages建议设成ONLY,即huge page必须配置足够大,全部SGA都能用到huge page,db instance才能启动,否则db instance不能启动
  • Huge page的数量设太大也是没有用的,因为不会被其他进程用到
  1. 编辑/etc/sysctl.conf 来设置vm.nr_hugepages参数
#hugepage
vm.nr_hugepages=184320
$ sysctl -w vm.nr_hugepages=184320
$ sysctl -p
  1. 查看hugepage配置
grep Huge /proc/meminfo
AnonHugePages: 3172352 kB
HugePages_Total: 184320
HugePages_Free: 16072
HugePages_Rsvd: 10953
HugePages_Surp: 0
Hugepagesize: 2048 kB
  1. 启动数据库实例,观察alert日志

Oracle宕机之PMON (ospid: 248987): terminating the instance due to error 484(另附hugepage配置方法)的更多相关文章

  1. 一次Oracle宕机切换后产生ORA错误的处理过程

    问题背景 机房意外断电后Oracle主服务器启动失败,Oracle备机接管 为了安全,管理员对于数据库做expdp的逻辑备份.但备份时发现AttributeInstance表备份失败,提示ORA-01 ...

  2. ASMB的BUG(ORA-04030 kfmditer)导致数据库宕机

    ASMB的BUG(ORA-04030 kfmditer)导致数据库宕机 现象: 客户的一个重要生产系统RAC的一个实例宕机,查看alert日志: Fri Jun 21 17:05:52 2013 Er ...

  3. 11gR2 RAC启用iptables导致节点宕机问题处理

    通常,在安装数据库时,绝大多数都是要求把selinux及iptables关闭,然后再进行安装的.但是在运营商的系统中,很多安全的因素,需要将现网的数据库主机上的iptables开启的. 在开启ipta ...

  4. ORA-04031错误导致宕机案例分析

    今天遇到一起ORACLE数据库宕机案例,下面是对这起数据库宕机案例的原因进行分析.解读.分析过程中顺便记录一下这个案例的前因后果,攒点经验值,培养一下分析.解决问题的能力. 案例环境:   操作系统 ...

  5. Oracle-11g-R2(11.2.0.3.x)RAC Oracle Grid & Database 零宕机方式回滚 PSU(自动模式)

    回滚环境: 1.源库版本: Grid Infrastructure:11.2.0.3.15 Database:11.2.0.3.15 2.目标库版本: Grid Infrastructure:11.2 ...

  6. Oracle-11g-R2(11.2.0.3.x)RAC Oracle Grid & Database 零宕机方式升级 PSU(自动模式)

    升级环境: 1.源库版本: Grid Infrastructure:11.2.0.3.13 Database:11.2.0.3.13 2.目标库版本: Grid Infrastructure:11.2 ...

  7. oracle 归档模式开启后数据库宕机解决过程

    首先按照网友说的shutdown immediately,结果hang了半个小时也么反应. 然后检查日志,全盘搜索.trc,发现 (D:\app\oracle\diag\rdbms\cms1u\cms ...

  8. Oracle备库宕机启动解决方案

    简介 ORA-10458: standby database requires recovery ORA-01196: 文件 1 由于介质恢复会话失败而不一致 ORA-01110: 数据文件 1: ' ...

  9. 记一次 oracle 数据库在宕机后的恢复

    系统:redhat 6.6 oracle版本: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production 问题描述: ...

  10. Oracle数据库突然宕机,处理方案

    一.现象 数据库突然断掉,无法响应,. 二.分析 查看日志发现错误如下(日志路径:D:\app\Administrator\diag\rdbms\orcl\orcl\trace\alert_hrpde ...

随机推荐

  1. What is the Best Python IDE for Data Science?

    Created by Guido van Rossum, Python was first released back in 1991. The interpreted high-level prog ...

  2. tomcat8.5.55启动失败service tomcat start 报错

    问题描述: Neither the JAVA_HOME nor the JRE_HOME environment variable is defined At least one of these e ...

  3. ZXing 生成二维码和条形码(添加NuGet包)

  4. 无线电(手台、APRS)

    泉胜手台操作:(TG-UV2)----------------------------------------------- MR/VFO: 频率模式指示F:信道模式指示M F+MAIN: 主副频转换 ...

  5. asp.net页面button按钮防止重复提交的方法

    网上找了一些实现方案都不行,就自己写了个用,还行. 先放javascript代码: <script type="text/javascript"> var clicks ...

  6. 持续集成环境(5)-Maven安装和配置

    在Jenkins集成服务器上,我们需要安装Maven来编译和打包项目. 安装Maven 1.下载Maven软件到jenkins服务器上 wget https://mirrors.aliyun.com/ ...

  7. django orm的增删改查 以及django1.x和2.x的区别

    ORM对字段的增删改查 # 建一个作者表 class Author(models.Model): ''' 如果你以后在创建表的时候,主键就叫id名,那么可以省略不写,orm会自动帮你创建出主键名称为i ...

  8. EVE如何提升名望值

    目录 背景介绍 简介 名望值划分 军团名望值 利弊 背景介绍 ​ 玩eve将近3个星期,开着毒蜥级刷1级代理人任务感觉没有一点难度,想尽快刷3.4级代理任务,而我目前能够接到的最高代理任务也就才1级. ...

  9. Leecode 141.环形链表(Java 快慢指针)

        想法: 1:遍历链表,每次判断节点是否被访问过.(哈希表) 2:快慢指针(看题解之后)     两个指针pq都在head头指针开始(初始化):     快指针每次走两步,慢指针每次走一步,如果 ...

  10. 【Leetcode】 剑指offer:字符串(简单)--Day03

    剑指 Offer 05. 替换空格 请实现一个函数,把字符串 s 中的每个空格替换成"%20". 逐字符遍历原字符串,遍历过程中对存放结果的字符串分情况更新. class Solu ...