背景

有这样一个案例。客户备库意外宕机,从集群日志只看出发生了主备切换,备库一直持续恢复备库没有成功,从数据库日志看到如下报错:

terminating connection because of crash of another server process

DETAIL: The kingbase has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

根据报错提示,怀疑当时并发太高,或者业务繁忙导致shared_buffer不够用,进而导致数据库宕机。由于V8R3版本数据库没有办法收集kwr报告,所以不容易定位这个判断。

分析

现在模拟实验:

测试环境:

[](javascript:void(0)

shared_buffer 设置成16MB
max_wal_size 设置成32MB
create table test01(id integer, val char(1024)); insert into test01 values(generate_series(1,2888600),repeat( chr(int4(random()*26)+65),1024)); TEST=# create table test01(id integer, val char(1024));
CREATE TABLE
TEST=# insert into test01 values(generate_series(1,2888600),repeat( chr(int4(random()*26)+65),1024)); 等待......
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

[](javascript:void(0)

ps命令看到了每个process,其中process13674占用了大量内存

数据库日志警告发生corrupted shared memory 。实例崩溃,发生重启。

在这之前触发了大量检查点,这也符合预期,因为已经把max_wal_size调的足够小。需要不断写出page以保证足够的shared_buffer满足insert。

数据库也给出了合理建议增加参数“max_wal_size”大小。

[](javascript:void(0)

2022-05-25 15:38:04 CST HINT:  Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:05 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:05 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:05 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:05 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:06 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:06 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:07 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:07 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:07 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:07 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:07 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:07 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:09 CST LOG: checkpoints are occurring too frequently (2 seconds apart)
2022-05-25 15:38:09 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:09 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:09 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:10 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:10 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:12 CST LOG: checkpoints are occurring too frequently (2 seconds apart)
2022-05-25 15:38:12 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:12 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:12 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:12 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:12 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:13 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:13 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:14 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:14 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:15 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:15 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:15 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:15 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:15 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:15 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:16 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:16 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:17 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:17 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:17 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:17 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:18 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:18 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:18 CST LOG: checkpoints are occurring too frequently (1 second apart)
2022-05-25 15:38:18 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:18 CST LOG: checkpoints are occurring too frequently (0 seconds apart)
2022-05-25 15:38:18 CST HINT: Consider increasing the configuration parameter "max_wal_size".
2022-05-25 15:38:19 CST LOG: server process (PID 13674) was terminated by signal 9: Killed
2022-05-25 15:38:19 CST DETAIL: Failed process was running: insert into test01 values(generate_series(1,2888600),repeat( chr(int4(random()*26)+65),1024));
2022-05-25 15:38:19 CST LOG: terminating any other active server processes
2022-05-25 15:38:19 CST WARNING: terminating connection because of crash of another server process
2022-05-25 15:38:19 CST DETAIL: The kingbase has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-05-25 15:38:19 CST HINT: In a moment you should be able to reconnect to the database and repeat your command.
2022-05-25 15:38:19 CST LOG: all server processes terminated; reinitializing
2022-05-25 15:38:19 CST LOG: database system was interrupted; last known up at 2022-05-25 15:38:19 CST
2022-05-25 15:38:19 CST LOG: database system was not properly shut down; automatic recovery in progress
2022-05-25 15:38:19 CST LOG: redo starts at 0/8F050338
2022-05-25 15:38:19 CST LOG: redo wal segment count 1
2022-05-25 15:38:19 CST LOG: invalid record length at 0/8FA6C178: wanted 24, got 0
2022-05-25 15:38:19 CST LOG: complete: 1/1
2022-05-25 15:38:19 CST LOG: redo done at 0/8FA6C108
2022-05-25 15:38:19 CST LOG: MultiXact member wraparound protections are now enabled
2022-05-25 15:38:19 CST LOG: redo done at 0/8FA6C108
2022-05-25 15:38:19 CST LOG: MultiXact member wraparound protections are now enabled
2022-05-25 15:38:19 CST LOG: database system is ready to accept connections
2022-05-25 15:38:19 CST LOG: autovacuum launcher started
2022-05-25 15:38:19 CST LOG: starting syslogical supervisor
2022-05-25 15:38:19 CST LOG: starting syslogical database manager for database TEST
2022-05-25 15:38:19 CST LOG: manager worker [13929] at slot 0 generation 1 detaching cleanly
2022-05-25 15:38:20 CST LOG: starting syslogical database manager for database TEMPLATE1
2022-05-25 15:38:20 CST LOG: manager worker [13930] at slot 0 generation 2 detaching cleanly
2022-05-25 15:38:20 CST LOG: starting syslogical database manager for database TEMPLATE2
2022-05-25 15:38:20 CST LOG: manager worker [13932] at slot 0 generation 3 detaching cleanly
2022-05-25 15:38:20 CST LOG: starting syslogical database manager for database SAMPLES
2022-05-25 15:38:20 CST LOG: manager worker [13935] at slot 0 generation 4 detaching cleanly
2022-05-25 15:38:20 CST LOG: starting syslogical database manager for database SECURITY
2022-05-25 15:38:20 CST LOG: manager worker [13940] at slot 0 generation 5 detaching cleanly

[](javascript:void(0)

再查看那个占用内存高的进程已经被干掉。

需要说明的是同样的环境,我在KingbaseV8R6上并没有复现,也没有发生宕机。能看到插入时间比较慢,看到进程占用内存没有如此之高。

总结:

由于突然性大并发导致数据库资源使用上限是常有之事,我们尽量和业务协商保持业务稳定,如有新上业务要提前评估内存,cpu,io使用情况后做决定。是否有可用内存以供增加,不然很容易像以上例子导致数据库崩溃。

尽量升级到高版本规避此问题。或在系统级限定资源消费上限。

KingbaseES V8R3 shared_buffer占用过多导致实例崩溃的更多相关文章

  1. .Net Core项目在Docker上运行,内存占用过多导致pods重启的问题

    默认情况下,.NET Core应用的内存回收模式是Server模式,这种情况下,内存占用和服务器核心数量有关,一半占用量比较大. 我们的应用目前吞吐量都不大,可以采用Workstation模式,这种模 ...

  2. KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构

    案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...

  3. 最常见的5个导致 RAC 实例崩溃的问题

    适用于: OracleDatabase - Enterprise Edition - 版本11.2.0.1 和更高版本本文档所含信息适用于所有平台 用途 本文档的目的是总结可能导致 RAC 实例崩溃的 ...

  4. 转载:Linux服务器Cache占用过多内存导致系统内存不足最终java应用程序崩溃解决方案

    原文链接: https://blog.csdn.net/u014740338/article/details/66975550 问题描述 Linux内存使用量超过阈值,使得Java应用程序无可用内存, ...

  5. KingbaseES V8R3 备份恢复案例之--单实例环境sys_rman脚本备份案例

    案例说明: sys_rman是KingbaseES数据库的物理备份工具,支持数据库的全备和增量备份,由于sys_rman工具使用需要配置多个参数,对于一般用户使用不是很方便.为方便用户在Kingbas ...

  6. Linux下php-fpm进程过多导致内存耗尽问题

    这篇文章主要介绍了解决Linux下php-fpm进程过多导致内存耗尽问题,需要的朋友可以参考下   最近,发现个人博客的Linux服务器,数据库服务经常挂掉,导致需要重启,才能正常访问,极其恶心,于是 ...

  7. 导致实例逐出的五大问题 (文档 ID 1526186.1)

    适用于: Oracle Database - Enterprise Edition - 版本 10.2.0.1 到 11.2.0.3 [发行版 10.2 到 11.2]本文档所含信息适用于所有平台 用 ...

  8. RDS数据库磁盘满导致实例锁定

    问题描述: 阿里云RDS空间不足,进行报警.收到报警后.对数据库中不重要的数据备份后执行delete删除操作.执行成功后发现数据删掉了.但是数据库的空间并没有释放.数据占用空间反而越来越大,最后RDS ...

  9. buff/cache内存占用过多

    通过free -m 查看到 buff/cache的值比较大,导致可使用的内存有120M左右了 通过下面的命令,清除缓存 echo 1 > /proc/sys/vm/drop_caches ech ...

随机推荐

  1. Linux的文件路径和访问文件相关命令

    Linux的绝对和相对路径 绝地路径 绝对路径:以根作为起来的路径 相对路径 相对路径:以当前位置作为起点 文件操作命令 显示当前工作目录: pwd命令 pwd:显示文件所在的路径 基名:basena ...

  2. HDFS数据平衡

    一.datanode之间的数据平衡 1.1.介绍 ​ Hadoop 分布式文件系统(Hadoop Distributed FilSystem),简称 HDFS,被设计成适合运行在通用硬件上的分布式文件 ...

  3. JAVA中简单的for循环竟有这么多坑,你踩过吗

    JAVA中简单的for循环竟有这么多坑,你踩过吗 实际的业务项目开发中,大家应该对从给定的list中剔除不满足条件的元素这个操作不陌生吧? 很多同学可以立刻想出很多种实现的方式,但你想到的这些实现方式 ...

  4. 笔记本USB接口案例分析和是实现

    笔记本电脑 笔记本电脑(laptop)通常具备使用USB设备的功能.在生产时,笔记本都预留了可以插入USB设备的USB接口,但具体是什么USB设备,笔记本厂商并不关心,只要符合USB规格的设备都可以 ...

  5. docker for window WSL 2 installation is incomplete 错误,导致docker无法启动

    1.错误截图如下: 2.错误原因:由于wsl2版本旧,根据提示让我们手动更新包,去微软官网下载最新wsl2后,安装完成重启即可解决. 3.下载地址:download地址

  6. 如何编写测试团队通用的Jmeter脚本

    平时学习.工作过程中,编写的一些jmeter脚本,相信大多数都遇到过这个问题.那就是:如果换一台电脑运行,文件路径不一样,会导致运行失败. 前不久,自己就真真切切遇到过一回,A同学写了个脚本用于压测, ...

  7. DHCP 动态主机设置协议 分析

    在TCP/IP网络中,每个接口都需要一个IP地址.子网掩码和广播地址( IPv6中没有),简单来说就是需要网络配置信息.如果想访问外部网络可以通过DNS获取外部地址,再通过路由间接转发出去.但是在&q ...

  8. github碰到的问题

    下载问题 自己编译一下 mvn clear mvn compile mvn package 自己编译之后的文件,然后解压即可,第一次自己傻傻的,直接用源码跑,少报错! 项目预览问题 添加1s即可 下载 ...

  9. JUC源码学习笔记3——AQS等待队列和CyclicBarrier,BlockingQueue

    一丶Condition 1.概述 任何一个java对象都拥有一组定义在Object中的监视器方法--wait(),wait(long timeout),notify(),和notifyAll()方法, ...

  10. N皇后的位运算有感

    N皇后很明显是一个NP-Hard问题,如果n足够大的话,在有限较短的时间内是很难得出答案的,但是注意到N皇后(笔者认为这类问题称为棋盘问题更为贴切),在n*n棋盘之上,每个点有且只有两种状态,这与电脑 ...