KingbaseES R6 集群主机锁冲突导致的主备切换案例
案例说明:
主库在业务高峰期间,客户执行建表等DDL操作,主库产生“AccessExclusiveLock ”锁,导致大量的事务产生锁冲突,大量的会话堆积,客户端session访问主库失败。备库和主库之间的PQping的心跳通讯测试也受到阻塞,在指定阈值(10次)范围内,尝试连接失败后,触发主备切换。
一、案例故障信息
1)主库sys_log日志
=如下图所示,在主库上产生了“AccessExclusiveLock ”锁,出现大量锁冲突=
2022-03-23 11:36:00 CST [3165572]: [2241-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9LOG: process 3165572 still waiting for AccessExclusiveLock on tuple (551717,56) of relation 20499 of database 16461 after 1012.724 ms
2022-03-23 11:36:00 CST [3165572]: [2242-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9DETAIL: Process holding the lock: 3219816. Wait queue: 3172465, 3220570, 3211000, 3194976, 3195098, 3189287, 3194983, 3148868, 3219348, 3219365, 3219380, 3219672, 3220580, 3218107, 3219360, 3217745, 3211003, 3219456, 3219687, 3217717, 3211892, 3079421, 3218763, 3220067, 3217706, 3217630, 3217372, 3217375, 3194995, 3220143, 3124556, 3117826, 3194934, 3194832, 3218969, 3217628, 3119221, 3219787, 3219385, 3219345, 3217769, 3216508, 3214193, 3218108, 3216459, 3220223, 3218058, 3220627, 3220413, 3215490, 3220591, 3215493, 3120204, 3218892, 3168511, 3194835, 3220588, 3220546, 3217718, 3217629, 3217386, 3220586, 3217778, 3220178, 3218789, 3217707, 3218839, 3150225, 3167040, 3219539, 3169071, 3194836, 3195048, 3194988, 3220523, 3220100, 3217711, 3215655, 3218068, 3168514, 3220589, 3220583, 3218721, 3133538, 3216464, 3218089, 3219051, 3212018, 3161515, 3220060, 3216453, 3217761, 3220176, 3216469, 3218843, 3218492, 3212020, 3210997, 3217715, 3172444, 3217674, 3157246, 3217682, 3220212, 3191504, 3216431, 3220410, 3194889, 3219677, 3220099, 3218810, 3219364, 3218099, 3182051, 3220574, 3218029, 3212166, 3219418, 3220106, 3219772, 3216447, 3215484, 3214194, 3219366, 3195081, 3170956, 3159849, 3170355, 3219405, 3218967, 3217626, 3218115, 3173965, 3220056, 3216482, 3220381, 3219349, 3194845, 3220389, 3217064, 3161902, 3220057, 3218055, 3211895, 3217770, 3220119, 3220227, 3220529, 3220052, 3219065, 3176922, 3194369, 3218054, 3220065, 3212136, 3218105, 3217775, 3218060, 3195091, 3058330, 3220571, 3217739, 3220511, 3220059, 3219400, 3219412, 3219379, 3211893, 3220192, 3220557, 3219676, 3220112, 3219420, 3165574, 3220216, 3217767, 3219533, 3220602, 3194828, 3220097, 3219783, 3219327, 3219329, 3212247, 3220053, 3219766, 3192177, 3219673, 3212162, 3220568, 3212106, 3220527, 3215487, 3220046, 3220537, 3121709, 3153617, 3217422, 3218073, 3218804, 3216434, 3220168, 3140776, 3156690, 3220199, 3212201, 3194380, 3194821, 3220516, 3217371, 3212015, 3194824, 3219346, 3216427, 3216506, 3219347, 3218102, 3215687, 3194816, 3220600, 3210998, 3218483, 3220593, 3220364, 3220407, 3220396, 3218046, 3220544, 3219417, 3218968, 3220380, 3136699, 3220555, 3218110, 3171869, 3220111, 3220084, 3220569, 3195041, 3217768, 3212151, 3179922, 3220167, 3219388, 3215489, 3194699, 3220162, 3160067, 3194809, 3220115, 3220285, 3218771, 3218837, 3218104, 3216437, 3219769, 3217676, 3215481, 3215645, 3218768, 3220418, 3218812, 3175055, 3219799, 3218777, 3153616, 3215488, 3217771, 3218770, 2983432, 3218806, 3216483, 3220130, 3215674, 3218772, 3215672, 3219746, 3211897, 3217374, 3218823, 3220133, 3220228, 3219745, 3220181, 3217765, 3219810, 3215686, 3217727, 3170682, 3219536, 3194808, 3217390, 3212163, 3179926, 3218059, 3218026, 3218824, 3216432, 3212128, 3220506, 3169074, 3220556, 3172449, 3220522, 3220366, 3233821, 3218795, 3218808, 3220287, 3220338, 3219780, 3165572, 3218832, 3220538, 3220499, 3220224, 3215654, 3218841, 3217776, 3220374, 3219675, 3218756, 3215677, 3195087, 3219754.
2022-03-23 11:36:00 CST [3165572]: [2243-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9CONTEXT: SQL statement "DELETE FROM tbtemp_gettodolistmsg1"
PL/SQL function fngettodolistmsg(integer,integer,refcursor) line 268 at SQL statement
2)备库hamgr.log日志
=如以下日志所示,备库在通过PQping() 函数和主库执行心跳通讯测试,在超过阈值(10次)后,触发主备切换。=
[2022-03-23 11:36:51] [WARNING] unable to ping "host=192.3.89.16 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-23 11:36:51] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:36:51] [WARNING] unable to connect to upstream node "node16" (ID: 1)
[2022-03-23 11:36:51] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-23 11:36:57] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-23 11:37:07] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.3.89.16 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-23 11:37:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:37:07] [INFO] sleeping 6 seconds until next reconnection attempt
......
[2022-03-23 11:38:57] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-23 11:39:07] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.3.89.16 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-23 11:39:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:39:07] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-23 11:39:07] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-23 11:39:07] [INFO] sleeping 5 seconds
[2022-03-23 11:39:12] [NOTICE] killing WAL receiver with PID 1794980
[2022-03-23 11:39:13] [INFO] WAL receiver with pid 1794980 killed
[2022-03-23 11:39:14] [NOTICE] killing WAL receiver with PID 1657239
[2022-03-23 11:39:15] [INFO] WAL receiver with pid 1657239 killed
[2022-03-23 11:39:16] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-23 11:39:16] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-23 11:39:16] [INFO] 0 active sibling nodes registered
[2022-03-23 11:39:16] [INFO] primary and this node have the same location ("default")
[2022-03-23 11:39:16] [INFO] no other sibling nodes - we win by default
[2022-03-23 11:39:16] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-23 11:39:16] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-23 11:39:16] [INFO] try to ping the trusted_servers "192.3.89.1" before execute promote_command
[2022-03-23 11:39:18] [NOTICE] PING 192.3.89.1 (192.3.89.1) 56(84) bytes of data.
--- 192.3.89.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.152/0.159/0.166/0.007 ms
[2022-03-23 11:39:18] [NOTICE] successfully ping one or more of the trusted_servers "192.3.89.1"
[2022-03-23 11:39:20] [NOTICE] try to stop old primary db (host: "192.3.89.16")
[2022-03-23 11:39:20] [NOTICE] PING 192.3.89.60 (192.3.89.60) 56(84) bytes of data.
--- 192.3.89.60 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.423/0.575/0.728/0.152 ms
[2022-03-23 11:39:20] [WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.3.89.16")
[2022-03-23 11:39:21] [INFO] stop database ...
[2022-03-23 11:39:21] [INFO] ES connection to host "192.3.89.16" succeeded, ready to release vip on it
[2022-03-23 11:39:23] [INFO] kill signal 9 to pid 2091799
[2022-03-23 11:39:23] [INFO] stop db done.
[2022-03-23 11:39:23] [NOTICE] old primary node (host: "192.3.89.16") does not hold the virtual ip 192.3.89.60/24 before we release it
[2022-03-23 11:39:23] [NOTICE] will acquire the virtual ip again
[2022-03-23 11:39:25] [NOTICE] PING 192.3.89.60 (192.3.89.60) 56(84) bytes of data.
--- 192.3.89.60 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms
[2022-03-23 11:39:25] [WARNING] ping host"192.3.89.60" failed
[2022-03-23 11:39:25] [DETAIL] average RTT value is not greater than zero
[2022-03-23 11:39:25] [INFO] loadvip result: 1, arping result: 1
[2022-03-23 11:39:25] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.3.89.60/24 success
[2022-03-23 11:39:25] [INFO] promote_command is:
"/home/kingbase/cluster/cluster/KingbaseCluster/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/cluster/KingbaseCluster/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node17" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node17" (ID: 2) was successfully promoted to primary
三、案例分析
1)AccessExclusiveLock 锁,一般是因为执行了以下操作(DDL)产生
2)PQping()函数
repmgr备库对主库状态的监控是通过PQping()函数,PQping()函数是libpq接口下的函数:libpq库是PostgreSQL的C接口。它是一组库函数,允许客户端程序与PostgreSQL交互。简单地说,Libpq 可做三件事:与数据库建立连接、发送 操作数据库的语句并处理结果。
3) 主备切换分析
主备之间通过libpq接口的PQping()函数做心跳测试,由于主库产生锁冲突,导致访问主库的会话堆积,致使备库连接主库PQping()失败,在超过阈值(10次)后,发生主备切换。
4)确定现场业务操作
四、总结
本次主备切换可以判断为,客户在业务高峰期,执行DDL操作,主库大量的锁冲突,会话阻塞,导致主备之间心跳通讯测试失败而引起。在生产环境业务高峰期间,应禁止执行DDL等操作,应该安排在业务低峰期间执行。
KingbaseES R6 集群主机锁冲突导致的主备切换案例的更多相关文章
- KingbaseES R6 集群物理copy方式手工添加新备库节点
案例说明: 对于主库数据量比较大的环境,在添加新节点是可以采用在线clone方式创建新的备库节点,也可以在离线的状态下,直接拷贝其中一个备库的所有集群相关目录来创建新的备库节点.本案例介绍了通过离线物 ...
- KingbaseES R6 集群创建流复制只读副本库案例
一.环境概述 [kingbase@node2 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test= ...
- KingbaseES R6 集群修改物理IP和VIP案例
在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...
- KingbaseES R6 集群 recovery 参数对切换的影响
案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...
- KingbaseES R6 集群通过备库clone在线添加新节点
案例说明: KingbaseES R6集群可以通过图形化方式在线添加新节点,但是在添加新节点clone环节时,是从主库copy数据到新的节点,这样在生产环境,如果数据量大,将会对主库的网络I/O造成压 ...
- KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)
KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...
- KingbaseES R6 集群修改data目录
案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...
- KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例
案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...
- KingbaseES R6 集群sys_monitor.sh change_password一键修改集群用户密码
案例说明: kingbaseES R6集群用户密码修改,需要修改两处: 1)修改数据库用户密码(alter user): 2)修改.encpwd文件中用户密码: 可以通过sys_monitor.sh ...
随机推荐
- 重学ES系列之Set实现数组去重、交集、并集、差集
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- Python爬虫常用:谷歌浏览器驱动——Chromedriver 插件安装教程
我们在做爬虫的时候经常要使用谷歌浏览器驱动,今天分享下这个Chromedriver 插件的安装方法. 第一步:打开谷歌浏览器打开设置面板 嫌枯燥的小伙伴可以点击此处找管理员小姐姐领取免费资料 第二步: ...
- NC24622 Brownie Slicing
NC24622 Brownie Slicing 题目 题目描述 Bessie has baked a rectangular brownie that can be thought of as an ...
- 记一次 .NET 某电厂Web系统 内存泄漏分析
一:背景 1. 讲故事 前段时间有位朋友找到我,说他的程序内存占用比较大,寻求如何解决,截图就不发了,分析下来我感觉除了程序本身的问题之外,.NET5 在内存管理方面做的也不够好,所以有必要给大家分享 ...
- day06 Socket_线程API_线程并发安全
使用多线程实现多客户端连接服务端 流程图 服务端代码改造: package socket; import java.io.BufferedReader; import java.io.IOExcept ...
- 02 MySQL_数据库相关的SQL
数据库相关的SQL 1. 查看所有数据库 show databases; 2. 创建数据库 格式:create database 数据库名称: 示例: create database db1; 3. ...
- day01 Java_基础
1.类型 int:整数 long:长整数 double:浮点数 char:字符型--单引号只能有一个 boolean:布尔型--真假 string : 字符串--双引号,可以0-多个 2.算数运算符: ...
- Git上传仓库
上传代码到gitee 方法1 1. 将远程仓库克隆到本地 git clone https://gitee.com/abc/aaa.git 2. 添加或修改本地文件 3. 将本地代码push到远程仓库 ...
- DFS序和7种模型
DFS序就是将树的节点按照先根的顺序遍历得到的节点顺序 性质:一个子树全在一个连续的区间内,可以与线段树和树状数组搭配使用 很好写,只需在dfs中加几行代码即可. 代码: void dfs(ll u, ...
- Linux系统上传公钥不生效问题
Authentication refused: bad ownership or modes for file /home/yanbo.xu/.ssh/authorized_keys 原因: sshd ...