数据库版本:

test=# select version();
version
----------------------------------------------------------------------------------------------------------------------
KingbaseES V008R006C005B0041 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

主机节点信息:

[kingbase@node101 bin]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.101 node101 #原主库
192.168.1.102 node102 #原备库

集群节点信息:

ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node101 | primary | * running | | running | 11180 | no | n/a
2 | node102 | standby | running | node101 | running | 9242 | no | 0 second(s) ago

一、查看集群状态及配置信息

1、集群节点状态

[kingbase@node101 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
1 | node101 | primary | * running | | default | 100 | 1 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node102 | standby | running | node101 | default | 100 | 1 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、集群配置信息

二、将主库网卡down测试

1、主库网卡down

[root@node101 ~]# ifconfig enp0s3 down

2、主库messages日志

Mar 29 16:39:47 node101 avahi-daemon[782]: Interface enp0s3.IPv4 no longer relevant for mDNS.
Mar 29 16:39:47 node101 avahi-daemon[782]: Leaving mDNS multicast group on interface enp0s3.IPv4 with address 192.168.1.101.
Mar 29 16:39:47 node101 avahi-daemon[782]: Withdrawing address record for fe80::a00:27ff:fe73:47f6 on enp0s3.
Mar 29 16:39:47 node101 avahi-daemon[782]: Withdrawing address record for 192.168.1.101 on enp0s3.

3、备库hamgr.log日志

=== 从以下日志看,可以分成两部分===

1)备库和主库的PQping测试失败,超过阈值后,触发主备切换。
2)由于recovery=‘automatic’,则对主库进行recovery,将原主库作为备库加入到集群。由于主库网卡down,备库一直尝试连接原主库,当原主库网络正常后,recovery成功。
[2022-03-29 16:36:52] [INFO] node "node102" (ID: 2) monitoring upstream node "node101" (ID: 1) in normal state
[2022-03-29 16:39:58] [WARNING] unable to ping "host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-29 16:39:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-29 16:39:58] [WARNING] unable to connect to upstream node "node101" (ID: 1)
[2022-03-29 16:39:58] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-29 16:40:04] [INFO] checking state of node 1, 1 of 10 attempts
......
[2022-03-29 16:41:52] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-29 16:41:53] [WARNING] unable to ping "user=system connect_timeout=10 dbname=esrep host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-29 16:41:53] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-29 16:41:53] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-29 16:41:53] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-29 16:41:53] [WARNING] wal receiver not running
[2022-03-29 16:41:53] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-29 16:41:53] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-29 16:41:53] [INFO] 0 active sibling nodes registered
[2022-03-29 16:41:53] [INFO] primary and this node have the same location ("default")
[2022-03-29 16:41:53] [INFO] no other sibling nodes - we win by default
[2022-03-29 16:41:53] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-29 16:41:53] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-29 16:41:53] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2022-03-29 16:41:55] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. --- 192.168.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.204/0.237/0.296/0.045 ms [2022-03-29 16:41:55] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2022-03-29 16:41:55] [INFO] promote_command is:
"/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/kha/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node102" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node102" (ID: 2) was successfully promoted to primary
[2022-03-29 16:41:57] [INFO] switching to primary monitoring mode
[2022-03-29 16:41:57] [NOTICE] monitoring cluster primary "node102" (ID: 2)
[2022-03-29 16:41:57] [INFO] create a thread 0x7f9ae9028700 to check the cluster status
[2022-03-29 16:41:57] [INFO] child node: 1; attached: no
[2022-03-29 16:41:57] [INFO] check node status again, try 1 / 10 times
......
[2022-03-29 16:42:15] [INFO] check node status again, try 10 / 10 times
[2022-03-29 16:42:17] [WARNING] [thread 0x7f9ae9028700] unable to connect via ES to host "192.168.1.101"
[2022-03-29 16:42:17] [INFO] child node: 1; attached: no
[2022-03-29 16:42:17] [INFO] found node down, recovery will be triggered after recovery delay time 20s
[2022-03-29 16:42:19] [INFO] child node: 1; attached: no
.......
[2022-03-29 16:42:27] [INFO] child node: 1; attached: no
[2022-03-29 16:42:29] [WARNING] [thread 0x7f9ae9028700] unable to connect via ES to host "192.168.1.101"
[2022-03-29 16:42:29] [INFO] child node: 1; attached: no
.......
[2022-03-29 16:42:37] [INFO] child node: 1; attached: no
[2022-03-29 16:42:37] [INFO] recovery delay time reached. can do recovery now.
[2022-03-29 16:42:37] [INFO] [thread pid:2995] do_nodes_recovery thread begin. The pthread_t tid is 0x7f9ae9829700
[2022-03-29 16:42:37] [NOTICE] [thread pid:2995] node (ID: 1; host: "192.168.1.101") is not attached, ready to auto-recovery
[2022-03-29 16:42:41] [NOTICE] [thread pid:2995] Now, the primary host ip: 192.168.1.102
[2022-03-29 16:42:47] [WARNING] unable to connect to remote host "192.168.1.101" via ES
[2022-03-29 16:42:47] [NOTICE] [thread pid:2995] node "node101" (ID: 1) auto-recovery failed: unable to connect via ES to host "192.168.1.101", user "", do nothing
[2022-03-29 16:42:47] [WARNING] [thread pid:2995] node (ID: 1) auto-recovery failed: unable to connect via ES to host "192.168.1.101", user "", do nothing
[2022-03-29 16:42:47] [INFO] [thread pid:2995] Is standby node "node101" (ID: 1) ready for connection?
[2022-03-29 16:42:53] [ERROR] [thread pid:2995] standby node "node101" (ID: 1) connected ... FAILED
[2022-03-29 16:42:53] [DETAIL] [thread pid:2995] could not connect to server: No route to host
Is the server running on host "192.168.1.101" and accepting
TCP/IP connections on port 54321? ......
[2022-03-29 16:44:05] [ERROR] [thread pid:3259] standby node "node101" (ID: 1) connected ... FAILED
[2022-03-29 16:44:05] [DETAIL] [thread pid:3259] could not connect to server: No route to host
Is the server running on host "192.168.1.101" and accepting
TCP/IP connections on port 54321? [2022-03-29 16:44:05] [INFO] [thread pid:3259] do_nodes_recovery thread ends. The pthread_t tid is 0x7f9ae9829700
[2022-03-29 16:44:05] [WARNING] [thread 0x7f9ae9028700] unable to connect via ES to host "192.168.1.101"
[2022-03-29 16:44:06] [INFO] thread tid:0x7f9ae9829700 is not running
[2022-03-29 16:44:06] [INFO] the recovery thread was exited, reset tid
[2022-03-29 16:44:06] [INFO] child node: 1; attached: no
[2022-03-29 16:44:06] [INFO] found node down, recovery will be triggered after recovery delay time 20s
[2022-03-29 16:44:07] [NOTICE] [thread 0x7f9ae9028700] the TimeLineID (1) of node (ID: 1) is smaller than the TimeLineID (2) of local node (ID: 2)
[2022-03-29 16:44:07] [NOTICE] [thread 0x7f9ae9028700] try to stop primary db on node (ID: 1, host: "192.168.1.101")
[2022-03-29 16:44:06] [INFO] stop database ...
[2022-03-29 16:44:07] [INFO] stop db done.
[2022-03-29 16:44:08] [NOTICE] [thread 0x7f9ae9028700] success to stop primary db on node (ID: 1, host: "192.168.1.101")
[2022-03-29 16:44:08] [INFO] child node: 1; attached: no
[2022-03-29 16:44:10] [INFO] child node: 1; attached: no
[2022-03-29 16:44:11] [INFO] node (ID: 1): no server running
[2022-03-29 16:44:11] [INFO] [thread 0x7f9ae9028700] the cluster has no other running primary node, exit
[2022-03-29 16:44:12] [INFO] child node: 1; attached: no
.......
[2022-03-29 16:44:26] [INFO] child node: 1; attached: no
[2022-03-29 16:44:26] [INFO] recovery delay time reached. can do recovery now.
[2022-03-29 16:44:26] [INFO] [thread pid:3662] do_nodes_recovery thread begin. The pthread_t tid is 0x7f9ae9028700
[2022-03-29 16:44:26] [NOTICE] [thread pid:3662] node (ID: 1; host: "192.168.1.101") is not attached, ready to auto-recovery
[2022-03-29 16:44:26] [NOTICE] [thread pid:3662] Now, the primary host ip: 192.168.1.102
[2022-03-29 16:44:27] [INFO] [thread pid:3662] ES connection to host "192.168.1.101" succeeded, ready to do auto-recovery
[2022-03-29 16:44:27] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock ### 执行“repmgr standby rejoin --force-rewind”,对主库执行recovery。 [2022-03-29 16:44:27] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr --dbname="host=192.168.1.102 dbname=esrep user=system port=54321" node rejoin --force-rewind"
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 2 forked off current database system timeline 1 before current recovery point 0/7000028
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' --source-server='host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/6001210 on timeline 1
sys_rewind: rewinding from last common checkpoint at 0/5000498 on timeline 1
sys_rewind: find last common checkpoint start time from 2022-03-29 16:44:27.204902 CST to 2022-03-29 16:44:27.308161 CST, in "0.103259" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/600FE20', minRecoveryPointTLI is '2', and database state is 'in archive recovery'
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/kha/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it.
sys_rewind: rewind start wal location 0/5000468 (file 000000010000000000000005), end wal location 0/600FE20 (file 000000020000000000000006). time from 2022-03-29 16:44:27.204902 CST to 2022-03-29 16:44:28.615051 CST, in "1.410149" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6HA/kha/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-29 16:44:28.628597
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-29 16:44:29.034402
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[2022-03-29 16:44:29] [NOTICE] kbha: node (ID: 1) rejoin success. [2022-03-29 16:44:29] [NOTICE] [thread pid:3662] node "node101" (ID: 1) auto-recovery success
[2022-03-29 16:44:29] [INFO] [thread pid:3662] Is standby node "node101" (ID: 1) ready for connection?
[2022-03-29 16:44:29] [INFO] [thread pid:3662] the standby node "node101" (ID: 1) connected ... OK
[2022-03-29 16:44:29] [INFO] [thread pid:3662] do_nodes_recovery thread ends. The pthread_t tid is 0x7f9ae9028700
[2022-03-29 16:44:30] [INFO] SET synchronous TO "quorum" on primary host
[2022-03-29 16:44:30] [INFO] thread tid:0x7f9ae9028700 is not running
[2022-03-29 16:44:30] [INFO] the recovery thread was exited, reset tid
[2022-03-29 16:44:30] [NOTICE] Some nodes reconnect, all standby nodes are OK now
[2022-03-29 16:44:34] [NOTICE] new standby "node101" (ID: 1) has connected
[2022-03-29 16:46:57] [INFO] monitoring primary node "node102" (ID: 2) in normal state

三、主库网卡恢复正常(up)

1、查看备库hamgr.log日志

=如下所示,原主库网卡up后,因为recovery=‘automatic’,对原主库执行recovery,作为新备库加入到集群。=

2、查看集群状态信息

=== 如下图所示,主备发生了切换,并且原主库作为新备库加入到集群。===

四、总结

1、对于主库,网卡down后,备库无法和主库通讯,超过阈值后,会触发集群主备切换。

2、对于参数recovery='automatic‘配置,主库网卡恢复正常后,集群执行recovery,将原主库作为新的备库加入到集群中。

KingbaseES R6 集群主库网卡down测试案例的更多相关文章

  1. kingbaseES R3 集群修改data路径测试案例

    案例说明: 默认KingbaseES R3集群部署后,数据存储目录(data)在/home/kingbase下,部署时不能更改:本案例是在部署完成后,迁移data目录到其他指定的存储位置. 数据库版本 ...

  2. KingbaseES R6集群归档备份故障分析解决案例

    案例说明: 在使用ps工具查看主库进程,发现主库'archiver'进程失败,检查sys_log日志可以发现归档失败的信息.通过sys_log日志提取归档语句手工执行归档操作,提示"当前数据 ...

  3. KingbaseES R6 集群“双主”故障解决案例

    实际工作中,可能会碰到集群脑裂的情况,在脑裂时,会出现双 primary情况.这时,需要用户介入,人工判断哪个节点的数据最新,减少数据丢失. 一.测试环境信息 操作系统: [kingbase@node ...

  4. KingbaseES R6 集群repmgr witness 手工配置案例

    使用见证服务器: 见证服务器是一个正常的KingbaseES实例,不是流复制群集的一部分; 其目的是,如果发生故障转移情况,则提供证明它是主服务器本身不可用的证据,而不是例如在不同物理位置之间的网络分 ...

  5. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)

    KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...

  6. KingbaseES R6 集群修改物理IP和VIP案例

    在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...

  7. KingbaseES R6 集群 recovery 参数对切换的影响

    案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...

  8. KingbaseES R6 集群修改data目录

    案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...

  9. KingbaseES R6 集群创建流复制只读副本库案例

    一.环境概述 [kingbase@node2 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test= ...

随机推荐

  1. 一文详解|Go 分布式链路追踪实现原理

    在分布式.微服务架构下,应用一个请求往往贯穿多个分布式服务,这给应用的故障排查.性能优化带来新的挑战.分布式链路追踪作为解决分布式应用可观测问题的重要技术,愈发成为分布式应用不可缺少的基础设施.本文将 ...

  2. 使用高斯Redis实现二级索引

    摘要:高斯Redis 搭建业务二级索引,低成本,高性能,实现性能与成本的双赢. 本文分享自华为云社区<华为云GaussDB(for Redis)揭秘第21期:使用高斯Redis实现二级索引> ...

  3. 【cartogarpher_ros】一: ros系统下的快速安装

    Cartographer是一个跨多个平台和传感器配置提供 2D 和 3D实时同步定位和映射 ( SLAM ) 的系统. 使用Cartographer有Ros集成环境和无Ros环境,对于新手快速入门,推 ...

  4. CSS 浮动 (二)

    CSS 浮动 本人是一名大二学生,欢迎大家进行交流 V15774135883 推荐一个是自学的网站 里面有超多培训机构的大课,地址 有需要可以加我免费拿! 传统网页布局的三种方式 网页布局的本质--用 ...

  5. DHCP 动态主机设置协议 分析

    在TCP/IP网络中,每个接口都需要一个IP地址.子网掩码和广播地址( IPv6中没有),简单来说就是需要网络配置信息.如果想访问外部网络可以通过DNS获取外部地址,再通过路由间接转发出去.但是在&q ...

  6. JAVA基础-11-Java Number 类--九五小庞

    问题:一直有疑惑,为什么java中学习了基本数据类型,而不使用,使用的是封装的对象. 解答: 一般地,当需要使用数字的时候,我们通常使用内置数据类型,如:byte.int.long.double 等. ...

  7. 四边形不等式优化 dp (doing)

    目录 1. 四边形不等式与决策单调性 2. 决策单调性优化 dp - (i) 关于符号 1. 四边形不等式与决策单调性 定义(四边形不等式) 设 \(w(x,y)\) 是定义在整数集合上的二元函数,若 ...

  8. nodejs学习总结01

    主流渲染引擎介绍1.渲染引擎又叫 排版引擎 或 浏览器内核 .(双内核:执行html和css的)2,主流的渲染引擎有**Chrome浏览器**:Blink引壁(WebKit的一个分支)**Safari ...

  9. 解决 Vue 部署在域名子路由 问题

    我们先看下官方说明 默认情况下,Vue CLI 会假设你的应用是被部署在一个域名的根路径上,例如 https://www.my-app.com/ .如果应用被部署在一个子路径上,你就需要用这个选项指定 ...

  10. 用JavaScript计算平年闰年

    var i = prompt("请输入你要查询的年份") if(i % 4 == 0 && i % 100 != 0 || i % 400 == 0){ conso ...