案例二:测试‘recovery = automatic’

1、查看集群节点状态信息:

  1. [kingbase@node1 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
  4. 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、配置recovery参数

  1. [kingbase@node3 bin]$ cat ../etc/repmgr.conf |egrep -i 'recovery|failover'
  2. failover='automatic'
  3. recovery='automatic'

3、重启主库节点测试

[root@node3 ~]# reboot

4、查看备库hamgr日志

=如下所示,从日志中获知,主库节点宕机后,集群执行主备切换,并且在主库节点系统正常后,将原主库作为新备库自动加入到集群。=

  1. [2022-03-01 14:38:09] [NOTICE] starting monitoring of node "node248" (ID: 2)
  2. [2022-03-01 14:38:09] [INFO] "connection_check_type" set to "ping"
  3. [2022-03-01 14:38:10] [INFO] monitoring connection to upstream node "node243" (ID: 1)
  4. [2022-03-01 14:38:10] [NOTICE] try to change wal catched_up state to 1
  5. [2022-03-01 14:38:10] [INFO] primary flush lsn is 0/17000578, local flush lsn is 0/170004C0
  6. [2022-03-01 14:38:10] [NOTICE] try to change streaming_sync state to TRUE
  7. [2022-03-01 14:43:11] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
  8. [2022-03-01 14:46:42] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  9. [2022-03-01 14:46:42] [DETAIL] PQping() returned "PQPING_REJECT"
  10. [2022-03-01 14:46:42] [WARNING] unable to connect to upstream node "node243" (ID: 1)
  11. [2022-03-01 14:46:42] [INFO] sleeping 6 seconds until next reconnection attempt
  12. [2022-03-01 14:46:48] [INFO] checking state of node 1, 1 of 10 attempts
  13. [2022-03-01 14:46:58] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  14. [2022-03-01 14:46:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  15. [2022-03-01 14:46:58] [INFO] sleeping 6 seconds until next reconnection attempt
  16. ......
  17. [2022-03-01 14:48:59] [INFO] checking state of node 1, 10 of 10 attempts
  18. [2022-03-01 14:48:59] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  19. [2022-03-01 14:48:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  20. [2022-03-01 14:48:59] [WARNING] unable to reconnect to node 1 after 10 attempts
  21. [2022-03-01 14:48:59] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
  22. [2022-03-01 14:49:00] [WARNING] wal receiver not running
  23. [2022-03-01 14:49:00] [NOTICE] WAL receiver disconnected on all sibling nodes
  24. [2022-03-01 14:49:00] [INFO] WAL receiver disconnected on all 0 sibling nodes
  25. [2022-03-01 14:49:00] [INFO] 0 active sibling nodes registered
  26. [2022-03-01 14:49:00] [INFO] primary and this node have the same location ("default")
  27. [2022-03-01 14:49:00] [INFO] no other sibling nodes - we win by default
  28. [2022-03-01 14:49:00] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
  29. [2022-03-01 14:49:00] [NOTICE] this node is the only available candidate and will now promote itself
  30. [2022-03-01 14:49:00] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
  31. [2022-03-01 14:49:02] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.
  32. --- 192.168.7.1 ping statistics ---
  33. 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
  34. rtt min/avg/max/mdev = 2.345/22.599/42.853/20.254 ms
  35. [2022-03-01 14:49:02] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
  36. [2022-03-01 14:49:04] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.
  37. --- 192.168.7.241 ping statistics ---
  38. 3 packets transmitted, 0 received, 100% packet loss, time 1999ms
  39. [2022-03-01 14:49:04] [WARNING] ping host"192.168.7.241" failed
  40. [2022-03-01 14:49:04] [DETAIL] average RTT value is not greater than zero
  41. [2022-03-01 14:49:04] [INFO] loadvip result: 1, arping result: 1
  42. [2022-03-01 14:49:04] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
  43. [2022-03-01 14:49:04] [INFO] promote_command is:
  44. "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
  45. NOTICE: promoting standby to primary
  46. DETAIL: promoting server "node248" (ID: 2) using sys_promote()
  47. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  48. INFO: SET synchronous TO "async" on primary host
  49. [2022-03-01 14:49:07] [NOTICE] try to stop old primary db (host: "192.168.7.243")
  50. NOTICE: STANDBY PROMOTE successful
  51. DETAIL: server "node248" (ID: 2) was successfully promoted to primary
  52. [2022-03-01 14:49:11] [INFO] switching to primary monitoring mode
  53. [2022-03-01 14:49:11] [NOTICE] monitoring cluster primary "node248" (ID: 2)
  54. [2022-03-01 14:49:11] [INFO] create a thread 0x7f1b4b125700 to check the cluster status
  55. [2022-03-01 14:49:11] [INFO] child node: 1; attached: no
  56. [2022-03-01 14:49:11] [INFO] check node status again, try 1 / 10 times
  57. [2022-03-01 14:49:12] [INFO] node (ID: 1): no server running
  58. .......
  59. [2022-03-01 14:49:29] [INFO] check node status again, try 10 / 10 times
  60. [2022-03-01 14:49:31] [INFO] child node: 1; attached: no
  61. [2022-03-01 14:49:31] [INFO] found node down, recovery will be triggered after recovery delay time 20s
  62. [2022-03-01 14:49:33] [INFO] child node: 1; attached: no
  63. ......
  64. [2022-03-01 14:49:52] [INFO] child node: 1; attached: no
  65. [2022-03-01 14:49:52] [INFO] recovery delay time reached. can do recovery now.
  66. [2022-03-01 14:49:52] [INFO] [thread pid:11778] do_nodes_recovery thread begin. The pthread_t tid is 0x7f1b4b125700
  67. [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
  68. [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] Now, the primary host ip: 192.168.7.248
  69. [2022-03-01 14:49:52] [INFO] [thread pid:11778] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
  70. [2022-03-01 14:49:53] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
  71. [2022-03-01 14:49:53] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
  72. NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
  73. DETAIL: rejoin target server's timeline 8 forked off current database system timeline 7 before current recovery point 0/18000028
  74. NOTICE: executing sys_rewind
  75. DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
  76. sys_rewind: servers diverged at WAL location 0/17000680 on timeline 7
  77. sys_rewind: rewinding from last common checkpoint at 0/160007C8 on timeline 7
  78. sys_rewind: find last common checkpoint start time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:49:53.296332 CST, in "0.125651" seconds.
  79. sys_rewind: update the control file: minRecoveryPoint is '0/1700DE58', minRecoveryPointTLI is '8', and database state is 'in archive recovery'
  80. sys_rewind: we will remove the dir '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it.
  81. sys_rewind: rewind start wal location 0/16000798 (file 000000070000000000000016), end wal location 0/1700DE58 (file 000000080000000000000017). time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:50:06.920859 CST, in "13.750178" seconds.
  82. sys_rewind: Done!
  83. NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  84. NOTICE: setting node 1's upstream to node 2
  85. WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  86. DETAIL: PQping() returned "PQPING_NO_RESPONSE"
  87. NOTICE: begin to start server at 2022-03-01 14:50:07.530887
  88. NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
  89. NOTICE: start server finish at 2022-03-01 14:50:08.952996
  90. NOTICE: NODE REJOIN successful
  91. DETAIL: node 1 is now attached to node 2
  92. [2022-03-01 14:50:09] [NOTICE] kbha: node (ID: 1) rejoin success.
  93. [2022-03-01 14:50:10] [NOTICE] [thread pid:11778] node "node243" (ID: 1) auto-recovery success
  94. [2022-03-01 14:50:10] [INFO] [thread pid:11778] do_nodes_recovery thread ends. The pthread_t tid is 0x7f1b4b125700
  95. [2022-03-01 14:50:10] [INFO] SET synchronous TO "sync" on primary host
  96. [2022-03-01 14:50:10] [INFO] thread tid:0x7f1b4b125700 is not running
  97. [2022-03-01 14:50:10] [INFO] the recovery thread was exited, reset tid
  98. [2022-03-01 14:50:10] [NOTICE] Some nodes reconnect, all standby nodes are OK now
  99. [2022-03-01 14:50:12] [NOTICE] new standby "node243" (ID: 1) has connected

5、查看备库数据库进程和集群状态信息

  1. [kingbase@node3 bin]$ ps -ef |grep kingbase
  2. kingbase 2654 1 0 14:49 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  3. kingbase 3462 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  4. kingbase 3463 3462 0 14:50 ? 00:00:00 kingbase: logger
  5. kingbase 3464 3462 0 14:50 ? 00:00:00 kingbase: startup recovering 000000080000000000000017
  6. kingbase 3465 3462 0 14:50 ? 00:00:00 kingbase: checkpointer
  7. kingbase 3466 3462 0 14:50 ? 00:00:00 kingbase: background writer
  8. kingbase 3467 3462 0 14:50 ? 00:00:00 kingbase: stats collector
  9. kingbase 3468 3462 0 14:50 ? 00:00:00 kingbase: walreceiver streaming 0/1700F160
  10. kingbase 3471 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57348) idle
  11. kingbase 3522 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  12. kingbase 3523 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57351) idle
  13. [kingbase@node3 bin]$ ./repmgr cluster show
  14. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  15. ----+---------+---------+-----------+----------+----------+----------+----------+-------------------------- 1 | node243 | standby | running | node248 | default | 100 | 7 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  16. 2 | node248 | primary | * running | | default | 100 | 8 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 从以上信息获知,原主库节点在系统恢复到正常后,集群将其作为新备库自动加入到集群。====

=未完待续=

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(二)的更多相关文章

  1. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)

    KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...

  2. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)

    案例三:测试'recovery = manual' 1.查看集群节点状态信息: [kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role ...

  3. KingbaseES R6 集群备库网卡down测试案例

    数据库版本: test=# select version(); version ------------------------------------------------------------ ...

  4. KingbaseES R6 集群修改物理IP和VIP案例

    在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...

  5. KingbaseES R6 集群repmgr witness 手工配置案例

    使用见证服务器: 见证服务器是一个正常的KingbaseES实例,不是流复制群集的一部分; 其目的是,如果发生故障转移情况,则提供证明它是主服务器本身不可用的证据,而不是例如在不同物理位置之间的网络分 ...

  6. KingbaseES R6 集群 recovery 参数对切换的影响

    案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...

  7. KingbaseES R6 集群一键修改集群和数据库参数测试案例

    ​ 案例说明: 集群环境修改集群或数据库参数,需要在每个node上都要修改,在每个节点而执行修改操作,容易出现漏改或节点上参数不一致等错误:在KingbaseES V8R6的集群中增加了,一键修改参数 ...

  8. KingbaseES R6 集群修改data目录

    案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...

  9. KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例

    案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...

随机推荐

  1. MySQL-3-DML

    DML 数据操作语言 插入insert 语法一:insert into 表名(列名,...)values(值1,...): 语法二:insert into 表名 set 列名=值,列名=值,... 插 ...

  2. UiPath程序设计文档

    1. [RPA之家]添加数据列UiPath.Core.Activities.AddDataColumn 链接: https://pan.baidu.com/s/1RRMw4voqJru-fJSoC3W ...

  3. 网络通讯之Socket-Tcp(一)

    网络通讯之Socket-Tcp  分成3部分讲解: 网络通讯之Socket-Tcp(一): 1.如何理解Socket 2.Socket通信重要函数 网络通讯之Socket-Tcp(二): 1.简单So ...

  4. CentOS查看操作系统安装时间信息:

    CentOS查看系统安装时间信息: 方法1:[root@logserver ~]#  ll /boot/|egrep -i "(grub|lost\+found)" 方法2:[ro ...

  5. 迭代器的实现原理和增强for循环

    Iterator遍历集合--工作原理 在调用Iterator的next方法之前,迭代器的索引位于第一个元素之前,不指向任何元素,当第一次调用迭代器的next方法后,迭代器的索引会向后移动一位, 指向第 ...

  6. 循环控制-break语句和continue语句

    break关键字的用法有常见的两种: 1.可以用switch语句当中,一旦执行,整个switch语句立刻结束 2.还可以用在循环语句当中,一定执行,整个循环语句立刻结束,打断循环 关于循环的选择,有一 ...

  7. 从Python到水一篇AI论文(核心 or Sci三区+)

    博客配套视频链接: https://space.bilibili.com/383551518?spm_id_from=333.1007.0.0 b 站直接看 配套 github 链接:https:// ...

  8. Eolink 全局搜索介绍【翻译】

    随着前后端分离成为互联网项目开发的标准模式, API 成为了前后端联通的桥梁.而面对越来越频繁和复杂的调用需求,项目里的 API 数量也越来越多,我们需要通过搜索功能来快速定位到对应的 API来进行使 ...

  9. SpringBoot接口 - API接口有哪些不安全的因素?如何对接口进行签名?

    在以SpringBoot开发后台API接口时,会存在哪些接口不安全的因素呢?通常如何去解决的呢?本文主要介绍API接口有不安全的因素以及常见的保证接口安全的方式,重点实践如何对接口进行签名.@pdai ...

  10. Java 技术栈中间件优雅停机方案设计与实现全景图

    欢迎关注公众号:bin的技术小屋,阅读公众号原文 本系列 Netty 源码解析文章基于 4.1.56.Final 版本 本文概要 在上篇文章 我为 Netty 贡献源码 | 且看 Netty 如何应对 ...