案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmgr.conf文件中配置recovery参数来解决。

本案例记录了‘recovery’参数的三种配置情况下,primary 主机重启后,集群恢复的过程。

注意:对于KingbaseES R6老的版本,recovery参数只支持‘manual’和‘automatic’。

数据库版本:

集群架构:

集群节点信息:

案例一:测试‘recovery = standby’

一、执行主备切换测试

1、配置recovery参数(所有node):

2、查看集群节点状态信息

  1. [kingbase@node1 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
  4. 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、主库节点系统重启

[root@node3 ~]# reboot

4、查看备库hamgr日志

=从hamgr日志获知,原主库宕机后,集群主备切换,原备库提升为主库。=

  1. [kingbase@node1 log]$ tail -f 100 hamgr.log
  2. tail: cannot open 100 for reading: No such file or directory
  3. ==> hamgr.log <==
  4. [2022-03-01 13:12:23] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
  5. [2022-03-01 13:12:23] [INFO] connecting to database "host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  6. INFO: set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/hamgrd.pid
  7. [2022-03-01 13:12:23] [NOTICE] starting monitoring of node "node248" (ID: 2)
  8. [2022-03-01 13:12:23] [INFO] "connection_check_type" set to "ping"
  9. [2022-03-01 13:12:23] [INFO] monitoring connection to upstream node "node243" (ID: 1)
  10. [2022-03-01 13:12:23] [NOTICE] try to change wal catched_up state to 1
  11. [2022-03-01 13:12:23] [INFO] primary flush lsn is 0/12000900, local flush lsn is 0/12000848
  12. [2022-03-01 13:12:23] [NOTICE] try to change streaming_sync state to TRUE
  13. [2022-03-01 13:17:24] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
  14. [2022-03-01 13:20:00] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  15. [2022-03-01 13:20:00] [DETAIL] PQping() returned "PQPING_REJECT"
  16. [2022-03-01 13:20:00] [WARNING] unable to connect to upstream node "node243" (ID: 1)
  17. [2022-03-01 13:20:00] [INFO] sleeping 6 seconds until next reconnection attempt
  18. [2022-03-01 13:20:06] [INFO] checking state of node 1, 1 of 10 attempts
  19. [2022-03-01 13:20:16] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  20. [2022-03-01 13:20:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  21. [2022-03-01 13:20:16] [INFO] sleeping 6 seconds until next reconnection attempt
  22. ......
  23. [2022-03-01 13:21:23] [INFO] checking state of node 1, 10 of 10 attempts
  24. [2022-03-01 13:21:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  25. [2022-03-01 13:21:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  26. [2022-03-01 13:21:23] [WARNING] unable to reconnect to node 1 after 10 attempts
  27. [2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
  28. [2022-03-01 13:21:23] [WARNING] wal receiver not running
  29. [2022-03-01 13:21:23] [NOTICE] WAL receiver disconnected on all sibling nodes
  30. [2022-03-01 13:21:23] [INFO] WAL receiver disconnected on all 0 sibling nodes
  31. [2022-03-01 13:21:23] [INFO] 0 active sibling nodes registered
  32. [2022-03-01 13:21:23] [INFO] primary and this node have the same location ("default")
  33. [2022-03-01 13:21:23] [INFO] no other sibling nodes - we win by default
  34. [2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
  35. [2022-03-01 13:21:23] [NOTICE] this node is the only available candidate and will now promote itself
  36. [2022-03-01 13:21:23] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
  37. [2022-03-01 13:21:25] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.
  38. --- 192.168.7.1 ping statistics ---
  39. 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
  40. rtt min/avg/max/mdev = 2.324/6.238/10.152/3.914 ms
  41. [2022-03-01 13:21:25] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
  42. [2022-03-01 13:21:26] [NOTICE] try to stop old primary db (host: "192.168.7.243")
  43. [2022-03-01 13:21:26] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.
  44. --- 192.168.7.241 ping statistics ---
  45. 2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms
  46. [2022-03-01 13:21:26] [WARNING] ping host"192.168.7.241" failed
  47. [2022-03-01 13:21:26] [DETAIL] average RTT value is not greater than zero
  48. [2022-03-01 13:21:26] [INFO] loadvip result: 1, arping result: 1
  49. [2022-03-01 13:21:26] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
  50. [2022-03-01 13:21:26] [INFO] promote_command is:
  51. "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
  52. NOTICE: promoting standby to primary
  53. DETAIL: promoting server "node248" (ID: 2) using sys_promote()
  54. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  55. NOTICE: STANDBY PROMOTE successful
  56. DETAIL: server "node248" (ID: 2) was successfully promoted to primary
  57. [2022-03-01 13:21:30] [INFO] switching to primary monitoring mode
  58. [2022-03-01 13:21:30] [NOTICE] monitoring cluster primary "node248" (ID: 2)
  59. [2022-03-01 13:21:30] [INFO] create a thread 0x7fe7dbe15700 to check the cluster status
  60. [2022-03-01 13:21:30] [INFO] node (ID: 1): no server running
  61. [2022-03-01 13:21:31] [INFO] [thread 0x7fe7dbe15700] the cluster has no other running primary node, exit

二、原主库节点系统恢复后加入集群测试

1、在新主库创建replication slot

  1. test=# select sys_create_physical_replication_slot('repmgr_slot_1');
  2. sys_create_physical_replication_slot
  3. --------------------------------------
  4. (repmgr_slot_1,)
  5. (1 row)
  6. test=# select sys_create_physical_replication_slot('repmgr_slot_2');
  7. sys_create_physical_replication_slot
  8. --------------------------------------
  9. (repmgr_slot_2,)
  10. (1 row)
  11. test=# select * from sys_replication_slots;
  12. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  13. ---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
  14. repmgr_slot_1 | | physical | | | f | f | | | | |
  15. repmgr_slot_2 | | physical | | | f | f | | | | |
  16. (2 rows)

2、原主库系统启动完成:

1)备份新备库节点数据目录

[kingbase@node3 kingbase]$ cp -r data data.bk

2)在data下创建备库标识文件(重要)

[kingbase@node3 data]$ touch standby.signal

3)查看新备库连接字串信息

  1. [kingbase@node3 data]$ cat kingbase.auto.conf
  2. # Do not edit this file manually!
  3. # It will be overwritten by the ALTER SYSTEM command.
  4. job_queue_processes = '5'
  5. primary_conninfo = 'user=esrep connect_timeout=10 host=192.168.7.248 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node243'
  6. recovery_target_timeline = 'latest'
  7. primary_slot_name = 'repmgr_slot_1'
  8. wal_retrieve_retry_interval = '5000'
  9. synchronous_standby_names = '1 (*)'
  10. wal_retrieve_retry_interval = '5000'

4)启动新备库数据库服务

  1. kingbase@node3 bin]$ ./sys_ctl start -D ../data
  2. ......
  3. NOTICE: standby node "node243" (ID: 1) successfully registered

5)查看当前集群节点状态

  1. [kingbase@node3 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+----------------------+----------+----------+----------+----------+----------------
  4. 1 | node243 | primary | ! running as standby | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | standby | ! running as primary | | default | 100 | 4 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  6. WARNING: following issues were detected
  7. - node "node243" (ID: 1) is registered as primary but running as standby
  8. - node "node248" (ID: 2) is registered as standby but running as primary

6)集群自动恢复新备库

=如下hamgr日志所示,启动新备库数据库服务后,集群自动对备库做recovery,并将原主库以备库的模式加入集群。=

  1. *[2022-03-01 13:26:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state
  2. [2022-03-01 13:27:28] [INFO] child node: 1; attached: no
  3. [2022-03-01 13:27:28] [INFO] check node status again, try 1 / 10 times
  4. [2022-03-01 13:27:30] [INFO] child node: 1; attached: no
  5. .....
  6. [2022-03-01 13:27:46] [INFO] check node status again, try 10 / 10 times
  7. [2022-03-01 13:27:48] [INFO] child node: 1; attached: no
  8. [2022-03-01 13:27:48] [INFO] found node down, recovery will be triggered after recovery delay time 20s
  9. [2022-03-01 13:27:50] [INFO] child node: 1; attached: no
  10. ......
  11. [2022-03-01 13:28:08] [INFO] child node: 1; attached: no
  12. [2022-03-01 13:28:08] [INFO] recovery delay time reached. can do recovery now.
  13. [2022-03-01 13:28:09] [NOTICE] mark node "node243" (ID: 1) as inactive
  14. [2022-03-01 13:28:09] [INFO] [thread pid:30763] do_nodes_recovery thread begin. The pthread_t tid is 0x7fe7dbe15700
  15. [2022-03-01 13:28:09] [NOTICE] [thread pid:30763] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
  16. [2022-03-01 13:28:09] [NOTICE] [thread pid:30763] Now, the primary host ip: 192.168.7.248
  17. [2022-03-01 13:28:10] [INFO] [thread pid:30763] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
  18. [2022-03-01 13:28:10] [NOTICE] kbha: node (ID: 1) is running as standby, stop it and do rejoin.
  19. [2022-03-01 13:28:15] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
  20. [2022-03-01 13:28:15] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
  21. NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
  22. DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/130000A0
  23. NOTICE: executing sys_rewind
  24. DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
  25. sys_rewind: servers diverged at WAL location 0/12000A08 on timeline 3
  26. sys_rewind: rewinding from last common checkpoint at 0/11000058 on timeline 3
  27. sys_rewind: find last common checkpoint start time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:16.200048 CST, in "0.599346" seconds.
  28. sys_rewind: update the control file: minRecoveryPoint is '0/12011F70', minRecoveryPointTLI is '4', and database state is 'in archive recovery'
  29. *sys_rewind: rewind start wal location 0/11000028 (file 000000030000000000000011), end wal location 0/12011F70 (file 000000040000000000000012). time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:36.045129 CST, in "20.444427" seconds.
  30. sys_rewind: Done!
  31. NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  32. NOTICE: setting node 1's upstream to node 2
  33. WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  34. DETAIL: PQping() returned "PQPING_NO_RESPONSE"
  35. NOTICE: begin to start server at 2022-03-01 13:28:36.437003
  36. NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
  37. NOTICE: start server finish at 2022-03-01 13:28:37.367954
  38. NOTICE: NODE REJOIN successful
  39. DETAIL: node 1 is now attached to node 2
  40. [2022-03-01 13:28:38] [NOTICE] kbha: node (ID: 1) rejoin success.
  41. [2022-03-01 13:28:38] [NOTICE] [thread pid:30763] node "node243" (ID: 1) auto-recovery success
  42. [2022-03-01 13:28:38] [INFO] [thread pid:30763] do_nodes_recovery thread ends. The pthread_t tid is 0x7fe7dbe15700
  43. [2022-03-01 13:28:39] [INFO] SET synchronous TO "sync" on primary host
  44. [2022-03-01 13:28:39] [INFO] thread tid:0x7fe7dbe15700 is not running
  45. [2022-03-01 13:28:39] [INFO] the recovery thread was exited, reset tid
  46. [2022-03-01 13:28:39] [NOTICE] Some nodes reconnect, all standby nodes are OK now
  47. [2022-03-01 13:28:41] [NOTICE] new standby "node243" (ID: 1) has connected
  48. [2022-03-01 13:31:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state

7)查看备库数据库进程

8)原主库作为新备库rejoin到集群

  1. [kingbase@node3 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
  4. 1 | node243 | standby | running | node248 | default | 100 | 5 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | primary | * running | | default | 100 | 6 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

9)主库查询流复制信息

  1. test=# select * from sys_replication_slots;
  2. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  3. ---------------+--------+-----------+--------+----------+-----------+--------+------------+------+-------------
  4. repmgr_slot_1 | | physical | | | f | t | 30928 | 1437 | | 0/120130A8 |
  5. repmgr_slot_2 | | physical | | | f | f | | | | |
  6. (2 rows)
  7. test=# select * from sys_stat_replication;
  8. pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
  9. -------+----------+---------+------------------+---------------+-----------------+-------------+-----------
  10. 30928 | 16384 | esrep | node243 | 192.168.7.243 | | 10817 | 2022-03-01 13:28:37.941077+08 | | streaming | 0/120130A8 | 0/120130A8 | 0/120130A8 | 0/120130A8 | | | | 1 | sync | 2022-03-01 13:32:08.445325+08
  11. (1 row)

案例二:测试‘recovery = automatic’

1、查看集群节点状态信息:

  1. [kingbase@node1 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
  4. 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、配置recovery参数

  1. [kingbase@node3 bin]$ cat ../etc/repmgr.conf |egrep -i 'recovery|failover'
  2. failover='automatic'
  3. recovery='automatic'

3、重启主库节点测试

[root@node3 ~]# reboot

4、查看备库hamgr日志

=如下所示,从日志中获知,主库节点宕机后,集群执行主备切换,并且在主库节点系统正常后,将原主库作为新备库自动加入到集群。=

  1. [2022-03-01 14:38:09] [NOTICE] starting monitoring of node "node248" (ID: 2)
  2. [2022-03-01 14:38:09] [INFO] "connection_check_type" set to "ping"
  3. [2022-03-01 14:38:10] [INFO] monitoring connection to upstream node "node243" (ID: 1)
  4. [2022-03-01 14:38:10] [NOTICE] try to change wal catched_up state to 1
  5. [2022-03-01 14:38:10] [INFO] primary flush lsn is 0/17000578, local flush lsn is 0/170004C0
  6. [2022-03-01 14:38:10] [NOTICE] try to change streaming_sync state to TRUE
  7. [2022-03-01 14:43:11] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
  8. [2022-03-01 14:46:42] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  9. [2022-03-01 14:46:42] [DETAIL] PQping() returned "PQPING_REJECT"
  10. [2022-03-01 14:46:42] [WARNING] unable to connect to upstream node "node243" (ID: 1)
  11. [2022-03-01 14:46:42] [INFO] sleeping 6 seconds until next reconnection attempt
  12. [2022-03-01 14:46:48] [INFO] checking state of node 1, 1 of 10 attempts
  13. [2022-03-01 14:46:58] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  14. [2022-03-01 14:46:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  15. [2022-03-01 14:46:58] [INFO] sleeping 6 seconds until next reconnection attempt
  16. ......
  17. [2022-03-01 14:48:59] [INFO] checking state of node 1, 10 of 10 attempts
  18. [2022-03-01 14:48:59] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  19. [2022-03-01 14:48:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  20. [2022-03-01 14:48:59] [WARNING] unable to reconnect to node 1 after 10 attempts
  21. [2022-03-01 14:48:59] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
  22. [2022-03-01 14:49:00] [WARNING] wal receiver not running
  23. [2022-03-01 14:49:00] [NOTICE] WAL receiver disconnected on all sibling nodes
  24. [2022-03-01 14:49:00] [INFO] WAL receiver disconnected on all 0 sibling nodes
  25. [2022-03-01 14:49:00] [INFO] 0 active sibling nodes registered
  26. [2022-03-01 14:49:00] [INFO] primary and this node have the same location ("default")
  27. [2022-03-01 14:49:00] [INFO] no other sibling nodes - we win by default
  28. [2022-03-01 14:49:00] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
  29. [2022-03-01 14:49:00] [NOTICE] this node is the only available candidate and will now promote itself
  30. [2022-03-01 14:49:00] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
  31. [2022-03-01 14:49:02] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.
  32. --- 192.168.7.1 ping statistics ---
  33. 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
  34. rtt min/avg/max/mdev = 2.345/22.599/42.853/20.254 ms
  35. [2022-03-01 14:49:02] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
  36. [2022-03-01 14:49:04] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.
  37. --- 192.168.7.241 ping statistics ---
  38. 3 packets transmitted, 0 received, 100% packet loss, time 1999ms
  39. [2022-03-01 14:49:04] [WARNING] ping host"192.168.7.241" failed
  40. [2022-03-01 14:49:04] [DETAIL] average RTT value is not greater than zero
  41. [2022-03-01 14:49:04] [INFO] loadvip result: 1, arping result: 1
  42. [2022-03-01 14:49:04] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
  43. [2022-03-01 14:49:04] [INFO] promote_command is:
  44. "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
  45. NOTICE: promoting standby to primary
  46. DETAIL: promoting server "node248" (ID: 2) using sys_promote()
  47. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  48. INFO: SET synchronous TO "async" on primary host
  49. [2022-03-01 14:49:07] [NOTICE] try to stop old primary db (host: "192.168.7.243")
  50. NOTICE: STANDBY PROMOTE successful
  51. DETAIL: server "node248" (ID: 2) was successfully promoted to primary
  52. [2022-03-01 14:49:11] [INFO] switching to primary monitoring mode
  53. [2022-03-01 14:49:11] [NOTICE] monitoring cluster primary "node248" (ID: 2)
  54. [2022-03-01 14:49:11] [INFO] create a thread 0x7f1b4b125700 to check the cluster status
  55. [2022-03-01 14:49:11] [INFO] child node: 1; attached: no
  56. [2022-03-01 14:49:11] [INFO] check node status again, try 1 / 10 times
  57. [2022-03-01 14:49:12] [INFO] node (ID: 1): no server running
  58. .......
  59. [2022-03-01 14:49:29] [INFO] check node status again, try 10 / 10 times
  60. [2022-03-01 14:49:31] [INFO] child node: 1; attached: no
  61. [2022-03-01 14:49:31] [INFO] found node down, recovery will be triggered after recovery delay time 20s
  62. [2022-03-01 14:49:33] [INFO] child node: 1; attached: no
  63. ......
  64. [2022-03-01 14:49:52] [INFO] child node: 1; attached: no
  65. [2022-03-01 14:49:52] [INFO] recovery delay time reached. can do recovery now.
  66. [2022-03-01 14:49:52] [INFO] [thread pid:11778] do_nodes_recovery thread begin. The pthread_t tid is 0x7f1b4b125700
  67. [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
  68. [2022-03-01 14:49:52] [NOTICE] [thread pid:11778] Now, the primary host ip: 192.168.7.248
  69. [2022-03-01 14:49:52] [INFO] [thread pid:11778] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
  70. [2022-03-01 14:49:53] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
  71. [2022-03-01 14:49:53] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
  72. NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
  73. DETAIL: rejoin target server's timeline 8 forked off current database system timeline 7 before current recovery point 0/18000028
  74. NOTICE: executing sys_rewind
  75. DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
  76. sys_rewind: servers diverged at WAL location 0/17000680 on timeline 7
  77. sys_rewind: rewinding from last common checkpoint at 0/160007C8 on timeline 7
  78. sys_rewind: find last common checkpoint start time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:49:53.296332 CST, in "0.125651" seconds.
  79. sys_rewind: update the control file: minRecoveryPoint is '0/1700DE58', minRecoveryPointTLI is '8', and database state is 'in archive recovery'
  80. sys_rewind: we will remove the dir '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data/sys_replslot/repmgr_slot_2.rewind' and all the file/dir in it.
  81. sys_rewind: rewind start wal location 0/16000798 (file 000000070000000000000016), end wal location 0/1700DE58 (file 000000080000000000000017). time from 2022-03-01 14:49:53.170681 CST to 2022-03-01 14:50:06.920859 CST, in "13.750178" seconds.
  82. sys_rewind: Done!
  83. NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  84. NOTICE: setting node 1's upstream to node 2
  85. WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  86. DETAIL: PQping() returned "PQPING_NO_RESPONSE"
  87. NOTICE: begin to start server at 2022-03-01 14:50:07.530887
  88. NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
  89. NOTICE: start server finish at 2022-03-01 14:50:08.952996
  90. NOTICE: NODE REJOIN successful
  91. DETAIL: node 1 is now attached to node 2
  92. [2022-03-01 14:50:09] [NOTICE] kbha: node (ID: 1) rejoin success.
  93. [2022-03-01 14:50:10] [NOTICE] [thread pid:11778] node "node243" (ID: 1) auto-recovery success
  94. [2022-03-01 14:50:10] [INFO] [thread pid:11778] do_nodes_recovery thread ends. The pthread_t tid is 0x7f1b4b125700
  95. [2022-03-01 14:50:10] [INFO] SET synchronous TO "sync" on primary host
  96. [2022-03-01 14:50:10] [INFO] thread tid:0x7f1b4b125700 is not running
  97. [2022-03-01 14:50:10] [INFO] the recovery thread was exited, reset tid
  98. [2022-03-01 14:50:10] [NOTICE] Some nodes reconnect, all standby nodes are OK now
  99. [2022-03-01 14:50:12] [NOTICE] new standby "node243" (ID: 1) has connected

5、查看备库数据库进程和集群状态信息

  1. [kingbase@node3 bin]$ ps -ef |grep kingbase
  2. kingbase 2654 1 0 14:49 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  3. kingbase 3462 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  4. kingbase 3463 3462 0 14:50 ? 00:00:00 kingbase: logger
  5. kingbase 3464 3462 0 14:50 ? 00:00:00 kingbase: startup recovering 000000080000000000000017
  6. kingbase 3465 3462 0 14:50 ? 00:00:00 kingbase: checkpointer
  7. kingbase 3466 3462 0 14:50 ? 00:00:00 kingbase: background writer
  8. kingbase 3467 3462 0 14:50 ? 00:00:00 kingbase: stats collector
  9. kingbase 3468 3462 0 14:50 ? 00:00:00 kingbase: walreceiver streaming 0/1700F160
  10. kingbase 3471 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57348) idle
  11. kingbase 3522 1 0 14:50 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  12. kingbase 3523 3462 0 14:50 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(57351) idle
  13. [kingbase@node3 bin]$ ./repmgr cluster show
  14. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  15. ----+---------+---------+-----------+----------+----------+----------+----------+--------------------------
  16. 1 | node243 | standby | running | node248 | default | 100 | 7 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  17. 2 | node248 | primary | * running | | default | 100 | 8 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

案例三:测试‘recovery = manual’

1、查看集群节点状态信息:

  1. [kingbase@node1 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
  4. 1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、查看recovery配置信息

3、重启主库主机系统

[root@node3 ~]# reboot

4、查看备库hamgr日志

=从以下日志信息获知,主库系统宕机后,集群执行主备切换,备库被提升为主库。==

  1. [2022-03-02 10:32:38] [NOTICE] starting monitoring of node "node248" (ID: 2)
  2. [2022-03-02 10:32:38] [INFO] "connection_check_type" set to "ping"
  3. [2022-03-02 10:32:38] [INFO] monitoring connection to upstream node "node243" (ID: 1)
  4. [2022-03-02 10:32:38] [NOTICE] try to change wal catched_up state to 1
  5. [2022-03-02 10:32:38] [INFO] primary flush lsn is 0/1F000D40, local flush lsn is 0/1F000D40
  6. [2022-03-02 10:32:38] [NOTICE] try to change streaming_sync state to TRUE
  7. [2022-03-02 10:34:24] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  8. [2022-03-02 10:34:24] [DETAIL] PQping() returned "PQPING_REJECT"
  9. [2022-03-02 10:34:24] [WARNING] unable to connect to upstream node "node243" (ID: 1)
  10. [2022-03-02 10:34:24] [INFO] sleeping 6 seconds until next reconnection attempt
  11. [2022-03-02 10:34:30] [INFO] checking state of node 1, 1 of 10 attempts
  12. [2022-03-02 10:34:40] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  13. [2022-03-02 10:34:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  14. [2022-03-02 10:34:40] [INFO] sleeping 6 seconds until next reconnection attempt
  15. ......
  16. [2022-03-02 10:35:47] [INFO] checking state of node 1, 10 of 10 attempts
  17. [2022-03-02 10:35:47] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
  18. [2022-03-02 10:35:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  19. [2022-03-02 10:35:47] [WARNING] unable to reconnect to node 1 after 10 attempts
  20. [2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
  21. [2022-03-02 10:35:47] [WARNING] wal receiver not running
  22. [2022-03-02 10:35:47] [NOTICE] WAL receiver disconnected on all sibling nodes
  23. [2022-03-02 10:35:47] [INFO] WAL receiver disconnected on all 0 sibling nodes
  24. [2022-03-02 10:35:47] [INFO] 0 active sibling nodes registered
  25. [2022-03-02 10:35:47] [INFO] primary and this node have the same location ("default")
  26. [2022-03-02 10:35:47] [INFO] no other sibling nodes - we win by default
  27. [2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
  28. [2022-03-02 10:35:48] [NOTICE] this node is the only available candidate and will now promote itself
  29. [2022-03-02 10:35:48] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
  30. [2022-03-02 10:35:50] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.
  31. --- 192.168.7.1 ping statistics ---
  32. 2 packets transmitted, 2 received, 0% packet loss, time 1008ms
  33. rtt min/avg/max/mdev = 2.473/2.535/2.598/0.080 ms
  34. [2022-03-02 10:35:50] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
  35. [2022-03-02 10:35:51] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.
  36. --- 192.168.7.241 ping statistics ---
  37. 2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms
  38. [2022-03-02 10:35:51] [WARNING] ping host"192.168.7.241" failed
  39. [2022-03-02 10:35:51] [DETAIL] average RTT value is not greater than zero
  40. [2022-03-02 10:35:51] [INFO] loadvip result: 1, arping result: 1
  41. [2022-03-02 10:35:51] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
  42. [2022-03-02 10:35:51] [INFO] promote_command is:
  43. "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
  44. NOTICE: promoting standby to primary
  45. DETAIL: promoting server "node248" (ID: 2) using sys_promote()
  46. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  47. [2022-03-02 10:35:51] [NOTICE] try to stop old primary db (host: "192.168.7.243")
  48. INFO: SET synchronous TO "async" on primary host
  49. NOTICE: STANDBY PROMOTE successful
  50. DETAIL: server "node248" (ID: 2) was successfully promoted to primary
  51. [2022-03-02 10:35:56] [INFO] 0 followers to notify
  52. [2022-03-02 10:35:56] [INFO] switching to primary monitoring mode
  53. [2022-03-02 10:35:56] [NOTICE] monitoring cluster primary "node248" (ID: 2)
  54. [2022-03-02 10:35:56] [INFO] create a thread 0x7fdeaa4b9700 to check the cluster status
  55. [2022-03-02 10:35:57] [INFO] node (ID: 1): no server running
  56. [2022-03-02 10:35:57] [INFO] [thread 0x7fdeaa4b9700] the cluster has no other running primary node, exit

5、原主库系统正常启动

1)从新主库查看集群状态 信息

=从以下信息可以获知,集群现在处于‘双主’状态,只是原主库是‘failed’,无法连接。=

  1. [kingbase@node1 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+----------------
  4. 1 | node243 | primary | - failed | | default | 100 | ? | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  6. WARNING: following issues were detected
  7. - unable to connect to node "node243" (ID: 1)
  8. You have new mail in /var/spool/mail/kingbase

2)在新主库(原备库)创建复制槽

  1. # 创建replication slots
  2. test=# select sys_create_physical_replication_slot('repmgr_slot_1');
  3. sys_create_physical_replication_slot
  4. --------------------------------------
  5. (repmgr_slot_1,)
  6. (1 row)
  7. test=# select sys_create_physical_replication_slot('repmgr_slot_2');
  8. sys_create_physical_replication_slot
  9. --------------------------------------
  10. (repmgr_slot_2,)
  11. (1 row)
  12. test=# select * from sys_replication_slots;
  13. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  14. ---------------+--------+-----------+--------+----------+-----------+--------+------------+-----
  15. repmgr_slot_1 | | physical | | | f | f | | | | |
  16. repmgr_slot_2 | | physical | | | f | f | | | | |
  17. (2 rows)

3)在原主库(新主库)执行以下恢复操作

  1. # 备份data目录
  2. [kingbase@node3 kingbase]$ cp data data.bk -r
  3. # 生成备库标识文件
  4. [kingbase@node3 kingbase]$ cd data
  5. [kingbase@node3 data]$ touch standby.signal

4)在原主库执行repmgr node rejoin重新加入到集群

  1. [kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind
  2. NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
  3. DETAIL: rejoin target server's timeline 10 forked off current database system timeline 9 before current recovery point 0/200000A0
  4. NOTICE: executing sys_rewind
  5. DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
  6. sys_rewind: servers diverged at WAL location 0/1F000D70 on timeline 9
  7. sys_rewind: rewinding from last common checkpoint at 0/1E000A70 on timeline 9
  8. sys_rewind: find last common checkpoint start time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:52:34.358066 CST, in "0.225008" seconds.
  9. sys_rewind: update the control file: minRecoveryPoint is '0/1F011AD0', minRecoveryPointTLI is '10', and database state is 'in archive recovery'
  10. sys_rewind: rewind start wal location 0/1E000A40 (file 00000009000000000000001E), end wal location 0/1F011AD0 (file 0000000A000000000000001F). time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:53:06.442270 CST, in "32.309212" seconds.
  11. sys_rewind: Done!
  12. NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  13. NOTICE: setting node 1's upstream to node 2
  14. WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
  15. DETAIL: PQping() returned "PQPING_NO_RESPONSE"
  16. NOTICE: begin to start server at 2022-03-02 10:53:06.588331
  17. NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
  18. NOTICE: start server finish at 2022-03-02 10:53:07.313294
  19. NOTICE: NODE REJOIN successful
  20. DETAIL: node 1 is now attached to node 2

5)启动新备库数据库服务

  1. [kingbase@node3 bin]$ ps -ef |grep kingbase
  2. kingbase 3218 1 0 10:36 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  3. kingbase 5817 1 0 10:49 ? 00:00:01 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
  4. kingbase 6730 1 0 10:53 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
  5. kingbase 6731 6730 0 10:53 ? 00:00:00 kingbase: logger
  6. kingbase 6732 6730 0 10:53 ? 00:00:00 kingbase: startup recovering 0000000A000000000000001F
  7. kingbase 6736 6730 0 10:53 ? 00:00:00 kingbase: checkpointer
  8. kingbase 6737 6730 0 10:53 ? 00:00:00 kingbase: background writer
  9. kingbase 6738 6730 0 10:53 ? 00:00:00 kingbase: stats collector
  10. kingbase 6739 6730 0 10:53 ? 00:00:00 kingbase: walreceiver streaming 0/1F012A78
  11. kingbase 6743 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55941) idle
  12. kingbase 6750 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55947) idle

6)查看集群节点状态

  1. [kingbase@node3 bin]$ ./repmgr cluster show
  2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  3. ----+---------+---------+-----------+----------+----------+----------+----------+----------------
  4. 1 | node243 | standby | running | node248 | default | 100 | 9 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  5. 2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

7)重启集群测试(可选)

  1. [kingbase@node3 bin]$ ./sys_monitor.sh restart
  2. 2022-03-02 10:55:26 Ready to stop all DB ...
  3. ....
  4. server started
  5. 2022-03-02 10:55:52 execute to start DB on "[192.168.7.248]" success, connect to check it.
  6. 2022-03-02 10:55:53 DB on "[192.168.7.248]" start success.
  7. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
  8. ----+---------+---------+-----------+-----------+----------+----------+----------+---------------
  9. 1 | node243 | standby | running | ! node248 | default | 100 | 10 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  10. 2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
  11. WARNING: following issues were detected
  12. - node "node243" (ID: 1) is not attached to its upstream node "node248" (ID: 2)
  13. 2022-03-02 10:55:53 The primary DB is started.
  14. ......
  15. 2022-03-02 10:56:15 repmgrd on "[192.168.7.248]" start success.
  16. ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
  17. ----+---------+---------+-----------+----------+---------+-------+---------+--------------------
  18. 1 | node243 | standby | running | node248 | running | 9500 | no | 1 second(s) ago
  19. 2 | node248 | primary | * running | | running | 27881 | no | n/a
  20. [2022-03-02 10:56:18] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
  21. [2022-03-02 10:56:20] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
  22. 2022-03-02 10:56:22 Done.

=从以上信息获知,通过手工执行repmgr node rejoin,原主库作为新备库重新加入到集群中。=

总结:

  1. 1、对于recovery=standby,主库节点系统宕机后,集群执行主库切换,原主库需要人工配置为备库模式,并启动数据库服务,然后集群可自动将其加入到集群。
  2. 2、对于recovery=automatic,主库节点系统宕机后,集群执行主库切换,不需要人工参与,原主库将作为新的备库自动加入到集群。
  3. 3、对于recovery=manual,主库节点系统宕机后,集群执行主库切换,需要人工参与,在原主库执行‘repmgr node rejoin’操作,将原主库将作为新的备库自动加入到集群。
  4. 4、对于无DBA日常监控管理的生产环境,可以考虑将recovery配置为automatic,提升集群架构的可靠性。

KingbaseES R6 集群 recovery 参数对切换的影响的更多相关文章

  1. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)

    KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...

  2. KingbaseES R6 集群修改data目录

    案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...

  3. KingbaseES R6 集群修改物理IP和VIP案例

    在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...

  4. KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例

    案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...

  5. KingbaseES R6 集群sys_monitor.sh change_password一键修改集群用户密码

    案例说明: kingbaseES R6集群用户密码修改,需要修改两处: 1)修改数据库用户密码(alter user): 2)修改.encpwd文件中用户密码: 可以通过sys_monitor.sh ...

  6. KingbaseES R6 集群创建流复制只读副本库案例

    一.环境概述 [kingbase@node2 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test= ...

  7. KingbaseES R6 集群通过备库clone在线添加新节点

    案例说明: KingbaseES R6集群可以通过图形化方式在线添加新节点,但是在添加新节点clone环节时,是从主库copy数据到新的节点,这样在生产环境,如果数据量大,将会对主库的网络I/O造成压 ...

  8. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(二)

    案例二:测试'recovery = automatic' 1.查看集群节点状态信息: [kingbase@node1 bin]$ ./repmgr cluster show ID | Name | R ...

  9. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)

    案例三:测试'recovery = manual' 1.查看集群节点状态信息: [kingbase@node1 bin]$ ./repmgr cluster show ID | Name | Role ...

随机推荐

  1. HTML,CSS,JS,DOM,jQuery

    HTML 超链接访问顺序 a:link-->a:visited-->a:hover-->a:active.(有顺序) link:表示从未访问过的链接的样式 visited:表示已经访 ...

  2. SprinigBoot自定义Starter

    自定义Starter 是什么 starter可以理解是一组封装好的依赖包,包含需要的组件和组件所需的依赖包,使得使用者不需要再关注组件的依赖问题 所以一个staerter包含 提供一个autoconf ...

  3. SAP Easy tree

    *&---------------------------------------------------------------------* *& Include SIMPLE_T ...

  4. ABAP CDS-基础语法规则

    The general syntax rules for the DDL and the DCL in ABAP CDS are: Keywords Keywords must be all uppe ...

  5. salt stack安装与使用

    SaltStack除了传统的C/S架构外,其实还有Masterless架构,如果采用Masterless架构,我不需要单独安装一台SaltStack Master机器,只需要在每台机器上安装Minio ...

  6. java常见的面试题(一)

    1.Collection 和 Collections 有什么区别? Collection 是一个集合接口(集合类的一个顶级接口).它提供了对集合对象进行基本操作的通用接口方法.Collection接口 ...

  7. NC202498 货物种类

    NC202498 货物种类 题目 题目描述 某电商平台有 \(n\) 个仓库,编号从 \(1\) 到 \(n\) . 当购进某种货物的时候,商家会把货物分散的放在编号相邻的几个仓库中. 我们暂时不考虑 ...

  8. 链表设计与Java实现,手写LinkedList这也太清楚了吧!!!

    链表设计与实现 在谈链表之前,我们先谈谈我们平常编程会遇到的很常见的一个问题.如果在编程的时候,某个变量在后续编程中仍需使用,我们可以用一个局部变量来保存该值,除此之外一个更加常用的方法就是使用容器了 ...

  9. 综合案例_文件搜索和FileFilter过滤器的原理和使用

    文件搜索 需求 : 遍历D:\aaa文件夹,及 aaa 文件夹的子文件夹并且只要.java结尾的文件 分析: 1.目录搜索,无法判断多少级目录,所以使用递归,遍历所有目录 2.遍历目录时,获取的子文件 ...

  10. C++指针探究

    周五听实习师父指点了一下C++的强制类型转换概念,师父说了一句"强制类型转换其实就是告诉编译器不用检查当前位置的类型,程序猿自己知道类型". 今天整理之前的学习笔记的时候又发现,在 ...