案例说明:

本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover切换过程有一定的帮助。

以下为现场案例:

failover切换后主备库的wal日志信息:

新主库数据库服务启动故障:(sys_log)

=如下所示,在sys_log中,新主库启动startup后,建立流复制,流复制的起始wal日志是:“ 00000004000000050000002A”,导致复制失败。=

适用版本:

KingbaseES V8R3/R6

节点信息:

集群节点状态信息:

[kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help. TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
(2 rows)

一、查看failover切换前节点信息

1、原主库wal日志

[kingbase@node102 sys_xlog]$ ls -lh
.......
-rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D.partial
-rw------- 1 kingbase kingbase 339 Jul 7 15:09 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002F
-rw------- 1 kingbase kingbase 382 Jul 7 15:25 0000000A.history

2、原主库控制文件信息

[kingbase@node102 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in production
sys_control last modified: Wed 03 Aug 2022 10:26:57 AM CST
Latest checkpoint location: 0/2F000108
Prior checkpoint location: 0/2F000028
Latest checkpoint's REDO location: 0/2F0000D0
Latest checkpoint's REDO WAL file: 0000000A000000000000002F
Latest checkpoint's TimeLineID: 10
Latest checkpoint's PrevTimeLineID: 10

3、备库wal日志

[kingbase@node101 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D
-rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:27 0000000A000000000000002F
-rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history

4、备库控制文件信息

[kingbase@node101 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in archive recovery
sys_control last modified: Wed 03 Aug 2022 10:26:55 AM CST
Latest checkpoint location: 0/2F000028
Prior checkpoint location: 0/2E0002C8
Latest checkpoint's REDO location: 0/2F000028
Latest checkpoint's REDO WAL file: 0000000A000000000000002F
Latest checkpoint's TimeLineID: 10
Latest checkpoint's PrevTimeLineID: 10

二、执行failover切换(关闭主库数据库服务)

1、关闭主库数据库服务

[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down....... done
server stopped

三、failover切换完成主备状态信息

1、新主库wal日志

[kingbase@node101 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A000000000000002F
-rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A0000000000000030.partial
-rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
-rw------- 1 kingbase kingbase 16M Aug 3 10:31 0000000B0000000000000030
-rw------- 1 kingbase kingbase 426 Aug 3 10:30 0000000B.history

切换完成后timeline发生切换:

查看timeline history文件信息:

2、新主库控制文件信息

[kingbase@node101 bin]$  ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in production
sys_control last modified: Wed 03 Aug 2022 10:35:48 AM CST
Latest checkpoint location: 0/3005E110
Prior checkpoint location: 0/30004BD8
Latest checkpoint's REDO location: 0/3005B370
Latest checkpoint's REDO WAL file: 0000000B0000000000000030
Latest checkpoint's TimeLineID: 11
Latest checkpoint's PrevTimeLineID: 11

3、新备库wal日志

[kingbase@node102 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A000000000000002F
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A0000000000000030.partial
-rw------- 1 kingbase kingbase 382 Aug 3 10:34 0000000A.history
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000B0000000000000030
-rw------- 1 kingbase kingbase 426 Aug 3 10:34 0000000B.history

4、新备库控制文件信息

[kingbase@node102 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in archive recovery
sys_control last modified: Wed 03 Aug 2022 10:35:42 AM CST
Latest checkpoint location: 0/30004BD8
Prior checkpoint location: 0/30004BD8
Latest checkpoint's REDO location: 0/30004BA0
Latest checkpoint's REDO WAL file: 0000000B0000000000000030
Latest checkpoint's TimeLineID: 11
Latest checkpoint's PrevTimeLineID: 11

四、将原主库作为备库恢复到集群

1、在原主库data下创建recovery.conf

[kingbase@node102 data]$ cp ../etc/recovery.done ./recovery.conf

2、查看recovery.log信息

primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows) if recover node up, let it down , for rewind
2022-08-03 10:34:35 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/R3HA/db/data
rewinding from last common checkpoint at 0/2F000108 on timeline 10
find last common checkpoint start time from 2022-08-03 10:34:35.349563 CST to 2022-08-03 10:34:35.405349 CST, in "0.055786" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
update the control file: minRecoveryPoint is '0/3004D0C8', minRecoveryPointTLI is '11', and database state is 'in archive recovery'
rewind start wal location 0/2F0000D0 (file 0000000A000000000000002F), end wal location 0/3004D0C8 (file 0000000B0000000000000030). time from 2022-08-03 10:34:37.349563 CST to 2022-08-03 10:34:37.872586 CST, in "2.523023" seconds.
Done!
sed conf change #synchronous_standby_names 2022-08-03 10:34:39 file operate
cp recovery.conf...
change recovery.conf ip -> primary.ip
2022-08-03 10:34:39 change recovery.conf
delete pid file if exist
del the replication_slots if exis
drop the slot [slot_node101].
drop the slot [slot_node102].
2022-08-03 10:34:40 start up the kingbase...
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".
done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node101,)
(1 row) 2022-08-03 10:34:42 create the slot [slot_node101] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node102,)
(1 row) 2022-08-03 10:34:42 create the slot [slot_node102] success.
2022-08-03 10:34:42 start up standby successful!
can not get the replication of myself

如下所示:recovery过程:

五、总结

在集群执行failover切换时,可以结合wal日志和recovery.log和控制文件的变化,可以详细了解failover切换中wal日志的变化,及通过sys_rewind工具对原主库的恢复过程。

KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析的更多相关文章

  1. KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构

    案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...

  2. KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障

    案例说明: 在KingbaseES V8R6集群备库执行"repmgr standby switchover"时,切换失败,并且在执行过程中,伴随着"repmr stan ...

  3. KingbaseES V8R3集群维护案例之---在线添加备库管理节点

    案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...

  4. KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析

    ​ 案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...

  5. KingbaseES R3 集群主库归档失败案例

    案例说明: 本案例用于KingbaseES R3集群归档进程归档日志失败的处理,对于一线的生产环境具有 一定的参考意义. 数据库版本: TEST=# select version(); VERSION ...

  6. KingbaseES V8R3集群运维案例之---用户自定义表空间管理

    ​案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...

  7. KingbaseES V8R3 集群专用机网关失败分析案例

    ​ KingbaseES R3集群网关检测工作机制: 1.Cluster下watchdog进程在固定间隔时间,通过ping 网关地址监控链路的连通性,如果连通网关地址失败,则修改cluster sta ...

  8. KingbaseES R6 集群手工配置VIP案例

    经常有用户问,V8R6集群搭建时没有配置VIP,搭建完成后,如何添加VIP?以下向大家介绍下手动添加VIP 的过程. 一.操作系统环境 操作系统(UOS): root@uos01:~# cat /et ...

  9. KingbaseES V8R6集群维护案例之---停用集群node_export进程

    案例说明: 在KingbaseES V8R6集群启动时,会启动node_exporter进程,此进程主要用于向kmonitor监控服务输出节点状态信息.在系统安全漏洞扫描中,提示出现以下安全漏洞: 对 ...

随机推荐

  1. wcf .net webService和 .net webApi的联系与差异

    首先,我们需要清楚它们的概念,然后才能走好下一步. wcf是对于ASMX,.Net Remoting,Enterprise Service,WSE,MSMQ等技术的整合,它是一种重量级消息交互框架,广 ...

  2. 不同的子序列问题I

    不同的子序列问题I 作者:Grey 原文地址: 不同的子序列问题I 题目链接 LeetCode 115. 不同的子序列 暴力解法 定义递归函数 int process(char[] str, char ...

  3. DDos、CC攻击与防御

    DDoS介绍 DDoS是英文Distributed Denial of Service的缩写,意即"分布式拒绝服务",那么什么又是拒绝服务(Denial of Service)呢? ...

  4. UiPath键盘操作的介绍和使用

    一.键盘操作的介绍 模拟用户使用键盘操作的一种行为: 例如使用发送热键(Sendhotkey),输入信息 (Typeinto)的操作 二.键盘操作在UiPath中的使用 1.打开设计器,在设计库中新建 ...

  5. SpringBoot 整合文件上传 elment Ui 上传组件

    SpringBoot 整合文件上传 elment Ui 上传组件 本文章记录 自己学习使用 侵权必删! 前端代码 博主最近在学 elment Ui 所以 前端使用 elmentUi 的 upload ...

  6. python小题目练习(六)

    需求:编写一个猜数字的小游戏,随机生成1到10(包含1和10)之间的数字作为基准数,玩家每次通过键盘输入一个数字,如果输入的数字跟基准数相同,则闯关成功,否则重新输入,如果玩家输入的是-1,则表示退出 ...

  7. 用Bootstrap4写了一个WordPress主题Writing

    这是一个简洁的WordPress博客主题,为专注写作而设计. 本主题使用Bootstrap4框架开发. 主要功能 自适应: 标签云页面模板: 两栏设计: 全宽页面模板: 支持设置背景色和背景图片: 8 ...

  8. java线程池开启多线程

    // //maximumPoolSize设置为2 ,拒绝策略为AbortPolic策略,直接抛出异常 ThreadPoolExecutor pool = new ThreadPoolExecutor( ...

  9. 动手实践丨手把手教你用STM32做一个智能鱼缸

    摘要:本文基于STM32单片机设计了一款基于物联网的智能鱼缸. 本文分享自华为云社区<基于STM32+华为云IOT设计的物联网鱼缸[玩转华为云]>,作者: DS小龙哥 . 1. 前言 为了 ...

  10. git和提交分支

    实习到今天,已经开始做项目一段时间了,当然只是实习生的个人项目. 项目是导师发在git上面的,要求我们用git的PR提交 可是我不会啊...git仅仅是简单的个人提交总的项目到仓库里,什么新建分支,p ...