使用MHA实现MySQL主从复制高可用

一、MHA简介
MHA（Master High Availability）目前在MySQL高可用方面是一个相对成熟的解决方案，它由日本DeNA公司的youshimaton（现就职于Facebook公司）开发，是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中，MHA能做到在0~30秒之内自动完成数据库的故障切换操作，并且在进行故障切换的过程中，MHA能在最大程度上保证数据的一致性，以达到真正意义上的高可用。

该软件由两部分组成：MHA Manager（管理节点）和MHA Node（数据节点）。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群，也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上，MHA Manager会定时探测集群中的master节点，当master出现故障时，它可以自动将最新数据的slave提升为新的master，然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。

在MHA自动故障切换过程中，MHA试图从宕机的主服务器上保存二进制日志，最大程度的保证数据的不丢失，但这并不总是可行的。例如，如果主服务器硬件故障或无法通过ssh访问，MHA没法保存二进制日志，只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制，可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志，MHA可以将最新的二进制日志应用于其他所有的slave服务器上，因此可以保证所有节点的数据一致性。

目前MHA主要支持一主多从的架构。要搭建MHA，要求一个复制集群中必须最少有三台数据库服务器，一主二从，即一台充当master，一台充当备用master，另外一台充当从库。因为至少需要三台服务器，出于机器成本的考虑，淘宝也在该基础上进行了改造，目前淘宝TMHA已经支持一主一从。（出自：《深入浅出MySQL(第二版)》）从代码层面看，MHA就是一套Perl脚本，那么相信以阿里系的技术实力，将MHA改成支持一主一从也并非难事。

图1所示为MHA架构：

图1
MHA工作原理总结为以下几条：

从宕机崩溃的master保存二进制日志事件（binlog events）；
识别含有最新更新的slave；
应用差异的中继日志（relay log）到其他slave；
应用从master保存的二进制日志事件（binlog events）；
提升一个slave为新master；
使用其他的slave连接新的master进行复制。
官方介绍：https://code.google.com/archive/p/mysql-master-ha/

二、实验架构设计
1. 基本环境
操作系统版本：CentOS Linux release 7.2.1511 (Core)
MySQL版本：5.6.14
VIP（虚IP）：172.16.1.100
主机信息：见表1
角色

主机名

网卡

server_id

功能

Monitor Host

172.16.1.124

hdp1

监控复制组

Master

172.16.1.127

hdp4

ens160

127

响应写请求

Candidate Master

172.16.1.126

hdp3

ens32

126

响应读请求

Slave

172.16.1.125

hdp2

ens32

125

响应读请求

表1

2. 架构设计
实验架构如图2所示。

图2
hdp1作为MHA Manager，其它三台主机构成MySQL一主二从复制集群，作为MHA Node。

三、MHA安装配置
1. 配置主从复制
MySQL主从复制的配置较为简单，具体过程可参考MySQL官方文档，这里从略。如果是全新搭建的复制，只要打开Master的binlog，然后将Slave change master到指定的file和pos，再start slave即可。如果是为已经存在且正在使用的数据库搭建从库，有两种方式，一是用mysqldump master-data参数记录master的file和pos，但可能卡库；比较好的方法是用innobackupex联机搭建从库，过程如下：
（1）前置条件

主从都安装好依赖包：
yum install perl perl-DBI perl-DBD-MySQL perl-IO-Socket-SSL perl-Time-HiRes
主从都安装percona-xtrabackup
设置PATH环境变量，如：
.:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/home/mysql/mysql-5.6.14/bin:/home/mysql/percona-xtrabackup-2.2.4-Linux-x86_64/bin:/home/mysql/bin
（2）配置主到从的SSH免密码连接
在主上用mysql用户执行：

ssh-keygen
... 一路回车 ...
ssh-copy-id slave的IP或主机名
（3）备份并传输
例如，在主上用mysql用户执行：

innobackupex --user root --password 123456 --defaults-file=/home/mysql/mysql-5.6.14/my.cnf --no-lock --socket=/home/mysql/mysql-5.6.14/mysql.sock --port 3306 --stream=tar ./ | ssh mysql@172.16.1.126 \ "cat - > /home/mysql/backup.tar"
（4）恢复备份
在从上用mysql用户执行：

# 解压缩
tar -ixvf backup.tar -C /home/mysql/mysql-5.6.14/data
# 应用日志
innobackupex --apply-log /home/mysql/mysql-5.6.14/data/

# 查看binlog日志文件的位置值
cat /home/mysql/mysql-5.6.14/data/xtrabackup_binlog_info

# 编辑my.cnf
vi /etc/my.cnf

# 启动MySQL，目录要和主保持一致
service mysql start

mysql -uroot -p123456 -P3306 -h127.0.0.1

# 配置复制
reset master;
reset slave all;

change master to
master_host='172.16.1.127',
master_port=3306,
master_user='repl',
master_password='123456',
master_log_file='mysql-bin.000001',
master_log_pos=120;

# 其中master_log_file和master_log_pos赋予/home/mysql/mysql5.6.14/data/xtrabackup_binlog_info中的值。

# 启动slave
start slave;

# 查看slave状态
show slave status\G
（5）后续工作
备份my.cnf、bat文件和crontab等。

2. 安装Perl等依赖模块
用root用户在所有四个节点执行下面的操作。

# 安装一个epel源
wget -O /etc/yum.repos.d/epel-7.repo http://mirrors.aliyun.com/repo/epel-7.repo

# 用yum安装依赖包
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

3. 配置SSH登录无密码验证
在hdp1 172.16.1.124（Monitor）上用root用户执行：

ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127
在hdp4 172.16.1.127（Master）上用root用户执行：

ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126
在hdp3 172.16.1.126（slave1）上用root用户执行：

ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.125
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127
在hdp2 172.16.1.125（slave2）上用root用户执行：

ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.126
ssh-copy-id -i /root/.ssh/id_rsa.pub root@172.16.1.127

4. 安装MHA Node
下载地址：https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads
在hdp2、hdp3、hdp4上用root用户执行下面的操作。

rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm
安装完成后，在/usr/bin/目录下有如下MHA相关文件：

apply_diff_relay_logs
filter_mysqlbinlog
purge_relay_logs
save_binary_logs
这些脚本工具通常由MHA Manager的脚本触发，无需人为操作。脚本说明：

apply_diff_relay_logs：识别差异的中继日志事件并将其差异的事件应用于其它slave。
filter_mysqlbinlog：去除不必要的ROLLBACK事件（MHA已不再使用这个工具）。
purge_relay_logs：清除中继日志（不会阻塞SQL线程）。
save_binary_logs：保存和复制master的二进制日志。
5. 安装MHA Manager
下载地址：https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads
在hdp1上用root用户执行下面的操作。

rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm
安装完成后，在/usr/bin/目录下有如下MHA相关文件：

masterha_check_repl
masterha_check_ssh
masterha_check_status
masterha_conf_host
masterha_manager
masterha_master_monitor
masterha_master_switch
masterha_secondary_check
masterha_stop
apply_diff_relay_logs
filter_mysqlbinlog
purge_relay_logs
save_binary_logs
6. 配置MHA
在hdp1上用root用户执行下面（1）、（2）、（3）的操作。

（1）建立配置文件目录

mkdir -p /etc/masterha
（2）创建配置文件/etc/masterha/app1.cnf，内容如下：

[server default]
manager_log=/var/log/masterha/app1/manager.log
manager_workdir=/var/log/masterha/app1.log
master_binlog_dir=/data
master_ip_failover_script=/usr/bin/master_ip_failover
master_ip_online_change_script=/usr/bin/master_ip_online_change
password=123456
ping_interval=1
remote_workdir=/tmp
repl_password=123456
repl_user=repl
secondary_check_script=/usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306
shutdown_script=""
ssh_user=root
user=root

[server1]
hostname=172.16.1.127
port=3306

[server2]
candidate_master=1
check_repl_delay=0
hostname=172.16.1.126
port=3306

[server3]
hostname=172.16.1.125
port=3306
server default段是manager的一些基本配置参数，server1、server2、server3分别对应复制中的master、第一个slave、第二个slave。该文件的语法要求严格，变量值后不要有多余的空格。主要配置项说明如下。

manager_log：设置manager的日志文件。
manager_workdir：设置manager的工作目录。
master_binlog_dir：设置master保存binlog的位置，以便MHA可以找到master的日志，这里的也就是mysql的数据目录。
master_ip_failover_script：设置自动failover时候的切换脚本。
master_ip_online_change_script：设置手动切换时候的切换脚本。
password：设置mysql中root用户的密码。
ping_interval：设置监控主库，发送ping包的时间间隔，默认是3秒，尝试三次没有回应的时候自动进行railover。
remote_workdir：设置远端mysql在发生切换时binlog的保存位置。
repl_password：设置复制用户的密码。
repl_user：设置复制环境中的复制用户名
secondary_check_script：一旦MHA到hdp4的监控之间出现问题，MHA Manager将会尝试从hdp3登录到hdp4。
shutdown_script：设置故障发生后关闭故障主机脚本。该脚本的主要作用是关闭主机放在发生脑裂，这里没有使用。
ssh_user：设置ssh的登录用户名。
user：设置监控用户为root。
candidate_master：设置为候选master。设置该参数以后，发生主从切换以后将会将此从库提升为主库，即使这个主库不是集群中事件最新的slave。
check_repl_delay：默认情况下如果一个slave落后master 100M的relay logs的话，MHA将不会选择该slave作为一个新的master，因为对于这个slave的恢复需要花费很长时间，通过设置check_repl_delay=0，MHA触发切换在选择一个新的master的时候将会忽略复制延时，这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master。
（3）建立软连接

ln -s /home/mysql/mysql-5.6.14/bin/mysqlbinlog /usr/bin/mysqlbinlog
ln -s /home/mysql/mysql-5.6.14/bin/mysql /usr/bin/mysql
（4）设置复制中Slave的relay_log_purge参数
在hdp3和hdp2上用mysql用户执行：

mysql -uroot -p123456 -e "set global relay_log_purge=0"
注意，MHA在发生切换的过程中，从库的恢复过程中依赖于relay log的相关信息，所以这里要将relay log的自动清除设置为OFF，采用手动清除relay log的方式。默认情况下，从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中，这些中继日志在恢复其他从服务器时可能会被用到，因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下，删除大的文件需要一定的时间，会导致严重的复制延时。为了避免复制延时，需要暂时为中继日志创建硬链接，因为在linux系统中通过硬链接删除大文件速度会很快。（在mysql数据库中，删除大表时，通常也采用建立硬链接的方式）

7. 创建相关脚本
（1）创建定期清理relay脚本
在hdp3、hdp2两台slave上建立/root/purge_relay_log.sh文件，内容如下：

#!/bin/bash
. /home/mysql/.bashrc

user=root
passwd=123456
port=3306
log_dir='/data'
work_dir='/data'
purge='/usr/bin/purge_relay_logs'

if [ ! -d $log_dir ]
then
mkdir $log_dir -p
fi

$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
purge_relay_logs的参数说明：

user mysql：MySQL用户名。
password mysql：MySQL用户密码。
port：MySQL端口号。
workdir：指定创建relay log的硬链接的位置，默认是/var/tmp。由于系统不同分区创建硬链接文件会失败，故需要执行硬链接具体位置，成功执行脚本后，硬链接的中继日志文件被删除。
disable_relay_log_purge：默认情况下，如果relay_log_purge=1，脚本会什么都不清理，自动退出。通过设定这个参数，当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后，最后将参数设置为OFF。
改模式为可执行：

chmod 755 purge_relay_log.sh
手工执行/root/purge_relay_log.sh，在控制台输出：

2018-07-31 12:45:20: purge_relay_logs script started.
Found relay_log.info: /data/relay-log.info
Opening /data/hdp2-relay-bin.000001 ..
Opening /data/hdp2-relay-bin.000002 ..
Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log
files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
2018-07-31 12:45:23: All relay log purging operations succeeded.
添加到crontab中：

0 4 * * * /bin/bash /root/purge_relay_log.sh
（2）创建自动failover脚本
在hdp1上创建/usr/bin/master_ip_failover文件，内容如下：

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);

my $vip = '172.16.1.100'; # Virtual IP
my $key = "1";
my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";

GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

# $orig_master_host, $orig_master_ip, $orig_master_port are passed.
# If you manage master ip address at global catalog database,
# invalidate orig_master_ip here.
my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {

# all arguments are passed.
# If you manage master ip address at global catalog database,
# activate new_master_ip here.
# You can also grant write access (create user, set read_only=0, etc) here.
my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
`ssh $ssh_user\@$orig_master_host \" $ssh_start_vip \"`;
exit 0;
}
else {
&usage();
exit 1;
}
}

# A simple system call that enable the VIP on the new master
sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}
注意脚本中VIP漂移的部分。

（3）创建手动failover脚本
在hdp1上创建/usr/bin/master_ip_online_change文件，内容如下：

#!/usr/bin/env perl

## Note: This is a sample script and is notcomplete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
# use Time::HiRes qw( sleep gettimeofdaytv_interval );
use Time::HiRes qw(sleep gettimeofday tv_interval);
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
my (
$command, $orig_master_host, $orig_master_ip,
$orig_master_port, $orig_master_user,
$new_master_host, $new_master_ip, $new_master_port,
$new_master_user,
);

my $vip = '172.16.1.100'; # Virtual IP
my $key = "1";
my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";
my $ssh_user = "root";
my $new_master_password = "123456";
my $orig_master_password = "123456";

GetOptions(
'command=s' =>\$command,
#'ssh_user=s' => \$ssh_user,
'orig_master_host=s' =>\$orig_master_host,
'orig_master_ip=s' =>\$orig_master_ip,
'orig_master_port=i' =>\$orig_master_port,
'orig_master_user=s' =>\$orig_master_user,
#'orig_master_password=s' => \$orig_master_password,
'new_master_host=s' =>\$new_master_host,
'new_master_ip=s' =>\$new_master_ip,
'new_master_port=i' =>\$new_master_port,
'new_master_user=s' =>\$new_master_user,
#'new_master_password=s' =>\$new_master_password,
);

exit &main();

sub current_time_us {
my ($sec, $microsec ) = gettimeofday();
my$curdate = localtime($sec);
return $curdate . " " . sprintf( "%06d", $microsec);
}

sub sleep_until {
my$elapsed = tv_interval($_tstart);
if ($_running_interval > $elapsed ) {
sleep( $_running_interval - $elapsed );
}
}

sub get_threads_util {
my$dbh = shift;
my$my_connection_id = shift;
my$running_time_threshold = shift;
my$type = shift;
$running_time_threshold = 0 unless ($running_time_threshold);
$type = 0 unless($type);
my@threads;

my$sth = $dbh->prepare("SHOW PROCESSLIST");
$sth->execute();

while ( my $ref = $sth->fetchrow_hashref() ) {
my$id = $ref->{Id};
my$user = $ref->{User};
my$host = $ref->{Host};
my$command = $ref->{Command};
my$state = $ref->{State};
my$query_time = $ref->{Time};
my$info = $ref->{Info};
$info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
next if ( $my_connection_id == $id );
next if ( defined($query_time) && $query_time <$running_time_threshold );
next if ( defined($command) && $command eq "Binlog Dump" );
next if ( defined($user) && $user eq "system user" );
next
if ( defined($command)
&& $command eq "Sleep"
&& defined($query_time)
&& $query_time >= 1 );

if( $type >= 1 ) {
next if ( defined($command) && $command eq "Sleep" );
nextif ( defined($command) && $command eq "Connect" );
}

if( $type >= 2 ) {
next if ( defined($info) && $info =~ m/^select/i );
next if ( defined($info) && $info =~ m/^show/i );
}

push @threads, $ref;
}
return @threads;
}

sub main {
if ($command eq "stop" ) {
##Gracefully killing connections on the current master
#1. Set read_only= 1 on the new master
#2. DROP USER so that no app user can establish new connections
#3. Set read_only= 1 on the current master
#4. Kill current queries
#* Any database access failure will result in script die.
my$exit_code = 1;
eval {
## Setting read_only=1 on the new master (to avoid accident)
my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error(die_on_error)_or_not
$new_master_handler->connect( $new_master_ip, $new_master_port,
$new_master_user, $new_master_password, 1 );
print current_time_us() . " Set read_only on the new master..";
$new_master_handler->enable_read_only();
if ( $new_master_handler->is_read_only() ) {
print "ok.\n";
}
else {
die "Failed!\n";
}
$new_master_handler->disconnect();

# Connecting to the orig master, die if any database error happens
my $orig_master_handler = new MHA::DBHelper();
$orig_master_handler->connect( $orig_master_ip, $orig_master_port,
$orig_master_user, $orig_master_password, 1 );

## Drop application user so that nobodycan connect. Disabling per-session binlog beforehand
#$orig_master_handler->disable_log_bin_local();
#print current_time_us() . " Drpping app user on the origmaster..\n";
#FIXME_xxx_drop_app_user($orig_master_handler);

## Waiting for N * 100 milliseconds so that current connections can exit
my $time_until_read_only = 15;
$_tstart = [gettimeofday];
my @threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
while ( $time_until_read_only > 0 && $#threads >= 0 ) {
if ( $time_until_read_only % 5 == 0 ) {
printf "%s Waiting all running %d threads aredisconnected.. (max %d milliseconds)\n",
current_time_us(), $#threads + 1, $time_until_read_only * 100;
if ( $#threads < 5 ) {
print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"
foreach (@threads);
}
}
sleep_until();
$_tstart = [gettimeofday];
$time_until_read_only--;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
}

## Setting read_only=1 on the current master so that nobody(exceptSUPER) can write
print current_time_us() . " Set read_only=1 on the orig master..";
$orig_master_handler->enable_read_only();
if ( $orig_master_handler->is_read_only() ) {
print "ok.\n";
}
else {
die "Failed!\n";
}

## Waiting for M * 100 milliseconds so that current update queries cancomplete
my $time_until_kill_threads = 5;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
if ( $time_until_kill_threads % 5 == 0 ) {
printf "%s Waiting all running %d queries aredisconnected.. (max %d milliseconds)\n",
current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
if ( $#threads < 5 ) {
print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"
foreach (@threads);
}
}
sleep_until();
$_tstart = [gettimeofday];
$time_until_kill_threads--;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
}

print "Disabling the VIPon old master: $orig_master_host \n";
&stop_vip();

## Terminating all threads
print current_time_us() . " Killing all applicationthreads..\n";
$orig_master_handler->kill_threads(@threads) if ( $#threads >= 0);
print current_time_us() . " done.\n";
#$orig_master_handler->enable_log_bin_local();
$orig_master_handler->disconnect();

## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
$exit_code = 0;
};
if($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
##Activating master ip on the new master
#1. Create app user with write privileges
#2. Moving backup script if needed
#3. Register new master's ip to the catalog database

# We don't return error even thoughactivating updatable accounts/ip failed so that we don't interrupt slaves'recovery.
# If exit code is 0 or 10, MHA does notabort
my$exit_code = 10;
eval{
my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error_or_not
$new_master_handler->connect( $new_master_ip, $new_master_port,
$new_master_user, $new_master_password, 1 );

## Set read_only=0 on the new master
#$new_master_handler->disable_log_bin_local();
print current_time_us() . " Set read_only=0 on the newmaster.\n";
$new_master_handler->disable_read_only();

## Creating an app user on the new master
#print current_time_us() . " Creating app user on the newmaster..\n";
#FIXME_xxx_create_app_user($new_master_handler);
#$new_master_handler->enable_log_bin_local();
$new_master_handler->disconnect();

## Update master ip on the catalog database, etc
print "Enabling the VIP -$vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {

#do nothing
exit 0;
}
else{
&usage();
exit 1;
}
}

# A simple system call that enable the VIPon the new master
sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIPon the old_master
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_online_change --command=start|stop|status--orig_master_host=host --orig_master_ip=ip --orig_master_port=port--new_master_host=host --new_master_ip=ip --new_master_port=port\n";
die;
}
注意脚本中VIP漂移的部分。

四、检查MHA配置
1. 检查SSH配置
在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_ssh --conf=/etc/masterha/app1.cnf
Tue Jul 31 12:50:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Jul 31 12:50:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Tue Jul 31 12:50:22 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Tue Jul 31 12:50:22 2018 - [info] Starting SSH connection tests..
Tue Jul 31 12:50:23 2018 - [debug]
Tue Jul 31 12:50:22 2018 - [debug] Connecting via SSH from root@172.16.1.127(172.16.1.127:22) to root@172.16.1.126(172.16.1.126:22)..
Tue Jul 31 12:50:22 2018 - [debug] ok.
Tue Jul 31 12:50:22 2018 - [debug] Connecting via SSH from root@172.16.1.127(172.16.1.127:22) to root@172.16.1.125(172.16.1.125:22)..
Tue Jul 31 12:50:23 2018 - [debug] ok.
Tue Jul 31 12:50:24 2018 - [debug]
Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.126(172.16.1.126:22) to root@172.16.1.127(172.16.1.127:22)..
Tue Jul 31 12:50:23 2018 - [debug] ok.
Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.126(172.16.1.126:22) to root@172.16.1.125(172.16.1.125:22)..
Tue Jul 31 12:50:23 2018 - [debug] ok.
Tue Jul 31 12:50:25 2018 - [debug]
Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.125(172.16.1.125:22) to root@172.16.1.127(172.16.1.127:22)..
Tue Jul 31 12:50:23 2018 - [debug] ok.
Tue Jul 31 12:50:23 2018 - [debug] Connecting via SSH from root@172.16.1.125(172.16.1.125:22) to root@172.16.1.126(172.16.1.126:22)..
Tue Jul 31 12:50:24 2018 - [debug] ok.
Tue Jul 31 12:50:25 2018 - [info] All SSH connection tests passed successfully.
[root@hdp1~]#
可以看到各个节点ssh验证都是ok的。

2. 检查整个复制环境状况
在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_repl --conf=/etc/masterha/app1.cnf
Tue Jul 31 12:52:19 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Jul 31 12:52:19 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Tue Jul 31 12:52:19 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Tue Jul 31 12:52:19 2018 - [info] MHA::MasterMonitor version 0.56.
Tue Jul 31 12:52:21 2018 - [info] GTID failover mode = 0
Tue Jul 31 12:52:21 2018 - [info] Dead Servers:
Tue Jul 31 12:52:21 2018 - [info] Alive Servers:
Tue Jul 31 12:52:21 2018 - [info] 172.16.1.127(172.16.1.127:3306)
Tue Jul 31 12:52:21 2018 - [info] 172.16.1.126(172.16.1.126:3306)
Tue Jul 31 12:52:21 2018 - [info] 172.16.1.125(172.16.1.125:3306)
Tue Jul 31 12:52:21 2018 - [info] Alive Slaves:
Tue Jul 31 12:52:21 2018 - [info] 172.16.1.126(172.16.1.126:3306) Version=5.6.14-log (oldest major version between slaves) log-bin:enabled
Tue Jul 31 12:52:21 2018 - [info] Replicating from 172.16.1.127(172.16.1.127:3306)
Tue Jul 31 12:52:21 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Jul 31 12:52:21 2018 - [info] 172.16.1.125(172.16.1.125:3306) Version=5.6.14-log (oldest major version between slaves) log-bin:enabled
Tue Jul 31 12:52:21 2018 - [info] Replicating from 172.16.1.127(172.16.1.127:3306)
Tue Jul 31 12:52:21 2018 - [info] Current Alive Master: 172.16.1.127(172.16.1.127:3306)
Tue Jul 31 12:52:21 2018 - [info] Checking slave configurations..
Tue Jul 31 12:52:21 2018 - [info] read_only=1 is not set on slave 172.16.1.126(172.16.1.126:3306).
Tue Jul 31 12:52:21 2018 - [info] Checking replication filtering settings..
Tue Jul 31 12:52:21 2018 - [info] binlog_do_db= , binlog_ignore_db=
Tue Jul 31 12:52:21 2018 - [info] Replication filtering check ok.
Tue Jul 31 12:52:21 2018 - [info] GTID (with auto-pos) is not supported
Tue Jul 31 12:52:21 2018 - [info] Starting SSH connection tests..
Tue Jul 31 12:52:23 2018 - [info] All SSH connection tests passed successfully.
Tue Jul 31 12:52:23 2018 - [info] Checking MHA Node version..
Tue Jul 31 12:52:24 2018 - [info] Version check ok.
Tue Jul 31 12:52:24 2018 - [info] Checking SSH publickey authentication settings on the current master..
Tue Jul 31 12:52:24 2018 - [info] HealthCheck: SSH to 172.16.1.127 is reachable.
Tue Jul 31 12:52:24 2018 - [info] Master MHA Node version is 0.56.
Tue Jul 31 12:52:24 2018 - [info] Checking recovery script configurations on 172.16.1.127(172.16.1.127:3306)..
Tue Jul 31 12:52:24 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000001
Tue Jul 31 12:52:24 2018 - [info] Connecting to root@172.16.1.127(172.16.1.127:22)..
Creating /tmp if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data, up to mysql-bin.000001
Tue Jul 31 12:52:25 2018 - [info] Binlog setting check done.
Tue Jul 31 12:52:25 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Tue Jul 31 12:52:25 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.126 --slave_ip=172.16.1.126 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info --relay_dir=/data/ --slave_pass=xxx
Tue Jul 31 12:52:25 2018 - [info] Connecting to root@172.16.1.126(172.16.1.126:22)..
Checking slave recovery environment settings..
Opening /data/relay-log.info ... ok.
Relay log found at /data, up to hdp3-relay-bin.000003
Temporary relay log file is /data/hdp3-relay-bin.000003
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Tue Jul 31 12:52:25 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.125 --slave_ip=172.16.1.125 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info --relay_dir=/data/ --slave_pass=xxx
Tue Jul 31 12:52:25 2018 - [info] Connecting to root@172.16.1.125(172.16.1.125:22)..
Checking slave recovery environment settings..
Opening /data/relay-log.info ... ok.
Relay log found at /data, up to hdp2-relay-bin.000003
Temporary relay log file is /data/hdp2-relay-bin.000003
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Tue Jul 31 12:52:25 2018 - [info] Slaves settings check done.
Tue Jul 31 12:52:25 2018 - [info]
172.16.1.127(172.16.1.127:3306) (current master)
+--172.16.1.126(172.16.1.126:3306)
+--172.16.1.125(172.16.1.125:3306)

Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.126..
Tue Jul 31 12:52:25 2018 - [info] ok.
Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.125..
Tue Jul 31 12:52:25 2018 - [info] ok.
Tue Jul 31 12:52:25 2018 - [info] Checking master_ip_failover_script status:
Tue Jul 31 12:52:25 2018 - [info] /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306

IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===

Checking the Status of the script.. OK
SIOCSIFADDR: No such device
ens32:1: ERROR while getting interface flags: No such device
Tue Jul 31 12:52:25 2018 - [info] OK.
Tue Jul 31 12:52:25 2018 - [warning] shutdown_script is not defined.
Tue Jul 31 12:52:25 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.
没有明显报错，只有几个警告而已，复制显示正常。

3. 检查MHA Manager的状态
在hdp1上用root用户操作。

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@hdp1~]#
显示"NOT_RUNNING"，这代表MHA监控没有开启。执行下面的命令后台启动MHA。

mkdir -p /var/log/masterha/app1/
nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
启动参数说明：

remove_dead_master_conf：该参数代表当发生主从切换后，老的主库的ip将会从配置文件中移除。
manger_log：日志存放位置。
ignore_last_failover：在缺省情况下，如果MHA检测到连续发生宕机，且两次宕机间隔不足8小时的话，则不会进行Failover，之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件，默认情况下，MHA发生切换后会在日志目录，也就是上面设置的/data产生app1.failover.complete文件，下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换，除非在第一次切换后收到删除该文件。为了方便，这里设置为--ignore_last_failover。
再次检查MHA Manager的状态：

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:298237) is running(0:PING_OK), master:172.16.1.127
[root@hdp1~]#
可以看见已经在监控了，而且master的主机为172.16.1.127。

4. 查看启动日志
在hdp1上用root用户操作。

[root@hdp1~]#tail -n20 /var/log/masterha/app1/manager.log
Tue Jul 31 12:57:06 2018 - [info]
172.16.1.127(172.16.1.127:3306) (current master)
+--172.16.1.126(172.16.1.126:3306)
+--172.16.1.125(172.16.1.125:3306)

Tue Jul 31 12:57:06 2018 - [info] Checking master_ip_failover_script status:
Tue Jul 31 12:57:06 2018 - [info] /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306

IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===

Checking the Status of the script.. OK
SIOCSIFADDR: No such device
ens32:1: ERROR while getting interface flags: No such device
Tue Jul 31 12:57:06 2018 - [info] OK.
Tue Jul 31 12:57:06 2018 - [warning] shutdown_script is not defined.
Tue Jul 31 12:57:06 2018 - [info] Set master ping interval 1 seconds.
Tue Jul 31 12:57:06 2018 - [info] Set secondary check script: /usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306
Tue Jul 31 12:57:06 2018 - [info] Starting ping health check on 172.16.1.127(172.16.1.127:3306)..
Tue Jul 31 12:57:06 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
[root@hdp1~]#
五、功能测试
1. 初始绑定VIP
在hdp4 172.16.1.127（master）上用root用户执行：

/sbin/ifconfig ens160:1 172.16.1.100/24
查看VIP：

[root@hdp4~]#ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff
inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160
valid_lft forever preferred_lft forever
inet 172.16.1.100/16 brd 172.16.255.255 scope global ens160:1
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fea5:497f/64 scope link
valid_lft forever preferred_lft forever
[root@hdp4~]#
2. 测试自动切换
（1）在slave1库（172.16.1.126）上停掉slave IO线程，模拟主从延时：

mysql -uroot -p123456 -e "stop slave io_thread;"
（2）在master库（172.16.1.127）安装sysbench，进行sysbench数据生成，在sbtest库下生成sbtest表，共10W记录。

# 用root用户安装sysbench
yum install sysbench -y

# 用mysql用户建立sbtest 数据库
mysql -uroot -p123456 -e "create database sbtest;"

# 用mysql用户执行sysbench生成数据
sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-user=root --mysql-password=123456 --oltp-test-mode=complex --oltp-tables-count=10 --oltp-table-size=10000 --threads=10 --time=120 --report-interval=10 --db-driver=mysql prepare
（3）用root用户停止master的mysql服务。

service mysql stop
（4）验证VIP漂移。
在hdp3上用root用户操作。

[root@hdp3~]#ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32
valid_lft forever preferred_lft forever
inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fea5:f77/64 scope link
valid_lft forever preferred_lft forever
[root@hdp3~]#
在hdp4上用root用户操作。

[root@hdp4~]#ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff
inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fea5:497f/64 scope link
valid_lft forever preferred_lft forever
[root@hdp4~]#
可以看到VIP已经从hdp4 172.16.1.127（master）漂移到了hdp3 172.16.1.126（slave1）。

（5）客户端用VIP访问数据库

C:\WINDOWS\system32>
在还没创建sbtest库的时候，172.16.1.126就停了slave sql线程。在新的Master 172.16.1.126上查看数据，可以看到落后的数据也同步过来了，数据没有丢失。

（6）查看复制的主从切换

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.1.126
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 19093607
Relay_Log_File: hdp2-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 19093607
Relay_Log_Space: 455
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 126
Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56
Master_Info_File: /data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>
可以看到，172.16.1.126称为新的master，而172.16.1.125也指向了这个新的master。

（7）检查MHA Manager的状态
在hdp1上用root用户执行下面的操作。

[root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[1]+ Done nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1
[root@hdp1~]#
发现在执行了一次自动failover后，MHA Manager进程停止了。官网上对这种情况的解释如下：

意思是安装一个进程工具，通过该工具结合脚本来管理进程。

3. 测试手工切换
首先要还原环境。

还原数据库复制：

-- 在hdp4、hdp3、hdp2上重置master、slave
stop slave;
drop database sbtest;
reset master;
reset slave all;

-- 在hdp3、hdp2上重新指向hdp4为master
change master to
master_host='172.16.1.127',
master_port=3306,
master_user='repl',
master_password='123456',
master_log_file='mysql-bin.000001',
master_log_pos=120;

start slave;
show slave status\G
还原VIP绑定：

# 在hdp3上用root用户执行
/sbin/ifconfig ens32:1 down

# 在hdp4上用root用户执行
/sbin/ifconfig ens160:1 172.16.1.100
还原配置文件：
编辑在hdp1上/etc/masterha/app1.cnf，将[server1]段添加回去。

启动MHA Manage：

# 在hdp1上用root用户执行
nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
至此环境还原完毕，可以开始测试手工切换。当主服务器故障时，人工手动调用MHA来进行故障切换操作，步骤如下。
（1）停止MHA Manage
在hdp1上用root用户操作。

masterha_stop --conf=/etc/masterha/app1.cnf
（2）关闭master
在hdp4上用root用户操作。

service mysql stop
（3）执行手工切换
在hdp1上用root用户操作。

masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.16.1.127 --dead_master_port=3306 --new_master_host=172.16.1.126 --new_master_port=3306 --ignore_last_failover
（4）验证VIP漂移到172.16.1.126

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.1.126
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 120
Relay_Log_File: hdp2-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 120
Relay_Log_Space: 455
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 126
Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56
Master_Info_File: /data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>
（6）验证客户端VIP访问

C:\WINDOWS\system32>
4. 测试在线切换
在许多情况下，需要将现有的主服务器迁移到另外一台服务器上。比如主服务器硬件故障，RAID控制卡需要重建，将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降，导致停机时间至少无法写入数据。另外，阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。MHA 提供快速切换和优雅的阻塞写入，这个切换过程只需要 0.5-2s 的时间，这段时间内数据是无法写入的。在很多情况下，0.5-2s 的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。

MHA在线切换的大概过程：

检测复制设置和确定当前主服务器
确定新的主服务器
阻塞写入到当前主服务器
等待所有从服务器赶上复制
授予写入到新的主服务器
重新设置从服务器
注意，在线切换的时候应用架构需要考虑以下两个问题：

自动识别master和slave的问题（master的机器可能会切换），如果采用了vip的方式，基本可以解决这个问题。
负载均衡的问题（可以定义大概的读写比例，每台机器可承担的负载比例，当有机器离开集群时，需要考虑这个问题）
为了保证数据完全一致性，在最快的时间内完成切换，MHA的在线切换必须满足以下条件才会切换成功，否则会切换失败。

所有slave的IO线程都在运行
所有slave的SQL线程都在运行
所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒。如果在切换过程中不指定running_updates_limit，那么默认情况下running_updates_limit为1秒。
在master端，通过show processlist输出，没有一个更新花费的时间大于running_updates_limit秒。
在测试前，先按照上面“测试手工切换”测试前的步骤执行还原环境（手工切换不用修改/etc/masterha/app1.cnf配置文件），然后按以下步骤测试线切换：

（1）停止MHA Manage
在hdp1上用root用户操作。

masterha_stop --conf=/etc/masterha/app1.cnf
（2）执行在线切换命令
在hdp1上用root用户操作。

masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.16.1.126 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
（3）验证复制关系
在hdp2、hdp3、hdp4查看slave status：

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.1.126
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 120
Relay_Log_File: hdp2-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 120
Relay_Log_Space: 455
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 126
Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56
Master_Info_File: /data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.127 -e "show slave status\G"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.1.126
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 120
Relay_Log_File: hdp4-relay-bin.000002
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 120
Relay_Log_Space: 455
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 126
Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56
Master_Info_File: /data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0

C:\WINDOWS\system32>
可以看到hdp3 172.16.1.126成为新的master，而hdp2 172.16.1.125和hdp4 172.16.1.127 成为指向新master的slave。

（4）验证VIP自动漂移

C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show variables like 'server_id'"
mysql: [Warning] Using a password on the command line interface can be insecure.
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 126 |
+---------------+-------+

C:\WINDOWS\system32>
5. 修复宕机的Master
通常情况下自动切换以后，原master可能已经废弃掉，待原master主机修复后，如果数据完整的情况下，可能想把原来master重新作为新主库的slave。这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令：

grep -i "All other slaves should start" /var/log/masterha/app1/manager.log
可以看到类似下面的信息：

All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.1.126', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='123456';
意思是说，如果Master主机修复好了，可以在修复好后的Master执行CHANGE MASTER操作，作为新的slave库。
————————————————
版权声明：本文为CSDN博主「wzy0623」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/wzy0623/article/details/81304654

使用MHA实现MySQL主从复制高可用的更多相关文章

分布式数据存储 - MySQL主从复制高可用方案
前面几篇文章说道MySQL数据库的高可用方案主从复制.主从复制的延迟产生原因.延迟检测及延迟解决方案(并未从根本上解决),这种主从复制方案保证数据的冗余的同时可以做读写分离来分担系统压力但是并非是高可 ...
zabbix监控nginx+php-fpm，mysql+主从复制+高可用，tomcat，redis web状态
zabbix监控对象区分使用SNMP监控交换使用IPMI监控服务器硬件使用Agent监控服务器使用JMX监控JAVA SNMP监控流程交换机上开启snmp 在zabbix上添加监控(设置SN ...
MHA实现MySQL的高可用
一:软件简介 MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件. 在 ...
mysql实现高可用架构之MHA
一.简介 MHA(Master HA)是一款开源的 MySQL 的高可用程序,它为 MySQL 主从复制架构提供了 automating master failover 功能.MHA 在监控到 mas ...
MySQL之高可用MHA部署
先说一下大概原理虚拟机A ip为10.0.3.92 作为master 虚拟机B ip为10.0.3.102 作为slave1 虚拟机C ip为10.0.3.103 作为 ...
美团点评MySQL数据库高可用架构从MMM到MHA+Zebra以及MHA+Proxy的演进
本文介绍最近几年美团点评MySQL数据库高可用架构的演进过程,以及我们在开源技术基础上做的一些创新.同时,也和业界其它方案进行综合对比,了解业界在高可用方面的进展,和未来我们的一些规划和展望. MMM ...
技术实战：基于 MHA 方式实现 MySQL 的高可用(转)
转自:http://os.51cto.com/art/201307/401702_all.htm MHA故障转移可以很好的帮我们解决从库数据的一致性问题,同时最大化挽回故障发生后的数据.本文分享了基于 ...
基于keepalived搭建MySQL的高可用集群
MySQL的高可用方案一般有如下几种: keepalived+双主,MHA,MMM,Heartbeat+DRBD,PXC,Galera Cluster 比较常用的是keepalived+双主,MHA和 ...
Keepalived+MySQL实现高可用
MySQL的高可用方案有很多,比如Cluster,MMM,MHA,DRBD等,这些都比较复杂,我前面的文章也有介绍.最近Oracle官方也推出了Fabric.有时我们不需要这么复杂的环境,这些方案各有 ...

随机推荐

从架构开始谈dubbo(一)
架构发展史一.单体应用架构当网站流量很小时,所有的功能写在一个项目中,打包部署在tomcat中. 例如:公司管理系统,超市的收银系统也可以将单体应用部 ...
【大数据】0002---MongoDB集群自动分离创建新集群
场景:MongoDB集群运行后,自动产生新的集群解答:可能没有理解问题,理清思路如下 1.日志分析 2.配置文件检查(日期.版本) 3.网络IP变化 4.github 5.stack overflo ...
python语言(七)面向对象、异常处理
一.异常处理 python解释器检测到错误,触发异常(也允许程序员自己触发异常).程序员编写特定的代码,专门用来捕捉这个异常(这段代码与程序逻辑无关,与异常处理有关).如果捕捉成功则进入另外一个处理分 ...
ANSI Common lisp1
lisp(本文专指common lisp)语言简介 lisp程序员能够并且经常编写一些能够写程序的程序,对于程序生成程序的这种特性, 因为lisp是主流语言中唯一一个提供一些方便的抽象来让你完成这个任 ...
ESA2GJK1DH1K基础篇: 移植源码中SmartConfig程序到自己的项目
前言: 说明两个地方 1.点击按钮启动SmartConfig的地方 2.最终跳转的地方说明这节说明一下,如何把本人源码里面的SmartConfig程序移植到自己的工程我要把所有的全部拿出来讲,让 ...
haproxy 配置文件详解之 listen
配置示例: listen admin_stats bind mode http log 127.0.0.1 local0 err stats refresh 30s stats uri /haprox ...
postcss 将px转换成rem vuecli3+vant+vue+postcss
1.安装 npm install postcss-pxtorem --save 2.找到postcss.config.js 默认是这样 module.exports = { "plugins ...
px转rem vue vscode
1.vscode中安装px2rem 2.打开settings.json ,新增 "px2rem.rootFontSize": 75, 3.重启vscode 4.可以转换了
DDD(Domain Driven Design) 架构设计
一.为什么要分层分层架构是所有架构的鼻祖,分层的作用就是隔离,不过,我们有时候有个误解,就是把层和程序集对应起来,就比如简单三层架构中,在你的解决方案中,一般会有三个程序集项目:XXUI.dll.X ...
mysql创建存储过程动态SQL语句
DROP PROCEDURE IF EXISTS x.`wk`; DELIMITER $$ CREATE PROCEDURE `x`.`wk`() BEGIN ); ); SET t = CONCAT ...

使用MHA实现MySQL主从复制高可用

使用MHA实现MySQL主从复制高可用的更多相关文章

随机推荐

热门专题