背景

TIKV使用raft协议来实现副本同步，任何时刻写入一个KEY-VAL键值对，都会基于RAFT协议复制到不同机器的三个副本上，raft协议本身能保证副本同步的强一致性，但是任何系统都可能存在BUG，如果由于程序BUG导致出现副本不一致，我们需要有一个机制能够检测出来，同时这个一致性检测功能不应该影响系统的正常运转。

基本思想

集群中每个TIKV进程都运行有一个检测线程，检测线程周期性的从所有的本地副本中选出一个距离上一次检测时间最老的Leader副本，写一条命令字为AdminCmdType::ComputeHash的raft log。

Leader和Follow在on_apply这条时候时候做snapshot，这样可以保证leader和follow是在同一log位置做的snapshot，同时会使用on_apply这条log时候的raft log index作为id用以标识这一轮校验。
然后基于这个snapshot异步的计算checksum，并保存到内存中的Peer对象中。
异步计算完成以后，如果是Leader，那么会再次写一条命令字为AdminCmdType::VerifyHash的命令，内容为leader计算出来的checksum以及id。
Follow收到AdminCmdType::VerifyHash的命令以后，如果id和本地保存的id相同，会raft log中解析出来的checksum和自己本地保存的checksum计算比较，如果checksum不一样，说明副本不一致。

源码分析

一致性检测定时器on_consistency_check_tick

fn on_consistency_check_tick(&mut self, event_loop: &mut EventLoop<Self>) {
    // 检测过程会涉及到扫描rocksdb,为了对系统的正常读写提供影响，只有等上一次的checksum计算完成，才会发起下一个region的副本校验
    if self.consistency_check_worker.is_busy() {
        self.register_consistency_check_tick(event_loop);
        return;
    }

    // 选出一个距离上一次校验时间最老的region
    let (mut candidate_id, mut candidate_check_time) = (0, Instant::now());
    for (&region_id, peer) in &mut self.region_peers {
        if !peer.is_leader() {
            continue;
        }
        if peer.consistency_state.last_check_time < candidate_check_time {
            candidate_id = region_id;
            candidate_check_time = peer.consistency_state.last_check_time;
        }
    }

    // 如果存在，则写一条命令字为AdminCmdType::ComputeHash的raft log
    if candidate_id != 0 {
        let peer = &self.region_peers[&candidate_id];

        info!("{} scheduling consistent check", peer.tag);
        let msg = Msg::new_raft_cmd(new_compute_hash_request(candidate_id, peer.peer.clone()),
                                    Box::new(|_| {}));

        if let Err(e) = self.sendch.send(msg) {
            error!("{} failed to schedule consistent check: {:?}", peer.tag, e);
        }
    }

    // 重新注册定时器
    self.register_consistency_check_tick(event_loop);
}

这条log被commit后，leader和follow都会被触发on_ready_compute_hash函数

fn on_ready_compute_hash(&mut self, region: metapb::Region, index: u64, snap: EngineSnapshot) {
    let region_id = region.get_id();
    self.region_peers.get_mut(&region_id).unwrap().consistency_state.last_check_time =
        Instant::now();

    // 触发异步checksum计算，必须用异步是因为不能阻塞RAFT线程
    let task = ConsistencyCheckTask::compute_hash(region, index, snap);
    info!("[region {}] schedule {}", region_id, task);
    if let Err(e) = self.consistency_check_worker.schedule(task) {
        error!("[region {}] schedule failed: {:?}", region_id, e);
    }
}

checksum异步计算完成后，会回调fn notify(&mut self, event_loop: &mut EventLoop<Self>, msg: Msg)函数，在这里会调用on_hash_computed，传入的参数为checksum计算结果

fn notify(&mut self, event_loop: &mut EventLoop<Self>, msg: Msg) {
    match msg {
        Msg::RaftMessage(data) => {
            if let Err(e) = self.on_raft_message(data) {
                error!("{} handle raft message err: {:?}", self.tag, e);
            }
        }
        Msg::RaftCmd { send_time, request, callback } => {
            self.raft_metrics
                .propose
                .request_wait_time
                .observe(duration_to_sec(send_time.elapsed()) as f64);
            self.propose_raft_command(request, callback)
        }
        Msg::SnapshotStats => self.store_heartbeat_pd(),

        // 调用on_hash_computed异步的计算checksum
        Msg::ComputeHashResult { region_id, index, hash } => {
            self.on_hash_computed(region_id, index, hash);
        }
    }
}

在on_hash_computed会把计算出来的checksum信息保存起来，如果是leader那么会发送一条命令字为AdminCmdType::VerifyHash的raft log, log内容为计算出来的checksum值

fn on_hash_computed(&mut self, region_id: u64, index: u64, hash: Vec<u8>) {
    let (state, peer) = match self.region_peers.get_mut(&region_id) {
        None => {
            warn!("[region {}] receive stale hash at index {}",
                  region_id,
                  index);
            return;
        }
        Some(p) => (&mut p.consistency_state, &p.peer),
    };

    // 会把计算出来的checksum以及index(raft log的index)信息保存起来    // 注意在这里也可能会做一次checksum校验，后面会来说明这个问题
    if !verify_and_store_hash(region_id, state, index, hash) {
        return;
    }

    // 接着会发送一条命令字为AdminCmdType::VerifyHash的raft log, log内容为计算出来的checksum和index值 （这里只有leader才会发送这条log）
    let msg = Msg::new_raft_cmd(new_verify_hash_request(region_id, peer.clone(), state),
                                Box::new(|_| {}));
    if let Err(e) = self.sendch.send(msg) {
        error!("[region {}] failed to schedule verify command for index {}: {:?}",
               region_id,
               index,
               e);
    }
}

follow在on_apply时候接收到命令字为AdminCmdType::VerifyHash的rafg log时候会触发on_ready_verify_hash,然后这里会调用verify_and_store_hash做checksum校验

fn on_ready_verify_hash(&mut self,
                        region_id: u64,
                        expected_index: u64,
                        expected_hash: Vec<u8>) {
    let state = match self.region_peers.get_mut(&region_id) {
        None => {
            warn!("[region {}] receive stale hash at index {}",
                  region_id,
                  expected_index);
            return;
        }
        Some(p) => &mut p.consistency_state,
    };

    // 在这个函数会触发校验逻辑
    verify_and_store_hash(region_id, state, expected_index, expected_hash);
}

verify_and_store_hash

// 注意这个函数同时候被 on_hash_computed和on_ready_verify_hash调用// 也就是说存在两个需要做checksum校验的地方// 在on_ready_verify_hash做checksum校验容易理解，这是正常的流程//     1.leader和follow计算完checksum后，follow保存index和checksum到本地，//       接着leader发送命令字为AdminCmdType::VerifyHash的raft log//     2.follow收到这个命令后，接续出log中的checksum和index,//       如果解析出来的index和本地保存的index相同，那么开始校验checksum// 什么情况下在on_hash_computed会做checksum校验了？//     1.如果leader先于follow计算出checksum，并发送AdminCmdType::VerifyHash给follow//     2.follow收到这个命令后，发现index比本地的大，那么直接保存log中的checksum和index到本地//     3.当follow的checksum计算完成后，再用计算出来的结果，和本地保存的checksum做校验

fn verify_and_store_hash(region_id: u64,
                         state: &mut ConsistencyState,
                         expected_index: u64,
                         expected_hash: Vec<u8>)
                         -> bool {
    if expected_index < state.index {
        REGION_HASH_COUNTER_VEC.with_label_values(&["verify", "miss"]).inc();
        warn!("[region {}] has scheduled a new hash: {} > {}, skip.",
              region_id,
              state.index,
              expected_index);
        return false;
    }

    // 这里的传入的index为上次compuate_hash命令时候的index,只有index相同，才做region一致性校验
    if state.index == expected_index {
        if state.hash != expected_hash {
            // 检测到副本不一致了！
            panic!("[region {}] hash at {} not correct, want {}, got {}!!!",
                   region_id,
                   state.index,
                   escape(&expected_hash),
                   escape(&state.hash));
        }
        REGION_HASH_COUNTER_VEC.with_label_values(&["verify", "matched"]).inc();
        state.hash = vec![];
        return false;
    }
    if state.index != INVALID_INDEX && !state.hash.is_empty() {
        // Maybe computing is too slow or computed result is dropped due to channel full.
        // If computing is too slow, miss count will be increased twice.
        REGION_HASH_COUNTER_VEC.with_label_values(&["verify", "miss"]).inc();
        warn!("[region {}] hash belongs to index {}, but we want {}, skip.",
              region_id,
              state.index,
              expected_index);
    }
    state.index = expected_index;
    state.hash = expected_hash;
    true
}

TIKV副本一致性检查机制分析的更多相关文章

Knative Serving 健康检查机制分析
作者| 阿里云智能事业群技术专家牛秋霖(冬岛) 导读:从头开发一个Serverless引擎并不是一件容易的事情,今天咱们就从Knative的健康检查说起.通过健康检查这一个点来看看Serverles ...
（转）区块链共识机制分析——论PoW，PoS，DPos和DAG的优缺点
近期,随着区块链技术在社区中的声音越来越大,业界已经开始从技术角度对区块链进行全方位的解读.作为第一批区块链技术的实现,传统比特币与以太坊在共识机制.存储机制.智能合约机制.跨链通讯机制等领域并没有非 ...
搭建高可用mongodb集群（三）—— 深入副本集内部机制
在上一篇文章<搭建高可用mongodb集群(二)—— 副本集> 介绍了副本集的配置,这篇文章深入研究一下副本集的内部机制.还是带着副本集的问题来看吧! 副本集故障转移,主节点是如何选举的? ...
Java 动态代理机制分析及扩展
Java 动态代理机制分析及扩展,第 1 部分王忠平, 软件工程师, IBM 何平, 软件工程师, IBM 简介: 本文通过分析 Java 动态代理的机制和特点,解读动态代理类的源代码,并且模拟 ...
搭建高可用mongodb集群（三）—— 深入副本集内部机制
在上一篇文章<搭建高可用mongodb集群(二)-- 副本集> 介绍了副本集的配置,这篇文章深入研究一下副本集的内部机制.还是带着副本集的问题来看吧! 副本集故障转移,主节点是如何选举的? ...
Java 动态代理机制分析及扩展，第 1 部分
Java 动态代理机制分析及扩展,第 1 部分 http://www.ibm.com/developerworks/cn/java/j-lo-proxy1/ 本文通过分析 Java 动态代理的机制和特 ...
SQL Server 2008 安装过程中遇到“性能计数器注册表配置单元一致性”检查失败问题的解决方法
操作步骤: 1. 在 Microsoft Windows 2003 或 Windows XP 桌面上,依次单击"开始"."运行",然后在"打开&quo ...
解析 Linux 内核可装载模块的版本检查机制
转自:http://www.ibm.com/developerworks/cn/linux/l-cn-kernelmodules/ 为保持 Linux 内核的稳定与可持续发展,内核在发展过程中引进了可 ...
您还有心跳吗？超时机制分析(java)
注:本人是原作者,首发于并发编程网(您还有心跳吗?超时机制分析),此文结合那里的留言作了一些修改. 问题描述在C/S模式中,有时我们会长时间保持一个连接,以避免频繁地建立连接,但同时,一般会有一个超 ...

随机推荐

开始编写寄几的 CSS 基础库
前言在现在的互联网业务中,前端开发人员往往需要支持比较多的项目数量.很多公司只有 1-2 名前端开发人员,这其中还不乏规模比较大的公司.这时前端同学就需要独挡一面支持整个公司上下的前端业务,项目如流 ...
（转载）Windows 上搭建Apache FtpServer
因工作需要,最近经常接触到FTP,今天我来介绍一个开源的FTP服务器,那就是Apache FTPServer,Apache FTPServer是一个100%纯Java的FTP服务器. 它的设计是基于现 ...
LeetCode-Interleaving String[dp]
Interleaving String Given s1, s2, s3, find whether s3 is formed by the interleaving of s1 and s2. Fo ...
vc++MFC开发上位机程序
用vc++MFC开发过不少跟单片机通讯的上位机程序了.搞懂了MFC架构,开发还是很快的,与底层单片机程序通讯,可以用串口.usb.网络.短信形式.串口现在用的越来越少了,一般电脑跟单片机在一块,使用串 ...
Linux服务器中安装Oracle
笔者手动安装成功一,oracle安装前的准备与配置 1,修改stsctl.conf文件 Linux是为小文件设计的,Oracle数据库安装需要占用较多资源,要把各项参数调大. 使用vi编辑/etc/ ...
庖丁解牛——CY7C68013A开发框架
大家好,好久不见了,距离上次发文章都有两个多星期了,非常高兴同时也非常感谢你们能一直关注我.之前在公众号上收到网友的消息,其大概意思就是问我能不能出点USB干货,为此我就把第二篇--解密USB2.0数 ...
centos7安装docker并安装jdk和tomcat（常用命令）
阿里专属的镜像加速在宿主机器编辑文件:vi /etc/docker/daemon.json 阿里专属的镜像加速地址,类似于"https://91cntlkt.mirror.aliyuncs ...
ASP.NET windows验证IIS配置
Windows验证时,需要配置IIS,把匿名验证设为disable,windows验证设为enable,window7 默认为匿名验证为enable,windows验证为disable. 否则会sys ...
【模板】51Nod--1085 01背包
在N件物品取出若干件放在容量为W的背包里,每件物品的体积为W1,W2--Wn(Wi为整数),与之相对应的价值为P1,P2--Pn(Pi为整数).求背包能够容纳的最大价值. Input 第1行,2个整数 ...
安徽省2016“京胜杯”程序设计大赛_K_纸上谈兵
纸上谈兵 Time Limit: 1000 MS Memory Limit: 65536 KB Total Submissions: 3 Accepted: 1 Description 战国时 ...

TIKV副本一致性检查机制分析

背景

基本思想

源码分析

TIKV副本一致性检查机制分析的更多相关文章

随机推荐

热门专题