




1 Paxos算法



Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.

1.1 Roles

Paxos describes the actions of the processors by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.

  • Client

    • The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server.
  • Acceptor (Voters) 
    • The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
  • Proposer 
    • A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
  • Learner 
    • Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
  • Leader 
    • Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved in that case.




1.2 Basic Paxos

  • Phase 1a: Prepare

  • Phase 1b: Promise

  • Phase 2a: Accept Request

  • Phase 2b: Accepted

首先将议员的角色分为 proposers,acceptors,和 learners(允许身兼数职)。proposers 提出提案,提案信息包括提案编号和提议的 value;acceptor 收到提案后可以接受(accept)提案,若提案获得多数 acceptors 的接受,则称该提案被批准(chosen);learners 只能“学习”被批准的提案。划分角色后,就可以更精确的定义问题:

  1. 决议(value)只有在被 proposers 提出后才能被批准(未经批准的决议称为“提案(proposal)”);
  2. 在一次 Paxos 算法的执行实例中,只批准(chosen)一个 value;
  3. learners 只能获得被批准(chosen)的 value。

作者通过不断加强上述3个约束(主要是第二个)获得了 Paxos 算法。

P1:一个 acceptor 必须接受(accept)第一次收到的提案。
P2:一旦一个具有 value v 的提案被批准(chosen),那么之后批准(chosen)的提案必须具有 value v。
P2a:一旦一个具有 value v 的提案被批准(chosen),那么之后任何 acceptor 再次接受(accept)的提案必须具有 value v。
P2b:一旦一个具有 value v 的提案被批准(chosen),那么以后任何 proposer 提出的提案必须具有 value v。
P2c:如果一个编号为 n 的提案具有 value v,那么存在一个多数派,要么他们中所有人都没有接受(accept)编号小于 n
的任何提案,要么他们已经接受(accept)的所有编号小于 n 的提案中编号最大的那个提案具有 value v。

2 Zookeeper Leader Election


Vote(提案编号、提议的 value):myid、zxid、epoch


currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());


HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();


  • 集群第一次启动,所有服务器的zxid和epoch一样,在集群的半数服务器启动后谁的myid最大,谁将会成为leader;比如5台服务器,id分别为1/2/3/4/5,当你顺序启动1/2/3/4/5的时候,3启动后3将会成为leader,4/5启动后会成为follower;当你顺序启动5/4/3/2/1的时候,3启动后5将会成为leader;
  • 集群重启,在集群的半数服务器启动后,谁的epoch最大(每次选举成功后leader会将epoch+1),谁将会成为leader,如果大家的epoch相同,谁的zxid最大(即谁拥有最新的数据),谁将会成为leader;



org.apache.zookeeper.server.quorum.FastLeaderElection implements Election









    public boolean containsQuorum(HashSet<Long> set){

        return (set.size() > half);



    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {

        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +

                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));

        if(self.getQuorumVerifier().getWeight(newId) == 0){

            return false;



         * We return true if one of the following three cases hold:

         * 1- New epoch is higher

         * 2- New epoch is the same as current epoch, but new zxid is higher

         * 3- New epoch is the same as current epoch, new zxid is the same

         *  as current zxid, but server id is higher.


        return ((newEpoch > curEpoch) ||

                ((newEpoch == curEpoch) &&

                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));



