8. SOFAJRaft源码分析— 如何实现日志复制的pipeline机制?
前言
前几天和腾讯的大佬一起吃饭聊天,说起我对SOFAJRaft的理解,我自然以为我是很懂了的,但是大佬问起了我那SOFAJRaft集群之间的日志是怎么复制的?
我当时哑口无言,说不出是怎么实现的,所以这次来分析一下SOFAJRaft中日志复制是怎么做的。
Leader发送探针获取Follower的LastLogIndex
Leader 节点在通过 Replicator 和 Follower 建立连接之后,要发送一个 Probe 类型的探针请求,目的是知道 Follower 已经拥有的的日志位置,以便于向 Follower 发送后续的日志。
大致的流程如下:
NodeImpl#becomeLeader->replicatorGroup#addReplicator->Replicator#start->Replicator#sendEmptyEntries
最后会通过调用Replicator的sendEmptyEntries方法来发送探针来获取Follower的LastLogIndex
Replicator#sendEmptyEntries
private void sendEmptyEntries(final boolean isHeartbeat,
final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
//将集群配置设置到rb中,例如Term,GroupId,ServerId等
if (!fillCommonFields(rb, this.nextIndex - 1, isHeartbeat)) {
// id is unlock in installSnapshot
installSnapshot();
if (isHeartbeat && heartBeatClosure != null) {
Utils.runClosureInThread(heartBeatClosure, new Status(RaftError.EAGAIN,
"Fail to send heartbeat to peer %s", this.options.getPeerId()));
}
return;
}
try {
final long monotonicSendTimeMs = Utils.monotonicMs();
final AppendEntriesRequest request = rb.build();
if (isHeartbeat) {
....//省略心跳代码
} else {
//statInfo这个类没看到哪里有用到,
// Sending a probe request.
//leader发送探针获取Follower的LastLogIndex
this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
//将lastLogIndex设置为比firstLogIndex小1
this.statInfo.firstLogIndex = this.nextIndex;
this.statInfo.lastLogIndex = this.nextIndex - 1;
this.appendEntriesCounter++;
//设置当前Replicator为发送探针
this.state = State.Probe;
final int stateVersion = this.version;
//返回reqSeq,并将reqSeq加一
final int seq = getAndIncrementReqSeq();
final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {
@Override
public void run(final Status status) {
onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request,
getResponse(), seq, stateVersion, monotonicSendTimeMs);
}
});
//Inflight 是对批量发送出去的 logEntry 的一种抽象,他表示哪些 logEntry 已经被封装成日志复制 request 发送出去了
//这里是将logEntry封装到Inflight中
addInflight(RequestType.AppendEntries, this.nextIndex, 0, 0, seq, rpcFuture);
}
LOG.debug("Node {} send HeartbeatRequest to {} term {} lastCommittedIndex {}", this.options.getNode()
.getNodeId(), this.options.getPeerId(), this.options.getTerm(), request.getCommittedIndex());
} finally {
this.id.unlock();
}
}
在调用sendEmptyEntries方法的时候,会传入isHeartbeat为false和heartBeatClosure为null,因为我们这个方法主要是发送探针获取Follower的位移。
首先调用fillCommonFields方法,将任期,groupId,ServerId,PeerIdLogIndex等设置到rb中,如:
private boolean fillCommonFields(final AppendEntriesRequest.Builder rb, long prevLogIndex, final boolean isHeartbeat) {
final long prevLogTerm = this.options.getLogManager().getTerm(prevLogIndex);
....
rb.setTerm(this.options.getTerm());
rb.setGroupId(this.options.getGroupId());
rb.setServerId(this.options.getServerId().toString());
rb.setPeerId(this.options.getPeerId().toString());
rb.setPrevLogIndex(prevLogIndex);
rb.setPrevLogTerm(prevLogTerm);
rb.setCommittedIndex(this.options.getBallotBox().getLastCommittedIndex());
return true;
}
注意prevLogIndex是nextIndex-1,表示当前的index
继续往下走,会设置statInfo实例里面的属性,但是statInfo这个对象我没看到哪里有用到过。
然后向该Follower发送一个AppendEntriesRequest请求,onRpcReturned负责响应请求。
发送完请求后调用addInflight初始化一个Inflight实例,加入到inflights集合中,如下:
private void addInflight(final RequestType reqType, final long startIndex, final int count, final int size,
final int seq, final Future<Message> rpcInfly) {
this.rpcInFly = new Inflight(reqType, startIndex, count, size, seq, rpcInfly);
this.inflights.add(this.rpcInFly);
this.nodeMetrics.recordSize("replicate-inflights-count", this.inflights.size());
}
Inflight 是对批量发送出去的 logEntry 的一种抽象,他表示哪些 logEntry 已经被封装成日志复制 request 发送出去了,这里是将logEntry封装到Inflight中。
Leader批量的发送日志给Follower
Replicator#sendEntries
private boolean sendEntries(final long nextSendingIndex) {
final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
//填写当前Replicator的配置信息到rb中
if (!fillCommonFields(rb, nextSendingIndex - 1, false)) {
// unlock id in installSnapshot
installSnapshot();
return false;
}
ByteBufferCollector dataBuf = null;
//获取最大的size为1024
final int maxEntriesSize = this.raftOptions.getMaxEntriesSize();
//这里使用了类似对象池的技术,避免重复创建对象
final RecyclableByteBufferList byteBufList = RecyclableByteBufferList.newInstance();
try {
//循环遍历出所有的logEntry封装到byteBufList和emb中
for (int i = 0; i < maxEntriesSize; i++) {
final RaftOutter.EntryMeta.Builder emb = RaftOutter.EntryMeta.newBuilder();
//nextSendingIndex代表下一个要发送的index,i代表偏移量
if (!prepareEntry(nextSendingIndex, i, emb, byteBufList)) {
break;
}
rb.addEntries(emb.build());
}
//如果EntriesCount为0的话,说明LogManager里暂时没有新数据
if (rb.getEntriesCount() == 0) {
if (nextSendingIndex < this.options.getLogManager().getFirstLogIndex()) {
installSnapshot();
return false;
}
// _id is unlock in _wait_more
waitMoreEntries(nextSendingIndex);
return false;
}
//将byteBufList里面的数据放入到rb中
if (byteBufList.getCapacity() > 0) {
dataBuf = ByteBufferCollector.allocateByRecyclers(byteBufList.getCapacity());
for (final ByteBuffer b : byteBufList) {
dataBuf.put(b);
}
final ByteBuffer buf = dataBuf.getBuffer();
buf.flip();
rb.setData(ZeroByteStringHelper.wrap(buf));
}
} finally {
//回收一下byteBufList
RecycleUtil.recycle(byteBufList);
}
final AppendEntriesRequest request = rb.build();
if (LOG.isDebugEnabled()) {
LOG.debug(
"Node {} send AppendEntriesRequest to {} term {} lastCommittedIndex {} prevLogIndex {} prevLogTerm {} logIndex {} count {}",
this.options.getNode().getNodeId(), this.options.getPeerId(), this.options.getTerm(),
request.getCommittedIndex(), request.getPrevLogIndex(), request.getPrevLogTerm(), nextSendingIndex,
request.getEntriesCount());
}
//statInfo没找到哪里有用到过
this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
this.statInfo.firstLogIndex = rb.getPrevLogIndex() + 1;
this.statInfo.lastLogIndex = rb.getPrevLogIndex() + rb.getEntriesCount();
final Recyclable recyclable = dataBuf;
final int v = this.version;
final long monotonicSendTimeMs = Utils.monotonicMs();
final int seq = getAndIncrementReqSeq();
final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {
@Override
public void run(final Status status) {
//回收资源
RecycleUtil.recycle(recyclable);
onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request, getResponse(), seq,
v, monotonicSendTimeMs);
}
});
//添加Inflight
addInflight(RequestType.AppendEntries, nextSendingIndex, request.getEntriesCount(), request.getData().size(),
seq, rpcFuture);
return true;
}
- 首先会调用fillCommonFields方法,填写当前Replicator的配置信息到rb中;
- 调用prepareEntry,根据当前的I和nextSendingIndex计算出当前的偏移量,然后去LogManager找到对应的LogEntry,再把LogEntry里面的属性设置到emb中,并把LogEntry里面的数据加入到RecyclableByteBufferList中;
- 如果LogEntry里面没有新的数据,那么EntriesCount会为0,那么就返回;
- 遍历byteBufList里面的数据,将数据添加到rb中,这样rb里面的数据就是前面是任期、类型、数据长度等信息,rb后面就是真正的数据;
- 新建AppendEntriesRequest实例发送请求;
- 添加 Inflight 到队列中。Leader 维护一个 queue,每发出一批 logEntry 就向 queue 中 添加一个代表这一批 logEntry 的 Inflight,这样当它知道某一批 logEntry 复制失败之后,就可以依赖 queue 中的 Inflight 把该批次 logEntry 以及后续的所有日志重新复制给 follower。既保证日志复制能够完成,又保证了复制日志的顺序不变
其中RecyclableByteBufferList采用对象池进行实例化,对象池的相关信息可以看我这篇:7. SOFAJRaft源码分析—如何实现一个轻量级的对象池?
下面我们详解一下sendEntries里面的具体方法。
prepareEntry填充emb属性
Replicator#prepareEntry
boolean prepareEntry(final long nextSendingIndex, final int offset, final RaftOutter.EntryMeta.Builder emb,
final RecyclableByteBufferList dateBuffer) {
if (dateBuffer.getCapacity() >= this.raftOptions.getMaxBodySize()) {
return false;
}
//设置当前要发送的index
final long logIndex = nextSendingIndex + offset;
//如果这个index已经在LogManager中找不到了,那么直接返回
final LogEntry entry = this.options.getLogManager().getEntry(logIndex);
if (entry == null) {
return false;
}
//下面就是把LogEntry里面的属性设置到emb中
emb.setTerm(entry.getId().getTerm());
if (entry.hasChecksum()) {
emb.setChecksum(entry.getChecksum()); //since 1.2.6
}
emb.setType(entry.getType());
if (entry.getPeers() != null) {
Requires.requireTrue(!entry.getPeers().isEmpty(), "Empty peers at logIndex=%d", logIndex);
for (final PeerId peer : entry.getPeers()) {
emb.addPeers(peer.toString());
}
if (entry.getOldPeers() != null) {
for (final PeerId peer : entry.getOldPeers()) {
emb.addOldPeers(peer.toString());
}
}
} else {
Requires.requireTrue(entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION,
"Empty peers but is ENTRY_TYPE_CONFIGURATION type at logIndex=%d", logIndex);
}
final int remaining = entry.getData() != null ? entry.getData().remaining() : 0;
emb.setDataLen(remaining);
//把LogEntry里面的数据放入到dateBuffer中
if (entry.getData() != null) {
// should slice entry data
dateBuffer.add(entry.getData().slice());
}
return true;
}
- 对比一下传入的dateBuffer的容量是否已经超过了系统设置的容量(512 * 1024),如果超过了则返回false
- 根据给定的起始的index和偏移量offset计算logIndex,然后去LogManager里面根据index获取LogEntry,如果返回的为则说明找不到了,那么就直接返回false,外层的if判断会执行break跳出循环
- 然后将LogEntry里面的属性设置到emb对象中,最后将LogEntry里面的数据添加到dateBuffer,这里要做到数据和属性分离
Follower处理Leader发送的日志复制请求
在leader发送完AppendEntriesRequest请求之后,请求的数据会在Follower中被AppendEntriesRequestProcessor所处理
具体的处理方法是processRequest0
public Message processRequest0(final RaftServerService service, final AppendEntriesRequest request,
final RpcRequestClosure done) {
final Node node = (Node) service;
//默认使用pipeline
if (node.getRaftOptions().isReplicatorPipeline()) {
final String groupId = request.getGroupId();
final String peerId = request.getPeerId();
//获取请求的次数,以groupId+peerId为一个维度
final int reqSequence = getAndIncrementSequence(groupId, peerId, done.getBizContext().getConnection());
//Follower处理leader发过来的日志请求
final Message response = service.handleAppendEntriesRequest(request, new SequenceRpcRequestClosure(done,
reqSequence, groupId, peerId));
//正常的数据只返回null,异常的数据会返回response
if (response != null) {
sendSequenceResponse(groupId, peerId, reqSequence, done.getAsyncContext(), done.getBizContext(),
response);
}
return null;
} else {
return service.handleAppendEntriesRequest(request, done);
}
}
调用service的handleAppendEntriesRequest会调用到NodeIml的handleAppendEntriesRequest方法中,handleAppendEntriesRequest方法只是异常情况和leader没有发送数据时才会返回,正常情况是返回null
处理响应日志复制请求
NodeIml#handleAppendEntriesRequest
public Message handleAppendEntriesRequest(final AppendEntriesRequest request, final RpcRequestClosure done) {
boolean doUnlock = true;
final long startMs = Utils.monotonicMs();
this.writeLock.lock();
//获取entryLog个数
final int entriesCount = request.getEntriesCount();
try {
//校验当前节点是否活跃
if (!this.state.isActive()) {
LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
return RpcResponseFactory.newResponse(RaftError.EINVAL, "Node %s is not in active state, state %s.",
getNodeId(), this.state.name());
}
//校验传入的serverId是否能被正常解析
final PeerId serverId = new PeerId();
if (!serverId.parse(request.getServerId())) {
LOG.warn("Node {} received AppendEntriesRequest from {} serverId bad format.", getNodeId(),
request.getServerId());
return RpcResponseFactory.newResponse(RaftError.EINVAL, "Parse serverId failed: %s.",
request.getServerId());
}
//校验任期
// Check stale term
if (request.getTerm() < this.currTerm) {
LOG.warn("Node {} ignore stale AppendEntriesRequest from {}, term={}, currTerm={}.", getNodeId(),
request.getServerId(), request.getTerm(), this.currTerm);
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(this.currTerm) //
.build();
}
// Check term and state to step down
//当前节点如果不是Follower节点的话要执行StepDown操作
checkStepDown(request.getTerm(), serverId);
//这说明请求的节点不是当前节点的leader
if (!serverId.equals(this.leaderId)) {
LOG.error("Another peer {} declares that it is the leader at term {} which was occupied by leader {}.",
serverId, this.currTerm, this.leaderId);
// Increase the term by 1 and make both leaders step down to minimize the
// loss of split brain
stepDown(request.getTerm() + 1, false, new Status(RaftError.ELEADERCONFLICT,
"More than one leader in the same term."));
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(request.getTerm() + 1) //
.build();
}
updateLastLeaderTimestamp(Utils.monotonicMs());
//校验是否正在生成快照
if (entriesCount > 0 && this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
LOG.warn("Node {} received AppendEntriesRequest while installing snapshot.", getNodeId());
return RpcResponseFactory.newResponse(RaftError.EBUSY, "Node %s:%s is installing snapshot.",
this.groupId, this.serverId);
}
//传入的是发起请求节点的nextIndex-1
final long prevLogIndex = request.getPrevLogIndex();
final long prevLogTerm = request.getPrevLogTerm();
final long localPrevLogTerm = this.logManager.getTerm(prevLogIndex);
//发起请求的节点prevLogIndex对应的任期和当前节点的index所对应的任期不匹配
if (localPrevLogTerm != prevLogTerm) {
final long lastLogIndex = this.logManager.getLastLogIndex();
LOG.warn(
"Node {} reject term_unmatched AppendEntriesRequest from {}, term={}, prevLogIndex={}, prevLogTerm={}, localPrevLogTerm={}, lastLogIndex={}, entriesSize={}.",
getNodeId(), request.getServerId(), request.getTerm(), prevLogIndex, prevLogTerm, localPrevLogTerm,
lastLogIndex, entriesCount);
return AppendEntriesResponse.newBuilder() //
.setSuccess(false) //
.setTerm(this.currTerm) //
.setLastLogIndex(lastLogIndex) //
.build();
}
//响应心跳或者发送的是sendEmptyEntry
if (entriesCount == 0) {
// heartbeat
final AppendEntriesResponse.Builder respBuilder = AppendEntriesResponse.newBuilder() //
.setSuccess(true) //
.setTerm(this.currTerm)
// 返回当前节点的最新的index
.setLastLogIndex(this.logManager.getLastLogIndex());
doUnlock = false;
this.writeLock.unlock();
// see the comments at FollowerStableClosure#run()
this.ballotBox.setLastCommittedIndex(Math.min(request.getCommittedIndex(), prevLogIndex));
return respBuilder.build();
}
// Parse request
long index = prevLogIndex;
final List<LogEntry> entries = new ArrayList<>(entriesCount);
ByteBuffer allData = null;
if (request.hasData()) {
allData = request.getData().asReadOnlyByteBuffer();
}
//获取所有数据
final List<RaftOutter.EntryMeta> entriesList = request.getEntriesList();
for (int i = 0; i < entriesCount; i++) {
final RaftOutter.EntryMeta entry = entriesList.get(i);
index++;
if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_UNKNOWN) {
//给logEntry属性设值
final LogEntry logEntry = new LogEntry();
logEntry.setId(new LogId(index, entry.getTerm()));
logEntry.setType(entry.getType());
if (entry.hasChecksum()) {
logEntry.setChecksum(entry.getChecksum()); // since 1.2.6
}
//将数据填充到logEntry
final long dataLen = entry.getDataLen();
if (dataLen > 0) {
final byte[] bs = new byte[(int) dataLen];
assert allData != null;
allData.get(bs, 0, bs.length);
logEntry.setData(ByteBuffer.wrap(bs));
}
if (entry.getPeersCount() > 0) {
//只有配置类型的entry才有多个Peer
if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
throw new IllegalStateException(
"Invalid log entry that contains peers but is not ENTRY_TYPE_CONFIGURATION type: "
+ entry.getType());
}
final List<PeerId> peers = new ArrayList<>(entry.getPeersCount());
for (final String peerStr : entry.getPeersList()) {
final PeerId peer = new PeerId();
peer.parse(peerStr);
peers.add(peer);
}
logEntry.setPeers(peers);
if (entry.getOldPeersCount() > 0) {
final List<PeerId> oldPeers = new ArrayList<>(entry.getOldPeersCount());
for (final String peerStr : entry.getOldPeersList()) {
final PeerId peer = new PeerId();
peer.parse(peerStr);
oldPeers.add(peer);
}
logEntry.setOldPeers(oldPeers);
}
} else if (entry.getType() == EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
throw new IllegalStateException(
"Invalid log entry that contains zero peers but is ENTRY_TYPE_CONFIGURATION type");
}
// Validate checksum
if (this.raftOptions.isEnableLogEntryChecksum() && logEntry.isCorrupted()) {
long realChecksum = logEntry.checksum();
LOG.error(
"Corrupted log entry received from leader, index={}, term={}, expectedChecksum={}, " +
"realChecksum={}",
logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
realChecksum);
return RpcResponseFactory.newResponse(RaftError.EINVAL,
"The log entry is corrupted, index=%d, term=%d, expectedChecksum=%d, realChecksum=%d",
logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
realChecksum);
}
entries.add(logEntry);
}
}
//存储日志,并回调返回response
final FollowerStableClosure closure = new FollowerStableClosure(request, AppendEntriesResponse.newBuilder()
.setTerm(this.currTerm), this, done, this.currTerm);
this.logManager.appendEntries(entries, closure);
// update configuration after _log_manager updated its memory status
this.conf = this.logManager.checkAndSetConfiguration(this.conf);
return null;
} finally {
if (doUnlock) {
this.writeLock.unlock();
}
this.metrics.recordLatency("handle-append-entries", Utils.monotonicMs() - startMs);
this.metrics.recordSize("handle-append-entries-count", entriesCount);
}
}
handleAppendEntriesRequest方法写的很长,但是实际上做了很多校验的事情,具体的处理逻辑不多
- 校验当前的Node节点是否还处于活跃状态,如果不是的话,那么直接返回一个error的response
- 校验请求的serverId的格式是否正确,不正确则返回一个error的response
- 校验请求的任期是否小于当前的任期,如果是那么返回一个AppendEntriesResponse类型的response
- 调用checkStepDown方法检测当前节点的任期,以及状态,是否有leader等
- 如果请求的serverId和当前节点的leaderId是不是同一个,用来校验是不是leader发起的请求,如果不是返回一个AppendEntriesResponse
- 校验是否正在生成快照
- 获取请求的Index在当前节点中对应的LogEntry的任期是不是和请求传入的任期相同,不同的话则返回AppendEntriesResponse
- 如果传入的entriesCount为零,那么leader发送的可能是心跳或者发送的是sendEmptyEntry,返回AppendEntriesResponse,并将当前任期和最新index封装返回
- 请求的数据不为空,那么遍历所有的数据
- 实例化一个logEntry,并且将数据和属性设置到logEntry实例中,最后将logEntry放入到entries集合中
- 调用logManager将数据批量提交日志写入 RocksDB
发送响应给leader
最终发送给leader的响应是通过AppendEntriesRequestProcessor的sendSequenceResponse来发送的
void sendSequenceResponse(final String groupId, final String peerId, final int seq,
final AsyncContext asyncContext, final BizContext bizContext, final Message msg) {
final Connection connection = bizContext.getConnection();
//获取context,维度是groupId和peerId
final PeerRequestContext ctx = getPeerRequestContext(groupId, peerId, connection);
final PriorityQueue<SequenceMessage> respQueue = ctx.responseQueue;
assert (respQueue != null);
synchronized (Utils.withLockObject(respQueue)) {
//将要响应的数据放入到优先队列中
respQueue.add(new SequenceMessage(asyncContext, msg, seq));
//校验队列里面的数据是否超过了256
if (!ctx.hasTooManyPendingResponses()) {
while (!respQueue.isEmpty()) {
final SequenceMessage queuedPipelinedResponse = respQueue.peek();
//如果序列对应不上,那么就不发送响应
if (queuedPipelinedResponse.sequence != getNextRequiredSequence(groupId, peerId, connection)) {
// sequence mismatch, waiting for next response.
break;
}
respQueue.remove();
try {
//发送响应
queuedPipelinedResponse.sendResponse();
} finally {
//序列加一
getAndIncrementNextRequiredSequence(groupId, peerId, connection);
}
}
} else {
LOG.warn("Closed connection to peer {}/{}, because of too many pending responses, queued={}, max={}",
ctx.groupId, peerId, respQueue.size(), ctx.maxPendingResponses);
connection.close();
// Close the connection if there are too many pending responses in queue.
removePeerRequestContext(groupId, peerId);
}
}
}
这个方法会将要发送的数据依次压入到PriorityQueue优先队列中进行排序,然后获取序列号最小的元素和nextRequiredSequence比较,如果不相等,那么则是出现了乱序的情况,那么就不发送请求
Leader处理日志复制的Response
Leader收到Follower发过来的Response响应之后会调用Replicator的onRpcReturned方法
static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
if (id == null) {
return;
}
final long startTimeMs = Utils.nowMs();
Replicator r;
if ((r = (Replicator) id.lock()) == null) {
return;
}
//检查版本号,因为每次resetInflights都会让version加一,所以检查一下
if (stateVersion != r.version) {
LOG.debug(
"Replicator {} ignored old version response {}, current version is {}, request is {}\n, and response is {}\n, status is {}.",
r, stateVersion, r.version, request, response, status);
id.unlock();
return;
}
//使用优先队列按seq排序,最小的会在第一个
final PriorityQueue<RpcResponse> holdingQueue = r.pendingResponses;
//这里用一个优先队列是因为响应是异步的,seq小的可能响应比seq大慢
holdingQueue.add(new RpcResponse(reqType, seq, status, request, response, rpcSendTime));
//默认holdingQueue队列里面的数量不能超过256
if (holdingQueue.size() > r.raftOptions.getMaxReplicatorInflightMsgs()) {
LOG.warn("Too many pending responses {} for replicator {}, maxReplicatorInflightMsgs={}",
holdingQueue.size(), r.options.getPeerId(), r.raftOptions.getMaxReplicatorInflightMsgs());
//重新发送探针
//清空数据
r.resetInflights();
r.state = State.Probe;
r.sendEmptyEntries(false);
return;
}
boolean continueSendEntries = false;
final boolean isLogDebugEnabled = LOG.isDebugEnabled();
StringBuilder sb = null;
if (isLogDebugEnabled) {
sb = new StringBuilder("Replicator ").append(r).append(" is processing RPC responses,");
}
try {
int processed = 0;
while (!holdingQueue.isEmpty()) {
//取出holdingQueue里seq最小的数据
final RpcResponse queuedPipelinedResponse = holdingQueue.peek();
//如果Follower没有响应的话就会出现次序对不上的情况,那么就不往下走了
//sequence mismatch, waiting for next response.
if (queuedPipelinedResponse.seq != r.requiredNextSeq) {
// 如果之前存在处理,则到此直接break循环
if (processed > 0) {
if (isLogDebugEnabled) {
sb.append("has processed ").append(processed).append(" responses,");
}
break;
} else {
//Do not processed any responses, UNLOCK id and return.
continueSendEntries = false;
id.unlock();
return;
}
}
//走到这里说明seq对的上,那么就移除优先队列里面seq最小的数据
holdingQueue.remove();
processed++;
//获取inflights队列里的第一个元素
final Inflight inflight = r.pollInflight();
//发起一个请求的时候会将inflight放入到队列中
//如果为空,那么就忽略
if (inflight == null) {
// The previous in-flight requests were cleared.
if (isLogDebugEnabled) {
sb.append("ignore response because request not found:").append(queuedPipelinedResponse)
.append(",\n");
}
continue;
}
//seq没有对上,说明顺序乱了,重置状态
if (inflight.seq != queuedPipelinedResponse.seq) {
// reset state
LOG.warn(
"Replicator {} response sequence out of order, expect {}, but it is {}, reset state to try again.",
r, inflight.seq, queuedPipelinedResponse.seq);
r.resetInflights();
r.state = State.Probe;
continueSendEntries = false;
// 锁住节点,根据错误类别等待一段时间
r.block(Utils.nowMs(), RaftError.EREQUEST.getNumber());
return;
}
try {
switch (queuedPipelinedResponse.requestType) {
case AppendEntries:
//处理日志复制的response
continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
(AppendEntriesRequest) queuedPipelinedResponse.request,
(AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
break;
case Snapshot:
//处理快照的response
continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
(InstallSnapshotRequest) queuedPipelinedResponse.request,
(InstallSnapshotResponse) queuedPipelinedResponse.response);
break;
}
} finally {
if (continueSendEntries) {
// Success, increase the response sequence.
r.getAndIncrementRequiredNextSeq();
} else {
// The id is already unlocked in onAppendEntriesReturned/onInstallSnapshotReturned, we SHOULD break out.
break;
}
}
}
} finally {
if (isLogDebugEnabled) {
sb.append(", after processed, continue to send entries: ").append(continueSendEntries);
LOG.debug(sb.toString());
}
if (continueSendEntries) {
// unlock in sendEntries.
r.sendEntries();
}
}
}
- 检查版本号,因为每次resetInflights都会让version加一,所以检查一下是不是同一批的数据
- 获取Replicator的pendingResponses队列,然后将当前响应的数据封装成RpcResponse实例加入到队列中
- 校验队列里面的元素是否大于256,大于256则清空数据重新同步
- 校验holdingQueue队列里面的seq最小的序列数据序列和当前的requiredNextSeq是否相同,不同的话如果是刚进入循环那么直接break退出循环
- 获取inflights队列中第一个元素,如果seq没有对上,说明顺序乱了,重置状态
- 调用onAppendEntriesReturned方法处理日志复制的response
- 如果处理成功,那么则调用sendEntries继续发送复制日志到Follower
Replicator#onAppendEntriesReturned
private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
final AppendEntriesRequest request,
final AppendEntriesResponse response, final long rpcSendTime,
final long startTimeMs, final Replicator r) {
//校验数据序列有没有错
if (inflight.startIndex != request.getPrevLogIndex() + 1) {
LOG.warn(
"Replicator {} received invalid AppendEntriesResponse, in-flight startIndex={}, request prevLogIndex={}, reset the replicator state and probe again.",
r, inflight.startIndex, request.getPrevLogIndex());
r.resetInflights();
r.state = State.Probe;
// unlock id in sendEmptyEntries
r.sendEmptyEntries(false);
return false;
}
//度量
// record metrics
if (request.getEntriesCount() > 0) {
r.nodeMetrics.recordLatency("replicate-entries", Utils.monotonicMs() - rpcSendTime);
r.nodeMetrics.recordSize("replicate-entries-count", request.getEntriesCount());
r.nodeMetrics.recordSize("replicate-entries-bytes", request.getData() != null ? request.getData().size()
: 0);
}
final boolean isLogDebugEnabled = LOG.isDebugEnabled();
StringBuilder sb = null;
if (isLogDebugEnabled) {
sb = new StringBuilder("Node "). //
append(r.options.getGroupId()).append(":").append(r.options.getServerId()). //
append(" received AppendEntriesResponse from "). //
append(r.options.getPeerId()). //
append(" prevLogIndex=").append(request.getPrevLogIndex()). //
append(" prevLogTerm=").append(request.getPrevLogTerm()). //
append(" count=").append(request.getEntriesCount());
}
//如果follower因为崩溃,RPC调用失败等原因没有收到成功响应
//那么需要阻塞一段时间再进行调用
if (!status.isOk()) {
// If the follower crashes, any RPC to the follower fails immediately,
// so we need to block the follower for a while instead of looping until
// it comes back or be removed
// dummy_id is unlock in block
if (isLogDebugEnabled) {
sb.append(" fail, sleep.");
LOG.debug(sb.toString());
}
//如果注册了Replicator状态监听器,那么通知所有监听器
notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
if (++r.consecutiveErrorTimes % 10 == 0) {
LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}", r.options.getPeerId(),
r.consecutiveErrorTimes, status);
}
r.resetInflights();
r.state = State.Probe;
// unlock in in block
r.block(startTimeMs, status.getCode());
return false;
}
r.consecutiveErrorTimes = 0;
//响应失败
if (!response.getSuccess()) {
// Leader 的切换,表明可能出现过一次网络分区,从新跟随新的 Leader
if (response.getTerm() > r.options.getTerm()) {
if (isLogDebugEnabled) {
sb.append(" fail, greater term ").append(response.getTerm()).append(" expect term ")
.append(r.options.getTerm());
LOG.debug(sb.toString());
}
// 获取当前本节点的表示对象——NodeImpl
final NodeImpl node = r.options.getNode();
r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
r.destroy();
// 调整自己的 term 任期值
node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
"Leader receives higher term heartbeat_response from peer:%s", r.options.getPeerId()));
return false;
}
if (isLogDebugEnabled) {
sb.append(" fail, find nextIndex remote lastLogIndex ").append(response.getLastLogIndex())
.append(" local nextIndex ").append(r.nextIndex);
LOG.debug(sb.toString());
}
if (rpcSendTime > r.lastRpcSendTimestamp) {
r.lastRpcSendTimestamp = rpcSendTime;
}
// Fail, reset the state to try again from nextIndex.
r.resetInflights();
//如果Follower最新的index小于下次要发送的index,那么设置为Follower响应的index
// prev_log_index and prev_log_term doesn't match
if (response.getLastLogIndex() + 1 < r.nextIndex) {
LOG.debug("LastLogIndex at peer={} is {}", r.options.getPeerId(), response.getLastLogIndex());
// The peer contains less logs than leader
r.nextIndex = response.getLastLogIndex() + 1;
} else {
// The peer contains logs from old term which should be truncated,
// decrease _last_log_at_peer by one to test the right index to keep
if (r.nextIndex > 1) {
LOG.debug("logIndex={} dismatch", r.nextIndex);
r.nextIndex--;
} else {
LOG.error("Peer={} declares that log at index=0 doesn't match, which is not supposed to happen",
r.options.getPeerId());
}
}
//响应失败需要重新获取Follower的日志信息,用来重新同步
// dummy_id is unlock in _send_heartbeat
r.sendEmptyEntries(false);
return false;
}
if (isLogDebugEnabled) {
sb.append(", success");
LOG.debug(sb.toString());
}
// success
//响应成功检查任期
if (response.getTerm() != r.options.getTerm()) {
r.resetInflights();
r.state = State.Probe;
LOG.error("Fail, response term {} dismatch, expect term {}", response.getTerm(), r.options.getTerm());
id.unlock();
return false;
}
if (rpcSendTime > r.lastRpcSendTimestamp) {
r.lastRpcSendTimestamp = rpcSendTime;
}
// 本次提交的日志数量
final int entriesSize = request.getEntriesCount();
if (entriesSize > 0) {
// 节点确认提交
r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
if (LOG.isDebugEnabled()) {
LOG.debug("Replicated logs in [{}, {}] to peer {}", r.nextIndex, r.nextIndex + entriesSize - 1,
r.options.getPeerId());
}
} else {
// The request is probe request, change the state into Replicate.
r.state = State.Replicate;
}
r.nextIndex += entriesSize;
r.hasSucceeded = true;
r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
// dummy_id is unlock in _send_entries
if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
r.sendTimeoutNow(false, false);
}
return true;
}
onAppendEntriesReturned方法也非常的长,但是我们要有点耐心往下看
- 校验数据序列有没有错
- 进行度量和拼接日志操作
- 判断一下返回的状态如果不是正常的,那么就通知监听器,进行重置操作并阻塞一定时间后再发送
- 如果返回Success状态为false,那么校验一下任期,因为Leader 的切换,表明可能出现过一次网络分区,需要重新跟随新的 Leader;如果任期没有问题那么就进行重置操作,并根据Follower返回的最新的index来重新设值nextIndex
- 如果各种校验都没有问题的话,那么进行日志提交确认,更新最新的日志提交位置索引
8. SOFAJRaft源码分析— 如何实现日志复制的pipeline机制?的更多相关文章
- 3. SOFAJRaft源码分析— 是如何进行选举的?
开篇 在上一篇文章当中,我们讲解了NodeImpl在init方法里面会初始化话的动作,选举也是在这个方法里面进行的,这篇文章来从这个方法里详细讲一下选举的过程. 由于我这里介绍的是如何实现的,所以请大 ...
- 4. SOFAJRaft源码分析— RheaKV初始化做了什么?
前言 由于RheaKV要讲起来篇幅比较长,所以这里分成几个章节来讲,这一章讲一讲RheaKV初始化做了什么? 我们先来给个例子,我们从例子来讲: public static void main(fin ...
- 6. SOFAJRaft源码分析— 透过RheaKV看线性一致性读
开篇 其实这篇文章我本来想在讲完选举的时候就开始讲线性一致性读的,但是感觉直接讲没头没尾的看起来比比较困难,所以就有了RheaKV的系列,这是RheaKV,终于可以讲一下SOFAJRaft的线性一致性 ...
- DolphinScheduler源码分析之任务日志
DolphinScheduler源码分析之任务日志 任务日志打印在调度系统中算是一个比较重要的功能,下面就简要分析一下其打印的逻辑和前端页面查询的流程. AbstractTask 所有的任务都会继承A ...
- NIO 源码分析(04) 从 SelectorProvider 看 JDK SPI 机制
目录 一.SelectorProvider SPI 二.SelectorProvider 加载过程 2.1 SelectorProvider 加载 2.2 Windows 下 DefaultSelec ...
- 9. SOFAJRaft源码分析— Follower如何通过Snapshot快速追上Leader日志?
前言 引入快照机制主要是为了解决两个问题: JRaft新节点加入后,如何快速追上最新的数据 Raft 节点出现故障重新启动后如何高效恢复到最新的数据 Snapshot 源码分析 生成 Raft 节点的 ...
- 2. SOFAJRaft源码分析—JRaft的定时任务调度器是怎么做的?
看完这个实现之后,感觉还是要多看源码,多研究.其实JRaft的定时任务调度器是基于Netty的时间轮来做的,如果没有看过Netty的源码,很可能并不知道时间轮算法,也就很难想到要去使用这么优秀的定时调 ...
- 源码分析之spring-JdbcTemplate日志打印sql语句
对于开源的项目来说的好处就是我们遇到什么问题可以通过看源码来解决. 比如近期有个同事问我说,为啥JdbcTemplate中只有在Error的时候才打印出sql语句呢.我一想,这和log的配置有关系吧. ...
- 7. SOFAJRaft源码分析—如何实现一个轻量级的对象池?
前言 我在看SOFAJRaft的源码的时候看到了使用了对象池的技术,看了一下感觉要吃透的话还是要新开一篇文章来讲,内容也比较充实,大家也可以学到之后运用到实际的项目中去. 这里我使用Recyclabl ...
随机推荐
- MIT线性代数:7.主变量,特解,求解AX=0
- Vue基础系列(三)——Vue模板中的数据绑定语法
写在前面的话: 文章是个人学习过程中的总结,为方便以后回头在学习. 文章中会参考官方文档和其他的一些文章,示例均为亲自编写和实践,若有写的不对的地方欢迎大家和我一起交流. VUE基础系列目录 < ...
- OTA升级详解(三)
君子知夫不全不粹之不足以为美也, 故诵数以贯之, 思索以通之, 为其人以处之, 除其害者以持养之: 出自荀子<劝学篇> 终于OTA的升级过程的详解来了,之前的两篇文章OTA升级详解(一)与 ...
- 防火墙firewalld的基础操作
防火墙Firewalld.iptables 1.systemctl模式 systemctl status firewalld #查看状态 2 systemctl start firewalld #启动 ...
- Python 基础之 I/O 模型
一.I/O模型 IO在计算机中指Input/Output,也就是输入和输出.由于程序和运行时数据是在内存中驻留,由CPU这个超快的计算核心来执行,涉及到数据交换的地方,通常是磁盘.网络等,就需要IO接 ...
- 关于设备与canvas画不出来的解决办法
连续四天解决一个在三星手机上面画canvas的倒计时饼图不出来的问题,困惑了很久,用了很多办法,甚至重写了那个方法,还是没有解决,大神给的思路是给父级加 "overflow: visible ...
- activmq点对点(简单写法)
开发环境 我们使用的是ActiveMQ 5.11.1 Release的Windows版,官网最新版是ActiveMQ 5.12.0 Release,大家可以自行下载,下载地址. 需要注意的是,开发时候 ...
- chrome中安装Vue调试工具vue-devtools
一.前言 vue-devtools是一款基于浏览器的插件,用来调试vue应用.本篇文章将要总结的是如何在chrome中安装Vue的调试工具vue-devtools. 首先能想到的第一种方法就是直接在c ...
- Lab8:文件系统
文件系统的概念 文件系统是操作系统中管理持久性数据的子系统,提供数据存储和访问功能 文件是具有符号名,由字节序列构成的数据项集合 文件系统的功能 分配文件磁盘空间 管理文件块(位置和顺序) 管理空闲空 ...
- Hadoop之HDFS读写原理
一.HDFS基本概念 HDFS全称是Hadoop Distributed System.HDFS是为以流的方式存取大文件而设计的.适用于几百MB,GB以及TB,并写一次读多次的场合.而对于低延时数据访 ...