HDFS源码分析EditLog之读取操作符

在《HDFS源码分析EditLog之获取编辑日志输入流》一文中，我们详细了解了如何获取编辑日志输入流EditLogInputStream。在我们得到编辑日志输入流后，是不是就该从输入流中获取数据来处理呢？答案是显而易见的！在《HDFS源码分析之EditLogTailer》一文中，我们在讲编辑日志追踪同步时，也讲到了如下两个连续的处理流程：

4、从编辑日志editLog中获取编辑日志输入流集合streams，获取的输入流为最新事务ID加1之后的数据
5、调用文件系统镜像FSImage实例image的loadEdits()，利用编辑日志输入流集合streams，加载编辑日志至目标namesystem中的文件系统镜像FSImage，并获得编辑日志加载的大小editsLoaded；

可见，我们在获得编辑日志输入流EditLogInputStream的集合streams后，就需要调用FSImage的loadEdits()方法，利用编辑日志输入流集合streams，加载编辑日志至目标namesystem中的文件系统镜像FSImage。而HDFS是如何从编辑日志输入流中读取数据的呢？本文，我们将进行详细的探究！

首先，在加载编辑日志的主要类FSEditLogLoader中，其核心方法loadEditRecords()中有如下一段代码：

while (true) {
try {
FSEditLogOp op;
try {
// 从编辑日志输入流in中读取操作符op
op = in.readOp();
// 如果操作符op为空，直接跳出循环，并返回
if (op == null) {
break;
}
} catch (Throwable e) {
// ...省略部分代码
}
// ...省略部分代码
try {
// ...省略部分代码
long inodeId = applyEditLogOp(op, fsDir, startOpt,
in.getVersion(true), lastInodeId);
if (lastInodeId < inodeId) {
lastInodeId = inodeId;
}
} catch (RollingUpgradeOp.RollbackException e) {
// ...省略部分代码
} catch (Throwable e) {
// ...省略部分代码
}
// ...省略部分代码
} catch (RollingUpgradeOp.RollbackException e) {
// ...省略部分代码
} catch (MetaRecoveryContext.RequestStopException e) {
// ...省略部分代码
}
}

它会从编辑日志输入流in中读取一个操作符op，然后调用applyEditLogOp()方法，将操作符作用于内存元数据FSNamesystem。那么问题来了，这个操作符如何从数据流中被读取并解析的呢？

接下来，我们就看下如何从编辑日志输出流EditLogInputStream中读取一个操作符，我们先看其readOp()方法，代码如下：

/**
* Read an operation from the stream
* @return an operation from the stream or null if at end of stream
* @throws IOException if there is an error reading from the stream
*/
public FSEditLogOp readOp() throws IOException {
FSEditLogOp ret;
// 如果缓存的cachedOp不为null，返回缓存的cachedOp，并将其清空
if (cachedOp != null) {
ret = cachedOp;
cachedOp = null;
return ret;
}
// 如果缓存的cachedOp为null，调用nextOp()进行处理
return nextOp();
}

很简单，如果缓存的cachedOp不为null，返回缓存的cachedOp，并将其清空，如果缓存的cachedOp为null，则调用nextOp()进行处理。而EditLogInputStream中nextOp()是一个抽象方法，我们需要看其子类的实现方法，下面就以EditLogFileInputStream为例，看下其nextOp()方法：

@Override
protected FSEditLogOp nextOp() throws IOException {
return nextOpImpl(false);
}

继续追踪nextOpImpl()方法，代码如下：

private FSEditLogOp nextOpImpl(boolean skipBrokenEdits) throws IOException {
FSEditLogOp op = null;
// 根据编辑日志文件输入流的状态判断：
switch (state) {
case UNINIT:// 如果为未初始化状态UNINIT
try {
// 调用init()方法进行初始化
init(true);
} catch (Throwable e) {
LOG.error("caught exception initializing " + this, e);
if (skipBrokenEdits) {
return null;
}
Throwables.propagateIfPossible(e, IOException.class);
}
// 检测编辑日志文件输入流状态，此时不应为UNINIT
Preconditions.checkState(state != State.UNINIT);
// 再次调用nextOpImpl()方法
return nextOpImpl(skipBrokenEdits);
case OPEN:// 如果为打开OPEN状态
// 调用FSEditLogOp.Reader的readOp()方法，读取操作符
op = reader.readOp(skipBrokenEdits);
if ((op != null) && (op.hasTransactionId())) {
long txId = op.getTransactionId();
if ((txId >= lastTxId) &&
(lastTxId != HdfsConstants.INVALID_TXID)) {
//
// Sometimes, the NameNode crashes while it's writing to the
// edit log. In that case, you can end up with an unfinalized edit log
// which has some garbage at the end.
// JournalManager#recoverUnfinalizedSegments will finalize these
// unfinished edit logs, giving them a defined final transaction
// ID. Then they will be renamed, so that any subsequent
// readers will have this information.
//
// Since there may be garbage at the end of these "cleaned up"
// logs, we want to be sure to skip it here if we've read everything
// we were supposed to read out of the stream.
// So we force an EOF on all subsequent reads.
//
long skipAmt = log.length() - tracker.getPos();
if (skipAmt > 0) {
if (LOG.isDebugEnabled()) {
LOG.debug("skipping " + skipAmt + " bytes at the end " +
"of edit log '" + getName() + "': reached txid " + txId +
" out of " + lastTxId);
}
tracker.clearLimit();
IOUtils.skipFully(tracker, skipAmt);
}
}
}
break;
case CLOSED: // 如果为关闭CLOSED状态，直接返回null
break; // return null
}
return op;
}

nextOpImpl()方法的大体处理逻辑如下：

根据编辑日志文件输入流的状态判断：

1、如果为未初始化状态UNINIT，调用init()方法进行初始化，然后检测编辑日志文件输入流状态，此时不应为UNINIT，最后再次调用nextOpImpl()方法；

2、如果为打开OPEN状态，调用FSEditLogOp.Reader的readOp()方法，读取操作符op；

3、如果为关闭CLOSED状态，直接返回null。

我们重点关注下FSEditLogOp.Reader的readOp()方法，代码如下：

/**
* Read an operation from the input stream.
*
* Note that the objects returned from this method may be re-used by future
* calls to the same method.
*
* @param skipBrokenEdits If true, attempt to skip over damaged parts of
* the input stream, rather than throwing an IOException
* @return the operation read from the stream, or null at the end of the
* file
* @throws IOException on error. This function should only throw an
* exception when skipBrokenEdits is false.
*/
public FSEditLogOp readOp(boolean skipBrokenEdits) throws IOException {
while (true) {
try {
// 调用decodeOp()方法
return decodeOp();
} catch (IOException e) {
in.reset();
if (!skipBrokenEdits) {
throw e;
}
} catch (RuntimeException e) {
// FSEditLogOp#decodeOp is not supposed to throw RuntimeException.
// However, we handle it here for recovery mode, just to be more
// robust.
in.reset();
if (!skipBrokenEdits) {
throw e;
}
} catch (Throwable e) {
in.reset();
if (!skipBrokenEdits) {
throw new IOException("got unexpected exception " +
e.getMessage(), e);
}
}
// Move ahead one byte and re-try the decode process.
if (in.skip(1) < 1) {
return null;
}
}
}

继续追踪decodeOp()方法，代码如下：

/**
* Read an opcode from the input stream.
* 从输入流中读取一个操作符code
*
* @return the opcode, or null on EOF.
*
* If an exception is thrown, the stream's mark will be set to the first
* problematic byte. This usually means the beginning of the opcode.
*/
private FSEditLogOp decodeOp() throws IOException {
limiter.setLimit(maxOpSize);
in.mark(maxOpSize);
if (checksum != null) {
checksum.reset();
}
byte opCodeByte;
try {
// 从输入流in中读取一个byte，即opCodeByte
opCodeByte = in.readByte();
} catch (EOFException eof) {
// EOF at an opcode boundary is expected.
return null;
}
// 将byte类型的opCodeByte转换成FSEditLogOpCodes对象opCode
FSEditLogOpCodes opCode = FSEditLogOpCodes.fromByte(opCodeByte);
if (opCode == OP_INVALID) {
verifyTerminator();
return null;
}
// 根据FSEditLogOpCodes对象opCode从cache中获取FSEditLogOp对象op
FSEditLogOp op = cache.get(opCode);
if (op == null) {
throw new IOException("Read invalid opcode " + opCode);
}
// 如果支持编辑日志长度，从输入流读入一个int，
if (supportEditLogLength) {
in.readInt();
}
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.STORED_TXIDS, logVersion)) {
// Read the txid
// 如果支持事务ID，读入一个long，作为事务ID，并在FSEditLogOp实例op中设置事务ID
op.setTransactionId(in.readLong());
} else {
// 如果不支持事务ID，在FSEditLogOp实例op中设置事务ID为-12345
op.setTransactionId(HdfsConstants.INVALID_TXID);
}
// 从输入流in中读入其他域，并设置入FSEditLogOp实例op
op.readFields(in, logVersion);
validateChecksum(in, checksum, op.txid);
return op;
}

decodeOp()方法的逻辑很简单：

1、从输入流in中读取一个byte，即opCodeByte，确定操作类型；

2、将byte类型的opCodeByte转换成FSEditLogOpCodes对象opCode；

3、根据FSEditLogOpCodes对象opCode从cache中获取FSEditLogOp对象op，这样我们就得到了操作符对象；

4、如果支持编辑日志长度，从输入流读入一个int；

5、如果支持事务ID，读入一个long，作为事务ID，并在FSEditLogOp实例op中设置事务ID，否则在FSEditLogOp实例op中设置事务ID为-12345；

6、调用操作符对象op的readFields()方法，从输入流in中读入其他域，并设置入FSEditLogOp实例op。

接下来，我们再看下操作符对象的readFields()方法，因为不同种类的操作符肯定包含不同的属性，所以它们的readFields()方法肯定也各不相同。下面，我们就以操作符AddCloseOp为例来分析，其readFields()方法如下：

@Override
void readFields(DataInputStream in, int logVersion)
throws IOException {
// 读取长度：如果支持读入长度，从输入流in读取一个int，赋值给length
if (!NameNodeLayoutVersion.supports(
LayoutVersion.Feature.EDITLOG_OP_OPTIMIZATION, logVersion)) {
this.length = in.readInt();
}
// 读取节点ID：如果支持读入节点ID，从输入流in读取一个long，赋值给inodeId，否则inodeId默认为0
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.ADD_INODE_ID, logVersion)) {
this.inodeId = in.readLong();
} else {
// The inodeId should be updated when this editLogOp is applied
this.inodeId = INodeId.GRANDFATHER_INODE_ID;
}
// 版本兼容性校验
if ((-17 < logVersion && length != 4) ||
(logVersion <= -17 && length != 5 && !NameNodeLayoutVersion.supports(
LayoutVersion.Feature.EDITLOG_OP_OPTIMIZATION, logVersion))) {
throw new IOException("Incorrect data format." +
" logVersion is " + logVersion +
" but writables.length is " +
length + ". ");
}
// 读取路径：从输入流in读取一个String，赋值给path
this.path = FSImageSerialization.readString(in);
// 读取副本数、修改时间：如果支持读取副本数、修改时间，分别从输入流读取一个short、long，
// 赋值给replication、mtime
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.EDITLOG_OP_OPTIMIZATION, logVersion)) {
this.replication = FSImageSerialization.readShort(in);
this.mtime = FSImageSerialization.readLong(in);
} else {
this.replication = readShort(in);
this.mtime = readLong(in);
}
// 读取访问时间：如果支持读取访问时间，从输入流读取一个long，赋值给atime，否则atime默认为0
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.FILE_ACCESS_TIME, logVersion)) {
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.EDITLOG_OP_OPTIMIZATION, logVersion)) {
this.atime = FSImageSerialization.readLong(in);
} else {
this.atime = readLong(in);
}
} else {
this.atime = 0;
}
// 读取数据块大小：如果支持读取数据块大小，从输入流读取一个long，赋值给blockSize
if (NameNodeLayoutVersion.supports(
LayoutVersion.Feature.EDITLOG_OP_OPTIMIZATION, logVersion)) {
this.blockSize = FSImageSerialization.readLong(in);
} else {
this.blockSize = readLong(in);
}
// 调用readBlocks()方法读取数据块，赋值给数据块数组blocks
this.blocks = readBlocks(in, logVersion);
// 从输入流读入权限，赋值给permissions
this.permissions = PermissionStatus.read(in);
// 如果是ADD操作，需要额外处理客户端名称clientName、客户端机器clientMachine、覆盖写标志overwrite等属性
if (this.opCode == OP_ADD) {
aclEntries = AclEditLogUtil.read(in, logVersion);
this.xAttrs = readXAttrsFromEditLog(in, logVersion);
this.clientName = FSImageSerialization.readString(in);
this.clientMachine = FSImageSerialization.readString(in);
if (NameNodeLayoutVersion.supports(
NameNodeLayoutVersion.Feature.CREATE_OVERWRITE, logVersion)) {
this.overwrite = FSImageSerialization.readBoolean(in);
} else {
this.overwrite = false;
}
if (NameNodeLayoutVersion.supports(
NameNodeLayoutVersion.Feature.BLOCK_STORAGE_POLICY, logVersion)) {
this.storagePolicyId = FSImageSerialization.readByte(in);
} else {
this.storagePolicyId = BlockStoragePolicySuite.ID_UNSPECIFIED;
}
// read clientId and callId
readRpcIds(in, logVersion);
} else {
this.clientName = "";
this.clientMachine = "";
}
}

这个没有什么特别好讲的，依次读入操作符需要的，在输入流中依次存在的属性即可。
不过，我们仍然需要重点讲解下读入数据块的readBlocks()方法，代码如下：

private static Block[] readBlocks(
DataInputStream in,
int logVersion) throws IOException {
// 读取block数目numBlocks，占一个int
int numBlocks = in.readInt();
// 校验block数目numBlocks，应大于等于0，小于等于1024 * 1024 * 64
if (numBlocks < 0) {
throw new IOException("invalid negative number of blocks");
} else if (numBlocks > MAX_BLOCKS) {
throw new IOException("invalid number of blocks: " + numBlocks +
". The maximum number of blocks per file is " + MAX_BLOCKS);
}
// 构造block数组blocks，大小即为numBlocks
Block[] blocks = new Block[numBlocks];
// 从输入流中读取numBlocks个数据块
for (int i = 0; i < numBlocks; i++) {
// 构造数据块Block实例blk
Block blk = new Block();
// 调用Block的readFields()方法，从输入流读入数据块
blk.readFields(in);
// 将数据块blk放入数据块数组blocks
blocks[i] = blk;
}
// 返回数据块数组blocks
return blocks;
}

很简单，先从输入流读取block数目numBlocks，确定一共需要读取多少个数据块，然后构造block数组blocks，大小即为numBlocks，最后从输入流中读取numBlocks个数据块，每次都是先构造数据块Block实例blk，调用Block的readFields()方法，从输入流读入数据块，然后将数据块blk放入数据块数组blocks。全部数据块读取完毕后，返回数据块数组blocks。

我们再看下数据块Block的readFields()方法，如下：

@Override // Writable
public void readFields(DataInput in) throws IOException {
readHelper(in);
}

继续看readHelper()方法，如下：

final void readHelper(DataInput in) throws IOException {
// 从输入流读取一个long，作为数据块艾迪blockId
this.blockId = in.readLong();
// 从输入流读取一个long，作为数据块大小numBytes
this.numBytes = in.readLong();
// 从输入流读取一个long，作为数据块产生的时间戳generationStamp
this.generationStamp = in.readLong();
// 校验：数据块大小numBytes应大于等于0
if (numBytes < 0) {
throw new IOException("Unexpected block size: " + numBytes);
}
}

从输入流依次读入数据块艾迪blockId、数据块大小numBytes、数据块产生的时间戳generationStamp即可，三者均为long类型。

HDFS源码分析EditLog之读取操作符的更多相关文章

HDFS源码分析EditLog之获取编辑日志输入流
在<HDFS源码分析之EditLogTailer>一文中,我们详细了解了编辑日志跟踪器EditLogTailer的实现,介绍了其内部编辑日志追踪线程EditLogTailerThread的 ...
HDFS源码分析DataXceiver之整体流程
在<HDFS源码分析之DataXceiverServer>一文中,我们了解到在DataNode中,有一个后台工作的线程DataXceiverServer.它被用于接收来自客户端或其他数据节 ...
HDFS源码分析之UnderReplicatedBlocks（一）
http://blog.csdn.net/lipeng_bigdata/article/details/51160359 UnderReplicatedBlocks是HDFS中关于块复制的一个重要数据 ...
HDFS源码分析数据块校验之DataBlockScanner
DataBlockScanner是运行在数据节点DataNode上的一个后台线程.它为所有的块池管理块扫描.针对每个块池,一个BlockPoolSliceScanner对象将会被创建,其运行在一个单独 ...
HDFS源码分析数据块复制监控线程ReplicationMonitor（二）
HDFS源码分析数据块复制监控线程ReplicationMonitor(二)
HDFS源码分析数据块复制监控线程ReplicationMonitor（一）
ReplicationMonitor是HDFS中关于数据块复制的监控线程,它的主要作用就是计算DataNode工作,并将复制请求超时的块重新加入到待调度队列.其定义及作为线程核心的run()方法如下: ...
HDFS源码分析之UnderReplicatedBlocks（二）
UnderReplicatedBlocks还提供了一个数据块迭代器BlockIterator,用于遍历其中的数据块.它是UnderReplicatedBlocks的内部类,有三个成员变量,如下: // ...
HDFS源码分析之LightWeightGSet
LightWeightGSet是名字节点NameNode在内存中存储全部数据块信息的类BlocksMap需要的一个重要数据结构,它是一个占用较低内存的集合的实现,它使用一个数组array存储元素,使用 ...
HDFS源码分析数据块汇报之损坏数据块检测checkReplicaCorrupt()
无论是第一次,还是之后的每次数据块汇报,名字名字节点都会对汇报上来的数据块进行检测,看看其是否为损坏的数据块.那么,损坏数据块是如何被检测的呢?本文,我们将研究下损坏数据块检测的checkReplic ...

随机推荐

BZOJ 3940: [Usaco2015 Feb]Censoring
3940: [Usaco2015 Feb]Censoring Time Limit: 10 Sec Memory Limit: 128 MBSubmit: 367 Solved: 173[Subm ...
后缀数组基本问题QAQ
以下题目均来自罗穗骞的论文... No.1最长公共前缀最长公共前缀: 题目: 给定一个字符串,询问某两个后缀的最长公共前缀. 分析: 某两个后缀的最长公共前缀就是区间height最小值,转化为RMQ ...
FormatDateTime 当前时间减去几小时的做法
top_start_modified := FormatDateTime('yyyy-mm-dd hh:mm:ss',(Now - ((1/24)*3))); top_end_modified ...
设置自定义Dialog背景不变暗
设置Dialog弹窗的背景不变暗,有两种方式,一种是通过在style中设置,一种是通过代码设置. 一.在style中设置 <style name="dialog_waiting&quo ...
MySQL的XA_prepare_event类型binlog的解析
为了支持新版的xa事务,MySQL新加了一种binlog event类型:XA_prepare 项目中使用的开源组件mysql-binlog-connector-java无法解析此种binlog ev ...
POJ 3070 Fibonacci【斐波那契数列/矩阵快速幂】
Fibonacci Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 17171 Accepted: 11999 Descr ...
2017 ACM-ICPC EC-Final 记录
北京赛区结束后就以为自己的赛季结束了……但是还是保持着做题量那天突然接到通知,去打EC-Final 但是这是一个临时组起来的队伍,另外两位队友原来一起组的比较熟,我就需要适应一下. 于是我们临时训练 ...
Zlib编译
转自原文编译和使用zlib 由于要编译Cesium Terrain Build,其中不仅需要gdal,还用到了zlib,所以此时不得不总结一下Zlib的编译之道了. 在windows下用到zlib库 ...
java加载类的方法1.classloader 2.class.forName()
java加载类的方法1.classloader 2.class.forName() 加载一个类后,是在方法去创建这个类的元信息class对象,在方法区立刻创建.在方法区创建.
VS2010 MFC中 Date Time Picker控件的使用
1. 在工具箱中找到Date Time Picker控件,然后拖放到对话框上. 2. 在其属性中按自己的需求做一些设置. Format 属性:Long Date (长日期):****年**月**日 S ...

HDFS源码分析EditLog之读取操作符

HDFS源码分析EditLog之读取操作符的更多相关文章

随机推荐

热门专题