前言

在Hadoop中，全部的元数据的保存都是在namenode节点之中，每次又一次启动整个集群，Hadoop都须要从这些持久化了的文件里恢复数据到内存中，然后通过镜像和编辑日志文件进行定期的扫描与合并。ok。这些略微了解Hadoop的人应该都知道。这不就是SecondNameNode干的事情嘛。可是非常多人仅仅是了解此机制的表象，内部的一些实现机理预计不是每一个人都又去深究过。你能想象在写入编辑日志的过程中，用到了双缓冲区来加大并发量的写吗，你能想象为了避免操作的一致性性，作者在写入的时候做过多重的验证操作，还有你有想过作者是怎样做到操作中断的处理情况吗。假设你不能非常好的回答上述几个问题，那么没有关系。以下来分享一下我的学习成果。

命名空间镜像

命名空间镜像在这里简称为FsImage，镜像这个词我最早听的时候是在虚拟机镜像恢复的时候听过的，非常强大，只是在这里的镜像好像比較小规模一些，仅仅是用于文件夹树的恢复。镜像的保存路径是由配置文件里的以下这个属性所控制

${dfs.name.dir}

当然你能够不配。是有默认值的。在命名空间镜像中，FSImage起着主导的作用，他管理着存储空间的生存期。以下是这个类的基本变量定义

/**

 * FSImage handles checkpointing and logging of the namespace edits.

 * fsImage镜像类

 */

public class FSImage extends Storage {

  //标准时间格式

  private static final SimpleDateFormat DATE_FORM =

    new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

  //

  // The filenames used for storing the images

  // 在命名空间镜像中可能用的几种名称

  //

  enum NameNodeFile {

    IMAGE     ("fsimage"),

    TIME      ("fstime"),

    EDITS     ("edits"),

    IMAGE_NEW ("fsimage.ckpt"),

    EDITS_NEW ("edits.new");

    private String fileName = null;

    private NameNodeFile(String name) {this.fileName = name;}

    String getName() {return fileName;}

  }

  // checkpoint states

  // 检查点击几种状态

  enum CheckpointStates{START, ROLLED_EDITS, UPLOAD_START, UPLOAD_DONE; }

  /**

   * Implementation of StorageDirType specific to namenode storage

   * A Storage directory could be of type IMAGE which stores only fsimage,

   * or of type EDITS which stores edits or of type IMAGE_AND_EDITS which

   * stores both fsimage and edits.

   * 名字节点文件夹存储类型

   */

  static enum NameNodeDirType implements StorageDirType {

    //名字节点存储类型定义主要有以下4种定义

    UNDEFINED,

    IMAGE,

    EDITS,

    IMAGE_AND_EDITS;

    public StorageDirType getStorageDirType() {

      return this;

    }

    //做存储类型的验证

    public boolean isOfType(StorageDirType type) {

      if ((this == IMAGE_AND_EDITS) && (type == IMAGE || type == EDITS))

        return true;

      return this == type;

    }

  }

  protected long checkpointTime = -1L;

  //内部维护了编辑日志类，与镜像类配合操作

  protected FSEditLog editLog = null;

  private boolean isUpgradeFinalized = false;

立即看到的是几个文件状态的名称，什么edit，edit.new。fsimage.ckpt，后面这些都会在元数据机制的中进行仔细解说，能够理解为就是暂时存放的文件夹名称。而对于这些文件夹的遍历，查询操作。都是以下这个类实现的

//文件夹迭代器

  private class DirIterator implements Iterator<StorageDirectory> {

    //文件夹存储类型

    StorageDirType dirType;

    //向前的指标,用于移除操作

    int prevIndex; // for remove()

    //向后指标

    int nextIndex; // for next()

    DirIterator(StorageDirType dirType) {

      this.dirType = dirType;

      this.nextIndex = 0;

      this.prevIndex = 0;

    }

    public boolean hasNext() {

      ....

    }

    public StorageDirectory next() {

      StorageDirectory sd = getStorageDir(nextIndex);

      prevIndex = nextIndex;

      nextIndex++;

      if (dirType != null) {

        while (nextIndex < storageDirs.size()) {

          if (getStorageDir(nextIndex).getStorageDirType().isOfType(dirType))

            break;

          nextIndex++;

        }

      }

      return sd;

    }

    public void remove() {

      ...

    }

  }

依据传入的文件夹类型，获取不同的文件夹。这些存储文件夹指的就是editlog，fsimage这些文件夹文件，有一些共同拥有的信息，例如以下

/**

 * Common class for storage information.

 * 存储信息公告类

 * TODO namespaceID should be long and computed as hash(address + port)

 * 命名空间ID必须足够长，ip地址+端口号做哈希计算而得

 */

public class StorageInfo {

  //存储信息版本号号

  public int   layoutVersion;  // Version read from the stored file.

  //命名空间ID

  public int   namespaceID;    // namespace id of the storage

  //存储信息创建时间

  public long  cTime;          // creation timestamp

  public StorageInfo () {

  	//默认构造函数。全为0

    this(0, 0, 0L);

  }

以下从1个保存镜像的方法作为切入口

/**

   * Save the contents of the FS image to the file.

   * 保存镜像文件

   */

  void saveFSImage(File newFile) throws IOException {

    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();

    FSDirectory fsDir = fsNamesys.dir;

    long startTime = FSNamesystem.now();

    //

    // Write out data

    //

    DataOutputStream out = new DataOutputStream(

                                                new BufferedOutputStream(

                                                                         new FileOutputStream(newFile)));

    try {

      //写入版本号号

      out.writeInt(FSConstants.LAYOUT_VERSION);

      //写入命名空间ID

      out.writeInt(namespaceID);

      //写入文件夹下的孩子总数

      out.writeLong(fsDir.rootDir.numItemsInTree());

      //写入时间

      out.writeLong(fsNamesys.getGenerationStamp());

      byte[] byteStore = new byte[4*FSConstants.MAX_PATH_LENGTH];

      ByteBuffer strbuf = ByteBuffer.wrap(byteStore);

      // save the root

      saveINode2Image(strbuf, fsDir.rootDir, out);

      // save the rest of the nodes

      saveImage(strbuf, 0, fsDir.rootDir, out);

      fsNamesys.saveFilesUnderConstruction(out);

      fsNamesys.saveSecretManagerState(out);

      strbuf = null;

    } finally {

      out.close();

    }

    LOG.info("Image file of size " + newFile.length() + " saved in "

        + (FSNamesystem.now() - startTime)/1000 + " seconds.");

  }

从上面的几行能够看到。一个完整的镜像文件首部应该包含版本号号，命名空间iD。文件数。数据块版本号块，然后后面是详细的文件信息。在这里人家还保存了构建节点文件信息以及安全信息。在保存文件文件夹的信息时，採用的saveINode2Image()先保留文件夹信息，然后再调用saveImage()保留孩子文件信息，由于在saveImage()中会调用saveINode2Image()方法。

/*

   * Save one inode's attributes to the image.

   * 保留一个节点的属性到镜像中

   */

  private static void saveINode2Image(ByteBuffer name,

                                      INode node,

                                      DataOutputStream out) throws IOException {

    int nameLen = name.position();

    out.writeShort(nameLen);

    out.write(name.array(), name.arrayOffset(), nameLen);

    if (!node.isDirectory()) {  // write file inode

      INodeFile fileINode = (INodeFile)node;

      //写入的属性包含,副本数,近期改动数据,近期訪问时间

      out.writeShort(fileINode.getReplication());

      out.writeLong(fileINode.getModificationTime());

      out.writeLong(fileINode.getAccessTime());

      out.writeLong(fileINode.getPreferredBlockSize());

      Block[] blocks = fileINode.getBlocks();

      out.writeInt(blocks.length);

      for (Block blk : blocks)

        //将数据块信息也写入

        blk.write(out);

      FILE_PERM.fromShort(fileINode.getFsPermissionShort());

      PermissionStatus.write(out, fileINode.getUserName(),

                             fileINode.getGroupName(),

                             FILE_PERM);

    } else {   // write directory inode

      //假设是文件夹,则还要写入节点的配额限制值

      out.writeShort(0);  // replication

      out.writeLong(node.getModificationTime());

      out.writeLong(0);   // access time

      out.writeLong(0);   // preferred block size

      out.writeInt(-1);    // # of blocks

      out.writeLong(node.getNsQuota());

      out.writeLong(node.getDsQuota());

      FILE_PERM.fromShort(node.getFsPermissionShort());

      PermissionStatus.write(out, node.getUserName(),

                             node.getGroupName(),

                             FILE_PERM);

    }

  }

在这里面会写入很多其它的关于文件文件夹的信息，此方法也会被saveImage()递归调用

/**

   * Save file tree image starting from the given root.

   * This is a recursive procedure, which first saves all children of

   * a current directory and then moves inside the sub-directories.

   * 依照给定节点进行镜像的保存,每一个节点文件夹会採取递归的方式进行遍历

   */

  private static void saveImage(ByteBuffer parentPrefix,

                                int prefixLength,

                                INodeDirectory current,

                                DataOutputStream out) throws IOException {

    int newPrefixLength = prefixLength;

    if (current.getChildrenRaw() == null)

      return;

    for(INode child : current.getChildren()) {

      // print all children first

      parentPrefix.position(prefixLength);

      parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());

      saveINode2Image(parentPrefix, child, out);

      ....

    }

写入正在处理的文件的方法是一个static静态方法。是被外部方法所引用的

// Helper function that writes an INodeUnderConstruction

  // into the input stream

  // 写入正在操作的文件的信息

  //

  static void writeINodeUnderConstruction(DataOutputStream out,

                                           INodeFileUnderConstruction cons,

                                           String path)

                                           throws IOException {

    writeString(path, out);

    out.writeShort(cons.getReplication());

    out.writeLong(cons.getModificationTime());

    out.writeLong(cons.getPreferredBlockSize());

    int nrBlocks = cons.getBlocks().length;

    out.writeInt(nrBlocks);

    for (int i = 0; i < nrBlocks; i++) {

      cons.getBlocks()[i].write(out);

    }

    cons.getPermissionStatus().write(out);

    writeString(cons.getClientName(), out);

    writeString(cons.getClientMachine(), out);

    out.writeInt(0); //  do not store locations of last block

  }

在这里顺便看下格式化相关的方法，格式化的操作是在每次開始使用HDFS前进行的，在这个过程中会生出新的版本号号和命名空间ID。在代码中是怎样实现的呢

public void format() throws IOException {

    this.layoutVersion = FSConstants.LAYOUT_VERSION;

      //对每一个文件夹进行格式化操作

      format(sd);

    }

  }

  /** Create new dfs name directory.  Caution: this destroys all files

   * 格式化操作,会创建一个dfs/name的文件夹

   * in this filesystem. */

  void format(StorageDirectory sd) throws IOException {

    sd.clearDirectory(); // create currrent dir

    sd.lock();

    try {

      saveCurrent(sd);

    } finally {

      sd.unlock();

    }

    LOG.info("Storage directory " + sd.getRoot()

             + " has been successfully formatted.");

  }

操作非常easy，就是清空原有文件夹并创建新的文件夹。

编辑日志

以下開始另外一个大类的分析，就是编辑日志，英文名就是EditLog，这个里面将会有很多出彩的设计。打开这个类，你立即就看到的是几十种的操作码

/**

 * FSEditLog maintains a log of the namespace modifications.

 * 编辑日志类包含了命名空间各种改动操作的日志记录

 */

public class FSEditLog {

  //操作參数种类

  private static final byte OP_INVALID = -1;

  // 文件操作相关

  private static final byte OP_ADD = 0;

  private static final byte OP_RENAME = 1;  // rename

  private static final byte OP_DELETE = 2;  // delete

  private static final byte OP_MKDIR = 3;   // create directory

  private static final byte OP_SET_REPLICATION = 4; // set replication

  //the following two are used only for backward compatibility :

  @Deprecated private static final byte OP_DATANODE_ADD = 5;

  @Deprecated private static final byte OP_DATANODE_REMOVE = 6;

  //以下2个权限设置相关

  private static final byte OP_SET_PERMISSIONS = 7;

  private static final byte OP_SET_OWNER = 8;

  private static final byte OP_CLOSE = 9;    // close after write

  private static final byte OP_SET_GENSTAMP = 10;    // store genstamp

  /* The following two are not used any more. Should be removed once

   * LAST_UPGRADABLE_LAYOUT_VERSION is -17 or newer. */

  //配额设置相关

  private static final byte OP_SET_NS_QUOTA = 11; // set namespace quota

  private static final byte OP_CLEAR_NS_QUOTA = 12; // clear namespace quota

  private static final byte OP_TIMES = 13; // sets mod & access time on a file

  private static final byte OP_SET_QUOTA = 14; // sets name and disk quotas.

  //Token认证相关

  private static final byte OP_GET_DELEGATION_TOKEN = 18; //new delegation token

  private static final byte OP_RENEW_DELEGATION_TOKEN = 19; //renew delegation token

  private static final byte OP_CANCEL_DELEGATION_TOKEN = 20; //cancel delegation token

  private static final byte OP_UPDATE_MASTER_KEY = 21; //update master key

可见操作之多啊，然后才是基本变量的定义

//日志刷入的缓冲大小值512k

  private static int sizeFlushBuffer = 512*1024;

  //编辑日志同一时候有多个输出流对象

  private ArrayList<EditLogOutputStream> editStreams = null;

  //内部维护了1个镜像类。与镜像进行交互

  private FSImage fsimage = null;

  // a monotonically increasing counter that represents transactionIds.

  //每次进行同步刷新的事物ID

  private long txid = 0;

  // stores the last synced transactionId.

  //近期一次已经同步的事物Id

  private long synctxid = 0;

  // the time of printing the statistics to the log file.

  private long lastPrintTime;

  // is a sync currently running?

//是否有日志同步操作正在进行

  private boolean isSyncRunning;

  // these are statistics counters.

  //事务相关的统计变量

  //事务的总数

  private long numTransactions;        // number of transactions

  //未能即使被同步的事物次数统计

  private long numTransactionsBatchedInSync;

  //事务的总耗时

  private long totalTimeTransactions;  // total time for all transactions

  private NameNodeInstrumentation metrics;

这里txtid,synctxid等变量会在后面的同步操作时频繁出现。作者为了避免多线程事务id之间的相互干扰。採用了ThreadLocal的方式来维护自己的事务id

//事物ID对象类,内部包含long类型txid值

  private static class TransactionId {

    //操作事物Id

    public long txid;

    TransactionId(long value) {

      this.txid = value;

    }

  }

  // stores the most current transactionId of this thread.

  //通过ThreadLocal类保存线程私有的状态信息

  private static final ThreadLocal<TransactionId> myTransactionId = new ThreadLocal<TransactionId>() {

    protected synchronized TransactionId initialValue() {

      return new TransactionId(Long.MAX_VALUE);

    }

  };

在EditLog编辑日志中。全部的文件操作都是通过特有的EditLog输入输出流实现的，他是一个父类。这里以EditLogOutput为例

，代码被我简化了一些

/**

 * A generic abstract class to support journaling of edits logs into

 * a persistent storage.

 */

abstract class EditLogOutputStream extends OutputStream {

  // these are statistics counters

  //以下是2个统计量

  //文件同步的次数。能够理解为就是缓冲写入的次数

  private long numSync;        // number of sync(s) to disk

  //同步写入的总时间计数

  private long totalTimeSync;  // total time to sync

  EditLogOutputStream() throws IOException {

    numSync = totalTimeSync = 0;

  }

  abstract String getName();

  abstract public void write(int b) throws IOException;

  abstract void write(byte op, Writable ... writables) throws IOException;

  abstract void create() throws IOException;

  abstract public void close() throws IOException;

  abstract void setReadyToFlush() throws IOException;

  abstract protected void flushAndSync() throws IOException;

  /**

   * Flush data to persistent store.

   * Collect sync metrics.

   * 刷出时间方法

   */

  public void flush() throws IOException {

    //同步次数加1

    numSync++;

    long start = FSNamesystem.now();

    //刷出同步方法为抽象方法，由继承的子类详细

    flushAndSync();

    long end = FSNamesystem.now();

    //同一时候进行耗时的累加

    totalTimeSync += (end - start);

  }

  abstract long length() throws IOException;

  long getTotalSyncTime() {

    return totalTimeSync;

  }

  long getNumSync() {

    return numSync;

  }

}

人家在这里对同步相关的操作做了一些设计，包含一些计数的统计。

输入流与此相似。就不展开讨论了，可是EditLog并没有直接用了此类，而是在这个类中继承了一个内容更加丰富的EditLogFileOutputStream

  /**

   * An implementation of the abstract class {@link EditLogOutputStream},

   * which stores edits in a local file.

   * 全部的写日志文件的操作,都会通过这个输出流对象实现

   */

  static private class EditLogFileOutputStream extends EditLogOutputStream {

    private File file;

    //内部维护了一个文件输出流对象

    private FileOutputStream fp;    // file stream for storing edit logs

    private FileChannel fc;         // channel of the file stream for sync

    //这里设计了一个双缓冲区的设计,大大加强并发度,bufCurrent负责写入写入缓冲区

    private DataOutputBuffer bufCurrent;  // current buffer for writing

    //bufReady负载刷入数据到文件里

    private DataOutputBuffer bufReady;    // buffer ready for flushing

    static ByteBuffer fill = ByteBuffer.allocateDirect(512); // preallocation

注意这里有双缓冲的设计，双缓冲的设计在很多的别的优秀的系统中都实用到。如今从编辑日志写文件開始看起

/**

   * Create empty edit log files.

   * Initialize the output stream for logging.

   *

   * @throws IOException

   */

  public synchronized void open() throws IOException {

    //在文件打开的时候,计数值都初始化0

    numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;

    if (editStreams == null) {

      editStreams = new ArrayList<EditLogOutputStream>();

    }

    //传入文件夹类型获取迭代器

    Iterator<StorageDirectory> it = fsimage.dirIterator(NameNodeDirType.EDITS);

    while (it.hasNext()) {

      StorageDirectory sd = it.next();

      File eFile = getEditFile(sd);

      try {

        //打开存储文件夹下的文件获取输出流

        EditLogOutputStream eStream = new EditLogFileOutputStream(eFile);

        editStreams.add(eStream);

      } catch (IOException ioe) {

        fsimage.updateRemovedDirs(sd, ioe);

        it.remove();

      }

    }

    exitIfNoStreams();

  }

这里将会把一个新的输出流增加到editStreams全局变量中。

那么对于一次标准的写入过程是怎么样的呢，我们以文件关闭的方法为例，由于文件关闭会触发一次最后剩余数据的写入操作

  /**

   * Shutdown the file store.

   * 关闭操作

   */

  public synchronized void close() throws IOException {

    while (isSyncRunning) {

      //假设同正在进行。则等待1s

      try {

        wait(1000);

      } catch (InterruptedException ie) {

      }

    }

    if (editStreams == null) {

      return;

    }

    printStatistics(true);

    //当文件关闭的时候重置计数

    numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;

    for (int idx = 0; idx < editStreams.size(); idx++) {

      EditLogOutputStream eStream = editStreams.get(idx);

      try {

        //关闭将最后的数据刷出缓冲

        eStream.setReadyToFlush();

        eStream.flush();

        eStream.close();

      } catch (IOException ioe) {

        removeEditsAndStorageDir(idx);

        idx--;

      }

    }

    editStreams.clear();

  }

主要是中的2行代码，setReadyToFlush()交换缓冲区

/**

     * All data that has been written to the stream so far will be flushed.

     * New data can be still written to the stream while flushing is performed.

     */

    @Override

    void setReadyToFlush() throws IOException {

      assert bufReady.size() == 0 : "previous data is not flushed yet";

      write(OP_INVALID);           // insert end-of-file marker

      //交换2个缓冲区

      DataOutputBuffer tmp = bufReady;

      bufReady = bufCurrent;

      bufCurrent = tmp;

    }

bufCurrent的缓冲用于外部写进行的数据缓冲，而bufReady则是将要写入文件的数据缓冲。而真正起作用的是flush()方法。他是父类中的方法

  /**

   * Flush data to persistent store.

   * Collect sync metrics.

   * 刷出时间方法

   */

  public void flush() throws IOException {

    //同步次数加1

    numSync++;

    long start = FSNamesystem.now();

    //刷出同步方法为抽象方法，由继承的子类详细

    flushAndSync();

    long end = FSNamesystem.now();

    //同一时候进行耗时的累加

    totalTimeSync += (end - start);

  }

会调用到同步方法

/**

     * Flush ready buffer to persistent store.

     * currentBuffer is not flushed as it accumulates new log records

     * while readyBuffer will be flushed and synced.

     */

    @Override

    protected void flushAndSync() throws IOException {

      preallocate();            // preallocate file if necessary

      //将ready缓冲区中的数据写入文件里

      bufReady.writeTo(fp);     // write data to file

      bufReady.reset();         // erase all data in the buffer

      fc.force(false);          // metadata updates not needed because of preallocation

      //跳过无效标志位。由于无效标志位每次都会写入

      fc.position(fc.position()-1); // skip back the end-of-file marker

    }

你或许会想。简简单单的文件写入过程，的确设计的有点静止。再回想之前文件最顶上的几十种操作码类型。代表了各式各样的操作，他们是怎样被调用的呢，第一反应当然是外界传入參数值，然后我调用对应语句做操作匹配。EditLog沿用的也是这个思路。

/**

   * Add set replication record to edit log

   */

  void logSetReplication(String src, short replication) {

    logEdit(OP_SET_REPLICATION,

            new UTF8(src),

            FSEditLog.toLogReplication(replication));

  }

  /** Add set namespace quota record to edit log

   *

   * @param src the string representation of the path to a directory

   * @param quota the directory size limit

   */

  void logSetQuota(String src, long nsQuota, long dsQuota) {

    logEdit(OP_SET_QUOTA, new UTF8(src),

            new LongWritable(nsQuota), new LongWritable(dsQuota));

  }

  /**  Add set permissions record to edit log */

  void logSetPermissions(String src, FsPermission permissions) {

    logEdit(OP_SET_PERMISSIONS, new UTF8(src), permissions);

  }

事实上还有非常多的logSet*系列的方法，形式都是传入操作码，操作对象以及附加參数。就会调用到更加基层的logEdit方法，这种方法才是终于写入操作记录的方法。

/**

   * Write an operation to the edit log. Do not sync to persistent

   * store yet.

   * 写入一个操作到编辑日志中

   */

  synchronized void logEdit(byte op, Writable ... writables) {

    if (getNumEditStreams() < 1) {

      throw new AssertionError("No edit streams to log to");

    }

    long start = FSNamesystem.now();

    for (int idx = 0; idx < editStreams.size(); idx++) {

      EditLogOutputStream eStream = editStreams.get(idx);

      try {

        // 写入操作到每一个输出流中

        eStream.write(op, writables);

      } catch (IOException ioe) {

        removeEditsAndStorageDir(idx);

        idx--;

      }

    }

    exitIfNoStreams();

    // get a new transactionId

    //获取一个新的事物Id

    txid++;

    //

    // record the transactionId when new data was written to the edits log

    //

    TransactionId id = myTransactionId.get();

    id.txid = txid;

    // update statistics

    long end = FSNamesystem.now();

    //在每次进行logEdit写入记录操作的时候,都会累加事物次数和耗时

    numTransactions++;

    totalTimeTransactions += (end-start);

    if (metrics != null) // Metrics is non-null only when used inside name node

      metrics.addTransaction(end-start);

  }

每次新的操作，在这里都生成一个新的事务id，而且会统计事务运行写入缓冲时间等，可是此时仅仅是写入的输出流中，还没有写到文件。原因是你要考虑到多线程操作的情况。

//

  // Sync all modifications done by this thread.

  //

  public void logSync() throws IOException {

    ArrayList<EditLogOutputStream> errorStreams = null;

    long syncStart = 0;

    // Fetch the transactionId of this thread.

    long mytxid = myTransactionId.get().txid;

    ArrayList<EditLogOutputStream> streams = new ArrayList<EditLogOutputStream>();

    boolean sync = false;

    try {

      synchronized (this) {

        printStatistics(false);

        // if somebody is already syncing, then wait

        while (mytxid > synctxid && isSyncRunning) {

          try {

            wait(1000);

          } catch (InterruptedException ie) {

          }

        }

        //

        // If this transaction was already flushed, then nothing to do

        //

        if (mytxid <= synctxid) {

          //当运行的事物id小于已同步的Id,也进行计数累加

          numTransactionsBatchedInSync++;

          if (metrics != null) // Metrics is non-null only when used inside name node

            metrics.incrTransactionsBatchedInSync();

          return;

        }

        // now, this thread will do the sync

        syncStart = txid;

        isSyncRunning = true;

        sync = true;

        // swap buffers

        exitIfNoStreams();

        for(EditLogOutputStream eStream : editStreams) {

          try {

          	//交换缓冲

            eStream.setReadyToFlush();

            streams.add(eStream);

          } catch (IOException ie) {

            FSNamesystem.LOG.error("Unable to get ready to flush.", ie);

            //

            // remember the streams that encountered an error.

            //

            if (errorStreams == null) {

              errorStreams = new ArrayList<EditLogOutputStream>(1);

            }

            errorStreams.add(eStream);

          }

        }

      }

      // do the sync

      long start = FSNamesystem.now();

      for (EditLogOutputStream eStream : streams) {

        try {

          //同步完毕之后，做输入数据操作

          eStream.flush();

       ....

  }

ok。整个操作过程总算理清了。写入的过程完毕之后，编辑日志类是怎样读入编辑日志文件。并完毕内存元数据的恢复的呢，整个过程事实上就是一个解码的过程

/**

   * Load an edit log, and apply the changes to the in-memory structure

   * This is where we apply edits that we've been writing to disk all

   * along.

   * 导入编辑日志文件,并在内存中构建此时状态

   */

  static int loadFSEdits(EditLogInputStream edits) throws IOException {

    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();

    //FSDirectory是一个门面模式的体现,全部的操作都是在这个类中分给里面的子系数实现

    FSDirectory fsDir = fsNamesys.dir;

    int numEdits = 0;

    int logVersion = 0;

    String clientName = null;

    String clientMachine = null;

    String path = null;

    int numOpAdd = 0, numOpClose = 0, numOpDelete = 0,

        numOpRename = 0, numOpSetRepl = 0, numOpMkDir = 0,

        numOpSetPerm = 0, numOpSetOwner = 0, numOpSetGenStamp = 0,

        numOpTimes = 0, numOpGetDelegationToken = 0,

        numOpRenewDelegationToken = 0, numOpCancelDelegationToken = 0,

        numOpUpdateMasterKey = 0, numOpOther = 0;

    long startTime = FSNamesystem.now();

    DataInputStream in = new DataInputStream(new BufferedInputStream(edits));

    try {

      // Read log file version. Could be missing.

      in.mark(4);

      // If edits log is greater than 2G, available method will return negative

      // numbers, so we avoid having to call available

      boolean available = true;

      try {

        // 首先读入日志版本号号

        logVersion = in.readByte();

      } catch (EOFException e) {

        available = false;

      }

      if (available) {

        in.reset();

        logVersion = in.readInt();

        if (logVersion < FSConstants.LAYOUT_VERSION) // future version

          throw new IOException(

                          "Unexpected version of the file system log file: "

                          + logVersion + ". Current version = "

                          + FSConstants.LAYOUT_VERSION + ".");

      }

      assert logVersion <= Storage.LAST_UPGRADABLE_LAYOUT_VERSION :

                            "Unsupported version " + logVersion;

      while (true) {

        ....

        //以下依据操作类型进行值的设置

        switch (opcode) {

        case OP_ADD:

        case OP_CLOSE: {

          ...

          break;

        }

        case OP_SET_REPLICATION: {

          numOpSetRepl++;

          path = FSImage.readString(in);

          short replication = adjustReplication(readShort(in));

          fsDir.unprotectedSetReplication(path, replication, null);

          break;

        }

        case OP_RENAME: {

          numOpRename++;

          int length = in.readInt();

          if (length != 3) {

            throw new IOException("Incorrect data format. "

                                  + "Mkdir operation.");

          }

          String s = FSImage.readString(in);

          String d = FSImage.readString(in);

          timestamp = readLong(in);

          HdfsFileStatus dinfo = fsDir.getFileInfo(d);

          fsDir.unprotectedRenameTo(s, d, timestamp);

          fsNamesys.changeLease(s, d, dinfo);

          break;

        }

        ...

整个函数代码非常长，大家理解思路就可以。

里面的非常多操作都是在FSDirectory中实现的，你能够理解整个类为一个门面模式，各个相关的子系统都包含在这个类中。

NameNode元数据备份机制

有了以上的方法做铺垫，元数据的备份机制才得以灵活的实现，无非就是调用上述的基础方法进行各个文件的拷贝。重命名等操作。总体上须要发生文件状态变化的操作例如以下：

1.原current镜像文件夹-->lastcheckpoint.tmp

2.第二名字节点上传新的镜像文件后fsimage.ckpt-->fsimage，并创建新的current文件夹

3.lastcheckpoint.tmp变为previous.checkpoint

4.日志文件edit.new-->edit文件

大体上是以上4条思路。

首先镜像文件的备份都是从第二名字节点的周期性检查点检測開始的

//

  // The main work loop

  //

  public void doWork() {

    long period = 5 * 60;              // 5 minutes

    long lastCheckpointTime = 0;

    if (checkpointPeriod < period) {

      period = checkpointPeriod;

    }

    //主循环程序

    while (shouldRun) {

      try {

        Thread.sleep(1000 * period);

      } catch (InterruptedException ie) {

        // do nothing

      }

      if (!shouldRun) {

        break;

      }

      try {

        // We may have lost our ticket since last checkpoint, log in again, just in case

        if(UserGroupInformation.isSecurityEnabled())

          UserGroupInformation.getCurrentUser().reloginFromKeytab();

        long now = System.currentTimeMillis();

        long size = namenode.getEditLogSize();

        if (size >= checkpointSize ||

            now >= lastCheckpointTime + 1000 * checkpointPeriod) {

          //周期性调用检查点方法

          doCheckpoint();

          ...

    }

  }

然后我们就找doCheckpoint()检查点检查方法

/**

   * Create a new checkpoint

   */

  void doCheckpoint() throws IOException {

    // Do the required initialization of the merge work area.

    //做初始化的镜像操作

    startCheckpoint();

    // Tell the namenode to start logging transactions in a new edit file

    // Retuns a token that would be used to upload the merged image.

    CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();

    // error simulation code for junit test

    if (ErrorSimulator.getErrorSimulation(0)) {

      throw new IOException("Simulating error0 " +

                            "after creating edits.new");

    }

    //从名字节点获取当前镜像或编辑日志

    downloadCheckpointFiles(sig);   // Fetch fsimage and edits

    //进行镜像合并操作

    doMerge(sig);                   // Do the merge

    //

    // Upload the new image into the NameNode. Then tell the Namenode

    // to make this new uploaded image as the most current image.

    //把合并好后的镜像又一次上传到名字节点

    putFSImage(sig);

    // error simulation code for junit test

    if (ErrorSimulator.getErrorSimulation(1)) {

      throw new IOException("Simulating error1 " +

                            "after uploading new image to NameNode");

    }

    //通知名字节点进行镜像的替换操作。包含将edit.new的名称又一次改为edit，镜像名称fsimage.ckpt改为fsImage

    namenode.rollFsImage();

    checkpointImage.endCheckpoint();

    LOG.info("Checkpoint done. New Image Size: "

              + checkpointImage.getFsImageName().length());

  }

这种方法中描写叙述了非常清晰的备份机制。

我们主要再来看下文件的替换方法。也就是namenode.rollFsImage方法，这种方法最后还是会调到FSImage的同名方法。

/**

   * Moves fsimage.ckpt to fsImage and edits.new to edits

   * Reopens the new edits file.

   * 完毕2个文件的名称替换

   */

  void rollFSImage() throws IOException {

    if (ckptState != CheckpointStates.UPLOAD_DONE) {

      throw new IOException("Cannot roll fsImage before rolling edits log.");

    }

    //

    // First, verify that edits.new and fsimage.ckpt exists in all

    // checkpoint directories.

    //

    if (!editLog.existsNew()) {

      throw new IOException("New Edits file does not exist");

    }

    Iterator<StorageDirectory> it = dirIterator(NameNodeDirType.IMAGE);

    while (it.hasNext()) {

      StorageDirectory sd = it.next();

      File ckpt = getImageFile(sd, NameNodeFile.IMAGE_NEW);

      if (!ckpt.exists()) {

        throw new IOException("Checkpoint file " + ckpt +

                              " does not exist");

      }

    }

    editLog.purgeEditLog(); // renamed edits.new to edits

方法前半部分交待的非常明白，做2类文件的替换，

//

    // Renames new image

    // 重命名新镜像名称

    //

    it = dirIterator(NameNodeDirType.IMAGE);

    while (it.hasNext()) {

      StorageDirectory sd = it.next();

      File ckpt = getImageFile(sd, NameNodeFile.IMAGE_NEW);

      File curFile = getImageFile(sd, NameNodeFile.IMAGE);

      // renameTo fails on Windows if the destination file

      // already exists.

      if (!ckpt.renameTo(curFile)) {

        curFile.delete();

        if (!ckpt.renameTo(curFile)) {

          editLog.removeEditsForStorageDir(sd);

          updateRemovedDirs(sd);

          it.remove();

        }

      }

    }

    editLog.exitIfNoStreams();

中间代码部分完毕fsimage.ckpt的新镜像重命名为当前名称fsimage，最后要对旧的文件夹文件进行删除操作

//

    // Updates the fstime file on all directories (fsimage and edits)

    // and write version file

    //

    this.layoutVersion = FSConstants.LAYOUT_VERSION;

    this.checkpointTime = FSNamesystem.now();

    it = dirIterator();

    while (it.hasNext()) {

      StorageDirectory sd = it.next();

      // delete old edits if sd is the image only the directory

      if (!sd.getStorageDirType().isOfType(NameNodeDirType.EDITS)) {

        File editsFile = getImageFile(sd, NameNodeFile.EDITS);

        editsFile.delete();

      }

      // delete old fsimage if sd is the edits only the directory

      if (!sd.getStorageDirType().isOfType(NameNodeDirType.IMAGE)) {

        File imageFile = getImageFile(sd, NameNodeFile.IMAGE);

        imageFile.delete();

      }

这个过程本身比較复杂，还是用书中的一张图来表示好了，图可能有点大。

总结

至此，本篇内容阐述完毕，事实上我还是忽略了很多细节部分。还是主要从基本的操作一步步的理清整个线索。建议能够在Hadoop运行的时候。跑到name和edit文件夹下观察各个文件夹的情况以此验证这整套机制了。或直接做upgrade測试升级工作也是能够的。

全部代码的分析请点击链接

https://github.com/linyiqun/hadoop-hdfs,兴许将会继续更新HDFS其它方面的代码分析。

參考文献

《Hadoop技术内部–HDFS结构设计与实现原理》.蔡斌等

HDFS源代码分析(二)-----元数据备份机制的更多相关文章

Android 中View的绘制机制源代码分析二
尊重原创:http://blog.csdn.net/yuanzeyao/article/details/46842891 本篇文章接着上篇文章的内容来继续讨论View的绘制机制,上篇文章中我们主要解说 ...
mybatis源代码分析：mybatis延迟加载机制改进
在上一篇博客<mybatis源代码分析:深入了解mybatis延迟加载机制>讲诉了mybatis延迟加载的具体机制及实现原理. 可以看出,如果查询结果对象中有一个属性是需要延迟加载的,那整 ...
cocos2d-x 源代码分析： EventDispatcher、EventListener、Event 源代码分析（新触摸机制，新的NotificationCenter机制）
源代码版本号来自3.x,转载请注明 cocos2d-x 源代码分析总文件夹 http://blog.csdn.net/u011225840/article/details/31743129 1.继承结 ...
【原创】Kakfa utils源代码分析(二)
我们继续研究kafka.utils包八.KafkaScheduler.scala 首先该文件定义了一个trait:Scheduler——它就是运行任务的一个调度器.任务调度的方式支持重复执行的后台任 ...
[Android]Volley源代码分析(二)Cache
Cache作为Volley最为核心的一部分,Volley花了重彩来实现它.本章我们顺着Volley的源代码思路往下,来看下Volley对Cache的处理逻辑. 我们回忆一下昨天的简单代码,我们的入口是 ...
【原创】kafka server源代码分析(二)
十四.AbstractFetcherManager.scala 该scala定义了两个case类和一个抽象类.两个case类很简单: 1. BrokerAndFectherId:封装了一个broker ...
【原创】Kakfa log包源代码分析(二)
八.Log.scala 日志类,个人认为是这个包最重要的两个类之一(另一个是LogManager).以伴生对象的方式提供.先说Log object,既然是object,就定义了一些类级别的变量,比如定 ...
结合源代码分析android的消息机制
描写叙述结合几个问题去看源代码. 1.Handler, MessageQueue, Message, Looper, LocalThread这5者在android的消息传递过程中扮演了什么样的角色? ...
【原创】kafka controller源代码分析(二)
四.TopicDeletionManager.scala 管理topic删除的状态机,具体逻辑如下: TopicCommand发送topic删除命令,在zk的/admin/delete_topics目 ...

随机推荐

html 文本标签
文本格式化标签标签描述 <b> 定义粗体文本. <big> 定义大号字. <em> 定义着重文字. <i> 定义斜体字. <small> ...
安装环境：win64
1.安装环境 :win64 1.1 下载mysql安装包地址: https://dev.mysql.com/downloads/file/?id=476233 2.安装 2.1 解压下载的ZIP压缩包 ...
CTSC 1999 家园【网络流24题】星际转移
直接把每一个点,每一天拆成一个点. 然后每个点到下一天连$inf$的边. 然后把飞船的路径用容量为飞船容量的边连接. 然后跑网络流判断是否满流. #include <queue> #inc ...
BZOJ3555 [Ctsc2014]企鹅QQ 【hash】
题目 PenguinQQ是中国最大.最具影响力的SNS(Social Networking Services)网站,以实名制为基础,为用户提供日志.群.即时通讯.相册.集市等丰富强大的互联网功能体验, ...
免安装版MySql安装与配置
1:在MySql官网下载免安装版 http://downloads.mysql.com/archives/community/ 下载链接 http://downloads.mysql.com/arch ...
C#中DataTable中Rows.Add 和 ImportRow 对比
最近参加项目中,数据操作基本都是用DataTable的操作,老代码中有些地方用到DataTable.Rows.Add又有些代码用的DataTable.ImportRow,于是就对比了一下 VS查询说明 ...
bzoj 4555 NTT优化子集斯特林
题目大意读入n 求$f(n)=\sum_{i=0}^n\sum_{j=0}^i\left\{\begin{matrix}i \\ j\end{matrix}\right\}*2^j*j!$ 分析 ...
Blog 081018
对于 linux 系统 api, 尝试理解函数参数和函数之间的内在联系,为什么要用这些参数而不是另一些参数,了解 api 之间的一些共性. 一个扩展性良好的程序,结构都有一些共性,就像是一个国家,有好 ...
UVa11021 Tribles
概率递推每只麻球都是独立计算的. 可以递推,设f[i]表示一只麻球经过i天死光的概率,那么f[i]的k次方就是k只麻球经过i天死光的概率. 则f[i]=p[0]+p[1]*f[i-1]^1+p[2 ...
【BZOJ2286】消耗战（虚树，DFS序，树形DP）
题意:一棵N个点的树上有若干个关键点,每条边有一个边权,现在要将这些关键点到1的路径全部切断,切断一条边的代价就是边权. 共有M组询问,每组询问有k[i]个关键点,对于每组询问求出完成任务的最小代价. ...

HDFS源代码分析(二)-----元数据备份机制

前言

相关涉及类