LevelDB的源码阅读（四） Compaction操作

leveldb的数据存储采用LSM的思想，将随机写入变为顺序写入，记录写入操作日志，一旦日志被以追加写的形式写入硬盘，就返回写入成功，由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leveldb在内存中维护一个数据结构memtable,采用skiplist来实现，保存当前写入的数据，当数据达到一定规模后变为不可写的内存表immutable table.新的写入操作会写入新的memtable，而immutable table会被后台线程写入到数据文件.Leveldb的数据文件是按层存放的，默认配置的最高层级是7，即level0，level1,…,level7.内存中的immutable总是写入level0,除level0之外的各个层leveli的所有数据文件的key范围都是互相不相交的.当满足一定条件时，leveli的数据文件会和leveli+1的数据文件进行merge,产生新的leveli+1层级的文件,这个磁盘文件的merge过程和immutable的dump过程叫做Compaction，在leveldb中是由一个单独的后台线程来完成的.

进行Compaction操作的条件如下:

1.产生了新的immutable table需要写入数据文件

2.某个level的数据规模过大

3.某个文件被无效查询的次数过多（在文件i中查询key,没有找到key,这次查询称为文件i的无效查询）

4.手动compaction

满足以上条件会启动Compaction过程，接下来分析详细的Compaction过程.

Leveldb进行Compaction的入口函数是db文件夹下db_impl.cc文件中的DBImpl::MaybeScheduleCompaction，该函数在每次leveldb进行读写操作时都有可能被调用.源码内容如下:

void DBImpl::MaybeScheduleCompaction() {

  mutex_.AssertHeld();

  if (bg_compaction_scheduled_) {

    // Already scheduled

  } else if (shutting_down_.Acquire_Load()) {

    // DB is being deleted; no more background compactions

  } else if (!bg_error_.ok()) {

    // Already got an error; no more changes

  } else if (imm_ == NULL &&

             manual_compaction_ == NULL &&

             !versions_->NeedsCompaction()) {

    // No work to be done

  } else {

    bg_compaction_scheduled_ = true;

    env_->Schedule(&DBImpl::BGWork, this);  //新建后台任务并进行调度

  }

}

首先调用db文件夹下version_set.h中的NeedsCompaction()判断是否需要启动Compact任务.源码内容如下：

// Returns true iff some level needs a compaction.

  bool NeedsCompaction() const {

    Version* v = current_;

    return (v->compaction_score_ >= ) || (v->file_to_compact_ != NULL);

  }

version_set.cc中compaction_score_ 的计算如下：

void VersionSet::Finalize(Version* v) {

  // Precomputed best level for next compaction

  int best_level = -;

  double best_score = -;

  for (int level = ; level < config::kNumLevels-; level++) {

    double score;

    if (level == ) {

      // We treat level-0 specially by bounding the number of files

      // instead of number of bytes for two reasons:

      //

      // (1) With larger write-buffer sizes, it is nice not to do too

      // many level-0 compactions.

      //

      // (2) The files in level-0 are merged on every read and

      // therefore we wish to avoid too many files when the individual

      // file size is small (perhaps because of a small write-buffer

      // setting, or very high compression ratios, or lots of

      // overwrites/deletions).

      score = v->files_[level].size() /

          static_cast<double>(config::kL0_CompactionTrigger);

    } else {

      // Compute the ratio of current size to size limit.

      const uint64_t level_bytes = TotalFileSize(v->files_[level]);

      score = static_cast<double>(level_bytes) / MaxBytesForLevel(level);

    }

    if (score > best_score) {

      best_level = level;

      best_score = score;

    }

  }

  v->compaction_level_ = best_level;

  v->compaction_score_ = best_score;

}

注意，这里同时预计算了进行compaction的最佳level.

确认需要启动compaction之后，调用util文件夹下env_posix.cc文件中的PosixEnv::Schedule函数启动Compact过程.

void PosixEnv::Schedule(void (*function)(void*), void* arg) {

  PthreadCall("lock", pthread_mutex_lock(&mu_));

  // Start background thread if necessary

  if (!started_bgthread_) {

    started_bgthread_ = true;

    PthreadCall(

        "create thread",

        pthread_create(&bgthread_, NULL,  &PosixEnv::BGThreadWrapper, this));

  }

  // If the queue is currently empty, the background thread may currently be

  // waiting.

  if (queue_.empty()) {

    PthreadCall("signal", pthread_cond_signal(&bgsignal_));

  }

  // Add to priority queue

  queue_.push_back(BGItem());

  queue_.back().function = function;

  queue_.back().arg = arg;

  PthreadCall("unlock", pthread_mutex_unlock(&mu_));

}

如果没有后台线程，则创建后台线程，否则新建一个后台执行任务BGItem压入后台线程任务队列，然后调用PosixEnv::BGThreadWrapper唤醒后台线程:

static void* BGThreadWrapper(void* arg) {

    reinterpret_cast<PosixEnv*>(arg)->BGThread();

    return NULL;

  }

BGThreadWrapper调用PosixEnv::BGThread,不断地从后台任务队列中拿到任务，然后执行任务

void PosixEnv::BGThread() {

  while (true) {

    // Wait until there is an item that is ready to run

    PthreadCall("lock", pthread_mutex_lock(&mu_));

    while (queue_.empty()) {

      PthreadCall("wait", pthread_cond_wait(&bgsignal_, &mu_));

    }

    void (*function)(void*) = queue_.front().function;

    void* arg = queue_.front().arg;

    queue_.pop_front();

    PthreadCall("unlock", pthread_mutex_unlock(&mu_));

    (*function)(arg);

  }

}

回到DBImpl::MaybeScheduleCompaction，方便理解起见这里再重复一遍源码

void DBImpl::MaybeScheduleCompaction() {

  mutex_.AssertHeld();

  if (bg_compaction_scheduled_) {

    // Already scheduled

  } else if (shutting_down_.Acquire_Load()) {

    // DB is being deleted; no more background compactions

  } else if (!bg_error_.ok()) {

    // Already got an error; no more changes

  } else if (imm_ == NULL &&

             manual_compaction_ == NULL &&

             !versions_->NeedsCompaction()) {

    // No work to be done

  } else {

    bg_compaction_scheduled_ = true;

    env_->Schedule(&DBImpl::BGWork, this);  //新建后台任务并进行调度

  }

}

之前分析了env_->Schedule进行的调度过程，现在来分析实际进行后台任务的DBImpl::BGWork.DBImpl::BGWork在db文件夹下db_impl.cc文件中.

void DBImpl::BGWork(void* db) {

  reinterpret_cast<DBImpl*>(db)->BackgroundCall();

}

DBImpl::BGWork调用DBImpl::BackgroundCall()，合并完成后可能导致有的level的文件数过多，因此会再次调用MaybeScheduleCompaction()判断是否需要继续进行合并.

void DBImpl::BackgroundCall() {

  MutexLock l(&mutex_);

  assert(bg_compaction_scheduled_);

  if (shutting_down_.Acquire_Load()) {

    // No more background work when shutting down.

  } else if (!bg_error_.ok()) {

    // No more background work after a background error.

  } else {

    BackgroundCompaction();

  }

  bg_compaction_scheduled_ = false;

  // Previous compaction may have produced too many files in a level,

  // so reschedule another compaction if needed.

  MaybeScheduleCompaction();

  bg_cv_.SignalAll();

}

DBImpl::BackgroundCall()调用 BackgroundCompaction()，在BackgroundCompaction()中分别完成三种不同的Compaction：对Memtable进行合并、 trivial Compaction（直接将文件移动到下一层）以及一般的合并，调用DoCompactionWork()实现.

void DBImpl::BackgroundCompaction() {

  mutex_.AssertHeld();

  if (imm_ != NULL) {

    CompactMemTable();//1、对Memtable进行合并

    return;

  }

  Compaction* c;

  bool is_manual = (manual_compaction_ != NULL);//manual_compaction默认为NULL，则is_manual默认为false

  InternalKey manual_end;

  if (is_manual) { //取得手动compaction对象

    ManualCompaction* m = manual_compaction_;

    c = versions_->CompactRange(m->level, m->begin, m->end);

    m->done = (c == NULL);

    if (c != NULL) {

      manual_end = c->input(, c->num_input_files() - )->largest;

    }

    Log(options_.info_log,

        "Manual compaction at level-%d from %s .. %s; will stop at %s\n",

        m->level,

        (m->begin ? m->begin->DebugString().c_str() : "(begin)"),

        (m->end ? m->end->DebugString().c_str() : "(end)"),

        (m->done ? "(end)" : manual_end.DebugString().c_str()));

  } else {   //取得自动compaction对象

    c = versions_->PickCompaction();

  }

  Status status;

  if (c == NULL) {

    // Nothing to do

  } else if (!is_manual && c->IsTrivialMove()) {//2、IsTrivialMove 返回 True,trivial Compaction，则直接将文件移入 level + 1 层即可

    // Move file to next level

    assert(c->num_input_files() == );

    FileMetaData* f = c->input(, );

    c->edit()->DeleteFile(c->level(), f->number);

    c->edit()->AddFile(c->level() + , f->number, f->file_size,

                       f->smallest, f->largest);

    status = versions_->LogAndApply(c->edit(), &mutex_);

    if (!status.ok()) {

      RecordBackgroundError(status);

    }

    VersionSet::LevelSummaryStorage tmp;

    Log(options_.info_log, "Moved #%lld to level-%d %lld bytes %s: %s\n",

        static_cast<unsigned long long>(f->number),

        c->level() + ,

        static_cast<unsigned long long>(f->file_size),

        status.ToString().c_str(),

        versions_->LevelSummary(&tmp));

  } else { //3、一般的合并

    CompactionState* compact = new CompactionState(c);

    status = DoCompactionWork(compact); //进行compaction

    if (!status.ok()) {

      RecordBackgroundError(status);

    }

    CleanupCompaction(compact);

    c->ReleaseInputs();      // input的文件引用计数减少1

    DeleteObsoleteFiles();   //删除无用文件

  }

  delete c;

  if (status.ok()) {

    // Done

  } else if (shutting_down_.Acquire_Load()) {

    // Ignore compaction errors found during shutting down

  } else {

    Log(options_.info_log,

        "Compaction error: %s", status.ToString().c_str());

  }

  if (is_manual) {

    ManualCompaction* m = manual_compaction_;   //标记手动compaction任务完成

    if (!status.ok()) {

      m->done = true;

    }

    if (!m->done) {

      // We only compacted part of the requested range.  Update *m

      // to the range that is left to be compacted.

      m->tmp_storage = manual_end;

      m->begin = &m->tmp_storage;

    }

    manual_compaction_ = NULL;

  }

}

首行mutex_.AssertHeld()，Mutex的AssertHeld函数实现默认为空，在很多函数的实现内有调用，其作用如下：

As you have observed it does nothing in the default implementation. The function seems to be a placeholder for checking whether a particular thread holds a mutex and optionally abort if it doesn’t. This would be equivalent to the normal asserts we use for variables but applied on mutexes.
I think the reason it is not implemented yet is we don’t have an equivalent light weight function to assert whether a thread holds a lock in pthread_mutex_t used in the default implementation. Some platforms which has that capability could fill this implementation as part of porting process. Searching online I did find some implementation for this function in the windows port of leveldb. I can see one way to implement it using a wrapper class over pthread_mutex_t and setting some sort of a thread id variable to indicate which thread(s) currently holds the mutex, but it will have to be carefully implemented given the race conditions that can arise.

Memtable的合并

Compaction首先检查imm_，及时将已写满的memtable写入磁盘sstable文件，对Memtable的合并，调用DBImpl::CompactMemTable()完成：

void DBImpl::CompactMemTable() {

  mutex_.AssertHeld();

  assert(imm_ != NULL);//imm_不能为空

  VersionEdit edit;

  Version* base = versions_->current();

  base->Ref();

  Status s = WriteLevel0Table(imm_, &edit, base);//将Memtable转化为.sst文件，写入level0 sst table,并写入到edit中

  base->Unref();

  if (s.ok()) {

    edit.SetPrevLogNumber();

    edit.SetLogNumber(logfile_number_);  // Earlier logs no longer needed

    s = versions_->LogAndApply(&edit, &mutex_);//应用edit中记录的变化，来生成新的版本

  }

  if (s.ok()) {

      // Commit to the new state

    imm_->Unref();

    imm_ = NULL;

    has_imm_.Release_Store(NULL);

    DeleteObsoleteFiles();

  } else {

    RecordBackgroundError(s);

  }

}

其中CompactMemTable()主要调用了两个函数：WriteLevel0Table()和versions_->LogAndApply()

CompactMemTable()首先调用WriteLevel0Table()，源码内容如下：

Status DBImpl::WriteLevel0Table(MemTable* mem, VersionEdit* edit,

                                Version* base) {

  mutex_.AssertHeld();

  FileMetaData meta;

  meta.number = versions_->NewFileNumber();//获取新生成的.sst文件的编号

  pending_outputs_.insert(meta.number);

  Iterator* iter = mem->NewIterator();//用于遍历Memtable中的数据

  Status s;

  {

    mutex_.Unlock();

    s = BuildTable(dbname_, env_, options_, table_cache_, iter, &meta);//创建.sst文件，并将其相关信息记录在meta中

    mutex_.Lock();

  }

  delete iter;  //iter用完之后一定要删除

  pending_outputs_.erase(meta.number);

  int level = ;

  if (s.ok() && meta.file_size > ) {

    const Slice min_user_key = meta.smallest.user_key();

    const Slice max_user_key = meta.largest.user_key();

    if (base != NULL) {

      level = base->PickLevelForMemTableOutput(min_user_key, max_user_key);//为合并的输出文件选择合适的level

    }

    edit->AddFile(level, meta.number, meta.file_size,meta.smallest, meta.largest);//将生成的.sst文件加入到该level

  }

  return s;

}

WriteLevel0Table()首先调用BuildTable()将Immutable Memtable中所有的数据写入到一个.sst文件中，并将.sst文件的信息（文件编号，Key值范围，文件大小）记录到变量meta中.由于Memtable是基于Skiplist的，是一个有序表，因此在写入.sst文件时，Key值也是从小到大来排列的.可以发现，将Memtable中的数据转换为SSTable时，是将所有记录都写入SSTable的，要删除的记录也一样.删除操作会在更高level的Compaction中完成.因此level 0中可能会存在Key值相同的记录.

Status BuildTable(const std::string& dbname,

                  Env* env,

                  const Options& options,

                  TableCache* table_cache,

                  Iterator* iter,

                  FileMetaData* meta) {

  Status s;

  meta->file_size = ;

  iter->SeekToFirst();

  std::string fname = TableFileName(dbname, meta->number);//获得新建表名字

  if (iter->Valid()) {

    WritableFile* file;

    s = env->NewWritableFile(fname, &file);   //建立新的表文件，后续写入数据

    if (!s.ok()) {

      return s;

    }

    TableBuilder* builder = new TableBuilder(options, file); //建立TableBuilder

    meta->smallest.DecodeFrom(iter->key());

    for (; iter->Valid(); iter->Next()) {    //将key/value对加入builder

      Slice key = iter->key();

      meta->largest.DecodeFrom(key);

      builder->Add(key, iter->value());

    }

    // Finish and check for builder errors

    s = builder->Finish(); //构建indexhandler,metahandler,写入文件

    if (s.ok()) {

      meta->file_size = builder->FileSize();

      assert(meta->file_size > );

    }

    delete builder;

    // Finish and check for file errors

    if (s.ok()) {

      s = file->Sync();  //写入文件

    }

    if (s.ok()) {

      s = file->Close();

    }

    delete file;

    file = NULL;

    if (s.ok()) {

      // Verify that the table is usable

      Iterator* it = table_cache->NewIterator(ReadOptions(),

                                              meta->number,

                                              meta->file_size); //将表结构加入表缓存

      s = it->status();

      delete it;

    }

  }

  // Check for input iterator errors

  if (!iter->status().ok()) {

    s = iter->status();

  }

  if (s.ok() && meta->file_size > ) {

    // Keep it

  } else {

    env->DeleteFile(fname);

  }

  return s;

}

该函数利用iter向TableBuilder中加入key/value对，然后写入文件并同步，将新生成的Table结构加入tablecache以备后用.

table_builder文件在table文件夹下，其中TableBuilder::Add函数流程如下：

void TableBuilder::Add(const Slice& key, const Slice& value) {

  Rep* r = rep_;

  assert(!r->closed);

  if (!ok()) return;

  if (r->num_entries > ) {

    assert(r->options.comparator->Compare(key, Slice(r->last_key)) > );

  }

  if (r->pending_index_entry) {//新的block开始

    assert(r->data_block.empty());

    r->options.comparator->FindShortestSeparator(&r->last_key, key);

    std::string handle_encoding;

    r->pending_handle.EncodeTo(&handle_encoding);

    r->index_block.Add(r->last_key, Slice(handle_encoding));

    r->pending_index_entry = false;

  }

  //计算filter

  if (r->filter_block != NULL) {

    r->filter_block->AddKey(key);

  }

  //加入blockbuilder

  r->last_key.assign(key.data(), key.size());

  r->num_entries++;

  r->data_block.Add(key, value);

  // block大于配置的尺寸(默认为4k)则结束该block，输出后开启新的Block。

  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();

  if (estimated_block_size >= r->options.block_size) {

    Flush();

  }

}

将Block结构写入文件的TableBuilder::WriteBlock函数流程如下：

void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {

  // File format contains a sequence of blocks where each block has:

  //    block_data: uint8[n]

  //    type: uint8

  //    crc: uint32

  assert(ok());

  Rep* r = rep_;

  Slice raw = block->Finish(); //取得block格式化数据

  Slice block_contents;

    //获取是否压缩配置选项

  CompressionType type = r->options.compression;

  // TODO(postrelease): Support more compression options: zlib?

  switch (type) {

    case kNoCompression:

      block_contents = raw;

      break;

    case kSnappyCompression: {

      std::string* compressed = &r->compressed_output;

      if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&

          compressed->size() < raw.size() - (raw.size() / 8u)) {

        block_contents = *compressed;

      } else {

        // Snappy not supported, or compressed less than 12.5%, so just

        // store uncompressed form

        block_contents = raw;

        type = kNoCompression;

      }

      break;

    }

  }

  //进行压缩后，然后写入文件，blockdata+type+crc32

  WriteRawBlock(block_contents, type, handle);

  r->compressed_output.clear();

  block->Reset();

}

而TableBuilder::Finish的函数定义如下：

Status TableBuilder::Finish() {

  Rep* r = rep_;

  Flush();//将block数据写入,可能不是满的block

  assert(!r->closed);

  r->closed = true;

  BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;

  // Write filter block

  if (ok() && r->filter_block != NULL) {

    WriteRawBlock(r->filter_block->Finish(), kNoCompression,

                  &filter_block_handle);

  }

  // Write metaindex block

  if (ok()) {

    BlockBuilder meta_index_block(&r->options);

    if (r->filter_block != NULL) {

      // Add mapping from "filter.Name" to location of filter data

      std::string key = "filter.";

      key.append(r->options.filter_policy->Name());

      std::string handle_encoding;

      filter_block_handle.EncodeTo(&handle_encoding);

      meta_index_block.Add(key, handle_encoding);

    }

    // TODO(postrelease): Add stats and other meta blocks

    WriteBlock(&meta_index_block, &metaindex_block_handle);

  }

  // Write index block

  if (ok()) {

    if (r->pending_index_entry) {

      r->options.comparator->FindShortSuccessor(&r->last_key);

      std::string handle_encoding;

      r->pending_handle.EncodeTo(&handle_encoding);

      r->index_block.Add(r->last_key, Slice(handle_encoding));

      r->pending_index_entry = false;

    }

    WriteBlock(&r->index_block, &index_block_handle);

  }

  // Write footer

  if (ok()) {

    Footer footer;

    footer.set_metaindex_handle(metaindex_block_handle);

    footer.set_index_handle(index_block_handle);

    std::string footer_encoding;

    footer.EncodeTo(&footer_encoding);

    r->status = r->file->Append(footer_encoding);

    if (r->status.ok()) {

      r->offset += footer_encoding.size();

    }

  }

  return r->status;

}

以上代码中调用的flush源码内容如下：

void TableBuilder::Flush() {

  Rep* r = rep_;

  assert(!r->closed);

  if (!ok()) return;

  if (r->data_block.empty()) return;

  assert(!r->pending_index_entry);

  WriteBlock(&r->data_block, &r->pending_handle);

  if (ok()) {

    r->pending_index_entry = true;

    r->status = r->file->Flush();

  }

  if (r->filter_block != NULL) {

    r->filter_block->StartBlock(r->offset);

  }

}

然后WriteLevel0Table()调用PickLevelForMemTableOutput()为Memtable合并的输出文件选择合适的level，并调用edit->AddFile()将生成的.sst文件加入到该level中.

WriteLevel0Table()结束后，CompactMemTable()调用db文件夹下version_set.cc文件中的versions_->LogAndApply()基于当前版本和更改edit来得到一个新版本.之后会对versions_->LogAndApply()进行分析.

Trivial Compaction

由之前的分析可知，is_manual默认为false，会调用PickCompaction()来选出要进行合并的level和相应的输入文件.当c->IsTrivialMove()满足时，则直接将文件移动到下一level.

  c = versions_->PickCompaction();

  Status status;

  if (c == NULL) {

    // Nothing to do

  } else if (!is_manual && c->IsTrivialMove()) {

    // Move file to next level

    assert(c->num_input_files() == );

    FileMetaData* f = c->input(, );

    c->edit()->DeleteFile(c->level(), f->number);  //将文件从该层删除

    c->edit()->AddFile(c->level() + , f->number, f->file_size,   //将该文件加入到下一level

                       f->smallest, f->largest);

    status = versions_->LogAndApply(c->edit(), &mutex_);  //应用更改，创建新的Version

  }

首先调用db文件夹下version_set.cc文件中的VersionSet::PickCompaction()为接下来的Compaction操作准备输入数据,由之前对Compaction的数据结构分析可知，Compaction操作有两种触发方式：某一level的文件数太多和某一文件的查找次数超过允许值，在进行合并时，将优先考虑文件数过多的情况.

Compaction* VersionSet::PickCompaction() {

  Compaction* c;

  int level;

  const bool size_compaction = (current_->compaction_score_ >= );//文件数过多

  const bool seek_compaction = (current_->file_to_compact_ != NULL);//某一文件的查找次数太多

  if (size_compaction) {//文件数太多优先考虑

    level = current_->compaction_level_;  //要进行Compaction的level

    c = new Compaction(level);

    //每一层有一个compact_pointer，用于记录compaction key，这样可以进行循环compaction

    for (size_t i = ; i < current_->files_[level].size(); i++) { //从待合并的level中选择合适的文件完成合并操作

      FileMetaData* f = current_->files_[level][i];  //level层中的第i个文件

      if (compact_pointer_[level].empty() || //compact_pointer_中记录的是下次合并的起始Key值，为空时都可以进行合并

          icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > ) { //或者f的最大Key值大于起始值

        c->inputs_[].push_back(f);//则该文件可以参与合并，将其加入到level输入文件中

        break;

      }

    }

    if (c->inputs_[].empty()) { //若level输入为空，则将level的第一个文件加入到输入中

      c->inputs_[].push_back(current_->files_[level][]);

    }

  } else if (seek_compaction) {//然后考虑查找次数过多的情况

    level = current_->file_to_compact_level_;

    c = new Compaction(level);

    c->inputs_[].push_back(current_->file_to_compact_);//将待合并的文件作为level层的输入

  } else {

    return NULL;

  }

  c->input_version_ = current_;

  c->input_version_->Ref();

  //level 0中的Key值是可以重复的，因此Key值范围可能相互覆盖，把所有重叠都找出来，一起做compaction

  if (level == ) {

    InternalKey smallest, largest;

    GetRange(c->inputs_[], &smallest, &largest);//待合并的level层的文件的Key值范围

    current_->GetOverlappingInputs(, &smallest, &largest, &c->inputs_[]);

    assert(!c->inputs_[].empty());

  }

  SetupOtherInputs(c);//获取待合并的level+1层的输入

  return c;

}

然后判断是否为trivial Compaction,当为trivial Compaction时，只需要简单的将level层的文件移动到level +1 层即可

bool Compaction::IsTrivialMove() const {

  return (num_input_files() ==  &&   //level层只有1个文件

          num_input_files() ==  &&   //level+1层没有文件

          TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes);//level+2层文件总大小不超过最大覆盖范围，否则会导致后面的merge需要很大的开销

}

最终完成完成Compaction操作

c->edit()->DeleteFile(c->level(), f->number);

c->edit()->AddFile(c->level() + , f->number, f->file_size,f->smallest, f->largest);

status = versions_->LogAndApply(c->edit(), &mutex_);

一般的合并

一般的合并调用DBImpl::DoCompactionWork()完成，compact是调用VersionSet::PickCompacttion()得到的，与之前的trivial Compaction相同.不同level之间，可能存在Key值相同的记录，但是记录的seq不同.由之前的分析可知，最新的数据存放在较低的level中，其对应的seq也一定比level+1中的记录的seq要大，因此当出现相同Key值的记录时，只需要记录第一条记录，后面的都可以丢弃.level 0中也可能存在Key值相同的数据，其后面的seq也不同.数据越新，其对应的seq越大，且记录在level 0中的记录是按照user_key递增，seq递减的方式存储的，则相同user_key对应的记录是聚集在一起的，且按照seq递减的方式存放的.在更高层的Compaction时，只需要处理第一条出现的user_key相同的记录即可，后面的相同user_key的记录都可以丢弃.因此合并后的level +1层的文件中不会存在Key值相同的记录.删除记录的操作也会在此时完成，删除数据的记录会被丢弃，而不会被写入到更高level的文件中.

Status DBImpl::DoCompactionWork(CompactionState* compact) {

  if (snapshots_.empty()) {

    compact->smallest_snapshot = versions_->LastSequence();

  } else {

    compact->smallest_snapshot = snapshots_.oldest()->number_;

  }

  mutex_.Unlock();

  //生成iterator:遍历要compaction的数据

  Iterator* input = versions_->MakeInputIterator(compact->compaction);//用于遍历待合并的每一个文件

  input->SeekToFirst();

  Status status;

  ParsedInternalKey ikey;

  std::string current_user_key;

  bool has_current_user_key = false;

  SequenceNumber last_sequence_for_key = kMaxSequenceNumber;

  for (; input->Valid() && !shutting_down_.Acquire_Load(); ) {

    if (has_imm_.NoBarrier_Load() != NULL) {  //immutable memtable的优先级最高

      mutex_.Lock();

      if (imm_ != NULL) {   //当imm_非空时，合并Memtable

        CompactMemTable();

        bg_cv_.SignalAll();  // Wakeup MakeRoomForWrite() if necessary

      }

      mutex_.Unlock();

    }

    Slice key = input->key();

    if (compact->compaction->ShouldStopBefore(key) &&   //是否需要停止Compaction，中途输出compaction的结果，避免compaction结果和level N+2 files有过多的重叠

        compact->builder != NULL) {

      status = FinishCompactionOutputFile(compact, input);

    }

    bool drop = false;

    if (!ParseInternalKey(key, &ikey)) {

      current_user_key.clear();

      has_current_user_key = false;

      last_sequence_for_key = kMaxSequenceNumber;

    } else {

      if (!has_current_user_key ||    //获取当前的user_key和sequence

          user_comparator()->Compare(ikey.user_key,

          Slice(current_user_key)) != ) { //可能存在Key值相同但seq不同的记录

        // 此时是这个Key第一次出现

        current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());

        has_current_user_key = true;

        last_sequence_for_key = kMaxSequenceNumber;//则将其seq设为最大值，表示第一次出现

      }

      if (last_sequence_for_key <= compact->smallest_snapshot) {//表示key已经出现过，否则seq应为KMaxSequenceNumber

        drop = true;    // (A)   //之前已经存在Key值相同的记录，丢弃

      } else if (ikey.type == kTypeDeletion &&   //要删除该记录

              ikey.sequence <= compact->smallest_snapshot &&  //记录的序号比数据库之前的最小序号还小

              compact->compaction->IsBaseLevelForKey(ikey.user_key)) { //高的level中没有数据

        drop = true;   //此时要丢弃该记录

      }

      last_sequence_for_key = ikey.sequence;//上次出现的记录对应的sequence，用于判断后面出现相同Key值的情况

    }

    if (!drop) {   //如果不需要丢弃该记录

      if (compact->builder == NULL) {

        status = OpenCompactionOutputFile(compact);//若需要，则创建一个.sst文件，用于存放合并后的数据

      }

      if (compact->builder->NumEntries() == ) {

        compact->current_output()->smallest.DecodeFrom(key);

      }

      compact->current_output()->largest.DecodeFrom(key);

      compact->builder->Add(key, input->value());//将记录写入.sst文件

      if (compact->builder->FileSize() >=

          compact->compaction->MaxOutputFileSize()) {   //当.sst文件超过最大值时

        status = FinishCompactionOutputFile(compact, input);//完成Compaction输出文件

      }

    }

    input->Next();  //处理下一个文件

  }

  if (status.ok() && compact->builder != NULL) {

    status = FinishCompactionOutputFile(compact, input);

  }

  if (status.ok()) {

    status = input->status();

  }

  delete input;

  input = NULL;

 //更新compaction的一些统计数据

  CompactionStats stats;

  stats.micros = env_->NowMicros() - start_micros - imm_micros;

  for (int which = ; which < ; which++) {

    for (int i = ; i < compact->compaction->num_input_files(which); i++) {

      stats.bytes_read += compact->compaction->input(which, i)->file_size;

    }

  }

  for (size_t i = ; i < compact->outputs.size(); i++) {

    stats.bytes_written += compact->outputs[i].file_size;

  }

  mutex_.Lock();

  stats_[compact->compaction->level() + ].Add(stats);

  if (status.ok()) {

    status = InstallCompactionResults(compact);//完成合并

  }

  if (!status.ok()) {

    RecordBackgroundError(status);

  }

  VersionSet::LevelSummaryStorage tmp;

  Log(options_.info_log,

      "compacted to: %s", versions_->LevelSummary(&tmp));

  return status;

}

首先将可以留下的记录写入到.sst文件中，并将相关信息保存在变量compact中，然后调用InstallCompactionResults()将所做的改动加入到VersionEdit中，再调用LogAndApply()来得到新的版本.

Status DBImpl::InstallCompactionResults(CompactionState* compact) {

  mutex_.AssertHeld();

  Log(options_.info_log,  "Compacted %d@%d + %d@%d files => %lld bytes",

      compact->compaction->num_input_files(),

      compact->compaction->level(),

      compact->compaction->num_input_files(),

      compact->compaction->level() + ,

      static_cast<long long>(compact->total_bytes));

  // Add compaction outputs

  compact->compaction->AddInputDeletions(compact->compaction->edit());

  const int level = compact->compaction->level();

  for (size_t i = ; i < compact->outputs.size(); i++) {

    const CompactionState::Output& out = compact->outputs[i];

    compact->compaction->edit()->AddFile(

        level + ,

        out.number, out.file_size, out.smallest, out.largest);

  }

  return versions_->LogAndApply(compact->compaction->edit(), &mutex_);

}

LogAndApply()

在上面三种不同的Compaction操作中，最终当对当前版本的更改VersionEdit全部完成后，都会调用VersionSet::LogAndApply()来应用更改，创建新版本.edit中保存了level和level+1层要删除和增加的文件.

Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {

  Version* v = new Version(this);  //创建一个新Version

  {

    Builder builder(this, current_);//基于当前Version创建一个builder变量

    builder.Apply(edit);//将edit中记录的要增加、删除的文件加入到builder类中

    builder.SaveTo(v);//然后将edit中的记录保存到新创建的Version中，这样就得到了一个新的版本

  }

  Finalize(v);//根据各层文件数来判断是否还需要进行Compaction

  std::string new_manifest_file;

  Status s;

  if (descriptor_log_ == NULL) {   //只会在第一次调用时进入

    assert(descriptor_file_ == NULL);

    new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);//创建一个新的Manifest文件

    edit->SetNextFile(next_file_number_);

    s = env_->NewWritableFile(new_manifest_file, &descriptor_file_);

    if (s.ok()) {

      descriptor_log_ = new log::Writer(descriptor_file_);

      s = WriteSnapshot(descriptor_log_);//快照，系统开始时完整记录数据库的所有信息

    }

  }

  {

    mu->Unlock();

    if (s.ok()) {

      std::string record;

      edit->EncodeTo(&record);

      s = descriptor_log_->AddRecord(record);//将数据库的变化记录到Manifest文件中

      if (s.ok()) {

        s = descriptor_file_->Sync();

      }

    }

    if (s.ok() && !new_manifest_file.empty()) {

      s = SetCurrentFile(env_, dbname_, manifest_file_number_);

    }

    mu->Lock();

  }

  if (s.ok()) {

    AppendVersion(v);  //将新得到的Version插入到所有Version形成的双向链表的尾部

    log_number_ = edit->log_number_;

    prev_log_number_ = edit->prev_log_number_;

  }

  }

  return s;

}

为了重启之后能恢复数据库之前的状态，就需要将数据库的历史变化信息记录下来，这些信息都是记录在Manifest文件中的.为了节省空间和时间，leveldb采用的是在系统开始完整的所有数据库的信息（WriteSnapShot()），以后则只记录数据库的变化，即VersionEdit中的信息（descriptor_log_->AddRecord()）.恢复时，只需要根据Manifest中的信息就可以一步步的恢复到上次的状态.

VersionSet::LogAndApply首先创建一个新的Version，然后调用builder.Apply(edit)将edit中所有要删除、增加的文件编号记录下来，其源码如下：

  // Apply all of the edits in *edit to the current state.

  void Apply(VersionEdit* edit) {

    // 更新每一层下次合并的起始Key值

    for (size_t i = ; i < edit->compact_pointers_.size(); i++) {

      const int level = edit->compact_pointers_[i].first;

      vset_->compact_pointer_[level] =

          edit->compact_pointers_[i].second.Encode().ToString();

    }

    //将所有要删除的文件加入到levels_[level].deleted_files变量中

    const VersionEdit::DeletedFileSet& del = edit->deleted_files_;

    for (VersionEdit::DeletedFileSet::const_iterator iter = del.begin();

         iter != del.end();++iter) {

      const int level = iter->first;

      const uint64_t number = iter->second;

      levels_[level].deleted_files.insert(number);

    }

    // 将所有新增加的文件加入到levels_[level].added_files中

    for (size_t i = ; i < edit->new_files_.size(); i++) {

      const int level = edit->new_files_[i].first;

      FileMetaData* f = new FileMetaData(edit->new_files_[i].second);

      f->refs = ;

      f->allowed_seeks = (f->file_size / );

      if (f->allowed_seeks < ) f->allowed_seeks = ;

      levels_[level].deleted_files.erase(f->number);

      levels_[level].added_files->insert(f);

    }

  }

然后VersionSet::LogAndApply再调用builder.SaveTo(v)将更改保存到新的Version中，其源码如下：

  void SaveTo(Version* v) {

    BySmallestKey cmp;

    cmp.internal_comparator = &vset_->icmp_;

    for (int level = ; level < config::kNumLevels; level++) {

      const std::vector<FileMetaData*>& base_files = base_->files_[level];//当前Version中原有的各个level的.sst文件

      std::vector<FileMetaData*>::const_iterator base_iter = base_files.begin();

      std::vector<FileMetaData*>::const_iterator base_end = base_files.end();

      const FileSet* added = levels_[level].added_files;//对应level新增加的文件

      v->files_[level].reserve(base_files.size() + added->size());

      for (FileSet::const_iterator added_iter = added->begin();

           added_iter != added->end();++added_iter) {

        // 将原有文件中编号比added小的加入到新的Version

        for (std::vector<FileMetaData*>::const_iterator bpos

                 = std::upper_bound(base_iter, base_end, *added_iter, cmp);

             base_iter != bpos;++base_iter) {

          MaybeAddFile(v, level, *base_iter);

        }

        MaybeAddFile(v, level, *added_iter);//再将新增的文件依次加入到新的Version

      }

      for (; base_iter != base_end; ++base_iter) {

        MaybeAddFile(v, level, *base_iter);//再将原有文件中剩余的部分加入到新的Version

      }

    }

  }

bpos = std::upper_bound(base_iter,base_end,*added_iter,cmp); // 返回base_iter到base_end之间，第一个大于*added_iter的iter.假设原有文件的编号为1、3、4、6、8，新增文件的编号为2、5、7，则第一次循环时，bpos为3对应的迭代器，因此base_iter只遍历一个元素，即将编号1加入到新的Version中.总体对新增文件来说，就是首先加入base中编号比它小的，然后再将其加入，然后再继续比那里下一个新增文件，因此最终得到的文件编号顺序是 1、2、3、4、5、6、7、8，即每一层的.sst文件都是按照编号从小到大排列的.这样就得到了新的Version的每一层的所有文件.

参考文献:

1.http://blog.csdn.net/u012658346/article/details/45787233

2.http://blog.csdn.net/u012658346/article/details/45788939

3.http://blog.csdn.net/joeyon1985/article/details/47154249

4.http://www.blogjava.net/sandy/archive/2012/03/15/leveldb6.html

5.http://www.pandademo.com/2016/04/compaction-of-sstable-leveldb-part-1-source-dissect-9/