leveldb的数据存储采用LSM的思想,将随机写入变为顺序写入,记录写入操作日志,一旦日志被以追加写的形式写入硬盘,就返回写入成功,由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leveldb在内存中维护一个数据结构memtable,采用skiplist来实现,保存当前写入的数据,当数据达到一定规模后变为不可写的内存表immutable table.新的写入操作会写入新的memtable,而immutable table会被后台线程写入到数据文件.Leveldb的数据文件是按层存放的,默认配置的最高层级是7,即level0,level1,…,level7.内存中的immutable总是写入level0,除level0之外的各个层leveli的所有数据文件的key范围都是互相不相交的.当满足一定条件时,leveli的数据文件会和leveli+1的数据文件进行merge,产生新的leveli+1层级的文件,这个磁盘文件的merge过程和immutable的dump过程叫做Compaction,在leveldb中是由一个单独的后台线程来完成的.

进行Compaction操作的条件如下:

1.产生了新的immutable table需要写入数据文件

2.某个level的数据规模过大

3.某个文件被无效查询的次数过多(在文件i中查询key,没有找到key,这次查询称为文件i的无效查询)

4.手动compaction

满足以上条件会启动Compaction过程,接下来分析详细的Compaction过程.

Leveldb进行Compaction的入口函数是db文件夹下db_impl.cc文件中的DBImpl::MaybeScheduleCompaction,该函数在每次leveldb进行读写操作时都有可能被调用.源码内容如下:

void DBImpl::MaybeScheduleCompaction() {
mutex_.AssertHeld();
if (bg_compaction_scheduled_) {
// Already scheduled
} else if (shutting_down_.Acquire_Load()) {
// DB is being deleted; no more background compactions
} else if (!bg_error_.ok()) {
// Already got an error; no more changes
} else if (imm_ == NULL &&
manual_compaction_ == NULL &&
!versions_->NeedsCompaction()) {
// No work to be done
} else {
bg_compaction_scheduled_ = true;
env_->Schedule(&DBImpl::BGWork, this); //新建后台任务并进行调度
}
}

首先调用db文件夹下version_set.h中的NeedsCompaction()判断是否需要启动Compact任务.源码内容如下:

// Returns true iff some level needs a compaction.
bool NeedsCompaction() const {
Version* v = current_;
return (v->compaction_score_ >= ) || (v->file_to_compact_ != NULL);
}

version_set.cc中compaction_score_ 的计算如下:

void VersionSet::Finalize(Version* v) {
// Precomputed best level for next compaction
int best_level = -;
double best_score = -; for (int level = ; level < config::kNumLevels-; level++) {
double score;
if (level == ) {
// We treat level-0 specially by bounding the number of files
// instead of number of bytes for two reasons:
//
// (1) With larger write-buffer sizes, it is nice not to do too
// many level-0 compactions.
//
// (2) The files in level-0 are merged on every read and
// therefore we wish to avoid too many files when the individual
// file size is small (perhaps because of a small write-buffer
// setting, or very high compression ratios, or lots of
// overwrites/deletions).
score = v->files_[level].size() /
static_cast<double>(config::kL0_CompactionTrigger);
} else {
// Compute the ratio of current size to size limit.
const uint64_t level_bytes = TotalFileSize(v->files_[level]);
score = static_cast<double>(level_bytes) / MaxBytesForLevel(level);
} if (score > best_score) {
best_level = level;
best_score = score;
}
} v->compaction_level_ = best_level;
v->compaction_score_ = best_score;
}

注意,这里同时预计算了进行compaction的最佳level.

确认需要启动compaction之后,调用util文件夹下env_posix.cc文件中的PosixEnv::Schedule函数启动Compact过程.

void PosixEnv::Schedule(void (*function)(void*), void* arg) {
PthreadCall("lock", pthread_mutex_lock(&mu_)); // Start background thread if necessary
if (!started_bgthread_) {
started_bgthread_ = true;
PthreadCall(
"create thread",
pthread_create(&bgthread_, NULL, &PosixEnv::BGThreadWrapper, this));
} // If the queue is currently empty, the background thread may currently be
// waiting.
if (queue_.empty()) {
PthreadCall("signal", pthread_cond_signal(&bgsignal_));
} // Add to priority queue
queue_.push_back(BGItem());
queue_.back().function = function;
queue_.back().arg = arg; PthreadCall("unlock", pthread_mutex_unlock(&mu_));
}

如果没有后台线程,则创建后台线程,否则新建一个后台执行任务BGItem压入后台线程任务队列,然后调用PosixEnv::BGThreadWrapper唤醒后台线程:

static void* BGThreadWrapper(void* arg) {
reinterpret_cast<PosixEnv*>(arg)->BGThread();
return NULL;
}

BGThreadWrapper调用PosixEnv::BGThread,不断地从后台任务队列中拿到任务,然后执行任务

void PosixEnv::BGThread() {
while (true) {
// Wait until there is an item that is ready to run
PthreadCall("lock", pthread_mutex_lock(&mu_));
while (queue_.empty()) {
PthreadCall("wait", pthread_cond_wait(&bgsignal_, &mu_));
} void (*function)(void*) = queue_.front().function;
void* arg = queue_.front().arg;
queue_.pop_front(); PthreadCall("unlock", pthread_mutex_unlock(&mu_));
(*function)(arg);
}
}

回到DBImpl::MaybeScheduleCompaction,方便理解起见这里再重复一遍源码

void DBImpl::MaybeScheduleCompaction() {
mutex_.AssertHeld();
if (bg_compaction_scheduled_) {
// Already scheduled
} else if (shutting_down_.Acquire_Load()) {
// DB is being deleted; no more background compactions
} else if (!bg_error_.ok()) {
// Already got an error; no more changes
} else if (imm_ == NULL &&
manual_compaction_ == NULL &&
!versions_->NeedsCompaction()) {
// No work to be done
} else {
bg_compaction_scheduled_ = true;
env_->Schedule(&DBImpl::BGWork, this); //新建后台任务并进行调度
}
}

之前分析了env_->Schedule进行的调度过程,现在来分析实际进行后台任务的DBImpl::BGWork.DBImpl::BGWork在db文件夹下db_impl.cc文件中.

void DBImpl::BGWork(void* db) {
reinterpret_cast<DBImpl*>(db)->BackgroundCall();
}

DBImpl::BGWork调用DBImpl::BackgroundCall(),合并完成后可能导致有的level的文件数过多,因此会再次调用MaybeScheduleCompaction()判断是否需要继续进行合并.

void DBImpl::BackgroundCall() {
MutexLock l(&mutex_);
assert(bg_compaction_scheduled_);
if (shutting_down_.Acquire_Load()) {
// No more background work when shutting down.
} else if (!bg_error_.ok()) {
// No more background work after a background error.
} else {
BackgroundCompaction();
} bg_compaction_scheduled_ = false; // Previous compaction may have produced too many files in a level,
// so reschedule another compaction if needed.
MaybeScheduleCompaction();
bg_cv_.SignalAll();
}

DBImpl::BackgroundCall()调用 BackgroundCompaction(),在BackgroundCompaction()中分别完成三种不同的Compaction:对Memtable进行合并、 trivial Compaction(直接将文件移动到下一层)以及一般的合并,调用DoCompactionWork()实现.

void DBImpl::BackgroundCompaction() {
mutex_.AssertHeld(); if (imm_ != NULL) {
CompactMemTable();//1、对Memtable进行合并
return;
} Compaction* c;
bool is_manual = (manual_compaction_ != NULL);//manual_compaction默认为NULL,则is_manual默认为false
InternalKey manual_end;
if (is_manual) { //取得手动compaction对象
ManualCompaction* m = manual_compaction_;
c = versions_->CompactRange(m->level, m->begin, m->end);
m->done = (c == NULL);
if (c != NULL) {
manual_end = c->input(, c->num_input_files() - )->largest;
}
Log(options_.info_log,
"Manual compaction at level-%d from %s .. %s; will stop at %s\n",
m->level,
(m->begin ? m->begin->DebugString().c_str() : "(begin)"),
(m->end ? m->end->DebugString().c_str() : "(end)"),
(m->done ? "(end)" : manual_end.DebugString().c_str()));
} else { //取得自动compaction对象
c = versions_->PickCompaction();
} Status status;
if (c == NULL) {
// Nothing to do
} else if (!is_manual && c->IsTrivialMove()) {//2、IsTrivialMove 返回 True,trivial Compaction,则直接将文件移入 level + 1 层即可
// Move file to next level
assert(c->num_input_files() == );
FileMetaData* f = c->input(, );
c->edit()->DeleteFile(c->level(), f->number);
c->edit()->AddFile(c->level() + , f->number, f->file_size,
f->smallest, f->largest);
status = versions_->LogAndApply(c->edit(), &mutex_);
if (!status.ok()) {
RecordBackgroundError(status);
}
VersionSet::LevelSummaryStorage tmp;
Log(options_.info_log, "Moved #%lld to level-%d %lld bytes %s: %s\n",
static_cast<unsigned long long>(f->number),
c->level() + ,
static_cast<unsigned long long>(f->file_size),
status.ToString().c_str(),
versions_->LevelSummary(&tmp));
} else { //3、一般的合并
CompactionState* compact = new CompactionState(c);
status = DoCompactionWork(compact); //进行compaction
if (!status.ok()) {
RecordBackgroundError(status);
}
CleanupCompaction(compact);
c->ReleaseInputs(); // input的文件引用计数减少1
DeleteObsoleteFiles(); //删除无用文件
}
delete c; if (status.ok()) {
// Done
} else if (shutting_down_.Acquire_Load()) {
// Ignore compaction errors found during shutting down
} else {
Log(options_.info_log,
"Compaction error: %s", status.ToString().c_str());
} if (is_manual) {
ManualCompaction* m = manual_compaction_; //标记手动compaction任务完成
if (!status.ok()) {
m->done = true;
}
if (!m->done) {
// We only compacted part of the requested range. Update *m
// to the range that is left to be compacted.
m->tmp_storage = manual_end;
m->begin = &m->tmp_storage;
}
manual_compaction_ = NULL;
}
}

首行mutex_.AssertHeld(),Mutex的AssertHeld函数实现默认为空,在很多函数的实现内有调用,其作用如下:

As you have observed it does nothing in the default implementation. The function seems to be a placeholder for checking whether a particular thread holds a mutex and optionally abort if it doesn’t. This would be equivalent to the normal asserts we use for variables but applied on mutexes.
I think the reason it is not implemented yet is we don’t have an equivalent light weight function to assert whether a thread holds a lock in pthread_mutex_t used in the default implementation. Some platforms which has that capability could fill this implementation as part of porting process. Searching online I did find some implementation for this function in the windows port of leveldb. I can see one way to implement it using a wrapper class over pthread_mutex_t and setting some sort of a thread id variable to indicate which thread(s) currently holds the mutex, but it will have to be carefully implemented given the race conditions that can arise.

Memtable的合并

Compaction首先检查imm_,及时将已写满的memtable写入磁盘sstable文件,对Memtable的合并,调用DBImpl::CompactMemTable()完成:

void DBImpl::CompactMemTable() {
mutex_.AssertHeld();
assert(imm_ != NULL);//imm_不能为空
VersionEdit edit;
Version* base = versions_->current();
base->Ref();
Status s = WriteLevel0Table(imm_, &edit, base);//将Memtable转化为.sst文件,写入level0 sst table,并写入到edit中
base->Unref();
if (s.ok()) {
edit.SetPrevLogNumber();
edit.SetLogNumber(logfile_number_); // Earlier logs no longer needed
s = versions_->LogAndApply(&edit, &mutex_);//应用edit中记录的变化,来生成新的版本
} if (s.ok()) {
// Commit to the new state
imm_->Unref();
imm_ = NULL;
has_imm_.Release_Store(NULL);
DeleteObsoleteFiles();
} else {
RecordBackgroundError(s);
}
}

其中CompactMemTable()主要调用了两个函数:WriteLevel0Table()和versions_->LogAndApply()

CompactMemTable()首先调用WriteLevel0Table(),源码内容如下:

Status DBImpl::WriteLevel0Table(MemTable* mem, VersionEdit* edit,
Version* base) {
mutex_.AssertHeld();
FileMetaData meta;
meta.number = versions_->NewFileNumber();//获取新生成的.sst文件的编号
pending_outputs_.insert(meta.number);
Iterator* iter = mem->NewIterator();//用于遍历Memtable中的数据 Status s;
{
mutex_.Unlock();
s = BuildTable(dbname_, env_, options_, table_cache_, iter, &meta);//创建.sst文件,并将其相关信息记录在meta中
mutex_.Lock();
} delete iter; //iter用完之后一定要删除
pending_outputs_.erase(meta.number); int level = ;
if (s.ok() && meta.file_size > ) {
const Slice min_user_key = meta.smallest.user_key();
const Slice max_user_key = meta.largest.user_key();
if (base != NULL) {
level = base->PickLevelForMemTableOutput(min_user_key, max_user_key);//为合并的输出文件选择合适的level
}
edit->AddFile(level, meta.number, meta.file_size,meta.smallest, meta.largest);//将生成的.sst文件加入到该level
}
return s;
}

WriteLevel0Table()首先调用BuildTable()将Immutable Memtable中所有的数据写入到一个.sst文件中,并将.sst文件的信息(文件编号,Key值范围,文件大小)记录到变量meta中.由于Memtable是基于Skiplist的,是一个有序表,因此在写入.sst文件时,Key值也是从小到大来排列的.可以发现,将Memtable中的数据转换为SSTable时,是将所有记录都写入SSTable的,要删除的记录也一样.删除操作会在更高level的Compaction中完成.因此level 0中可能会存在Key值相同的记录.

Status BuildTable(const std::string& dbname,
Env* env,
const Options& options,
TableCache* table_cache,
Iterator* iter,
FileMetaData* meta) {
Status s;
meta->file_size = ;
iter->SeekToFirst();
std::string fname = TableFileName(dbname, meta->number);//获得新建表名字
if (iter->Valid()) {
WritableFile* file;
s = env->NewWritableFile(fname, &file); //建立新的表文件,后续写入数据
if (!s.ok()) {
return s;
}
TableBuilder* builder = new TableBuilder(options, file); //建立TableBuilder
meta->smallest.DecodeFrom(iter->key());
for (; iter->Valid(); iter->Next()) { //将key/value对加入builder
Slice key = iter->key();
meta->largest.DecodeFrom(key);
builder->Add(key, iter->value());
} // Finish and check for builder errors
s = builder->Finish(); //构建indexhandler,metahandler,写入文件
if (s.ok()) {
meta->file_size = builder->FileSize();
assert(meta->file_size > );
}
delete builder; // Finish and check for file errors
if (s.ok()) {
s = file->Sync(); //写入文件
}
if (s.ok()) {
s = file->Close();
}
delete file;
file = NULL; if (s.ok()) {
// Verify that the table is usable
Iterator* it = table_cache->NewIterator(ReadOptions(),
meta->number,
meta->file_size); //将表结构加入表缓存
s = it->status();
delete it;
}
} // Check for input iterator errors
if (!iter->status().ok()) {
s = iter->status();
} if (s.ok() && meta->file_size > ) {
// Keep it
} else {
env->DeleteFile(fname);
}
return s;
}

该函数利用iter向TableBuilder中加入key/value对,然后写入文件并同步,将新生成的Table结构加入tablecache以备后用.

table_builder文件在table文件夹下,其中TableBuilder::Add函数流程如下:

void TableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->num_entries > ) {
assert(r->options.comparator->Compare(key, Slice(r->last_key)) > );
} if (r->pending_index_entry) {//新的block开始
assert(r->data_block.empty());
r->options.comparator->FindShortestSeparator(&r->last_key, key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
//计算filter
if (r->filter_block != NULL) {
r->filter_block->AddKey(key);
}
//加入blockbuilder
r->last_key.assign(key.data(), key.size());
r->num_entries++;
r->data_block.Add(key, value);
// block大于配置的尺寸(默认为4k)则结束该block,输出后开启新的Block。
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
if (estimated_block_size >= r->options.block_size) {
Flush();
}
}

将Block结构写入文件的TableBuilder::WriteBlock函数流程如下:

void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
// File format contains a sequence of blocks where each block has:
// block_data: uint8[n]
// type: uint8
// crc: uint32
assert(ok());
Rep* r = rep_;
Slice raw = block->Finish(); //取得block格式化数据 Slice block_contents;
//获取是否压缩配置选项
CompressionType type = r->options.compression;
// TODO(postrelease): Support more compression options: zlib?
switch (type) {
case kNoCompression:
block_contents = raw;
break; case kSnappyCompression: {
std::string* compressed = &r->compressed_output;
if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
compressed->size() < raw.size() - (raw.size() / 8u)) {
block_contents = *compressed;
} else {
// Snappy not supported, or compressed less than 12.5%, so just
// store uncompressed form
block_contents = raw;
type = kNoCompression;
}
break;
}
}
//进行压缩后,然后写入文件,blockdata+type+crc32
WriteRawBlock(block_contents, type, handle);
r->compressed_output.clear();
block->Reset();
}

而TableBuilder::Finish的函数定义如下:

Status TableBuilder::Finish() {
Rep* r = rep_;
Flush();//将block数据写入,可能不是满的block
assert(!r->closed);
r->closed = true; BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle; // Write filter block
if (ok() && r->filter_block != NULL) {
WriteRawBlock(r->filter_block->Finish(), kNoCompression,
&filter_block_handle);
} // Write metaindex block
if (ok()) {
BlockBuilder meta_index_block(&r->options);
if (r->filter_block != NULL) {
// Add mapping from "filter.Name" to location of filter data
std::string key = "filter.";
key.append(r->options.filter_policy->Name());
std::string handle_encoding;
filter_block_handle.EncodeTo(&handle_encoding);
meta_index_block.Add(key, handle_encoding);
} // TODO(postrelease): Add stats and other meta blocks
WriteBlock(&meta_index_block, &metaindex_block_handle);
} // Write index block
if (ok()) {
if (r->pending_index_entry) {
r->options.comparator->FindShortSuccessor(&r->last_key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
WriteBlock(&r->index_block, &index_block_handle);
} // Write footer
if (ok()) {
Footer footer;
footer.set_metaindex_handle(metaindex_block_handle);
footer.set_index_handle(index_block_handle);
std::string footer_encoding;
footer.EncodeTo(&footer_encoding);
r->status = r->file->Append(footer_encoding);
if (r->status.ok()) {
r->offset += footer_encoding.size();
}
}
return r->status;
}

以上代码中调用的flush源码内容如下:

void TableBuilder::Flush() {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->data_block.empty()) return;
assert(!r->pending_index_entry);
WriteBlock(&r->data_block, &r->pending_handle);
if (ok()) {
r->pending_index_entry = true;
r->status = r->file->Flush();
}
if (r->filter_block != NULL) {
r->filter_block->StartBlock(r->offset);
}
}

然后WriteLevel0Table()调用PickLevelForMemTableOutput()为Memtable合并的输出文件选择合适的level,并调用edit->AddFile()将生成的.sst文件加入到该level中.

WriteLevel0Table()结束后,CompactMemTable()调用db文件夹下version_set.cc文件中的versions_->LogAndApply()基于当前版本和更改edit来得到一个新版本.之后会对versions_->LogAndApply()进行分析.

Trivial Compaction

由之前的分析可知,is_manual默认为false,会调用PickCompaction()来选出要进行合并的level和相应的输入文件.当c->IsTrivialMove()满足时,则直接将文件移动到下一level.

  c = versions_->PickCompaction();

  Status status;
if (c == NULL) {
// Nothing to do
} else if (!is_manual && c->IsTrivialMove()) {
// Move file to next level
assert(c->num_input_files() == );
FileMetaData* f = c->input(, );
c->edit()->DeleteFile(c->level(), f->number); //将文件从该层删除
c->edit()->AddFile(c->level() + , f->number, f->file_size, //将该文件加入到下一level
f->smallest, f->largest);
status = versions_->LogAndApply(c->edit(), &mutex_); //应用更改,创建新的Version
}

首先调用db文件夹下version_set.cc文件中的VersionSet::PickCompaction()为接下来的Compaction操作准备输入数据,由之前对Compaction的数据结构分析可知,Compaction操作有两种触发方式:某一level的文件数太多和某一文件的查找次数超过允许值,在进行合并时,将优先考虑文件数过多的情况.

Compaction* VersionSet::PickCompaction() {
Compaction* c;
int level; const bool size_compaction = (current_->compaction_score_ >= );//文件数过多
const bool seek_compaction = (current_->file_to_compact_ != NULL);//某一文件的查找次数太多
if (size_compaction) {//文件数太多优先考虑
level = current_->compaction_level_; //要进行Compaction的level
c = new Compaction(level);
//每一层有一个compact_pointer,用于记录compaction key,这样可以进行循环compaction
for (size_t i = ; i < current_->files_[level].size(); i++) { //从待合并的level中选择合适的文件完成合并操作
FileMetaData* f = current_->files_[level][i]; //level层中的第i个文件
if (compact_pointer_[level].empty() || //compact_pointer_中记录的是下次合并的起始Key值,为空时都可以进行合并
icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > ) { //或者f的最大Key值大于起始值
c->inputs_[].push_back(f);//则该文件可以参与合并,将其加入到level输入文件中
break;
}
}
if (c->inputs_[].empty()) { //若level输入为空,则将level的第一个文件加入到输入中
c->inputs_[].push_back(current_->files_[level][]);
}
} else if (seek_compaction) {//然后考虑查找次数过多的情况
level = current_->file_to_compact_level_;
c = new Compaction(level);
c->inputs_[].push_back(current_->file_to_compact_);//将待合并的文件作为level层的输入
} else {
return NULL;
} c->input_version_ = current_;
c->input_version_->Ref(); //level 0中的Key值是可以重复的,因此Key值范围可能相互覆盖,把所有重叠都找出来,一起做compaction
if (level == ) {
InternalKey smallest, largest;
GetRange(c->inputs_[], &smallest, &largest);//待合并的level层的文件的Key值范围
current_->GetOverlappingInputs(, &smallest, &largest, &c->inputs_[]);
assert(!c->inputs_[].empty());
}
SetupOtherInputs(c);//获取待合并的level+1层的输入
return c;
}

然后判断是否为trivial Compaction,当为trivial Compaction时,只需要简单的将level层的文件移动到level +1 层即可

bool Compaction::IsTrivialMove() const {
return (num_input_files() == && //level层只有1个文件
num_input_files() == && //level+1层没有文件
TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes);//level+2层文件总大小不超过最大覆盖范围,否则会导致后面的merge需要很大的开销
}

最终完成完成Compaction操作

c->edit()->DeleteFile(c->level(), f->number);
c->edit()->AddFile(c->level() + , f->number, f->file_size,f->smallest, f->largest);
status = versions_->LogAndApply(c->edit(), &mutex_);

一般的合并

一般的合并调用DBImpl::DoCompactionWork()完成,compact是调用VersionSet::PickCompacttion()得到的,与之前的trivial Compaction相同.不同level之间,可能存在Key值相同的记录,但是记录的seq不同.由之前的分析可知,最新的数据存放在较低的level中,其对应的seq也一定比level+1中的记录的seq要大,因此当出现相同Key值的记录时,只需要记录第一条记录,后面的都可以丢弃.level 0中也可能存在Key值相同的数据,其后面的seq也不同.数据越新,其对应的seq越大,且记录在level 0中的记录是按照user_key递增,seq递减的方式存储的,则相同user_key对应的记录是聚集在一起的,且按照seq递减的方式存放的.在更高层的Compaction时,只需要处理第一条出现的user_key相同的记录即可,后面的相同user_key的记录都可以丢弃.因此合并后的level +1层的文件中不会存在Key值相同的记录.删除记录的操作也会在此时完成,删除数据的记录会被丢弃,而不会被写入到更高level的文件中.

Status DBImpl::DoCompactionWork(CompactionState* compact) {
if (snapshots_.empty()) {
compact->smallest_snapshot = versions_->LastSequence();
} else {
compact->smallest_snapshot = snapshots_.oldest()->number_;
}
mutex_.Unlock();
//生成iterator:遍历要compaction的数据
Iterator* input = versions_->MakeInputIterator(compact->compaction);//用于遍历待合并的每一个文件
input->SeekToFirst();
Status status;
ParsedInternalKey ikey;
std::string current_user_key;
bool has_current_user_key = false;
SequenceNumber last_sequence_for_key = kMaxSequenceNumber;
for (; input->Valid() && !shutting_down_.Acquire_Load(); ) {
if (has_imm_.NoBarrier_Load() != NULL) { //immutable memtable的优先级最高
mutex_.Lock();
if (imm_ != NULL) { //当imm_非空时,合并Memtable
CompactMemTable();
bg_cv_.SignalAll(); // Wakeup MakeRoomForWrite() if necessary
}
mutex_.Unlock();
} Slice key = input->key();
if (compact->compaction->ShouldStopBefore(key) && //是否需要停止Compaction,中途输出compaction的结果,避免compaction结果和level N+2 files有过多的重叠
compact->builder != NULL) {
status = FinishCompactionOutputFile(compact, input);
} bool drop = false;
if (!ParseInternalKey(key, &ikey)) {
current_user_key.clear();
has_current_user_key = false;
last_sequence_for_key = kMaxSequenceNumber;
} else {
if (!has_current_user_key || //获取当前的user_key和sequence
user_comparator()->Compare(ikey.user_key,
Slice(current_user_key)) != ) { //可能存在Key值相同但seq不同的记录
// 此时是这个Key第一次出现
current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
has_current_user_key = true;
last_sequence_for_key = kMaxSequenceNumber;//则将其seq设为最大值,表示第一次出现
} if (last_sequence_for_key <= compact->smallest_snapshot) {//表示key已经出现过,否则seq应为KMaxSequenceNumber
drop = true; // (A) //之前已经存在Key值相同的记录,丢弃
} else if (ikey.type == kTypeDeletion && //要删除该记录
ikey.sequence <= compact->smallest_snapshot && //记录的序号比数据库之前的最小序号还小
compact->compaction->IsBaseLevelForKey(ikey.user_key)) { //高的level中没有数据
drop = true; //此时要丢弃该记录
}
last_sequence_for_key = ikey.sequence;//上次出现的记录对应的sequence,用于判断后面出现相同Key值的情况
} if (!drop) { //如果不需要丢弃该记录
if (compact->builder == NULL) {
status = OpenCompactionOutputFile(compact);//若需要,则创建一个.sst文件,用于存放合并后的数据
}
if (compact->builder->NumEntries() == ) {
compact->current_output()->smallest.DecodeFrom(key);
}
compact->current_output()->largest.DecodeFrom(key);
compact->builder->Add(key, input->value());//将记录写入.sst文件 if (compact->builder->FileSize() >=
compact->compaction->MaxOutputFileSize()) { //当.sst文件超过最大值时
status = FinishCompactionOutputFile(compact, input);//完成Compaction输出文件
}
}
input->Next(); //处理下一个文件
} if (status.ok() && compact->builder != NULL) {
status = FinishCompactionOutputFile(compact, input);
}
if (status.ok()) {
status = input->status();
}
delete input;
input = NULL; //更新compaction的一些统计数据
CompactionStats stats;
stats.micros = env_->NowMicros() - start_micros - imm_micros;
for (int which = ; which < ; which++) {
for (int i = ; i < compact->compaction->num_input_files(which); i++) {
stats.bytes_read += compact->compaction->input(which, i)->file_size;
}
}
for (size_t i = ; i < compact->outputs.size(); i++) {
stats.bytes_written += compact->outputs[i].file_size;
} mutex_.Lock();
stats_[compact->compaction->level() + ].Add(stats); if (status.ok()) {
status = InstallCompactionResults(compact);//完成合并
}
if (!status.ok()) {
RecordBackgroundError(status);
}
VersionSet::LevelSummaryStorage tmp;
Log(options_.info_log,
"compacted to: %s", versions_->LevelSummary(&tmp));
return status; }

首先将可以留下的记录写入到.sst文件中,并将相关信息保存在变量compact中,然后调用InstallCompactionResults()将所做的改动加入到VersionEdit中,再调用LogAndApply()来得到新的版本.

Status DBImpl::InstallCompactionResults(CompactionState* compact) {
mutex_.AssertHeld();
Log(options_.info_log, "Compacted %d@%d + %d@%d files => %lld bytes",
compact->compaction->num_input_files(),
compact->compaction->level(),
compact->compaction->num_input_files(),
compact->compaction->level() + ,
static_cast<long long>(compact->total_bytes)); // Add compaction outputs
compact->compaction->AddInputDeletions(compact->compaction->edit());
const int level = compact->compaction->level();
for (size_t i = ; i < compact->outputs.size(); i++) {
const CompactionState::Output& out = compact->outputs[i];
compact->compaction->edit()->AddFile(
level + ,
out.number, out.file_size, out.smallest, out.largest);
}
return versions_->LogAndApply(compact->compaction->edit(), &mutex_);
}

LogAndApply()

在上面三种不同的Compaction操作中,最终当对当前版本的更改VersionEdit全部完成后,都会调用VersionSet::LogAndApply()来应用更改,创建新版本.edit中保存了level和level+1层要删除和增加的文件.

Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {

  Version* v = new Version(this);  //创建一个新Version
{
Builder builder(this, current_);//基于当前Version创建一个builder变量
builder.Apply(edit);//将edit中记录的要增加、删除的文件加入到builder类中
builder.SaveTo(v);//然后将edit中的记录保存到新创建的Version中,这样就得到了一个新的版本
}
Finalize(v);//根据各层文件数来判断是否还需要进行Compaction std::string new_manifest_file;
Status s;
if (descriptor_log_ == NULL) { //只会在第一次调用时进入
assert(descriptor_file_ == NULL);
new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);//创建一个新的Manifest文件
edit->SetNextFile(next_file_number_);
s = env_->NewWritableFile(new_manifest_file, &descriptor_file_);
if (s.ok()) {
descriptor_log_ = new log::Writer(descriptor_file_);
s = WriteSnapshot(descriptor_log_);//快照,系统开始时完整记录数据库的所有信息
}
}
{
mu->Unlock();
if (s.ok()) {
std::string record;
edit->EncodeTo(&record);
s = descriptor_log_->AddRecord(record);//将数据库的变化记录到Manifest文件中
if (s.ok()) {
s = descriptor_file_->Sync();
}
}
if (s.ok() && !new_manifest_file.empty()) {
s = SetCurrentFile(env_, dbname_, manifest_file_number_);
}
mu->Lock();
} if (s.ok()) {
AppendVersion(v); //将新得到的Version插入到所有Version形成的双向链表的尾部
log_number_ = edit->log_number_;
prev_log_number_ = edit->prev_log_number_;
}
}
return s;
}

为了重启之后能恢复数据库之前的状态,就需要将数据库的历史变化信息记录下来,这些信息都是记录在Manifest文件中的.为了节省空间和时间,leveldb采用的是在系统开始完整的所有数据库的信息(WriteSnapShot()),以后则只记录数据库的变化,即VersionEdit中的信息(descriptor_log_->AddRecord()).恢复时,只需要根据Manifest中的信息就可以一步步的恢复到上次的状态.

VersionSet::LogAndApply首先创建一个新的Version,然后调用builder.Apply(edit)将edit中所有要删除、增加的文件编号记录下来,其源码如下:

  // Apply all of the edits in *edit to the current state.
void Apply(VersionEdit* edit) {
// 更新每一层下次合并的起始Key值
for (size_t i = ; i < edit->compact_pointers_.size(); i++) {
const int level = edit->compact_pointers_[i].first;
vset_->compact_pointer_[level] =
edit->compact_pointers_[i].second.Encode().ToString();
}
//将所有要删除的文件加入到levels_[level].deleted_files变量中
const VersionEdit::DeletedFileSet& del = edit->deleted_files_;
for (VersionEdit::DeletedFileSet::const_iterator iter = del.begin();
iter != del.end();++iter) {
const int level = iter->first;
const uint64_t number = iter->second;
levels_[level].deleted_files.insert(number);
}
// 将所有新增加的文件加入到levels_[level].added_files中
for (size_t i = ; i < edit->new_files_.size(); i++) {
const int level = edit->new_files_[i].first;
FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
f->refs = ;
f->allowed_seeks = (f->file_size / );
if (f->allowed_seeks < ) f->allowed_seeks = ;
levels_[level].deleted_files.erase(f->number);
levels_[level].added_files->insert(f);
}
}

然后VersionSet::LogAndApply再调用builder.SaveTo(v)将更改保存到新的Version中,其源码如下:

  void SaveTo(Version* v) {
BySmallestKey cmp;
cmp.internal_comparator = &vset_->icmp_;
for (int level = ; level < config::kNumLevels; level++) {
const std::vector<FileMetaData*>& base_files = base_->files_[level];//当前Version中原有的各个level的.sst文件
std::vector<FileMetaData*>::const_iterator base_iter = base_files.begin();
std::vector<FileMetaData*>::const_iterator base_end = base_files.end();
const FileSet* added = levels_[level].added_files;//对应level新增加的文件
v->files_[level].reserve(base_files.size() + added->size());
for (FileSet::const_iterator added_iter = added->begin();
added_iter != added->end();++added_iter) {
// 将原有文件中编号比added小的加入到新的Version
for (std::vector<FileMetaData*>::const_iterator bpos
= std::upper_bound(base_iter, base_end, *added_iter, cmp);
base_iter != bpos;++base_iter) {
MaybeAddFile(v, level, *base_iter);
}
MaybeAddFile(v, level, *added_iter);//再将新增的文件依次加入到新的Version
}
for (; base_iter != base_end; ++base_iter) {
MaybeAddFile(v, level, *base_iter);//再将原有文件中剩余的部分加入到新的Version
}
}
}

bpos = std::upper_bound(base_iter,base_end,*added_iter,cmp); // 返回base_iter到base_end之间,第一个大于*added_iter的iter.假设原有文件的编号为1、3、4、6、8,新增文件的编号为2、5、7,则第一次循环时,bpos为3对应的迭代器,因此base_iter只遍历一个元素,即将编号1加入到新的Version中.总体对新增文件来说,就是首先加入base中编号比它小的,然后再将其加入,然后再继续比那里下一个新增文件,因此最终得到的文件编号顺序是 1、2、3、4、5、6、7、8,即每一层的.sst文件都是按照编号从小到大排列的.这样就得到了新的Version的每一层的所有文件.

参考文献:

1.http://blog.csdn.net/u012658346/article/details/45787233

2.http://blog.csdn.net/u012658346/article/details/45788939

3.http://blog.csdn.net/joeyon1985/article/details/47154249

4.http://www.blogjava.net/sandy/archive/2012/03/15/leveldb6.html

5.http://www.pandademo.com/2016/04/compaction-of-sstable-leveldb-part-1-source-dissect-9/

LevelDB的源码阅读(四) Compaction操作的更多相关文章

  1. 40 网络相关函数(八)——live555源码阅读(四)网络

    40 网络相关函数(八)——live555源码阅读(四)网络 40 网络相关函数(八)——live555源码阅读(四)网络 简介 15)writeSocket向套接口写数据 TTL的概念 函数send ...

  2. 39 网络相关函数(七)——live555源码阅读(四)网络

    39 网络相关函数(七)——live555源码阅读(四)网络 39 网络相关函数(七)——live555源码阅读(四)网络 简介 14)readSocket从套接口读取数据 recv/recvfrom ...

  3. 38 网络相关函数(六)——live555源码阅读(四)网络

    38 网络相关函数(六)——live555源码阅读(四)网络 38 网络相关函数(六)——live555源码阅读(四)网络 简介 12)makeSocketNonBlocking和makeSocket ...

  4. 37 网络相关函数(五)——live555源码阅读(四)网络

    37 网络相关函数(五)——live555源码阅读(四)网络 37 网络相关函数(五)——live555源码阅读(四)网络 简介 10)MAKE_SOCKADDR_IN构建sockaddr_in结构体 ...

  5. 36 网络相关函数(四)——live555源码阅读(四)网络

    36 网络相关函数(四)——live555源码阅读(四)网络 36 网络相关函数(四)——live555源码阅读(四)网络 简介 7)createSocket创建socket方法 8)closeSoc ...

  6. 35 网络相关函数(三)——live555源码阅读(四)网络

    35 网络相关函数(三)——live555源码阅读(四)网络 35 网络相关函数(三)——live555源码阅读(四)网络 简介 5)NoReuse不重用地址类 6)initializeWinsock ...

  7. 34 网络相关函数(二)——live555源码阅读(四)网络

    34 网络相关函数(二)——live555源码阅读(四)网络 34 网络相关函数(二)——live555源码阅读(四)网络 2)socketErr 套接口错误 3)groupsockPriv函数 4) ...

  8. 33 网络相关函数(一)——live555源码阅读(四)网络

    33 网络相关函数(一)——live555源码阅读(四)网络 33 网络相关函数(一)——live555源码阅读(四)网络 简介 1)IsMulticastAddress多播(组播)地址判断函数 多播 ...

  9. 32 GroupSock(AddressPortLookupTable)——live555源码阅读(四)网络

    32 GroupSock(AddressPortLookupTable)——live555源码阅读(四)网络 32 GroupSock(AddressPortLookupTable)——live555 ...

  10. 31 GroupSock(AddressString)——live555源码阅读(四)网络

    31 GroupSock(AddressString)——live555源码阅读(四)网络 31 GroupSock(AddressString)——live555源码阅读(四)网络 简介 Addre ...

随机推荐

  1. flask中的session,render_template()第二和参数是字典

    1. 设置一个secret_key 2.验证登入后加上session,这是最简单,不保险 . 3.注意render_template传的参数是字典

  2. Django安装与开发虚拟环境搭建01

    Django是一款基于python的MVT的web开发框架(m表示model,主要用于对数据库层的封装  ,v表示view,用于向用户展示结果,c表示controller,是核心,用于处理请求.获取数 ...

  3. app支付宝快速入门

    最近在做个车辆认证app,需要用到支付宝付款.前端使用H5,框架是react,后台是java.app支付与普通网页支付差别还是很大,我这里主要对于app支付做说明 1.让财务开通支付宝账号(需要企业税 ...

  4. std::shared_ptr<void>的工作原理

    前戏 先抛出两个问题 如果delete一个指针,但是它真实的类型和指针类型不一样会发生什么? 是谁调用了析构函数? 下面这段代码会发生什么有趣的事情? // delete_diff_type.cpp ...

  5. ES6作用域和解构赋值

    ES6 强制开启严格模式 作用域 var 声明局部变量,for/if花括号中定义的变量在花括号外也可访问 let 声明的变量为块作用域,变量不可重复定义 const 声明常量,块作用域,声明时必须赋值 ...

  6. Asp.net常用开发方法之DataTable/DataReader转Json格式代码

    public static string JsonParse(OleDbDataReader dataReader) //DataRead转json { StringBuilder jsonStrin ...

  7. Visual simultaneous localization and mapping: a survey 论文解析(全)

    当激光或声纳等距离传感器被用来构建小的静态环境的二维地图时,SLAM的问题被认为是解决的.然而,对于动态,复杂和大规模的环境,使用视觉作为唯一的外部传感器,SLAM是一个活跃的研究领域. 第一部分是简 ...

  8. LODOP打印控件示例

    一.lodop打印预览效果图 LODOP.PRINT_SETUP();打印维护效果图 LODOP.PREVIEW();打印预览图 二.写在前面 最近项目用到了LODOP的套打,主要用到两个地方,一是物 ...

  9. 简单记录一下原生ajax

    面试老忘记,代码如下 function ajax() { var xmlHttpRequest = null; //定义XMLHttp对象的容器 if(window.XMLHttpRequest) { ...

  10. [编织消息框架][JAVA核心技术]动态代理应用6-设计生成类

    上篇介绍到rpc可以使用接口与实现类来约束书写 根据接口用javassist生成两个代理类 1.sendProxy 发送处理,调用方式可以是远程/本地 2.receiveProxy 接收处理,内部调用 ...