LevelDB的源码阅读(三) Put操作
在Linux上leveldb的安装和使用中我们写了这么一段测试代码,内容以及输出结果如下:
#include <iostream>
#include <string>
#include <assert.h>
#include "leveldb/db.h" using namespace std; int main(void)
{ leveldb::DB *db;
leveldb::Options options;
options.create_if_missing = true; // open
leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);
assert(status.ok()); string key = "name";
string value = "chenqi"; // write
status = db->Put(leveldb::WriteOptions(), key, value);
assert(status.ok()); // read
status = db->Get(leveldb::ReadOptions(), key, &value);
assert(status.ok()); cout<<value<<endl; // delete
status = db->Delete(leveldb::WriteOptions(), key);
assert(status.ok()); status = db->Get(leveldb::ReadOptions(),key, &value);
if(!status.ok()) {
cerr<<key<<" "<<status.ToString()<<endl;
} else {
cout<<key<<"==="<<value<<endl;
} // close
delete db; return ;
}
chenqi
name NotFound:
Leveldb的写数据流程入口为db文件夹下db_impl.cc文件中的DBImpl::Put和DBImpl::Delete,这两个文件是DBImpl::Write接口的封装,将写操作封装成WriteBatch传入DBImpl::Write进行操作,可见leveldb在内部是将单独的写操作也作为只有一个操作的批量写操作来进行的.源码内容如下:
// Convenience methods
Status DBImpl::Put(const WriteOptions& o, const Slice& key, const Slice& val) {
return DB::Put(o, key, val);
} Status DBImpl::Delete(const WriteOptions& options, const Slice& key) {
return DB::Delete(options, key);
} // Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
WriteBatch batch;
batch.Put(key, value);
return Write(opt, &batch);
} Status DB::Delete(const WriteOptions& opt, const Slice& key) {
WriteBatch batch;
batch.Delete(key);
return Write(opt, &batch);
}
DBImpl::Put和DBImpl::Delete最终调用 DBImpl::Write完成写操作,DBImpl::Write源码内容如下:
Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
Writer w(&mutex_);
w.batch = my_batch;
w.sync = options.sync;
w.done = false; MutexLock l(&mutex_);
writers_.push_back(&w);
while (!w.done && &w != writers_.front()) {
w.cv.Wait();
}
if (w.done) {
return w.status;
} // May temporarily unlock and wait.
Status status = MakeRoomForWrite(my_batch == NULL);
uint64_t last_sequence = versions_->LastSequence();
Writer* last_writer = &w;
if (status.ok() && my_batch != NULL) { // NULL batch is for compactions
WriteBatch* updates = BuildBatchGroup(&last_writer);
WriteBatchInternal::SetSequence(updates, last_sequence + );
last_sequence += WriteBatchInternal::Count(updates); // Add to log and apply to memtable. We can release the lock
// during this phase since &w is currently responsible for logging
// and protects against concurrent loggers and concurrent writes
// into mem_.
{
mutex_.Unlock();
status = log_->AddRecord(WriteBatchInternal::Contents(updates));
bool sync_error = false;
if (status.ok() && options.sync) {
status = logfile_->Sync();
if (!status.ok()) {
sync_error = true;
}
}
if (status.ok()) {
status = WriteBatchInternal::InsertInto(updates, mem_);
}
mutex_.Lock();
if (sync_error) {
// The state of the log file is indeterminate: the log record we
// just added may or may not show up when the DB is re-opened.
// So we force the DB into a mode where all future writes fail.
RecordBackgroundError(status);
}
}
if (updates == tmp_batch_) tmp_batch_->Clear(); versions_->SetLastSequence(last_sequence);
} while (true) {
Writer* ready = writers_.front();
writers_.pop_front();
if (ready != &w) {
ready->status = status;
ready->done = true;
ready->cv.Signal();
}
if (ready == last_writer) break;
} // Notify new head of write queue
if (!writers_.empty()) {
writers_.front()->cv.Signal();
} return status;
}
以下逐段进行分析
Writer w(&mutex_);// writer可以理解为一个任务
w.batch = my_batch;
w.sync = options.sync;
w.done = false; MutexLock l(&mutex_);//构造时上锁, 析构时解锁
writers_.push_back(&w);// 把w推进writers_队列
// 生产者消费者模型
while (!w.done && &w != writers_.front()) {
w.cv.Wait();//线程可能多次被唤醒
}
//写操作有可能被合并处理,因此有可能取到的时候写入已经完成。完成的话直接返回
if (w.done) {
return w.status;
}
mutex l上锁之后, 到了"w.cv.Wait()"的时候, 会先释放锁等待, 然后收到signal时再次上锁. 这段代码的作用就是多线程在提交任务的时候, 一个接一个push_back进队列. 但只有位于队首的线程有资格继续运行下去. 目的是把尽可能多的写请求((要求sync选项一致))合并成一个大batch提升效率.
Status status = MakeRoomForWrite(my_batch == NULL);
接下来是调用MakeRoomForWrite查看是否允许写,如果后台Compact失败,则返回错误,如果level0的文件数达到配置的SlowdownWritesTrigger(默认为8),则对每个写操作都延迟1ms,如果level0的文件数达到配置的kL0_StopWritesTrigger(默认为12),则阻塞写操作,等待后台Compact结束.如果memtable不满,则直接返回,可写.如果memtable已满,并且immutable table不为空,阻塞,等待Compact结束.否则,将memtable改为immutable table,新建memtable,返回可写.源码内容如下:
// REQUIRES: mutex_ is held
// REQUIRES: this thread is currently at the front of the writer queue
Status DBImpl::MakeRoomForWrite(bool force) {
mutex_.AssertHeld();
assert(!writers_.empty());
bool allow_delay = !force;
Status s;
while (true) {
if (!bg_error_.ok()) {
// Yield previous error
s = bg_error_;//后台Compact出错
break;
} else if (
allow_delay &&
versions_->NumLevelFiles() >= config::kL0_SlowdownWritesTrigger) {
// We are getting close to hitting a hard limit on the number of
// L0 files. Rather than delaying a single write by several
// seconds when we hit the hard limit, start delaying each
// individual write by 1ms to reduce latency variance. Also,
// this delay hands over some CPU to the compaction thread in
// case it is sharing the same core as the writer.
mutex_.Unlock();
env_->SleepForMicroseconds();
allow_delay = false; // Do not delay a single write more than once
mutex_.Lock();
} else if (!force &&
(mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) {
// There is room in current memtable 也就是说可以写
break;
} else if (imm_ != NULL) {
// We have filled up the current memtable, but the previous
// one is still being compacted, so we wait.
Log(options_.info_log, "Current memtable full; waiting...\n");
bg_cv_.Wait();
} else if (versions_->NumLevelFiles() >= config::kL0_StopWritesTrigger) {
// There are too many level-0 files.
Log(options_.info_log, "Too many L0 files; waiting...\n");
bg_cv_.Wait();
} else {
// Attempt to switch to a new memtable and trigger compaction of old
assert(versions_->PrevLogNumber() == );
uint64_t new_log_number = versions_->NewFileNumber();
WritableFile* lfile = NULL;
s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);
if (!s.ok()) {
// Avoid chewing through file number space in a tight loop.
versions_->ReuseFileNumber(new_log_number);
break;
}
delete log_;
delete logfile_;
logfile_ = lfile;
logfile_number_ = new_log_number;
log_ = new log::Writer(lfile);
imm_ = mem_;
has_imm_.Release_Store(imm_);
mem_ = new MemTable(internal_comparator_);
mem_->Ref();
force = false; // Do not force another compaction if have room
MaybeScheduleCompaction();//检查是否启动后台compact
}
}
return s;
}
这里有几点值得注意的优化:
1.在level 0的table数量快要接近阈值的时候, sleep 1ms.
因为文件系统表示写入完成并不一定真写到硬盘里了. 数量接近阈值说明快要进行下一次compaction了. 这时候如果文件系统的buffer积压了太多, 会造成硬盘一下子满负载. 还有可能已经在compact了, 这时候sleep就可以让CPU周期给更重要的任务.
2.log具有最高优先级无论如何都要写, 但immemtable只能一张一张写.
MakeRoomForWrite分析结束,我们重新回到DBImpl::Write往下看.写的时候一次性写入,首先写入log文件,然后写到memtable里.
uint64_t last_sequence = versions_->LastSequence();
Writer* last_writer = &w;
if (status.ok() && my_batch != NULL) { // NULL batch is for compactions
WriteBatch* updates = BuildBatchGroup(&last_writer);
WriteBatchInternal::SetSequence(updates, last_sequence + ); //设置writebatch的起始sequence
last_sequence += WriteBatchInternal::Count(updates); //写成功后的sequence // Add to log and apply to memtable. We can release the lock
// during this phase since &w is currently responsible for logging
// and protects against concurrent loggers and concurrent writes
// into mem_.
{
mutex_.Unlock();
status = log_->AddRecord(WriteBatchInternal::Contents(updates)); //写入日志文件
bool sync_error = false;
if (status.ok() && options.sync) {
status = logfile_->Sync();
if (!status.ok()) {
sync_error = true;
}
}
if (status.ok()) {
status = WriteBatchInternal::InsertInto(updates, mem_); //写入memtable
}
mutex_.Lock();
if (sync_error) {
// The state of the log file is indeterminate: the log record we
// just added may or may not show up when the DB is re-opened.
// So we force the DB into a mode where all future writes fail.
RecordBackgroundError(status);
}
}
if (updates == tmp_batch_) tmp_batch_->Clear(); versions_->SetLastSequence(last_sequence); //设置现在的最新更新sequence
}
这里调用了BuildBatchGroup,从等待写队列中获取尽可能多的写任务,BuildBatchGroup源码内容如下:
// REQUIRES: Writer list must be non-empty
// REQUIRES: First writer must have a non-NULL batch
WriteBatch* DBImpl::BuildBatchGroup(Writer** last_writer) {
assert(!writers_.empty());
Writer* first = writers_.front();
WriteBatch* result = first->batch;
assert(result != NULL); size_t size = WriteBatchInternal::ByteSize(first->batch); // Allow the group to grow up to a maximum size, but if the
// original write is small, limit the growth so we do not slow
// down the small write too much.
size_t max_size = << ;
if (size <= (<<)) {
max_size = size + (<<);
} *last_writer = first;
std::deque<Writer*>::iterator iter = writers_.begin();
++iter; // Advance past "first"
for (; iter != writers_.end(); ++iter) {
Writer* w = *iter;
if (w->sync && !first->sync) {
// Do not include a sync write into a batch handled by a non-sync write.
break;
} if (w->batch != NULL) {
size += WriteBatchInternal::ByteSize(w->batch);
if (size > max_size) {
// Do not make batch too big
break;
} // Append to *result
// 把合并的写请求保存在成员变量 tmp_batch_ 中,避免和调用者的写请求混淆在一起
if (result == first->batch) {
// Switch to temporary batch instead of disturbing caller's batch
result = tmp_batch_;
assert(WriteBatchInternal::Count(result) == );
WriteBatchInternal::Append(result, first->batch);
}
WriteBatchInternal::Append(result, w->batch);
}
*last_writer = w;
}
return result;
}
调用到的WriteBatchInternal相关代码在db文件夹下write_batch.cc中。
再次回到DBImpl::Write看最后一段代码,从头开始检查队列, 把完成的任务标记为done, 如果队列还有别的任务, 继续唤醒第一个.
while (true) {
Writer* ready = writers_.front();
writers_.pop_front();
if (ready != &w) {
ready->status = status;
ready->done = true;
ready->cv.Signal();
}
if (ready == last_writer) break;
} // Notify new head of write queue
if (!writers_.empty()) {
writers_.front()->cv.Signal();
} return status;
参考文献:
1.http://blog.csdn.net/joeyon1985/article/details/47154249
2.http://masutangu.com/2017/06/leveldb_1/
3.https://zhuanlan.zhihu.com/jimderestaurant?topic=LevelDB
LevelDB的源码阅读(三) Put操作的更多相关文章
- 25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment
25 BasicUsageEnvironment0基本使用环境基类——Live555源码阅读(三)UsageEnvironment 25 BasicUsageEnvironment0基本使用环境基类— ...
- 24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment
24 UsageEnvironment使用环境抽象基类——Live555源码阅读(三)UsageEnvironment 24 UsageEnvironment使用环境抽象基类——Live555源码阅读 ...
- 26 BasicUsageEnvironment基本使用环境——Live555源码阅读(三)UsageEnvironment
26 BasicUsageEnvironment基本使用环境--Live555源码阅读(三)UsageEnvironment 26 BasicUsageEnvironment基本使用环境--Live5 ...
- LevelDB的源码阅读(三) Get操作
在Linux上leveldb的安装和使用中我们写了这么一段测试代码,内容以及输出结果如下: #include <iostream> #include <string> #inc ...
- LevelDB的源码阅读(四) Compaction操作
leveldb的数据存储采用LSM的思想,将随机写入变为顺序写入,记录写入操作日志,一旦日志被以追加写的形式写入硬盘,就返回写入成功,由后台线程将写入日志作用于原有的磁盘文件生成新的磁盘数据.Leve ...
- LevelDB的源码阅读(二) Open操作
在Linux上leveldb的安装和使用中我们写了一个测试代码,内容如下: #include "leveldb/db.h" #include <cassert> #in ...
- LevelDB的源码阅读(一)
源码下载 git clone https://github.com/google/leveldb.git 项目结构 db/, 数据库逻辑 doc/, MD文档 helpers/, LevelDB内存版 ...
- SparkSQL(源码阅读三)
额,没忍住,想完全了解sparksql,毕竟一直在用嘛,想一次性搞清楚它,所以今天再多看点好了~ 曾几何时,有一个叫做shark的东西,它改了hive的源码...突然有一天,spark Sql突然出现 ...
- SpringMVC源码阅读(三)
先理一下Bean的初始化路线 org.springframework.beans.factory.support.AbstractBeanDefinitionReader public int loa ...
随机推荐
- Vue深度学习(5)-过渡效果
简介 通过 Vue.js 的过渡系统,你可以轻松的为 DOM 节点被插入/移除的过程添加过渡动画效果.Vue 将会在适当的时机添加/移除 CSS 类名来触发 CSS3 过渡/动画效果,你也可以提供相应 ...
- 二、spring Boot构建的Web应用中,基于MySQL数据库的几种数据库连接方式进行介绍
包括JDBC.JPA.MyBatis.多数据源和事务. 一.JDBC 连接数据库 1.属性配置文件(application.properties) spring.datasource.url=jdbc ...
- python __getattr__ 巧妙应用
在之前的文章有提到__getattr__函数的作用: 如果属性查找(attribute lookup)在实例以及对应的类中(通过__dict__)失败, 那么会调用到类的__getattr__函数, ...
- .net WCF简单实例
最近看到网上招聘有许多都需要WCF技术的人员,我之前一直没接触过这个东西,以后工作中难免会遇到,所谓笨鸟先飞,于是我就一探究竟,便有了这边文章.由于是初学WCF没有深入研究其原理,只是写了一个demo ...
- 【java】HashSet
package com.tn.hashSet; public class Person { private int id; private String name; private String bi ...
- Java之数据类型,变量赋值
Java中的基础数据类型(四类八种): 1.整数型 byte----使用byte关键字来定义byte型变量,可以一次定义多个变量并对其进行赋值,也可以不进行赋值.byte型是整型中所分配的内存空间是最 ...
- iOS UIAlertController在Tableview中显示缓慢,迟钝,延迟
在UITableViewCell中弹窗Alert延迟.在cellForRow中:cell.selectionStyle = UITableViewCellSelectionStyleNone; 或者在 ...
- 快速自检电脑是否被黑客入侵过(Linux版)
之前写了一篇快速自检电脑是否被黑客入侵过(Windows版), 这次就来写写Linux版本的. 前言 严谨地说, Linux只是一个内核, GNU Linux才算完整的操作系统, 但在本文里还是用通俗 ...
- Intellij IDEA 像eclipse那样给maven添加依赖
打开pom.xml,在它里面使用快捷键:ALT+Insert ---->点击dependency 再输入想要添加的依赖关键字,比如:输个spring 出现下图: 根据需求选择版本,完成以后 ...
- bzoj 3999: [TJOI2015]旅游
Description 为了提高智商,ZJY准备去往一个新世界去旅游.这个世界的城市布局像一棵树.每两座城市之间只有一条路径可 以互达.每座城市都有一种宝石,有一定的价格.ZJY为了赚取最高利益,她会 ...