上篇笔记讲到了聚合函数的实现并且带大家看了聚合函数是如何注册到ClickHouse之中的并被调用使用的。这篇笔记,笔者会续上上篇的内容,将剖析一把ClickHouse聚合流程的整体实现。

第二篇文章,我们来一起看看聚合流程的实现~~ 上车!

1.基础知识的梳理

ClickHouse的实现接口
  • Block类

    前文我们聊到ClickHouse是一个列式存储数据库,在内存之中用IColumn接口来作为数据结构表示数据。 而Block则是这些列的集合,也就是说Block包含了一组列,而无数个Block就构成了我们通常理解的表了。

    在ClickHouse进行查询之中,数据的最小处理单位是 Block 。由下面代码可以看到,Block就是由一组列以及列名对应列的偏移map组成的。
class Block
{
private:
using Container = ColumnsWithTypeAndName;
using IndexByName = std::map<String, size_t>; Container data;
IndexByName index_by_name;

这是一个很重要的类,实现的也并不复杂。Block类作为ClickHouse的核心,后续的工作都是基于Block类展开的。

  • 抽象类IBlockInputStream

    由名字可以看出,IBlockInputStream是一个实现接口。

    这也同样是一个十分重要的接口,ClickHouse的调用模型就建立在IBlockInputStream接口之上。该接口最为核心的就是方法便是read函数,它返回一个被对应Stream处理过的Block。

    想必看到这里应该明白了,ClickHouse就是通过IBlockInputStream实现的火山模型,每一个不同的Stream处理不同的查询逻辑,最后层层迭代,完成最终输出流就是用户需要的结果了。

    IBlockInputStream类还有一个孪生兄弟IBlockoutputStream,顾名思义,需要进行写操作的时候就要用到它了。
class IBlockInputStream : public TypePromotion<IBlockInputStream>
{
friend struct BlockStreamProfileInfo; public:
IBlockInputStream() { info.parent = this; }
virtual ~IBlockInputStream() {} IBlockInputStream(const IBlockInputStream &) = delete;
IBlockInputStream & operator=(const IBlockInputStream &) = delete; /// To output the data stream transformation tree (query execution plan).
virtual String getName() const = 0; /** Get data structure of the stream in a form of "header" block (it is also called "sample block").
* Header block contains column names, data types, columns of size 0. Constant columns must have corresponding values.
* It is guaranteed that method "read" returns blocks of exactly that structure.
*/
virtual Block getHeader() const = 0; virtual const BlockMissingValues & getMissingValues() const
{
static const BlockMissingValues none;
return none;
} /// If this stream generates data in order by some keys, return true.
virtual bool isSortedOutput() const { return false; } /// In case of isSortedOutput, return corresponding SortDescription
virtual const SortDescription & getSortDescription() const; /** Read next block.
* If there are no more blocks, return an empty block (for which operator `bool` returns false).
* NOTE: Only one thread can read from one instance of IBlockInputStream simultaneously.
* This also applies for readPrefix, readSuffix.
*/
Block read();
  • AggregatingBlockInputStream类

    终于引出我们的主角了,AggregatingBlockInputStream类,作为上面IBlockInputStream的子类,也就是我们今天要重点分析的类。
class AggregatingBlockInputStream : public IBlockInputStream
{
public:
/** keys are taken from the GROUP BY part of the query
* Aggregate functions are searched everywhere in the expression.
* Columns corresponding to keys and arguments of aggregate functions must already be computed.
*/
AggregatingBlockInputStream(const BlockInputStreamPtr & input, const Aggregator::Params & params_, bool final_)
: params(params_), aggregator(params), final(final_)
{
children.push_back(input);
} String getName() const override { return "Aggregating"; } Block getHeader() const override; protected:
Block readImpl() override; Aggregator::Params params;
Aggregator aggregator;
bool final; bool executed = false; std::vector<std::unique_ptr<TemporaryFileStream>> temporary_inputs; /** From here we will get the completed blocks after the aggregation. */
std::unique_ptr<IBlockInputStream> impl;
};

首先看它的构造方法,参数有:

  • BlockInputStreamPtr: 这个很好理解,就是它的子流,也就是实际产生数据的流,后续的聚合计算将会在子流返回的结果上展开。
  • params: 聚合参数,这个参数十分重要。它记录了那些key属于聚合,调用那些聚合参数等核心信息。并且aggregator也就是执行聚合的类,也是通过该参数构造的,它是Aggregator的内部类。
  • final: 指明该Stream是否是最终结果,还是要继续进行计算。

这里最为核心的就是AggregatingBlockInputStream类通过继承override对应的readImpl()的接口来实现对应的具体逻辑。AggregatingBlockInputStream类还有一个孪生兄弟:ParallelAggregatingBlockInputStream类,通过并行化来进一步加快聚合流程的执行效率。(通过笔者进行的测试,在简单查询聚合查询下,并行化能够提高近一倍的效率~~)

  • Aggregator::Params类

    Aggregator::Params类Aggregator的内部类。这个类是整个聚合过程之中最重要的类,查询解析优化后生成聚合查询的执行计划。 而对应的执行计划的参数都通过Aggregator::Params类来初始化,比如那些列要进行聚合,选取的聚合算子等等,并传递给对应的Aggregator来实现对应的聚合逻辑。
 struct Params
{
/// Data structure of source blocks.
Block src_header;
/// Data structure of intermediate blocks before merge.
Block intermediate_header; /// What to count.
const ColumnNumbers keys;
const AggregateDescriptions aggregates;
const size_t keys_size;
const size_t aggregates_size; /// The settings of approximate calculation of GROUP BY.
const bool overflow_row; /// Do we need to put into AggregatedDataVariants::without_key aggregates for keys that are not in max_rows_to_group_by.
const size_t max_rows_to_group_by;
const OverflowMode group_by_overflow_mode; /// Settings to flush temporary data to the filesystem (external aggregation).
const size_t max_bytes_before_external_group_by; /// 0 - do not use external aggregation. /// Return empty result when aggregating without keys on empty set.
bool empty_result_for_aggregation_by_empty_set; VolumePtr tmp_volume; /// Settings is used to determine cache size. No threads are created.
size_t max_threads; const size_t min_free_disk_space;
Params(
const Block & src_header_,
const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_,
bool overflow_row_, size_t max_rows_to_group_by_, OverflowMode group_by_overflow_mode_,
size_t group_by_two_level_threshold_, size_t group_by_two_level_threshold_bytes_,
size_t max_bytes_before_external_group_by_,
bool empty_result_for_aggregation_by_empty_set_,
VolumePtr tmp_volume_, size_t max_threads_,
size_t min_free_disk_space_)
: src_header(src_header_),
keys(keys_), aggregates(aggregates_), keys_size(keys.size()), aggregates_size(aggregates.size()),
overflow_row(overflow_row_), max_rows_to_group_by(max_rows_to_group_by_), group_by_overflow_mode(group_by_overflow_mode_),
group_by_two_level_threshold(group_by_two_level_threshold_), group_by_two_level_threshold_bytes(group_by_two_level_threshold_bytes_),
max_bytes_before_external_group_by(max_bytes_before_external_group_by_),
empty_result_for_aggregation_by_empty_set(empty_result_for_aggregation_by_empty_set_),
tmp_volume(tmp_volume_), max_threads(max_threads_),
min_free_disk_space(min_free_disk_space_)
{
} /// Only parameters that matter during merge.
Params(const Block & intermediate_header_,
const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_, bool overflow_row_, size_t max_threads_)
: Params(Block(), keys_, aggregates_, overflow_row_, 0, OverflowMode::THROW, 0, 0, 0, false, nullptr, max_threads_, 0)
{
intermediate_header = intermediate_header_;
}
};
  • Aggregator类

    顾名思义,这个是一个实际进行聚合工作展开的类。它最为核心的方法是下面两个函数:

    • execute函数:将输入流的stream依照次序进行blcok迭代处理,将聚合的结果写入result之中。
    • mergeAndConvertToBlocks函数:将聚合的结果转换为输入流,并通过输入流的read函数将结果继续返回给上一层。

      通过上面两个函数的调用,我们就可以完成被聚合的数据输入-》 数据聚合 -》 数据输出的流程。具体的细节笔者会在下一章详细的进行剖析。
class Aggregator
{
public:
Aggregator(const Params & params_); /// Aggregate the source. Get the result in the form of one of the data structures.
void execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result); using AggregateColumns = std::vector<ColumnRawPtrs>;
using AggregateColumnsData = std::vector<ColumnAggregateFunction::Container *>;
using AggregateColumnsConstData = std::vector<const ColumnAggregateFunction::Container *>;
using AggregateFunctionsPlainPtrs = std::vector<IAggregateFunction *>; /// Process one block. Return false if the processing should be aborted (with group_by_overflow_mode = 'break').
bool executeOnBlock(const Block & block, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, /// Passed to not create them anew for each block
bool & no_more_keys); bool executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, /// Passed to not create them anew for each block
bool & no_more_keys); /** Convert the aggregation data structure into a block.
* If overflow_row = true, then aggregates for rows that are not included in max_rows_to_group_by are put in the first block.
*
* If final = false, then ColumnAggregateFunction is created as the aggregation columns with the state of the calculations,
* which can then be combined with other states (for distributed query processing).
* If final = true, then columns with ready values are created as aggregate columns.
*/
BlocksList convertToBlocks(AggregatedDataVariants & data_variants, bool final, size_t max_threads) const; /** Merge several aggregation data structures and output the result as a block stream.
*/
std::unique_ptr<IBlockInputStream> mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;
ManyAggregatedDataVariants prepareVariantsToMerge(ManyAggregatedDataVariants & data_variants) const; /** Merge the stream of partially aggregated blocks into one data structure.
* (Pre-aggregate several blocks that represent the result of independent aggregations from remote servers.)
*/
void mergeStream(const BlockInputStreamPtr & stream, AggregatedDataVariants & result, size_t max_threads); using BucketToBlocks = std::map<Int32, BlocksList>;
/// Merge partially aggregated blocks separated to buckets into one data structure.
void mergeBlocks(BucketToBlocks bucket_to_blocks, AggregatedDataVariants & result, size_t max_threads); /// Merge several partially aggregated blocks into one.
/// Precondition: for all blocks block.info.is_overflows flag must be the same.
/// (either all blocks are from overflow data or none blocks are).
/// The resulting block has the same value of is_overflows flag.
Block mergeBlocks(BlocksList & blocks, bool final); std::unique_ptr<IBlockInputStream> mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const; using CancellationHook = std::function<bool()>; /** Set a function that checks whether the current task can be aborted.
*/
void setCancellationHook(const CancellationHook cancellation_hook); /// Get data structure of the result.
Block getHeader(bool final) const;

2.聚合流程的实现

这里我们就从上文提到的Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)函数作为起点来梳理一下ClickHouse的聚合实现:

void Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)
{
Stopwatch watch; size_t src_rows = 0;
size_t src_bytes = 0; /// Read all the data
while (Block block = stream->read())
{
if (isCancelled())
return; src_rows += block.rows();
src_bytes += block.bytes(); if (!executeOnBlock(block, result, key_columns, aggregate_columns, no_more_keys))
break;
}

由上述代码可以看出,这里就是依次读取子节点流生成的Block,然后继续调用executeOnBlock方法来执行聚合流程处理每一个Block的聚合。接着我们按图索骥,继续看下去,这个函数比较长,我们拆分成几个部分,并且把无关紧要的代码先去掉:这部分主要完成的工作就是将param之中指定的key列与聚合列的指针作为参数提取出来,并且和聚合函数一起封装到AggregateFunctionInstructions的结构之中。

bool Aggregator::executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys)
{
/// `result` will destroy the states of aggregate functions in the destructor
result.aggregator = this; /// How to perform the aggregation?
if (result.empty())
{
result.init(method_chosen);
result.keys_size = params.keys_size;
result.key_sizes = key_sizes;
LOG_TRACE(log, "Aggregation method: " << result.getMethodName());
} for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_columns[i].resize(params.aggregates[i].arguments.size()); /** Constant columns are not supported directly during aggregation.
* To make them work anyway, we materialize them.
*/
Columns materialized_columns; /// Remember the columns we will work with
for (size_t i = 0; i < params.keys_size; ++i)
{
materialized_columns.push_back(columns.at(params.keys[i])->convertToFullColumnIfConst());
key_columns[i] = materialized_columns.back().get(); if (!result.isLowCardinality())
{
auto column_no_lc = recursiveRemoveLowCardinality(key_columns[i]->getPtr());
if (column_no_lc.get() != key_columns[i])
{
materialized_columns.emplace_back(std::move(column_no_lc));
key_columns[i] = materialized_columns.back().get();
}
}
} AggregateFunctionInstructions aggregate_functions_instructions(params.aggregates_size + 1);
aggregate_functions_instructions[params.aggregates_size].that = nullptr; std::vector<std::vector<const IColumn *>> nested_columns_holder;
for (size_t i = 0; i < params.aggregates_size; ++i)
{
for (size_t j = 0; j < aggregate_columns[i].size(); ++j)
{
materialized_columns.push_back(columns.at(params.aggregates[i].arguments[j])->convertToFullColumnIfConst());
aggregate_columns[i][j] = materialized_columns.back().get(); auto column_no_lc = recursiveRemoveLowCardinality(aggregate_columns[i][j]->getPtr());
if (column_no_lc.get() != aggregate_columns[i][j])
{
materialized_columns.emplace_back(std::move(column_no_lc));
aggregate_columns[i][j] = materialized_columns.back().get();
}
} aggregate_functions_instructions[i].arguments = aggregate_columns[i].data();
aggregate_functions_instructions[i].state_offset = offsets_of_aggregate_states[i];
auto that = aggregate_functions[i];
/// Unnest consecutive trailing -State combinators
while (auto func = typeid_cast<const AggregateFunctionState *>(that))
that = func->getNestedFunction().get();
aggregate_functions_instructions[i].that = that;
aggregate_functions_instructions[i].func = that->getAddressOfAddFunction(); if (auto func = typeid_cast<const AggregateFunctionArray *>(that))
{
/// Unnest consecutive -State combinators before -Array
that = func->getNestedFunction().get();
while (auto nested_func = typeid_cast<const AggregateFunctionState *>(that))
that = nested_func->getNestedFunction().get();
auto [nested_columns, offsets] = checkAndGetNestedArrayOffset(aggregate_columns[i].data(), that->getArgumentTypes().size());
nested_columns_holder.push_back(std::move(nested_columns));
aggregate_functions_instructions[i].batch_arguments = nested_columns_holder.back().data();
aggregate_functions_instructions[i].offsets = offsets;
}
else
aggregate_functions_instructions[i].batch_arguments = aggregate_columns[i].data(); aggregate_functions_instructions[i].batch_that = that;
}

将需要准备的参数准备好了之后,后续就通过按部就班的调用executeImpl(*result.NAME, result.aggregates_pool, num_rows, key_columns, aggregate_functions_instructions.data(),

no_more_keys, overflow_row_ptr)
聚合运算了。我们来看看它的实现,它是一个模板函数,内部通过调用了 executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions)来实现的,数据库都会通过Batch的形式,一次性提交一组需要操作的数据来减少虚函数调用的开销。

template <typename Method>
void NO_INLINE Aggregator::executeImpl(
Method & method,
Arena * aggregates_pool,
size_t rows,
ColumnRawPtrs & key_columns,
AggregateFunctionInstruction * aggregate_instructions,
bool no_more_keys,
AggregateDataPtr overflow_row) const
{
typename Method::State state(key_columns, key_sizes, aggregation_state_cache); if (!no_more_keys)
executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions);
else
executeImplCase<true>(method, state, aggregates_pool, rows, aggregate_instructions, overflow_row);
}

那我们就继续看下去,executeImplBatch同样也是一个模板函数。

  • 首先,它构造了一个AggregateDataPtr的数组places,这里是这就是后续我们实际聚合结果存放的地方。这个数据的长度也就是这个Batch的长度,也就是说,聚合结果的指针也作为一组列式的数据,参与到后续的聚合运算之中。
  • 接下来,通过一个for循环,依次调用state.emplaceKey,计算每列聚合key的hash值,进行分类,并且将对应结果依次和places对应。
  • 最后,通过一个for循环,调用聚合函数的addBatch方法,(这个函数我们在上一篇之中介绍过)。每个AggregateFunctionInstruction都有一个制定的places_offset和对应属于进行聚合计算的value列,这里通过一个for循环调用AddBatch,将places之中对应的数据指针和聚合value列进行聚合,最终形成所有的聚合计算的结果。

到这里,整个聚合计算的核心流程算是完成了,后续就是将result的结果通过上面的convertToBlock的方式转换为BlockStream流,继续返回给上层的调用方。

template <typename Method>
void NO_INLINE Aggregator::executeImplBatch(
Method & method,
typename Method::State & state,
Arena * aggregates_pool,
size_t rows,
AggregateFunctionInstruction * aggregate_instructions) const
{
PODArray<AggregateDataPtr> places(rows); /// For all rows.
for (size_t i = 0; i < rows; ++i)
{
AggregateDataPtr aggregate_data = nullptr; auto emplace_result = state.emplaceKey(method.data, i, *aggregates_pool); /// If a new key is inserted, initialize the states of the aggregate functions, and possibly something related to the key.
if (emplace_result.isInserted())
{
/// exception-safety - if you can not allocate memory or create states, then destructors will not be called.
emplace_result.setMapped(nullptr); aggregate_data = aggregates_pool->alignedAlloc(total_size_of_aggregate_states, align_aggregate_states);
createAggregateStates(aggregate_data); emplace_result.setMapped(aggregate_data);
}
else
aggregate_data = emplace_result.getMapped(); places[i] = aggregate_data;
assert(places[i] != nullptr);
} /// Add values to the aggregate functions.
for (AggregateFunctionInstruction * inst = aggregate_instructions; inst->that; ++inst)
{
if (inst->offsets)
inst->batch_that->addBatchArray(rows, places.data(), inst->state_offset, inst->batch_arguments, inst->offsets, aggregates_pool);
else
inst->batch_that->addBatch(rows, places.data(), inst->state_offset, inst->batch_arguments, aggregates_pool);
}

3. 小结

好了,到这里也就把ClickHouse聚合流程的代码梳理完了。

除了聚合计算外,其他的物理执行操作符也是同样通过流的方式依次对接处理的,源码阅读的步骤也可以参照笔者的分析流程来参考。、

笔者是一个ClickHouse的初学者,对ClickHouse有兴趣的同学,欢迎多多指教,交流。

4. 参考资料

官方文档

ClickHouse源代码

ClickHouse源码笔记2:聚合流程的实现的更多相关文章

  1. ClickHouse源码笔记5:聚合函数的源码再梳理

    笔者在源码笔记1之中分析过ClickHouse的聚合函数的实现,但是对于各个接口函数的实际如何共同工作的源码,回头看并没有那么明晰,主要原因是没有结合Aggregator的类来一起分析聚合函数的是如果 ...

  2. ClickHouse源码笔记1:聚合函数的实现

    由于工作的需求,后续笔者工作需要和开源的OLAP数据库ClickHouse打交道.ClickHouse是Yandex在2016年6月15日开源了一个分析型数据库,以强悍的单机处理能力被称道. 笔者在实 ...

  3. ClickHouse源码笔记4:FilterBlockInputStream, 探寻where,having的实现

    书接上文,本篇继续分享ClickHouse源码中一个重要的流,FilterBlockInputStream的实现,重点在于分析Clickhouse是如何在执行引擎实现向量化的Filter操作符,而利用 ...

  4. ClickHouse源码笔记3:函数调用的向量化实现

    分享一下笔者研读ClickHouse源码时分析函数调用的实现,重点在于分析Clickhouse查询层实现的接口,以及Clickhouse是如何利用这些接口更好的实现向量化的.本文的源码分析基于Clic ...

  5. ClickHouse源码笔记6:探究列式存储系统的排序

    分析完成了聚合以及向量化过滤,向量化的函数计算之后.本篇,笔者将分析数据库的一个重要算子:排序.让我们从源码的角度来剖析ClickHouse作为列式存储系统是如何实现排序的. 本系列文章的源码分析基于 ...

  6. clickhouse源码Redhat系列机单机版安装踩坑笔记

    前情概要 由于工作需要用到clickhouse, 这里暂不介绍概念,应用场景,谷歌,百度一大把. 将安装过程踩下的坑记录下来备用 ClickHouse源码 git clone安装(直接下载源码包安装失 ...

  7. redis源码笔记(一) —— 从redis的启动到command的分发

    本作品采用知识共享署名 4.0 国际许可协议进行许可.转载联系作者并保留声明头部与原文链接https://luzeshu.com/blog/redis1 本博客同步在http://www.cnblog ...

  8. AsyncTask源码笔记

    AsyncTask源码笔记 AsyncTask在注释中建议只用来做短时间的异步操作,也就是只有几秒的操作:如果是长时间的操作,建议还是使用java.util.concurrent包中的工具类,例如Ex ...

  9. Tomcat8源码笔记(八)明白Tomcat怎么部署webapps下项目

    以前没想过这么个问题:Tomcat怎么处理webapps下项目,并且我访问浏览器ip: port/项目名/请求路径,以SSM为例,Tomcat怎么就能将请求找到项目呢,项目还是个文件夹类型的? Tom ...

随机推荐

  1. 黎活明8天快速掌握android视频教程--23_网络通信之网络图片查看器

    1.首先新建立一个java web项目的工程.使用的是myeclipe开发软件 图片的下载路径是http://192.168.1.103:8080/lihuoming_23/3.png 当前手机和电脑 ...

  2. elk2

    如果使用codec->json进行解码,表示输入到logstast中的input数据必须是json的格式,否则会解码失败 java中一句代码异常会抛出多条的堆栈日志,我们可以使用上面的mutil ...

  3. skywalking学习ppt

    和传统应用监控的区别,Dapper论文 监控图

  4. java scoket Blocking 阻塞IO socket通信四

    记住NIO在jdk1.7版本之前是同步非阻塞的,以前的inputsream是同步阻塞的,上面学习完成了Buffer现在我们来学习channel channel书双向的,以前阻塞的io的inputstr ...

  5. Riccati方程迭代法求解

    根据上述迭代法求解P,P为Riccati方程的解,然而用LQR需要计算K,再将K算出. (迭代过程中 ,我们可以将此算法和dlqr函数求解的参数进行对比,当误差小于我们设置的允许误差我们就可以把此算法 ...

  6. GraphicsLab Project 之 Screen Space Planar Reflection

    作者:i_dovelemon 日期:2020-06-23 主题:Screen Space Planar Reflection, Compute Shader 引言 前段时间,同事发来一篇讲述特化版本的 ...

  7. node+ajax实战案例(3)

    3.用户注册实现 3.1.注册用户功能的实现逻辑 1 用户在表单上输入注册信息 2 点击注册后,收集用户在表单上输入的注册信息并且发送给后台 3 后台接收用户发送过来的注册信息 4 后台需要处理数据并 ...

  8. 在Github上建立自己的个人主页

    目录 注册Github账号 登录Github账号 建立新仓库 选择个人主页的主题 注册Github账号 首先打开Github的主页(https://github.com/),点击右上角的sign up ...

  9. Oracle 闪回总结

    一.闪回查询(Flashback Query)1.闪回查询技术1.1 闪回查询机制    闪回查询是指利用数据库回滚段存放的信息查看指定表中过去某个时间点的数据信息,或过去某个时间段数据的变化情况,或 ...

  10. C#操作SharePoint文档库文档

    using (Stream file = spFile.OpenBinaryStream()) { //其余代码 }