转自:https://cube.dev/blog/high-performance-data-analytics-with-cubejs-pre-aggregations/ 可以了解 Pre-Aggregations的处理流程

This is an advanced tutorial. If you are just getting started with Cube.js, I recommend checking this tutorial first and then coming back here.

One of the most powerful features of Cube.js is pre-aggregations. Coupled with data schema, it eliminates the need to organize, denormalize, and transform data before using it with Cube.js. The pre-aggregation engine builds a layer of aggregated data in your database during the runtime and maintains it to be up-to-date.

Upon an incoming request, Cube.js will first look for a relevant pre-aggregation. If it cannot find any, it will build a new one. Once the pre-aggregation is built, all the subsequent requests will go to the pre-aggregated layer instead of hitting the raw data. It could speed the response time by hundreds or even thousands of times.

Pre-aggregations are materialized query results persisted as tables. In order to start using pre-aggregations, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Cube.js also takes care of keeping the pre-aggregation up-to-date. It performs refresh checks and if it finds that a pre-aggregation is outdated, it schedules a refresh in the background.

Creating a Simple Pre-Aggregation

Let’s take a look at the example of how we can use pre-aggregations to improve query performance.

For testing purposes, we will use a Postgres database and will generate around ten million records using the generate_series function.

$ createdb cubejs_test

The following SQL creates a table, orders, and inserts a sample of generated records into it.

CREATE TABLE orders (
id SERIAL PRIMARY KEY,
amount integer,
created_at timestamp without time zone
);
CREATE INDEX orders_created_at_amount ON orders(created_at, amount); INSERT INTO orders (created_at, amount)
SELECT
created_at,
floor((1000 + 500*random())*log(row_number() over())) as amount
FROM generate_series
( '1997-01-01'::date
, '2017-12-31'::date
, '1 minutes'::interval) created_at

Next, create a new Cube.js application if you don’t have any.

$ npm install -g cube.js
$ cubejs create test-app -d postgres

Change the content of .env in the project folder to the following.

CUBEJS_API_SECRET=SECRET
CUBEJS_DB_TYPE=postgres
CUBEJS_DB_NAME=cubejs_test

Finally, generate a schema for the orders table and start the Cube.js server.

$  cubejs generate -t orders
$ npm run dev

Now, we can send a query to Cube.js with the Orders.count measure and Orders.createdAt time dimension with granularity set to month.

curl \
-H "Authorization: EXAMPLE-API-TOKEN" \
-G \
--data-urlencode 'query={
"measures" : ["Orders.amount"],
"timeDimensions":[{
"dimension": "Orders.createdAt",
"granularity": "month",
"dateRange": ["1997-01-01", "2017-01-01"]
}]
}' \
http://localhost:4000/cubejs-api/v1/load

Cube.js will respond with Continue wait, because this query takes more than 5 seconds to process. Let’s look at Cube.js logs to see exactly how long it took for our Postgres to execute this query.

Performing query completed:
{
"queueSize":2,
"duration":6514,
"queryKey":[
"
SELECT
date_trunc('month', (orders.created_at::timestamptz at time zone 'UTC')) \"orders.created_at_month\",
sum(orders.amount) \"orders.amount\"
FROM
public.orders AS orders
WHERE (
orders.created_at >= $1::timestamptz
AND orders.created_at <= $2::timestamptz
)
GROUP BY 1
ORDER BY 1 ASC limit 10000
",
[
"2000-01-01T00:00:00Z",
"2017-01-01T23:59:59Z"
],
[]
]
}

It took 6,514 milliseconds (6.5 seconds) for Postgres to execute the above query. Although we have an index on the created_at and amount columns, it doesn't help a lot in this particular case since we're querying almost all the dates we have. The index would help if we query a smaller date range, but still, it would be a matter of seconds, not milliseconds.

We can significantly speed it up by adding a pre-aggregation layer. To do this, add the following preAggregations block to src/Orders.js:

preAggregations: {
amountByCreated: {
type: `rollup`,
measureReferences: [amount],
timeDimensionReference: createdAt,
granularity: `month`
}
}

The block above instructs Cube.js to build and use a rollup type of pre-aggregation when the “Orders.amount” measure and “Orders.createdAt” time dimension (with “month” granularity) are requested together. You can read more about pre-aggregation options in the documentation reference.

Now, once we send the same request, Cube.js will detect the pre-aggregation declaration and will start building it. Once it's built, it will query it and send the result back. All the subsequent queries will go to the pre-aggregation layer.

Here is how querying pre-aggregation looks in the Cube.js logs:

Performing query completed:
{
"queueSize":1,
"duration":5,
"queryKey":[
"
SELECT
\"orders.created_at_month\" \"orders.created_at_month\",
sum(\"orders.amount\") \"orders.amount\"
FROM
stb_pre_aggregations.orders_amount_by_created
WHERE (
\"orders.created_at_month\" >= ($1::timestamptz::timestamptz AT TIME ZONE 'UTC')
AND
\"orders.created_at_month\" <= ($2::timestamptz::timestamptz AT TIME ZONE 'UTC')
)
GROUP BY 1 ORDER BY 1 ASC LIMIT 10000
",
[
"1995-01-01T00:00:00Z",
"2017-01-01T23:59:59Z"
],
[
[
"
CREATE TABLE
stb_pre_aggregations.orders_amount_by_created
AS SELECT
date_trunc('month', (orders.created_at::timestamptz AT TIME ZONE 'UTC')) \"orders.created_at_month\",
sum(orders.amount) \"orders.amount\"
FROM
public.orders AS orders
GROUP BY 1
",
[]
]
]
]
}

As you can see, now it takes only 5 milliseconds (1,300 times faster) to get the same data. Also, you can note that SQL has been changed and now it queries data from stb_pre_aggregations.orders_amount_by_created, which is the table generated by Cube.js to store pre-aggregation for this query. The second query is a DDL statement for this pre-aggregation table.

Pre-Aggregations Refresh

Cube.js also takes care of keeping pre-aggregations up to date. Every two minutes on a new request Cube.js will initiate the refresh check.

You can set up a custom refresh check strategy by using refreshKey. By default, pre-aggregations are refreshed every hour.

If the result of the refresh check is different from the last one, Cube.js will initiate the rebuild of the pre-aggregation in the background and then hot swap the old one.

Next Steps

This guide is the first step to learning about pre-aggregations and how to start using them in your project. But there is much more you can do with them. You can find the pre-aggregations documentation reference here.

Also, here are some highlights with useful links to help you along the way.

Pre-aggregate queries across multiple cubes

Pre-aggregations work not only for measures and dimensions inside the single cube, but also across multiple joined cubes as well. If you have joined cubes, you can reference measures and dimensions from any part of the join tree. The example below shows how the Users.country dimension can be used with the Orders.count and Orders.revenue measures.

cube(`Orders`, {
sql: `select * from orders`, joins: {
Users: {
relationship: `belongsTo`,
sql: `${CUBE}.user_id = ${Users}.id`
}
}, // … preAggregations: {
categoryAndDate: {
type: `rollup`,
measureReferences: [count, revenue],
dimensionReferences: [Users.country],
timeDimensionReference: createdAt,
granularity: `day`
}
}
});

Generate pre-aggregations dynamically

Since pre-aggregations are part of the data schema, which is basically a Javascript code, you can dynamically create all the required pre-aggregations. This guide covers how you can dynamically generate a Cube.js schema.

Time partitioning

You can instruct Cube.js to partition pre-aggregations by time using the partitionGranularity option. Cube.js will generate not a single table for the whole pre-aggregation, but a set of smaller tables. It can reduce the refresh time and cost in the case of BigQuery for example.

Time partitioning documentation reference.

preAggregations: {
categoryAndDate: {
type: `rollup`,
measureReferences: [count],
timeDimensionReference: createdAt,
granularity: `day`,
partitionGranularity: `month`
}
}

Data Cube Lattices

Cube.js can automatically build rollup pre-aggregations without the need to specify which measures and dimensions to use. It learns from query history and selects an optimal set of measures and dimensions for a given query. Under the hood it uses the Data Cube Lattices approach.

It is very useful if you need a lot of pre-aggregations and you don't know ahead of time which ones exactly. Using autoRollup will save you from coding manually all the possible aggregations.

You can find documentation for auto rollup here.

cube(`Orders`, {
sql: `select * from orders`, preAggregations: {
main: {
type: `autoRollup`
}
}
});
 
 
 
 

Optimize Cube.js Performance with Pre-Aggregations的更多相关文章

  1. cube.js 开源模块化分析框架

    cube.js 是一款很不错的模块化web 应用分析框架.cube.js 的设计主要是面向serverless 服务, 但是同时也支持所有rdbms, cube.js不是一个单体应用,包含了以下部分: ...

  2. cube.js 最近的一些更新

    cube.js 是一个和不错的数据分析框架,最近又有了一些新的功能支持,以下是一些简单的 总结 基于web socket 的预览支持 react hooks api 支持 支持基于reecharts ...

  3. cube.js 通过presto-gateway 进行连接

    cube.js 对于presto 的支持是通过presto-client 刚好简单修改了一个可以支持presto-gateway 连接的 以下是一个简单的集成,以及关于集成中原有的一些修改 环境准备 ...

  4. 通过patch 方式解决cube.js 集成cratedb 的问题

    今天有写过一个简单的cube.js 集成cratedb 的说明,主要是在driver 上的兼容问题,处理方法是删除不兼容的代码 实际上我们也可以通过类似linux c 开发中的patch 方式解决,简 ...

  5. cube.js 集成cratedb 的尝试

    cratedb 提供了pg协议的兼容,我们可以直接使用pg client 连接,但是也不是完整实现pg 协议的 以下是 cube.js 集成cratedb 的一些尝试 环境准备 docker-comp ...

  6. cube.js 新版本试用preosto

    cube.js 新的版本添加了更多的数据库的支持,但是目前cubejs-cli 以及官方文档问题还挺多,使用不清晰,文档有明显的错误 以下演示presto 数据库的使用 环境准备 安装新版本的cube ...

  7. cube.js 最近版本的一些更新

    有一段时间没有关注cube.js 了,刚好晚上收到一封来自官方的更新介绍,这里简单说明下 更多的数据驱动支持 bigquey, clickhouse snowflake,presto (很棒),hiv ...

  8. cube.js 学习 cube 连接mongodb 试用

    cube.js 对于mongodb 的连接是通过mongodb bi connector(mysql 协议)处理的,以下为简单的试用 安装mongo bi connector 这个玩意用docker ...

  9. cube.js 学习 cube docker-compose 运行

    cube.js 官方为我们也提供了backeng 部署的模型,为了测试方便以下是一个使用docker-compose 运行的demo 项目是一个集成gitbase 的demo,实际可以按照自己的项目修 ...

随机推荐

  1. 安装 go和beego后的环境变量设置

    简介 之前有几次因为环境变量设置的问题 损失了一些时间,特在此做记录 安装golang 可参考官方Getting Started,但有时候有问题. 解压 tar -C /usr/local -xzf ...

  2. Linux进程的五个段

    目录 数据段 代码段 BSS段 堆(heap) 栈 数据段 用来存放可执行文件中已初始化的全局变量,换句话说就是存放程序静态分配的变量和全局变量: 代码段 代码段是用来存放可执行文件的操作指令,也就是 ...

  3. mysql 字符

    只适用mysql5.0以上的版本: 1.一个汉字占多少长度与编码有关:         UTF-8:一个汉字=3个字节            GBK:一个汉字=2个字节 2.varchar(n)表示n ...

  4. 全栈项目|小书架|微信小程序-项目结构设计分包

    前面的文章 介绍了服务端的基础搭建以及用户模块的设计,接下来就是在服务端和客户端实现具体的业务了. 本篇文章先来介绍微信小程序开发的项目结构设计,也就是项目分包情况. 由于项目是在<极客时间-9 ...

  5. c# 基于RTMP推流 PC+移动端 拉流播放

    网上关于直播相关的文章很多,但是讲解还是不够系统,对于刚刚接触直播开发的朋友实施起来会浪费不少时间.下面结合我自己的经验, 介绍一下直播方面的实战经验. 分成两个部分 第一部分是标题中介绍的基于RTM ...

  6. 父元素设置min-height子元素设置100%问题

    问题:父元素设置min-height子元素高度设置100%取不到值,这是因为子元素 div设置 height:100%: 只有当父级元素满足min-height:1000px;设置的条件才触发: 浏览 ...

  7. 兴奋与沮丧并存spider爬取拉勾网

    兴奋的开发除了爬取拉勾网的爬虫信息,可是当调试都成功了的那一刻,我被拉钩封IP了. 下面是spider的主要内容 import reimport scrapy from bs4 import Beau ...

  8. Flutter:教你用CustomPaint画一个自定义的CircleProgressBar

    https://www.jianshu.com/p/2ea01ae02ffe Flutter:教你用CustomPaint画一个自定义的CircleProgressBar paint_page.dar ...

  9. Oracle统计、分析和优化环境配置

    Oracle统计.分析和优化环境配置 创建批处理文件Login.bat 用于初始化设置系统环境 Login.bat @echo off title eoda mode con cols=140 col ...

  10. vs2017开启JavaScript智能提示

    [工具]--[选项]--[文本编辑器]--[JavaScript]--[语言服务] 把第一个钩去掉即可,如果不需要提示就打上勾