TimescaleDB1.3 的新特性——Continuous aggregates: faster queries with automatically maintained materialized views
One characteristic of time-series data workloads is that the dataset will grow very quickly. Without the proper data infrastructure, these large data volumes can cause slowdowns in primarily two areas: inserting the data into the database, and aggregating the data into summaries which are more useful to analyze.
The first we’ve discussed in detail in our blog posts about the underlying architectureof TimescaleDB: keeping storage and indexing data structures small, and minimizing memory usage, to support high ingest. As for the second, TimescaleDB 1.3 introduces new capabilities to make aggregation simple and easy.
In particular, TimescaleDB 1.3 introduces automated continuous aggregates, which can massively speed up workloads that need to process large amounts of data. In this blog post, we will describe what continuous aggregates are, how they work, and how you can use them to speed up your workloads.
(Special thanks to Gayathri Ayyappan and David Kohn for their work on this feature.)
What are continuous aggregates?
Say you have a table of temperature readings over time in a number of locations:
time | location | temperature
------------------+-------------------+-----------------
2019/01/01 1:00am | New York | 68 F
2019/01/01 1:00am | Stockholm | 66 F
2019/01/01 2:00am | New York | 70 F
2019/01/01 2:00am | Stockholm | 60 F
... | ... | ...
2019/01/02 1:00am | New York | 72 F
2019/01/02 1:00am | Stockholm | 66 F
... | ... | ...
And you want the average temperature read per-day in each location:
day | location | avg temperature
------------------+-------------------+-----------------
2019/01/01 | New York | 73 F
2019/01/01 | Stockholm | 70 F
2019/01/02 | New York | 72 F
2019/01/02 | Stockholm | 69 F
If you only need this average as a one-off, than you can simply calculate it with a query such as:
SELECT time_bucket(‘1 day’, time) as day,
location,
avg(temperature)
FROM temperatures
GROUP BY day, location;
But if you’re going to want to find the average temperature repeatedly, this is wasteful. Every time you perform the SELECT, the database will need to scan the entire table and recalculate the average. But most of the data has not changed, and so re-scanning it is redundant. Alternatively, you could store the full results of the query in another table (or materialized view). But this quickly becomes unwieldy, because updating this table efficiently is cumbersome and complex.
Continuous aggregates solve this problem: they automatically, and in the background, maintain the results from the query, and allow you to retrieve them as you would any other data. A continuous aggregate looks just like a regular view.
A continuous aggregate for the aforementioned query can be created as easily as:
CREATE VIEW daily_average WITH (timescaledb.continuous)
AS SELECT time_bucket(‘1 day’, time) as Day,
location,
avg(temperature)
FROM temperatures
GROUP BY day, location;
And queried just like any other view:
SELECT * FROM daily_average;
That’s it! Unlike a regular view, a continuous aggregate does not perform the average when queried, and unlike a materialized view, it does not need to be refreshed manually. The view will be refreshed automatically in the background as new data is added, or old data is modified. This latter capability is fairly unique to TimescaleDB, which properly tracks when previous data is updated, or delayed data points are backfilled in older time intervals; the continuous aggregate will be automatically recomputed on this older data. Further, since this is automatic, it doesn’t add any maintenance burden to your database, and since it runs in the background, continuous aggregates do not slow down INSERT operations.
Continuous aggregates work out of the box with a large number of aggregation functions [1], can work with any custom aggregation function as long as it is parallelizable, and you can even use more complex expressions on top of those aggregate functions, e.g., something like max(temperature)-min(temperature)
.
Continuous aggregates sound great, but how do they work?
At a very high level, a continuous aggregate consists of four parts:
- A materialization hypertable: to store the aggregated data in.
- A materialization engine: to aggregate data from the raw, underlying, table to the materialization table.
- An invalidation engine: to determine when data needs to be re-materialized, due to INSERTs, UPDATEs, or DELETEs within the materialized data.
- A query engine: to access the aggregated data.
Of course, all of these parts need to be performant, otherwise Continuous Aggregates wouldn’t be worth using. In this section, we describe these components, and how their design is used to ensure good performance. Due to the way in which they interact with each other, we will go through the components in order: materialization hypertable, query engine, invalidation engine, and materialization engine.
Materialization Table and Data Model
A continuous aggregate takes raw data from the original hypertable, aggregates it, and stores intermediate state in a materialization hypertable. When you query the continuous aggregate view, the state is returned to you as needed.
For our temperature case above, the materialization table would look something like:
day | location | chunk | avg temperature partial
-----------+---------------+------------+----------------------------
2019/01/01 | New York | 1 | {3, 219}
2019/01/01 | Stockholm | 1 | {4, 280}
2019/01/02 | New York | 2 | {3, 216}
2019/01/02 | Stockholm | 2 | {5, 345}
The data stored inside a materialization table consists of a column for each group-by clause in the query, a chunk column identifying the raw-data chunk this data came from, and a partial aggregate representation for each aggregate in the query. A partial is the intermediate form of an aggregation function, and it is what’s used internally to calculate the aggregate’s output. For instance, for avg
the partial consists of a {count, sum}
pair, representing the number of rows seen, and the sum of all their values.
For our purposes, the key feature of partials is that they can be combined with each other to create new partials spanning all of the old partials’ rows. This property is needed when combining groups that span multiple chunks. It is also key for additional features currently in development: creating aggregates at multiple time granularities and combining aggregates generated in the background with those created live from the raw data. For each query group originating from a given chunk, we will store one row with a partial representation for each aggregate in the query.
The materialization-table itself represents time-series data and is stored as a TimescaleDB hypertable, in order to take advantage of the scaling and query optimizations that hypertables offer over vanilla tables.
Query Engine
When you query the continuous aggregate view, the aggregate partials are combined into a single partial for each time range, and finalized into the value the user receives. In other words, to compute the average temperature, each partial sum is added up to the total sum, each partial count is added up to a total count, then the average is computed by total sum / total count.
In addition to this functionality, we are currently developing a version which always provides up-to-date aggregates by combining partials from the materialization table with partials calculated on-demand from the raw table, when needed.
Invalidation Engine
The Invalidation Engine is one of the core performance-critical pieces of the Continuous Aggregates. Any INSERT, UPDATE, or DELETE to a hypertable which has a continuous aggregate could potentially invalidate some materialized rows, and we need to ensure that the system does not become swamped with invalidations.
Fortunately, our data is time-series data, which has one important implication: nearly all INSERTs and UPDATEs happen near the portion of the data closest to the present. We design our invalidation engine around this assumption. We do not materialize all the way to the last inserted datapoint, but rather to some point behind that, called the materialization threshold.
This threshold is set so that the vast majority of INSERTs will contain timestamps greater than its value. These data points have never been materialized by the continuous aggregate, so there is no additional work needed to notify the continuous aggregate that they have been added. When the materializer next runs, it is responsible for determining how much new data can be materialized without risking the continuous aggregate will be invalidated. Having done this, it will materialize some of the more recent data and move the materialization threshold forward in time. This ensures that the threshold lags behind the point-in-time where data changes are common, and that most INSERTs do not require any extra writes.
When data is changed that lies below the threshold, we log the maximum and minimum timestamps whose rows were edited by the transaction. The materializer uses these values to determine which rows in the aggregation table need to be recalculated. The additional logging for old values does cause some write amplification, but since the materialization threshold lags behind the area of data that is currently changing, such writes are small and rare.
Materialization Engine
Materializing the continuous aggregate is a potentially long-running operation with two important goals: correctness and performance. In terms of correctness, we must ensure that all of our invalidations are logged when needed, and that our continuous aggregates will eventually reflect the latest data changes. On the other hand, materialization can take a long time, and data-modifying transactions must perform well even while the materialization is in progress.
We achieve this by having materialization use two transactions. In a quick first transaction, we block all INSERTs, UPDATEs, and DELETEs, determine the time period we will materialize, and update the invalidation threshold. In the second, other operations are unblocked as we perform the bulk of the work, materializing the aggregates. This ensures that the vast majority of the work does not interfere with other operations.
Why do we block data modification in the first transaction? For our invalidations to work, any data-modifying transaction must either be included in the materialized aggregation or be logged for the next materialization. Blocking data-modifying operations in the first transaction provides a convenient barrier we can use to decide which transactions need to be logged. It divides the transactions into two groups, those that happened before the threshold was updated, and those happened after. Those transactions that came before the threshold update will be included in the materialization and thus never require any additional work, while those that occur after must log their invalidations, and seeing the new threshold inform these transactions that they need to do so.
Using Continuous Aggregates
To test out continuous aggregates, follow our tutorial which uses a sample dataset. Before starting the tutorial, make sure you’ve upgraded to (or installed) TimescaleDB version 1.3.
Conclusion
After months of work, we are really excited to release continuous aggregates. For more information, take a look at our docs page.
If you are just getting started with TimescaleDB, check out our installation guide or try Timescale Cloud, which includes all community and enterprise features.
Have questions? Feel free to leave a comment in the section below or get in touch with us here.
[1] TimescaleDB’s Continuous Aggregates works with a wide range of built-in aggregate functions. We’ve tested it on the following functions (listed in alphabetical order), and users can also define their own custom, aggregation functions, which will be able to immediately leverage Continuous Aggregates, provided these functions are parallel safe:
avg,
bit_and,
bit_or,
bool_and,
bool_or,
corr,
count,
covar_pop,
covar_samp,
every,
first,
histogram
last,
max,
min,
regr_avgx,
regr_avgy,
regr_count,
regr_intercept,
regr_r2,
regr_slope,
regr_sxx,
regr_sxy,
regr_syy,
stddev,
stddev_pop,
stddev_samp,
sum,
variance,
var_pop,
var_samp,
TimescaleDB1.3 的新特性——Continuous aggregates: faster queries with automatically maintained materialized views的更多相关文章
- 微软SMB 3.0文件共享协议新特性介绍
SMB(*nix平台和Win NT4.0又称CIFS)协议是Windows平台标准文件共享协议.Linux平台通过samba来支持.SMB最新版本v3.0,在v2.0基础上针对WAN和分布式有改进.详 ...
- Xcode 8 的 Debug 新特性
Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...
- 升级本地部署的CRM到Dynamics 365及部分新特性介绍。
关注本人微信和易信公众号: 微软动态CRM专家罗勇 ,回复241或者20161226可方便获取本文,同时可以在第一间得到我发布的最新的博文信息,follow me!我的网站是 www.luoyong. ...
- Xcode 8 的 Debug 新特性 —- WWDC 2016 Session 410 & 412 学习笔记
Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...
- Atitit opencv3.0 3.1 3.2 新特性attilax总结
Atitit opencv3.0 3.1 3.2 新特性attilax总结 1. 3.0OpenCV 3 的改动在哪?1 1.1. 模块构成该看哪些模块?2 2. 3.1新特性 2015-12-21 ...
- Oracle 11gR2 RAC 新特性说明
最近接触了一下Oracle 11g R2 的RAC,发现变化很大. 所以在自己动手做实验之前还是先研究下它的新特性比较好. 一. 官网介绍 先看一下Oracle 的官网文档里对RAC 新特性的一 ...
- PHP7新特性 What will be in PHP 7/PHPNG
本文结合php官网和鸟哥相关文章总结: 官网:http://www.php7.ca/ https://wiki.php.net/phpng PHP7将在2015年10月正式发布,PHP7 ,将会是 ...
- PostgreSQL 9.5,带来 UPSERT 等新特性
PostgreSQL 9.5于2016年1月7日正式发布,此版本主要带来了以下几个方面的特性: UPSERT, Row Level Security, and Big Data 1)UPSERTUPS ...
- [转]深入了解 CSS3 新特性
简介 CSS 即层叠样式表(Cascading Stylesheet).Web 开发中采用 CSS 技术,可以有效地控制页面的布局.字体.颜色.背景和其它效果.只需要一些简单的修改,就可以改变网页的外 ...
随机推荐
- C++ 的多继承与虚继承
C++之多继承与虚继承 1. 多继承 1.1 多继承概念 一个类有多个直接基类的继承关系称为多继承 多继承声明语法 class 派生类名 : 访问控制 基类名1, 访问控制 基类名2, ... { ...
- tkinter基础-输入框、文本框
本节内容 了解输入框.文本框的使用方法 利用1制作简易界面 首先明确上面由几个元素组成:该界面由界面标题,输入框.两个按钮.文本框组成. 该界面我们需要实现的功能: 在输入框中输入文字,点击inser ...
- netcore 实现跨应用的分布式session
需求场景 网站a,域名为 a.site.com 网站b, 域名为 b.site.com 需要在a.b两个站点之间共享session 解决方案 使用redis作为分布式缓存存储 设置sessionId ...
- 关键字ref、out
通常,变量作为参数进行传递时,不论在方法内进行了什么操作,其原始初始化的值都不会被影响: 例如: public void TestFun1() { ; TestFun2(arg); Console.W ...
- shell编程必须要掌握的命令-xargs
一,说xargs命令前,说一下什么是shell编程 什么是shell编程呢,说白了就是按一定的规则把各种命令组织起来,完成一定的事情.纯属个人理解,哈哈.不管是交互式的shell,还是非交互的shel ...
- aria2 资料
https://www.jianshu.com/p/8124b5b6ef95https://quan.ithome.com/0/331/853.htmhttp://www.360doc.com/con ...
- 不安全的验证码Insecure CAPTCHA
没啥好讲的,当验证不合格时,通过burp抓包工具修改成符合要求的数据包.修改参数标志位.USER-AGENT之类的参数. 防御 加强验证,Anti-CSRF token机制防御CSRF攻击,利用PDO ...
- Privoxy搭建代理服务器
Privoxy搭建代理服务器 Docker Hub镜像地址 Dockerfile FROM alpine EXPOSE 8118 RUN apk --no-cache --update add pri ...
- Vue.js详解
vuejs介绍 Vue.js是当下很火的一个JavaScript MVVM库,它是以数据驱动和组件化的思想构建的.相比于Angular.js,Vue.js提供了更加简洁.更易于理解的API,使得我们能 ...
- Java 之 MyBatis(一)入门
一.Mybatis 框架概述 (1)mybatis 是一个优秀的基于 java 的持久层框架,它内部封装了 jdbc,使开发者只需要关注 sql 语句本身,而不需要花费精力去处理加载驱动.创建连接.创 ...