One characteristic of time-series data workloads is that the dataset will grow very quickly. Without the proper data infrastructure, these large data volumes can cause slowdowns in primarily two areas: inserting the data into the database, and aggregating the data into summaries which are more useful to analyze.

The first we’ve discussed in detail in our blog posts about the underlying architectureof TimescaleDB: keeping storage and indexing data structures small, and minimizing memory usage, to support high ingest. As for the second, TimescaleDB 1.3 introduces new capabilities to make aggregation simple and easy.

In particular, TimescaleDB 1.3 introduces automated continuous aggregates, which can massively speed up workloads that need to process large amounts of data. In this blog post, we will describe what continuous aggregates are, how they work, and how you can use them to speed up your workloads.

(Special thanks to Gayathri Ayyappan and David Kohn for their work on this feature.)

What are continuous aggregates?

Say you have a table of temperature readings over time in a number of locations:

  1. time | location | temperature
  2. ------------------+-------------------+-----------------
  3. 2019/01/01 1:00am | New York | 68 F
  4. 2019/01/01 1:00am | Stockholm | 66 F
  5. 2019/01/01 2:00am | New York | 70 F
  6. 2019/01/01 2:00am | Stockholm | 60 F
  7. ... | ... | ...
  8. 2019/01/02 1:00am | New York | 72 F
  9. 2019/01/02 1:00am | Stockholm | 66 F
  10. ... | ... | ...

And you want the average temperature read per-day in each location:

  1. day | location | avg temperature
  2. ------------------+-------------------+-----------------
  3. 2019/01/01 | New York | 73 F
  4. 2019/01/01 | Stockholm | 70 F
  5. 2019/01/02 | New York | 72 F
  6. 2019/01/02 | Stockholm | 69 F

If you only need this average as a one-off, than you can simply calculate it with a query such as:

  1. SELECT time_bucket(‘1 day’, time) as day,
  2. location,
  3. avg(temperature)
  4. FROM temperatures
  5. GROUP BY day, location;

But if you’re going to want to find the average temperature repeatedly, this is wasteful. Every time you perform the SELECT, the database will need to scan the entire table and recalculate the average. But most of the data has not changed, and so re-scanning it is redundant. Alternatively, you could store the full results of the query in another table (or materialized view).  But this quickly becomes unwieldy, because updating this table efficiently is cumbersome and complex.

Continuous aggregates solve this problem: they automatically, and in the background, maintain the results from the query, and allow you to retrieve them as you would any other data. A continuous aggregate looks just like a regular view.

A continuous aggregate for the aforementioned query can be created as easily as:

  1. CREATE VIEW daily_average WITH (timescaledb.continuous)
  2. AS SELECT time_bucket(‘1 day’, time) as Day,
  3. location,
  4. avg(temperature)
  5. FROM temperatures
  6. GROUP BY day, location;

And queried just like any other view:

  1. SELECT * FROM daily_average;

That’s it! Unlike a regular view, a continuous aggregate does not perform the average when queried, and unlike a materialized view, it does not need to be refreshed manually. The view will be refreshed automatically in the background as new data is added, or old data is modified. This latter capability is fairly unique to TimescaleDB, which properly tracks when previous data is updated, or delayed data points are backfilled in older time intervals; the continuous aggregate will be automatically recomputed on this older data.  Further, since this is automatic, it doesn’t add any maintenance burden to your database, and since it runs in the background, continuous aggregates do not slow down INSERT operations.

Continuous aggregates work out of the box with a large number of aggregation functions [1], can work with any custom aggregation function as long as it is parallelizable, and you can even use more complex expressions on top of those aggregate functions, e.g., something like max(temperature)-min(temperature).

Continuous aggregates sound great, but how do they work?

At a very high level, a continuous aggregate consists of four parts:

  1. materialization hypertable: to store the aggregated data in.
  2. materialization engine: to aggregate data from the raw, underlying, table to the materialization table.
  3. An invalidation engine: to determine when data needs to be re-materialized, due to INSERTs, UPDATEs, or DELETEs within the materialized data.
  4. query engine: to access the aggregated data.

Of course, all of these parts need to be performant, otherwise Continuous Aggregates wouldn’t be worth using. In this section, we describe these components, and how their design is used to ensure good performance. Due to the way in which they interact with each other, we will go through the components in order: materialization hypertable, query engine, invalidation engine, and materialization engine.

Materialization Table and Data Model

A continuous aggregate takes raw data from the original hypertable, aggregates it, and stores intermediate state in a materialization hypertable. When you query the continuous aggregate view, the state is returned to you as needed.

For our temperature case above, the materialization table would look something like:

  1. day | location | chunk | avg temperature partial
  2. -----------+---------------+------------+----------------------------
  3. 2019/01/01 | New York | 1 | {3, 219}
  4. 2019/01/01 | Stockholm | 1 | {4, 280}
  5. 2019/01/02 | New York | 2 | {3, 216}
  6. 2019/01/02 | Stockholm | 2 | {5, 345}

The data stored inside a materialization table consists of a column for each group-by clause in the query, a chunk column identifying the raw-data chunk this data came from, and a partial aggregate representation for each aggregate in the query. A partial is the intermediate form of an aggregation function, and it is what’s used internally to calculate the aggregate’s output. For instance, for avg the partial consists of a {count, sum} pair, representing the number of rows seen, and the sum of all their values.

For our purposes, the key feature of partials is that they can be combined with each other to create new partials spanning all of the old partials’ rows. This property is needed when combining groups that span multiple chunks. It is also key for additional features currently in development: creating aggregates at multiple time granularities and combining aggregates generated in the background with those created live from the raw data. For each query group originating from a given chunk, we will store one row with a partial representation for each aggregate in the query.

The materialization-table itself represents time-series data and is stored as a TimescaleDB hypertable, in order to take advantage of the scaling and query optimizations that hypertables offer over vanilla tables.

Query Engine

When you query the continuous aggregate view, the aggregate partials are combined into a single partial for each time range, and finalized into the value the user receives. In other words, to compute the average temperature, each partial sum is added up to the total sum, each partial count is added up to a total count, then the average is computed by total sum / total count.

In addition to this functionality, we are currently developing a version which always provides up-to-date aggregates by combining partials from the materialization table with partials calculated on-demand from the raw table, when needed.

Invalidation Engine

The Invalidation Engine is one of the core performance-critical pieces of the Continuous Aggregates. Any INSERT, UPDATE, or DELETE to a hypertable which has a continuous aggregate could potentially invalidate some materialized rows, and we need to ensure that the system does not become swamped with invalidations.

Fortunately, our data is time-series data, which has one important implication: nearly all INSERTs and UPDATEs happen near the portion of the data closest to the present. We design our invalidation engine around this assumption. We do not materialize all the way to the last inserted datapoint, but rather to some point behind that, called the materialization threshold.

This threshold is set so that the vast majority of INSERTs will contain timestamps greater than its value. These data points have never been materialized by the continuous aggregate, so there is no additional work needed to notify the continuous aggregate that they have been added. When the materializer next runs, it is responsible for determining how much new data can be materialized without risking the continuous aggregate will be invalidated. Having done this, it will materialize some of the more recent data and move the materialization threshold forward in time. This ensures that the threshold lags behind the point-in-time where data changes are common, and that most INSERTs do not require any extra writes.

When data is changed that lies below the threshold, we log the maximum and minimum timestamps whose rows were edited by the transaction. The materializer uses these values to determine which rows in the aggregation table need to be recalculated. The additional logging for old values does cause some write amplification, but since the materialization threshold lags behind the area of data that is currently changing, such writes are small and rare.

Materialization Engine

Materializing the continuous aggregate is a potentially long-running operation with two important goals: correctness and performance. In terms of correctness, we must ensure that all of our invalidations are logged when needed, and that our continuous aggregates will eventually reflect the latest data changes. On the other hand, materialization can take a long time, and data-modifying transactions must perform well even while the materialization is in progress.

We achieve this by having materialization use two transactions. In a quick first transaction, we block all INSERTs, UPDATEs, and DELETEs, determine the time period we will materialize, and update the invalidation threshold. In the second, other operations are unblocked as we perform the bulk of the work, materializing the aggregates. This ensures that the vast majority of the work does not interfere with other operations.

Why do we block data modification in the first transaction? For our invalidations to work, any data-modifying transaction must either be included in the materialized aggregation or be logged for the next materialization. Blocking data-modifying operations in the first transaction provides a convenient barrier we can use to decide which transactions need to be logged. It divides the transactions into two groups, those that happened before the threshold was updated, and those happened after. Those transactions that came before the threshold update will be included in the materialization and thus never require any additional work, while those that occur after must log their invalidations, and seeing the new threshold inform these transactions that they need to do so.

Using Continuous Aggregates

To test out continuous aggregates, follow our tutorial which uses a sample dataset. Before starting the tutorial, make sure you’ve upgraded to (or installed) TimescaleDB version 1.3.

Conclusion

After months of work, we are really excited to release continuous aggregates. For more information, take a look at our docs page.

If you are just getting started with TimescaleDB, check out our installation guide or try Timescale Cloud, which includes all community and enterprise features.

Have questions? Feel free to leave a comment in the section below or get in touch with us here.


[1] TimescaleDB’s Continuous Aggregates works with a wide range of built-in aggregate functions.  We’ve tested it on the following functions (listed in alphabetical order), and users can also define their own custom, aggregation functions, which will be able to immediately leverage Continuous Aggregates, provided these functions are parallel safe:

avg, 
bit_and,
bit_or, 
bool_and,
bool_or,
corr,
count,
covar_pop,
covar_samp,
every,
first,
histogram
last,
max,
min,
regr_avgx,
regr_avgy,
regr_count,
regr_intercept,
regr_r2,
regr_slope,
regr_sxx,
regr_sxy,
regr_syy,
stddev,
stddev_pop,
stddev_samp,
sum,
variance,
var_pop,
var_samp,

TimescaleDB1.3 的新特性——Continuous aggregates: faster queries with automatically maintained materialized views的更多相关文章

  1. 微软SMB 3.0文件共享协议新特性介绍

    SMB(*nix平台和Win NT4.0又称CIFS)协议是Windows平台标准文件共享协议.Linux平台通过samba来支持.SMB最新版本v3.0,在v2.0基础上针对WAN和分布式有改进.详 ...

  2. Xcode 8 的 Debug 新特性

    Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...

  3. 升级本地部署的CRM到Dynamics 365及部分新特性介绍。

    关注本人微信和易信公众号: 微软动态CRM专家罗勇 ,回复241或者20161226可方便获取本文,同时可以在第一间得到我发布的最新的博文信息,follow me!我的网站是 www.luoyong. ...

  4. Xcode 8 的 Debug 新特性 —- WWDC 2016 Session 410 & 412 学习笔记

    Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...

  5. Atitit opencv3.0  3.1 3.2 新特性attilax总结

    Atitit opencv3.0  3.1 3.2 新特性attilax总结 1. 3.0OpenCV 3 的改动在哪?1 1.1. 模块构成该看哪些模块?2 2. 3.1新特性 2015-12-21 ...

  6. Oracle 11gR2 RAC 新特性说明

    最近接触了一下Oracle 11g R2 的RAC,发现变化很大. 所以在自己动手做实验之前还是先研究下它的新特性比较好. 一.    官网介绍 先看一下Oracle 的官网文档里对RAC 新特性的一 ...

  7. PHP7新特性 What will be in PHP 7/PHPNG

    本文结合php官网和鸟哥相关文章总结: 官网:http://www.php7.ca/   https://wiki.php.net/phpng PHP7将在2015年10月正式发布,PHP7 ,将会是 ...

  8. PostgreSQL 9.5,带来 UPSERT 等新特性

    PostgreSQL 9.5于2016年1月7日正式发布,此版本主要带来了以下几个方面的特性: UPSERT, Row Level Security, and Big Data 1)UPSERTUPS ...

  9. [转]深入了解 CSS3 新特性

    简介 CSS 即层叠样式表(Cascading Stylesheet).Web 开发中采用 CSS 技术,可以有效地控制页面的布局.字体.颜色.背景和其它效果.只需要一些简单的修改,就可以改变网页的外 ...

随机推荐

  1. [BZOJ3230]相似子串(后缀数组)

    显然可以通过后缀数组快速找到询问的两个串分别是什么,然后正反各建一个后缀数组来求两个串的LCP和LCS即可. #include<cstdio> #include<cstring> ...

  2. Kafka Replication: The case for MirrorMaker 2.0

    Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking ...

  3. C# 获取目录路径

    Console.WriteLine(System.Windows.Forms.Application.StartupPath);//获取启动了应用程序的可执行文件的路径,不包括可执行文件的名称.(Wi ...

  4. Softmax学习笔记

    softmax用于多分类过程中,它将多个神经元的输出,映射到(0,1)区间内,可以看成概率来理解,从而来进行多分类! 他把一些输入映射为0-1之间的实数,并且归一化保证和为1,因此多分类的概率之和也刚 ...

  5. 【面试突击】-SpringBoot面试题(一)

    Spring Boot 是微服务中最好的 Java 框架. 我们建议你能够成为一名 Spring Boot 的专家. 问题一 Spring Boot.Spring MVC 和 Spring 有什么区别 ...

  6. 【夯实基础】- Java中的fail-fast机制

    转载自:Java中的fail-fast机制 遍历删除List中的元素有很多种方法,当运用不当的时候就会产生问题.下面主要看看以下几种遍历删除List中元素的形式: 1.通过普通的for删除删除符合条件 ...

  7. 【强烈推荐】ok-admin 一个好看又好用的后台模版!!!

    ok-admin 一个很赞的,扁平化风格的,响应式布局的后台管理模版,旨为后端程序员减压! 目前一共有两个版本:ok-admin v1.0和ok-admin v2.0可自由选择! 源码地址:https ...

  8. 处理vue-quill-editor回显数据的时候没有空格问题

    这是我要实现的效果 这是我回显后的情况(可以看见空格都没有了) 处理后 处理方法  添加一个class="ql-editor" <quill-editor class=&qu ...

  9. 英语Affrike非洲Affrike单词

    中文名称阿非利加洲(全称) 外文名称Africa 别 名Affrike 行政区类别洲 下辖地区北非.东非.西非.中非.南非 地理位置东濒印度洋,西临大西洋,北至地中海,南至好望角 面 积3022万平方 ...

  10. Cannot assign to read only property 'exports' of object '#<Object>' ,文件名大小写问题!!!

    有些坑不知道怎么就掉进去,可能一辈子都爬不起来!!! 一.错误描述 昨天还好好的,今天早上来从git获取了一下别人提交的代码就出错了!而提交代码的人 运行一点错误都没有!!! cya@KQ-101 M ...