One characteristic of time-series data workloads is that the dataset will grow very quickly. Without the proper data infrastructure, these large data volumes can cause slowdowns in primarily two areas: inserting the data into the database, and aggregating the data into summaries which are more useful to analyze.

The first we’ve discussed in detail in our blog posts about the underlying architectureof TimescaleDB: keeping storage and indexing data structures small, and minimizing memory usage, to support high ingest. As for the second, TimescaleDB 1.3 introduces new capabilities to make aggregation simple and easy.

In particular, TimescaleDB 1.3 introduces automated continuous aggregates, which can massively speed up workloads that need to process large amounts of data. In this blog post, we will describe what continuous aggregates are, how they work, and how you can use them to speed up your workloads.

(Special thanks to Gayathri Ayyappan and David Kohn for their work on this feature.)

What are continuous aggregates?

Say you have a table of temperature readings over time in a number of locations:

      time        |      location     |   temperature
------------------+-------------------+-----------------
2019/01/01 1:00am | New York | 68 F
2019/01/01 1:00am | Stockholm | 66 F
2019/01/01 2:00am | New York | 70 F
2019/01/01 2:00am | Stockholm | 60 F
... | ... | ...
2019/01/02 1:00am | New York | 72 F
2019/01/02 1:00am | Stockholm | 66 F
... | ... | ...

And you want the average temperature read per-day in each location:

      day         |      location     |  avg temperature
------------------+-------------------+-----------------
2019/01/01 | New York | 73 F
2019/01/01 | Stockholm | 70 F
2019/01/02 | New York | 72 F
2019/01/02 | Stockholm | 69 F

If you only need this average as a one-off, than you can simply calculate it with a query such as:

SELECT time_bucket(‘1 day’, time) as day,
location,
avg(temperature)
FROM temperatures
GROUP BY day, location;

But if you’re going to want to find the average temperature repeatedly, this is wasteful. Every time you perform the SELECT, the database will need to scan the entire table and recalculate the average. But most of the data has not changed, and so re-scanning it is redundant. Alternatively, you could store the full results of the query in another table (or materialized view).  But this quickly becomes unwieldy, because updating this table efficiently is cumbersome and complex.

Continuous aggregates solve this problem: they automatically, and in the background, maintain the results from the query, and allow you to retrieve them as you would any other data. A continuous aggregate looks just like a regular view.

A continuous aggregate for the aforementioned query can be created as easily as:

CREATE VIEW daily_average WITH (timescaledb.continuous)
AS SELECT time_bucket(‘1 day’, time) as Day,
location,
avg(temperature)
FROM temperatures
GROUP BY day, location;

And queried just like any other view:

SELECT * FROM daily_average;

That’s it! Unlike a regular view, a continuous aggregate does not perform the average when queried, and unlike a materialized view, it does not need to be refreshed manually. The view will be refreshed automatically in the background as new data is added, or old data is modified. This latter capability is fairly unique to TimescaleDB, which properly tracks when previous data is updated, or delayed data points are backfilled in older time intervals; the continuous aggregate will be automatically recomputed on this older data.  Further, since this is automatic, it doesn’t add any maintenance burden to your database, and since it runs in the background, continuous aggregates do not slow down INSERT operations.

Continuous aggregates work out of the box with a large number of aggregation functions [1], can work with any custom aggregation function as long as it is parallelizable, and you can even use more complex expressions on top of those aggregate functions, e.g., something like max(temperature)-min(temperature).

Continuous aggregates sound great, but how do they work?

At a very high level, a continuous aggregate consists of four parts:

  1. materialization hypertable: to store the aggregated data in.
  2. materialization engine: to aggregate data from the raw, underlying, table to the materialization table.
  3. An invalidation engine: to determine when data needs to be re-materialized, due to INSERTs, UPDATEs, or DELETEs within the materialized data.
  4. query engine: to access the aggregated data.

Of course, all of these parts need to be performant, otherwise Continuous Aggregates wouldn’t be worth using. In this section, we describe these components, and how their design is used to ensure good performance. Due to the way in which they interact with each other, we will go through the components in order: materialization hypertable, query engine, invalidation engine, and materialization engine.

Materialization Table and Data Model

A continuous aggregate takes raw data from the original hypertable, aggregates it, and stores intermediate state in a materialization hypertable. When you query the continuous aggregate view, the state is returned to you as needed.

For our temperature case above, the materialization table would look something like:

    day    |    location   |    chunk   |   avg temperature partial
-----------+---------------+------------+----------------------------
2019/01/01 | New York | 1 | {3, 219}
2019/01/01 | Stockholm | 1 | {4, 280}
2019/01/02 | New York | 2 | {3, 216}
2019/01/02 | Stockholm | 2 | {5, 345}

The data stored inside a materialization table consists of a column for each group-by clause in the query, a chunk column identifying the raw-data chunk this data came from, and a partial aggregate representation for each aggregate in the query. A partial is the intermediate form of an aggregation function, and it is what’s used internally to calculate the aggregate’s output. For instance, for avg the partial consists of a {count, sum} pair, representing the number of rows seen, and the sum of all their values.

For our purposes, the key feature of partials is that they can be combined with each other to create new partials spanning all of the old partials’ rows. This property is needed when combining groups that span multiple chunks. It is also key for additional features currently in development: creating aggregates at multiple time granularities and combining aggregates generated in the background with those created live from the raw data. For each query group originating from a given chunk, we will store one row with a partial representation for each aggregate in the query.

The materialization-table itself represents time-series data and is stored as a TimescaleDB hypertable, in order to take advantage of the scaling and query optimizations that hypertables offer over vanilla tables.

Query Engine

When you query the continuous aggregate view, the aggregate partials are combined into a single partial for each time range, and finalized into the value the user receives. In other words, to compute the average temperature, each partial sum is added up to the total sum, each partial count is added up to a total count, then the average is computed by total sum / total count.

In addition to this functionality, we are currently developing a version which always provides up-to-date aggregates by combining partials from the materialization table with partials calculated on-demand from the raw table, when needed.

Invalidation Engine

The Invalidation Engine is one of the core performance-critical pieces of the Continuous Aggregates. Any INSERT, UPDATE, or DELETE to a hypertable which has a continuous aggregate could potentially invalidate some materialized rows, and we need to ensure that the system does not become swamped with invalidations.

Fortunately, our data is time-series data, which has one important implication: nearly all INSERTs and UPDATEs happen near the portion of the data closest to the present. We design our invalidation engine around this assumption. We do not materialize all the way to the last inserted datapoint, but rather to some point behind that, called the materialization threshold.

This threshold is set so that the vast majority of INSERTs will contain timestamps greater than its value. These data points have never been materialized by the continuous aggregate, so there is no additional work needed to notify the continuous aggregate that they have been added. When the materializer next runs, it is responsible for determining how much new data can be materialized without risking the continuous aggregate will be invalidated. Having done this, it will materialize some of the more recent data and move the materialization threshold forward in time. This ensures that the threshold lags behind the point-in-time where data changes are common, and that most INSERTs do not require any extra writes.

When data is changed that lies below the threshold, we log the maximum and minimum timestamps whose rows were edited by the transaction. The materializer uses these values to determine which rows in the aggregation table need to be recalculated. The additional logging for old values does cause some write amplification, but since the materialization threshold lags behind the area of data that is currently changing, such writes are small and rare.

Materialization Engine

Materializing the continuous aggregate is a potentially long-running operation with two important goals: correctness and performance. In terms of correctness, we must ensure that all of our invalidations are logged when needed, and that our continuous aggregates will eventually reflect the latest data changes. On the other hand, materialization can take a long time, and data-modifying transactions must perform well even while the materialization is in progress.

We achieve this by having materialization use two transactions. In a quick first transaction, we block all INSERTs, UPDATEs, and DELETEs, determine the time period we will materialize, and update the invalidation threshold. In the second, other operations are unblocked as we perform the bulk of the work, materializing the aggregates. This ensures that the vast majority of the work does not interfere with other operations.

Why do we block data modification in the first transaction? For our invalidations to work, any data-modifying transaction must either be included in the materialized aggregation or be logged for the next materialization. Blocking data-modifying operations in the first transaction provides a convenient barrier we can use to decide which transactions need to be logged. It divides the transactions into two groups, those that happened before the threshold was updated, and those happened after. Those transactions that came before the threshold update will be included in the materialization and thus never require any additional work, while those that occur after must log their invalidations, and seeing the new threshold inform these transactions that they need to do so.

Using Continuous Aggregates

To test out continuous aggregates, follow our tutorial which uses a sample dataset. Before starting the tutorial, make sure you’ve upgraded to (or installed) TimescaleDB version 1.3.

Conclusion

After months of work, we are really excited to release continuous aggregates. For more information, take a look at our docs page.

If you are just getting started with TimescaleDB, check out our installation guide or try Timescale Cloud, which includes all community and enterprise features.

Have questions? Feel free to leave a comment in the section below or get in touch with us here.


[1] TimescaleDB’s Continuous Aggregates works with a wide range of built-in aggregate functions.  We’ve tested it on the following functions (listed in alphabetical order), and users can also define their own custom, aggregation functions, which will be able to immediately leverage Continuous Aggregates, provided these functions are parallel safe:

avg, 
bit_and,
bit_or, 
bool_and,
bool_or,
corr,
count,
covar_pop,
covar_samp,
every,
first,
histogram
last,
max,
min,
regr_avgx,
regr_avgy,
regr_count,
regr_intercept,
regr_r2,
regr_slope,
regr_sxx,
regr_sxy,
regr_syy,
stddev,
stddev_pop,
stddev_samp,
sum,
variance,
var_pop,
var_samp,

TimescaleDB1.3 的新特性——Continuous aggregates: faster queries with automatically maintained materialized views的更多相关文章

  1. 微软SMB 3.0文件共享协议新特性介绍

    SMB(*nix平台和Win NT4.0又称CIFS)协议是Windows平台标准文件共享协议.Linux平台通过samba来支持.SMB最新版本v3.0,在v2.0基础上针对WAN和分布式有改进.详 ...

  2. Xcode 8 的 Debug 新特性

    Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...

  3. 升级本地部署的CRM到Dynamics 365及部分新特性介绍。

    关注本人微信和易信公众号: 微软动态CRM专家罗勇 ,回复241或者20161226可方便获取本文,同时可以在第一间得到我发布的最新的博文信息,follow me!我的网站是 www.luoyong. ...

  4. Xcode 8 的 Debug 新特性 —- WWDC 2016 Session 410 & 412 学习笔记

    Contents OverView Static Analyzer Localizability Instance Cleanup Nullablility Runtime Issue View De ...

  5. Atitit opencv3.0  3.1 3.2 新特性attilax总结

    Atitit opencv3.0  3.1 3.2 新特性attilax总结 1. 3.0OpenCV 3 的改动在哪?1 1.1. 模块构成该看哪些模块?2 2. 3.1新特性 2015-12-21 ...

  6. Oracle 11gR2 RAC 新特性说明

    最近接触了一下Oracle 11g R2 的RAC,发现变化很大. 所以在自己动手做实验之前还是先研究下它的新特性比较好. 一.    官网介绍 先看一下Oracle 的官网文档里对RAC 新特性的一 ...

  7. PHP7新特性 What will be in PHP 7/PHPNG

    本文结合php官网和鸟哥相关文章总结: 官网:http://www.php7.ca/   https://wiki.php.net/phpng PHP7将在2015年10月正式发布,PHP7 ,将会是 ...

  8. PostgreSQL 9.5,带来 UPSERT 等新特性

    PostgreSQL 9.5于2016年1月7日正式发布,此版本主要带来了以下几个方面的特性: UPSERT, Row Level Security, and Big Data 1)UPSERTUPS ...

  9. [转]深入了解 CSS3 新特性

    简介 CSS 即层叠样式表(Cascading Stylesheet).Web 开发中采用 CSS 技术,可以有效地控制页面的布局.字体.颜色.背景和其它效果.只需要一些简单的修改,就可以改变网页的外 ...

随机推荐

  1. mysql中length与char_length字符长度函数使用方法

    在mysql中length是计算字段的长度一个汉字是算三个字符,一个数字或字母算一个字符了,与char_length是有一点区别,本文章重点介绍第一个函数. mysql里面的length函数是一个用来 ...

  2. Java函数式编程

    函数式编程 从JDK1.8开始为了简化使用者进行代码的开发,专门提供有lambda表达式的支持,利用此操作形式可以实现函数式的编程,对于函数编程比较著名的语言是:haskell.Scala,利用函数式 ...

  3. asp.net core 3.1 中Synchronous operations are disallowed. Call FlushAsync or set AllowSynchronousIO to true instead

    在Action中解决措施: var syncIOFeature = HttpContext.Features.Get<IHttpBodyControlFeature>(); if (syn ...

  4. EF CodeFirst Dome学习

    创建ConsoleDome控制台应用程序 从NuGet包管理器安装EntityFramework 创建DbContextDome类并继承DbContext public class DbContext ...

  5. Python进阶----反射(四个方法),函数vs方法(模块types 与 instance()方法校验 ),双下方法的研究

    Python进阶----反射(四个方法),函数vs方法(模块types 与 instance()方法校验 ),双下方法的研究 一丶反射 什么是反射: ​ 反射的概念是由Smith在1982年首次提出的 ...

  6. webpack练手项目之easySlide(三):commonChunks

    Hello,大家好. 在之前两篇文章中: webpack练手项目之easySlide(一):初探webpack webpack练手项目之easySlide(二):代码分割 与大家分享了webpack的 ...

  7. iOS多线程GCD简介(一)

    之前讲过多线程之NSOperation,今天来讲讲代码更加简洁和高效的GCD.下面说的内容都是基于iOS6以后和ARC下. Grand Central Dispatch (GCD)简介 Grand C ...

  8. ABAP和Java里的单例模式攻击

    面向对象编程世界里的单例模式(Singleton)可能是设计模式里最简单的一种,大多数开发人员都觉得可以很容易掌握它的用法.单例模式保证一个类仅有一个实例,并提供一个访问它的全局访问点. 然而在某些场 ...

  9. CNN原理

    卷积神经网络(Convolutional Neural Network)的结构类似于神经网络,可以看做是对其的改进.它利用局部连接.权值共享.多核卷积.池化四个手段大大降低了参数的数目,使得网络的层数 ...

  10. python SqlServer操作

    python连接微软的sql server数据库用的第三方模块叫做pymssql(document:http://www.pymssql.org/en/stable/index.html).在官方文档 ...