This is the second post in a series discussing the architecture and implementation of massively parallel databases, such as Vertica, BigQueryor EventQL. The target audience are software and system engineers with an interest in databases and distributed systems.

Inthe last post we saw that in order to execute interactive queries on a large data set we have to split the data up into smaller partitions and put each partition on it's own server. This way we can utilize the combined processing power of all servers to answer the query rapidly.

The problem we'll discuss today is how exactly we're going to split up a given dataset into partitions and distribute them among servers.

Say we have a table containing a large number of rows, a couple billion or so. The total size of the table is around 100TB. Our task is to distribute the rows uniformly among 20 servers, i.e. put roughly 5TB of the table on each server.

Of course, solving that task is trivial: We read in our 100TB source table, write the first 5TB of rows to the first server, the next 5TB to the second server and so on.

While this simplistic scheme works well for a static dataset, we'll have to be more clever if we are to implement an entire database that supports adding and modifying rows.

Why? Consider this: If we want to modify a row in our naively partitioned table, we first have to figure out on which server we have put the row when splitting the table into pieces. Now, to find the row, we have to search through all rows until we hit the correct one. In the worst case we would have to examine all rows on all servers to find any single row - the whole 100TB of data.

In more technical terms: Locating a row has linear complexity : Finding the row in a table containing one thousand rows would take one thousand times longer than finding the row in a table containing just one row. It gets slower and slower as we add more rows.

Clearly, our simplistic solution will not scale: We need a more efficient way to figure out on which server a given row is stored.

To quickly tell the location of a specific row, we could store an index file somewhere that records the location of each row. We could then do a quick lookup into our index to find the correct server instead of searching through all the data.

Sadly, this doesn't solve the problem. The index file would still have linear complexity. Allthough this time it wouldn't be linear in time, but in space: If we were storing a bazillion rows in our table, our index file would also have a bazillion rows.

Essentially, we would just be rephrasing the problem statement from "How do we partition a large table?" to "How do we partition a large index file?".

We'll have to come up with a partitioning scheme that allows us to find rows with less than linear complexity. I.e. we have to find an algorithm that can correctly compute the location of any single row, but doesn't get slower and slower as we add more rows to the table.

Modulo Hashing

One such algorithm is called modulo hashing . The good thing about modulo hashing is that it's not only very efficient but also extremely simple to implement.

If we want to partition our input table using modulo hashing, we first have to assign an identifier ID to every row in the table. This ID is usually dervied from the row itself, for example by designating one of the table's columns as the primary key. For illustration purposes, we will use numeric identifiers, but the same works with strings.

The only piece of informationthat modulo hashing keeps is a single variable N . This variable N contains the number of servers among which the table should be partitioned.

Now, to figure out on which server a given row belongs, we simply compute the remainder of the division of the row's ID over N . This use of the modulo operation is also where the algorithm derives its name from.

find_row(row_id) { server_id := row_id % N; return server_id; }

If, for example, we wanted to locate the row with ID=123 in a table partitioned among 8 servers ( N=8 ), the row would be stored on server number 3 ( 123 % 8 = 3 ).

Of course, this is assuming we have also used the algorithm to decide on which server to put each row while loading the input table in the first place.

Modulo hashing works out so that every possible ID is consistently mapped to a single server. The distribution of the rows among servers will be approximately uniform, i.e. every server will get roughly the same number of rows.

The modulo hashing algorithm is a huge improvement over our naive approach as it is constant in space and time: Locating a row takes the same amount of time regardless of the total number of rows in the table. And the only pieces of information we need to tell where a row goes are the row's ID and the total number of servers N , which will only take a few bytes to store. Not bad.

So are we done? I'm afraid we're not.

One thing modulo hashing can't handle well is a growing dataset. If we're continually adding more rows to the table, we will eventually have to add more capacity and increase the number of servers N . However, once we do that the locations of almost all of the rows would change, since changing N also changes the result of all modulo operations.

This means that in order to increase the number of servers N and still keep the rows where they belong, we would have to copy every single row in the table to it's new location every time we add or remove a server.

That's not exactly ideal: Even if we did not care about the massive overhead of copying every single row, we would eventually reach a point where our table grows faster than we can rebalance it.

Consistent Hashing

Consistent hashing is a more involved version of modulo hashing. The main improvement of consistent hashing is that it allows to add and remove servers without affecting the locations of all rows.

At the heart of the consistent hashing algorithm is a so called circular keyspace . Before we discuss what exactly that means let's define the term keyspace :

As with modular hashing we need to assign an identifier ID to each row. Now, the keyspace is the range of all valid ID values. If the ID is numeric, the keyspace is the range from negative infinity to positive infinity.

A keyspace and the positions of three row identifiers within the keyspace.

Within the keyspace, identifiers are well-ordered. That means each ID has a successor and a predecessor . The successor of an ID is the ID that goes immediately after it and the predecessor is the one that goes immediately before it.

In the illustration above, the successor of green ( ID=123 ) is red ( ID=856 ), the successor of red ( ID=856 ) is yellow ( ID=923 ) and so on.

But what is the successor of yellow ( ID=923 )? Does it have one? The answer is not entirely clear. However, we will later have to come up with a successor for each possible position in the keyspace, so we will have to define what the successor of the last position in our keyspace is.

Imagine, we glued one end of our keyspace to the other end:

A circular keyspace and the positions of three row identifiers within the keyspace.

Now we can clearly say that, in clockwise order, yellow's ( ID=923 ) successor is green ( ID+123 ). Also, it finally looks like a circular keyspace!

Back to consistent hashing. Like we did with modulo hashing, we will choose an initial number of servers N . For each of the N servers, we will put a marker at a random position in the circular keyspace.

To decide on which server a given row ID belongs, we first locate the position of the ID in the circular keyspace and then search clockwise for the next server marker. In other words: each row goes onto the server whose marker immediately succeeds the row-id's position in the keyspace.

The illustration below shows a circular keyspace with eight server markers. In the illustration, the succeeding server marker for row 123 is server 1 , while the suceeding server marker for rows 856 and 923 is server 3 .

So row 123 gets stored on server 1 while rows 856 and 923 get stored on server 3 .

A circular keyspace with three row identifiers and eight server markers.

Alike the modulo hashing scheme, we still end up with a uniform distribution of rows among serversand can quickly locate each row. The only information we have to store are the positions of each server's marker in the keyspace.

Additionally, any server marker that we add or remove will only affect the rows immediately between it and the previous marker: The location of all other rows remain consistent, hence the name consistent hashing.

This means we can add or remove servers and only have to move a small subset of the rows into their new locations. The ratio of rows that needs to be moved is 1/N where N is the number of servers. So as we add more servers, the percentage of rows that need to be moved actually gets smaller.

Can we still do better? It depends on your usecase. Consistent hashing is successfully employed in a number of popular key/value databasessuch as DynamoDB, Cassandraor memcache. Nevertheless, here are two things that we could improve about consistent hashing:

Firstly, consistent hashing only supports an exact lookup operation. That is, we can only find a row quickly if we already know it's ID . If we want to find the locations of all rows in a range of IDs , for example all rows with an ID between 100 and 200 , we're back to scanning the full table.

Because of this, consistent hashing is particularly well suited for key/value databases where range scans are not required but less so for OLAP [] systems like EventQL.

The other possible improvement is that we still have to copy roughly 1/N of the table's rows after changing the number of servers. If we had 100TB on 20 servers, that would mean we're - realistically - still copying at least 15TB for every server addition. Not too bad, but still a lot of overhead network traffic.

The BigTable Algorithm

The last algorithm we will look at today is best known for it's publication in Google's BigTable paper.

The bigtable algorithm takes a completely different approach to the problem, but it also starts by defining a keyspace. Except this time it's not a circular keyspace, but a linear one.

The illustration below shows a bigtable keyspace and the position of three rows with the identifiers 123 , 856 and 923 within the keyspace.

The next thing bigtable does is to split up the keyspace into a number of partitions that are defined by their start and end positions, i.e. by the lowest and highest row identifiers that will still be contained in the partition.

The illustration below shows a keyspace that is split into five parititions A-E. In the illustration, the row with ID=123 goes into partition B and the rows with ID=856 and ID=923 both go into partition D .

Three record identifiers are mapped onto five partitions

Now, the clever bit about the bigtable algorithm is how it comes up with the partition boundaries. To see why it's clever we have to understand why we can't simply divide the keyspace into equal parts without knowing the exact distribution of the input data:

One reason for that is that if you split up the range from negative to positive infinity into a list of discrete partitions, you end up with an infinite number of partitions.

The other reason is that it's highly likely that the row identifiers will all be piled up in a small area of the keyspace. After all, the identifiers might be user-supplied so we can't nessecarily guarantee anything about their distribution.

Realistic distribution of record identifiers in the keyspace

So it could be, that even though we have split up the keyspace into a large number of partitions, all rows actually end up in the same partition. And we can't solve the problem by making the partitions infinitesimally small either - that would be like going to back to keeping track of every row's location individually, just a bit worse.

Here's how bigtable solves the problem: Initially the table starts out with a single partition that covers the whole keyspace - from negative infinity to positive infiity.

As soon as this first partition has become too large, it will be split in two. The split point will be chosen so that it roughly halves the data in the partition into equal parts. This continues recursively as partitions become too large. At a basic level, it's an application of the classic divide and conquer principle.

Partition D is splitting into partitions D1 and D2

This way, you always end up with a number of partitions that are roughly equal in size. Even though the distribution of row identifiers in the keyspace is initially unknown.

And since each partition is defined in terms of it's lowest and highest contained row identifier, we can easily implement efficient range scans: To find all rows in a given range of identifiers, we only have to scan the partitions with overlapping ranges.

Lastly, the bigtable scheme does require a second allocation layer to assign partitions to servers that we didn't discuss here. Suffice to say that this second allocation layer allows us to add new servers to a cluster without physically moving a single row. Of course, we still need to move around some rows every time we split a partition.

Conclusion

So is the bigtable algorithm really "better" than consistent hashing? Again, it depends on the usecase.

The upsides of the bigtable scheme are that it supports range scans and that we can add capacity to a cluster without copying rows. The major downside is that implementing the algorithm in a masterless system requires a fair amount of code and synchronization.

For EventQL, we still chose the bigtable algorithm as the clear winner. After reading this post, go have a look at the debug interface of EventQL where you can see the partition map for a given table. Hopefully it will make a lot more sense now:

Partition map for an EventQL table with a DATETIME primary key

That's all for today. In the next post we will discuss how to handle streaming updates on columnar files. You can subscribe to email updates for upcoming posts or the RSS feed in the sidebar.

Dividing Infinity - Distributed Partitioning Schemes的更多相关文章

  1. 分区实践 A PRIMARY KEY must include all columns in the table's partitioning function

    MySQL :: MySQL 8.0 Reference Manual :: 23 Partitioning https://dev.mysql.com/doc/refman/8.0/en/parti ...

  2. devices-list

    转自:https://www.kernel.org/pub/linux/docs/lanana/device-list/devices-2.6.txt LINUX ALLOCATED DEVICES ...

  3. kudu

    Kudu White Paper http://www.cloudera.com/documentation/betas/kudu/0-5-0/topics/kudu_resources.html h ...

  4. 浅析 Hadoop 中的数据倾斜

    转自:http://my.oschina.net/leejun2005/blog/100922 最近几次被问到关于数据倾斜的问题,这里找了些资料也结合一些自己的理解. 在并行计算中我们总希望分配的每一 ...

  5. Oracle 11g Articles

    发现一个比较有意思的网站,http://www.oracle-base.com/articles/11g/articles-11g.php Oracle 11g Articles Oracle Dat ...

  6. OpenStack若干概念

    近期在部署OpenStack时涉及到各个服务之间的诸多概念,这里简要记录其中的一些作为备忘. 服务(service) 在OpenStack中,一个服务有若干端点,用户通过端点访问服务并使用服务提供的功 ...

  7. Oracle 11g新特性 Interval Partition

    分区(Partition)一直是Oracle数据库引以为傲的一项技术,正是分区的存在让Oracle高效的处理海量数据成为可能,在Oracle 11g中,分区技术在易用性和可扩展性上再次得到了增强.在1 ...

  8. 1 RAID技术入门

    序   RAID一页通整理所有RAID技术.原理并配合相应RAID图解,给所有存储新人提供一个迅速学习.理解RAID技术的网上资源库,本文将持续更新,欢迎大家补充及投稿.中国存储网一如既往为广大存储界 ...

  9. 可扩展的Web系统和分布式系统(Scalable Web Architecture and Distributed Systems)

    Open source software has become a fundamental building block for some of the biggest websites. And a ...

随机推荐

  1. Linux:at命令详解

    at命令 at命令为单一工作调度命令.at命令非常简单,但是在指定时间上却非常强大 语法 at [选项] time at > 执行的命令 ctrl+d 选项 -m :当指定的任务被完成之后,将给 ...

  2. 高并发下linux ulimit优化

    系统性能一直是一个受关注的话题,如何通过最简单的设置来实现最有效的性能调优,如何在有限资源的条件下保证程序的运作,ulimit 是我们在处理这些问题时,经常使用的一种简单手段.ulimit 是一种 l ...

  3. [LeetCode&Python] Problem 557. Reverse Words in a String III

    Given a string, you need to reverse the order of characters in each word within a sentence while sti ...

  4. test20180922 扭动的树

    题意 分析 二叉查找树按照键值排序的本质是中序遍历,每次我们可以在当前区间中提取出一个根,然后划分为两个子区间做区间DP.记\(f(i,j,k)\)表示区间[i, j]建子树,子树根节点的父亲是第k个 ...

  5. 【转】python3中bytes和string之间的互相转换

    问题: 比对算法测试脚本在python2.7上跑的没问题,在python3上报错,将base64转码之后的串打印出来发现,2.7版本和3是不一样的:2.7就是字符串类型的,但是3是bytes类型的,形 ...

  6. CentOS升级Python2.6到Python2.7并安装pip

    原文:http://ruter.sundaystart.net/2015/12/03/Update-python/ 貌似CentOS 6.X系统默认安装的Python都是2.6版本的?平时使用以及很多 ...

  7. Linux块设备驱动_WDS

    推荐书:<Linux内核源代码情景分析> 1.字符设备驱动和使用中等待某一事件的方法①查询方式②休眠唤醒,但是这种没有超时时间③poll机制,在休眠唤醒基础上加一个超时时间④异步通知,异步 ...

  8. GStreamer插件分类

    gst-plugins-base一套小而固定的插件,涵盖各种可能类型的elements; 这些在开发系列期间随着核心变化而不断更新.我们相信分销商可以安全地发行这些插件.人们编写插件应该将他们的代码基 ...

  9. 【转】每天一个linux命令(4):mkdir命令

    原文网址:http://www.cnblogs.com/peida/archive/2012/10/25/2738271.html linux mkdir 命令用来创建指定的名称的目录,要求创建目录的 ...

  10. Jenkins进阶-远程构建任务(4)

    开发过程中提交代码以后,如何不登录Jenkins就自动触发jenkins 任务来发布软件版本. 1.首先我们创建一个Jenkins任务. 2.选择"构建触发器"->勾选&qu ...