转自:https://blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-query-layer/

Our previous post dived into the details of the storage layer of YugaByte DB called DocDB, a distributed document store inspired by Google Spanner. This post focuses on YugaByte SQL (YSQL), a distributed, highly resilient, PostgreSQL-compatible SQL API layer powered by DocDB. A follow-up post will highlight the challenges faced and lessons learned when engineering such a database.

YSQL, Distributed PostgreSQL Made Real

YugaByte SQL (YSQL) is a distributed and highly resilient SQL layer, running across multiple nodes. It is compatible with the SQL dialect and wire protocol of PostgreSQL. This means that developers familiar with PostgreSQL can fully reuse their knowledge (and the standard PostgreSQL client drivers) to build an application powered by YSQL.

YSQL essentially transforms the monolithic PostgreSQL database into a DocDB-powered distributed database. To accomplish this, it reuses open source PostgreSQL’s query layer (written in C) as much as possible.

Following were the design goals we set for YSQL early on.

  • Reuse the open source, mature and feature-rich PostgreSQL query layer
  • Preserve existing PostgreSQL functionality and extend as necessary
  • Enable migrations to newer versions of PostgreSQL by implementing features in a modular approach

Relentless execution towards the above goals has paid rich dividends. YSQL now supports a wider range of existing PostgreSQL functionality than we had originally expected. This is evident from the v1.2 feature matrix, examples being:

  • DDL statements: CREATE, DROP and TRUNCATE tables
  • Data types: All primitive types including numeric types (integers and floats), text data types, byte arrays, date-time types, UUID, SERIAL, as well as JSONB
  • DML statements: Most statements such as INSERT, UPDATE, SELECT and DELETE. Bulk of core SQL functionality now supported includes JOINs, WHERE clauses, GROUP BY, ORDER BY, LIMIT, OFFSETand SEQUENCES
  • Transactions: ABORT, ROLLBACK, BEGIN, END, and COMMIT
  • Expressions: Rich set of PostgreSQL built-in functions and operators
  • Other Features: VIEWs, EXPLAIN, PREPARE-BIND-EXECUTE, and JDBC support

As for the design goal of migrating to newer versions, YSQL started with the PostgreSQL v10.4 and recently rebased to PostgreSQL v11.2 in a matter of weeks!

How YSQL Works?

YSQL internals can be categorized into four distinct areas:

  • System catalog management
  • User table management
  • The read and write IO Path
  • Mapping SQL tables to a document store

The next sections detail each of the above areas. Before diving into the details, here’s a quick recap of DocDB from the first post of this series.

  • Every table in DocDB has the same schema: one key maps to one document.
  • As a distributed database, it replicates data on each write.
  • Offers single-key linearizability and multi-key snapshot isolation (serializable isolation is in the works).
  • Native support for secondary indexes on any document attribute.
  • Efficient querying and updating a subset of attributes of any document.

System Catalog Management

The PostgreSQL documentation on system catalogs says that the system catalogs are regular tables where schema metadata is stored, such as information about tables and columns, and internal bookkeeping information. The initdb code path in PostgreSQL, which is completely different from the code path the deals with user tables, creates and initializes system catalog tables. So, in order to make a distributed SQL database with no single points of failure, it is essential to replicate these system catalogs.

1. Initialize system catalog through initdb

When YSQL starts up for the first time, a modified initdb executes and creates the system catalog a replicated, single-tablet system catalog table in DocDB. This is shown in the figure above.
The system catalog tablets in DocDB forms a Raft group, which replicates data onto a set of nodes and can tolerate failures. In the figure above, the system catalog tablet leader is shown with a solid border while the followers are shown with a dotted border. This ensures that PostgreSQL can still rely on the familiar system catalog in order to function.

2. Ready to serve apps

Once the system catalogs are created, YSQL can be used by applications. Since the data is replicated across nodes and persisted on disk, initdb is not needed on subsequent restarts of the cluster.

User Table Management

Now that the YSQL cluster is up and running, let us consider the scenario when a user creates a table. This happens in the following four steps.

1. Parse and analyze the query

Just as with PostgreSQL, the query is received by PostgreSQL server process – which parses, analyzes and executes the query.

2. Route query to tablet leader of DocDB system catalog

In the case of a regular PostgreSQL, the execution phase would add entries to the system catalog tables and create some directories and files on the local filesystem. In the case of YSQL, this update to the system catalog is sent to the tablet leader of the distributed system catalog table in DocDB.

3. Replicate system catalog entry across nodes in DocDB

The tablet leader of the distributed system catalog table in DocDB is responsible replicating the update to the followers. This is done using Raft consensus, which ensures that the update is linearizable even in the presence of faults.

4. Create user table in DocDB

Now that the entry has been persisted in the system catalog, the next step of the execution phase is to create a distributed DocDB table. This involves creating a number of tablets (which have replicas) across a set of nodes. This is shown in the diagram below.

Once the above steps are complete, the table is ready to use.

Read/Write IO Path

The read and write IO paths are quite similar. Let us understand the write IO path, which involves replication of data in DocDB. The read IO path is similar, except for the last step which can serve data directly from the leader of the tablet in DocDB.

1. Parse and analyze the query

Just as with PostgreSQL, the PostgreSQL server process receives the query. It then goes through the parser, analyzer, planner and the executor. Some of the planning, analysis and execution steps, however, are different to accommodate a distributed database instead of the local store.

2. Route the insert to the tablet leader

The SQL insert statement may end up updating a single row or multiple rows. Although DocDB can handle both cases natively, these two cases are detected and handled differently to improve the performance of YSQL. Single row inserts are routed directly to the tablet leader that owns the primary key of that row. Inserts affecting multiple rows are sent to a global transaction manager which performs a distributed transaction. The single-row insert case is shown below.

3. Replicate the write through Raft

In the of single-row inserts, the tablet leader replicates the data using the Raft protocol onto the followers. This simpler case is shown below. In the case of multi-row inserts, the global transaction manager writes multiple records (transaction status records, provisional records, etc) across tablets (often on different nodes). Each of these writes are replicated using Raft consensus. The hybrid logical clock or HLC tracking in the cluster serves as a coarsely synchronized, highly available global clock to coordinate writes. This results in the writes being fault tolerant, with a high-performance system.

Mapping SQL Tables to Documents

Each user table in YSQL maps to a corresponding DocDB table with multiple tablets. The YSQL tables come with their own schemas, while all the DocDB tables have the same schema, which is shown below. The actual schema enforcement is done using table schema metadata.

 
 
1
DocKey → { Document Value }

The combined set of primary key column values are used to construct the DocKey above. Each of the value columns (non-primary key columns) are mapped to one attribute in the Document Value above.

The various YSQL constructs are mapped to suitable DocDB equivalents. This is shown in the table below.

So how does this look in practice? Let us take an example. Consider the following rather simple table.

 
 
1
2
3
4
5
6
CREATE TABLE msgs (
    user_id INT,
    msg_id  INT,
    subject TEXT
    msg     TEXT,
PRIMARY KEY (user_id, msg_id);

This will correspond to a DocDB table that has a document key to value schema. Now, lets us perform the following insert at time T1.

 
 
1
2
T1: INSERT INTO msgs (user_id, msg_id, subject, msg)
      VALUES ('user1', 10, 'hello', 'hello world');

This will get translated into the following entries in the DocDB table.

 
 
1
2
3
4
5
DocKey ('user1', 10):
    {
        column_id (subject), T1 -> 'hello',
        column_id (msg), T1 ->  'hello world'
    }

YSQL Benefits

A YSQL cluster appears as a single logical PostgreSQL database to applications. All nodes in the YSQL layer are identical and application clients can connect to any node in order to read or write data. Along with maximum PostgreSQL compatibility, such an architecture delivers a number of benefits.

Horizontal Write Scalability

Since DocDB is capable of being scaled out on demand, a stateless YSQL tier makes it easy to add nodes on demand. This enables rapid scaling of the cluster when more resources (CPU, memory, storage capacity) are required.

Highly Resilient w/ Native Failover & Repair

The underlying DocDB cluster is fault-tolerant. This means that node failures do not affect the SQL application using this distributed SQL database. It simply starts communicating to a new node as opposed to native PostgreSQL where the common approach of master-slave replication inevitably leads to manual failover and/or inability to serve recent commits.

Geo-Distribution w/ Multi-Region Deployments

DocDB supports geo-distributed deployments, meaning you can deploy a distributed SQL database across different geographic regions and zone.

Cloud Native Operations

DocDB allows dynamically changing nodes of the database with no app impact. Schema changes as well as infrastructure migrations are now zero downtime, even for a SQL database.

Summary

Bringing together two iconic database technologies such as Spanner and PostgreSQL into a new open source, cloud native database has been an immensely satisfying engineering achievement. However, we understand that a well-engineered database on its own right does not build trust in the minds of developers and architects. We have to earn that trust using the traditional means of communication, collaboration and sharing of success stories.

Through this series of posts, we explain our design principles, the tradeoffs associated with those principles, the actual implementation details and finally, the lessons learned especially around some of the more challenging aspects. We intend to prove our claims through exhaustive correctness testing (such as Jepsen) as well as comprehensive performance benchmarking (including TPCC). As we make rapid progress towards YSQL GA this summer, we are working closely with a few of our current users to highlight how YSQL can complement their existing investment in YugaByte DB. If your project can benefit from YSQL as well, don’t hesitate to reach us on our community Slack channel.

What’s Next?

  • Compare YugaByte DB in depth to databases like CockroachDB, Google Cloud Spanner and MongoDB.
  • Get started with YugaByte DB on macOS, Linux, Docker and Kubernetes.
  • Contact us to learn more about licensing, pricing or to schedule a technical overview.

Distributed PostgreSQL on a Google Spanner Architecture – Query Layer的更多相关文章

  1. Distributed PostgreSQL on a Google Spanner Architecture – Storage Layer

    转自:https://blog.yugabyte.com/distributed-postgresql-on-a-google-spanner-architecture-storage-layer/ ...

  2. 全球分布式数据库:Google Spanner(论文翻译)

    本文由厦门大学计算机系教师林子雨翻译,翻译质量很高,本人只对极少数翻译得不太恰当的地方进行了修改. [摘要]:Spanner 是谷歌公司研发的.可扩展的.多版本.全球分布式.同步复制数据库.它是第一个 ...

  3. Google Spanner (中文版)

    温馨提示:本论文由厦门大学计算机系林子雨翻译自英文论文,转载请注明出处,仅用于学习交流,请勿用于商业用途. [本文翻译的原始出处:厦门大学计算机系数据库实验室网站林子雨老师的云数据库技术资料专区htt ...

  4. 分布式数据库Google Spanner原理分析

    Spanner 是Google的全球级的分布式数据库 (Globally-Distributed Database) .Spanner的扩展性达到了令人咋舌的全球级,可以扩展到数百万的机器,数已百计的 ...

  5. Google Spanner vs Amazon Aurora: Who’ll Get the Enterprise?

    https://www.clustrix.com/bettersql/spanner-vs-aurora/ Google Spanner versus Amazon Aurora In July 20 ...

  6. 全球级的分布式数据库 Google Spanner原理

    开发四年只会写业务代码,分布式高并发都不会还做程序员?->>>    Google Spanner简介 Spanner 是Google的全球级的分布式数据库 (Globally-Di ...

  7. google spanner

    REF 论文 google spanner spanner 介绍 http://blog.jobbole.com/110262/

  8. Google的分布式关系型数据库F1和Spanner

    F1是Google开发的分布式关系型数据库,主要服务于Google的广告系统.Google的广告系统以前使用MySQL,广告系统的用户经常需要使用复杂的query和join操作,这就需要设计shard ...

  9. Google NewSQL之Spanner

    谷歌分布式三宝 BigTable.GFS.MapReduce这传说中的谷歌分布式三驾马车,虽然谷歌没有公开具体实现代码,但却公布了相应论文,对分布式文件系统.大数据挖掘和NoSQL流行起了重大促进作用 ...

随机推荐

  1. 实体类在set字段时报空指针异常

    实体类在set字段时报空指针异常的原因: T_Entry entry=null;entry.setGeneName("1212");entry.setEntryName(" ...

  2. 2018.07.06 POJ2536 Gopher II(二分图匹配)

    Gopher II Time Limit: 2000MS Memory Limit: 65536K Description The gopher family, having averted the ...

  3. A标签中传递的中文参数到Servlet 后台request.getParameter()接收时出现中文乱码

    package util; import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletRequ ...

  4. UVa 11346 Probability (转化+积分+概率)

    题意:给定a,b,s,在[-a, a]*[-b, b]区域内任取一点p,求以原点(0,0)和p为对角线的长方形面积大于s的概率. 析:应该明白,这个和高中数学的东西差不多,基本就是一个求概率的题,只不 ...

  5. 在终端上创建Java项目及编译和运行

    一:实践一次这样的操作有助于理解Tomcat/Eclipse的启动原理,包括classpath的设置,option的配置等等: 二:通过Bash终端创建一个简单的Java项目(单项目单Module,如 ...

  6. AngularJS标准Web业务流程开发框架—1.AngularJS模块以及启动分析

    前言: AngularJS中提到模块是自定义的模块标准,提到这不得不说AngularJS是框架中的老大哥,思想相当的前卫..在这框架满天横行的时代,AngularJS有些思想至今未被超越,当然仁者见仁 ...

  7. kallinux2.0安装网易云音乐

    安装 dpkg -i netease-cloud-music_1.0.0_amd64.kali2.0(yagami).deb apt-get -f install dpkg -i netease-cl ...

  8. 西邮Linux兴趣小组2014级免试挑战题 (续)

    在上一篇的博客中已经解到第四关了,现在继续挑战-- [ 第四关] 在上一关解压成功后,生成了一个file文件.用vim的二进制格式打开,转成十六进制,发现文件头格式如下: 是个以ELF字符开头的文件, ...

  9. hdu 4996 1~n排列LIS值为k个数

    http://acm.hdu.edu.cn/showproblem.php?pid=4996 直接贴bc题解 按数字1-N的顺序依次枚举添加的数字,用2N的状态保存在那个min数组中的数字,每次新添加 ...

  10. WebAPI Token 验证

    WebAPI Token 验证 登录端 //HttpContext.Current.Session.Timeout = 10; ////生成Ticket //FormsAuthenticationTi ...