2012年的OSDI上google发布了Spanner数据库。个人认为Spanner对于版本控制,事务外部一致性的处理,使用TrueTime + Timestamp进行全球备份同步的实现都比较值得一看。个人认为对于其中时序逻辑的理解对在大范围内(通常是全国到全球)部署分布式DB以确保复制同步有重要意义。

key point:

external consistency -> txn sequence

truetime + timestamp, sync & multi-version

global deployment

2PC 2PL

3 basic txns(RW, RO, snapshot)

Spanner: Globally-Distributed Database

Implementation

Different environment: universe

test development production......

Hierarchy

  1. universe: global

    The universe master and the placement driver are currently singletons.

  2. zone: manage deployment unit; logical & physical isolation

    zone master & location proxy

  3. spanserver
  4. tablet

Spanserver

software stack

1 leader, server replica, in different data centers

all have:

  1. tablet

    $$

    (key:string, timestamp:int64) → string

    $$

  2. Colossus: a distributed filesystem like GFS

  3. Paxos state machine: to support replication, for consistently replicated bag of mappings, replicas set: Paxos group

    Each state machine stores its metadata and log in its corresponding tablet. Paxos implementation supports long-lived leaders with time-based leader leases.

    Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date.

Paxos: implementation pipelined, write in-order

leader uniquely has:

  1. lock table: the state for two-phase locking
  2. transaction manager: for distributed transactions, across Paxos group

Directories and Placement

based on k/v map, bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.

tablet: different with bigtable, spanner tablet is a container that may encapsulate multiple partitions of the row space

Movedir: background, not a single txn, register fact and uses a transaction to atomically move small data!(actually the fragment, not a big dir)

Data model

  1. schematized semi-relational tables
  2. a query language
  3. generalpurpose transactions

Spanner’s data model is not purely relational, in that rows must have names.

hierarchies: in database schemas via the INTERLEAVE IN: get locality relationships.


TrueTime

API:

  • now: return interval[earliest, latest]
  • after
  • before

underlying time references: GPS and atomic clocks


Concurrency Control

two-phase commit generates a Paxos write for the prepare phase that has no corresponding Spanner client write.

transactions:

  • read-write: (including Standalone writes)
  • read-only: without locking, any replica that is sufficiently up-to-date
  • snapshot-reads: read in the past, no locking, any replica that is sufficiently up-to-date

Paxos leader lease:

timed leases: to make leadership long-lived, for lease votes

lease interval: [discover quorum of votes, no longer has votes]

Smax: the maximum timestamp used by a leader.

two-phase commit: a protocol maintain consistency - unsuccess: rollback

  1. prepare phase
  2. commit phase

RW txn:

buffered before written

wound-wait :avoid deadlock

both two have writing lock,

  • non-coordinator participant leader
  • coordinator leader: skip prepare phase

RO txn:

execution flow:

  • assign a timestamp sread
  • execute the transaction’s reads as snapshot reads at sread.

simply select sread = TT.now().latest

  • single Paxos group

    Define LastTS() to be the timestamp of the last committed write at a Paxos group.

  • multiple Paxos groups

Schema-Change Transactions

Discussion

Paxos Truetime consistency

strong consistency cross data centers

data model: not pure relational(can use sql )

tablets are replicated, concurrtency corrtdiantion by Pxaos

txns with multiple Paxos groups --- 2PC coordination

leader

what's the actually difference compared with the classical distributed database?????

consistent versions of the data

the only reading data

the spirit kernel: the timestamp & version control

time mechenism

global-time consistency: timestamp no uncertainty

commit time: interval

there are two txns, to distinguish one happened actually before another

Participant leader -> Transaction manager -> Paxos group

three basic r/w ops, make the external consistency, global timestamp for sync across regions and certain txns sequences

Concurrency control : timestamp management to do

timestamp -> multi-version -> snapshot

almost all the work in spanner around the sequence of timestamp!

condition: multiple data centers

target: external consistency ~= linearizability

Two phase locking:

  1. growing phase: acquire lock
  2. shrinking phase: release lock
  • 2PC: distributed system, global manage
  • 2PL: one node, multi-txns, resource acquire and manage,

TrueTime: local clock -> global clock, which is essentially important for global distributed system because of sync needs.

uncertainty interval[earliest, latest]: try to make it as small as possible(increase accuracy) -> less lock -> increase efficiency

Thus, Timestamps + TrueTime can build a global accessible time service for all the application around the world.

external-consistency invariant: s1 < s2

Google全球分布式数据库:Spanner的更多相关文章

  1. 全球分布式数据库:Google Spanner(论文翻译)

    本文由厦门大学计算机系教师林子雨翻译,翻译质量很高,本人只对极少数翻译得不太恰当的地方进行了修改. [摘要]:Spanner 是谷歌公司研发的.可扩展的.多版本.全球分布式.同步复制数据库.它是第一个 ...

  2. 全球级的分布式数据库 Google Spanner原理

    开发四年只会写业务代码,分布式高并发都不会还做程序员?->>>    Google Spanner简介 Spanner 是Google的全球级的分布式数据库 (Globally-Di ...

  3. 分布式数据库Google Spanner原理分析

    Spanner 是Google的全球级的分布式数据库 (Globally-Distributed Database) .Spanner的扩展性达到了令人咋舌的全球级,可以扩展到数百万的机器,数已百计的 ...

  4. 怎样打造一个分布式数据库——rocksDB, raft, mvcc,本质上是为了解决跨数据中心的复制

    摘自:http://www.infoq.com/cn/articles/how-to-build-a-distributed-database?utm_campaign=rightbar_v2& ...

  5. 这次,听人大教授讲讲分布式数据库的多级一致性|TDSQL 关键技术突破

    近年来,凭借高可扩展.高可用等技术特性,分布式数据库正在成为金融行业数字化转型的重要支撑.分布式数据库如何在不同的金融级应用场景下,在确保数据一致性的前提下,同时保障系统的高性能和高可扩展性,是分布式 ...

  6. 云时代的分布式数据库:阿里分布式数据库服务DRDS

    发表于2015-07-15 21:47| 10943次阅读| 来源<程序员>杂志| 27 条评论| 作者王晶昱 <程序员>杂志数据库DRDS分布式沈询 摘要:伴随着系统性能.成 ...

  7. 从NoSQL到NewSQL,谈交易型分布式数据库建设要点

    在上一篇文章<从架构特点到功能缺陷,重新认识分析型分布式数据库>中,我们完成了对不同"分布式数据库"的横向分析,本文Ivan将讲述拆解的第二部分,会结合NoSQL与Ne ...

  8. 跨时代的分布式数据库 – 阿里云DRDS详解(转)

    原文章地址:https://www.csdn.net/article/a/2015-08-28/15827676 跨时代的分布式数据库 – 阿里云DRDS详解 发表于2015-08-28 18:39| ...

  9. SDP(6):分布式数据库运算环境- Cassandra-Engine

    现代信息系统应该是避不开大数据处理的.作为一个通用的系统集成工具也必须具备大数据存储和读取能力.cassandra是一种分布式的数据库,具备了分布式数据库高可用性(high-availability) ...

  10. 开源分布式数据库SequoiaDB在去哪儿网的实践

    编者注: 中国的数据库行业也迎来了一波新的热点事件.分布式数据库这块新消息不断,也让大家开始关注中国的分布式数据库.首先是短短一周内,Pingcap和SequoiaDB巨杉数据库陆续宣布了C轮的数千万 ...

随机推荐

  1. Kubernetes 内存资源限制实战

    本文转载自米开朗基扬的博客 1. Kubernetes 内存资源限制实战 Kubernetes 对内存资源的限制实际上是通过 cgroup 来控制的,cgroup 是容器的一组用来控制内核如何运行进程 ...

  2. 理解 Kubernetes volume 和 共享存储

    1. Kubernetes volume 文章 介绍了 Docker volume.与 docker volume 类似的,在 kubernetes 中存在 Pod 级别的 volume,Pod 的 ...

  3. 域名解析类型及dig,nslookup进行Dns解析过程查看

    本文为博主原创,未经允许不得转载: 通常我们在windows系统下查看域名是不是可以正常访问,是通过cmd命令打开dos窗口,使用ping 命令来查看域名是不是可以正常访问,使用 ping 命令正常访 ...

  4. SV Interface and Program

    内容 验证平台与待测设计的连接 VTB driver和dut之间的连线通过tb中声明wire连线 通过例化dut的方式进行连接 A module的input连接到B module的output SVT ...

  5. Vue2 - 配置跨域

    在根目录下创建 vue.config.js 文件 . 即可 vue.config.js : // vue.config.js 配置说明 //官方vue.config.js 参考文档 https://c ...

  6. 【css】 text-align 居中导航

    原理 :利用 inline-block 将 导航 作为 文本 , 被外层具有 text-align 属性的导航盒子包含 .从而实现居中效果 1.  html 结构 <header> < ...

  7. 【Mysql系列】(一)MySQL语句执行流程

    首发博客地址 首发博客地址 系列文章地址 参考文章 MySQL 逻辑架构 连接器 连接命令一般是这么写的 mysql -h$ip -P$port -u$user -p 那么 什么是连接器? MySQL ...

  8. [转帖]MySQL如何在InnoDB中重建索引并更新统计数据?

    https://geek-docs.com/mysql/mysql-ask-answer/356_mysql_how_can_i_rebuild_indexes_and_update_stats_in ...

  9. [转帖]模拟enq: TX - row lock contention争用

    https://www.modb.pro/db/623036 enq: TX - row lock contention它表示一个事务正在等待另一个事务释放被锁定的行.这种等待事件通常发生在并发访问数 ...

  10. [转帖]shell命令替换~date用法~如果被替换命令的输出内容包括多行或有多个连续的空白符,输出变量时应该将变量用双引号包围

    https://www.cnblogs.com/mianbaoshu/p/12069458.html Shell 命令替换是指将命令的输出结果赋值给某个变量.比如,将使用ls命令查看到的某个目录中的内 ...