2012年的OSDI上google发布了Spanner数据库。个人认为Spanner对于版本控制，事务外部一致性的处理，使用TrueTime + Timestamp进行全球备份同步的实现都比较值得一看。个人认为对于其中时序逻辑的理解对在大范围内（通常是全国到全球）部署分布式DB以确保复制同步有重要意义。

key point:

external consistency -> txn sequence

truetime + timestamp, sync & multi-version

global deployment

2PC 2PL

3 basic txns(RW, RO, snapshot)

Spanner: Globally-Distributed Database

Implementation

Different environment: universe

test development production......

Hierarchy

universe: global

The universe master and the placement driver are currently singletons.
zone: manage deployment unit; logical & physical isolation

zone master & location proxy
spanserver
tablet

Spanserver

software stack

1 leader, server replica, in different data centers

all have:

tablet

$$

(key:string, timestamp:int64) → string

$$
Colossus: a distributed filesystem like GFS
Paxos state machine: to support replication, for consistently replicated bag of mappings, replicas set: Paxos group

Each state machine stores its metadata and log in its corresponding tablet. Paxos implementation supports long-lived leaders with time-based leader leases.

Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date.

Paxos: implementation pipelined, write in-order

leader uniquely has:

lock table: the state for two-phase locking
transaction manager: for distributed transactions, across Paxos group

Directories and Placement

based on k/v map, bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.

tablet: different with bigtable, spanner tablet is a container that may encapsulate multiple partitions of the row space

Movedir: background, not a single txn, register fact and uses a transaction to atomically move small data!(actually the fragment, not a big dir)

Data model

schematized semi-relational tables
a query language
generalpurpose transactions

Spanner’s data model is not purely relational, in that rows must have names.

hierarchies: in database schemas via the INTERLEAVE IN: get locality relationships.

TrueTime

API:

now: return interval[earliest, latest]
after
before

underlying time references: GPS and atomic clocks

Concurrency Control

two-phase commit generates a Paxos write for the prepare phase that has no corresponding Spanner client write.

transactions:

read-write: (including Standalone writes)
read-only: without locking, any replica that is sufficiently up-to-date
snapshot-reads: read in the past, no locking, any replica that is sufficiently up-to-date

Paxos leader lease:

timed leases: to make leadership long-lived, for lease votes

lease interval: [discover quorum of votes, no longer has votes]

S_max: the maximum timestamp used by a leader.

two-phase commit: a protocol maintain consistency - unsuccess: rollback

prepare phase
commit phase

RW txn:

buffered before written

wound-wait :avoid deadlock

both two have writing lock,

non-coordinator participant leader
coordinator leader: skip prepare phase

RO txn:

execution flow：

assign a timestamp s_read
execute the transaction’s reads as snapshot reads at s_read.

simply select s_read = TT.now().latest

single Paxos group

Define LastTS() to be the timestamp of the last committed write at a Paxos group.
multiple Paxos groups

Schema-Change Transactions

Discussion

Paxos Truetime consistency

strong consistency cross data centers

data model: not pure relational(can use sql )

tablets are replicated, concurrtency corrtdiantion by Pxaos

txns with multiple Paxos groups --- 2PC coordination

leader

what's the actually difference compared with the classical distributed database?????

consistent versions of the data

the only reading data

the spirit kernel: the timestamp & version control

time mechenism

global-time consistency: timestamp no uncertainty

commit time: interval

there are two txns, to distinguish one happened actually before another

Participant leader -> Transaction manager -> Paxos group

three basic r/w ops, make the external consistency, global timestamp for sync across regions and certain txns sequences

Concurrency control : timestamp management to do

timestamp -> multi-version -> snapshot

almost all the work in spanner around the sequence of timestamp!

condition: multiple data centers

target: external consistency ~= linearizability

Two phase locking:

growing phase: acquire lock
shrinking phase: release lock

2PC: distributed system, global manage
2PL: one node, multi-txns, resource acquire and manage,

TrueTime: local clock -> global clock, which is essentially important for global distributed system because of sync needs.

uncertainty interval[earliest, latest]: try to make it as small as possible(increase accuracy) -> less lock -> increase efficiency

Thus, Timestamps + TrueTime can build a global accessible time service for all the application around the world.

external-consistency invariant: s₁ < s₂

Google全球分布式数据库：Spanner的更多相关文章

全球分布式数据库：Google Spanner（论文翻译）
本文由厦门大学计算机系教师林子雨翻译,翻译质量很高,本人只对极少数翻译得不太恰当的地方进行了修改. [摘要]:Spanner 是谷歌公司研发的.可扩展的.多版本.全球分布式.同步复制数据库.它是第一个 ...
全球级的分布式数据库 Google Spanner原理
开发四年只会写业务代码,分布式高并发都不会还做程序员?->>> Google Spanner简介 Spanner 是Google的全球级的分布式数据库 (Globally-Di ...
分布式数据库Google Spanner原理分析
Spanner 是Google的全球级的分布式数据库 (Globally-Distributed Database) .Spanner的扩展性达到了令人咋舌的全球级,可以扩展到数百万的机器,数已百计的 ...
怎样打造一个分布式数据库——rocksDB, raft, mvcc，本质上是为了解决跨数据中心的复制
摘自:http://www.infoq.com/cn/articles/how-to-build-a-distributed-database?utm_campaign=rightbar_v2& ...
这次，听人大教授讲讲分布式数据库的多级一致性｜TDSQL 关键技术突破
近年来,凭借高可扩展.高可用等技术特性,分布式数据库正在成为金融行业数字化转型的重要支撑.分布式数据库如何在不同的金融级应用场景下,在确保数据一致性的前提下,同时保障系统的高性能和高可扩展性,是分布式 ...
云时代的分布式数据库：阿里分布式数据库服务DRDS
发表于2015-07-15 21:47| 10943次阅读| 来源<程序员>杂志| 27 条评论| 作者王晶昱 <程序员>杂志数据库DRDS分布式沈询摘要:伴随着系统性能.成 ...
从NoSQL到NewSQL，谈交易型分布式数据库建设要点
在上一篇文章<从架构特点到功能缺陷,重新认识分析型分布式数据库>中,我们完成了对不同"分布式数据库"的横向分析,本文Ivan将讲述拆解的第二部分,会结合NoSQL与Ne ...
跨时代的分布式数据库 – 阿里云DRDS详解（转）
原文章地址:https://www.csdn.net/article/a/2015-08-28/15827676 跨时代的分布式数据库 – 阿里云DRDS详解发表于2015-08-28 18:39| ...
SDP（6）：分布式数据库运算环境- Cassandra-Engine
现代信息系统应该是避不开大数据处理的.作为一个通用的系统集成工具也必须具备大数据存储和读取能力.cassandra是一种分布式的数据库,具备了分布式数据库高可用性(high-availability) ...
开源分布式数据库SequoiaDB在去哪儿网的实践
编者注: 中国的数据库行业也迎来了一波新的热点事件.分布式数据库这块新消息不断,也让大家开始关注中国的分布式数据库.首先是短短一周内,Pingcap和SequoiaDB巨杉数据库陆续宣布了C轮的数千万 ...

随机推荐

Kubernetes 内存资源限制实战
本文转载自米开朗基扬的博客 1. Kubernetes 内存资源限制实战 Kubernetes 对内存资源的限制实际上是通过 cgroup 来控制的,cgroup 是容器的一组用来控制内核如何运行进程 ...
理解 Kubernetes volume 和共享存储
1. Kubernetes volume 文章介绍了 Docker volume.与 docker volume 类似的,在 kubernetes 中存在 Pod 级别的 volume,Pod 的 ...
域名解析类型及dig，nslookup进行Dns解析过程查看
本文为博主原创,未经允许不得转载: 通常我们在windows系统下查看域名是不是可以正常访问,是通过cmd命令打开dos窗口,使用ping 命令来查看域名是不是可以正常访问,使用 ping 命令正常访 ...
SV Interface and Program
内容验证平台与待测设计的连接 VTB driver和dut之间的连线通过tb中声明wire连线通过例化dut的方式进行连接 A module的input连接到B module的output SVT ...
Vue2 - 配置跨域
在根目录下创建 vue.config.js 文件 . 即可 vue.config.js : // vue.config.js 配置说明 //官方vue.config.js 参考文档 https://c ...
【css】 text-align 居中导航
原理 :利用 inline-block 将导航作为文本 , 被外层具有 text-align 属性的导航盒子包含 .从而实现居中效果 1. html 结构 <header> < ...
【Mysql系列】（一）MySQL语句执行流程
首发博客地址首发博客地址系列文章地址参考文章 MySQL 逻辑架构连接器连接命令一般是这么写的 mysql -h$ip -P$port -u$user -p 那么什么是连接器? MySQL ...
[转帖]MySQL如何在InnoDB中重建索引并更新统计数据？
https://geek-docs.com/mysql/mysql-ask-answer/356_mysql_how_can_i_rebuild_indexes_and_update_stats_in ...
[转帖]模拟enq: TX - row lock contention争用
https://www.modb.pro/db/623036 enq: TX - row lock contention它表示一个事务正在等待另一个事务释放被锁定的行.这种等待事件通常发生在并发访问数 ...
[转帖]shell命令替换~date用法~如果被替换命令的输出内容包括多行或有多个连续的空白符，输出变量时应该将变量用双引号包围
https://www.cnblogs.com/mianbaoshu/p/12069458.html Shell 命令替换是指将命令的输出结果赋值给某个变量.比如,将使用ls命令查看到的某个目录中的内 ...

Google全球分布式数据库：Spanner