apache原文地址:http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

ZooKeeper

ZooKeeper: 分布式应用的协调服务

ZooKeeper是一个分布式、开源的分布式应用协调服务。它提供了一些简单的原语来使分布式应用能建立在更高级别的同步、配置可维护、分组和命名的服务。它被设计得更易编程,使用了和目录树类似的数据结构。

协调服务通常很难保证正确性。很容易在竞态条件中出现错误或死锁。Zookeeper的设计目的就是去减轻分布式应用在协调服务方面的压力。

设计目标

简单.Zookeeper 允许分布式进程通过类似于标准的文件系统的共享的命名空间来相互保持一致。命名空间由叫做znode的数据注册者组成。按照在ZK中的说法,这些和文件和目录很相似。和其他专门为存储设计的系统不同,Zookeeper的数据保存在内存中,这可以让zookeeper保持较高的吞吐量和低延时。

zookeeper在实现中花了较大的力气在高性能、高可靠性、严格的有序访问方面。Zookeeper可以被用在大型的分布式系统中。可靠性意味着不存在单点故障。严格的访问顺序意味着那些复杂的同步原语可以在客户端实现。

可复制的. 就像协调分布式进程一样,Zookeeper也会在一个集群间被复制。

ZooKeeper Service

组成ZK服务的服务器必须要相互了解。它们在内存中维护着状态的镜像,同时还对事物日志快照进行持久化的备份。只要ZK集群中一台服务器还正常工作,那么ZK的服务也就会正常工作。

客户端只连接到一台ZK服务器。客户端与服务器之间维护着一个TCP连接,也是通过该链接来发送请求、获得回应、获得监听事件、发送心跳。如果连接断开,客户端会去连接另外一个不同的服务器。

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.

ZK  

ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

Data model and the hierarchical namespace

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.

ZooKeeper's Hierarchical Namespace

Nodes and ephemeral nodes

Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].

Conditional updates and watches

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].

Guarantees

ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

  • Sequential Consistency - Updates from a client will be applied in the order that they were sent.

  • Atomicity - Updates either succeed or fail. No partial results.

  • Single System Image - A client will see the same view of the service regardless of the server that it connects to.

  • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

  • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

For more information on these, and how they can be used, see [tbd]

Simple API

One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:

create

creates a node at a location in the tree

delete

deletes a node

exists

tests if a node exists at a location

get data

reads the data from a node

set data

writes data to a node

get children

retrieves a list of children of a node

sync

waits for data to be propagated

For a more in-depth discussion on these, and how they can be used to implement higher level operations, please refer to [tbd]

Implementation

ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.

ZooKeeper Components

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

Uses

The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see [tbd]

Performance

ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper's development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)

ZooKeeper Throughput as the Read-Write Ratio Varies

The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.

Note

In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.

Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:

  1. Failure and recovery of a follower

  2. Failure and recovery of a different follower

  3. Failure of the leader

  4. Failure and recovery of two followers

  5. Failure of another leader

Reliability

To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.

Reliability in the Presence of Errors

The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.

The ZooKeeper Project

ZooKeeper has been successfully usedin many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.

All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apachefor more information.

<译>Zookeeper官方文档的更多相关文章

  1. ZooKeeper官方文档资源

    一般来说官方的文档是最权威的. 入口:http://zookeeper.apache.org/ 在右侧即可进入相应版本文档: 如果想要看主干的文章,入口如下,主干是最稳当的版本:http://zook ...

  2. zookeeper 官方文档——综述

      Zookeeper: 一个分布式应用的分布式协调服务   zookeeper 是一个分布式的,开源的协调服务框架,服务于分布式应用程序.   它暴露了一系列基础操作服务,因此,分布式应用能够基于这 ...

  3. <译>Flink官方文档-Flink概述

    Overview This documentation is for Apache Flink version 1.0-SNAPSHOT, which is the current developme ...

  4. 《Apache Velocity用户指南》官方文档

    http://ifeve.com/apache-velocity-dev/ <Apache Velocity用户指南>官方文档 原文链接   译文连接 译者:小村长  校对:方腾飞 Qui ...

  5. HBase 官方文档

    HBase 官方文档 Copyright © 2010 Apache Software Foundation, 盛大游戏-数据仓库团队-颜开(译) Revision History Revision ...

  6. HBase 官方文档0.90.4

    HBase 官方文档0.90.4 Copyright © 2010 Apache Software Foundation, 盛大游戏-数据仓库团队-颜开(译) Revision History Rev ...

  7. Spring Cloud官方文档中文版-Spring Cloud Config(下)-客户端等

    官方文档地址为:http://cloud.spring.io/spring-cloud-static/Dalston.SR2/#_serving_alternative_formats 文中例子我做了 ...

  8. 别开心太早,Python 官方文档的翻译差远了

    近几天,很多公众号发布了 Python 官方文档的消息.然而,一个特别奇怪的现象就发生了,让人啼笑皆非. Python 文档的中文翻译工作一直是“默默无闻”,几个月前,我还吐槽过这件事<再聊聊P ...

  9. 《KAFKA官方文档》入门指南(转)

    1.入门指南 1.1简介 Apache的Kafka™是一个分布式流平台(a distributed streaming platform).这到底意味着什么? 我们认为,一个流处理平台应该具有三个关键 ...

随机推荐

  1. 【贪心】【堆】bzoj2590 [Usaco2012 Feb]Cow Coupons

    每个物品有属性a,b 考虑在仅仅用光优惠券时的最优方案. 显然是按照b排序,取前K个. 但是我们还要尽可能去取剩余的. 假设朴素地取剩余的话,应该把剩余的对a排序,然后尽量去取. 但是有可能对其用优惠 ...

  2. 【状态压缩DP】NOIP2005-river过河

    [问题描述] 在河上有一座独木桥,一只青蛙想沿着独木桥从河的一侧跳到另一侧.在桥上有一些石子,青蛙很讨厌踩在这些石子上.由于桥的长度和青蛙一次跳过的距离都是正整数,我们可以把独木桥上青蛙可能到达的点看 ...

  3. JDK源码学习笔记——Integer

    一.类定义 public final class Integer extends Number implements Comparable<Integer> 二.属性 private fi ...

  4. CentOS中如何安装7ZIP

    7-zip以高压缩率著称,并且是一款免费开源的压缩软件.在常规的Linux发行版中,无法通过简单的yum命令来安装该软件.那么在CentOS中,如何安装7ZIP呢?有以下3种方法: 第一种,源码编译安 ...

  5. Spring注解@Primary的意思

    @Primary:在众多相同的Bean中,优先使用@Primary注解的Bean. 这个和@Qualifier有点区别,@Qualifier指的是使用哪个Bean进行注入. 参考: http://bl ...

  6. FIREMONEY手机虚拟键盘遮挡的解决

    FIREMONEY手机虚拟键盘遮挡的解决 尝遍了网上人们提供的N种方法之后,发现还是老猫的方法才是彻底解决问题的办法. 老猫“不看后悔XXX”--->RAD10.2.3 Flying Wang ...

  7. Unity-EasyTouch插件之One Finger

    这节课,我们主要讲下单个手指的测试.比如单击啊,双击啊,拖动,单手滑动等. 单击: public class TouchTest : MonoBehaviour { // Subscribe to e ...

  8. easyui datagrid 表格动态隐藏部分列的展示

    1.一套代码中,可能不同的项目情况都在用,但是可能不同的项目要求展示的datagrid列的内容并不一致,所以能够动态的显示部分datagrid列的内容. 即datagrid的中的某一列,这个项目要求显 ...

  9. [Linux Memory] 用/proc/stat计算cpu的占用率

    转载自:http://blog.csdn.net/pppjob/article/details/4060336 在Linux下,CPU利用率分为用户态,系统态和空闲态,分别表示CPU处于用户态执行的时 ...

  10. 【LaTeX】E喵的LaTeX新手入门教程(2)基础排版

    换了块硬盘折腾了好久..联想的驱动真坑爹.前情回顾[LaTeX]E喵的LaTeX新手入门教程(1)准备篇文档框架嗯昨天我们已经编写了一个最基本的文档,其内容是这样的:\documentclass{ar ...