[Big Data - ZooKeeper] ZooKeeper: A Distributed Coordination Service for Distributed Applications
ZooKeeper
ZooKeeper: A Distributed Coordination Service for Distributed Applications
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.
Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.
Design Goals
ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.
The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.
| ZooKeeper Service |
![]() |
The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.
ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.
ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
Data model and the hierarchical namespace
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
| ZooKeeper's Hierarchical Namespace |
![]() |
Nodes and ephemeral nodes
Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].
Conditional updates and watches
ZooKeeper supports the concept of watches. Clients can set a watch on a znode. A watch will be triggered and removed when the znode changes. When a watch is triggered, the client receives a packet saying that the znode has changed. If the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
For more information on these, and how they can be used, see [tbd]
Simple API
One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:
- create
-
creates a node at a location in the tree
- delete
-
deletes a node
- exists
-
tests if a node exists at a location
- get data
-
reads the data from a node
- set data
-
writes data to a node
- get children
-
retrieves a list of children of a node
- sync
-
waits for data to be propagated
For a more in-depth discussion on these, and how they can be used to implement higher level operations, please refer to [tbd]
Implementation
ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of the components.
| ZooKeeper Components |
![]() |
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.
Uses
The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see [tbd]
Performance
ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper's development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)
| ZooKeeper Throughput as the Read-Write Ratio Varies |
![]() |
The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.
Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:
Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of two followers
Failure of another leader
Reliability
To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.
| Reliability in the Presence of Errors |
![]() |
The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.
The ZooKeeper Project
ZooKeeper has been successfully usedin many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.
All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apachefor more information.
[Big Data - ZooKeeper] ZooKeeper: A Distributed Coordination Service for Distributed Applications的更多相关文章
- Zookeeper - zookeeper安装与配置
1.什么时Zookeeper ZooKeeper:分布式服务框架 Zookeeper -- 管理分布式环境中的数据. 2.安装 1>官网下载压缩包并解压zookeeper-3.4.14.zip ...
- Data Flow Diagram with Examples - Customer Service System
Data Flow Diagram with Examples - Customer Service System Data Flow Diagram (DFD) provides a visual ...
- wsse:InvalidSecurity Error When Testing FND_PROFILE Web Service in Oracle Applications R 12.1.2 from SOAP UI (Doc ID 1314946.1)
wsse:InvalidSecurity Error When Testing FND_PROFILE Web Service in Oracle Applications R 12.1.2 from ...
- ZooKeeper异常:Error connecting service It is probably not running
ZooKeeper安装后使用以下命令可以启动成功 bin/zkServer.sh start 但是使用下面命令查看启动状态,则报错误: bin/zkServer.sh status Error con ...
- centos7下安装zookeeper&zookeeper集群的搭建
一.centos7下安装zookeeper 1.zookeeper 下载地址 https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/ 2.安装步骤 ...
- [ ZooKeeper]ZooKeeper 的功能和原理
Zookeeper功能简介: ZooKeeper 是一个开源的分布式协调服务,由雅虎创建,是 Google Chubby 的开源实现.分布式应用程序可以基于 ZooKeeper 实现诸如数据发布/订阅 ...
- Chubby lock service for distributed system
Chubby lock service在分布式系统中的应用 Chubby lock service在分布式系统中提供粗粒度的锁服务, 以及可靠的存储. 相比高性能, 设计的重点在于高可靠性和高可用性. ...
- zookeeper Zookeeper
这是ZooKeeper客户端库的主要类.使用一个ZooKeeper服务,应用程序必须首先实例化ZooKeeper类的对象.所有的迭代都将通过调用ZooKeeper类的方法来完成.除非另有说明,该类的方 ...
- Zookeeper:Zookeeper集群概要
1.下载解压zookeeper 使用官网的(http://zookeeper.apache.org/releases.html#download)推荐下载镜像https://mirrors.tuna. ...
随机推荐
- MyEclipse里面如何把偏好设置导出
长时间使用Myeclipse,里面快捷键和代码风格以及其它设置都用习惯了,一旦需要重新安装,再次配置起来 就会很浪费时间,这里我们可以将自己的配置风格保留下来,下次重新安装时直接导入就可以了,不用再重 ...
- jstl标签比较格式化后的时间
c:set 里面不支持任何标签,这样写不好讲格式化的值放到bdateVar里面 <c:set var="bdateVar" value="<fmt:forma ...
- windows server 2003 安全加固(二)
windows server 2003 安全加固 关闭默认端口 我们知道远程桌面服务端口默认开启在3389端口,如果我们一定要用到,最好能换到另外的端口上,放到靠后的端口号上去,比如10001. 更改 ...
- Alpha冲刺随笔—:第一天
课程名称:软件工程1916|W(福州大学) 作业要求:项目Alpha冲刺(十天冲刺) 团队名称:葫芦娃队 作业目标:在十天冲刺里对每天的任务进行总结. 随笔汇总:https://www.cnblogs ...
- u3d 元件的克隆 Cloning of u3d components
u3d 元件的克隆 Cloning of u3d components 作者:韩梦飞沙 Author:han_meng_fei_sha 邮箱:313134555@qq.com E-mail: 3131 ...
- GIF录制
韩梦飞沙 韩亚飞 313134555@qq.com yue31313 han_meng_fei_sha ============= 快手电脑版_快手_gif快手电脑版 GIF动画录制工具|GI ...
- Scala:First Steps in Scala
var and val 简单来说,val声明的变量可以重新修改其引用,val则不行,见下面的例子: def max(x: Int, y: Int): Int = { if(x > y) x el ...
- MyEclipse 2016 破解详细过程
转自:https://blog.csdn.net/u012318074/article/details/71310553/ 文件资源 MyEclipse 和 破解程序可以到百度云下载:http://p ...
- docker使用大全 tomcat安装
um install docker #安装docker docker search tomcat docker pull docker.io/tomcat # 安装tomcat镜像 docker im ...
- 你真的会用Gson吗?Gson使用指南(4)
原文出处: 怪盗kidou 注:此系列基于Gson 2.4. 本次文章的主要内容: TypeAdapter JsonSerializer与JsonDeserializer TypeAdapterFac ...




