ZooKeeper-基础介绍

What is ZooKeeper?

ZooKeeper为分布式应用设计的高性能（使用在大的分布式系统）、高可用（防止单点失败）、严格地有序访问（客户端可以实现复杂的同步原语）的协同服务。

ZooKeeper提供的服务包括：maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Design Goals

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system.

ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.

ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

The ZooKeeper Data Model

ZooKeeper在内存中维护一个由ZNode构成的层次树。ZNode可以类比成文件和目录。每个ZNode维护的数据结构中包括(1) version numbers for data changes(2)ACL that restricts who can do what.(3)timestamps.

The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn't match the actual version of the data, the update will fail. version number为 -1，则可以匹配任意znode的版本。

（1）数据访问是原子的。read一个znode的数据，只可能把该znode中的数据全部读取出来或者失败。write将替换掉znode中的所有数据或者失败。不存在部分读取和部分写入（ZooKeeper不支持append操作）。

（2）使用绝对路径来访问一个znode，路径使用java.lang.String来描述。

（3）每个ZNode最多存储1MB的数据，but the data should be much less than that on average。ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data（configuration, status information, rendezvous）。如果需要存储大的数据，一般将数据存储到NFS或者HDFS，将指针存储到ZooKeeper中。

Ephemeral ZNodes & Persistent ZNodes

ZooKeeper中有两种类型的znode：ephemeral or persistent。ZooKeeper实例的create方法在创建znode的时候可以使用CreateMode指定znode的类型。

CreateMode可以指定的类型如下。

Sequence numbers can be used to impose a global ordering on events in a distributed system and may be used by the client to infer the ordering.

ZooKeeper Watches

ZooKeeper的读操作getData(), getChildren(), exists() 可以决定是否设置watch，写操作create() ，delete()， setData()可以触发watch。ACL的操作与watch无关。

ZooKeeper的Watcher对象有两个功能：（1）通知ZooKeeper状态的改变（2）通知ZNode的改变。

ZooKeeper对watch的定义：A watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes.

One-time trigger

One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.
Sent to the client

This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.
The data for which the watch was set

This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

Watch Triggers

具体的说明如下：

• A watch set on an exists operation will be triggered when the znode being watched is created, deleted, or has its data updated.

• A watch set on a getData operation will be triggered when the znode being watched is deleted or has its data updated. No trigger can occur on creation because the znode must already exist for the getData operation to succeed.

• A watch set on a getChildren operation will be triggered when a child of the znode being watched is created or deleted, or when the znode itself is deleted. You can tell whether the znode or its child was deleted by looking at the watch event type: NodeDeleted shows the znode was deleted, and NodeChildrenChanged indicates that it was a child that was deleted.

What ZooKeeper Guarantees about Watches

With regard to watches, ZooKeeper maintains these guarantees:

Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.

Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)

A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.

When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

Operations

delete和setData需要指定znode的版本号（-1可以匹配任意znode的版本号，通过exists方法的返回值可以得到版本号），版本号不匹配，那么操作将失败。更新操作是非阻塞的。

ZooKeeoer中的read操作有可能读取不到最新的数据，client使用sync，则可以得到最新的数据。

multi操作： batch together multiple primitive operations into a single unit that either succeeds or fails in its entirety。

APIs

ZooKeeper提供Java和C的语言支持，提供同步和异步的API。

同步的exists

异步的exists

ACLs

A znode is created with a list of ACLs, which determine who can perform certain operations on it.

ZooKeeper提供的认证方案：

、

Clients may authenticate themselves after establishing a ZooKeeper session. Authentication is optional, although a znode’s ACL may require an authenticated client, in which case the client must authenticate itself to access the znode. Here is an example of using the digest scheme to authenticate with a username and password:

zk.addAuthInfo("digest", "tom:secret".getBytes());、

An ACL is the combination of an authentication scheme, an identity for that scheme, and a set of permissions. For example, if we wanted to give a client with the IP address 10.0.0.1 read access to a znode, we would set an ACL on the znode with the ip scheme, an ID of 10.0.0.1, and READ permission. In Java, we would create the ACL object as follows:

new ACL(Perms.READ,
　　new Id("ip", "10.0.0.1"));

Implementation

ZooKeeper的两种模式：

Standalone Mode：一个ZooKeeper的Server，测试环境下使用，不提供高可用和可靠性。

Replicated Mode：通过复制，实现高可用。只要ZooKeeper集群中超过半数Server可用，那么整个ZooKeeper服务就是可用的。Server的数量等于N，活着的Server数量需要大于等于N/2 + 1。假设N = 5，5/2 + 1 = 3，允许两个Server失败。假设N = 6，6/2 + 1 = 4，也允许两个Server失败。因此推荐使用奇数个Server构成ensemble（最少3个Server）。ZooKeeper runs in replicated mode on a cluster of machines called an ensemble.

ZooKeeper的核心理念：

all it has to do is ensure that every modification to the tree of znodes is replicated to a majority of the ensemble. If a minority of the machines fail, then a minimum of one machine will survive with the latest state. The other remaining replicas will eventually catch up with this state.

ZooKeeper使用Zab协议实现以上的理念，Zab协议包含两个阶段：

If the leader fails, the remaining machines hold another leader election and continue as before with the new leader. If the old leader later recovers, it then starts as a follower. Leader election is very fast, around 200 ms according to one published result, so performance does not noticeably degrade during an election. All machines in the ensemble write updates to disk before updating their in-memory copies of the znode tree. Read requests may be serviced from any machine, and because they involve only a lookup from memory, they are very fast.

Consistency

A follower may lag the leader by a number of updates.（因此ZooKeeper集群中将Server命名为Leader和Follower是比较恰当的）。This is a consequence of the fact that only a majority and not all members of the ensemble need to have persisted a change before it is committed.

Sessions

Zookeeper客户端拥有ZooKeeper集群中的服务器列表，客户端启动会连接列表中的一个Server。客户端连接上Server以后，Server为该客户端创建一个Session。客户端通过向Server发送ping请求（心跳）来维持Session的存活。

Time

ZooKeeper中基础的时间单位是tick time，其他的时间参数基于tick time设置（例如Session timeout）。

States

ZooKeeper对象在其生命周期拥有不同的State，新建的ZooKeeper实例的状态是CONNECTING，与Server建立起连接后，变成CONNECTED状态。如果ZooKeeper实例调用close方法或者session timeout，则变成CLOSED状态。通过调用getState方法可以获取到状态。

The most performance-critical part of ZooKeeper is the transaction log.

partial failure: when we don’t even know if an operation failed

参考：

https://zookeeper.apache.org/doc/current/index.html

Hadoop权威指南第4版

看了有收获的文章：

https://blog.csdn.net/u012152619/article/details/53053634

https://blog.csdn.net/liu857279611/article/details/70495413