Zookeeper原理与Curator使用

近期打算实现一个基于Zookeeper的分布式的集群状态一致性控制, 对Zookeeper的原理不太了解, 正好学习一下, 网上找到了几篇文章, 先贴在这边, 等我熟读官方文档后, 再来补充自己的见解

-----------------------------我是分割线-------------------------------------

最近基于Zk实现了一套公司风控系统的规则管理和集群管理, 对zk和curator有了更加深入的认识, 下面就踩过的坑记录下

1. curator 有两套监听机制, 一个是封装了zk自身的watcher, 一个是自己的listener, 坑来了:

　　a.listener 只能监听相同thread的client事件, 跨thread或者跨process则不行, 操作必须使用inbackground()模式才能触发listener

　　b.watcher 封装了zk原本的watcher 可以跨进程使用, 但是注意, 无法在 inbackground的情况下触发watcher

2. zk watcher 定义了4种事件

public enum EventType {
None (-1),
NodeCreated (1),
NodeDeleted (2),
NodeDataChanged (3),
NodeChildrenChanged (4);

}

坑来了

怎样才能得到自己想要的事件?

　　a. 想监听 NodeCreated, NodeDeleted, NodeDataChanged 可以使用 checkExist 或者 getData, 推荐使用checkExist, 因为getData 如果结点未创建则报错

　　b. 想监听 NodeChildrenChanged 只能使用 getChildren, 但是注意不能监听嵌套内层的子节点, 如 /test/1 不能获得 /test/1/2/3 的变动 , 可以获得 /test/1/2 的变动, 而且每次变动的path 永远都是你监听的那个path, 不要妄想用它来获得子节点的path

这里有篇文章不错, http://blog.csdn.net/lzx1104/article/details/6968802

http://liuqunying.blog.51cto.com/3984207/1407455

http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/

https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_advancedConfiguration

https://github.com/Netflix/curator

The ZooKeeper Data Model

ZooKeeper has a hierarchal name space, much like a distributed file system. The only difference is that each node in the namespace can have data associated with it as well as children. It is like having a file system that allows a file to also be a directory. Paths to nodes are always expressed as canonical, absolute, slash-separated paths; there are no relative reference. Any unicode character can be used in a path subject to the following constraints:

Zk的node可以含有数据也可含有子节点, 路径不支持的unicode如下

The null character (\u0000) cannot be part of a path name. (This causes problems with the C binding.)
The following characters can't be used because they don't display well, or render in confusing ways: \u0001 - \u001F and \u007F - \u009F.
The following characters are not allowed: \ud800 - uF8FF, \uFFF0 - uFFFF.
The "." character can be used as part of another name, but "." and ".." cannot alone be used to indicate a node along a path, because ZooKeeper doesn't use relative paths. The following would be invalid: "/a/b/./c" or "/a/b/../c".
The token "zookeeper" is reserved.

ZNodes

Every node in a ZooKeeper tree is referred to as a znode. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn't match the actual version of the data, the update will fail. (This behavior can be overridden. For more information see... )[tbd...]

每一个Zk节点树上的节点被认为是一个Znode, Znode维护了一个状态结构, 包含数据版本号(data, acl--Access Control List), timestamp. versionNumber和Timestamp 相结合, 来验证Zk的Cache, 并在更新时保证数据一致性.

每当Znode的数据发生变化, versionNumber自增, 比如, 每当一个client获得数据, 它同时会获得数据的版本号, 当client尝试去更新或删除, 它必须提供版本号. 如果提供的版本号和Zk的不一致, 更新将会失败. [类似于数据库的乐观锁]

Note

In distributed application engineering, the word node can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation, znodes refer to the data nodes. Servers refer to machines that make up the ZooKeeper service; quorum peers refer to the servers that make up an ensemble; client refers to any host or process which uses a ZooKeeper service.

Znodes are the main enitity that a programmer access. They have several characteristics that are worth mentioning here.

Watches

Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section ZooKeeper Watches.

Data Access

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. Operating on relatively large data sizes will cause some operations to take much more time than others and will affect the latencies of some operations because of the extra time needed to move more data over the network and onto storage media. If large data storage is needed, the usually pattern of dealing with such data is to store it on a bulk storage system, such as NFS or HDFS, and store pointers to the storage locations in ZooKeeper.

Ephemeral Nodes

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.

Sequence Nodes -- Unique Naming

When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding (the counter is formatted in this way to simplify sorting), i.e. "<path>0000000001". See Queue Recipe for an example use of this feature. Note: the counter used to store the next sequence number is a signed int (4bytes) maintained by the parent node, the counter will overflow when incremented beyond 2147483647 (resulting in a name "<path>-2147483647").

Time in ZooKeeper

ZooKeeper tracks time multiple ways:

Zxid

Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.
Version numbers

Every change to a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode).
Ticks

When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout.
Real time

ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.

ZooKeeper Stat Structure

The Stat structure for each znode in ZooKeeper is made up of the following fields:

czxid--创建id

The zxid of the change that caused this znode to be created.
mzxid--更新id

The zxid of the change that last modified this znode.
ctime--创建时间

The time in milliseconds from epoch when this znode was created.
mtime--更新时间

The time in milliseconds from epoch when this znode was last modified.
version

The number of changes to the data of this znode.
cversion

The number of changes to the children of this znode.
aversion

The number of changes to the ACL of this znode.
ephemeralOwner

The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength

The length of the data field of this znode.
numChildren

The number of children of this znode.

ZooKeeper Watches

All of the read operations in ZooKeeper - getData(), getChildren(), and exists() - have the option of setting a watch as a side effect. Here is ZooKeeper's definition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes. There are three key points to consider in this definition of a watch:

所有的zookeeper读操作, 都可以设置一个watch getData(), getChildren(), exists(), zookeeper的定义如下: 一个watch event是一个一次性trigger, 被发送到设置它的client, 当数据变化时, 对应的watch起效.

One-time trigger

One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.

一个watch event将被发送到client当data变化, 例如, 一个client调用getData("/znode1", true), 当/znode1的数据发生变化, 如果znode再次发生变化, 将不会有event发送, 除非client再次获取数据并设置新的watch

Sent to the client

This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.

zk会保证watch event的顺序, 防止网络延迟或其他原因导致的异步时序问题

The data for which the watch was set

This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

zk维持两个watch list, data和child的watch, 用getData(), exist()设置Data的watch, 用getChildren设置child watch.

它有助于帮助我们思考返回数据问题, getdata(),exist()返回node节点信息, getChildren()返回子节点数组, 因此setData将会触发znode的Data watch(watch返回znode的节点信息, 前提是set成功). 成功的create()将会触发znode的Data watch和父节点的childWatch, 成功的delete()将会触发data watch和child watch和父节点的child watch

Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

watches在zookeeper节点维护, 如果client端重连不会导致watches失效, 这一切对client端透明, 但是除了一种情况, 如果znode在连接丢失时被创建或者删除, 判断这个Znode存在与否的watch将会miss

Semantics of Watches

We can set watches with the three calls that read the state of ZooKeeper: exists, getData, and getChildren. The following list details the events that a watch can trigger and the calls that enable them:

Created event:

Enabled with a call to exists.
Deleted event:

Enabled with a call to exists, getData, and getChildren.
Changed event:

Enabled with a call to exists and getData.
Child event:

Enabled with a call to getChildren.

Remove Watches

We can remove the watches registered on a znode with a call to removeWatches. Also, a ZooKeeper client can remove watches locally even if there is no server connection by setting the local flag to true. The following list details the events which will be triggered after the successful watch removal.

Child Remove event:

Watcher which was added with a call to getChildren.
Data Remove event:

Watcher which was added with a call to exists or getData.

What ZooKeeper Guarantees about Watches

With regard to watches, ZooKeeper maintains these guarantees:

Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.

Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)

A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.

When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

1. watch保证有序, watch是一次性的

2. 相同类型的watch在同一个znode被设置多次, 但只会触发一次

3. zk与client的连接丢失, client不会得到任何的watch event直到连接重新被建立, 因此session event

4. 由于watch是一次性的. 会有这种潜在情况, 获取event和发送request去获取watch, 不一定会获取这个节点的每一次变动, 所以要准备去处理这种case, 起码要有这种意识.

Gotchas: Common Problems and Troubleshooting

So now you know ZooKeeper. It's fast, simple, your application works, but wait ... something's wrong. Here are some pitfalls that ZooKeeper users fall into:

If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existance, you will miss the event if the znode is created and deleted while you are disconnected.

用watches, 你必须注意连接问题, 如果client连接断开, 你不会收到任何event除非重连, 如果你在监听一个znode的exist event, 那么连接中断你将miss掉这个节点的watch event

You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.

必须测试Zkserver失败的情况, 看看application是否能够正常工作

The list of ZooKeeper servers used by the client must match the list of ZooKeeper servers that each ZooKeeper server has. Things can work, although not optimally, if the client list is a subset of the real list of ZooKeeper servers, but not if the client lists ZooKeeper servers not in the ZooKeeper cluster.

client端使用的zkServer列表, 必须和ZkServer本身配置的列表匹配, 否则有可能出现client的ZkServer不在Zk集群中的情况

Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it.

Set your Java max heap size correctly. It is very important to avoid swapping. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk.

To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with.

正确设置java max heap size , 对于防止swaping很重要, 频繁的进行磁盘交换将会大幅影响性能, 由于Zk是有序的, 如果一个request hit到磁盘, 那么其他后续的一定也是到磁盘

防止swapping, 尝试设置heap zise到物理内存大小, 留给OS和cache一点空间, 最好的做法是进行性能测试, 如果不行的话, 建议是4G的机器, 3G的heap size, 大约3/4左右