近期打算实现一个基于Zookeeper的分布式的集群状态一致性控制, 对Zookeeper的原理不太了解, 正好学习一下, 网上找到了几篇文章, 先贴在这边, 等我熟读官方文档后, 再来补充自己的见解

-----------------------------我是分割线-------------------------------------

最近基于Zk实现了一套公司风控系统的规则管理和集群管理, 对zk和curator有了更加深入的认识, 下面就踩过的坑记录下

1. curator 有两套监听机制, 一个是封装了zk自身的watcher, 一个是自己的listener, 坑来了:

  a.listener 只能监听相同thread的client事件, 跨thread或者跨process则不行, 操作必须使用inbackground()模式才能触发listener

  b.watcher 封装了zk原本的watcher 可以跨进程使用, 但是注意, 无法在 inbackground的情况下触发watcher

2. zk watcher 定义了4种事件

public enum EventType {
None (-1),
NodeCreated (1),
NodeDeleted (2),
NodeDataChanged (3),
NodeChildrenChanged (4);

}

坑来了

怎样才能得到自己想要的事件?

  a. 想监听 NodeCreated, NodeDeleted, NodeDataChanged 可以使用 checkExist 或者 getData, 推荐使用checkExist, 因为getData 如果结点未创建则报错

  b. 想监听 NodeChildrenChanged 只能使用 getChildren, 但是注意不能监听嵌套内层的子节点, 如 /test/1  不能获得 /test/1/2/3 的变动 , 可以获得 /test/1/2 的变动, 而且每次变动的path 永远都是你监听的那个path, 不要妄想用它来获得子节点的path

这里有篇文章不错, http://blog.csdn.net/lzx1104/article/details/6968802

http://liuqunying.blog.51cto.com/3984207/1407455

http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/

https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_advancedConfiguration

https://github.com/Netflix/curator

The ZooKeeper Data Model

ZooKeeper has a hierarchal name space, much like a distributed file system. The only difference is that each node in the namespace can have data associated with it as well as children. It is like having a file system that allows a file to also be a directory. Paths to nodes are always expressed as canonical, absolute, slash-separated paths; there are no relative reference. Any unicode character can be used in a path subject to the following constraints:

Zk的node可以含有数据也可含有子节点, 路径不支持的unicode如下

  • The null character (\u0000) cannot be part of a path name. (This causes problems with the C binding.)

  • The following characters can't be used because they don't display well, or render in confusing ways: \u0001 - \u001F and \u007F - \u009F.

  • The following characters are not allowed: \ud800 - uF8FF, \uFFF0 - uFFFF.

  • The "." character can be used as part of another name, but "." and ".." cannot alone be used to indicate a node along a path, because ZooKeeper doesn't use relative paths. The following would be invalid: "/a/b/./c" or "/a/b/../c".

  • The token "zookeeper" is reserved.

ZNodes

Every node in a ZooKeeper tree is referred to as a znode. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn't match the actual version of the data, the update will fail. (This behavior can be overridden. For more information see... )[tbd...]

每一个Zk节点树上的节点被认为是一个Znode, Znode维护了一个状态结构, 包含数据版本号(data, acl--Access Control List), timestamp. versionNumber和Timestamp 相结合, 来验证Zk的Cache, 并在更新时保证数据一致性.

每当Znode的数据发生变化, versionNumber自增, 比如, 每当一个client获得数据, 它同时会获得数据的版本号, 当client尝试去更新或删除, 它必须提供版本号. 如果提供的版本号和Zk的不一致, 更新将会失败. [类似于数据库的乐观锁]

Note

In distributed application engineering, the word node can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation, znodes refer to the data nodes. Servers refer to machines that make up the ZooKeeper service; quorum peers refer to the servers that make up an ensemble; client refers to any host or process which uses a ZooKeeper service.

Znodes are the main enitity that a programmer access. They have several characteristics that are worth mentioning here.

Watches

Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section ZooKeeper Watches.

Data Access

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. Operating on relatively large data sizes will cause some operations to take much more time than others and will affect the latencies of some operations because of the extra time needed to move more data over the network and onto storage media. If large data storage is needed, the usually pattern of dealing with such data is to store it on a bulk storage system, such as NFS or HDFS, and store pointers to the storage locations in ZooKeeper.

Ephemeral Nodes

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.

Sequence Nodes -- Unique Naming

When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding (the counter is formatted in this way to simplify sorting), i.e. "<path>0000000001". See Queue Recipe for an example use of this feature. Note: the counter used to store the next sequence number is a signed int (4bytes) maintained by the parent node, the counter will overflow when incremented beyond 2147483647 (resulting in a name "<path>-2147483647").

Time in ZooKeeper

ZooKeeper tracks time multiple ways:

  • Zxid

    Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.

  • Version numbers

    Every change to a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode).

  • Ticks

    When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout.

  • Real time

    ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.

ZooKeeper Stat Structure

The Stat structure for each znode in ZooKeeper is made up of the following fields:

  • czxid--创建id

    The zxid of the change that caused this znode to be created.

  • mzxid--更新id

    The zxid of the change that last modified this znode.

  • ctime--创建时间

    The time in milliseconds from epoch when this znode was created.

  • mtime--更新时间

    The time in milliseconds from epoch when this znode was last modified.

  • version

    The number of changes to the data of this znode.

  • cversion

    The number of changes to the children of this znode.

  • aversion

    The number of changes to the ACL of this znode.

  • ephemeralOwner

    The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.

  • dataLength

    The length of the data field of this znode.

  • numChildren

    The number of children of this znode.

ZooKeeper Watches

All of the read operations in ZooKeeper - getData()getChildren(), and exists() - have the option of setting a watch as a side effect. Here is ZooKeeper's definition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes. There are three key points to consider in this definition of a watch:

所有的zookeeper读操作, 都可以设置一个watch getData()getChildren()exists(), zookeeper的定义如下: 一个watch event是一个一次性trigger, 被发送到设置它的client, 当数据变化时, 对应的watch起效.

  • One-time trigger

    One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.

一个watch event将被发送到client当data变化, 例如, 一个client调用getData("/znode1", true), 当/znode1的数据发生变化, 如果znode再次发生变化, 将不会有event发送, 除非client再次获取数据并设置新的watch

  • Sent to the client

    This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.

zk会保证watch event的顺序, 防止网络延迟或其他原因导致的异步时序问题

  • The data for which the watch was set

    This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

zk维持两个watch list, data和child的watch, 用getData(), exist()设置Data的watch, 用getChildren设置child watch.

它有助于帮助我们思考返回数据问题, getdata(),exist()返回node节点信息, getChildren()返回子节点数组, 因此setData将会触发znode的Data watch(watch返回znode的节点信息, 前提是set成功).  成功的create()将会触发znode的Data watch和父节点的childWatch, 成功的delete()将会触发data watch和child watch和父节点的child watch

Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

watches在zookeeper节点维护, 如果client端重连不会导致watches失效, 这一切对client端透明, 但是除了一种情况, 如果znode在连接丢失时被创建或者删除, 判断这个Znode存在与否的watch将会miss

Semantics of Watches

We can set watches with the three calls that read the state of ZooKeeper: exists, getData, and getChildren. The following list details the events that a watch can trigger and the calls that enable them:

  • Created event:

    Enabled with a call to exists.

  • Deleted event:

    Enabled with a call to exists, getData, and getChildren.

  • Changed event:

    Enabled with a call to exists and getData.

  • Child event:

    Enabled with a call to getChildren.

Remove Watches

We can remove the watches registered on a znode with a call to removeWatches. Also, a ZooKeeper client can remove watches locally even if there is no server connection by setting the local flag to true. The following list details the events which will be triggered after the successful watch removal.

  • Child Remove event:

    Watcher which was added with a call to getChildren.

  • Data Remove event:

    Watcher which was added with a call to exists or getData.

What ZooKeeper Guarantees about Watches

With regard to watches, ZooKeeper maintains these guarantees:

  • Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

  • A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

  • The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

  • Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.

  • Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)

  • A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.

  • When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

1. watch保证有序, watch是一次性的

2. 相同类型的watch在同一个znode被设置多次, 但只会触发一次

3. zk与client的连接丢失, client不会得到任何的watch event直到连接重新被建立, 因此session event

4. 由于watch是一次性的. 会有这种潜在情况, 获取event和发送request去获取watch, 不一定会获取这个节点的每一次变动, 所以要准备去处理这种case, 起码要有这种意识.

Gotchas: Common Problems and Troubleshooting

So now you know ZooKeeper. It's fast, simple, your application works, but wait ... something's wrong. Here are some pitfalls that ZooKeeper users fall into:

  1. If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existance, you will miss the event if the znode is created and deleted while you are disconnected.

用watches, 你必须注意连接问题, 如果client连接断开, 你不会收到任何event除非重连, 如果你在监听一个znode的exist event, 那么连接中断你将miss掉这个节点的watch event

  1. You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.

必须测试Zkserver失败的情况, 看看application是否能够正常工作

  1. The list of ZooKeeper servers used by the client must match the list of ZooKeeper servers that each ZooKeeper server has. Things can work, although not optimally, if the client list is a subset of the real list of ZooKeeper servers, but not if the client lists ZooKeeper servers not in the ZooKeeper cluster.

client端使用的zkServer列表, 必须和ZkServer本身配置的列表匹配, 否则有可能出现client的ZkServer不在Zk集群中的情况

  1. Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it.

  1. Set your Java max heap size correctly. It is very important to avoid swapping. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk.

    To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with.

正确设置java max heap size , 对于防止swaping很重要, 频繁的进行磁盘交换将会大幅影响性能, 由于Zk是有序的, 如果一个request hit到磁盘, 那么其他后续的一定也是到磁盘

防止swapping, 尝试设置heap zise到物理内存大小, 留给OS和cache一点空间, 最好的做法是进行性能测试, 如果不行的话, 建议是4G的机器, 3G的heap size, 大约3/4左右

Zookeeper原理与Curator使用的更多相关文章

  1. Apache ZooKeeper原理剖析及分布式理论名企高频面试v3.7.0

    概述 **本人博客网站 **IT小神 www.itxiaoshen.com 定义 Apache ZooKeeper官网 https://zookeeper.apache.org/ 最新版本3.7.0 ...

  2. zookeeper入门之Curator的使用之几种监听器的使用

    package com.git.zookeeper.passwordmanager.listener; import java.util.ArrayList; import java.util.Lis ...

  3. Zookeeper(三) Zookeeper原理与应用

    一.zookeeper原理解析 1.进群角色描述 2.Paxos 算法概述( ZAB 协议)    分布式一致性算法 3.Zookeeper 的选主(恢复模式) 以一个简单的例子来说明整个选举的过程. ...

  4. Zookeeper原理和实战开发经典视频教程 百度云网盘下载

    Zookeeper原理和实战开发 经典视频教程 百度云网盘下载 资源下载地址:http://pan.baidu.com/s/1o7ZjPeM   密码:r5yf   

  5. 八:Zookeeper开源客户端Curator的api测试

    curator是Netflix公司开源的一套ZooKeeper客户端,Curator解决了很多ZooKeeper客户端非常底层的细节开发工作.包括连接重连,反复注册Watcher等.实现了Fluent ...

  6. 8.8.ZooKeeper 原理和选举机制

    1.ZooKeeper原理 Zookeeper虽然在配置文件中并没有指定master和slave但是,zookeeper工作时,是有一个节点为leader,其他则为follower,Leader是通 ...

  7. Zookeeper客户端Apache Curator

    本文不对Zookeeper进行介绍,主要介绍Curator怎么操作Zookeeper. Apache Curator是Apache ZooKeeper的Java / JVM客户端库,Apache Zo ...

  8. ZooKeeper 分布式锁 Curator 源码 02:可重入锁重复加锁和锁释放

    ZooKeeper 分布式锁 Curator 源码 02:可重入锁重复加锁和锁释放 前言 加锁逻辑已经介绍完毕,那当一个线程重复加锁是如何处理的呢? 锁重入 在上一小节中,可以看到加锁的过程,再回头看 ...

  9. ZooKeeper 分布式锁 Curator 源码 03:可重入锁并发加锁

    前言 在了解了加锁和锁重入之后,最需要了解的还是在分布式场景下或者多线程并发加锁是如何处理的? 并发加锁 先来看结果,在多线程对 /locks/lock_01 加锁时,是在后面又创建了新的临时节点. ...

随机推荐

  1. 第1章 为什么创造WPF、第2章 XAML揭秘

    1.2 步入WPF 下面是WPF的一些亮点: 广泛整合:各种媒体类型都能组合起来并一起呈现 与分辨率无关:因为WPF使用矢量图形 硬件加速:WPF是基于Direct3D创建的,工作全部是由GPU完成的 ...

  2. Android开发之布局文件里实现OnClick事件关联处理方法

    一般监听OnClickListener事件,我们都是通过Button button = (Button)findViewById(....); button.setOClickLisener....这 ...

  3. Solidworks如何保存为网页可以浏览的3D格式

    1 如图所示3D装配图,在Solidworks中可以旋转,缩放.   2 我想要另存为在浏览器中可以缩放,旋转的格式.如下所示(我的装配图初步.htm)   3 步骤是,先在Solidworks中出版 ...

  4. linux 文件删除恢复extundelete

    首先要把删除文件所有磁盘分区卸载掉 然后安装yum install -y extundelete *2fs* extundelete /dev/sdb1 --inode #查看sdb1分区下删除的文件 ...

  5. linux 挂载移动盘

    http://www.2cto.com/os/201411/354319.html 磁盘出现问题,有时候卸载不掉 参见http://blog.csdn.net/davil_dev/article/de ...

  6. jenkins 构建一个前端web项目

    Jenkins发布web前端代码 “系统管理”“管理插件”“已安装” 检查是否有“Git plugin”和“Publish Over SSH”两个插件,如果没有,则需点击“可选插件”,找到它并安装 ...

  7. PowerDesigner将PDM导出生成WORD文档(转)

    今天的温习老知识,是如何将一个PD设计的PDM来导出WORD文档,这是一个非常实用的功能,可以在软件过程的数据库设计文档编写中节省N多时间, 那不废话了,我们就开始今天的讲解吧! 第一步,点击Repo ...

  8. 《textanalytics》课程简单总结(2):topic mining

    coursera上的公开课<https://www.coursera.org/course/textanalytics>系列,讲的很不错哦. 1."term as topic&q ...

  9. 【Linux】OpenWRT的无线设置注意事项——从2.4G到5G,hwmode不简单

    硬件说明: 操作系统:OpenWRT 网卡:AR9220R52Hn 网卡驱动:ath9k OpenWRT在刷机完成之后,并不会自动开启无线功能,需要手动修改配置文件,然后重启网络服务.管理无线功能的配 ...

  10. android studio 更新Gradle版本号方法

    在导入其它项目时,常常因为gradle版本号不一致而导致不能编译 解决方法: 第一步: 按提示点击让它下载.事实上目的并非要它下载.因为这样速度会非常慢.这样做仅仅是为了让它在本地创建相应的文件夹结构 ...