mk-supervisor

(defserverfn mk-supervisor [conf shared-context ^ISupervisor isupervisor]
(log-message "Starting Supervisor with conf " conf)
(.prepare isupervisor conf (supervisor-isupervisor-dir conf)) ;;初始化supervisor-id,并存在localstate中(参考ISupervisor的实现)
(FileUtils/cleanDirectory (File. (supervisor-tmp-dir conf))) ;;清空本机的supervisor目录
(let [supervisor (supervisor-data conf shared-context isupervisor)
;;创建两个event-manager,用于在后台执行function
[event-manager processes-event-manager :as managers] [(event/event-manager false) (event/event-manager false)]
sync-processes (partial sync-processes supervisor) ;;partial sync-process
;;mk-synchronize-supervisor, mk-supervisor的主要工作,参考下面
synchronize-supervisor (mk-synchronize-supervisor supervisor sync-processes event-manager processes-event-manager)
;;定义生成supervisor hb的funciton
heartbeat-fn (fn [] (.supervisor-heartbeat!
(:storm-cluster-state supervisor)
(:supervisor-id supervisor)
(SupervisorInfo. (current-time-secs)
(:my-hostname supervisor)
(:assignment-id supervisor)
(keys @(:curr-assignment supervisor))
;; used ports
(.getMetadata isupervisor)
(conf SUPERVISOR-SCHEDULER-META)
((:uptime supervisor)))))]
;;先调用heartbeat-fn发送一次supervisor的hb
    ;;接着使用schedule-recurring去定期调用heartbeat-fn更新hb
    (heartbeat-fn)
;; should synchronize supervisor so it doesn't launch anything after being down (optimization)
(schedule-recurring (:timer supervisor)
0
(conf SUPERVISOR-HEARTBEAT-FREQUENCY-SECS)
heartbeat-fn))
 

mk-synchronize-supervisor

supervisor很简单, 主要管两件事,
当assignment发生变化时, 从nimbus同步topology的代码到本地
当assignment发生变化时, check workers状态, 保证被分配的work的状态都是valid

两个需求,
1. 当assignment发生变化时触发
    怎样通过zookeeper的watcher实现这个反复触发机制, 参考
Storm-源码分析- Storm中Zookeeper的使用

2. 因为比较耗时, 后台执行
    创建两个event-manager, 分别用于后台执行mk-synchronize-supervisor和sync-processes

mk-synchronize-supervisor, 比较特别的是内部用了一个有名字的匿名函数this来封装这个函数体
刚开始看到非常诧异, 其实目的是为了可以在sync-callback中将这个函数add到event-manager里面去
即每次被调用, 都需要再一次把sync-callback注册到zk, 以保证下次可以被继续触发

(defn mk-synchronize-supervisor [supervisor sync-processes event-manager processes-event-manager]
(fn this []
(let [conf (:conf supervisor)
storm-cluster-state (:storm-cluster-state supervisor)
^ISupervisor isupervisor (:isupervisor supervisor)
^LocalState local-state (:local-state supervisor) ;;本地缓存数据库
sync-callback (fn [& ignored] (.add event-manager this)) ;;生成callback函数(后台执行mk-synchronize-supervisor)
assignments-snapshot (assignments-snapshot storm-cluster-state sync-callback) ;;读取assignments,并注册callback,在zk->assignment发生变化时被触发
storm-code-map (read-storm-code-locations assignments-snapshot) ;;从哪儿下载topology code
downloaded-storm-ids (set (read-downloaded-storm-ids conf)) ;;已经下载了哪些topology
all-assignment (read-assignments ;;supervisor的port上被分配了哪些executors
assignments-snapshot
(:assignment-id supervisor)) ;;supervisor-id
new-assignment (->> all-assignment ;;new=all,因为confirmAssigned没有具体实现,always返回true
(filter-key #(.confirmAssigned isupervisor %)))
assigned-storm-ids (assigned-storm-ids-from-port-assignments new-assignment) ;;supervisor上被分配的topology id集合
existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)] ;;从local-state数据库里面读出当前保存的local assignments ;;下载新分配的topology代码
(doseq [[storm-id master-code-dir] storm-code-map]
(when (and (not (downloaded-storm-ids storm-id))
(assigned-storm-ids storm-id))
(download-storm-code conf storm-id master-code-dir)))
      (.put local-state     ;;把new-assignment存到local-state数据库中
LS-LOCAL-ASSIGNMENTS
new-assignment)
(reset! (:curr-assignment supervisor) new-assignment) ;;把new-assignment cache到supervisor对象中
      ;;删除无用的topology code 
;;remove any downloaded code that's no longer assigned or active
(doseq [storm-id downloaded-storm-ids]
(when-not (assigned-storm-ids storm-id)
(log-message "Removing code for storm id " storm-id)
(rmr (supervisor-stormdist-root conf storm-id))
))
      ;;后台执行sync-processes
(.add processes-event-manager sync-processes)
)))

sync-processes

sync-processes用于管理workers, 比如处理不正常的worker或dead worker, 并创建新的workers
首先从本地读出workers的hb, 来判断work状况, shutdown所有状态非valid的workers
并为被assignment, 而worker状态非valid的slot, 创建新的worker

(defn sync-processes [supervisor]
(let [conf (:conf supervisor)
^LocalState local-state (:local-state supervisor)
assigned-executors (defaulted (.get local-state LS-LOCAL-ASSIGNMENTS) {})
now (current-time-secs)
allocated (read-allocated-workers supervisor assigned-executors now) ;;1.读取当前worker的状况
keepers (filter-val ;;找出状态为valid的worker
(fn [[state _]] (= state :valid))
allocated)
keep-ports (set (for [[id [_ hb]] keepers] (:port hb))) ;;keepers的ports集合
         ;;select-keys-pred(pred map), 对map中的key使用pred进行过滤
         ;;找出assigned-executors中executor的port, 哪些不属于keep-ports, 
        ;;即找出新被assign的workers或那些虽被assign但状态不是valid的workers(dead或没有start)
;;这些executors需要从新分配到新的worker上去

reassign-executors (select-keys-pred (complement keep-ports) assigned-executors)
new-worker-ids (into
{}
(for [port (keys reassign-executors)] ;;为reassign-executors的port产生新的worker-id
[port (uuid)]))
]
;; 1. to kill are those in allocated that are dead or disallowed
;; 2. kill the ones that should be dead
;; - read pids, kill -9 and individually remove file
;; - rmr heartbeat dir, rmdir pid dir, rmdir id dir (catch exception and log)
;; 3. of the rest, figure out what assignments aren't yet satisfied
;; 4. generate new worker ids, write new "approved workers" to LS
;; 5. create local dir for worker id
;; 5. launch new workers (give worker-id, port, and supervisor-id)
;; 6. wait for workers launch
(doseq [[id [state heartbeat]] allocated]
(when (not= :valid state) ;;shutdown所有状态不是valid的worker
(shutdown-worker supervisor id)))
(doseq [id (vals new-worker-ids)]
(local-mkdirs (worker-pids-root conf id))) ;;为新的worker创建目录, 并加到local-state的LS-APPROVED-WORKERS中
(.put local-state LS-APPROVED-WORKERS ;;更新的approved worker, 状态为valid的 + new workers
(merge
(select-keys (.get local-state LS-APPROVED-WORKERS) ;;现有approved worker中状态为valid
(keys keepers))
(zipmap (vals new-worker-ids) (keys new-worker-ids)) ;;new workers
))
(wait-for-workers-launch ;;2.wait-for-workers-launch
conf
(dofor [[port assignment] reassign-executors]
(let [id (new-worker-ids port)]
(launch-worker supervisor
(:storm-id assignment)
port
id)
id)))
))

1. read-allocated-workers

(defn read-allocated-workers
"Returns map from worker id to worker heartbeat. if the heartbeat is nil, then the worker is dead (timed out or never wrote heartbeat)"
[supervisor assigned-executors now]
(let [conf (:conf supervisor)
^LocalState local-state (:local-state supervisor)
;从local-state中读出每个worker的hb, 当然每个worker进程会不断的更新本地hb
id->heartbeat (read-worker-heartbeats conf)
        approved-ids (set (keys (.get local-state LS-APPROVED-WORKERS)))] ;;从local-state读出approved的worker
(into
{}
(dofor [[id hb] id->heartbeat] ;;根据hb来判断worker的当前状态
(let [state (cond
(or (not (contains? approved-ids id))
(not (matches-an-assignment? hb assigned-executors)))
:disallowed ;;不被允许
(not hb)
:not-started ;;无hb,没有start
(> (- now (:time-secs hb))
(conf SUPERVISOR-WORKER-TIMEOUT-SECS))
:timed-out ;;超时,dead
true
:valid)]
(log-debug "Worker " id " is " state ": " (pr-str hb) " at supervisor time-secs " now)
[id [state hb]] ;;返回每个worker的当前state和hb
))
)))

2. wait-for-workers-launch

对reassign-executors中的每个new_work_id调用launch-worker

最终调用wait-for-workers-launch, 等待worder被成功launch

逻辑也比较简单, check hb, 如果没有就不停的sleep, 至到超时, 打印failed to start

(defn- wait-for-worker-launch [conf id start-time]
(let [state (worker-state conf id)]
(loop []
(let [hb (.get state LS-WORKER-HEARTBEAT)]
(when (and
(not hb)
(<
(- (current-time-secs) start-time)
(conf SUPERVISOR-WORKER-START-TIMEOUT-SECS)
))
(log-message id " still hasn't started")
(Time/sleep 500)
(recur)
)))
(when-not (.get state LS-WORKER-HEARTBEAT)
(log-message "Worker " id " failed to start")
))) (defn- wait-for-workers-launch [conf ids]
(let [start-time (current-time-secs)]
(doseq [id ids]
(wait-for-worker-launch conf id start-time))
))

Storm-源码分析-Topology Submit-Supervisor的更多相关文章

  1. Storm源码分析--Nimbus-data

    nimbus-datastorm-core/backtype/storm/nimbus.clj (defn nimbus-data [conf inimbus] (let [forced-schedu ...

  2. JStorm与Storm源码分析(四)--均衡调度器,EvenScheduler

    EvenScheduler同DefaultScheduler一样,同样实现了IScheduler接口, 由下面代码可以看出: (ns backtype.storm.scheduler.EvenSche ...

  3. JStorm与Storm源码分析(一)--nimbus-data

    Nimbus里定义了一些共享数据结构,比如nimbus-data. nimbus-data结构里定义了很多公用的数据,请看下面代码: (defn nimbus-data [conf inimbus] ...

  4. JStorm与Storm源码分析(三)--Scheduler,调度器

    Scheduler作为Storm的调度器,负责为Topology分配可用资源. Storm提供了IScheduler接口,用户可以通过实现该接口来自定义Scheduler. 其定义如下: public ...

  5. JStorm与Storm源码分析(二)--任务分配,assignment

    mk-assignments主要功能就是产生Executor与节点+端口的对应关系,将Executor分配到某个节点的某个端口上,以及进行相应的调度处理.代码注释如下: ;;参数nimbus为nimb ...

  6. storm源码分析之任务分配--task assignment

    在"storm源码分析之topology提交过程"一文最后,submitTopologyWithOpts函数调用了mk-assignments函数.该函数的主要功能就是进行topo ...

  7. storm源码分析之topology提交过程

    storm集群上运行的是一个个topology,一个topology是spouts和bolts组成的图.当我们开发完topology程序后将其打成jar包,然后在shell中执行storm jar x ...

  8. Nimbus<三>Storm源码分析--Nimbus启动过程

    Nimbus server, 首先从启动命令开始, 同样是使用storm命令"storm nimbus”来启动看下源码, 此处和上面client不同, jvmtype="-serv ...

  9. JStorm与Storm源码分析(五)--SpoutOutputCollector与代理模式

    本文主要是解析SpoutOutputCollector源码,顺便分析该类中所涉及的设计模式–代理模式. 首先介绍一下Spout输出收集器接口–ISpoutOutputCollector,该接口主要声明 ...

  10. Nimbus<二>storm启动nimbus源码分析-nimbus.clj

    nimbus是storm集群的"控制器",是storm集群的重要组成部分.我们可以通用执行bin/storm nimbus >/dev/null 2>&1 &a ...

随机推荐

  1. cookie,session,localStage,sessionStage区别

    Cookie和Session详解 1.什么是Cookie Cookie是存放在客户端浏览器的Name/Value键值对,访问服务器时,会自动传递给服务器. Cookie的生成方式有两种,服务器写入,客 ...

  2. -[__NSArrayI removeAllObjects]: unrecognized selector sent to instance 0x7fa8dc830110

    问题 今天做项目,遇到了这个问题 -[__NSArrayI removeAllObjects]: unrecognized selector sent to instance 0x7fa8dc8301 ...

  3. Ubuntu安装Sublime Text 2

    参考资料:http://www.technoreply.com/how-to-install-sublime-text-2-on-ubuntu-12-04-unity/ 1.去Sublime Text ...

  4. 安装CentOS版本的yum(转载)

    安装CentOS版本的yum 下载源:http://mirrors.163.com/centos/6/os/i386/Packages/ 材料准备: python-iniparse-0.3.1-2.1 ...

  5. php读取csv的问题

    csv文件要用utf-8 无bom格式保存 如果有英文外的字符,另外每项要用双引号,不用双引号不能保存非英文字符

  6. js获取字符串的实际长度并截断实际长度

    在项目中有这样一个需求,就是一个很长的字符串,需要截断成几组字符串,而这几组字符串里既包含汉字,又包含字母,下面提供了几种方法 1,获取字符串的长度 function getstrlength(str ...

  7. ES6学习笔记(1,let和const)

    在介绍let和const之前我们先复习一下相关的知识点. 关于函数作用域 开发过程中,在ES6(ECMA2015)标准推出之前,声明变量的方式一直都是var,而变量的作用域一般也只在函数内部,即函数作 ...

  8. Huber-Markov先验模型相关

    随机概率重建-MAP算法 随机概率重建:利用贝叶斯理论作为框架,理想图像的先验知识作为约束条件进行图像重建.常用的随机概率超分辨率重建包括最大后验概率估计法(MAP)和极大似然估计法(ML). MAP ...

  9. python之简单的get和post请求

    1.json 模块提供了一种很简单的方式来编码和解码JSON数据. 其中两个主要的函数是 json.dumps() 和 json.loads() , 要比其他序列化函数库如pickle的接口少得多. ...

  10. Mongodb默认开启与关闭

    默认启动:   $ ./mongodb   默认数据保存路径:/data/db/ 默认端口:27017   修改默认路径:   --dbpath $ ./mongdb --dbpath /mongod ...