supervisor是storm集群重要组成部分,supervisor主要负责管理各个"工作节点"。supervisor与zookeeper进行通信,通过zookeeper的"watch机制"可以感知到是否有新的任务需要认领或哪些任务被重新分配。我们可以通用执行bin/storm supervisor >/dev/null 2>&1 &来启动supervisor。bin/storm是一个python脚本,在这个脚本中定义了一个supervisor函数:

supervisor函数
def supervisor(klass="backtype.storm.daemon.supervisor"):
   """Syntax: [storm supervisor]

Launches the supervisor daemon. This command should be run
   under supervision with a tool like daemontools or monit.

See Setting up a Storm cluster for more information.
   (https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster)
   """
   cppaths = [STORM_DIR + "/log4j", STORM_DIR + "/conf"]
   jvmopts = parse_args(confvalue("supervisor.childopts", cppaths)) + [
       "-Dlogfile.name=supervisor.log",
       "-Dlog4j.configuration=storm.log.properties",
   ]
   exec_storm_class(
       klass,
       jvmtype="-server",
       extrajars=cppaths,
       jvmopts=jvmopts)

klass参数的默认值为backtype.storm.daemon.supervisor,backtype.storm.daemon.supervisor标识一个java类。STORM_DIR标识storm的安装目录,cppaths集合存放了log4j配置文件路径和storm配置文件storm.yaml路径,jvmopts存放传递给jvm的参数,包括log4j配文件路径、storm.yaml路径、log4j日志名称和log4j配置文件名称。exec_storm_class函数的逻辑比较简单,具体实现如下:

exec_storm_class函数
def exec_storm_class(klass, jvmtype="-server", jvmopts=[], extrajars=[], args=[], fork=False):  
   global CONFFILE  
   all_args = [  
       "java", jvmtype, get_config_opts(),  
       "-Dstorm.home=" + STORM_DIR,  
       "-Djava.library.path=" + confvalue("java.library.path", extrajars),  
       "-Dstorm.conf.file=" + CONFFILE,  
       "-cp", get_classpath(extrajars),  
   ] + jvmopts + [klass] + list(args)  
   print "Running: " + " ".join(all_args)  
   if fork:  
       os.spawnvp(os.P_WAIT, "java", all_args)  
   else:  
       os.execvp("java", all_args) # replaces the current process and never returns

get_config_opts()获取jvm的默认配置信息,confvalue("java.library.path", extrajars)获取storm使用的本地库JZMQ加载路径,get_classpath(extrajars)获取所有依赖jar包的完整路径,然后拼接一个java -cp命令运行klass的main方法。klass默认值为backtype.storm.daemon.supervisor,所以exec_storm_class函数最终调用backtype.storm.daemon.supervisor类的main方法。
backtype.storm.daemon.supervisor类定义在supervisor.clj文件中,定义如下:

backtype.storm.daemon.supervisor类
(ns backtype.storm.daemon.supervisor
 (:import [backtype.storm.scheduler ISupervisor])
 (:use [backtype.storm bootstrap])
 (:use [backtype.storm.daemon common])
 (:require [backtype.storm.daemon [worker :as worker]])
 (:gen-class
   :methods [^{:static true} [launch [backtype.storm.scheduler.ISupervisor] void]]))

(bootstrap)
       ;; ... ...
   ;; 其他方法
   ;; ... ...
   (defn -main []
         (-launch (standalone-supervisor)))

:gen-class指示Clojure生成Java类backtype.storm.daemon.supervisor,并且声明一个静态方法launch,launch方法接收一个实现backtype.storm.scheduler.ISupervisor接口的实例作为参数。launch函数的参数是由standalone-supervisor函数生成的。standalone-supervisor函数定义如下:返回一个实现ISupervisor接口的实例。

standalone-supervisor函数
;; 该函数主要返回一个实现了ISupervisor接口的实例。
(defn standalone-supervisor []
 (let [conf-atom (atom nil)
       id-atom (atom nil)]
   (reify ISupervisor
     ;; prepare方法主要功能是创建一个基于磁盘的存放K/V对的database——LocalState对象,LocalState类参见其定义部分
     (prepare [this conf local-dir]
       ;; conf-atom原子类型,绑定storm集群配置信息
       (reset! conf-atom conf)
   ;; state绑定LocalState对象,local-dir标识database在磁盘上的根目录,database实际就是一个HashMap对象序列化后存放到磁盘local-dir目录下
       (let [state (LocalState. local-dir)
         ;; LS-ID值为字符串"supervisor-id",定义在common.clj文件中。如果state中存放了该supervisor的id,那么curr-id绑定该id,否则curr-id绑定32为uuid
             curr-id (if-let [id (.get state LS-ID)]
                       id
           ;; 调用uuid函数生成一个32的id
                       (generate-supervisor-id))]
         ;; 调用state的put函数,更新该supervisor的id
         (.put state LS-ID curr-id)
     ;; id-atom原子类型,绑定该supervisor的id
         (reset! id-atom curr-id))
       )
     ;; 返回true
     (confirmAssigned [this port]
       true)
     ;; 从storm配置信息中获取supervisor的所有端口,因为clojure中的map函数返回的是"懒惰序列",所以需要调用doall函数对"懒惰序列"进行完全实例化
     (getMetadata [this]
       (doall (map int (get @conf-atom SUPERVISOR-SLOTS-PORTS))))
     ;; 获取supervisor的id
     (getSupervisorId [this]
       @id-atom)
     ;; 获取supervisor的分配id即其id
     (getAssignmentId [this]
       @id-atom)
     ;; killedWorker空实现
     (killedWorker [this port]
       )
     ;; assigned空实现
     (assigned [this ports]
       ))))

LocalState类是个java类,定义见LocalState.java,这个类有一个VersionedStore类型对象,VersionedStore类见VersionedStore.java,由于这两个类是java实现,而且也比较简单,这样就在详细分析。

mk-supervisor函数定义如下:

mk-supervisor函数
            (
                         (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
                         (fn [] (.add processes-event-manager sync-processes))))
   (log-message "Starting supervisor with id " (:supervisor-id supervisor) " at host " (:my-hostname supervisor))
   ;; 返回实现了Shutdownable接口、SupervisorDaemon协议和DaemonCommon协议的实例
   (reify
    Shutdownable
    ;; 关闭supervisor,就是关闭该supervisor所拥护的资源
    (shutdown [this]
              (log-message "Shutting down supervisor " (:supervisor-id supervisor))
              (reset! (:active supervisor) false)
              (cancel-timer (:timer supervisor))
              (.shutdown event-manager)
              (.shutdown processes-event-manager)
              (.disconnect (:storm-cluster-state supervisor)))
    SupervisorDaemon
    ;; 返回集群配置信息
    (get-conf [this]
      conf)
    ;; 返回supervisor-id
    (get-id [this]
      (:supervisor-id supervisor))
    ;; 见名知意,关闭所有worker
    (shutdown-all-workers [this]
      (let [ids (my-worker-ids conf)]
        (doseq [id ids]
          (shutdown-worker supervisor id)
          )))
    DaemonCommon
    (waiting? [this]
      (or (not @(:active supervisor))
          (and
       ;; 定时器线程是否处于sleep状态
           (timer-waiting? (:timer supervisor))
       ;; 调用事件管理器的waiting?函数检查event-manager和processes-event-manager内事件执行线程是否处于sleep状态,memfn宏可以自动生成代码以使得java方法可以当成clojure里面的函数
           (every? (memfn waiting?) managers)))
          ))))

supervisor-data函数定义如下:

supervisor-data函数返回一个包含了supervisor元数据的map对象。

supervisor-data函数
( "Error when processing an event")
                              ))
  ;; 创建一个用于存放带有版本号的分配信息的map
  :assignment-versions (atom {})
  })

sync-processes函数定义如下:

sync-processes函数
;; sync-processes函数用于管理workers, 比如处理不正常的worker或dead worker, 并创建新的workers
;; supervisor标识supervisor的元数据
(defn sync-processes [supervisor]
 ;; conf绑定storm的配置信息map
 (let [conf (:conf supervisor)
       ;; local-state绑定supervisor的LocalState实例
       ^LocalState local-state (:local-state supervisor)
       ;; 从supervisor的LocalState实例中获取本地分配信息端口port->LocalAssignment实例的map,LocalAssignment实例封装了storm-id和分配给该storm-id的executors
       assigned-executors (defaulted (.get local-state LS-LOCAL-ASSIGNMENTS) {})
       ;; now绑定当前时间
       now (current-time-secs)
       ;; allocated绑定worker-id->worker状态和心跳的map,read-allocated-workers函数请参见其定义部分
       allocated (read-allocated-workers supervisor assigned-executors now)
       ;; 过滤掉allocated中state不等于:valid的元素,并将过滤后的结果绑定到keepers
       keepers (filter-val
                (fn [[state _]] (= state :valid))
                allocated)
       ;; keep-ports绑定keepers中心跳信息所包含的端口
       keep-ports (set (for [[id [_ hb]] keepers] (:port hb)))
       ;; reassign-executors绑定assigned-executors中端口不在集合keep-ports的键值对构成的map,也就是说已分配的线程所对应的进程挂掉了,需要重新进行分配
       reassign-executors (select-keys-pred (complement keep-ports) assigned-executors)
       ;; new-worker-ids绑定port->worker-id的map,new-worker-ids保存了需要重新启动进程的worker-id
       new-worker-ids (into
                       {}
                       (for [port (keys reassign-executors)]
                         [port (uuid)]))
       ]
   ;; 1. to kill are those in allocated that are dead or disallowed
   ;; 2. kill the ones that should be dead
   ;;     - read pids, kill -9 and individually remove file
   ;;     - rmr heartbeat dir, rmdir pid dir, rmdir id dir (catch exception and log)
   ;; 3. of the rest, figure out what assignments aren't yet satisfied
   ;; 4. generate new worker ids, write new "approved workers" to LS
   ;; 5. create local dir for worker id
   ;; 5. launch new workers (give worker-id, port, and supervisor-id)
   ;; 6. wait for workers launch
 
   (log-debug "Syncing processes")
   (log-debug "Assigned executors: " assigned-executors)
   (log-debug "Allocated: " allocated)
   ;; allocated绑定worker-id->worker状态和心跳的map,id绑定worker-id,state绑定worker状态,heartbeat绑定worker心跳时间
   (doseq [[id [state heartbeat]] allocated]
     ;; 如果worker的状态不是:valid,那么就关闭worker
     (when (not= :valid state)
       (log-message
        "Shutting down and clearing state for id " id
        ". Current supervisor time: " now
        ". State: " state
        ", Heartbeat: " (pr-str heartbeat))
       ;; shutdown-worker函数关闭进程,shutdown-worker函数请参见其定义部分
       (shutdown-worker supervisor id)
       ))
   ;; new-worker-ids保存了需要重新启动进程的worker-id,遍历new-worker-ids,为每个worker-id创建本地目录"{storm.local.dir}/workers/{worker_id}"
   (doseq [id (vals new-worker-ids)]
     (local-mkdirs (worker-pids-root conf id)))
   ;; 将合并后的map重新保存到local-state的LS-APPROVED-WORKERS中
   (.put local-state LS-APPROVED-WORKERS
         ;; 将new-worker-ids的键值交换由原来的port->worker-id转换成worker-id->port,并与local-state的LS-APPROVED-WORKERS合并
         (merge
          ;; select-keys函数从local-state的LS-APPROVED-WORKERS中获取key包含在keepers中的键值对,返回结果是一个map
          (select-keys (.get local-state LS-APPROVED-WORKERS)
                       (keys keepers))
          ;; zipmap函数返回new-worker-ids的worker-id->port的map
          (zipmap (vals new-worker-ids) (keys new-worker-ids))
          ))
   ;; wait-for-workers-launch函数等待所有worker启动完成,请参见wait-for-workers-launch函数定义部分
   (wait-for-workers-launch
    conf
    ;; assignment绑定在该port运行的executor信息
    (dofor [[port assignment] reassign-executors]
      ;; id为port所对应的worker-id
      (let [id (new-worker-ids port)]
        (log-message "Launching worker with assignment "
                     (pr-str assignment)
                     " for this supervisor "
                     (:supervisor-id supervisor)
                     " on port "
                     port
                     " with id "
                     id
                     )
        ;; launch-worker函数负责启动worker,关于worker启动的相关分析会在以后的文章中详细介绍,在此不再介绍
        (launch-worker supervisor
                       (:storm-id assignment)
                       port
                       id)
        id)))
   ))

read-allocated-workers函数定义如下:

read-allocated-workers函数
;; 返回worker-id->worker状态和心跳的map,如果worker心跳为nil,那么worker是"dead"
(defn read-allocated-workers
 "Returns map from worker id to worker heartbeat. if the heartbeat is nil, then the worker is dead (timed out or never wrote heartbeat)"
 ;; supervisor绑定supervisor元数据,assigned-executors绑定supervisor分配信息端口port->LocalAssignment实例的map,now绑定当前时间
 [supervisor assigned-executors now]
 ;; 获取集群配置信息
 (let [conf (:conf supervisor)
   ;; 获取supervisor的LocalState实例
       ^LocalState local-state (:local-state supervisor)
   ;; id->heartbeat绑定supervisor上运行进程的worker-id->心跳信息的map
       id->heartbeat (read-worker-heartbeats conf)
   ;; approved-ids绑定supervisor的LocalState实例中保存的worker-id的集合
       approved-ids (set (keys (.get local-state LS-APPROVED-WORKERS)))]
   ;; 生成worker-id->[state hb]的map
   (into
    {}
    (dofor [[id hb] id->heartbeat]
           ;; cond相当于if...else嵌套
           (let [state (cond
                ;; 如果心跳信息为nil,那么state值为:not-started关键字
                        (not hb)
                          :not-started
            ;; 如果approved-ids不包含id或者matches-an-assignment?返回false,那么state值为:disallowed关键字
                        (or (not (contains? approved-ids id))
                ;; matches-an-assignment?函数通过比较心跳信息和分配信息中的storm-id和线程id集合是否相同,来判定该worker是否已分配
                            (not (matches-an-assignment? hb assigned-executors)))
                          :disallowed
            ;; 如果当前时间-上次心跳时间>心跳超时时间,state值为:timed-out关键字
                        (> (- now (:time-secs hb))
                           (conf SUPERVISOR-WORKER-TIMEOUT-SECS))
                          :timed-out
            ;; 以上条件均不满足时,state值为:valid关键字
                        true
                          :valid)]
             (log-debug "Worker " id " is " state ": " (pr-str hb) " at supervisor time-secs " now)
             [id [state hb]]
             ))
    )))

read-worker-heartbeats函数定义如下:

read-worker-heartbeats函数
;; 获取supervisor上运行进程的worker-id->心跳信息的map
(defn read-worker-heartbeats
 "Returns map from worker id to heartbeat"
 [conf]
 ;; ids绑定supervisor上进程的worker-id集合
 (let [ids (my-worker-ids conf)]
   ;; 生成worker-id->心跳信息的map
   (into {}
     (dofor [id ids]
       ;; read-worker-heartbeat函数获取指定worker-id的心跳信息,从supervisor上"{storm.local.dir}/workers/{worker-id}/heartbeats"中获取心跳信息
       [id (read-worker-heartbeat conf id)]))
   ))

my-worker-ids函数定义如下:

my-worker-ids函数
;; 获取supervisor上运行的进程的worker-id
(defn my-worker-ids [conf]
 ;; worker-root函数返回supervisor本地目录"{storm.local.dir}/workers",read-dir-contents函数获取目录"{storm.local.dir}/workers"下所有文件名的集合(即该supervisor上正在运行的所有进程的worker-id)
 (read-dir-contents (worker-root conf)))

matches-an-assignment?函数定义如下:

matches-an-assignment?函数
;; worker-heartbeat标识心跳信息,assigned-executors标识supervisor分配信息port->LocalAssignment实例的map
(defn matches-an-assignment? [worker-heartbeat assigned-executors]
 ;; 从worker-heartbeat中获取进程占用的端口,进而从assigned-executors中获取LocalAssignment实例
 (let [local-assignment (assigned-executors (:port worker-heartbeat))]
   ;; 如果local-assignment不为nil,且心跳信息中的storm-id和分配信息中的storm-id相等,且心跳信息中的线程id集合和分配信息中的线程id集合相等,那么返回true;否则返回false
   (and local-assignment
        (= (:storm-id worker-heartbeat) (:storm-id local-assignment))
    ;; Constants/SYSTEM_EXECUTOR_ID标识"系统bolt"的线程id,我定义的topology除了我们指定的spout和bolt外,还包含一些"系统bolt"
        (= (disj (set (:executors worker-heartbeat)) Constants/SYSTEM_EXECUTOR_ID)
           (set (:executors local-assignment))))))

shutdown-worker函数定义如下:

shutdown-worker函数
)) ;; allow 1 second for execution of cleanup threads on worker.
   ;; 通过调用"kill -15 pid"命令未能关闭的进程,将通过调用force-kill-process函数关闭,force-kill-process函数只是调用了"kill -9 pid"命令
   (doseq [pid pids]
     (force-kill-process pid)
     (try
       ;; 删除"{storm.local.dir}/workers/{worker_id}/pids"
       (rmpath (worker-pid-path conf id pid))
       (catch Exception e))) ;; on windows, the supervisor may still holds the lock on the worker directory
   ;; try-cleanup-worker函数清理本地目录,try-cleanup-worker函数参见其定义部分
   (try-cleanup-worker conf id))
 (log-message "Shut down " (:supervisor-id supervisor) ":" id))

try-cleanup-worker函数定义如下:

try-cleanup-worker函数
;; 清理本地目录
(defn try-cleanup-worker [conf id]
 (try
   ;; 删除"{storm.local.dir}/workers/{worker_id}/heartbeats"目录
   (rmr (worker-heartbeats-root conf id))
   ;; this avoids a race condition with worker or subprocess writing pid around same time
   ;; 删除"{storm.local.dir}/workers/{worker_id}/pids"目录
   (rmpath (worker-pids-root conf id))
   ;; 删除"{storm.local.dir}/workers/{worker_id}"目录
   (rmpath (worker-root conf id))
 (catch RuntimeException e
   (log-warn-error e "Failed to cleanup worker " id ". Will retry later")
   )
 (catch java.io.FileNotFoundException e (log-message (.getMessage e)))
 (catch java.io.IOException e (log-message (.getMessage e)))
   ))

wait-for-workers-launch函数定义如下:

wait-for-workers-launch函数
;; wait-for-workers-launch函数等待所有worker启动完成
(defn- wait-for-workers-launch [conf ids]
 (let [start-time (current-time-secs)]
   (doseq [id ids]
     ;; 调用wait-for-worker-launch函数
     (wait-for-worker-launch conf id start-time))
   ))

wait-for-worker-launch函数定义如下:

wait-for-worker-launch函数
)
         (recur)
         )))
   (when-not (.get state LS-WORKER-HEARTBEAT)
     (log-message "Worker " id " failed to start")
     )))

mk-synchronize-supervisor函数定义如下:

mk-synchronize-supervisor函数
;; mk-synchronize-supervisor函数返回一个名字为"this"的函数,
(defn mk-synchronize-supervisor [supervisor sync-processes event-manager processes-event-manager]
 (fn this []
   ;; conf绑定集群配置信息
   (let [conf (:conf supervisor)
         ;; storm-cluster-state绑定StormClusterState对象
         storm-cluster-state (:storm-cluster-state supervisor)
         ;; isupervisor绑定实现了ISupervisor接口的实例
         ^ISupervisor isupervisor (:isupervisor supervisor)
         ;; local-state绑定LocalState实例
         ^LocalState local-state (:local-state supervisor)
         ;; sync-callback绑定一个匿名函数,这个匿名函数的主要功能就是将上面定义的"this"函数添加到event-manager中,这样"this"函数将会在一个新的线程内执行
         ;; 每次执行,都需要再一次把sync-callback注册到zookeeper中作为回调函数,以保证下次可以被继续触发,当zookeeper的子节点"/assignments"发生变化时执行回调函数sync-callback
         sync-callback (fn [& ignored] (.add event-manager this))
         ;; assignment-versions绑定带有版本号的分配信息,topology-id->分配信息的map
         assignment-versions @(:assignment-versions supervisor)
         ;; assignments-snapshot绑定topoloy-id->分配信息AssignmentInfo对象的map,versions绑定带有版本号的分配信息,assignments-snapshot函数从zookeeper的子节点"/assignments"获取分配信息(当前集群分配信息快照),并将回调函数添加到子节点"/assignments"上,assignments-snapshot函数参见其定义部分
         {assignments-snapshot :assignments versions :versions}  (assignments-snapshot
                                                                  storm-cluster-state sync-callback
                                                                  assignment-versions)
         ;; 调用read-storm-code-locations函数获取topology-id->nimbus上该topology代码目录的map                                                
         storm-code-map (read-storm-code-locations assignments-snapshot)
         ;; read-downloaded-storm-ids函数从supervisor本地的"{storm.local.dir}/stormdist"目录读取已经下载了代码jar包的topology-id
         downloaded-storm-ids (set (read-downloaded-storm-ids conf))
         ;; all-assignment绑定该supervisor上的所有分配信息,即port->LocalAssignment对象的map
         all-assignment (read-assignments
                          assignments-snapshot
                          (:assignment-id supervisor))
         ;; 调用isupervisor对象的confirmAssigned函数验证all-assignment的key即port的有效性,将通过验证的保存到new-assignment中。isupervisor对象是在standalone-supervisor函数中创建的,查看standalone-supervisor函数,我们可以发现isupervisor对象的confirmAssigned函数只是返回true,所以new-assignment=all-assignment
         new-assignment (->> all-assignment
                             (filter-key #(.confirmAssigned isupervisor %)))
         ;; assigned-storm-ids绑定分配给该supervisor的topology-id的集合
         assigned-storm-ids (assigned-storm-ids-from-port-assignments new-assignment)
         ;; existing-assignment绑定该supervisor上已经存在的分配信息
         existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
     (log-debug "Synchronizing supervisor")
     (log-debug "Storm code map: " storm-code-map)
     (log-debug "Downloaded storm ids: " downloaded-storm-ids)
     (log-debug "All assignment: " all-assignment)
     (log-debug "New assignment: " new-assignment)
     
     ;; download code first
     ;; This might take awhile
     ;;   - should this be done separately from usual monitoring?
     ;; should we only download when topology is assigned to this supervisor?
     ;; storm-code-map绑定当前集群上已分配的所有topology-id->nimbus上代码jar包目录的键值对的map
     (doseq [[storm-id master-code-dir] storm-code-map]
       ;; 如果downloaded-storm-ids集合不包含该storm-id,且assigned-storm-ids集合包含该storm-id(表明该storm-id需要在该superior上运行,但是该storm-id的代码jar包还没有从nimbus服务器下载到本地),则调用download-storm-code函数下载代码jar包
       (when (and (not (downloaded-storm-ids storm-id))
                  (assigned-storm-ids storm-id))
         (log-message "Downloading code for storm id "
            storm-id
            " from "
            master-code-dir)
         ;; 从nimbus服务器上下载该storm-id相关的代码jar包,序列化后的topology对象,运行时所需的配置信息,并将其保存到"{storm.local.dir}/nimbus/stormdist/{storm-id}/"目录
         (download-storm-code conf storm-id master-code-dir)
         (log-message "Finished downloading code for storm id "
            storm-id
            " from "
            master-code-dir)
         ))

(log-debug "Writing new assignment "
                (pr-str new-assignment))
     ;; existing-assignment与new-assignment的差集表示不需要在该supervisor上运行的分配的集合,所以要把这些分配对应的worker关闭
     (doseq [p (set/difference (set (keys existing-assignment))
                               (set (keys new-assignment)))]
       ;; 当前storm版本0.9.2中,killedWorker为空实现,所以什么都没做
       (.killedWorker isupervisor (int p)))
     ;; assigned函数为空实现,什么也没有做
     (.assigned isupervisor (keys new-assignment))
     ;; 将最新分配信息new-assignment保存到local-state数据库中
     (.put local-state
           LS-LOCAL-ASSIGNMENTS
           new-assignment)
     ;; 将带有版本号的分配信息versions存入supervisor缓存:assignment-versions中
     (swap! (:assignment-versions supervisor) versions)
     ;; 重新设置supervisor缓存的:curr-assignment值为new-assignment,即保存当前storm集群上最新分配信息
     (reset! (:curr-assignment supervisor) new-assignment)
     ;; remove any downloaded code that's no longer assigned or active
     ;; important that this happens after setting the local assignment so that
     ;; synchronize-supervisor doesn't try to launch workers for which the
     ;; resources don't exist
     ;; 如果当前supervisor服务器的操作系统是"Windows_NT"系统,那么执行shutdown-disallowed-workers函数,关闭状态为:disallowed的worker
     (if on-windows? (shutdown-disallowed-workers supervisor))
     ;; 遍历downloaded-storm-ids集合,该集合内存放了已经下载了jar包等信息的topology的id
     (doseq [storm-id downloaded-storm-ids]
       ;; 如果storm-id不在assigned-storm-ids集合内,则递归删除"{storm.local.dir}/supervisor/stormdist/{storm-id}"目录。assigned-storm-ids表示当前需要在该supervisor上运行的topology的id
       (when-not (assigned-storm-ids storm-id)
         (log-message "Removing code for storm id "
                      storm-id)
         (try
           (rmr (supervisor-stormdist-root conf storm-id))
           (catch Exception e (log-message (.getMessage e))))
         ))
     ;; 将sync-processes函数添加到processes-event-manager事件管理器中,这样就可以在一个单独线程内执行sync-processes函数。因为sync-processes函数比较耗时,所以需要在一个新的线程内执行
     (.add processes-event-manager sync-processes)
     )))

assignments-snapshot函数定义如下:

assignments-snapshot函数
;; assignments-snapshot函数从zookeeper的子节点"/assignments"获取分配信息,并将回调函数添加到子节点"/assignments"上,assignment-versions绑定该supervisor本地缓存的带有版本号的分配信息
(defn- assignments-snapshot [storm-cluster-state callback assignment-versions]
 ;; storm-ids绑定已分配的topology-id的集合,获取/assignments的子节点列表,如果callback不为空,将其赋值给assignments-callback,并对/assignments添加"节点观察",这样supervisor就能感知集群是否有新的assignment或者有assignment被删除
 (let [storm-ids (.assignments storm-cluster-state callback)]
   ;; new-assignments绑定最新分配信息
   (let [new-assignments
         (->>
          ;; sid绑定topology-id
          (dofor [sid storm-ids]
                 ;; recorded-version绑定该supervisor上缓存的该sid的分配信息版本号
                 (let [recorded-version (:version (get assignment-versions sid))]
                   ;; assignment-version绑定zookeeper上"/assignments/{sid}"节点数据及其版本号,并注册回调函数
                   (if-let [assignment-version (.assignment-version storm-cluster-state sid callback)]
                     ;; 如果缓存的分配版本号和zookeeper上获取的分配版本号相等,则返回sid->缓存的分配信息的map,否则从zookeeper的"/assignments/{sid}"节点重新获取带有版本号的分配信息,并注册回调函数,这样supervisor就能感知某个已存在的assignment是否被重新分配
                     (if (= assignment-version recorded-version)
                       {sid (get assignment-versions sid)}
                       {sid (.assignment-info-with-version storm-cluster-state sid callback)})
                     ;; 如果从zookeeper上获取分配信息失败,值为{sid nil}
                     {sid nil})))
          ;; 将dofor结果进行合并,形如:{sid_1 {:data data_1 :version version_1}, sid_2 {:data data_2 :version version_2},......sid_n {:data data_n :version version_n} }
          (apply merge)
          ;; 保留值不空的键值对
          (filter-val not-nil?))]
     ;; 返回的map形如:{:assignments {sid_1 data_1, sid_2 data_2, ...... , sid_n data_n}, :versions {sid_1 {:data data_1 :version version_1}, sid_2 {:data data_2 :version version_2},......sid_n {:data data_n :version version_n} } }    
     ;; data_x是一个AssignmentInfo对象,AssignmentInfo对象包含对应的nimbus上的代码目录,所有task的启动时间,每个task与机器、端口的映射
     {:assignments (into {} (for [[k v] new-assignments] [k (:data v)]))
      :versions new-assignments})))

read-assignments函数定义如下:

read-assignments函数
(defn- read-assignments
 "Returns map from port to struct containing :storm-id and :executors"
 ;; assignments-snapshot绑定topology-id->分配信息AssignmentInfo对象的map,assignment-id绑定supervisor-id
 [assignments-snapshot assignment-id]
 ;; 遍历read-my-executors函数返回结果,检查是否存在多个topology分配到同一个端口,如果存在则抛出异常。检查的方式特别巧妙,通过对返回结果调用merge-with函数,如果返回结果中存在相同的port,那么就会调用
 ;; 匿名函数(fn [& ignored] ......),这样就会抛出异常
 (->> (dofor [sid (keys assignments-snapshot)] (read-my-executors assignments-snapshot sid assignment-id))
      (apply merge-with (fn [& ignored] (throw-runtime "Should not have multiple topologies assigned to one port")))))

read-my-executor函数定义如下:

read-my-executor函数
;; assignments-snapshot绑定topology-id->分配信息AssignmentInfo对象的map,assignment-id绑定supervisor-id,storm-id为topoloy-id
(defn- read-my-executors [assignments-snapshot storm-id assignment-id]
 (let [assignment (get assignments-snapshot storm-id)
       ;; my-executors绑定分配给该supervisor的executor信息,即executor->node+port的map
       my-executors (filter (fn [[_ [node _]]] (= node assignment-id))
                          (:executor->node+port assignment))
       ;; port-executors绑定port->executor-id集合的map,merge-with函数的作用就是对key相同的value调用concat函数
       port-executors (apply merge-with
                         concat
                         (for [[executor [_ port]] my-executors]
                           {port [executor]}
                           ))]
   ;; 返回port->LocalAssignment对象的map,LocalAssignment包含两个属性:topology-id和executor-id集合
   (into {} (for [[port executors] port-executors]
              ;; need to cast to int b/c it might be a long (due to how yaml parses things)
              ;; doall is to avoid serialization/deserialization problems with lazy seqs
              [(Integer. port) (LocalAssignment. storm-id (doall executors))]
              ))))

download-storm-code函数定义如下:

download-storm-code函数
(defmulti download-storm-code cluster-mode)
download-storm-code函数是一个"多重函数",根据cluster-mode函数的返回值决定调用哪个函数,cluster-mode函数可能返回关键字:distributed和:local,如果返回:distributed,那么会调用下面这个函数。
(defmethod download-storm-code
   ;; master-code-dir绑定storm-id的代码jar包在nimbus服务器上的路径
   :distributed [conf storm-id master-code-dir]
   ;; Downloading to permanent location is atomic
   ;; tmproot绑定supervisor本地路径"{storm.local.dir}/supervisor/tmp/{uuid}",临时存放从nimbus上下载的代码jar包
   (let [tmproot (str (supervisor-tmp-dir conf) file-path-separator (uuid))
         ;; stormroot绑定该storm-id的代码jar包在supervisor上的路径"{storm.local.dir}/supervisor/stormdist/{storm-id}"
         stormroot (supervisor-stormdist-root conf storm-id)]
     ;; 创建临时目录tmproot
     (FileUtils/forceMkdir (File. tmproot))
     ;; 将nimbus服务器上的"{storm.local.dir}/nimbus/stormdist/{storm-id}/stormjar.jar"文件下载到supervisor服务器的tmproot目录中,stormjar.jar包含这个topology所有代码
     (Utils/downloadFromMaster conf (master-stormjar-path master-code-dir) (supervisor-stormjar-path tmproot))
     ;; 将nimbus服务器上的"{storm.local.dir}/nimbus/stormdist/{storm-id}/stormcode.ser"文件下载到supervisor服务器的tmproot目录中,stormcode.ser是这个topology对象的序列化
     (Utils/downloadFromMaster conf (master-stormcode-path master-code-dir) (supervisor-stormcode-path tmproot))
     ;; 将nimbus服务器上的"{storm.local.dir}/nimbus/stormdist/{storm-id}/stormconf.ser"文件下载到supervisor服务器的tmproot目录中,stormconf.ser包含运行这个topology的配置
     (Utils/downloadFromMaster conf (master-stormconf-path master-code-dir) (supervisor-stormconf-path tmproot))
     ;; RESOURCES-SUBDIR值为字符串"resources",extract-dir-from-jar函数主要作用就是将jar包解压,然后将jar包中路径以"resources"开头的文件解压到"{tmproot}/resources/......"目录
     (extract-dir-from-jar (supervisor-stormjar-path tmproot) RESOURCES-SUBDIR tmproot)
     ;; 将临时目录tmproot中的文件剪切到stormroot目录中,这样"{storm.local.dir}/nimbus/stormdist/{storm-id}/"目录中将包括resources目录,stormjar.jar文件,stormcode.ser文件,stormconf.ser文件
     (FileUtils/moveDirectory (File. tmproot) (File. stormroot))
     ))

extract-dir-from-jar函数定义如下:

extract-dir-from-jar函数
;; jarpath标识jar路径,dir标识"resources",destdir标识"{tmproot}"路径
(defn extract-dir-from-jar [jarpath dir destdir]
 (try-cause
   ;; 使用类ZipFile来解压jar包,jarpath绑定ZipFile对象
   (with-open [jarpath (ZipFile. jarpath)]
     ;; 调用entries方法,返回一个枚举对象,然后调用enumeration-seq函数获取文件的ZIP条目对象
     (let [entries (enumeration-seq (.entries jarpath))]
       ;; 遍历entries中路径以"resources"开头的文件
       (doseq [file (filter (fn [entry](and (not (.isDirectory entry)) (.startsWith (.getName entry) dir))) entries)]
         ;; 在"tmproot"目录中创建文件的完整父路径
         (.mkdirs (.getParentFile (File. destdir (.getName file))))
         ;; 将文件复制到"{tmproot}/{在压缩文件中的路径}"
         (with-open [out (FileOutputStream. (File. destdir (.getName file)))]
           (io/copy (.getInputStream jarpath file) out)))))
   (catch IOException e
     (log-message "Could not extract " dir " from " jarpath))))

以上就是storm启动supervisor的完整流程,启动supervisor的工作主要是在mk-supervisor函数中进行的,所以阅读该部分源码时,要首先从该函数入手,然后依次分析在该函数中所调用的其他函数,根据函数的控制流程分析每个函数。

storm启动supervisor源码分析-supervisor.clj的更多相关文章

  1. Nimbus<二>storm启动nimbus源码分析-nimbus.clj

    nimbus是storm集群的"控制器",是storm集群的重要组成部分.我们可以通用执行bin/storm nimbus >/dev/null 2>&1 &a ...

  2. storm启动nimbus源码分析-nimbus.clj

    nimbus是storm集群的"控制器",是storm集群的重要组成部分.我们可以通用执行bin/storm nimbus >/dev/null 2>&1 &a ...

  3. supervisor启动worker源码分析-worker.clj

    supervisor通过调用sync-processes函数来启动worker,关于sync-processes函数的详细分析请参见"storm启动supervisor源码分析-superv ...

  4. worker启动executor源码分析-executor.clj

    在"supervisor启动worker源码分析-worker.clj"一文中,我们详细讲解了worker是如何初始化的.主要通过调用mk-worker函数实现的.在启动worke ...

  5. storm操作zookeeper源码分析-cluster.clj

    storm操作zookeeper的主要函数都定义在命名空间backtype.storm.cluster中(即cluster.clj文件中).backtype.storm.cluster定义了两个重要p ...

  6. storm shell命令源码分析-shell_submission.clj

    当我们在shell里执行storm shell命令时会调用shell_submission.clj里的main函数.shell_submission.clj如下: shell_submission.c ...

  7. storm定时器timer源码分析-timer.clj

    storm定时器与java.util.Timer定时器比较相似.java.util.Timer定时器实际上是个线程,定时调度所拥有的TimerTasks:storm定时器也有一个线程负责调度所拥有的& ...

  8. 涨姿势:Spring Boot 2.x 启动全过程源码分析

    目录 SpringApplication 实例 run 方法运行过程 总结 上篇<Spring Boot 2.x 启动全过程源码分析(一)入口类剖析>我们分析了 Spring Boot 入 ...

  9. Spring Boot 2.x 启动全过程源码分析

    Spring Boot 2.x 启动全过程源码分析 SpringApplication 实例 run 方法运行过程 上面分析了 SpringApplication 实例对象构造方法初始化过程,下面继续 ...

随机推荐

  1. Centos7 配置yum源 安装epel

    一.什么是epel如果既想获得 RHEL 的高质量.高性能.高可靠性,又需要方便易用(关键是免费)的软件包更新功能,那么 Fedora Project 推出的 EPEL(Extra Packages ...

  2. Luogu-4410 [HNOI2009]无归岛

    裸的仙人掌最大独立子集,结果一个zz的错误让我调了好久... \(-inf\)开始设为\(0x7fffffff\)结果\(A_i\)有负数一加就炸了 #include<cstdio> #i ...

  3. JAVA强制类型转换(转载+自己的感想) - stemon

    JAVA强制类型转换(转载+自己的感想) - stemon 时间 2013-10-29 15:52:00  博客园-Java原文  http://www.cnblogs.com/stemon/p/33 ...

  4. 如何处理异常? catch Exception OR catch Throwable

    在Java中,当你需要统一处理异常的时候,你是会选择catch (Exception),还是直接catch (Throwable)? Java的异常体系 Throwable: Java中所有异常和错误 ...

  5. java:练习学校学生

    java:练习学校学生 一个学生对应一个学校 一个学校对应多个学生 Student类,School类,Demo测试类 Student: public class Student { private S ...

  6. win2008server R2 x64 部署.net core到IIS

    1.下载sdk 和.NET Core Windows Server Hosting   https://www.microsoft.com/net/download  2.出现HTTP 错误 500. ...

  7. HashOperations

    存储格式:Key=>(Map<HK,HV>) 1.put(H key, HK hashKey, HV value) putAll(H key, java.util.Map<? ...

  8. vue2.0使用Sortable.js实现的拖拽功能

    简介 在使用vue1.x之前的版本的时候,页面中的拖拽功能,我在项目中是直接用的jQuery ui中的sortable.js,只是在拖拽完成后,在update的回调函数中又重新排序了存放数据的数组.但 ...

  9. 2013面试C++小结

    2013年我在厦门c++求职小结 1.一般公司出的面试题目中的找错误,都是出自平常公司内部使用过程中出现的真实错误. 比如stl 中erase的使用:详细请见 :http://blog.csdn.ne ...

  10. Linux-shell 练习题(一)

    1.实现批量添加20个用户,用户名为user1-20,密码为user+123 #!/bin/bash count=$ if [ -n "$count" ] then ;i<= ...