Azkaban3.45

一简介

1 官网

https://azkaban.github.io/

Azkaban was implemented at LinkedIn to solve the problem of Hadoop job dependencies. We had jobs that needed to run in order, from ETL jobs to data analytics products.

Initially a single server solution, with the increased number of Hadoop users over the years, Azkaban has evolved to be a more robust solution.

Azkaban是由LinkedIn为了解决Hadoop环境下任务依赖问题而开发的，LinkedIn团队有很多任务需要按照顺序运行，包括ETL任务以及数据分析任务；

Azkaban一开始是单server方案，现在已经演化为一个更健壮的方案；（可惜当前版本的WebServer还是单点）

Azkaban consists of 3 key components:

Relational Database (MySQL)
AzkabanWebServer
AzkabanExecutorServer

Azkaban有3个核心组件：Mysql、WebServer、ExecutorServer；

2 部署

3 数据库表结构

projects：项目

project_flows：工作流定义

execution_flows：工作流实例

execution_jobs：任务实例

triggers：调度定义

ps：表中很多数据都是编码的，enc_type是编码类型（对应的枚举为EncodingType），2是gzip编码，其他为无编码，2需要调用GZIPUtils.transformBytesToObject解析得到原始字符串；

4 概念

l Job：最小的执行单元，作为DAG的一个结点，即任务

l Flow：由多个Job组成，并通过dependent配置Job的依赖属性，即工作流

l Tirgger：根据指定Cron信息触发Flow，即调度

二代码解析

1 启动过程

Web Server

AzkabanWebServer.main

launch

prepareAndStartServer

configureRoutes

TriggerManager.start

FlowTriggerService.start

recoverIncompleteTriggerInstances

SELECT %s FROM execution_dependencies WHERE trigger_instance_id in (SELECT trigger_instance_id FROM execution_dependencies WHERE dep_status = %s or dep_status = %s or (dep_status = %s and flow_exec_id = %s))

FlowTriggerScheduler.start

ExecutorManager

setupExecutors

loadRunningFlows

QueueProcessorThread.run

ExecutingManagerUpdaterThread.run

Executor Server

AzkabanExecutorServer.main

launch

AzkabanExecutorServer.start

insertExecutorEntryIntoDB

2 工作流执行过程

Web Server两个入口：

ExecuteFlowAction.doAction

ExecutorServlet.ajaxExecuteFlow

Web Server分配任务：

ExecutorManager.submitExecutableFlow

JdbcExecutorLoader.uploadExecutableFlow

INSERT INTO execution_flows (project_id, flow_id, version, status, submit_time, submit_user, update_time) values (?,?,?,?,?,?,?)

ExecutorLoader.addActiveExecutableReference

INSERT INTO active_executing_flows (exec_id, update_time) values (?,?)

queuedFlows.enqueue

QueueProcessorThread.run

processQueuedFlows

ExecutorManager.selectExecutorAndDispatchFlow (get from queuedFlows)

selectExecutor

dispatch

JdbcExecutorLoader.assignExecutor

UPDATE execution_flows SET executor_id=? where exec_id=?

ExecutorApiGateway.callWithExecutable （调用Executor Server）

Executor Server执行任务：

ExecutorServlet.doGet

handleAjaxExecute

FlowRunnerManager.submitFlow

JdbcExecutorLoader.fetchExecutableFlow

SELECT exec_id, enc_type, flow_data FROM execution_flows WHERE exec_id=?

FlowPreparer.setup

FlowRunner.run

setupFlowExecution

updateFlow

UPDATE execution_flows SET status=?,update_time=?,start_time=?,end_time=?,enc_type=?,flow_data=? WHERE exec_id=?

runFlow

progressGraph

runReadyJob

runExecutableNode

JobRunner.run

uploadExecutableNode

INSERT INTO execution_jobs (exec_id, project_id, version, flow_id, job_id, start_time, end_time, status, input_params, attempt) VALUES (?,?,?,?,?,?,?,?,?,?)

prepareJob

runJob

Job.run (ProcessJob, JavaJob)

Web Server轮询流程状态：

ExecutingManagerUpdaterThread.run

getFlowToExecutorMap

ExecutorApiGateway.callWithExecutionId

updateExecution

3 调度执行过程

TriggerManager.start

loadTriggers

SELECT trigger_id, trigger_source, modify_time, enc_type, data FROM triggers

TriggerScannerThread.start

checkAllTriggers

onTriggerTrigger

TriggerAction.doAction

ExecuteFlowAction.doAction

PS：还有另一套完全独立的定时任务逻辑，通过azkaban.server.schedule.enable_quartz控制（默认false），以下为register job到quartz：

ProjectManagerServlet.ajaxHandleUpload

SELECT id, name, active, modified_time, create_time, version, last_modified_by, description, enc_type, settings_blob FROM projects WHERE name=? AND active=true

ProjectManager.loadAllProjectFlows

SELECT project_id, version, flow_id, modified_time, encoding_type, json FROM project_flows WHERE project_id=? AND version=?

FlowTriggerScheduler.scheduleAll

SELECT MAX(flow_version) FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=?

SELECT flow_file FROM project_flow_files WHERE project_id=? AND project_version=? AND flow_name=? AND flow_version=?

registerJob

以下为quartz job执行：

FlowTriggerQuartzJob.execute

FlowTriggerService.startTrigger

TriggerInstanceProcessor.processSucceed

TriggerInstanceProcessor.executeFlowAndUpdateExecID

ExecutorManager.submitExecutableFlow

4 任务执行过程

Job是任务的核心接口，所有具体任务都是该接口的子类：

Job

AbstractJob

AbstractProcessJob

ProcessJob （Shell任务）

JavaProcessJob （Java任务）

JavaJob

【原创】大数据基础之Azkaban（1）简介、源代码解析的更多相关文章

【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Impala（1）简介、安装、使用
impala2.12 官方:http://impala.apache.org/ 一简介 Apache Impala is the open source, native analytic datab ...
【原创】大数据基础之Benchmark（2）TPC-DS
tpc 官方:http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction pr ...
【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
大数据基础知识：分布式计算、服务器集群[zz]
大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用 ...
大数据基础知识问答----spark篇，大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
大数据基础知识问答----hadoop篇
handoop相关知识点 1.Hadoop是什么? Hadoop是一个由Apache基金会所开发的分布式系统基础架构.用户可以在不了解分布式底层细节的情况下,开发分布式程序.充分利用集群的威力进行高速 ...
hadoop大数据基础框架技术详解
一.什么是大数据进入本世纪以来,尤其是2010年之后,随着互联网特别是移动互联网的发展,数据的增长呈爆炸趋势,已经很难估计全世界的电子设备中存储的数据到底有多少,描述数据系统的数据量的计量单位从MB ...
大数据基础总结---HDFS分布式文件系统
HDFS分布式文件系统文件系统的基本概述文件系统定义:文件系统是一种存储和组织计算机数据的方法,它使得对其访问和查找变得容易. 文件名:在文件系统中,文件名是用于定位存储位置. 元数据(Metad ...

随机推荐

写个.net开发者的Linux迁移指南
前言为什么要迁移到Linux 首先我个人还是有点软件洁癖,以前是穷酸学生的时候也是用盗版的用户,后来在知乎被洗脑终于有了点版权意识.然后便有了能用开源软件的就用开源,实在不能就选社区版或者免费版.于 ...
Python之使用转义序列 \n 和 \t 跟 expandtabs 函数输出表格
示例: text = "username\temail\tpassword\nashdfh\tfiodfh@q.com\ty567\nsdfiuh\tadfhisoj@163.com\t42 ...
PS调出清新风格社区街拍照片
原图: 首先呢,我们还是先看一下在直方图,但是呢,你会发现,这张照片的直方图毫无特色. 简直是标准得不能再标准的直方图了.所以各位那我们就跳过这步吧.你要真跳过这步你就完了.直方图还有三个儿子啊,通道 ...
多线程之：MESI－CPU缓存一致性协议
MESI(Modified Exclusive Shared Or Invalid)(也称为伊利诺斯协议,是因为该协议由伊利诺斯州立大学提出)是一种广泛使用的支持写回策略的缓存一致性协议,该协议被应用 ...
jsp篇之脚本元素
jsp的脚本元素 : 第一种:表达式 (类似输出语句) 表达式形式:<%= %> 看源码发现[翻译]到java文件中的位置: [out.print(..)]里面的参数. 所以System ...
[模板] 次短路 | bzoj1726-[Usaco2006Nov]Roadblocks第二短路
简介所谓次短路, 顾名思义, 就是第二短路. :P 1到n的次短路长度必然产生于:1到x的最短路 + edge(x,y) + y到n的最短路简单证明一下: 设 \(dis(i,j)\) 表示 \( ...
[SimplePlayer] 5. 向音频设备输出音频
两种SDL音频输出方式我们这里采用SDL来进行音频输出.SDL提供两种音频输出的方式: 如果在SDL_OpenAudio时不指定callback,那么可以调用SDL_QueueAudio主动地向音频 ...
Codeforces437 B. The Child and Set
题目类型:位运算传送门:>Here< 题意:给出\(sum和limit\),求一个集合\(S\),其中的元素互不相同且不超过\(limit\),他们的\(lowbit\)之和等于\(su ...
适配相关：viewpoint，@media，vw/vh，em/rem
从网易与淘宝的font-size思考前端设计稿与工作流: http://www.cnblogs.com/lyzg/p/4877277.html Rem布局的原理解析: https://yanhaiji ...
BZOJ 1671: [Usaco2005 Dec]Knights of Ni 骑士 (bfs)
题目: https://www.lydsy.com/JudgeOnline/problem.php?id=1671 题解: 按题意分别从贝茜和骑士bfs然后meet_in_middle.. 把一个逗号 ...

【原创】大数据基础之Azkaban（1）简介、源代码解析

一 简介