spark源码解析之基本概念
从两方面来阐述spark的组件,一个是宏观上,一个是微观上。
1. spark组件
要分析spark的源码,首先要了解spark是如何工作的。spark的组件:
了解其工作过程先要了解基本概念
官方罗列了一些概念:
Term | Meaning |
---|---|
Application | User program built on Spark. Consists of a driver program and executors on the cluster. |
Application jar | A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. |
Driver program | The process running the main() function of the application and creating the SparkContext |
Cluster manager | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) |
Deploy mode | Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. |
Worker node | Any node that can run application code in the cluster |
Executor | A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. |
Task | A unit of work that will be sent to one executor |
Job | A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save ,collect ); you'll see this term used in the driver's logs. |
Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs. |
1.1 SparkContext
sparkContext:Main entry point for Spark functionality. A SparkContext represents the connection to a Spark,cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
1.2 Task
/**
* A unit of execution. We have two kinds of Task's in Spark:
*
* - [[org.apache.spark.scheduler.ShuffleMapTask]]
* - [[org.apache.spark.scheduler.ResultTask]]
*
* A Spark job consists of one or more stages. The very last stage in a job consists of multiple
* ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
* and sends the task output back to the driver application. A ShuffleMapTask executes the task
* and divides the task output to multiple buckets (based on the task's partitioner).
*
* @param stageId id of the stage this task belongs to
* @param partitionId index of the number in the RDD
*/
1.3 ActiveJob
/**
* A running job in the DAGScheduler. Jobs can be of two types: a result job, which computes a
* ResultStage to execute an action, or a map-stage job, which computes the map outputs for a
* ShuffleMapStage before any downstream stages are submitted. The latter is used for adaptive
* query planning, to look at map output statistics before submitting later stages. We distinguish
* between these two types of jobs using the finalStage field of this class.
*
* Jobs are only tracked for "leaf" stages that clients directly submitted, through DAGScheduler's
* submitJob or submitMapStage methods. However, either type of job may cause the execution of
* other earlier stages (for RDDs in the DAG it depends on), and multiple jobs may share some of
* these previous stages. These dependencies are managed inside DAGScheduler.
*
* @param jobId A unique ID for this job.
* @param finalStage The stage that this job computes (either a ResultStage for an action or a
* ShuffleMapStage for submitMapStage).
* @param callSite Where this job was initiated in the user's program (shown on UI).
* @param listener A listener to notify if tasks in this job finish or the job fails.
* @param properties Scheduling properties attached to the job, such as fair scheduler pool name.
*/
1.4 Stage
/**
* A stage is a set of parallel tasks all computing the same function that need to run as part
* of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
* by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
* DAGScheduler runs these stages in topological order.
*
* Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
* other stage(s), or a result stage, in which case its tasks directly compute a Spark action
* (e.g. count(), save(), etc) by running a function on an RDD. For shuffle map stages, we also
* track the nodes that each output partition is on.
*
* Each Stage also has a firstJobId, identifying the job that first submitted the stage. When FIFO
* scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
* faster on failure.
*
* Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In that
* case, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.
* The latest one will be accessible through latestInfo.
*
* @param id Unique stage ID
* @param rdd RDD that this stage runs on: for a shuffle map stage, it's the RDD we run map tasks
* on, while for a result stage, it's the target RDD that we ran an action on
* @param numTasks Total number of tasks in stage; result stages in particular may not need to
* compute all partitions, e.g. for first(), lookup(), and take().
* @param parents List of stages that this stage depends on (through shuffle dependencies).
* @param firstJobId ID of the first job this stage was part of, for FIFO scheduling.
* @param callSite Location in the user program associated with this stage: either where the target
* RDD was created, for a shuffle map stage, or where the action for a result stage was called.
*/
1.5 executor
/**
* Spark executor, backed by a threadpool to run tasks.
*
* This can be used with Mesos, YARN, and the standalone scheduler.
* An internal RPC interface (at the moment Akka) is used for communication with the driver,
* except in the case of Mesos fine-grained mode.
*/
2. spark核心
Spark建立在统一抽象的RDD之上,使得它可以以基本一致的方式应对不同的大数据处理场景,包括MapReduce,Streaming,SQL,Machine Learning以及Graph等。
要理解Spark,就需得理解RDD。
2.1 RDD是什么?
它的特性可以总结如下:
- 它是不变的数据结构存储
- 它是支持跨集群的分布式数据结构
- 可以根据数据记录的key对结构进行分区
- 提供了粗粒度的操作,且这些操作都支持分区
- 它将数据存储在内存中,从而提供了低延迟性
官方定义:
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
* on RDD internals.
*/
从上述描述可以知道:
1.rdd是可以并行操作的不可变,分区的元素集合。
2.rdd定义了所有类型rdd的操作如:
`map`, `filter`, and `persist`
其中,其它操作有:PairRDDFunctions,DoubleRDDFunctions,SequenceFileRDDFunctions
3. rdd主要由五部分组成:
一组分片(partition);计算每个分片的函数;其它rdd的依赖集合;可选的分区键(key-value rdd拥有);一个列表,存储存取每个partition的preferred位置。对于一个HDFS文件来说,存储每个partition所在的块的位置。
小结:
RDD,全称为Resilient Distributed Datasets,是一个容错的、并行的数据结构,可以让用户显式地将数据存储到磁盘和内存中,并能控制数据的分区。同时,RDD还提供了一组丰富的操作来操作这些数据。在这些操作中,诸如map、flatMap、filter等转换操作实现了monad模式,很好地契合了Scala的集合操作。除此之外,RDD还提供了诸如join、groupBy、reduceByKey等更为方便的操作(注意,reduceByKey是action,而非transformation),以支持常见的数据运算。
通常来讲,针对数据处理有几种常见模型,包括:Iterative Algorithms,Relational Queries,MapReduce,Stream Processing。例如Hadoop MapReduce采用了MapReduces模型,Storm则采用了Stream Processing模型。RDD混合了这四种模型,使得Spark可以应用于各种大数据处理场景。
RDD作为数据结构,本质上是一个只读的分区记录集合。一个RDD可以包含多个分区,每个分区就是一个dataset片段。RDD可以相互依赖。如果RDD的每个分区最多只能被一个Child RDD的一个分区使用,则称之为narrow dependency;若多个Child RDD分区都可以依赖,则称之为wide dependency。不同的操作依据其特性,可能会产生不同的依赖。例如map操作会产生narrow dependency,而join操作则产生wide dependency。
Spark之所以将依赖分为narrow与wide,基于两点原因。
首先,narrow dependencies可以支持在同一个cluster node上以管道形式执行多条命令,例如在执行了map后,紧接着执行filter。相反,wide dependencies需要所有的父分区都是可用的,可能还需要调用类似MapReduce之类的操作进行跨节点传递。
其次,则是从失败恢复的角度考虑。narrow dependencies的失败恢复更有效,因为它只需要重新计算丢失的parent partition即可,而且可以并行地在不同节点进行重计算。而wide dependencies牵涉到RDD各级的多个Parent Partitions。下图说明了narrow dependencies与wide dependencies之间的区别:
本图来自Matei Zaharia撰写的论文An Architecture for Fast and General Data Processing on Large Clusters。图中,一个box代表一个RDD,一个带阴影的矩形框代表一个partition。
2.2 RDD对容错的支持
支持容错通常采用两种方式:数据复制或日志记录。对于以数据为中心的系统而言,这两种方式都非常昂贵,因为它需要跨集群网络拷贝大量数据,毕竟带宽的数据远远低于内存。
RDD天生是支持容错的。首先,它自身是一个不变的(immutable)数据集,其次,它能够记住构建它的操作图(Graph of Operation),因此当执行任务的Worker失败时,完全可以通过操作图获得之前执行的操作,进行重新计算。由于无需采用replication方式支持容错,很好地降低了跨网络的数据传输成本。
不过,在某些场景下,Spark也需要利用记录日志的方式来支持容错。例如,在Spark Streaming中,针对数据进行update操作,或者调用Streaming提供的window操作时,就需要恢复执行过程的中间状态。此时,需要通过Spark提供的checkpoint机制,以支持操作能够从checkpoint得到恢复。
针对RDD的wide dependency,最有效的容错方式同样还是采用checkpoint机制。不过,似乎Spark的最新版本仍然没有引入auto checkpointing机制。
2.3 分片Partition
Partition是RDD中一个分片的标识。
/**
* An identifier for a partition in an RDD.
*/
trait Partition extends Serializable {
/**
* Get the partition's index within its parent RDD
*/
def index: Int // A better default implementation of HashCode
override def hashCode(): Int = index
}
2.4 分片函数Partitioner
分片函数Partitioner:An object that defines how the elements in a key-value pair RDD are partitioned by key.Maps each key to a partition ID, from 0 to `numPartitions - 1`.
分片函数的分类
默认分片函数defaultPartitioner:
/**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If any of the RDDs already has a partitioner, choose that one.
*
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
*
* Unless spark.default.parallelism is set, the number of partitions will be the
* same as the number of partitions in the largest upstream RDD, as this should
* be least likely to cause out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {
return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(bySize.head.partitions.size)
}
}
hash分片函数HashPartitioner:A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using, Java's `Object.hashCode`.Java arrays have hashCodes that are based on the arrays' identities rather than their contents,so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will produce an unexpected or incorrect result.
Range分片函数RangePartitioner:A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in. Note that the actual number of partitions created by the RangePartitioner might not be the same as the `partitions` parameter, in the case where the number of sampled records is less than the value of `partitions`.
2.5 依赖Dependency
依赖的分类
窄依赖NarrowDependency:Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.
shuffle依赖ShuffleDependency:Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,the RDD is transient since we don't need it on the executor side.
一对一依赖OneToOneDependency:Represents a one-to-one dependency between partitions of the parent and child RDDs.
范围依赖RangeDependency:Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
参考文献:
【1】http://spark.apache.org/docs/latest/cluster-overview.html
【2】http://www.infoq.com/cn/articles/spark-core-rdd/
spark源码解析之基本概念的更多相关文章
- Spark 源码解析:TaskScheduler的任务提交和task最佳位置算法
上篇文章< Spark 源码解析 : DAGScheduler中的DAG划分与提交 >介绍了DAGScheduler的Stage划分算法. 本文继续分析Stage被封装成TaskSet, ...
- Spark 源码解析 : DAGScheduler中的DAG划分与提交
一.Spark 运行架构 Spark 运行架构如下图: 各个RDD之间存在着依赖关系,这些依赖关系形成有向无环图DAG,DAGScheduler对这些依赖关系形成的DAG,进行Stage划分,划分的规 ...
- Scala实战高手****第4课:零基础彻底实战Scala控制结构及Spark源码解析
1.环境搭建 基础环境配置 jdk+idea+maven+scala2.11.以上工具安装配置此处不再赘述. 2.源码导入 官网下载spark源码后解压到合适的项目目录下,打开idea,File-&g ...
- spark源码解析大全
第1章 Spark 整体概述 1.1 整体概念 Apache Spark 是一个开源的通用集群计算系统,它提供了 High-level 编程 API,支持 Scala.Java 和 Pytho ...
- Spark源码解析 - Spark-shell浅析
1.准备工作 1.1 安装spark,并配置spark-env.sh 使用spark-shell前需要安装spark,详情可以参考http://www.cnblogs.com/swordfall/p/ ...
- Scala实战高手****第7课:零基础实战Scala面向对象编程及Spark源码解析
/** * 如果有这些语法的支持,我们说这门语言是支持面向对象的语言 * 其实真正面向对象的精髓是不是封装.继承.多态呢? * --->肯定不是,封装.继承.多态,只不过是支撑面向对象的 * 一 ...
- spark源码解析总结
========== Spark 通信架构 ========== 1.spark 一开始使用 akka 作为网络通信框架,spark 2.X 版本以后完全抛弃 akka,而使用 netty 作为新的网 ...
- Scala实战高手****第5课:零基础实战Scala函数式编程及Spark源码解析
Scala函数式编程 ----------------------------------------------------------------------------------------- ...
- spark源码解析之scala基本语法
1. scala初识 spark由scala编写,要解析scala,首先要对scala有基本的了解. 1.1 class vs object A class is a blueprint for ob ...
随机推荐
- C++对象模型——Inline Functions(第四章)
4.5 Inline Functions 以下是Point class 的一个加法运算符的可能实现内容: class Point { friend Point operator+(const Poin ...
- 使用PHP来压缩CSS文件
这里将介绍使用PHP以一种简便的方式来压缩你的CSS文件.这种方法不需要命名你的.css文件和.php文件. 当前有许多方法需要将.css文件重命名成.php文件,然后在所有PHP文件中放置压缩代码. ...
- html --- rem 媒体查询
rem是一种相对长度单位,参考的基准是<html>标签定义的font-size. viewport 做移动端的h5,通常会在HTML文件中指定一个<meta>标签: <m ...
- POJ 3051 DFS
题意:判断连通块大小 水题 //By SiriusRen #include <cstdio> #include <cstring> #include <algorithm ...
- 为root账户更名
为root账户更名 处于安全考虑许多管理员想把root更名,具体方法如下: 1.先以root登陆系统 2.用vi 编辑/etc/passwd文件,将第一行的第一个root修改为你想要的账户名,然后保存 ...
- Pig源代码分析: 简析运行计划的生成
摘要 本文通过跟代码的方式,分析从输入一批Pig-latin到输出物理运行计划(与launcher引擎有关,通常是MR运行计划.也能够是Spark RDD的运行算子)的总体流程. 不会详细涉及AST怎 ...
- 分享一段php获取随意时间的前一天代码
<?php /** * 获取给定日期的前一天 * @param string $date * @return string $yesterday */ function getYesterday ...
- 重建一些被PHP7废弃的函数,
<?php if(!function_exists('ereg')) { function ereg($pattern, $subject, &$matches = []) { retu ...
- TextView -无法调节字体、边框的距离
今天调节一个字体边框距离,结果一直都实现不了,布局如下 <RelativeLayout xmlns:android="http://schemas.android.com/apk/re ...
- Python 标准库 csv —— csv 文件的读写
csv 文件,逗号分割文件. 0. 读取 csv 到 list from csv import reader def load_csv(csvfile): dataset = [] with open ...