从两方面来阐述spark的组件,一个是宏观上,一个是微观上。

1. spark组件

要分析spark的源码,首先要了解spark是如何工作的。spark的组件:

了解其工作过程先要了解基本概念

官方罗列了一些概念:

Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save,collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

1.1 SparkContext

sparkContext:Main entry point for Spark functionality. A SparkContext represents the connection to a Spark,cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details.

1.2 Task

/**
* A unit of execution. We have two kinds of Task's in Spark:
*
* - [[org.apache.spark.scheduler.ShuffleMapTask]]
* - [[org.apache.spark.scheduler.ResultTask]]
*
* A Spark job consists of one or more stages. The very last stage in a job consists of multiple
* ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
* and sends the task output back to the driver application. A ShuffleMapTask executes the task
* and divides the task output to multiple buckets (based on the task's partitioner).
*
* @param stageId id of the stage this task belongs to
* @param partitionId index of the number in the RDD
*/

1.3 ActiveJob

/**
* A running job in the DAGScheduler. Jobs can be of two types: a result job, which computes a
* ResultStage to execute an action, or a map-stage job, which computes the map outputs for a
* ShuffleMapStage before any downstream stages are submitted. The latter is used for adaptive
* query planning, to look at map output statistics before submitting later stages. We distinguish
* between these two types of jobs using the finalStage field of this class.
*
* Jobs are only tracked for "leaf" stages that clients directly submitted, through DAGScheduler's
* submitJob or submitMapStage methods. However, either type of job may cause the execution of
* other earlier stages (for RDDs in the DAG it depends on), and multiple jobs may share some of
* these previous stages. These dependencies are managed inside DAGScheduler.
*
* @param jobId A unique ID for this job.
* @param finalStage The stage that this job computes (either a ResultStage for an action or a
* ShuffleMapStage for submitMapStage).
* @param callSite Where this job was initiated in the user's program (shown on UI).
* @param listener A listener to notify if tasks in this job finish or the job fails.
* @param properties Scheduling properties attached to the job, such as fair scheduler pool name.
*/

1.4 Stage

/**
* A stage is a set of parallel tasks all computing the same function that need to run as part
* of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
* by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
* DAGScheduler runs these stages in topological order.
*
* Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
* other stage(s), or a result stage, in which case its tasks directly compute a Spark action
* (e.g. count(), save(), etc) by running a function on an RDD. For shuffle map stages, we also
* track the nodes that each output partition is on.
*
* Each Stage also has a firstJobId, identifying the job that first submitted the stage. When FIFO
* scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
* faster on failure.
*
* Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In that
* case, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.
* The latest one will be accessible through latestInfo.
*
* @param id Unique stage ID
* @param rdd RDD that this stage runs on: for a shuffle map stage, it's the RDD we run map tasks
* on, while for a result stage, it's the target RDD that we ran an action on
* @param numTasks Total number of tasks in stage; result stages in particular may not need to
* compute all partitions, e.g. for first(), lookup(), and take().
* @param parents List of stages that this stage depends on (through shuffle dependencies).
* @param firstJobId ID of the first job this stage was part of, for FIFO scheduling.
* @param callSite Location in the user program associated with this stage: either where the target
* RDD was created, for a shuffle map stage, or where the action for a result stage was called.
*/

1.5 executor

/**
* Spark executor, backed by a threadpool to run tasks.
*
* This can be used with Mesos, YARN, and the standalone scheduler.
* An internal RPC interface (at the moment Akka) is used for communication with the driver,
* except in the case of Mesos fine-grained mode.
*/

2. spark核心

Spark建立在统一抽象的RDD之上,使得它可以以基本一致的方式应对不同的大数据处理场景,包括MapReduce,Streaming,SQL,Machine Learning以及Graph等。

要理解Spark,就需得理解RDD。

2.1 RDD是什么?

它的特性可以总结如下:

  • 它是不变的数据结构存储
  • 它是支持跨集群的分布式数据结构
  • 可以根据数据记录的key对结构进行分区
  • 提供了粗粒度的操作,且这些操作都支持分区
  • 它将数据存储在内存中,从而提供了低延迟性

官方定义:

/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
* on RDD internals.
*/

从上述描述可以知道:

1.rdd是可以并行操作的不可变,分区的元素集合。

2.rdd定义了所有类型rdd的操作如:

`map`, `filter`, and `persist`

 其中,其它操作有:PairRDDFunctions,DoubleRDDFunctions,SequenceFileRDDFunctions

3. rdd主要由五部分组成:

  一组分片(partition);计算每个分片的函数;其它rdd的依赖集合;可选的分区键(key-value rdd拥有);一个列表,存储存取每个partition的preferred位置。对于一个HDFS文件来说,存储每个partition所在的块的位置。

小结:

  RDD,全称为Resilient Distributed Datasets,是一个容错的、并行的数据结构,可以让用户显式地将数据存储到磁盘和内存中,并能控制数据的分区。同时,RDD还提供了一组丰富的操作来操作这些数据。在这些操作中,诸如map、flatMap、filter等转换操作实现了monad模式,很好地契合了Scala的集合操作。除此之外,RDD还提供了诸如join、groupBy、reduceByKey等更为方便的操作(注意,reduceByKey是action,而非transformation),以支持常见的数据运算。

  通常来讲,针对数据处理有几种常见模型,包括:Iterative Algorithms,Relational Queries,MapReduce,Stream Processing。例如Hadoop MapReduce采用了MapReduces模型,Storm则采用了Stream Processing模型。RDD混合了这四种模型,使得Spark可以应用于各种大数据处理场景。

  RDD作为数据结构,本质上是一个只读的分区记录集合。一个RDD可以包含多个分区,每个分区就是一个dataset片段。RDD可以相互依赖。如果RDD的每个分区最多只能被一个Child RDD的一个分区使用,则称之为narrow dependency;若多个Child RDD分区都可以依赖,则称之为wide dependency。不同的操作依据其特性,可能会产生不同的依赖。例如map操作会产生narrow dependency,而join操作则产生wide dependency。

  Spark之所以将依赖分为narrow与wide,基于两点原因。

  首先,narrow dependencies可以支持在同一个cluster node上以管道形式执行多条命令,例如在执行了map后,紧接着执行filter。相反,wide dependencies需要所有的父分区都是可用的,可能还需要调用类似MapReduce之类的操作进行跨节点传递。

  其次,则是从失败恢复的角度考虑。narrow dependencies的失败恢复更有效,因为它只需要重新计算丢失的parent partition即可,而且可以并行地在不同节点进行重计算。而wide dependencies牵涉到RDD各级的多个Parent Partitions。下图说明了narrow dependencies与wide dependencies之间的区别:

  本图来自Matei Zaharia撰写的论文An Architecture for Fast and General Data Processing on Large Clusters。图中,一个box代表一个RDD,一个带阴影的矩形框代表一个partition。

2.2 RDD对容错的支持

  支持容错通常采用两种方式:数据复制或日志记录。对于以数据为中心的系统而言,这两种方式都非常昂贵,因为它需要跨集群网络拷贝大量数据,毕竟带宽的数据远远低于内存。

  RDD天生是支持容错的。首先,它自身是一个不变的(immutable)数据集,其次,它能够记住构建它的操作图(Graph of Operation),因此当执行任务的Worker失败时,完全可以通过操作图获得之前执行的操作,进行重新计算。由于无需采用replication方式支持容错,很好地降低了跨网络的数据传输成本。

  不过,在某些场景下,Spark也需要利用记录日志的方式来支持容错。例如,在Spark Streaming中,针对数据进行update操作,或者调用Streaming提供的window操作时,就需要恢复执行过程的中间状态。此时,需要通过Spark提供的checkpoint机制,以支持操作能够从checkpoint得到恢复。

  针对RDD的wide dependency,最有效的容错方式同样还是采用checkpoint机制。不过,似乎Spark的最新版本仍然没有引入auto checkpointing机制。

2.3 分片Partition

Partition是RDD中一个分片的标识。
/**
* An identifier for a partition in an RDD.
*/
trait Partition extends Serializable {
/**
* Get the partition's index within its parent RDD
*/
def index: Int // A better default implementation of HashCode
override def hashCode(): Int = index
}

2.4 分片函数Partitioner

分片函数Partitioner:An object that defines how the elements in a key-value pair RDD are partitioned by key.Maps each key to a partition ID, from 0 to `numPartitions - 1`.

分片函数的分类

默认分片函数defaultPartitioner:

 /**
* Choose a partitioner to use for a cogroup-like operation between a number of RDDs.
*
* If any of the RDDs already has a partitioner, choose that one.
*
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
*
* Unless spark.default.parallelism is set, the number of partitions will be the
* same as the number of partitions in the largest upstream RDD, as this should
* be least likely to cause out-of-memory errors.
*
* We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.
*/
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {
return r.partitioner.get
}
if (rdd.context.conf.contains("spark.default.parallelism")) {
new HashPartitioner(rdd.context.defaultParallelism)
} else {
new HashPartitioner(bySize.head.partitions.size)
}
}

hash分片函数HashPartitioner:A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using, Java's `Object.hashCode`.Java arrays have hashCodes that are based on the arrays' identities rather than their contents,so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will produce an unexpected or incorrect result.

Range分片函数RangePartitioner:A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in. Note that the actual number of partitions created by the RangePartitioner might not be the same as the `partitions` parameter, in the case where the number of sampled records is less than the value of `partitions`.

2.5 依赖Dependency

依赖的分类

窄依赖NarrowDependency:Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.

shuffle依赖ShuffleDependency:Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,the RDD is transient since we don't need it on the executor side.

一对一依赖OneToOneDependency:Represents a one-to-one dependency between partitions of the parent and child RDDs.

范围依赖RangeDependency:Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.

参考文献:

【1】http://spark.apache.org/docs/latest/cluster-overview.html

【2】http://www.infoq.com/cn/articles/spark-core-rdd/

spark源码解析之基本概念的更多相关文章

  1. Spark 源码解析:TaskScheduler的任务提交和task最佳位置算法

    上篇文章<  Spark 源码解析 : DAGScheduler中的DAG划分与提交 >介绍了DAGScheduler的Stage划分算法. 本文继续分析Stage被封装成TaskSet, ...

  2. Spark 源码解析 : DAGScheduler中的DAG划分与提交

    一.Spark 运行架构 Spark 运行架构如下图: 各个RDD之间存在着依赖关系,这些依赖关系形成有向无环图DAG,DAGScheduler对这些依赖关系形成的DAG,进行Stage划分,划分的规 ...

  3. Scala实战高手****第4课:零基础彻底实战Scala控制结构及Spark源码解析

    1.环境搭建 基础环境配置 jdk+idea+maven+scala2.11.以上工具安装配置此处不再赘述. 2.源码导入 官网下载spark源码后解压到合适的项目目录下,打开idea,File-&g ...

  4. spark源码解析大全

      第1章 Spark 整体概述 1.1 整体概念   Apache Spark 是一个开源的通用集群计算系统,它提供了 High-level 编程 API,支持 Scala.Java 和 Pytho ...

  5. Spark源码解析 - Spark-shell浅析

    1.准备工作 1.1 安装spark,并配置spark-env.sh 使用spark-shell前需要安装spark,详情可以参考http://www.cnblogs.com/swordfall/p/ ...

  6. Scala实战高手****第7课:零基础实战Scala面向对象编程及Spark源码解析

    /** * 如果有这些语法的支持,我们说这门语言是支持面向对象的语言 * 其实真正面向对象的精髓是不是封装.继承.多态呢? * --->肯定不是,封装.继承.多态,只不过是支撑面向对象的 * 一 ...

  7. spark源码解析总结

    ========== Spark 通信架构 ========== 1.spark 一开始使用 akka 作为网络通信框架,spark 2.X 版本以后完全抛弃 akka,而使用 netty 作为新的网 ...

  8. Scala实战高手****第5课:零基础实战Scala函数式编程及Spark源码解析

    Scala函数式编程 ----------------------------------------------------------------------------------------- ...

  9. spark源码解析之scala基本语法

    1. scala初识 spark由scala编写,要解析scala,首先要对scala有基本的了解. 1.1 class vs object A class is a blueprint for ob ...

随机推荐

  1. 你如何理解 HTML5 的 section?会在什么场景使用?为什么这些场景使用 section 而不是 div?

    section元素表示文档或应用的一个部分.所谓“部分”,这里是指按照主题分组的内容区域,通常会带有标题.[也就是每个section对应不同的主题.注意是内容本身的主题,而不是其他人为设定的划分标准. ...

  2. 42.管道,cmd执行指令写到管道中

    #define _CRT_SECURE_NO_WARNINGS #include <stdio.h> #include <stdlib.h> #include <stri ...

  3. Android RecyclerView滑动监听,判断是否滑动到了最后一个item

    项目中的需求,RecyclerView横向滑动列表,要有加载更多的功能,给RecyclerView设置一个滑动监听,在onScrolled方法中判断一下滑动方向,然后在onScrollStateCha ...

  4. java体系学习

    前端部分: 1)HTML:网页的核心语言,构成网页的基础 2)CSS:使网页更加丰富多彩灿烂的利器 3)JavaScript:使网页动起来的根本,加强了网页和用户之间的交互 4)HTML DOM:换一 ...

  5. centos7 docker镜像源设置

    由于docker他的镜像下载地址是国外官网源需要修改 添加 Docker 加速镜像(阿里云专属) 安装/升级你的Docker客户端 推荐安装1.10.0以上版本的Docker客户端,参考文档 dock ...

  6. 今日SGU 5.12

    SGU 149 题意:求每一个点的距离最远距离的点的长度 收获:次大值和最大值,dfs #include<bits/stdc++.h> #define de(x) cout<< ...

  7. 一起talk C栗子吧(第三十四回:C语言实例--巧用溢出计算最值)

    各位看官们.大家好,上一回中咱们说的是巧用移位的样例,这一回咱们说的样例是:巧用溢出计算最值. 闲话休提,言归正转.让我们一起talk C栗子吧! 大家都知道,程序中的变量都有一个取值范围,这个范围也 ...

  8. codeforces 204E. Little Elephant and Strings(广义后缀自动机,Parent树)

    传送门在这里. 大意: 给一堆字符串,询问每个字符串有多少子串在所有字符串中出现K次以上. 解题思路: 这种子串问题一定要见后缀自动机Parent树Dfs序统计出现次数都是套路了吧. 这道题统计子串个 ...

  9. Stack switching mechanism in a computer system

    A method and mechanism for performing an unconditional stack switch in a processor. A processor incl ...

  10. hdu 2795 Billboard(线段树单点更新)

    Billboard Time Limit: 20000/8000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Total ...