Spark GraphX图计算核心源码分析【图构建器、顶点、边】
一.图构建器
GraphX提供了几种从RDD或磁盘上的顶点和边的集合构建图形的方法。默认情况下,没有图构建器会重新划分图的边;相反,边保留在默认分区中。Graph.groupEdges要求对图进行重新分区,因为它假定相同的边将在同一分区上放置,因此在调用Graph.partitionBy之前必须要调用groupEdges。
源码如下:
package org.apache.spark.graphx import org.apache.spark.SparkContext
import org.apache.spark.graphx.impl.{EdgePartitionBuilder, GraphImpl}
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel /**
* Provides utilities for loading [[Graph]]s from files.
*/
object GraphLoader extends Logging { /**
* Loads a graph from an edge list formatted file where each line contains two integers: a source
* id and a target id. Skips lines that begin with `#`.
*/
def edgeListFile(
sc: SparkContext,
path: String,
canonicalOrientation: Boolean = false,
numEdgePartitions: Int = -1,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, //缓存级别
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
: Graph[Int, Int] =
{
val startTime = System.currentTimeMillis // Parse the edge data table directly into edge partitions
val lines =
if (numEdgePartitions > 0) { // 加载文件数据
sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions)
} else {
sc.textFile(path)
} // 按照分区进行图构建
val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
val builder = new EdgePartitionBuilder[Int, Int]
iter.foreach { line =>
if (!line.isEmpty && line(0) != '#') { // 过滤注释行
val lineArray = line.split("\\s+")
if (lineArray.length < 2) { // 识别异常数据
throw new IllegalArgumentException("Invalid line: " + line)
}
val srcId = lineArray(0).toLong
val dstId = lineArray(1).toLong
if (canonicalOrientation && srcId > dstId) {
builder.add(dstId, srcId, 1)// 逐个添加边及权重
} else {
builder.add(srcId, dstId, 1)
}
}
}
Iterator((pid, builder.toEdgePartition))
}.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))
edges.count() // 触发执行 logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime)) GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,
vertexStorageLevel = vertexStorageLevel)
} // end of edgeListFile }
源码分析:
GraphLoader.edgeListFile是从磁盘或HDFS类似的文件系统中加载图形数据,解析为(源顶点ID, 目标顶点ID)对的邻接列表,并跳过注释行。Graph从指定的边开始创建,然后自动创建和边相邻的任何节点。所有顶点和边属性均默认为1。参数canonicalOrientation允许沿正方向重新定向边,这是所有连接算法所必须的。
源码如下:
/**
* The Graph object contains a collection of routines used to construct graphs from RDDs.
*/
object Graph { /**
* Construct a graph from a collection of edges encoded as vertex id pairs.
*
* @param rawEdges a collection of edges in (src, dst) form
* @param defaultValue the vertex attributes with which to create vertices referenced by the edges
* @param uniqueEdges if multiple identical edges are found they are combined and the edge
* attribute is set to the sum. Otherwise duplicate edges are treated as separate. To enable
* `uniqueEdges`, a [[PartitionStrategy]] must be provided.
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*
* @return a graph with edge attributes containing either the count of duplicate edges or 1
* (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
*/
def fromEdgeTuples[VD: ClassTag](
rawEdges: RDD[(VertexId, VertexId)],
defaultValue: VD,
uniqueEdges: Option[PartitionStrategy] = None,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
{
val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
uniqueEdges match {
case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
case None => graph
}
} /**
* Construct a graph from a collection of edges.
*
* @param edges the RDD containing the set of edges in the graph
* @param defaultValue the default vertex attribute to use for each vertex
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*
* @return a graph with edge attributes described by `edges` and vertices
* given by all vertices in `edges` with value `defaultValue`
*/
def fromEdges[VD: ClassTag, ED: ClassTag](
edges: RDD[Edge[ED]],
defaultValue: VD,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
} /**
* Construct a graph from a collection of vertices and
* edges with attributes. Duplicate vertices are picked arbitrarily and
* vertices found in the edge collection but not in the input
* vertices are assigned the default attribute.
*
* @tparam VD the vertex attribute type
* @tparam ED the edge attribute type
* @param vertices the "set" of vertices and their attributes
* @param edges the collection of edges in the graph
* @param defaultVertexAttr the default vertex attribute to use for vertices that are
* mentioned in edges but not in vertices
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*/
def apply[VD: ClassTag, ED: ClassTag](
vertices: RDD[(VertexId, VD)],
edges: RDD[Edge[ED]],
defaultVertexAttr: VD = null.asInstanceOf[VD],
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
} /**
* Implicitly extracts the [[GraphOps]] member from a graph.
*
* To improve modularity the Graph type only contains a small set of basic operations.
* All the convenience operations are defined in the [[GraphOps]] class which may be
* shared across multiple graph implementations.
*/
implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag]
(g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops
源码分析:
Graph.apply允许根据顶点和边的RDD创建图。选取任意重复的顶点,并在边RDD中找到对应的顶点,指定这些数据为顶点的默认属性。
Graph.fromEdges允许仅从边RDD创建图。若顶点数据不存在,则从边数据中提取。这些数据被指定为顶点的默认属性。
Graph.fromEdgeTuple允许仅从边RDD创建图。为边设置初始值为1,并自动创建Edge及相关顶点并指定默认值。它还支持对边进行去重,此时,必须传入PartitionStrategy作为参数uniqueEdges的值(例如:uniqueEdges=Some(PartitionStrategy.RandomVertexCut))。必须使用分区策略才能使相同的边放置到同一个分区上,以便进行重复数据删除。
二.顶点RDD
VertexRDD[A]继承RDD[(VertexId,A)]并增加了额外的限制,每个VertexId只能创建一次。此外,VertexRDD[A]表示一组顶点,每个顶点的类型都为A。在内部,这是通过将顶点属性存储在可重用的哈希映射数据结构中来实现的。如果两个VertexRDDs是从相同的基本VertexRDD派生出来的话,则可以在恒定时间内将它们连接在一起,而无需进行哈希评估。
源码如下:
/**
* @tparam VD the vertex attribute associated with each vertex in the set.
*/
abstract class VertexRDD[VD](
sc: SparkContext,
deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) { implicit protected def vdTag: ClassTag[VD] private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]] override protected def getPartitions: Array[Partition] = partitionsRDD.partitions /**
* Provides the `RDD[(VertexId, VD)]` equivalent output.
*/
override def compute(part: Partition, context: TaskContext): Iterator[(VertexId, VD)] = {
firstParent[ShippableVertexPartition[VD]].iterator(part, context).next().iterator
} /**
* Construct a new VertexRDD that is indexed by only the visible vertices. The resulting
* VertexRDD will be based on a different index and can no longer be quickly joined with this
* RDD.
*/
def reindex(): VertexRDD[VD] /**
* Applies a function to each `VertexPartition` of this RDD and returns a new VertexRDD.
*/
private[graphx] def mapVertexPartitions[VD2: ClassTag](
f: ShippableVertexPartition[VD] => ShippableVertexPartition[VD2])
: VertexRDD[VD2] /**
* Restricts the vertex set to the set of vertices satisfying the given predicate. This operation
* preserves the index for efficient joins with the original RDD, and it sets bits in the bitmask
* rather than allocating new memory.
*
* It is declared and defined here to allow refining the return type from `RDD[(VertexId, VD)]` to
* `VertexRDD[VD]`.
*
* @param pred the user defined predicate, which takes a tuple to conform to the
* `RDD[(VertexId, VD)]` interface
*/
override def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD] =
this.mapVertexPartitions(_.filter(Function.untupled(pred))) /**
* Maps each vertex attribute, preserving the index.
*
* @tparam VD2 the type returned by the map function
*
* @param f the function applied to each value in the RDD
* @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
* original VertexRDD
*/
def mapValues[VD2: ClassTag](f: VD => VD2): VertexRDD[VD2] /**
* Maps each vertex attribute, additionally supplying the vertex ID.
*
* @tparam VD2 the type returned by the map function
*
* @param f the function applied to each ID-value pair in the RDD
* @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
* original VertexRDD. The resulting VertexRDD retains the same index.
*/
def mapValues[VD2: ClassTag](f: (VertexId, VD) => VD2): VertexRDD[VD2] /**
* For each VertexId present in both `this` and `other`, minus will act as a set difference
* operation returning only those unique VertexId's present in `this`.
*
* @param other an RDD to run the set operation against
*/
def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD] /**
* For each VertexId present in both `this` and `other`, minus will act as a set difference
* operation returning only those unique VertexId's present in `this`.
*
* @param other a VertexRDD to run the set operation against
*/
def minus(other: VertexRDD[VD]): VertexRDD[VD] /**
* For each vertex present in both `this` and `other`, `diff` returns only those vertices with
* differing values; for values that are different, keeps the values from `other`. This is
* only guaranteed to work if the VertexRDDs share a common ancestor.
*
* @param other the other RDD[(VertexId, VD)] with which to diff against.
*/
def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD] /**
* For each vertex present in both `this` and `other`, `diff` returns only those vertices with
* differing values; for values that are different, keeps the values from `other`. This is
* only guaranteed to work if the VertexRDDs share a common ancestor.
*
* @param other the other VertexRDD with which to diff against.
*/
def diff(other: VertexRDD[VD]): VertexRDD[VD] /**
* Left joins this RDD with another VertexRDD with the same index. This function will fail if
* both VertexRDDs do not share the same index. The resulting vertex set contains an entry for
* each vertex in `this`.
* If `other` is missing any vertex in this VertexRDD, `f` is passed `None`.
*
* @tparam VD2 the attribute type of the other VertexRDD
* @tparam VD3 the attribute type of the resulting VertexRDD
*
* @param other the other VertexRDD with which to join.
* @param f the function mapping a vertex id and its attributes in this and the other vertex set
* to a new vertex attribute.
* @return a VertexRDD containing the results of `f`
*/
def leftZipJoin[VD2: ClassTag, VD3: ClassTag]
(other: VertexRDD[VD2])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3] /**
* Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
* backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is
* used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is
* missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates,
* the vertex is picked arbitrarily.
*
* @tparam VD2 the attribute type of the other VertexRDD
* @tparam VD3 the attribute type of the resulting VertexRDD
*
* @param other the other VertexRDD with which to join
* @param f the function mapping a vertex id and its attributes in this and the other vertex set
* to a new vertex attribute.
* @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted
* by `f`.
*/
def leftJoin[VD2: ClassTag, VD3: ClassTag]
(other: RDD[(VertexId, VD2)])
(f: (VertexId, VD, Option[VD2]) => VD3)
: VertexRDD[VD3] /**
* Efficiently inner joins this VertexRDD with another VertexRDD sharing the same index. See
* [[innerJoin]] for the behavior of the join.
*/
def innerZipJoin[U: ClassTag, VD2: ClassTag](other: VertexRDD[U])
(f: (VertexId, VD, U) => VD2): VertexRDD[VD2] /**
* Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
* backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation
* is used.
*
* @param other an RDD containing vertices to join. If there are multiple entries for the same
* vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries.
* @param f the join function applied to corresponding values of `this` and `other`
* @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both
* `this` and `other`, with values supplied by `f`
*/
def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)])
(f: (VertexId, VD, U) => VD2): VertexRDD[VD2] /**
* Aggregates vertices in `messages` that have the same ids using `reduceFunc`, returning a
* VertexRDD co-indexed with `this`.
*
* @param messages an RDD containing messages to aggregate, where each message is a pair of its
* target vertex ID and the message data
* @param reduceFunc the associative aggregation function for merging messages to the same vertex
* @return a VertexRDD co-indexed with `this`, containing only vertices that received messages.
* For those vertices, their values are the result of applying `reduceFunc` to all received
* messages.
*/
def aggregateUsingIndex[VD2: ClassTag](
messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2] /**
* Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding
* [[EdgeRDD]].
*/
def reverseRoutingTables(): VertexRDD[VD] /** Prepares this VertexRDD for efficient joins with the given EdgeRDD. */
def withEdges(edges: EdgeRDD[_]): VertexRDD[VD] /** Replaces the vertex partitions while preserving all other properties of the VertexRDD. */
private[graphx] def withPartitionsRDD[VD2: ClassTag](
partitionsRDD: RDD[ShippableVertexPartition[VD2]]): VertexRDD[VD2] /**
* Changes the target storage level while preserving all other properties of the
* VertexRDD. Operations on the returned VertexRDD will preserve this storage level.
*
* This does not actually trigger a cache; to do this, call
* [[org.apache.spark.graphx.VertexRDD#cache]] on the returned VertexRDD.
*/
private[graphx] def withTargetStorageLevel(
targetStorageLevel: StorageLevel): VertexRDD[VD] /** Generates an RDD of vertex attributes suitable for shipping to the edge partitions. */
private[graphx] def shipVertexAttributes(
shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, VertexAttributeBlock[VD])] /** Generates an RDD of vertex IDs suitable for shipping to the edge partitions. */
private[graphx] def shipVertexIds(): RDD[(PartitionID, Array[VertexId])]
源码分析:
基本的操作像filer,leftJoin,RightJoin和Spark SQL基本一致,用法也相同,只是处理的数据样式有所差别。另外,像独有的算子,例如:aggregateUsingIndex可以高效构建新的VertexRDD。从概念上讲,如果我们构建了VertexRDD[B]这一组数据,这是顶点A的超集,那么构建RDD[(VertexId,A)]就可以重用索引进行聚合,从而大大提高效率。
三.边RDD
边EdgeRDD[ED]其延伸至RDD[Edge[ED]],使用定义中的各种分区策略PatitionStrategy。在每个分区中,边属性和邻接结构分别存储,从而在更改属性值时可实现最大程度的重用。
源码如下:
abstract class EdgeRDD[ED](
sc: SparkContext,
deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) { // scalastyle:off structural.type
private[graphx] def partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])] forSome { type VD }
// scalastyle:on structural.type override protected def getPartitions: Array[Partition] = partitionsRDD.partitions override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = {
val p = firstParent[(PartitionID, EdgePartition[ED, _])].iterator(part, context)
if (p.hasNext) {
p.next()._2.iterator.map(_.copy())
} else {
Iterator.empty
}
} /**
* Map the values in an edge partitioning preserving the structure but changing the values.
*
* @tparam ED2 the new edge value type
* @param f the function from an edge to a new edge value
* @return a new EdgeRDD containing the new edge values
*/
def mapValues[ED2: ClassTag](f: Edge[ED] => ED2): EdgeRDD[ED2] /**
* Reverse all the edges in this RDD.
*
* @return a new EdgeRDD containing all the edges reversed
*/
def reverse: EdgeRDD[ED] /**
* Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same
* [[PartitionStrategy]].
*
* @param other the EdgeRDD to join with
* @param f the join function applied to corresponding values of `this` and `other`
* @return a new EdgeRDD containing only edges that appear in both `this` and `other`,
* with values supplied by `f`
*/
def innerJoin[ED2: ClassTag, ED3: ClassTag]
(other: EdgeRDD[ED2])
(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3] /**
* Changes the target storage level while preserving all other properties of the
* EdgeRDD. Operations on the returned EdgeRDD will preserve this storage level.
*
* This does not actually trigger a cache; to do this, call
* [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD.
*/
private[graphx] def withTargetStorageLevel(targetStorageLevel: StorageLevel): EdgeRDD[ED]
}
源码分析:
单独使用情况较少,一般EdgeRDD上的操作是通过图运算符完成的,或者依赖于基类RDD中定义的操作。
Spark GraphX图计算核心源码分析【图构建器、顶点、边】的更多相关文章
- 并发编程之 SynchronousQueue 核心源码分析
前言 SynchronousQueue 是一个普通用户不怎么常用的队列,通常在创建无界线程池(Executors.newCachedThreadPool())的时候使用,也就是那个非常危险的线程池 ^ ...
- iOS 开源库系列 Aspects核心源码分析---面向切面编程之疯狂的 Aspects
Aspects的源码学习,我学到的有几下几点 Objective-C Runtime 理解OC的消息分发机制 KVO中的指针交换技术 Block 在内存中的数据结构 const 的修饰区别 block ...
- HashMap的结构以及核心源码分析
摘要 对于Java开发人员来说,能够熟练地掌握java的集合类是必须的,本节想要跟大家共同学习一下JDK1.8中HashMap的底层实现与源码分析.HashMap是开发中使用频率最高的用于映射(键值对 ...
- SynchronousQueue核心源码分析
一.SynchronousQueue的介绍 SynchronousQueue是一个不存储元素的阻塞队列.每一个put操作必须等待一个take操作,否则不能继续添加元素.SynchronousQueue ...
- jquery事件核心源码分析
我们从绑定事件开始,一步步往下看: 以jquery.1.8.3为例,平时通过jquery绑定事件最常用的是on方法,大概分为下面3种类型: $(target).on('click',function( ...
- 迷你版jQuery——zepto核心源码分析
前言 zepto号称迷你版jQuery,并且成为移动端dom操作库的首选 事实上zepto很多时候只是借用了jQuery的名气,保持了与其基本一致的API,其内部实现早已面目全非! 艾伦分析了jQue ...
- FutureTask核心源码分析
本文主要介绍FutureTask中的核心方法,如果有错误,欢迎大家指出! 首先我们看一下在java中FutureTask的组织关系 我们看一下FutureTask中关键的成员变量以及其构造方法 //表 ...
- HTTP流量神器Goreplay核心源码详解
摘要:Goreplay 前称是 Gor,一个简单的 TCP/HTTP 流量录制及重放的工具,主要用 Go 语言编写. 本文分享自华为云社区<流量回放工具之 goreplay 核心源码分析> ...
- Java内存管理-掌握类加载器的核心源码和设计模式(六)
勿在流沙筑高台,出来混迟早要还的. 做一个积极的人 编码.改bug.提升自己 我有一个乐园,面向编程,春暖花开! 上一篇文章介绍了类加载器分类以及类加载器的双亲委派模型,让我们能够从整体上对类加载器有 ...
随机推荐
- Linux 怎样更改locale语言设置
推荐使用UTF8编码,因为这是国际标准,能兼容任何语言的编码.在CentOS VPS下修改语言编码: localedef -c -f UTF-8 -i zh_CN zh_CN.utf8 export ...
- 神兽、佛祖保佑,代码全程无bug
''' ━━━━━━神兽出没━━━━━━ ┏┓ ┏┓ ┏┛┻━━━━━┛┻┓ ┃ ┃ ┃ ━ ┃ ┃ ┳┛ ┗┳ ┃ ┃ ┃ ┃ ┻ ┃ ┃ ┃ ┗━┓ ┏━┛ Code is far away fr ...
- LeetCode237-Delete_Node_In_A_Linked_List
delete-node-in-a-linked-list public void deleteNode(ListNode node) { node.val = node.next.val; node. ...
- CF1076D Edge Deletion 最短路树
问题描述 Codeforces 洛谷(有翻译) 题解 最短路树,是一棵在最短路过程中构建的树. 在\(\mathrm{Dijkstra}\)过程中,如果最终点\(y\)是由点\(x\)转移得到的,则在 ...
- RHEL7 安装Docker-CE
rhel7官方有源可以直接使用,前提是需要订阅, 参考地址 通过添加CentOS7 源,进行安装: 通过添加CentOS7 源,进行安装 参考博客 安装container-selinux依赖(Requ ...
- [Taro] 解决 使用 Taro UI 小程序下 Iconfont 图标 不显示问题
Taro UI 配置 第三方 的 文档 配置即可解决 https://taro-ui.jd.com/#/docs/icon 解决问题: 之前 只有在H5下 才显示 Iconfont 图标 后来按照此文 ...
- 【转】Java代码编译过程简述
转载:https://blog.csdn.net/fuzhongmin05/article/details/54880257. 代码编译是由Javac编译器来完成,流程如下图1所示: 图1 Javac ...
- 从GopherChina 2019看当前的go语言
GopherChina 2019大会4月底刚刚结束,大会上使用的PPT也放了出来(大会情况及PPT在https://mp.weixin.qq.com/s/_oVpIcBMVIKVzQn6YrkAJw) ...
- Java Scala获取所有注解的类信息
要想获取使用指定注解的类信息,可借助工具: org.reflections.Reflections 此工具将Java反射进行了高级封装,Reflections 通过扫描 classpath,索引元数据 ...
- (十六)golang--匿名函数
Go支持匿名函数,如果我们某个函数只是使用一次,可以考虑使用匿名函数,匿名函数也可以实现多次调用: 匿名函数的使用方式:(1)在定义匿名函数的时候就直接调用,这种方式匿名函数只调用一次: (2)将匿名 ...