Spark GraphX图计算核心源码分析【图构建器、顶点、边】

一.图构建器

　　GraphX提供了几种从RDD或磁盘上的顶点和边的集合构建图形的方法。默认情况下，没有图构建器会重新划分图的边；相反，边保留在默认分区中。Graph.groupEdges要求对图进行重新分区，因为它假定相同的边将在同一分区上放置，因此在调用Graph.partitionBy之前必须要调用groupEdges。　

源码如下：

 package org.apache.spark.graphx

 import org.apache.spark.SparkContext

 import org.apache.spark.graphx.impl.{EdgePartitionBuilder, GraphImpl}

 import org.apache.spark.internal.Logging

 import org.apache.spark.storage.StorageLevel

 /**

  * Provides utilities for loading [[Graph]]s from files.

  */

 object GraphLoader extends Logging {

   /**

    * Loads a graph from an edge list formatted file where each line contains two integers: a source

    * id and a target id. Skips lines that begin with `#`.

    */

   def edgeListFile(

       sc: SparkContext,

       path: String,

       canonicalOrientation: Boolean = false,

       numEdgePartitions: Int = -1,

       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, //缓存级别

       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)

     : Graph[Int, Int] =

   {

     val startTime = System.currentTimeMillis

     // Parse the edge data table directly into edge partitions

     val lines =

       if (numEdgePartitions > 0) { // 加载文件数据

         sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions)

       } else {

         sc.textFile(path)

       } // 按照分区进行图构建

     val edges = lines.mapPartitionsWithIndex { (pid, iter) =>

       val builder = new EdgePartitionBuilder[Int, Int]

       iter.foreach { line =>

         if (!line.isEmpty && line(0) != '#') { // 过滤注释行

           val lineArray = line.split("\\s+")

           if (lineArray.length < 2) { // 识别异常数据

             throw new IllegalArgumentException("Invalid line: " + line)

           }

           val srcId = lineArray(0).toLong

           val dstId = lineArray(1).toLong

           if (canonicalOrientation && srcId > dstId) {

             builder.add(dstId, srcId, 1)// 逐个添加边及权重

           } else {

             builder.add(srcId, dstId, 1)

           }

         }

       }

       Iterator((pid, builder.toEdgePartition))

     }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))

     edges.count() // 触发执行

     logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime))

     GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,

       vertexStorageLevel = vertexStorageLevel)

   } // end of edgeListFile

 }

源码分析：

　　GraphLoader.edgeListFile是从磁盘或HDFS类似的文件系统中加载图形数据，解析为(源顶点ID，目标顶点ID)对的邻接列表，并跳过注释行。Graph从指定的边开始创建，然后自动创建和边相邻的任何节点。所有顶点和边属性均默认为1。参数canonicalOrientation允许沿正方向重新定向边，这是所有连接算法所必须的。

源码如下：

 /**

  * The Graph object contains a collection of routines used to construct graphs from RDDs.

  */

 object Graph {

   /**

    * Construct a graph from a collection of edges encoded as vertex id pairs.

    *

    * @param rawEdges a collection of edges in (src, dst) form

    * @param defaultValue the vertex attributes with which to create vertices referenced by the edges

    * @param uniqueEdges if multiple identical edges are found they are combined and the edge

    * attribute is set to the sum.  Otherwise duplicate edges are treated as separate. To enable

    * `uniqueEdges`, a [[PartitionStrategy]] must be provided.

    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary

    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary

    *

    * @return a graph with edge attributes containing either the count of duplicate edges or 1

    * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.

    */

   def fromEdgeTuples[VD: ClassTag](

       rawEdges: RDD[(VertexId, VertexId)],

       defaultValue: VD,

       uniqueEdges: Option[PartitionStrategy] = None,

       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,

       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =

   {

     val edges = rawEdges.map(p => Edge(p._1, p._2, 1))

     val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)

     uniqueEdges match {

       case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)

       case None => graph

     }

   }

   /**

    * Construct a graph from a collection of edges.

    *

    * @param edges the RDD containing the set of edges in the graph

    * @param defaultValue the default vertex attribute to use for each vertex

    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary

    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary

    *

    * @return a graph with edge attributes described by `edges` and vertices

    *         given by all vertices in `edges` with value `defaultValue`

    */

   def fromEdges[VD: ClassTag, ED: ClassTag](

       edges: RDD[Edge[ED]],

       defaultValue: VD,

       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,

       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {

     GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)

   }

   /**

    * Construct a graph from a collection of vertices and

    * edges with attributes.  Duplicate vertices are picked arbitrarily and

    * vertices found in the edge collection but not in the input

    * vertices are assigned the default attribute.

    *

    * @tparam VD the vertex attribute type

    * @tparam ED the edge attribute type

    * @param vertices the "set" of vertices and their attributes

    * @param edges the collection of edges in the graph

    * @param defaultVertexAttr the default vertex attribute to use for vertices that are

    *                          mentioned in edges but not in vertices

    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary

    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary

    */

   def apply[VD: ClassTag, ED: ClassTag](

       vertices: RDD[(VertexId, VD)],

       edges: RDD[Edge[ED]],

       defaultVertexAttr: VD = null.asInstanceOf[VD],

       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,

       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {

     GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)

   }

   /**

    * Implicitly extracts the [[GraphOps]] member from a graph.

    *

    * To improve modularity the Graph type only contains a small set of basic operations.

    * All the convenience operations are defined in the [[GraphOps]] class which may be

    * shared across multiple graph implementations.

    */

   implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag]

       (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops

源码分析：　　

　　Graph.apply允许根据顶点和边的RDD创建图。选取任意重复的顶点，并在边RDD中找到对应的顶点，指定这些数据为顶点的默认属性。

　　Graph.fromEdges允许仅从边RDD创建图。若顶点数据不存在，则从边数据中提取。这些数据被指定为顶点的默认属性。

　　Graph.fromEdgeTuple允许仅从边RDD创建图。为边设置初始值为1，并自动创建Edge及相关顶点并指定默认值。它还支持对边进行去重，此时，必须传入PartitionStrategy作为参数uniqueEdges的值（例如：uniqueEdges=Some(PartitionStrategy.RandomVertexCut)）。必须使用分区策略才能使相同的边放置到同一个分区上，以便进行重复数据删除。

二.顶点RDD

　　VertexRDD[A]继承RDD[(VertexId,A)]并增加了额外的限制，每个VertexId只能创建一次。此外，VertexRDD[A]表示一组顶点，每个顶点的类型都为A。在内部，这是通过将顶点属性存储在可重用的哈希映射数据结构中来实现的。如果两个VertexRDDs是从相同的基本VertexRDD派生出来的话，则可以在恒定时间内将它们连接在一起，而无需进行哈希评估。

源码如下：

 /**

  * @tparam VD the vertex attribute associated with each vertex in the set.

  */

 abstract class VertexRDD[VD](

     sc: SparkContext,

     deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) {

   implicit protected def vdTag: ClassTag[VD]

   private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]]

   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions

   /**

    * Provides the `RDD[(VertexId, VD)]` equivalent output.

    */

   override def compute(part: Partition, context: TaskContext): Iterator[(VertexId, VD)] = {

     firstParent[ShippableVertexPartition[VD]].iterator(part, context).next().iterator

   }

   /**

    * Construct a new VertexRDD that is indexed by only the visible vertices. The resulting

    * VertexRDD will be based on a different index and can no longer be quickly joined with this

    * RDD.

    */

   def reindex(): VertexRDD[VD]

   /**

    * Applies a function to each `VertexPartition` of this RDD and returns a new VertexRDD.

    */

   private[graphx] def mapVertexPartitions[VD2: ClassTag](

       f: ShippableVertexPartition[VD] => ShippableVertexPartition[VD2])

     : VertexRDD[VD2]

   /**

    * Restricts the vertex set to the set of vertices satisfying the given predicate. This operation

    * preserves the index for efficient joins with the original RDD, and it sets bits in the bitmask

    * rather than allocating new memory.

    *

    * It is declared and defined here to allow refining the return type from `RDD[(VertexId, VD)]` to

    * `VertexRDD[VD]`.

    *

    * @param pred the user defined predicate, which takes a tuple to conform to the

    * `RDD[(VertexId, VD)]` interface

    */

   override def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD] =

     this.mapVertexPartitions(_.filter(Function.untupled(pred)))

   /**

    * Maps each vertex attribute, preserving the index.

    *

    * @tparam VD2 the type returned by the map function

    *

    * @param f the function applied to each value in the RDD

    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the

    * original VertexRDD

    */

   def mapValues[VD2: ClassTag](f: VD => VD2): VertexRDD[VD2]

   /**

    * Maps each vertex attribute, additionally supplying the vertex ID.

    *

    * @tparam VD2 the type returned by the map function

    *

    * @param f the function applied to each ID-value pair in the RDD

    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the

    * original VertexRDD.  The resulting VertexRDD retains the same index.

    */

   def mapValues[VD2: ClassTag](f: (VertexId, VD) => VD2): VertexRDD[VD2]

   /**

    * For each VertexId present in both `this` and `other`, minus will act as a set difference

    * operation returning only those unique VertexId's present in `this`.

    *

    * @param other an RDD to run the set operation against

    */

   def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD]

   /**

    * For each VertexId present in both `this` and `other`, minus will act as a set difference

    * operation returning only those unique VertexId's present in `this`.

    *

    * @param other a VertexRDD to run the set operation against

    */

   def minus(other: VertexRDD[VD]): VertexRDD[VD]

   /**

    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with

    * differing values; for values that are different, keeps the values from `other`. This is

    * only guaranteed to work if the VertexRDDs share a common ancestor.

    *

    * @param other the other RDD[(VertexId, VD)] with which to diff against.

    */

   def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD]

   /**

    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with

    * differing values; for values that are different, keeps the values from `other`. This is

    * only guaranteed to work if the VertexRDDs share a common ancestor.

    *

    * @param other the other VertexRDD with which to diff against.

    */

   def diff(other: VertexRDD[VD]): VertexRDD[VD]

   /**

    * Left joins this RDD with another VertexRDD with the same index. This function will fail if

    * both VertexRDDs do not share the same index. The resulting vertex set contains an entry for

    * each vertex in `this`.

    * If `other` is missing any vertex in this VertexRDD, `f` is passed `None`.

    *

    * @tparam VD2 the attribute type of the other VertexRDD

    * @tparam VD3 the attribute type of the resulting VertexRDD

    *

    * @param other the other VertexRDD with which to join.

    * @param f the function mapping a vertex id and its attributes in this and the other vertex set

    * to a new vertex attribute.

    * @return a VertexRDD containing the results of `f`

    */

   def leftZipJoin[VD2: ClassTag, VD3: ClassTag]

       (other: VertexRDD[VD2])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]

   /**

    * Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is

    * backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is

    * used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is

    * missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates,

    * the vertex is picked arbitrarily.

    *

    * @tparam VD2 the attribute type of the other VertexRDD

    * @tparam VD3 the attribute type of the resulting VertexRDD

    *

    * @param other the other VertexRDD with which to join

    * @param f the function mapping a vertex id and its attributes in this and the other vertex set

    * to a new vertex attribute.

    * @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted

    * by `f`.

    */

   def leftJoin[VD2: ClassTag, VD3: ClassTag]

       (other: RDD[(VertexId, VD2)])

       (f: (VertexId, VD, Option[VD2]) => VD3)

     : VertexRDD[VD3]

   /**

    * Efficiently inner joins this VertexRDD with another VertexRDD sharing the same index. See

    * [[innerJoin]] for the behavior of the join.

    */

   def innerZipJoin[U: ClassTag, VD2: ClassTag](other: VertexRDD[U])

       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]

   /**

    * Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is

    * backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation

    * is used.

    *

    * @param other an RDD containing vertices to join. If there are multiple entries for the same

    * vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries.

    * @param f the join function applied to corresponding values of `this` and `other`

    * @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both

    *         `this` and `other`, with values supplied by `f`

    */

   def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)])

       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]

   /**

    * Aggregates vertices in `messages` that have the same ids using `reduceFunc`, returning a

    * VertexRDD co-indexed with `this`.

    *

    * @param messages an RDD containing messages to aggregate, where each message is a pair of its

    * target vertex ID and the message data

    * @param reduceFunc the associative aggregation function for merging messages to the same vertex

    * @return a VertexRDD co-indexed with `this`, containing only vertices that received messages.

    * For those vertices, their values are the result of applying `reduceFunc` to all received

    * messages.

    */

   def aggregateUsingIndex[VD2: ClassTag](

       messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]

   /**

    * Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding

    * [[EdgeRDD]].

    */

   def reverseRoutingTables(): VertexRDD[VD]

   /** Prepares this VertexRDD for efficient joins with the given EdgeRDD. */

   def withEdges(edges: EdgeRDD[_]): VertexRDD[VD]

   /** Replaces the vertex partitions while preserving all other properties of the VertexRDD. */

   private[graphx] def withPartitionsRDD[VD2: ClassTag](

       partitionsRDD: RDD[ShippableVertexPartition[VD2]]): VertexRDD[VD2]

   /**

    * Changes the target storage level while preserving all other properties of the

    * VertexRDD. Operations on the returned VertexRDD will preserve this storage level.

    *

    * This does not actually trigger a cache; to do this, call

    * [[org.apache.spark.graphx.VertexRDD#cache]] on the returned VertexRDD.

    */

   private[graphx] def withTargetStorageLevel(

       targetStorageLevel: StorageLevel): VertexRDD[VD]

   /** Generates an RDD of vertex attributes suitable for shipping to the edge partitions. */

   private[graphx] def shipVertexAttributes(

       shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, VertexAttributeBlock[VD])]

   /** Generates an RDD of vertex IDs suitable for shipping to the edge partitions. */

   private[graphx] def shipVertexIds(): RDD[(PartitionID, Array[VertexId])]

源码分析：

　　基本的操作像filer,leftJoin,RightJoin和Spark SQL基本一致，用法也相同，只是处理的数据样式有所差别。另外，像独有的算子，例如：aggregateUsingIndex可以高效构建新的VertexRDD。从概念上讲，如果我们构建了VertexRDD[B]这一组数据，这是顶点A的超集，那么构建RDD[(VertexId,A)]就可以重用索引进行聚合，从而大大提高效率。

三.边RDD

　　边EdgeRDD[ED]其延伸至RDD[Edge[ED]]，使用定义中的各种分区策略PatitionStrategy。在每个分区中，边属性和邻接结构分别存储，从而在更改属性值时可实现最大程度的重用。

源码如下：

 abstract class EdgeRDD[ED](

     sc: SparkContext,

     deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) {

   // scalastyle:off structural.type

   private[graphx] def partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])] forSome { type VD }

   // scalastyle:on structural.type

   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions

   override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = {

     val p = firstParent[(PartitionID, EdgePartition[ED, _])].iterator(part, context)

     if (p.hasNext) {

       p.next()._2.iterator.map(_.copy())

     } else {

       Iterator.empty

     }

   }

   /**

    * Map the values in an edge partitioning preserving the structure but changing the values.

    *

    * @tparam ED2 the new edge value type

    * @param f the function from an edge to a new edge value

    * @return a new EdgeRDD containing the new edge values

    */

   def mapValues[ED2: ClassTag](f: Edge[ED] => ED2): EdgeRDD[ED2]

   /**

    * Reverse all the edges in this RDD.

    *

    * @return a new EdgeRDD containing all the edges reversed

    */

   def reverse: EdgeRDD[ED]

   /**

    * Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same

    * [[PartitionStrategy]].

    *

    * @param other the EdgeRDD to join with

    * @param f the join function applied to corresponding values of `this` and `other`

    * @return a new EdgeRDD containing only edges that appear in both `this` and `other`,

    *         with values supplied by `f`

    */

   def innerJoin[ED2: ClassTag, ED3: ClassTag]

       (other: EdgeRDD[ED2])

       (f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]

   /**

    * Changes the target storage level while preserving all other properties of the

    * EdgeRDD. Operations on the returned EdgeRDD will preserve this storage level.

    *

    * This does not actually trigger a cache; to do this, call

    * [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD.

    */

   private[graphx] def withTargetStorageLevel(targetStorageLevel: StorageLevel): EdgeRDD[ED]

 }

源码分析：

　　单独使用情况较少，一般EdgeRDD上的操作是通过图运算符完成的，或者依赖于基类RDD中定义的操作。