Spark Graphx编程指南
1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用?
3.三角形计数算法的作用是什么?
Spark中文手册-编程指南
Spark之一个快速的例子Spark之基本概念
Spark之基本概念
Spark之基本概念(2)
Spark之基本概念(3)
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南(1)
Pregel API
- class GraphOps[VD, ED] {
- def pregel[A]
- (initialMsg: A,
- maxIter: Int = Int.MaxValue,
- activeDir: EdgeDirection = EdgeDirection.Out)
- (vprog: (VertexId, VD, A) => VD,
- sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
- mergeMsg: (A, A) => A)
- : Graph[VD, ED] = {
- // Receive the initial message at each vertex
- var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
- // compute the messages
- var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
- var activeMessages = messages.count()
- // Loop until no messages remain or maxIterations is achieved
- var i = 0
- while (activeMessages > 0 && i < maxIterations) {
- // Receive the messages: -----------------------------------------------------------------------
- // Run the vertex program on all vertices that receive messages
- val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
- // Merge the new vertex values back into the graph
- g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
- // Send Messages: ------------------------------------------------------------------------------
- // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
- // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
- // on edges in the activeDir of vertices in newVerts
- messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
- activeMessages = messages.count()
- i += 1
- }
- g
- }
- }
复制代码
- import org.apache.spark.graphx._
- // Import random graph generation library
- import org.apache.spark.graphx.util.GraphGenerators
- // A graph with edge attributes containing distances
- val graph: Graph[Int, Double] =
- GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
- val sourceId: VertexId = 42 // The ultimate source
- // Initialize the graph such that all vertices except the root have distance infinity.
- val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
- val sssp = initialGraph.pregel(Double.PositiveInfinity)(
- (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
- triplet => { // Send Message
- if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
- Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
- } else {
- Iterator.empty
- }
- },
- (a,b) => math.min(a,b) // Merge Message
- )
- println(sssp.vertices.collect.mkString("\n"))
复制代码
图构造者
- object GraphLoader {
- def edgeListFile(
- sc: SparkContext,
- path: String,
- canonicalOrientation: Boolean = false,
- minEdgePartitions: Int = 1)
- : Graph[Int, Int]
- }
复制代码
- # This is a comment
- 2 1
- 4 1
- 1 2
复制代码
- object Graph {
- def apply[VD, ED](
- vertices: RDD[(VertexId, VD)],
- edges: RDD[Edge[ED]],
- defaultVertexAttr: VD = null)
- : Graph[VD, ED]
- def fromEdges[VD, ED](
- edges: RDD[Edge[ED]],
- defaultValue: VD): Graph[VD, ED]
- def fromEdgeTuples[VD](
- rawEdges: RDD[(VertexId, VertexId)],
- defaultValue: VD,
- uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
- }
复制代码
顶点和边RDDs
VertexRDDs
- class VertexRDD[VD] extends RDD[(VertexID, VD)] {
- // Filter the vertex set but preserves the internal index
- def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
- // Transform the values without changing the ids (preserves the internal index)
- def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
- def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
- // Remove vertices from this set that appear in the other set
- def diff(other: VertexRDD[VD]): VertexRDD[VD]
- // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
- def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
- def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
- def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
- }
复制代码
- val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
- val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
- // There should be 200 entries in rddB
- rddB.count
- val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
- // There should be 100 entries in setB
- setB.count
- // Joining A and B should now be fast!
- val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
复制代码
EdgeRDDs
- // Transform the edge attributes while preserving the structure
- def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
- // Revere the edges reusing both attributes and structure
- def reverse: EdgeRDD[ED]
- // Join two `EdgeRDD`s partitioned using the same partitioning strategy.
- def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
复制代码
图算法
PageRank算法
- // Load the edges as a graph
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Run PageRank
- val ranks = graph.pageRank(0.0001).vertices
- // Join the ranks with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ranksByUsername = users.join(ranks).map {
- case (id, (username, rank)) => (username, rank)
- }
- // Print the result
- println(ranksByUsername.collect().mkString("\n"))
复制代码
连通体算法
- / Load the graph as in the PageRank example
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Find the connected components
- val cc = graph.connectedComponents().vertices
- // Join the connected components with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ccByUsername = users.join(cc).map {
- case (id, (username, cc)) => (username, cc)
- }
- // Print the result
- println(ccByUsername.collect().mkString("\n"))
复制代码
三角形计数算法
- // Load the edges in canonical order and partition the graph for triangle count
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
- // Find the triangle count for each vertex
- val triCounts = graph.triangleCount().vertices
- // Join the triangle counts with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
- (username, tc)
- }
- // Print the result
- println(triCountByUsername.collect().mkString("\n"))
复制代码
例子
- // Connect to the Spark cluster
- val sc = new SparkContext("spark://master.amplab.org", "research")
- // Load my user data and parse into tuples of user id and attribute list
- val users = (sc.textFile("graphx/data/users.txt")
- .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
- // Parse the edge data which is already in userId -> userId format
- val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Attach the user attributes
- val graph = followerGraph.outerJoinVertices(users) {
- case (uid, deg, Some(attrList)) => attrList
- // Some users may not have attributes so we set them as empty
- case (uid, deg, None) => Array.empty[String]
- }
- // Restrict the graph to users with usernames and names
- val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
- // Compute the PageRank
- val pagerankGraph = subgraph.pageRank(0.001)
- // Get the attributes of the top pagerank users
- val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
- case (uid, attrList, Some(pr)) => (pr, attrList.toList)
- case (uid, attrList, None) => (0.0, attrList.toList)
- }
- println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
复制代码
相关内容:
Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html
Spark中文手册2:Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html
Spark中文手册3:Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html
Spark中文手册4:Spark之基本概念(2)
http://www.aboutyun.com/thread-11516-1-1.html
Spark中文手册5:Spark之基本概念(3)
http://www.aboutyun.com/thread-11535-1-1.html
Spark中文手册6:Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html
Spark中文手册7:Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html
Spark中文手册8:spark GraphX编程指南(1)
http://www.aboutyun.com/thread-11589-1-1.html
Spark中文手册10:spark部署:提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html
Spark中文手册11:Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html
Spark Graphx编程指南的更多相关文章
- Spark—GraphX编程指南
Spark系列面试题 Spark面试题(一) Spark面试题(二) Spark面试题(三) Spark面试题(四) Spark面试题(五)--数据倾斜调优 Spark面试题(六)--Spark资源调 ...
- GraphX编程指南
GraphX编程指南 概述 入门 属性图 属性图示例 图算子 算子摘要列表 属性算子 结构化算子 Join算子 最近邻聚集 汇总消息(aggregateMessages) Map Reduce三元 ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- <译>Spark Sreaming 编程指南
Spark Streaming 编程指南 Overview A Quick Example Basic Concepts Linking Initializing StreamingContext D ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
- Spark Streaming编程指南
Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (D ...
- Spark SQL编程指南(Python)
前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询.它的核心是一个特殊类型的Spark RDD:SchemaRDD. SchemaRDD类似于传统关 ...
- Spark SQL编程指南(Python)【转】
转自:http://www.cnblogs.com/yurunmiao/p/4685310.html 前言 Spark SQL允许我们在Spark环境中使用SQL或者Hive SQL执行关系型查询 ...
- Spark官方3 ---------Spark Streaming编程指南(1.5.0)
Design Patterns for using foreachRDD dstream.foreachRDD是一个强大的原语,允许将数据发送到外部系统.然而,了解如何正确有效地使用该原语很重要.避免 ...
随机推荐
- 枚举for/in
for/in循环可以遍历对象中所有可以枚举的属性(包括自有属性和继承属性).对象继承的内置方法不能枚举,凡是在代码中给对象自己或者继承的类添加的属性方法都是可枚举的,但是对象自有的内置属性可不可以枚举 ...
- 仿javascript中confirm()方法的小插件
10天没有写博客了,不知道为什么,心里感觉挺不舒服的,可能这是自己给自己规定要去完成的事情,没有按照计划执行,总会心里不怎么舒服.最近事情挺多的,终于今天抽空来更新一下博客了. 今天写的是一个小插件. ...
- 自动生成api文档
vs2010代码注释自动生成api文档 最近做了一些接口,提供其他人调用,要写个api文档,可是我想代码注释已经写了说明,能不能直接把代码注释生成api?于是找到以下方法 环境:vs2010 先下载安 ...
- 手机网站keyup解决方案
模糊搜索keyup无效,解决方案如下 //手机网站解决keyup的方法 $(function () { $('#repairsearch').bind('focus', filter_time); } ...
- 设置checkbox为只读(readOnly)
方式一: checkbox没有readOnly属性,如果使用disabled=“disabled”属性的话,会让checkbox变成灰色的,用户很反感这种样式可以这样让它保持只读:设置它的onclic ...
- NLP startup material
The Association for Computational Linguistics(ACL,URL:http://aclweb.org/) Computational Linguistics( ...
- PHP中的赋值-引用or传值?
直接上代码: <?php $num1 = 1; $num2 = $num1; $num1 = 2; echo $num2 . "\n"; $arr1 = array(1, 2 ...
- 【C#】调用DOS命令
public interface IRunConsole { void Run(); } public abstract class RunConsole:IRunConsole { public a ...
- tomcat配置数据池
1->配置servlet.xml 在 <GlobalNamingResources></GlobalNamingResources>中添加<Resource> ...
- sql 数据库还原脚本 (kill链接+独占
在开发过程中经常会碰到数据库还原,要是sql 连接没完全释放掉,那么还原就会受到阻碍.此脚本就是为了解决这个问题. USE [master] GO /****** Object: StoredProc ...