Spark GraphX图计算核心源码分析【图构建器、顶点、边】

一.图构建器

  GraphX提供了几种从RDD或磁盘上的顶点和边的集合构建图形的方法。默认情况下,没有图构建器会重新划分图的边;相反,边保留在默认分区中。Graph.groupEdges要求对图进行重新分区,因为它假定相同的边将在同一分区上放置,因此在调用Graph.partitionBy之前必须要调用groupEdges。 

源码如下:

 1 package org.apache.spark.graphx
 2 
 3 import org.apache.spark.SparkContext
 4 import org.apache.spark.graphx.impl.{EdgePartitionBuilder, GraphImpl}
 5 import org.apache.spark.internal.Logging
 6 import org.apache.spark.storage.StorageLevel
 7 
 8 /**
 9  * Provides utilities for loading [[Graph]]s from files.
10  */
11 object GraphLoader extends Logging {
12 
13   /**
14    * Loads a graph from an edge list formatted file where each line contains two integers: a source
15    * id and a target id. Skips lines that begin with `#`.
16    */
17   def edgeListFile(
18       sc: SparkContext,
19       path: String,
20       canonicalOrientation: Boolean = false,
21       numEdgePartitions: Int = -1,
22       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, //缓存级别
23       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
24     : Graph[Int, Int] =
25   {
26     val startTime = System.currentTimeMillis
27 
28     // Parse the edge data table directly into edge partitions
29     val lines =
30       if (numEdgePartitions > 0) { // 加载文件数据
31         sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions)
32       } else {
33         sc.textFile(path)
34       } // 按照分区进行图构建
35     val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
36       val builder = new EdgePartitionBuilder[Int, Int]
37       iter.foreach { line =>
38         if (!line.isEmpty && line(0) != '#') { // 过滤注释行
39           val lineArray = line.split("\\s+")
40           if (lineArray.length < 2) { // 识别异常数据
41             throw new IllegalArgumentException("Invalid line: " + line)
42           }
43           val srcId = lineArray(0).toLong
44           val dstId = lineArray(1).toLong
45           if (canonicalOrientation && srcId > dstId) {
46             builder.add(dstId, srcId, 1)// 逐个添加边及权重
47           } else {
48             builder.add(srcId, dstId, 1)
49           }
50         }
51       }
52       Iterator((pid, builder.toEdgePartition))
53     }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path))
54     edges.count() // 触发执行
55 
56     logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime))
57 
58     GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel,
59       vertexStorageLevel = vertexStorageLevel)
60   } // end of edgeListFile
61 
62 }

源码分析:

  GraphLoader.edgeListFile是从磁盘或HDFS类似的文件系统中加载图形数据,解析为(源顶点ID, 目标顶点ID)对的邻接列表,并跳过注释行。Graph从指定的边开始创建,然后自动创建和边相邻的任何节点。所有顶点和边属性均默认为1。参数canonicalOrientation允许沿正方向重新定向边,这是所有连接算法所必须的。

源码如下:

 1 /**
 2  * The Graph object contains a collection of routines used to construct graphs from RDDs.
 3  */
 4 object Graph {
 5 
 6   /**
 7    * Construct a graph from a collection of edges encoded as vertex id pairs.
 8    *
 9    * @param rawEdges a collection of edges in (src, dst) form
10    * @param defaultValue the vertex attributes with which to create vertices referenced by the edges
11    * @param uniqueEdges if multiple identical edges are found they are combined and the edge
12    * attribute is set to the sum.  Otherwise duplicate edges are treated as separate. To enable
13    * `uniqueEdges`, a [[PartitionStrategy]] must be provided.
14    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
15    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
16    *
17    * @return a graph with edge attributes containing either the count of duplicate edges or 1
18    * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
19    */
20   def fromEdgeTuples[VD: ClassTag](
21       rawEdges: RDD[(VertexId, VertexId)],
22       defaultValue: VD,
23       uniqueEdges: Option[PartitionStrategy] = None,
24       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
25       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
26   {
27     val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
28     val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
29     uniqueEdges match {
30       case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
31       case None => graph
32     }
33   }
34 
35   /**
36    * Construct a graph from a collection of edges.
37    *
38    * @param edges the RDD containing the set of edges in the graph
39    * @param defaultValue the default vertex attribute to use for each vertex
40    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
41    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
42    *
43    * @return a graph with edge attributes described by `edges` and vertices
44    *         given by all vertices in `edges` with value `defaultValue`
45    */
46   def fromEdges[VD: ClassTag, ED: ClassTag](
47       edges: RDD[Edge[ED]],
48       defaultValue: VD,
49       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
50       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
51     GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
52   }
53 
54   /**
55    * Construct a graph from a collection of vertices and
56    * edges with attributes.  Duplicate vertices are picked arbitrarily and
57    * vertices found in the edge collection but not in the input
58    * vertices are assigned the default attribute.
59    *
60    * @tparam VD the vertex attribute type
61    * @tparam ED the edge attribute type
62    * @param vertices the "set" of vertices and their attributes
63    * @param edges the collection of edges in the graph
64    * @param defaultVertexAttr the default vertex attribute to use for vertices that are
65    *                          mentioned in edges but not in vertices
66    * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
67    * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
68    */
69   def apply[VD: ClassTag, ED: ClassTag](
70       vertices: RDD[(VertexId, VD)],
71       edges: RDD[Edge[ED]],
72       defaultVertexAttr: VD = null.asInstanceOf[VD],
73       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
74       vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
75     GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
76   }
77 
78   /**
79    * Implicitly extracts the [[GraphOps]] member from a graph.
80    *
81    * To improve modularity the Graph type only contains a small set of basic operations.
82    * All the convenience operations are defined in the [[GraphOps]] class which may be
83    * shared across multiple graph implementations.
84    */
85   implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag]
86       (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops

源码分析:  

  Graph.apply允许根据顶点和边的RDD创建图。选取任意重复的顶点,并在边RDD中找到对应的顶点,指定这些数据为顶点的默认属性。

  Graph.fromEdges允许仅从边RDD创建图。若顶点数据不存在,则从边数据中提取。这些数据被指定为顶点的默认属性。

  Graph.fromEdgeTuple允许仅从边RDD创建图。为边设置初始值为1,并自动创建Edge及相关顶点并指定默认值。它还支持对边进行去重,此时,必须传入PartitionStrategy作为参数uniqueEdges的值(例如:uniqueEdges=Some(PartitionStrategy.RandomVertexCut))。必须使用分区策略才能使相同的边放置到同一个分区上,以便进行重复数据删除。

二.顶点RDD

  VertexRDD[A]继承RDD[(VertexId,A)]并增加了额外的限制,每个VertexId只能创建一次。此外,VertexRDD[A]表示一组顶点,每个顶点的类型都为A。在内部,这是通过将顶点属性存储在可重用的哈希映射数据结构中来实现的。如果两个VertexRDDs是从相同的基本VertexRDD派生出来的话,则可以在恒定时间内将它们连接在一起,而无需进行哈希评估。

源码如下:

  1 /**
  2  * @tparam VD the vertex attribute associated with each vertex in the set.
  3  */
  4 abstract class VertexRDD[VD](
  5     sc: SparkContext,
  6     deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) {
  7 
  8   implicit protected def vdTag: ClassTag[VD]
  9 
 10   private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]]
 11 
 12   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions
 13 
 14   /**
 15    * Provides the `RDD[(VertexId, VD)]` equivalent output.
 16    */
 17   override def compute(part: Partition, context: TaskContext): Iterator[(VertexId, VD)] = {
 18     firstParent[ShippableVertexPartition[VD]].iterator(part, context).next().iterator
 19   }
 20 
 21   /**
 22    * Construct a new VertexRDD that is indexed by only the visible vertices. The resulting
 23    * VertexRDD will be based on a different index and can no longer be quickly joined with this
 24    * RDD.
 25    */
 26   def reindex(): VertexRDD[VD]
 27 
 28   /**
 29    * Applies a function to each `VertexPartition` of this RDD and returns a new VertexRDD.
 30    */
 31   private[graphx] def mapVertexPartitions[VD2: ClassTag](
 32       f: ShippableVertexPartition[VD] => ShippableVertexPartition[VD2])
 33     : VertexRDD[VD2]
 34 
 35   /**
 36    * Restricts the vertex set to the set of vertices satisfying the given predicate. This operation
 37    * preserves the index for efficient joins with the original RDD, and it sets bits in the bitmask
 38    * rather than allocating new memory.
 39    *
 40    * It is declared and defined here to allow refining the return type from `RDD[(VertexId, VD)]` to
 41    * `VertexRDD[VD]`.
 42    *
 43    * @param pred the user defined predicate, which takes a tuple to conform to the
 44    * `RDD[(VertexId, VD)]` interface
 45    */
 46   override def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD] =
 47     this.mapVertexPartitions(_.filter(Function.untupled(pred)))
 48 
 49   /**
 50    * Maps each vertex attribute, preserving the index.
 51    *
 52    * @tparam VD2 the type returned by the map function
 53    *
 54    * @param f the function applied to each value in the RDD
 55    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
 56    * original VertexRDD
 57    */
 58   def mapValues[VD2: ClassTag](f: VD => VD2): VertexRDD[VD2]
 59 
 60   /**
 61    * Maps each vertex attribute, additionally supplying the vertex ID.
 62    *
 63    * @tparam VD2 the type returned by the map function
 64    *
 65    * @param f the function applied to each ID-value pair in the RDD
 66    * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the
 67    * original VertexRDD.  The resulting VertexRDD retains the same index.
 68    */
 69   def mapValues[VD2: ClassTag](f: (VertexId, VD) => VD2): VertexRDD[VD2]
 70 
 71   /**
 72    * For each VertexId present in both `this` and `other`, minus will act as a set difference
 73    * operation returning only those unique VertexId's present in `this`.
 74    *
 75    * @param other an RDD to run the set operation against
 76    */
 77   def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD]
 78 
 79   /**
 80    * For each VertexId present in both `this` and `other`, minus will act as a set difference
 81    * operation returning only those unique VertexId's present in `this`.
 82    *
 83    * @param other a VertexRDD to run the set operation against
 84    */
 85   def minus(other: VertexRDD[VD]): VertexRDD[VD]
 86 
 87   /**
 88    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with
 89    * differing values; for values that are different, keeps the values from `other`. This is
 90    * only guaranteed to work if the VertexRDDs share a common ancestor.
 91    *
 92    * @param other the other RDD[(VertexId, VD)] with which to diff against.
 93    */
 94   def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD]
 95 
 96   /**
 97    * For each vertex present in both `this` and `other`, `diff` returns only those vertices with
 98    * differing values; for values that are different, keeps the values from `other`. This is
 99    * only guaranteed to work if the VertexRDDs share a common ancestor.
100    *
101    * @param other the other VertexRDD with which to diff against.
102    */
103   def diff(other: VertexRDD[VD]): VertexRDD[VD]
104 
105   /**
106    * Left joins this RDD with another VertexRDD with the same index. This function will fail if
107    * both VertexRDDs do not share the same index. The resulting vertex set contains an entry for
108    * each vertex in `this`.
109    * If `other` is missing any vertex in this VertexRDD, `f` is passed `None`.
110    *
111    * @tparam VD2 the attribute type of the other VertexRDD
112    * @tparam VD3 the attribute type of the resulting VertexRDD
113    *
114    * @param other the other VertexRDD with which to join.
115    * @param f the function mapping a vertex id and its attributes in this and the other vertex set
116    * to a new vertex attribute.
117    * @return a VertexRDD containing the results of `f`
118    */
119   def leftZipJoin[VD2: ClassTag, VD3: ClassTag]
120       (other: VertexRDD[VD2])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
121 
122   /**
123    * Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
124    * backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is
125    * used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is
126    * missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates,
127    * the vertex is picked arbitrarily.
128    *
129    * @tparam VD2 the attribute type of the other VertexRDD
130    * @tparam VD3 the attribute type of the resulting VertexRDD
131    *
132    * @param other the other VertexRDD with which to join
133    * @param f the function mapping a vertex id and its attributes in this and the other vertex set
134    * to a new vertex attribute.
135    * @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted
136    * by `f`.
137    */
138   def leftJoin[VD2: ClassTag, VD3: ClassTag]
139       (other: RDD[(VertexId, VD2)])
140       (f: (VertexId, VD, Option[VD2]) => VD3)
141     : VertexRDD[VD3]
142 
143   /**
144    * Efficiently inner joins this VertexRDD with another VertexRDD sharing the same index. See
145    * [[innerJoin]] for the behavior of the join.
146    */
147   def innerZipJoin[U: ClassTag, VD2: ClassTag](other: VertexRDD[U])
148       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
149 
150   /**
151    * Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
152    * backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation
153    * is used.
154    *
155    * @param other an RDD containing vertices to join. If there are multiple entries for the same
156    * vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries.
157    * @param f the join function applied to corresponding values of `this` and `other`
158    * @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both
159    *         `this` and `other`, with values supplied by `f`
160    */
161   def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)])
162       (f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
163 
164   /**
165    * Aggregates vertices in `messages` that have the same ids using `reduceFunc`, returning a
166    * VertexRDD co-indexed with `this`.
167    *
168    * @param messages an RDD containing messages to aggregate, where each message is a pair of its
169    * target vertex ID and the message data
170    * @param reduceFunc the associative aggregation function for merging messages to the same vertex
171    * @return a VertexRDD co-indexed with `this`, containing only vertices that received messages.
172    * For those vertices, their values are the result of applying `reduceFunc` to all received
173    * messages.
174    */
175   def aggregateUsingIndex[VD2: ClassTag](
176       messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
177 
178   /**
179    * Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding
180    * [[EdgeRDD]].
181    */
182   def reverseRoutingTables(): VertexRDD[VD]
183 
184   /** Prepares this VertexRDD for efficient joins with the given EdgeRDD. */
185   def withEdges(edges: EdgeRDD[_]): VertexRDD[VD]
186 
187   /** Replaces the vertex partitions while preserving all other properties of the VertexRDD. */
188   private[graphx] def withPartitionsRDD[VD2: ClassTag](
189       partitionsRDD: RDD[ShippableVertexPartition[VD2]]): VertexRDD[VD2]
190 
191   /**
192    * Changes the target storage level while preserving all other properties of the
193    * VertexRDD. Operations on the returned VertexRDD will preserve this storage level.
194    *
195    * This does not actually trigger a cache; to do this, call
196    * [[org.apache.spark.graphx.VertexRDD#cache]] on the returned VertexRDD.
197    */
198   private[graphx] def withTargetStorageLevel(
199       targetStorageLevel: StorageLevel): VertexRDD[VD]
200 
201   /** Generates an RDD of vertex attributes suitable for shipping to the edge partitions. */
202   private[graphx] def shipVertexAttributes(
203       shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, VertexAttributeBlock[VD])]
204 
205   /** Generates an RDD of vertex IDs suitable for shipping to the edge partitions. */
206   private[graphx] def shipVertexIds(): RDD[(PartitionID, Array[VertexId])]

源码分析:

  基本的操作像filer,leftJoin,RightJoin和Spark SQL基本一致,用法也相同,只是处理的数据样式有所差别。另外,像独有的算子,例如:aggregateUsingIndex可以高效构建新的VertexRDD。从概念上讲,如果我们构建了VertexRDD[B]这一组数据,这是顶点A的超集,那么构建RDD[(VertexId,A)]就可以重用索引进行聚合,从而大大提高效率。

三.边RDD

  边EdgeRDD[ED]其延伸至RDD[Edge[ED]],使用定义中的各种分区策略PatitionStrategy。在每个分区中,边属性和邻接结构分别存储,从而在更改属性值时可实现最大程度的重用。

源码如下:

 1 abstract class EdgeRDD[ED](
 2     sc: SparkContext,
 3     deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) {
 4 
 5   // scalastyle:off structural.type
 6   private[graphx] def partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])] forSome { type VD }
 7   // scalastyle:on structural.type
 8 
 9   override protected def getPartitions: Array[Partition] = partitionsRDD.partitions
10 
11   override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = {
12     val p = firstParent[(PartitionID, EdgePartition[ED, _])].iterator(part, context)
13     if (p.hasNext) {
14       p.next()._2.iterator.map(_.copy())
15     } else {
16       Iterator.empty
17     }
18   }
19 
20   /**
21    * Map the values in an edge partitioning preserving the structure but changing the values.
22    *
23    * @tparam ED2 the new edge value type
24    * @param f the function from an edge to a new edge value
25    * @return a new EdgeRDD containing the new edge values
26    */
27   def mapValues[ED2: ClassTag](f: Edge[ED] => ED2): EdgeRDD[ED2]
28 
29   /**
30    * Reverse all the edges in this RDD.
31    *
32    * @return a new EdgeRDD containing all the edges reversed
33    */
34   def reverse: EdgeRDD[ED]
35 
36   /**
37    * Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same
38    * [[PartitionStrategy]].
39    *
40    * @param other the EdgeRDD to join with
41    * @param f the join function applied to corresponding values of `this` and `other`
42    * @return a new EdgeRDD containing only edges that appear in both `this` and `other`,
43    *         with values supplied by `f`
44    */
45   def innerJoin[ED2: ClassTag, ED3: ClassTag]
46       (other: EdgeRDD[ED2])
47       (f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
48 
49   /**
50    * Changes the target storage level while preserving all other properties of the
51    * EdgeRDD. Operations on the returned EdgeRDD will preserve this storage level.
52    *
53    * This does not actually trigger a cache; to do this, call
54    * [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD.
55    */
56   private[graphx] def withTargetStorageLevel(targetStorageLevel: StorageLevel): EdgeRDD[ED]
57 }

源码分析:

  单独使用情况较少,一般EdgeRDD上的操作是通过图运算符完成的,或者依赖于基类RDD中定义的操作。

 

 

 

posted @ 2019-11-08 20:40  云山之巅  阅读(1009)  评论(0编辑  收藏  举报