Spark GraphX图计算核心源码分析【图构建器、顶点、边】
一.图构建器
GraphX提供了几种从RDD或磁盘上的顶点和边的集合构建图形的方法。默认情况下,没有图构建器会重新划分图的边;相反,边保留在默认分区中。Graph.groupEdges要求对图进行重新分区,因为它假定相同的边将在同一分区上放置,因此在调用Graph.partitionBy之前必须要调用groupEdges。
源码如下:
1 package org.apache.spark.graphx 2 3 import org.apache.spark.SparkContext 4 import org.apache.spark.graphx.impl.{EdgePartitionBuilder, GraphImpl} 5 import org.apache.spark.internal.Logging 6 import org.apache.spark.storage.StorageLevel 7 8 /** 9 * Provides utilities for loading [[Graph]]s from files. 10 */ 11 object GraphLoader extends Logging { 12 13 /** 14 * Loads a graph from an edge list formatted file where each line contains two integers: a source 15 * id and a target id. Skips lines that begin with `#`. 16 */ 17 def edgeListFile( 18 sc: SparkContext, 19 path: String, 20 canonicalOrientation: Boolean = false, 21 numEdgePartitions: Int = -1, 22 edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, //缓存级别 23 vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) 24 : Graph[Int, Int] = 25 { 26 val startTime = System.currentTimeMillis 27 28 // Parse the edge data table directly into edge partitions 29 val lines = 30 if (numEdgePartitions > 0) { // 加载文件数据 31 sc.textFile(path, numEdgePartitions).coalesce(numEdgePartitions) 32 } else { 33 sc.textFile(path) 34 } // 按照分区进行图构建 35 val edges = lines.mapPartitionsWithIndex { (pid, iter) => 36 val builder = new EdgePartitionBuilder[Int, Int] 37 iter.foreach { line => 38 if (!line.isEmpty && line(0) != '#') { // 过滤注释行 39 val lineArray = line.split("\\s+") 40 if (lineArray.length < 2) { // 识别异常数据 41 throw new IllegalArgumentException("Invalid line: " + line) 42 } 43 val srcId = lineArray(0).toLong 44 val dstId = lineArray(1).toLong 45 if (canonicalOrientation && srcId > dstId) { 46 builder.add(dstId, srcId, 1)// 逐个添加边及权重 47 } else { 48 builder.add(srcId, dstId, 1) 49 } 50 } 51 } 52 Iterator((pid, builder.toEdgePartition)) 53 }.persist(edgeStorageLevel).setName("GraphLoader.edgeListFile - edges (%s)".format(path)) 54 edges.count() // 触发执行 55 56 logInfo("It took %d ms to load the edges".format(System.currentTimeMillis - startTime)) 57 58 GraphImpl.fromEdgePartitions(edges, defaultVertexAttr = 1, edgeStorageLevel = edgeStorageLevel, 59 vertexStorageLevel = vertexStorageLevel) 60 } // end of edgeListFile 61 62 }
源码分析:
GraphLoader.edgeListFile是从磁盘或HDFS类似的文件系统中加载图形数据,解析为(源顶点ID, 目标顶点ID)对的邻接列表,并跳过注释行。Graph从指定的边开始创建,然后自动创建和边相邻的任何节点。所有顶点和边属性均默认为1。参数canonicalOrientation允许沿正方向重新定向边,这是所有连接算法所必须的。
源码如下:
1 /** 2 * The Graph object contains a collection of routines used to construct graphs from RDDs. 3 */ 4 object Graph { 5 6 /** 7 * Construct a graph from a collection of edges encoded as vertex id pairs. 8 * 9 * @param rawEdges a collection of edges in (src, dst) form 10 * @param defaultValue the vertex attributes with which to create vertices referenced by the edges 11 * @param uniqueEdges if multiple identical edges are found they are combined and the edge 12 * attribute is set to the sum. Otherwise duplicate edges are treated as separate. To enable 13 * `uniqueEdges`, a [[PartitionStrategy]] must be provided. 14 * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary 15 * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary 16 * 17 * @return a graph with edge attributes containing either the count of duplicate edges or 1 18 * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex. 19 */ 20 def fromEdgeTuples[VD: ClassTag]( 21 rawEdges: RDD[(VertexId, VertexId)], 22 defaultValue: VD, 23 uniqueEdges: Option[PartitionStrategy] = None, 24 edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, 25 vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] = 26 { 27 val edges = rawEdges.map(p => Edge(p._1, p._2, 1)) 28 val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel) 29 uniqueEdges match { 30 case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b) 31 case None => graph 32 } 33 } 34 35 /** 36 * Construct a graph from a collection of edges. 37 * 38 * @param edges the RDD containing the set of edges in the graph 39 * @param defaultValue the default vertex attribute to use for each vertex 40 * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary 41 * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary 42 * 43 * @return a graph with edge attributes described by `edges` and vertices 44 * given by all vertices in `edges` with value `defaultValue` 45 */ 46 def fromEdges[VD: ClassTag, ED: ClassTag]( 47 edges: RDD[Edge[ED]], 48 defaultValue: VD, 49 edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, 50 vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = { 51 GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel) 52 } 53 54 /** 55 * Construct a graph from a collection of vertices and 56 * edges with attributes. Duplicate vertices are picked arbitrarily and 57 * vertices found in the edge collection but not in the input 58 * vertices are assigned the default attribute. 59 * 60 * @tparam VD the vertex attribute type 61 * @tparam ED the edge attribute type 62 * @param vertices the "set" of vertices and their attributes 63 * @param edges the collection of edges in the graph 64 * @param defaultVertexAttr the default vertex attribute to use for vertices that are 65 * mentioned in edges but not in vertices 66 * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary 67 * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary 68 */ 69 def apply[VD: ClassTag, ED: ClassTag]( 70 vertices: RDD[(VertexId, VD)], 71 edges: RDD[Edge[ED]], 72 defaultVertexAttr: VD = null.asInstanceOf[VD], 73 edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, 74 vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = { 75 GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel) 76 } 77 78 /** 79 * Implicitly extracts the [[GraphOps]] member from a graph. 80 * 81 * To improve modularity the Graph type only contains a small set of basic operations. 82 * All the convenience operations are defined in the [[GraphOps]] class which may be 83 * shared across multiple graph implementations. 84 */ 85 implicit def graphToGraphOps[VD: ClassTag, ED: ClassTag] 86 (g: Graph[VD, ED]): GraphOps[VD, ED] = g.ops
源码分析:
Graph.apply允许根据顶点和边的RDD创建图。选取任意重复的顶点,并在边RDD中找到对应的顶点,指定这些数据为顶点的默认属性。
Graph.fromEdges允许仅从边RDD创建图。若顶点数据不存在,则从边数据中提取。这些数据被指定为顶点的默认属性。
Graph.fromEdgeTuple允许仅从边RDD创建图。为边设置初始值为1,并自动创建Edge及相关顶点并指定默认值。它还支持对边进行去重,此时,必须传入PartitionStrategy作为参数uniqueEdges的值(例如:uniqueEdges=Some(PartitionStrategy.RandomVertexCut))。必须使用分区策略才能使相同的边放置到同一个分区上,以便进行重复数据删除。
二.顶点RDD
VertexRDD[A]继承RDD[(VertexId,A)]并增加了额外的限制,每个VertexId只能创建一次。此外,VertexRDD[A]表示一组顶点,每个顶点的类型都为A。在内部,这是通过将顶点属性存储在可重用的哈希映射数据结构中来实现的。如果两个VertexRDDs是从相同的基本VertexRDD派生出来的话,则可以在恒定时间内将它们连接在一起,而无需进行哈希评估。
源码如下:
1 /** 2 * @tparam VD the vertex attribute associated with each vertex in the set. 3 */ 4 abstract class VertexRDD[VD]( 5 sc: SparkContext, 6 deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) { 7 8 implicit protected def vdTag: ClassTag[VD] 9 10 private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]] 11 12 override protected def getPartitions: Array[Partition] = partitionsRDD.partitions 13 14 /** 15 * Provides the `RDD[(VertexId, VD)]` equivalent output. 16 */ 17 override def compute(part: Partition, context: TaskContext): Iterator[(VertexId, VD)] = { 18 firstParent[ShippableVertexPartition[VD]].iterator(part, context).next().iterator 19 } 20 21 /** 22 * Construct a new VertexRDD that is indexed by only the visible vertices. The resulting 23 * VertexRDD will be based on a different index and can no longer be quickly joined with this 24 * RDD. 25 */ 26 def reindex(): VertexRDD[VD] 27 28 /** 29 * Applies a function to each `VertexPartition` of this RDD and returns a new VertexRDD. 30 */ 31 private[graphx] def mapVertexPartitions[VD2: ClassTag]( 32 f: ShippableVertexPartition[VD] => ShippableVertexPartition[VD2]) 33 : VertexRDD[VD2] 34 35 /** 36 * Restricts the vertex set to the set of vertices satisfying the given predicate. This operation 37 * preserves the index for efficient joins with the original RDD, and it sets bits in the bitmask 38 * rather than allocating new memory. 39 * 40 * It is declared and defined here to allow refining the return type from `RDD[(VertexId, VD)]` to 41 * `VertexRDD[VD]`. 42 * 43 * @param pred the user defined predicate, which takes a tuple to conform to the 44 * `RDD[(VertexId, VD)]` interface 45 */ 46 override def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD] = 47 this.mapVertexPartitions(_.filter(Function.untupled(pred))) 48 49 /** 50 * Maps each vertex attribute, preserving the index. 51 * 52 * @tparam VD2 the type returned by the map function 53 * 54 * @param f the function applied to each value in the RDD 55 * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the 56 * original VertexRDD 57 */ 58 def mapValues[VD2: ClassTag](f: VD => VD2): VertexRDD[VD2] 59 60 /** 61 * Maps each vertex attribute, additionally supplying the vertex ID. 62 * 63 * @tparam VD2 the type returned by the map function 64 * 65 * @param f the function applied to each ID-value pair in the RDD 66 * @return a new VertexRDD with values obtained by applying `f` to each of the entries in the 67 * original VertexRDD. The resulting VertexRDD retains the same index. 68 */ 69 def mapValues[VD2: ClassTag](f: (VertexId, VD) => VD2): VertexRDD[VD2] 70 71 /** 72 * For each VertexId present in both `this` and `other`, minus will act as a set difference 73 * operation returning only those unique VertexId's present in `this`. 74 * 75 * @param other an RDD to run the set operation against 76 */ 77 def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD] 78 79 /** 80 * For each VertexId present in both `this` and `other`, minus will act as a set difference 81 * operation returning only those unique VertexId's present in `this`. 82 * 83 * @param other a VertexRDD to run the set operation against 84 */ 85 def minus(other: VertexRDD[VD]): VertexRDD[VD] 86 87 /** 88 * For each vertex present in both `this` and `other`, `diff` returns only those vertices with 89 * differing values; for values that are different, keeps the values from `other`. This is 90 * only guaranteed to work if the VertexRDDs share a common ancestor. 91 * 92 * @param other the other RDD[(VertexId, VD)] with which to diff against. 93 */ 94 def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD] 95 96 /** 97 * For each vertex present in both `this` and `other`, `diff` returns only those vertices with 98 * differing values; for values that are different, keeps the values from `other`. This is 99 * only guaranteed to work if the VertexRDDs share a common ancestor. 100 * 101 * @param other the other VertexRDD with which to diff against. 102 */ 103 def diff(other: VertexRDD[VD]): VertexRDD[VD] 104 105 /** 106 * Left joins this RDD with another VertexRDD with the same index. This function will fail if 107 * both VertexRDDs do not share the same index. The resulting vertex set contains an entry for 108 * each vertex in `this`. 109 * If `other` is missing any vertex in this VertexRDD, `f` is passed `None`. 110 * 111 * @tparam VD2 the attribute type of the other VertexRDD 112 * @tparam VD3 the attribute type of the resulting VertexRDD 113 * 114 * @param other the other VertexRDD with which to join. 115 * @param f the function mapping a vertex id and its attributes in this and the other vertex set 116 * to a new vertex attribute. 117 * @return a VertexRDD containing the results of `f` 118 */ 119 def leftZipJoin[VD2: ClassTag, VD3: ClassTag] 120 (other: VertexRDD[VD2])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3] 121 122 /** 123 * Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is 124 * backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is 125 * used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is 126 * missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates, 127 * the vertex is picked arbitrarily. 128 * 129 * @tparam VD2 the attribute type of the other VertexRDD 130 * @tparam VD3 the attribute type of the resulting VertexRDD 131 * 132 * @param other the other VertexRDD with which to join 133 * @param f the function mapping a vertex id and its attributes in this and the other vertex set 134 * to a new vertex attribute. 135 * @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted 136 * by `f`. 137 */ 138 def leftJoin[VD2: ClassTag, VD3: ClassTag] 139 (other: RDD[(VertexId, VD2)]) 140 (f: (VertexId, VD, Option[VD2]) => VD3) 141 : VertexRDD[VD3] 142 143 /** 144 * Efficiently inner joins this VertexRDD with another VertexRDD sharing the same index. See 145 * [[innerJoin]] for the behavior of the join. 146 */ 147 def innerZipJoin[U: ClassTag, VD2: ClassTag](other: VertexRDD[U]) 148 (f: (VertexId, VD, U) => VD2): VertexRDD[VD2] 149 150 /** 151 * Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is 152 * backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation 153 * is used. 154 * 155 * @param other an RDD containing vertices to join. If there are multiple entries for the same 156 * vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries. 157 * @param f the join function applied to corresponding values of `this` and `other` 158 * @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both 159 * `this` and `other`, with values supplied by `f` 160 */ 161 def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)]) 162 (f: (VertexId, VD, U) => VD2): VertexRDD[VD2] 163 164 /** 165 * Aggregates vertices in `messages` that have the same ids using `reduceFunc`, returning a 166 * VertexRDD co-indexed with `this`. 167 * 168 * @param messages an RDD containing messages to aggregate, where each message is a pair of its 169 * target vertex ID and the message data 170 * @param reduceFunc the associative aggregation function for merging messages to the same vertex 171 * @return a VertexRDD co-indexed with `this`, containing only vertices that received messages. 172 * For those vertices, their values are the result of applying `reduceFunc` to all received 173 * messages. 174 */ 175 def aggregateUsingIndex[VD2: ClassTag]( 176 messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2] 177 178 /** 179 * Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding 180 * [[EdgeRDD]]. 181 */ 182 def reverseRoutingTables(): VertexRDD[VD] 183 184 /** Prepares this VertexRDD for efficient joins with the given EdgeRDD. */ 185 def withEdges(edges: EdgeRDD[_]): VertexRDD[VD] 186 187 /** Replaces the vertex partitions while preserving all other properties of the VertexRDD. */ 188 private[graphx] def withPartitionsRDD[VD2: ClassTag]( 189 partitionsRDD: RDD[ShippableVertexPartition[VD2]]): VertexRDD[VD2] 190 191 /** 192 * Changes the target storage level while preserving all other properties of the 193 * VertexRDD. Operations on the returned VertexRDD will preserve this storage level. 194 * 195 * This does not actually trigger a cache; to do this, call 196 * [[org.apache.spark.graphx.VertexRDD#cache]] on the returned VertexRDD. 197 */ 198 private[graphx] def withTargetStorageLevel( 199 targetStorageLevel: StorageLevel): VertexRDD[VD] 200 201 /** Generates an RDD of vertex attributes suitable for shipping to the edge partitions. */ 202 private[graphx] def shipVertexAttributes( 203 shipSrc: Boolean, shipDst: Boolean): RDD[(PartitionID, VertexAttributeBlock[VD])] 204 205 /** Generates an RDD of vertex IDs suitable for shipping to the edge partitions. */ 206 private[graphx] def shipVertexIds(): RDD[(PartitionID, Array[VertexId])]
源码分析:
基本的操作像filer,leftJoin,RightJoin和Spark SQL基本一致,用法也相同,只是处理的数据样式有所差别。另外,像独有的算子,例如:aggregateUsingIndex可以高效构建新的VertexRDD。从概念上讲,如果我们构建了VertexRDD[B]这一组数据,这是顶点A的超集,那么构建RDD[(VertexId,A)]就可以重用索引进行聚合,从而大大提高效率。
三.边RDD
边EdgeRDD[ED]其延伸至RDD[Edge[ED]],使用定义中的各种分区策略PatitionStrategy。在每个分区中,边属性和邻接结构分别存储,从而在更改属性值时可实现最大程度的重用。
源码如下:
1 abstract class EdgeRDD[ED]( 2 sc: SparkContext, 3 deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) { 4 5 // scalastyle:off structural.type 6 private[graphx] def partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])] forSome { type VD } 7 // scalastyle:on structural.type 8 9 override protected def getPartitions: Array[Partition] = partitionsRDD.partitions 10 11 override def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = { 12 val p = firstParent[(PartitionID, EdgePartition[ED, _])].iterator(part, context) 13 if (p.hasNext) { 14 p.next()._2.iterator.map(_.copy()) 15 } else { 16 Iterator.empty 17 } 18 } 19 20 /** 21 * Map the values in an edge partitioning preserving the structure but changing the values. 22 * 23 * @tparam ED2 the new edge value type 24 * @param f the function from an edge to a new edge value 25 * @return a new EdgeRDD containing the new edge values 26 */ 27 def mapValues[ED2: ClassTag](f: Edge[ED] => ED2): EdgeRDD[ED2] 28 29 /** 30 * Reverse all the edges in this RDD. 31 * 32 * @return a new EdgeRDD containing all the edges reversed 33 */ 34 def reverse: EdgeRDD[ED] 35 36 /** 37 * Inner joins this EdgeRDD with another EdgeRDD, assuming both are partitioned using the same 38 * [[PartitionStrategy]]. 39 * 40 * @param other the EdgeRDD to join with 41 * @param f the join function applied to corresponding values of `this` and `other` 42 * @return a new EdgeRDD containing only edges that appear in both `this` and `other`, 43 * with values supplied by `f` 44 */ 45 def innerJoin[ED2: ClassTag, ED3: ClassTag] 46 (other: EdgeRDD[ED2]) 47 (f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3] 48 49 /** 50 * Changes the target storage level while preserving all other properties of the 51 * EdgeRDD. Operations on the returned EdgeRDD will preserve this storage level. 52 * 53 * This does not actually trigger a cache; to do this, call 54 * [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD. 55 */ 56 private[graphx] def withTargetStorageLevel(targetStorageLevel: StorageLevel): EdgeRDD[ED] 57 }
源码分析:
单独使用情况较少,一般EdgeRDD上的操作是通过图运算符完成的,或者依赖于基类RDD中定义的操作。