Learning Spark阅读笔记2

Working with Key/Value Pairs

Key/Value RDDs通常被用来执行aggregations，我们经常会做一些初始化ETL（extract, transform, load）来得到我们的key/value数据。

使用可控制的partitioning，应用程序能够减少通信消耗通过确保数据同时被访问到，在同一个节点上。

Creating Pair RDDs

有些加载数据集会直接得到key/value数据，有些需要我们将它转换成pair RDD。可以使用map()来实现，例如：

val pairs = lines.map(x => (x.split(" ")(0), x))

当从内存的集合中创建pair RDD，只需要调用SparkContext.parallelize()。

Transformations on Pair RDDs

Pair RDDs可以使用所有的适用于标准RDDs的transformation。下面列出常用的pair RDDs transformation。

reduceByKey(func) Combine values with the same key.
groupByKey() Group values with the same key.
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
mapValues(func) Apply a function to each value of a pair RDD without changing the key.
flatMapValues(func)
keys()
values()
sortByKey()
substractByKey(other)
join(other)
rightOuterJoin(other) where the key must be present in the first RDD.
leftOuterJoin(other)
cogroup(other) Group data from both RDDs sharing the same key.

Aggregations

reduceByKey() 不是actions，因为数据集中可能会有很多keys。举个例子，计算键的平均值：

rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

注意，在执行reduceByKey()和foldByKey()将会自动在每一个机器本地先执行combining，不需要手动设置。更一般的combineByKey()将会允许自定义combining的行为。

combineByKey()是最基础的per-key aggregation函数，很多其他的per-key combiners都用它来实现。要理解combineByKey的行为，需要知道它的执行过程：

在一个partition上遍历每一个元素，对于碰到的元素，它的键要么之前碰到过，要么是新遇到的。
如果是新的元素，就会使用我们提供的createCombiner()函数，来创建在那个键上开始accumulator的初始值。
如果之间碰到过，就是调用我们提供的函数mergeValue()，使用accumulator的值和现在碰到的值。
每一个partition将会独立执行上面3步，最后merge每个partition的结果时，将会调用我们提供的mergeCombiners()函数。

如果我们需要禁止map-side combines，我们需要指定partitioner。（TODO 什么是map-side combines）

对于combineByKey()举个例子：

val result = input.combineByKey(
  (v) => (v, 1), // createCombiner
  (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),    // mergeValue
  (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)    // mergeCombiners
  ).map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))

Tuning the level of parallelism

当执行aggregations和grouping操作时，我们可以指定partitions的个数。举个例子：

val data = Seq(("a", 3), ("b", 4), ("a", 1))
sc.parallelize(data).reduceByKey((x, y) => x + y) //Default parallelism
sc.parallelize(data).reduceByKey((x, y) => x + y, 10) // Custom parallelism

有时候，我们在grouping和aggregation操作之外要更改partitioning，可以使用repartition()函数，但是代价非常高，如果更改的partition数目比原来小，Spark有个优化的函数，coalesce()，在使用之前可以使用rdd.partitions.size()来查看partition的数目。

Grouping Data

如果一个RDD的键值是K，值是V，那么应用groupByKey()之后将会得到（K，Iterable[V]），注意如果先使用groupByKey，然后使用reduce或者fold作用在值上，不如世界使用aggregation函数（例如reduceByKey）来的有效。

cogroup可以从多个RDD上group相同的key。

Joins

inner join：

storeAddress = {
  (Store("Ritual"), "1026 Valencia St"), (Store("Philz"), "748 Van Ness Ave"),
  (Store("Philz"), "3101 24th St"), (Store("Starbucks"), "Seattle")}
storeRating = {
  (Store("Ritual"), 4.9), (Store("Philz"), 4.8))}
storeAddress.join(storeRating) == {
  (Store("Ritual"), ("1026 Valencia St", 4.9)),
  (Store("Philz"), ("748 Van Ness Ave", 4.8)),
  (Store("Philz"), ("3101 24th St", 4.8))}

leftOuterJoin有源RDD的所有键，值是对应的两个RDD的，如果右边RDD没有这个键，将表示成None，有的话将是Option类型，表示可能丢失值。

rightOuterJoin和leftOuterJoin相反。

storeAddress.leftOuterJoin(storeRating) ==
{(Store("Ritual"),("1026 Valencia St",Some(4.9))),
  (Store("Starbucks"),("Seattle",None)),
  (Store("Philz"),("748 Van Ness Ave",Some(4.8))),
  (Store("Philz"),("3101 24th St",Some(4.8)))}
storeAddress.rightOuterJoin(storeRating) ==
{(Store("Ritual"),(Some("1026 Valencia St"),4.9)),
  (Store("Philz"),(Some("748 Van Ness Ave"),4.8)),
  (Store("Philz"), (Some("3101 24th St"),4.8))}

Sorting Data

可以在sortByKey中传递比较函数进行自定义的比较。

Actions Available on Pair RDDs

所有使用在基础RDD上的action操作都能使用，同时也增加了适用与pair的action操作。

countByKey(): Count the number of elements for each key.
collectAsMap(): Collect the result as a map to provide easy lookup.
lookup(key): Return all values associated with the provided key.

Data Partitioning

在分布式计算中，Spark可以控制RDD's的分块来减少通信开销，但是要记住分块并不见得在所有程序中都有帮助，只有在某个数据集被频繁重复使用，并且是key-oriented的操作，例如join，才会有帮助。

Spark的partition对所有的key/value对的RDD都可使用，将会令系统基于在key上的函数组织元素。确保那一组key将会出现在某些node上。举个例子：

val sc = new SparkContext(...)
val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...").persist()

def processNewLogs(logFileName: String) {
  val events = sc.sequenceFile[UserID, LinkInfo](logFileName)
  val joined = userData.join(events)// RDD of (UserID, (UserInfo, LinkInfo)) pairs
  val offTopicVisits = joined.filter {
    case (userId, (userInfo, linkInfo)) => // Expand the tuple into its components
      !userInfo.topics.contains(linkInfo.topic)
  }.count()
  println("Number of visits to non-subscribed topics: " + offTopicVisits)
}

上面的代码是将userData和events作join操作，但是运行是很没有效率的，因为每次调用processNewLogs函数时，join都会执行，但是不知道key是怎么分布的，将会hash两个数据集上的key，通过网络传送相同的key，将会造成很大的网络开销。示意图如下：

但是使用partitionBy将会减少通信消耗：

val sc = new SparkContext(...)
val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...")
                 .partitionBy(new HashPartitioner(100)) // Create 100 partitions
                 .persist()

现在Spark就知道userData是如何被分块的，join将会只shuffle events RDD，示意图如下：

实施上，很多操作会自动地将RDD进行某种方式的分块，例如sortByKey()将会使用range-partitioned，groupByKey()将会使用hash-partitioned。除了join，还有其他的操作会利用分块的信息。

Determining an RDD's Partitioner

可以查看partitioner属性来查看使用了什么partitioner。

scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
pairs: spark.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12

scala> pairs.partitioner
res0: Option[spark.Partitioner] = None

scala> val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))
partitioned: spark.RDD[(Int, Int)] = ShuffledRDD[1] at partitionBy at <console>:14

scala> partitioned.partitioner
res1: Option[spark.Partitioner] = Some(spark.HashPartitioner@5147788d)

Operations That Benefit from Partitioning

很多涉及到根据key来shuffle数据的操作将会受益于partitioning。

Operations That Affect Partitioning

Spark知道每个操作怎么影响partitioning，自动地设置RDD的partitioner。总的来说，以下的会设置输出RDD的partitioner：

cogroup()
groupWith()
join()
leftOuterJoin()
rightOuterJoin()
groupByKey()
reduceByKey()
combineByKey()
partitionBy()
sort()
mapValues() 取决与父RDD
flatMapValues() 取决与父RDD
filter() 取决与父RDD

当然上面的很多方法也可以设置输出结果的partitioning。

Custom Partitioners

除了Spark自带的HashPartitioner和RangePartitioner，用户也可以自定义Partitioner。需要继承org.apache.spark.Partitioner，并实现3个方法：

numPartitions: Int 返回创建的partition的数量
getPartition(key: Any): Int 对给定的key返回partition的ID
equals() Spark要比较两个两个RDD的分区是否相同

举个例子，在PageRank中键是URL，但是我们希望将相同域名下的归到一起，所以代码如下：

class DomainNamePartitioner(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts
  override def getPartition(key: Any): Int = {
  val domain = new Java.net.URL(key.toString).getHost()
  val code = (domain.hashCode % numPartitions)
  if (code < 0) {
    code + numPartitions // Make it non-negative
  } else {
    code
  }
}
// Java equals method to let Spark compare our Partitioner objects
  override def equals(other: Any): Boolean = other match {
    case dnp: DomainNamePartitioner =>
      dnp.numPartitions == numPartitions
    case _ =>
      false
  }
}

注意这里我们的equals()方法，测试other是否是DomainNamePartioner，和Java里的instanceof相同。

posted @ 2016-11-22 16:09 传奇魔法师阅读(243) 评论(0) 编辑收藏举报

刷新页面返回顶部