Spark基本API解读
1. Spark基本API解读
首先我们写一段简单的进行单词统计的代码,考察其中出现的API,然后做出整理:
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import java.util.Arrays; import java.util.Iterator; import java.util.List; /* 使用java实现word count */ public class WCJava1 { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("wcjava"); JavaSparkContext jsc = new JavaSparkContext(conf); //加载文件 JavaRDD<String> rdd1 = jsc.textFile("hdfs://mycluster/wc.txt"); //压扁 JavaRDD<String> rdd2 = rdd1.flatMap(new FlatMapFunction<String, String>() { public Iterator<String> call(String s) throws Exception { return Arrays.asList(s.split(" ")).iterator(); } }); //map JavaPairRDD<String, Integer> rdd3 = rdd2.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) throws Exception { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> rdd4 = rdd3.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }); List<Tuple2<String, Integer>> list = rdd4.collect(); for (Tuple2<String, Integer> stringIntegerTuple2 : list) { System.out.println(stringIntegerTuple2); } } }
从语法角度来说,scala版的代码比java要简洁的多,但从性能角度来说,java虽然啰嗦,却优于scala,上述代码中,使用到了一些在做RDD变换中经常会用到的API,事实上,这些API可以参考apache的官方文档,可见网址:http://spark.apache.org/docs/latest/rdd-programming-guide.html
1.1 Spark core模块核心概念介绍
RDD:弹性分布式数据集(resilient distributed dataset)
RDD是Spark编程中的一个核心概念,不同于Hadoop的存储在Datanode中的真实数据集,RDD是一个逻辑上的概念,并没有真实的数据,RDD对象一旦创建出来,就是不可变的了,而“分布式”也就意味着这些数据是被分成不同的范围执行并行计算的,下面我们摘录一段scala版RDD的官方文档注释,进行一下解读:
Internally, each RDD is characterized by five main properties:
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
内部来说,每个RDD都由五个主要特性组成:
1. 分区列表:指的是分区对象中包含的数据的范围
2. 计算每个切片的函数:指的是operator,也就是算子,或者称之为算法
3. 和其他RDD的依赖列表:主要有宽依赖以及窄依赖两种
4. (可选的)KV类型RDD的分区器
5. (可选的)计算每个切片的首选位置列表
Task
任务是Spark中最小的执行单位,RDD的每个分区对应一个task,而每个任务运行在节点上的线程中
依赖(Dependency)
之前说过,依赖分为两种,宽依赖以及窄依赖,
窄依赖:子RDD的每个分区依赖于父RDD的少量分区(不一定是只有一个)
NarrowDependency:本身是一个抽象类,有三个实现子类,OneToOneDependency,PruneDependency以及RangeDependency
宽依赖:子RDD的每个分区依赖于父RDD的所有分区
ShuffleDependency:是Dependency抽象类的实现子类,可以将其称之为shuffle依赖或是宽依赖
1.2 RDD API基本特性总结
关于groupByKey,reduceByKey等聚合方法的总结:
查看groupByKey的scala源码:
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. val createCombiner = (v: V) => CompactBuffer(v) val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 val bufs = combineByKeyWithClassTag[CompactBuffer[V]]( createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false) bufs.asInstanceOf[RDD[(K, Iterable[V])]] }
可以看到,很多带有"ByKey"的方法底层都是通过调用combineByKeyWithClassTag方法来实现的,而在groupByKey方法体中我们发现它将mapSideCombine=false,因此该方法并没有map端的聚合
关于基本聚合方法aggregateByKey方法的解读:
参考该方法的源码:
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { // Serialize the zero value to a byte array so that we can get a new clone of it on each key val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue) val zeroArray = new Array[Byte](zeroBuffer.limit) zeroBuffer.get(zeroArray) lazy val cachedSerializer = SparkEnv.get.serializer.newInstance() val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray)) // We will clean the combiner closure later in `combineByKey` val cleanedSeqOp = self.context.clean(seqOp) combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, partitioner) }
def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("HashPartitioner cannot partition array keys.") } } val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) } }
很明显,该函数使用的是一个柯里化过程,第一个括号内由用户指定一个zeroValue,即零值U,该值可以是任意的数据类型,在seqOp中使用的变换过程是 ( U , V ) => U,在同一个分区中,该零值与value进行某种聚合操作,因此能够将value的数据类型改变成U的数据类型,而combOp则是对分区之间各自计算出来的U值进行聚合操作,然后聚合成最终的U,即 ( U , U ) => U,该函数的优点有这么几个:
1. 使用了map端聚合,因此可以提高计算效率
2. 可以改变value的数据类型,并且可由用户自行指定聚合规则,非常的灵活
那么,下面我们就来进行一次实战演练,写一个Demo运用一下该方法吧!
import org.apache.spark.{SparkConf, SparkContext} /* 写一个使用aggregateByKey方法做word count的demo */ object AggScala { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local") conf.setAppName("aggscala") val sc = new SparkContext(conf) //设置zeroValue val zeroValue = "" def seqOp(U:String, V:Int) : String = { U + V + "," } def combOp(U1:String, U2:String) : String = { U1 + ":" + U2 } sc.textFile("hdfs://mycluster/wc.txt",2).flatMap(e => e.split(" ")) .map(e => (e, 1)) .aggregateByKey(zeroValue)(seqOp,combOp) .collect .foreach(println(_)) } }
关于countByKey为何是action的疑问
我们来一起看一下countByKey的源代码:
def countByKey(): Map[K, Long] = self.withScope { self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap }
查看源码,我们可以发现,countByKey是将每一个value计数成1,然后调用的是reduceByKey方法,最后再调用collect方法和map方法,因此触发了action操作