Spark RDD Transformation和Action
spark -- Spark RDD Transformation和Action
目录
- Transformation算子
- 基本的初始化
- 一、map、flatMap、mapParations、mapPartitionsWithIndex
- 1.1map
- 1.2flatMap
- 1.3mapPartitions
- 1.4mapPartitionsWithIndex
- 二、reduce、reduceByKey
- 2.1reduce
- 2.2reduceByKey
- 三、union,join和groupByKey
- 3.1union
- 3.2groupByKey
- 3.3join
- 四、sample、cartesian
- 4.1sample
- 4.2cartesian
- 五、filter、distinct、intersection
- 5.1filter
- 5.2distinct
- 5.3intersection
- 六、coalesce、repartition、repartitionAndSortWithinPartitions
- 6.1coalesce
- 6.2 replication
- 6.3repartitionAndSortWithinPartitions
- 七、cogroup、sortBykey、aggregateByKey
- 7.1cogroup
- 7.2sortBykey
- 7.3aggregateByKey
Transformation算子
基本的初始化
val config = new SparkConf().setAppName("MapPartitionsAPP").setMaster("local[2]") val sc = new SparkContext(config) // 获取spark 上下文
一、map、flatMap、mapParations、mapPartitionsWithIndex
1.1 map
def map: Unit ={ val list = List("spark","hadoop","sqoop","hive","storm") val listRDD = sc.parallelize(list) //parallelize第二个参数可以指定RDD分区个数
/**
* 对于map算子,源JavaRDD的每个元素都会进行计算,由于是依次进行传参,所以他是有序的,新RDD的元素顺序与源RDD是相同的。而由有序又引出接下来的flatMap
*/ val listMapRDD = listRDD.map(name =>{ "hello word" + name }) listMapRDD.foreach(println(_)) }
1.2 flatMap
def flatMap: Unit ={ val list = List("spark sparkSQL","hadoop MapReduce","sqoop","hive","storm") val listRDD = sc.parallelize(list) val flatMapRDD = listRDD.flatMap(name => { name.split(" ").map(name =>"hello word"+name) }) flatMapRDD.foreach(println(_)) }
1.3 mapPartitions
def mapPartitions: Unit ={ val list = List(1, 2, 3, 4, 5, 6) val listRDD = sc.parallelize(list,2) /** * map和flatMap都是依次进行参数传递的, * 但有时候需要RDD中的两个元素进行相应操作时(例如:算存款所得时,下一个月所得的利息是要原本金加上上一个月所得的本金56的), * 这两个算子便无法达到目的了,这是便需要mapPartitions算子,他传参的方式是将整个RDD传入, * 然后将一个迭代器传出生成一个新的RDD,由于整个RDD都传入了,所以便能完成前面说的业务。 */ listRDD.mapPartitions(iterator => { val newList: ListBuffer[String] = ListBuffer() while (iterator.hasNext){ newList.append("hello " + iterator.next()) } newList.toIterator }).foreach(name => println(name)) }
1.4 mapPartitionsWithIndex
每次获取和处理的就是一个分区的数据,并且知道处理的分区的分区号
def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("MapPartitionsWithIndexAPP").setMaster("local[2]") val sc = new SparkContext(conf) val list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) /** * mapPartitionsWithIndex * 每次获取和处理的就是一个分区的数据,并且知道处理的分区的分区号index */ val listRDD = sc.parallelize(list).mapPartitionsWithIndex((index,iterator) => { val listBuffer:ListBuffer[String] = new ListBuffer while (iterator.hasNext){ listBuffer.append(index+"_"+iterator.next()) } listBuffer.iterator },true).foreach(println(_)) }
二、reduce、reduceByKey
2.1reduce
def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("ReduceAPP").setMaster("local[2]") val sc = new SparkContext(conf) val list = Array(1,2,3,4,5,1,2,3) val listRDD = sc.parallelize(list) /** * reduce其实是将RDD中的所有元素进行合并, * 当运行call方法时,会传入两个参数, * 在call方法中将两个参数合并后返回,而这个返回值回合一个新的RDD中的元素再次传入call方法中,继续合并,直到合并到只剩下一个元素时。 */ val resule = listRDD.reduce((x,y) => x+y ) println(resule) }
2.2 reduceByKey
def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("ReduceByKeyAPP").setMaster("local[2]") val sc = new SparkContext(conf) val list = List(("A", 99), ("B", 97), ("A", 89), ("B", 77)) val mapRDD = sc.parallelize(list) /** * reduceByKey仅将RDD中所有K,V对中K值相同的V进行合并。 */ val resultRDD = mapRDD.reduceByKey((_+_)) resultRDD.foreach(tuple => println(tuple._1 + "->"+tuple._2)) }
三、union,join和groupByKey
3.1union
def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("ReduceByKeyAPP").setMaster("local[2]") val sc = new SparkContext(conf) val list1 = List(1,2,3,4) val list2 = List(2,2,3,4) val rdd1 = sc.parallelize(list1) val rdd2 = sc.parallelize(list2) /** * union 操作只是将两个RDD连接起来,相当于List的 ADDALL操作,local[2] 导致有两个分区 */ rdd1.union(rdd2).foreach(println(_)) }
3.2groupByKey
def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("ReduceByKeyAPP").setMaster("local[2]") val sc = new SparkContext(conf) val list = List(("hadoop", "MapReduce"), ("hadoop", "hive"), ("Spark", "SparkSQL"), ("Spark", "SpartStreaming")) val listRDD = sc.parallelize(list) /** * groupByKey是将PairRDD中拥有相同key值得元素归为一组 */ val groupByKeyRDD = listRDD.groupByKey() groupByKeyRDD.foreach(touple => { val key = touple._1 val valuesiter = touple._2.iterator var people = "" while(valuesiter.hasNext){ people = people + valuesiter.next + " " } println(key + " -> " + people) }) }
3.3join
def join(): Unit ={ val list1 = List((1, "Apache"), (2, "Nginx"), (3, "Tomcat")) val list2 = List((1, 99), (2, 98), (3, 97)) val list1RDD = sc.parallelize(list1) val list2RDD = sc.parallelize(list2) /** * join是将两个PairRDD合并,并将有相同key的元素分为一组,可以理解为groupByKey和Union的结合 */ val joinRDD = list1RDD.join(list2RDD) joinRDD.foreach(t => { println("学号:"+ t._1 +" 姓名:"+t._2._1 + " 成绩" + t._2._2) }) }