spark union intersection subtract

union、intersection subtract 都是transformation 算子

1、union 合并2个数据集,2个数据集的类型要求一致,返回的新RDD的分区数是合并RDD分区数的总和;

    val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2)
    val bd=spark.sparkContext.parallelize(List(("hive",18),("test",2),("spark",20)),1)
    val result=bd.union(kzc)
    println(result.partitions.size)
    println("*******************")
    result.collect().foreach(println(_))

结果

3
*******************
(hive,18)
(test,2)
(spark,20)
(hive,8)
(apache,8)
(hive,30)
(hadoop,18)

2、intersection 取交集,新RDD的分区与父RDD分区数多的一致

 spark.sparkContext.setLogLevel("error")
    val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2)
    val bd=spark.sparkContext.parallelize(List(("hive",8),("test",2),("spark",20)),1)
    val result=bd.intersection(kzc)
    println(result.partitions.size)
    println("*******************")
    result.collect().foreach(println(_))

结果

2
*******************
(hive,8)

3、subtract,减去二者之间的交集(intersection),新RDD与subtract前边的父RDD分区数一致

    spark.sparkContext.setLogLevel("error")
    val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",18)),2)
    val bd=spark.sparkContext.parallelize(List(("hive",8),("test",2),("spark",20)),1)
    val result=bd.subtract(kzc)
    println(result.partitions.size)
    println("*******************")
    result.collect().foreach(println(_))

结果

1
*******************
(test,2)
(spark,20)

 

posted @ 2021-01-05 17:37  bioamin  阅读(114)  评论(0编辑  收藏  举报