bioamin

2021年1月11日

摘要：累加器原理图：累加器创建： sc.longaccumulator("") sc.longaccumulator sc.collectionaccumulator() sc.collectionaccumulator sc.doubleaccumulator() sc.doubleaccumulat 阅读全文

posted @ 2021-01-11 18:58 bioamin 阅读(317) 评论(0) 推荐(0) 编辑

spark 广播变量与累加器

摘要：如何理解广播变量？适用场景：大变量，比如100M以上的大集合。算子函数中使用到外部变量时，默认情况下，Spark会将该变量复制多个副本，通过网络传输到task中，此时每个task都有一个变量副本。如果变量本身比较大的话（比如100M，甚至1G），那么大量的变量副本在网络中传输的性能开销，以及在各个阅读全文

posted @ 2021-01-11 17:02 bioamin 阅读(89) 评论(0) 推荐(0) 编辑

spark countByKey && countByvalue

摘要： countByKey 和 countByValue都是 action算子，结果集均在driver端，输出时不需要单独做collect spark.sparkContext.setLogLevel("error") val bd=spark.sparkContext.parallelize(List 阅读全文

posted @ 2021-01-11 16:44 bioamin 阅读(262) 评论(0) 推荐(0) 编辑

spark zip && zipPartitions && zipWithIndex && zipWithUniqueId

摘要： zip transformation算子，将两个RDD中的元素（KV格式/非KV格式）变成一个KV格式的RDD,两个RDD的每个分区元素个数必须相同。 spark.sparkContext.setLogLevel("error") spark.sparkContext.setLogLevel("er 阅读全文

posted @ 2021-01-11 16:34 bioamin 阅读(319) 评论(0) 推荐(0) 编辑

spark mapPartitionWithindex && repartition && coalesce

摘要： mapPartitionWithindex transformation算子，每次输入是一个分区的数据，并且传入数据的分区号 spark.sparkContext.setLogLevel("error")val kzc=spark.sparkContext.parallelize(List(("hi 阅读全文

posted @ 2021-01-11 15:11 bioamin 阅读(226) 评论(0) 推荐(0) 编辑

spark foreachPartition

摘要： foreachPartition action 算子，与foreach相比，foreach每次输入的是一行数据，而foreachPartition每次输入的是一个分区的数据（iterator） result2.foreachPartition(x=>{ println("**********") w 阅读全文

posted @ 2021-01-11 14:14 bioamin 阅读(498) 评论(0) 推荐(0) 编辑

spark mapPartition

摘要： mapPartition 是一个transformation 算子，主要针对需要建立连接的程序，比如数据写入数据库。 val kzc=spark.sparkContext.parallelize(List(("hive",8),("apache",8),("hive",30),("hadoop",1 阅读全文

posted @ 2021-01-11 10:28 bioamin 阅读(434) 评论(0) 推荐(0) 编辑

2021年1月5日

spark union intersection subtract

摘要： union、intersection subtract 都是transformation 算子 1、union 合并2个数据集，2个数据集的类型要求一致，返回的新RDD的分区数是合并RDD分区数的总和； val kzc=spark.sparkContext.parallelize(List(("hi 阅读全文

posted @ 2021-01-05 17:37 bioamin 阅读(114) 评论(0) 推荐(0) 编辑

spark join 类算子

摘要： join,leftOuterJoin,rightOuterJoin,fullOuterJoin 都是transformation类别的算子作用在K,V格式的RDD上。根据K进行连接，对（K,V）join(K,W)返回（K,(V,W)） join后的分区数是多的那个的分区 join val kzc= 阅读全文

posted @ 2021-01-05 17:13 bioamin 阅读(387) 评论(0) 推荐(0) 编辑

2021年1月4日

spark action 算子

摘要： action算子会触发spark进行运算，用于job划分，一个action算子就是一个job。带有shuffle的算子用于划分stage（一个分区的数据去往多个分区），例如reduceByKey、 action算子如下： 1、count() 返回数据集中的元素数。会在结果计算完成后回收到Drive 阅读全文

posted @ 2021-01-04 19:14 bioamin 阅读(399) 评论(0) 推荐(0) 编辑

追寻创业的梦想

公告