spark总结

算子总结

　　1.变换操作,包括过滤，变换，去重，排序,分区操作

　　　　filter过滤操作，无法触发重新分区

　　　　map,flatMap,flatMapValues,mapValues,mapPartitions,mapPartitionsWithIndex, mapPartitionsWithSplit,zip, zipWithIndex, zipWithUniqueId，reduceByKey

　　　　　　变换操作，默认不触发分区，如果希望重新分区可以第二个参数preservesPartitioning传True，map是一对一变换，flatMap是一对多变换

　　　　distinct 去重操作，默认不触发分区，如果希望重新分区可以第二个参数preservesPartitioning传True

　　　　sortBy,sortByKey 排序操作，必须变换分区，可以指定变换后的分区数

　　　　glom,coalesce,partitionBy,repartition 分区操作，coalesce重新分区，第二个参数指定是否shuffle，如果不shuffle，只是分区的简单合并。

　　2.rdd操作，包括合并，连接，分组，交并补，差集

　　　　join,fullOuterJoin,leftOuterJoin,rightOuterJoin 内连接，外连接可以通过第二个参数指定分区个数

　　　　groupBy，groupByKey,groupWith 分组操作，可以通过第二个参数指定分区个数，第三个参数指定分区方式

　　　　intersection,subtract,subtractKey,union 交，差，并

　　3.操作，包括取数据，计算，和保存

　　　　take，sample，takeSample，top,head,first 读取数据

　　　　reduce,sum,stdev,sumApprox,variance,aggregate,fold，count 计算操作，count统计个数，sum求和，stdev 均值，variance方差 reduce，aggregate，fold都是自定义计算，fold相当于带有初值的reduce，aggregate不仅带初值，且结果和rdd元素类型不一致

　　　　saveAsTextFile,saveAsHadoopFile,saveAsSequenceFile,saveAsNewAPIHadoopFile 保存文件到本地或hadoop-fs中

python语法拾遗

　　>>> 2 if len(list)>1 else 3

　　python没有三目运算符，以上和三目运算符效果相同

　　>>> list = [1,2,3,4]

　　>>> [i*10 for i in list]

　　[10, 20, 30, 40]

　　集合的map操作

posted on 2019-04-16 11:39 杨杨09265 阅读(112) 评论(0) 编辑收藏举报

刷新页面返回顶部

yangyang12138

导航

公告

spark总结