spark 常见算子功能 -- RDD批处理

一，RDD 算子:

适用于 rdd 或（k, v）类型

1，Transformation 类型算子

map ：转化每个元素，返回 1：1比例元素输出

filter(func) ：过滤元素

flapmap ：同样转化元素， 1：N输出元素

mappation: 对分区进行map, 效率高，但数据量大，内存溢出 OOM风险

mapPartitionsWithIndex(func)：

sample：取出一定数据

union: 并集

intersection: 交集计算

distinct : 去重

groupBykey: a dataset of (K, Iterable<V>) ；没有提前聚合，大量数据发生shuffer, 存在内存溢出

reduceByKey; 按照key 聚合，提前map端合并，类似combiner. must be of type (V,V) => V.

aggregatebykey: 不同的类型

sortByKey([ascending], [numTasks]) : 依据key排序

join; (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs

cogroup: type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples.

cartesian(otherDataset) : datasets of types T and U, returns a dataset of (T, U) pairs 笛卡尔积

coalesce(numPartitions) : 减小分区，使用于大dataset元素数量减少。

repartition(numPartitions) : 增大分区

2，Action类型算子

reduce : 归约计算，返回一个值

count：元素个数

collect : 数组返回到driver

first; 取第一个元素

take: n个元素的数组

takeOrdered(n,[ordering]):

saveAsTextFile(path)：保存一个文件（文件集合）到一个目录（本地，hdfs ...）

countByKey() ： Returns a hashmap of (K, Int) pairs

posted @ 2022-06-09 20:24 gaussen126 阅读(105) 评论(0) 编辑收藏举报

刷新页面返回顶部

SAM's DATA RIVER