spark 聚合函数比较 by key
combineByKey-->>aggregateByKey-->>foldByKey-->>reduceByKey-->>groupByKey-->>countByKey
0> combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=<function portable_hash>)
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
>>> sorted(x.combineByKey(lambda a:[a], lambda a,b:a+b,lambda a,b:a+b).collect())
[('a', [1, 2]), ('b', [1])]
1> groupByKey(numPartitions=None, partitionFunc=<function portable_hash>)
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.groupByKey().mapValues(len).collect())
[('a', 2), ('b', 1)]
>>> sorted(rdd.groupByKey().mapValues(list).collect())
[('a', [1, 1]), ('b', [1])]
2> reduceByKey(func, numPartitions=None, partitionFunc=<function portable_hash>)
>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]
3> aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=<function portable_hash>)
--seqFunc 作用于所有分区的所有元素;
--combFunc 汇总所有分区的结果;
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> seqOp = (lambda x, y: x + y)
>>> combOp = (lambda x, y: (x + y))
>>> rdd.aggregateByKey(0, seqOp, combOp).collect()
[('a', 2), ('b', 1)]
4> countByKey()
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.countByKey().items())
[('a', 2), ('b', 1)]
5> foldByKey(zeroValue, func, numPartitions=None, partitionFunc=<function portable_hash>)[source]
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> from operator import add
>>> sorted(rdd.foldByKey(0, add).collect())
[('a', 2), ('b', 1)]
----------------------------------------------------------
6> sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.<lambda>>)
默认从小到大排序;True / False
>>> tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)]
>>> sc.parallelize(tmp2).sortByKey(True, 3, keyfunc=lambda k: k.lower()).collect()
[('a', 3), ('fleece', 7), ('had', 2), ('lamb', 5),...('white', 9), ('whose', 6)]
7> subtractByKey(other, numPartitions=None)[source]
排除other集合里面元素;
>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]