4.RDD常用算子之transformations
RDD Opertions
transformations:create a new dataset from an existing one
RDDA --> RDDB
actions: return a value to the driver program after running a computation on the dataset
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program
This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
def my_map():
data = [1,2,3,4,5]
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(lambda x: x * 2 )
print(rdd2.collect())
def my_filter():
data = [1, 2, 3, 4, 5]
# rdd1 = sc.parallelize(data)
# rdd2 = rdd1.map(lambda x: x * 2)
# rdd3 = rdd2.filter(lambda x:x > 5)
# print(rdd3.collect())
print(sc.parallelize(data).map(lambda x:x*2).filter(lambda x:x>5).collect())
def my_flatMap():
data = ["hello spark","hello ming","hello clay"]
print(sc.parallelize(data).flatMap(lambda line:line.split(" ")).collect())
def my_reduceByKey():
data = ["hello spark","hello ming","hello clay"]
rdd = sc.parallelize(data)
mapRdd = rdd.flatMap(lambda line: line.split(" ")).map(lambda x:(x,1))
my_reduceByKeyRdd = mapRdd.reduceByKey(lambda a,b:a+b)
print(my_reduceByKeyRdd.collect())
union:
distinct:
join: