py计算wordcount
一、数据准备:文件 words数据如下:
hello spark
hello python
hello scala
hello spark
hello python
二、python代码实现如下:
1 from pyspark import SparkConf, SparkContext 2 3 if __name__ == '__main__': 4 conf = SparkConf() 5 conf.setAppName("test") 6 conf.setMaster("local") 7 sc = SparkContext(conf=conf) 8 lines = sc.textFile("./words") 9 print("lines type is %s" % type(lines)) 10 11 words = lines.flatMap(lambda line: line.split(" ")) 12 pair_words = words.map(lambda word: (word, 1)) 13 reduce_result = pair_words.reduceByKey(lambda v1, v2: v1+v2) 14 result = reduce_result.sortBy(lambda tp:tp[1],ascending=False) 15 result.foreach(print)
三、运行结果如下:
[Stage 0:> (0 + 1) / 1]('hello', 5)
('spark', 2)
('python', 2)
('scala', 1)