py计算wordcount

一、数据准备:文件 words数据如下:

hello spark
hello python
hello scala
hello spark
hello python

二、python代码实现如下:

 1 from pyspark import SparkConf, SparkContext
 2 
 3 if __name__ == '__main__':
 4     conf = SparkConf()
 5     conf.setAppName("test")
 6     conf.setMaster("local")
 7     sc = SparkContext(conf=conf)
 8     lines = sc.textFile("./words")
 9     print("lines type is %s" % type(lines))
10 
11     words = lines.flatMap(lambda line: line.split(" "))
12     pair_words = words.map(lambda word: (word, 1))
13     reduce_result = pair_words.reduceByKey(lambda v1, v2: v1+v2)
14     result = reduce_result.sortBy(lambda tp:tp[1],ascending=False)
15     result.foreach(print)

三、运行结果如下:

[Stage 0:>  (0 + 1) / 1]('hello', 5)
('spark', 2)
('python', 2)
('scala', 1)
posted @ 2021-03-09 21:27  大数据程序员  阅读(77)  评论(0编辑  收藏  举报