py计算wordcount

一、数据准备：文件 words数据如下：

hello spark
hello python
hello scala
hello spark
hello python

二、python代码实现如下：

 1 from pyspark import SparkConf, SparkContext
 2 
 3 if __name__ == '__main__':
 4     conf = SparkConf()
 5     conf.setAppName("test")
 6     conf.setMaster("local")
 7     sc = SparkContext(conf=conf)
 8     lines = sc.textFile("./words")
 9     print("lines type is %s" % type(lines))
10 
11     words = lines.flatMap(lambda line: line.split(" "))
12     pair_words = words.map(lambda word: (word, 1))
13     reduce_result = pair_words.reduceByKey(lambda v1, v2: v1+v2)
14     result = reduce_result.sortBy(lambda tp:tp[1],ascending=False)
15     result.foreach(print)

三、运行结果如下：

[Stage 0:>  (0 + 1) / 1]('hello', 5)
('spark', 2)
('python', 2)
('scala', 1)

posted @ 2021-03-09 21:27 大数据程序员阅读(77) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

大数据程序员

py计算wordcount

公告