08 分布式计算MapReduce--词频统计


WordCount程序任务:

程序

WordCount

输入

一个包含大量单词的文本文件

输出

文件中每个单词及其出现次数(频数),

并按照单词字母顺序排序,

每个单词和其频数占一行,单词和频数之间有间隔

1.用你最熟悉的编程环境,编写非分布式的词频统计程序。

  • 读文件
  • 分词(text.split列表)
  • 按单词统计(字典,key单词,value次数)
  • 排序(list.sort列表)
  • 输出
import operator

speech_etxt = '''
《You Have Only One Life》

  There are moments in life when you miss someone so much that you just want to pick them from your dreams and hug them for real! Dream what you want to dream;go where you want to go;be what you want to be,because you have only one life and one chance to do all the things you want to do.

  May you have enough happiness to make you sweet,enough trials to make you strong,enough sorrow to keep you human,enough hope to make you happy? Always put yourself in others’shoes.If you feel that it hurts you,it probably hurts the other person, too.

  The happiest of people don’t necessarily have the best of everything;they just make the most of everything that comes along their way.Happiness lies for those who cry,those who hurt, those who have searched,and those who have tried,for only they can appreciate the importance of people

  who have touched their lives.Love begins with a smile,grows with a kiss and ends with a tear.The brightest future will always be based on a forgotten past, you can’t go on well in lifeuntil you let go of your past failures and heartaches.

  When you were born,you were crying and everyone around you was smiling.Live your life so that when you die,you're the one who is smiling and everyone around you is crying.

  Please send this message to those people who mean something to you,to those who have touched your life in one way or another,to those who make you smile when you really need it,to those that make you see the brighter side of things when you are really down,to those who you want to let them know that you appreciate their friendship.And if you don’t, don’t worry,nothing bad will happen to you,you will just miss out on the opportunity to brighten someone’s day with this message.
'''
#以上是需要分析的文本
#先转换为小写再执行
speech = speech_etxt.lower().split()
#利用字典进行处理
dic = {}
for word in speech:
if word not in dic:
dic[word] = 1
else:
dic[word] = dic[word] + 1

swd = sorted(dic.items(),key=operator.itemgetter(1),reverse=True)
print(swd)

 

在Ubuntu中实现运行。

  • 准备txt文件
  • 编写py文件
  • python3运行py文件分析txt文件。

 

 

 

 

2.用MapReduce实现词频统计

2.1编写Map函数

  • 编写mapper.py
  • 授予可运行权限
  • 本地测试mapper.py

 

 

 

 

 

 

 

 

2.2编写Reduce函数

  • 编写reducer.py
  • 授予可运行权限
  • 本地测试reducer.py

 

 

 

2.3分布式运行自带词频统计示例

  • 启动HDFS与YARN
  • 准备待处理文件,上传到HDFS上
  • 运行实例hadoop-mapreduce-examples-2.7.1.jar
  • 查看结果

 

 

 

 

 

 

 

 

2.4 分布式运行自写的词频统计

  • 用Streaming提交MapReduce任务:
    • 查看hadoop-streaming的jar文件位置:/usr/local/hadoop/share/hadoop/tools/lib/
    • 配置stream环境变量
    • 编写运行文件run.sh
    • 运行run.sh运行
  • 查看运行结果
  • 停止HDFS与YARN

 

 

 

 

 

 

 

posted @ 2021-11-23 14:28  LBxl  阅读(40)  评论(0编辑  收藏  举报