5.RDD操作综合实例

一、词频统计

A. 分步骤实现

准备文件
1. 下载小说或长篇新闻稿
2. 上传到hdfs上
读文件创建RDD
分词
排除大小写lower()，map()
标点符号re.split(pattern,str)，flatMap(),
停用词,可网盘下载stopwords.txt,filter()，
长度小于2的词filter()
统计词频
按词频排序
输出到文件
查看结果

B. 一句话实现：文件入文件出

lines = sc.textFile("hdfs://localhost:9001/user/hadoop/RDD_5/words.txt") \
    .flatMap(lambda line: line.split()) \
    .flatMap(lambda line: re.split(r'\W', line)) \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: word.lower()) \
    .filter(lambda x: x not in stopwords) \
    .filter(lambda x: len(x) > 2) \
    .map(lambda a: (a, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda x: x[1], False)

C.和作业2的“二、Python编程练习：英文文本的词频统计 ”进行比较，理解Spark编程的特点。

import string

list_dict = {}  
data = []  
f = open('test.txt', 'r')  
content = f.read() 
f.close()  
content = content.replace('-', ' ') 
words = content.split()  

for i in range(len(words)):
    words[i] = words[i].strip(string.punctuation) 
    words[i] = words[i].lower() 
    if words[i] in list_dict:  
        list_dict[words[i]] = list_dict[words[i]] + 1  
    else:
        list_dict[words[i]] = 1  
# print(list_dict)  ）
for key, value in list_dict.items(): 
    temp = [value, key]  
    data.append(temp)  
data.sort(reverse=True)  
print(data)

二、求Top值

posted @ 2022-04-08 21:43 sunny_LF 阅读(39) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

5.RDD操作综合实例

公告