2.安装Spark与Python练习

一、安装Spark

1.检查基础环境hadoop,jdk


启动hadoop,jps查看启动成功


查看jdk环境

下载spark

image

解压,文件夹重命名、权限

image
解压spark

image
文件夹重命名

image
授权

配置文件

image

image

环境变量

image

image

试运行Python代码

image

二、Python编程练习:英文文本的词频统计

准备文本文件

image

读文件

file = open("The_Flying_Inn.txt", 'r')
text = file.read()

预处理:大小写,标点符号,停用词,分词

# 以非英文字符为间隔生成list
words = re.split(r'[^a-zA-Z]', text)  
# 去除空串
realWords0 = list(filter(None, words)) 
realWords1 = []
for word in realWords0:
	# 均转换为小写
    realWords1.append(word.lower()) 
    realWords1.sort()

统计每个单词出现的次数

dct = dict()
for word in realWords1:
    if (word in dct):
        dct[word] = dct[word] + 1
    else:
        dct[word] = 1

按词频大小排序

wclist = list(dct.items())
wclist.sort(key=lambda  x:x[1],reverse=True)
print(wclist)

结果写文件

text = open("output.txt", "w", encoding='UTF-8')
text.write(str(wclist))
posted @ 2022-03-07 18:41  空木蓮華  阅读(34)  评论(0编辑  收藏  举报