1. 用Python编写WordCount程序并提交任务
程序 |
WordCount |
输入 |
一个包含大量单词的文本文件 |
输出 |
文件中每个单词及其出现次数(频数),并按照单词字母顺序排序,每个单词和其频数占一行,单词和频数之间有间隔 |
- 编写map函数,reduce函数
12345678910111213141516171819202122232425262728293031323334353637
cd
/
home
/
hadoop
/
wc
sudo gedit mapper.py
# map函数
import
sys
for
i
in
stdin:
i
=
i.strip()
words
=
i.split()
for
word
in
words:
print
'%s\t%s'
%
(word,
1
)
#reduce函数
from
operator
import
itemgetter
import
sys
current_word
=
None
current_count
=
0
word
=
None
for
i
in
stdin:
i
=
i.strip()
word, count
=
i.split(
'\t'
,
1
)
try
:
count
=
int
(count)
except
ValueError:
continue
if
current_word
=
=
word:
current_count
+
=
count
else
:
if
current_word:
print
'%s\t%s'
%
(current_word, current_count)
current_count
=
count
current_word
=
word
if
current_word
=
=
word:
print
'%s\t%s'
%
(current_word, current_count)
- 将其权限作出相应修改
1
chmod a
+
x
/
home
/
hadoop
/
mapper.py
- 本机上测试运行代码
123
echo
"foo foo quux labs foo bar quux"
|
/
home
/
hadoop
/
wc
/
mapper.py
echo
"foo foo quux labs foo bar quux"
|
/
home
/
hadoop
/
wc
/
mapper.py | sort
-
k1,
1
|
/
home
/
hadoop
/
wc
/
reducer.p
- 放到HDFS上运行
- 将之前爬取的文本文件上传到hdfs上
- 用Hadoop Streaming命令提交任务
- 查看运行结果
1234567
cd
/
home
/
hadoop
/
wc
wget http:
/
/
www.gutenberg.org
/
files
/
5000
/
5000
-
8.txt
wget http:
/
/
www.gutenberg.org
/
cache
/
epub
/
20417
/
pg20417.txt
cd
/
usr
/
hadoop
/
wc
hdfs dfs
-
put
/
home
/
hadoop
/
hadoop
/
gutenberg
/
*
.txt
/
user
/
hadoop
/
input
2. 用mapreduce 处理气象数据集
编写程序求每日最高最低气温,区间最高最低气温
- 气象数据集下载地址为:ftp://ftp.ncdc.noaa.gov/pub/data/noaa
- 按学号后三位下载不同年份月份的数据(例如201506110136号同学,就下载2013年以6开头的数据,看具体数据情况稍有变通)
- 解压数据集,并保存在文本文件中
- 对气象数据格式进行解析
- 编写map函数,reduce函数
- 将其权限作出相应修改
- 本机上测试运行代码
- 放到HDFS上运行
- 将之前爬取的文本文件上传到hdfs上
- 用Hadoop Streaming命令提交任务
- 查看运行结果
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
cd
/
usr
/
hadoop
sodu mkdir qx
cd
/
usr
/
hadoop
/
qx
wget
-
D
-
-
accept
-
regex
=
REGEX
-
P data
-
r
-
c ftp:
/
/
ftp.ncdc.noaa.gov
/
pub
/
data
/
noaa
/
2013
/
4
*
cd
/
usr
/
hadoop
/
qx
/
data
/
ftp.ncdc.noaa.gov
/
pub
/
data
/
noaa
/
2014
sudo zcat
1
*
.gz >qxdata.txt
cd
/
usr
/
hadoop
/
qx
import
sys
for
i
in
sys.stdin:
i
=
i.strip()
d
=
i[
15
:
23
]
t
=
i[
87
:
92
]
print
'%s\t%s'
%
(d,t)
from
operator
import
itemggetter
import
sys
current_word
=
None
current_count
=
0
word
=
None
for
i
in
sys.stdin:
i
=
i.strip()
word,count
=
i.split(
'\t'
,
1
)
try
:
count
=
int
(count)
except
ValueError:
continue
if
current_word
=
=
word:
if
current_count > count:
current_count
=
count
else
:
if
current_word:
print
'%s\t%s'
%
(current_word, current_count)
current_count
=
count
current_word
=
word
if
current_word
=
=
word:
print
'%s\t%s'
%
(current_word, current_count)
chmod a
+
x
/
usr
/
hadoop
/
qx
/
mapper.py
chmod a
+
x
/
usr
/
hadoop
/
qx
/
reducer.py
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步