补交作业

熟悉常用的hdfs操作

1. 以下关系型数据库中的表和数据，要求将其转换为适合于HBase存储的表并插入数据：

学生表（Student）（不包括最后一列）

学号（S_No）	姓名（S_Name）	性别（S_Sex）	年龄（S_Age）	课程（course）
2015001	Zhangsan	male	23
2015003	Mary	female	22
2015003	Lisi	male	24	数学（Math）85

create 'Student',{NAME=>'S_No',VERSIONS=>5},{NAME=>'S_Name',VERSIONS=>5},{NAME=>'S_Sex',VERSIONS=>5},{NAME=>'S_Age',VERSIONS=>5}
put 'Student','2015001','sname','Zhangsan'
put 'Student','2015001','ssex','male'
put 'Student','2015001','sage','23'
put 'Student','2015002','sname','Mary'
put 'Student','2015002','ssex','female'
put 'Student','2015002','sage','22'
put 'Student','2015003','sname','Lisi'
put 'Student','2015003','ssex','male'
put 'Student','2015003','sage','24

2. 用Hadoop提供的HBase Shell命令完成相同任务：

列出HBase所有的表的相关信息；list
在终端打印出学生表的所有记录数据；
向学生表添加课程列族；
向课程列族添加数学列并登记成绩为85；
删除课程列；
统计表的行数；count 's1'
清空指定的表的所有记录数据；truncate 's1'

list
scan 'Student'
alter ‘Student',NAME=>'course'
put 'Student','3','course:Math','85’
dorp 'Student','course'
count 'Student'
truncate 'Student'

用mapreduce 处理气象数据集

编写程序求每日最高最低气温，区间最高最低气温

气象数据集下载地址为：ftp://ftp.ncdc.noaa.gov/pub/data/noaa
按学号后三位下载不同年份月份的数据（例如201506110136号同学，就下载2013年以6开头的数据，看具体数据情况稍有变通）
解压数据集，并保存在文本文件中

cd /usr/hadoop
sodu mkdir qx
cd /usr/hadoop/qx

wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2017/7*

cd /usr/hadoop/qx/data/ftp.ncdc.noaa.gov/pub/data/noaa/2017
sudo zcat 7*.gz >qxdata.txt
cd /usr/hadoop/qx

对气象数据格式进行解析

#！/usr/bin/env python
import sys
for i in sys.stdin:
     i = i.strip()
     d = i[15:23]
     t = i[87:92]

     print '%s\t%s' % (d,t)

编写map函数，reduce函数

#!/usr/bin/env python
from operator import itemggetter
import sys

current_word = None
current_count = 0
word = None

for i in sys.stdin:
     i = i.strip()
     word,count = i.split('\t', 1)
     try:
          count = int(count)
     except ValueError:
          continue

     if current_word == word:
         if current_count > count:
              current_count = count
     else:
         if current_word:
             print '%s\t%s' % (current_word, current_count)
         current_count = count
         current_word = word

if current_word == word:
     print '%s\t%s' % (current_word, current_count)

将其权限作出相应修改

chmod a+x /usr/hadoop/qx/mapper.py
chmod a+x /usr/hadoop/qx/reducer.py

本机上测试运行代码
放到HDFS上运行
1. 将之前爬取的文本文件上传到hdfs上
2. 用Hadoop Streaming命令提交任务
查看运行结果

hive基本操作与应用

通过hadoop上的hive完成WordCount

启动hadoop

Hdfs上创建文件夹

上传文件至hdfs

启动Hive

创建原始文档表

导入文件内容到表docs并查看

用HQL进行词频统计，结果放在表word_count里

查看统计结果

在这里：http://www.cnblogs.com/FZW1874402927/p/9085246.html

posted @ 2018-05-25 16:12 157-符致伟阅读(127) 评论(0) 编辑收藏举报

刷新页面返回顶部

补交作业

hive基本操作与应用

公告