Pig安装与应用
1. 参考说明
参考文档:
http://pig.apache.org/docs/r0.17.0/start.html#build
2. 安装环境说明
2.1. 环境说明
CentOS7.4+ Hadoop2.7.5的伪分布式环境
主机名 |
NameNode |
SecondaryNameNode |
DataNodes |
centoshadoop.smartmap.com |
192.168.1.80 |
192.168.1.80 |
192.168.1.80 |
|
|
|
|
Hadoop的安装目录为:/opt/hadoop/hadoop-2.7.5
3. 安装
3.1. Pig下载
http://pig.apache.org/releases.html#Download
[root@server1 ~]# mkdir /opt/mongodb
[root@server1 ~]# chown -R mongodb:mongodb /opt/mongodb/
3.2. Pig解压
将下载的pig-0.17.0.tar.gz解压到/opt/hadoop/pig-0.17.0目录下
4. 配置
4.1. 修改profile文件
vi /etc/profile
export PIG_HOME=/opt/hadoop/pig-0.17.0
export PATH=$PATH:$PIG_HOME/bin
4.2. 将JDK升级为1.8版本
将JDK切换成1.8的版本,并修改所有与JAVA_HOME相关的变量
4.3. 修改pig的配置文件
vi /opt/hadoop/pig-0.17.0/conf/pig.properties
exectype=mapreduce
4.4. 修改mapred-site.xml以启用jobhistory
vi /opt/hadoop/hadoop-2.7.5/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.1.80:10020</value>
</property>
5. 启动Hadoop
5.1. 启动YARN与HDFS
cd /opt/hadoop/hadoop-2.7.5/sbin
start-all.sh
5.2. 启动historyserver
cd /opt/hadoop/hadoop-2.7.5/sbin
mr-jobhistory-daemon.sh start historyserver
6. 应用Pig工具
6.1. 导入文件到HDFS中
hadoop fs -mkdir -p /input/ncdc/micro-tab
hadoop fs -copyFromLocal sample.txt /input/ncdc/micro-tab/sample.txt
6.2. 启动运行Pig的交互式Shell环境
cd /opt/hadoop/pig-0.17.0/bin
pig
6.3. 运行任务
grunt> records = load '/input/ncdc/micro-tab/sample.txt' as (year:chararray, temperature:int, quality:int);
grunt> dump records;
6.4. 退出
grunt> \q
6.5. 显示模式
cd /opt/hadoop/pig-0.17.0/bin
pig
grunt> records = LOAD '/input/ncdc/micro-tab/sample.txt' as (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
grunt> DESCRIBE records
records: {year: chararray,temperature: int,quality: int}
grunt>
6.6. 过滤数据
grunt> filter_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);
grunt> DUMP filter_records;
6.7. 分组记录
grunt> grouped_records = GROUP filter_records BY year;
grunt> DUMP grouped_records;
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filter_records: {(year: chararray,temperature: int,quality: int)}}
grunt>
6.8. 计算最大值
grunt> max_temp = FOREACH grouped_records GENERATE group, MAX(filter_records.temperature);
grunt> DUMP max_temp;
6.9. 查看执行过程
grunt> ILLUSTRATE max_temp;