抓取数据同步备份hive

1:创建表

CREATE external TABLE `tbl_spider`(
`url` string,
`html` string
) partitioned by (
`site` string,
`dt` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 'hdfs://aaa-offline/hive/warehouse/tbl_spider';

 

external 的作用,在drop表时,保留数据文件。


2:构建测试数据

在/home/cphmvp下vim文件 aaas.com  内容如下 两列以\t分割

http://www.aaaa.com/1.html    test

构建初始导入

load data local inpath '/home/cphmvp/aaaa.com' overwrite into table tbl_spider partition (site ='aaaa.com',dt='20160720');

 

3: 查询
select * from tbl_spider where site='aaaa.com';

数据上传备份方案
1: load方式
2:put方式
2.1 判断文件夹是否存在 ,不存在 手动创建
2.2 put上传文件 ,-f(覆盖)(检查是否成功 $?)
2.3 hive加载识别partion

=====================

demo:
1:测试上传
hadoop fs -put aaaa.com hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/
建议改成 直接判断

   hadoop fs -test -d hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719

2:判断状态 echo $? 不等于0的话失败 ,尝试创建目录
3:创建目录
hadoop fs -mkdir -p hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/
判断状态 echo $? 【注意-p 递归创建目录】
4: 正式上传 hadoop fs -put -f aaaa.com hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/
echo $? 【不加说明 都需要结果判断,重试 】 -f 强制覆盖
5: 测试查询
select * from tbl_spider; 发现仍为一条,(需要hive load part)
6: hive 加载partition 
hive -e "alter table tbl_spider add partition (site='aaaa.com',dt='20160719') LOCATION 'hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719';"
echo $0
7:验证数据
select * from tbl_spider;
两天数据
select * from tbl_spider where dt='20160719';
单天数据

 

整体demo

运行方式:sh parseSpiderLog.sh test aaaa.com all 20160722

【注意字段有所增加,增加了源码类型 ,source_type 不再往上修改,参照即可】

 

 

#!/bin/bash
filename="$1"
basepath="hdfs://aaaa/hive/warehouse/tbl_spider/"
site="$2"
source_type="$3"
dt="$4"
if [ $# != 4 ];
then
echo "USAGE:$0 filename site source_type dt"
echo "e.g.: $0 test aaaa.com sj 20160722"
exit 1;
fi
outpath=$basepath"site=$site/source_type=$source_type/dt=$dt"
log=$filename."upload_log"
echo "begin to deal logfile  $filename "
echo "begin to deal logfile  $filename " > $log
echo "outpath hdfs $outpath"
echo "outpath hdfs $outpath " >> $log
sleep 3
#移除多线程标记
cat $filename|sed 's/  INFO  -/\t/g'|sed 's/^\[.*] //g' > $filename.format
#测试上传
hadoop fs -test -d  $outpath
if [ $? -eq 0 ];
then
    echo 'dir exsit' >> $log
else
    hadoop fs -mkdir -p $outpath
    echo 'create dir' >> $log
fi
#正式上传
hadoop fs -put -f $filename.format $outpath
if [ $? -ne 0 ];
then
    echo 'retry put' >> $log
    hadoop fs -put -f $filename.format $outpath
else
    echo 'put success' >> $log
fi
#刷新hive加载
HQL="
ALTER TABLE tbl_spider DROP IF EXISTS PARTITION (site='$site',source_type='$source_type',dt='$dt');
alter table tbl_spider add partition (site='$site',source_type='$source_type',dt='$dt') LOCATION '$outpath';
"
hive -e "$HQL"
if [ $? -ne 0 ];
then
    echo 'retry alter' >> $log
    hive -e "$HQL"
else
    echo 'alter hive success' >> $log
fi
if [ $? -eq 0 ];
then
    echo 'rm log file' >> $log
    rm $filename
    rm $filename.format
else
    echo 'task fail' >> $log
fi

 

 

 

posted @ 2016-07-21 20:16  cphmvp  阅读(545)  评论(0编辑  收藏  举报
爬虫在线测试小工具: http://tool.haoshuju.cn/