抓取数据同步备份hive
1:创建表
CREATE external TABLE `tbl_spider`( `url` string, `html` string ) partitioned by ( `site` string, `dt` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION 'hdfs://aaa-offline/hive/warehouse/tbl_spider';
external 的作用,在drop表时,保留数据文件。
2:构建测试数据
在/home/cphmvp下vim文件 aaas.com 内容如下 两列以\t分割
http://www.aaaa.com/1.html test
构建初始导入
load data local inpath '/home/cphmvp/aaaa.com' overwrite into table tbl_spider partition (site ='aaaa.com',dt='20160720');
3: 查询
select * from tbl_spider where site='aaaa.com';
数据上传备份方案
1: load方式
2:put方式
2.1 判断文件夹是否存在 ,不存在 手动创建
2.2 put上传文件 ,-f(覆盖)(检查是否成功 $?)
2.3 hive加载识别partion
=====================
demo: 1:测试上传 hadoop fs -put aaaa.com hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/
建议改成 直接判断
hadoop fs -test -d hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719
2:判断状态 echo $? 不等于0的话失败 ,尝试创建目录 3:创建目录 hadoop fs -mkdir -p hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/ 判断状态 echo $? 【注意-p 递归创建目录】 4: 正式上传 hadoop fs -put -f aaaa.com hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719/ echo $? 【不加说明 都需要结果判断,重试 】 -f 强制覆盖 5: 测试查询 select * from tbl_spider; 发现仍为一条,(需要hive load part) 6: hive 加载partition hive -e "alter table tbl_spider add partition (site='aaaa.com',dt='20160719') LOCATION 'hdfs://aaaa-offline/hive/warehouse/tbl_spider/site=aaaa.com/dt=20160719';" echo $0 7:验证数据 select * from tbl_spider; 两天数据 select * from tbl_spider where dt='20160719'; 单天数据
整体demo
运行方式:sh parseSpiderLog.sh test aaaa.com all 20160722
【注意字段有所增加,增加了源码类型 ,source_type 不再往上修改,参照即可】
#!/bin/bash filename="$1" basepath="hdfs://aaaa/hive/warehouse/tbl_spider/" site="$2" source_type="$3" dt="$4" if [ $# != 4 ]; then echo "USAGE:$0 filename site source_type dt" echo "e.g.: $0 test aaaa.com sj 20160722" exit 1; fi outpath=$basepath"site=$site/source_type=$source_type/dt=$dt" log=$filename."upload_log" echo "begin to deal logfile $filename " echo "begin to deal logfile $filename " > $log echo "outpath hdfs $outpath" echo "outpath hdfs $outpath " >> $log sleep 3 #移除多线程标记 cat $filename|sed 's/ INFO -/\t/g'|sed 's/^\[.*] //g' > $filename.format #测试上传 hadoop fs -test -d $outpath if [ $? -eq 0 ]; then echo 'dir exsit' >> $log else hadoop fs -mkdir -p $outpath echo 'create dir' >> $log fi #正式上传 hadoop fs -put -f $filename.format $outpath if [ $? -ne 0 ]; then echo 'retry put' >> $log hadoop fs -put -f $filename.format $outpath else echo 'put success' >> $log fi #刷新hive加载 HQL=" ALTER TABLE tbl_spider DROP IF EXISTS PARTITION (site='$site',source_type='$source_type',dt='$dt'); alter table tbl_spider add partition (site='$site',source_type='$source_type',dt='$dt') LOCATION '$outpath'; " hive -e "$HQL" if [ $? -ne 0 ]; then echo 'retry alter' >> $log hive -e "$HQL" else echo 'alter hive success' >> $log fi if [ $? -eq 0 ]; then echo 'rm log file' >> $log rm $filename rm $filename.format else echo 'task fail' >> $log fi
create by cphmvp
email:cphmvp@163.com
爬虫技术交流_crawler QQ群 :167047843