sqoop
sqoop框架(工具)简介
** RDBMS <-- sqoop --> HDFS、HIVE、HBASE
** 原先数据库集群已有的数据 --> HDFS --> 分析
** Hive分析结果 --> RDBMS(mysql) --> 报表展示
** sqoop.apache.org
** apache版本
** CDH版本
====安装CDH版本的Hadoop=====================================================
** Apache版本的sqoop和Hadoop存在兼容性问题,缺少jars
** Hadoop是开源项目,人力有限,没有足够的精力去解决兼容性问题,而Cloudera是个公司
** http://archive.cloudera.com/cdh5/
** CDH安装部署几乎跟apache版本一致
1、创建新目录用来安装CDH版本相关软件包
$ mkdir cdh --> /opt/modules/cdh
$ cd cdh/
$ tar zxf /opt/softwares/hadoop-2.5.0-cdh5.3.6.tar.gz
2、修改hadoop的配置文件
2.1 配置JDK
# rpm -qa | grep -i java --检查是否存在额外的jdk
** .../etc/hadoop
修改:hadoop-env.sh yarn-env.sh mapred-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67
2.2配置相关文件
** 将Apache Hadoop中对应文件内容copy过来
a) core-site.xml:修改hadoop.tmp.dir,对应目录格式化时会自动生成
b) ** mapred-site.xml(重命名,去掉.template)
hdfs-site.xml yarn-site.xml mapred-site.xml:这三个文件的内容不需要修改
c) slaves:复制内容
3、格式化启动
$ bin/hdfs namenode -format
$ sbin/start-dfs.sh --输入密码
$ sbin/start-yarn.sh
$ sbin/mr-jobhistory-daemon.sh start historyserver
PS:
** 若是报jobhistory已经启动:
$ ls /tmp/*.pid --查找mapred-tom-historyserver.pid文件
$ rm -rf /tmp/mapred-lxl-historyserver.pid --删除后再启动
** 若是有某个进程没有启动,则进入logs查看对应日志
$ cd logs
$ ls
4、测试:
http://192.168.158.100:50070/
$ bin/hdfs dfs -mkdir /input2 #在HDFS上创建文件夹,没有类似-cd进入目录的参数
$ hdfs dfs -mkdir -p /aaa/bbb/ccc #级联创建目录
$ hdfs dfs -ls / #查看
$ hdfs dfs -put b.txt /input #把本地文件拷到HDFS
$ hdfs dfs -cat /input2/b.txt #查看文件
$ hdfs dfs -rm /input2/b.txt #删除文件
$ hdfs dfs -rmr /input? #递归删除,推荐使用'-rm -r'格式;单字符通配符'?'
$ hdfs dfs -help
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar wordcount /input/ /output
====安装CDH版本的Hive======================================
1、解压缩
$ tar zxvf /opt/softwares/hive-0.13.1-cdh5.3.6.tar.gz
2、重命名hive-env.sh (去掉.template)
HADOOP_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
export HIVE_CONF_DIR=/opt/modules/cdh/hive-0.13.1-cdh5.3.6/conf
3、重命名hive-site.xml (去掉.template,修改为site)
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://blue01.mydomain:3306/cdhmetastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
4、重命名hive-log4j.properties (去掉.template)
5、拷贝jdbc driver
** 将驱动文件复制到lib/
6、切换到CDH Hadoop目录,建立目录,并修改权限
** /user/hive/warehouse为Hive存放数据的目录
$ bin/hdfs dfs -mkdir -p /user/hive/warehouse
$ bin/hadoop fs -chmod g+w /user/hive/warehouse
7、启动客户端使用Hive
**注意:模板文件有一个bug,在hive-site.xml的2781行少了一个<property>开始标记
**把profile里HIVE_HOME注解掉
$ bin/hive
hive> show databases;
====安装cdh版本zookeeper=========================================
ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,它是一个为分布式应用
提供一致性服务的软件,提供的功能包括:配置维护、域名服务、分布式同步、组服务等。
1 $ tar zxvf /opt/softwares/zookeeper-3.4.5-cdh5.3.6.tar.gz
2 进入zookkeeper目录,创建存放数据的目录
$ mkdir zkData
3 在conf目录里
$ cp -a zoo_sample.cfg zoo.cfg
然后修改:dataDir=/opt/modules/cdh/zookeeper-3.4.5-cdh5.3.6/zkData
4 启动
$ sbin/zkServer.sh start
可以去查看状态:$ sbin/zkServer.sh status
====安装sqoop=====================================================
1 解压
$ tar zxvf /opt/softwares/sqoop-1.4.5-cdh5.3.6.tar.gz
2 配置conf/sqoop-env.sh
改名:$ cp -a sqoop-env-template.sh sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/modules/cdh/hive-0.13.1-cdh5.3.6
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/opt/modules/cdh/zookeeper-3.4.5-cdh5.3.6
3 拷贝jdbc驱动包达到lib目录
$ cp /opt/modules/apache-hive-0.13.1-bin/lib/mysql-connector-java-5.1.27-bin.jar lib/
====sqoop使用=============================================
要点
RDBMS <-- sqoop --> HDFS
将关系型数据库数据导入到HDFS
将HDFS数据导出到关系型数据库
** sqoop借助Mapreduce模型,运行在yarn上
** 注意:sqoop导入导出只需要运行map task即可,不需要reduce过程
(1)RDBMS: --JDBC连接要素
1)jdbcurl
2)username
3)password
4)tablename
(2)HADOOP:
1)hdfs:path
2)hive:databasename、tablename
(3)方式:
1)import
2)export
** 确定是import还是export,以HDFS为参照
** 如果数据是往HDFS、hive、hbase里面写 --> import
** 如果数据是往RDBMS里面写 --> export
sqoop命令:
$ bin/sqoop help
$ bin/sqoop list-databases --help **显示某个子命令的帮助信息,list-databases是一个子命令
--显示出对应节点上的所有数据库,用来测试RDBMS[mysql]是否能够连接,'\'表示一条命令语句中的换行
$ bin/sqoop list-databases \
--connect jdbc:mysql://lxl01.mydomain:3306 \
--username root \
--password root
====mysql导入HDFS===================================================
案例一: 利用sqoop把mysql里面的数据导入到HDFS
1、在mysql里面创建测试表
CREATE TABLE `my_user` (
`id` tinyint(4) NOT NULL AUTO_INCREMENT,
`account` varchar(255) DEFAULT NULL,
`passwd` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `my_user` VALUES ('1', 'admin', 'admin');
INSERT INTO `my_user` VALUES ('2', 'pu', '12345');
INSERT INTO `my_user` VALUES ('3', 'system', 'system');
INSERT INTO `my_user` VALUES ('4', 'zxh', 'zxh');
INSERT INTO `my_user` VALUES ('5', 'test', 'test');
INSERT INTO `my_user` VALUES ('6', 'pudong', 'pudong');
INSERT INTO `my_user` VALUES ('7', 'qiqi', 'qiqi');
INSERT INTO `my_user` VALUES ('8', 'congcong', 'congcong');
2、使用sqoop导入数据到HDFS
$ bin/sqoop import --help
$ bin/sqoop import \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--table my_user \
--target-dir /user/hive/warehouse/my_user \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"
target-dir:HDFS输出目录
delete-target-dir:如果上面输出目录存在,就先删除
num-mappers:设置map个数为1,默认情况下map个数是4,即会在输出目录生成4个文件
fields-terminated-by "\t":指定列分隔符为 \t
查看:$ hdfs dfs -cat /user/hive/warehouse/my_user/p*
--------------------------
1、指定具体列(num-mappers为2,生成2个文件)
$ bin/sqoop import \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--table my_user \
--target-dir /user/hive/warehouse/my_user \
--delete-target-dir \
--num-mappers 2 \
--fields-terminated-by "\t" \
--columns id,passwd
---------------------------
2、用where指定条件
$ bin/sqoop import \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--table my_user \
--target-dir /user/hive/warehouse/my_user \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns id,passwd \
--where "id<=3"
--------------------------
3、把select语句的查询结果导入,必需包含'$CONDITIONS'在WHERE子句,否则报错
--query "select id,name from my_user where id>=3 and $CONDITIONS"
$ bin/sqoop import \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--target-dir /user/hive/warehouse/my_user \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query 'select id,account from my_user where id>=3 and $CONDITIONS'
=====mysql导入Hive=======================================================
案例二: 把mysql里的数据导入到Hive
1、先要创建好Hive表
hive> create database mydb;
hive> use mydb;
CREATE TABLE mydb.my_user (
id int,
account string,
passwd string
)row format delimited fields terminated by "\t";
2、导入数据到Hive
$ bin/sqoop import \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--table my_user \
--num-mappers 1 \
--hive-import \
--hive-database mydb \
--hive-table my_user \
--fields-terminated-by "\t" \
--delete-target-dir \
--hive-overwrite
3、查看结果:hive> select * from my_user;
====Hive或HDFS导出到mysql=============================
案例三:从Hive或HDFS中把数据导出到mysql
1、先创建一个mysql表
CREATE TABLE `hive2mysql` (
`id` tinyint(4) PRIMARY KEY AUTO_INCREMENT,
`account` varchar(255),
`passwd` varchar(255)
);
2、从hive或者hdfs导入到Mysql表
$ bin/sqoop export \
--connect jdbc:mysql://blue01.mydomain:3306/mydb \
--username root \
--password root \
--table hive2mysql \
--num-mappers 1 \
--export-dir /user/hive/warehouse/mydb.db/my_user \
--input-fields-terminated-by "\t"
====脚本文件====================================================
1、创建一个opt脚本文件(注意:必需要换行)
vi job1.opt
export
--connect
jdbc:mysql://blue01.mydomain:3306/mydb
--username
root
--password
root
--table
hive2mysql
--num-mappers
1
--export-dir
/user/hive/warehouse/mydb.db/my_user
--input-fields-terminated-by
"\t"
2、使用sqoop执行这个文件
** 删除掉表中数据,避免主键重复
$ bin/sqoop --options-file job1.opt
====案例=============================================================================
统计某个网站每天每个小时的PV、UV(参见图)
** Hive分析流程:
** 1、把日志文件从linux本地上传到HDFS
** 2、把上传的文件导入对应的Hive表[表1]
** 3、利用Hive进行业务需求分析
** 4、Hive语句的分析结果临时存储在Hive表[表2]
** 5、利用sqoop把Hive表[表2]的数据导入mysql永久存储
数据源文件:
2015082818
2015082819
1、创建Hive分区表 [表1]
** create database if not exists mydb;
create table if not exists mydb.track_log(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
partitioned by (date string,hour string)
row format delimited fields terminated by '\t';
2、上传数据HDFS,再创建Hive表去关联
2.1 创建目录
$ bin/hdfs dfs -mkdir -p /user/hive/warehouse/mydb.db/track_log/date=20150828/hour=18
$ bin/hdfs dfs -mkdir -p /user/hive/warehouse/mydb.db/track_log/date=20150828/hour=19
2.2 上传文件到对应的目录
$ bin/hdfs dfs -put /home/tom/2015082818 /user/hive/warehouse/mydb.db/track_log/date=20150828/hour=18
$ bin/hdfs dfs -put /home/tom/2015082819 /user/hive/warehouse/mydb.db/track_log/date=20150828/hour=19
2.3 给分区表关联数据
alter table track_log add partition(date='20150828',hour='18') location "/user/hive/warehouse/mydb.db/track_log/date=20150828/hour=18";
alter table track_log add partition(date='20150828',hour='19') location "/user/hive/warehouse/mydb.db/track_log/date=20150828/hour=19";
3、利用Hive分析每天每个小时PV、UV
** 创建临时表用于临时保存当天的分析记录
create table mydb.tracklog_pvuv_hourly(
date string,
hour string,
pv string,
uv string
)row format delimited fields terminated by '\t';
** 把分析结果保存到临时表
** url为链接,guid为唯一用户ID
insert overwrite table mydb.tracklog_pvuv_hourly
select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour;
4、利用sqoop把分析结果导入mysql表存储
4.1在mysql里面创建表:
create table mydb.tracklog_pvuv_hourly(
date int,
hour int,
pv bigint,
uv bigint
);
4.2 创建一个opt文件
vi job2.opt
export
--connect
jdbc:mysql://blue01.mydomain:3306/mydb
--username
root
--password
root
--table
tracklog_pvuv_hourly
--num-mappers
1
--export-dir
/user/hive/warehouse/mydb.db/tracklog_pvuv_hourly
--input-fields-terminated-by
"\t"
4.3 执行脚本文件
$ bin/sqoop --options-file job2.opt
----扩展--------------------------------
一、维度
基本指标 pv uv vv ip...
维度指标 时间、地域、商品类别...
例:
增加维度 date hour provinceid(省份id)
select date,hour,provinceid,count(url),count(distinct guid) from track_log
where date='20150828' group by date,hour,provinceid;
二、数据清洗
通常把需要分析的原文件上传到HDFS以后,如果文件里面有脏数据需要过滤
或者根据需求我们需要转变原数据文件某些字段的格式(如日期类型)。
这个时候就需要我们自己写MapReduce程序去做【数据清洗】
---程序写完以后,通常都是导出为jar包,比如:LogClean.jar
---运行清洗程序: bin/yarn jar LogClean.jar 输入路径 输出路径
---输出路径就是我们Hive表对应的路径
例:
1、编写清洗程序,过滤掉不合法的数据[脏数据]
2、执行清洗程序
$ bin/yarn jar /home/tom/jars/logclean.jar /logclean /user/hive/warehouse/db_track.db/track_log/date=20150828/hour=19
---------------------------------------------------------------------------------
mysql> alter user 'root'@'localhost' identified by 'Alamps@163'; Query OK, 0 rows affected (1.55 sec) mysql> flush privileges; DROP TABLE IF EXISTS `order_admin_type_mgmt`; CREATE TABLE `order_admin_type_mgmt` ( `id` bigint(20) NOT NULL, `alarm_type` varchar(128) NOT NULL COMMENT '故障类型', `alarm_property` varchar(1024) DEFAULT NULL COMMENT '故障属性', `alarm_desc` longtext COMMENT '故障描述', `ext` varchar(1024) DEFAULT NULL COMMENT '扩展字段', PRIMARY KEY (`id`), KEY `idx_alarm_type` (`alarm_type`) USING BTREE ) ENGINE=MyISAM DEFAULT CHARSET=utf8; -- ---------------------------- -- Records of order_admin_type_mgmt -- ---------------------------- INSERT INTO `order_admin_type_mgmt` VALUES ('1', 'Normal Fault', 'Normal Fault', 'please deal with as soon as possible', 'aaa'); INSERT INTO `order_admin_type_mgmt` VALUES ('2', 'Replace failed hardware', 'failed hardware', 'failed hardware', 'bbb'); INSERT INTO `order_admin_type_mgmt` VALUES ('3', 'Hardware is removed but noe replaced', 'Hardware is removed', 'Hardware is removed', 'ccc'); INSERT INTO `order_admin_type_mgmt` VALUES ('4', 'Hardware is lost and needs to be replaced', 'Hardware is lost', 'Hardware is lost', 'ddd'); INSERT INTO `order_admin_type_mgmt` VALUES ('5', 'Hardware is fault and there is engineering defects', 'Hardware is fault', 'Hardware is fault', 'eee'); INSERT INTO `order_admin_type_mgmt` VALUES ('6', 'No need to be prossessed', 'No need to be prossessed', 'No need to be prossessed', 'fff'); mysql> select * from order_admin_type_mgmt; +----+----------------------------------------------------+--------------------------+--------------------------------------+------+ | id | alarm_type | alarm_property | alarm_desc | ext | +----+----------------------------------------------------+--------------------------+--------------------------------------+------+ | 1 | Normal Fault | Normal Fault | please deal with as soon as possible | aaa | | 2 | Replace failed hardware | failed hardware | failed hardware | bbb | | 3 | Hardware is removed but noe replaced | Hardware is removed | Hardware is removed | ccc | | 4 | Hardware is lost and needs to be replaced | Hardware is lost | Hardware is lost | ddd | | 5 | Hardware is fault and there is engineering defects | Hardware is fault | Hardware is fault | eee | | 6 | No need to be prossessed | No need to be prossessed | No need to be prossessed | fff | +----+----------------------------------------------------+--------------------------+--------------------------------------+------+ 6 rows in set (0.04 sec) 启动 HDFS: [hadoop@alamps sbin]$ ./start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh Starting namenodes on [alamps] alamps: starting namenode, logging to /home/hadoop/app/hadoop-2.4.1/logs/hadoop-hadoop-namenode-alamps.out alamps: starting datanode, logging to /home/hadoop/app/hadoop-2.4.1/logs/hadoop-hadoop-datanode-alamps.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.4.1/logs/hadoop-hadoop-secondarynamenode-alamps.out starting yarn daemons starting resourcemanager, logging to /home/hadoop/app/hadoop-2.4.1/logs/yarn-hadoop-resourcemanager-alamps.out alamps: starting nodemanager, logging to /home/hadoop/app/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-alamps.out [hadoop@alamps sbin]$ jps 2464 DataNode 2790 ResourceManager 2374 NameNode 2586 SecondaryNameNode 2891 NodeManager 3099 Jps ----------------------------------------------------------------- vi sqoop-env.sh : export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.4.1 export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.4.1 export HIVE_HOME=/home/hadoop/app/apache-hive-0.14.0-bin export HBASE_HOME=/home/hadoop/hbase-0.96.2 export ZOOCFGDIR=/home/hadoop/app/zookeeper-3.4.8/conf if [ -z "${HADOOP_HOME}" ]; then HADOOP_HOME=/home/hadoop/app/hadoop-2.4.1 fi if [ -z "${HBASE_HOME}" ]; then HBASE_HOME=/home/hadoop/hbase-0.96.2 fi if [ -z "${ZOOKEEPER_HOME}" ]; then ZOOKEEPER_HOME=/home/hadoop/app/zookeeper-3.4.8 fi 我安装的Hadoop版本是原生hadoop-0.20.203.0,SQOOP不支持此版本,可使用CDH3版本hadoop,也可以通过拷贝相应的包到sqoop-1.2.0-CDH3B4/lib下,依然可以使用。 下载相关文件: http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4.tar.gz http://archive.cloudera.com/cdh/3/sqoop-1.2.0-CDH3B4.tar.gz sqoop-1.2.0-CDH3B4依赖hadoop-core-0.20.2-CDH3B4.jar,所以你需要下载hadoop- 0.20.2-CDH3B4.tar.gz,解压缩后将hadoop-0.20.2-CDH3B4/hadoop-core-0.20.2- CDH3B4.jar复制到sqoop-1.2.0-CDH3B4/lib中。 另外,sqoop导入mysql数据运行过程中依赖mysql-connector-java-*.jar,所以你需要下载mysql-connector-java-*.jar并复制到sqoop-1.2.0-CDH3B4/lib中。 ------------------------------------------------------------------ <property> <name>sqoop.metastore.client.autoconnect.username</name> <value>root</value> <description>The username to bind to the metastore. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.password</name> <value>root</value> <description>The password to bind to the metastore. </description> </property> ------------------------------------------------------------- [hadoop@alamps lib]$ sqoop list-databases --connect jdbc:mysql://alamps:3306/sqoop_mysql2hdfs --username root --password root Warning: /home/hadoop/app/sqoop-1.4.6.bin/../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /home/hadoop/app/sqoop-1.4.6.bin/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /home/hadoop/app/sqoop-1.4.6.bin/../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 17/12/08 19:12:14 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 17/12/08 19:12:14 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 17/12/08 19:12:14 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. information_schema hive mysql sqoop_mysql2hdfs [hadoop@alamps lib]$ sqoop list-tables --connect jdbc:mysql://alamps/sqoop_mysql2hdfs --username root --password root Warning: /home/hadoop/app/sqoop-1.4.6.bin/../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /home/hadoop/app/sqoop-1.4.6.bin/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /home/hadoop/app/sqoop-1.4.6.bin/../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 17/12/08 19:14:13 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 17/12/08 19:14:13 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 17/12/08 19:14:13 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. order_admin_type_mgmt ----------------------------------------------------------------------- sqoop import --connect jdbc:mysql://alamps:3306/sqoop_mysql2hdfs --username root --password root --table order_admin_type_mgmt --m 1 --target-dir /output/sqoop/ sqoop import --connect jdbc:mysql://alamps:3306/sqoop_mysql2hdfs --username root --password root --table order_admin_type_mgmt --target-dir /output/sqoop/
**************************************************************************************************************************************************************************************
Hadoop2.2.0下安装Sqoop
我这里,暂时,以CentOS环境下。
我这里 Hadoop 集群安装的是 Hadoop2.2.0 版本,所以 Sqoop 安装版本也要与之相匹配,否则后面 Sqoop 工具的使用会出现问题。
这里我选择 sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz版本安装。 安装 Sqoop 很简单,分为以下几步完成。
步骤一:首先将下载的 sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz放到 /usr/java/目录下,然后对安装包解压、修改文件名和修改用户权限。
[root@singlehadoop java]# tar zxvf sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //解压
[root@singlehadoop java]# rm sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz //删除安装包
[root@singlehadoop java]# mv sqoop-1.4.6.bin__hadoop-1.0.0 sqoop //修改安装文件目录
[root@singlehadoop java]# chown -R hadoop:hadoop sqoop //赋予sqoop hadoop用户权限
步骤二:切换到/sqoop/conf 目录下,执行以下命令。
[hadoop@singlehadoop java]$ cd sqoop/conf
[hadoop@singlehadoop java]$ cp sqoop-env-template.sh sqoop-env.sh
然后使用 vi sqoop-env.sh 命令,打开文件添加如下内容。
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/java/hadoop-2.2.0-x64 (这里,我是为了入门,所以数据读取设计到hadoop,配置了。)
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/java/hadoop-2.2.0-x64 (这里,我是为了入门,所以数据读取设计到hadoop,配置了。)
#set the path to where bin/hbase is available
#export HBASE_HOME= (以后自行去配置)
#Set the path to where bin/hive is available
export HIVE_HOME=/usr/java/hive-1.0.0 (以后自行去配置)
#Set the path for where zookeper config dir is
#export ZOOCFGDIR= (以后自行去配置,但必须得是3或5节点分布式集群。若是单节点,没必要)
如果数据读取不涉及hbase和hive,那么相关hbase和hive的配置可以不加;如果集群有独立的zookeeper集群,那么配置zookeeper,反之,不用配置。
步骤三:将相关的驱动 jar 包拷贝到 sqoop/lib 目录下。
安装 Hadoop2.2.0 的核心 jar包有三个需要导入:(因为,我这里是暂时数据读取涉及的是hadoop。)
commons-cli-1.2.jar、hadoop-common-2.2.0.jar和hadoop-mapreduce-client-core-2.2.0.jar。
数据库驱动 jar 包需要导入,这里我们使用的是 mysql 数据库,所以需要导入mysql-connector-java-5.1.21.jar包。
[hadoop@singlehadoop lib]$ cp commons-cli-1.2.jar /usr/java/sqoop/lib
[hadoop@singlehadoop common]$ cp hadoop-common-2.2.0.jar /usr/java/sqoop/lib
[hadoop@singlehadoop mapreduce]$ cp hadoop-mapreduce-client-core-2.2.0.jar /usr/java/sqoop/lib
[hadoop@singlehadoop java]$ cp mysql-connector-java-5.1.21.jar /usr/java/sqoop/lib
步骤四:添加环境变量。
[hadoop@singlehadoopjava]$ vi ~/.bash_profile
PATH=$PATH:$HOME/bin
export SQOOP_HOME=/usr/java/sqoop //sqoop安装目录
export PATH=$PATH:$SQOOP_HOME/bin
环境添加完毕后,执行以下命令使环境生效。
[hadoop@singlehadoop java]$ source ~/.bash_profile
步骤五:测试运行
(1) 先启动hadoop集群,即sbin/start-all.sh
(2) 测试下
[hadoop@singlehadoop java]$ sqoop list-databases \
> --connect jdbc:mysql://192.168.80.128:3306/db_hadoop \
> --username sqoop \
> --password sqoop
15/06/03 02:47:27 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
15/06/03 02:47:27 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/06/03 02:47:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
db_demo
db_hadoop
db_www
sqoop 命令执行成功,代表Hadoop2.2.0(单节点)下安装Sqoop安装成功!