sqoop使用详解
import-to-hdfs 关系数据库导出数据到hdfs&hive
sqoop用于关系数据库和hadoop家族(hdfs、hive、hbase)之间的ETL
数据库导出到hadoop家族:sqoop import hadoop家族导出到数据库:sqoop export
下载:
http://www.apache.org/dyn/closer.lua/sqoop/1.4.7
官方:
https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html
参考文档: https://www.cnblogs.com/yinzhengjie/p/9183069.html
sqoop 1.4.7 安装
cd /usr/app
tar -zxvf sqoop-1.4.7.binhadoop-2.6.0.tar.gz mv sqoop-1.4.7.binhadoop-2.6.0/ sqoop-1.4.7
cd sqoop-1.4.7/conf/ mv sqoop-env-template.sh sqoop-env.sh vi sqoop-env.sh
export HADOOPCOMMONHOME=/usr/local/hadoop/hadoop-2.10.0
export HADOOPMAPREDHOME=/usr/local/hadoop/hadoop-2.10.0
export HBASE_HOME=/usr/app/hbase-1.6.0
export HIVE_HOME=/usr/app/apache-hive-2.3.7-bin
export ZOOCFGDIR=/usr/app/apache-zookeeper-3.6.1-bin/conf
cp mysql-connector-java-5.1.49-bin.jar /usr/app/sqoop-1.4.7/lib/
vi /etc/profile
export SQOOPHOME=/usr/app/sqoop-1.4.7 export PATH=$PATH:$SQOOPHOME/bin
source /etc/profile
验证是否成功 sqoop-version
mysql需要配置root@hadoop001的权限 GRANT ALL PRIVILEGES ON . TO 'root'@'%' IDENTIFIED BY 'mysql'
如果是集群环境试试关防火墙 sqoop import --connect jdbc:mysql://192.168.10.3:3306/mydb --username root --password mysql --table employees -m 2
看看效果(默认路径)
hadoop fs -ls /user/root/employees hadoop fs -cat /user/root/employees/part-m-00000
命令
- codegen Generate code to interact with database records
- create-hive-table Import a table definition into Hive
- eval Evaluate a SQL statement and display the results
- export Export an HDFS directory to a database table
- help List available commands
- import Import a table from a database to HDFS
- import-all-tables Import tables from a database to HDFS
- import-mainframe Import datasets from a mainframe server to HDFS
- job Work with saved jobs
- list-databases List available databases on a server
- list-tables List available tables in a database
- merge Merge results of incremental imports
- metastore Run a standalone Sqoop metastore
- version Display version information
参数
- --connect <jdbc-uri> Specify JDBC connect string
- --connection-manager <class-name> Specify connection manager class to use
- --driver <class-name> Manually specify JDBC driver class to use
- --hadoop-mapred-home
Override $HADOOP MAPREDHOME - --help Print usage instructions
- --password-file Set path for a file containing the authentication password
- -P Read password from console
- --password Set authentication password
- --username Set authentication username
- --verbose Print more information while working
- --connection-param-file Optional properties file that provides connection parameters
- --relaxed-isolation Set connection transaction isolation to read uncommitted for the mappers.
数据库连接
- mysql
sqoop import --connect jdbc:mysql://192.168.10.3:3306/mydb --username root --password mysql
实现需要驱动
cp mysql-connector-java-5.1.40-bin.jar apps/sqoop-1.4.6/lib/
数据表列表
sqoop list-tables --connect jdbc:mysql://192.168.10.3:3306/mydb --username root --password mysql
数据表单表导出到hdfs
导入的默认路径:/user/hadoop/tablename1
sqoop import --connect jdbc:mysql://192.168.10.3:3306/mydb?useSSL=false --username root --password mysql --table employees --target-dir /user/hadoop11/employees -m 1
指定分隔符、split-by分区、map并发
sqoop import \
--connect jdbc:mysql://hadoop1:3306/mysql \
--username root \
--password root \
--table helpkeyword \
--target-dir /user/hadoop11/myhelp_keyword1 \
--fields-terminated-by '\t' \
-m 2
--fields-terminated-by '\t' 用于分隔符
-m:表明需要使用几个map任务并发执行
sqoop import \
--query 'SELECT a., b. FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id -m 10 --target-dir /user/foo/joinresults
--split-by 用于分区处理
假设有一张表test,sqoop命令中--split-by 'id',-m 10,会发生怎样奇特的事情。首先呢,sqoop会去查表的元数据等等,重点说一下sqoop是如何根据--split-by进行分区的。首先sqoop会向关系型数据库比如mysql发送一个命令:select max(id),min(id) from test。然后会把max、min之间的区间平均分为10分,最后10个并行的map去找数据库
增量导入
id增量
sqoop import \
--connect jdbc:mysql://hadoop1:3306/mysql \
--username root \
--password root \
--table helpkeyword \
--target-dir /user/hadoop/myimportadd \
--incremental append \
--check-column helpkeywordid \
--last-value 500 \
-m 1
时间增量
sqoop import \ --connect jdbc:mysql://hadoop1:3316/testdb \ --username root \ --password transwarp \ --query “select orderid, name from ordertable where \$CONDITIONS” \ --target-dir /user/root/orderall \ --split-by id \ -m 4 \ --incremental lastmodified \ --merge-key orderid \ --check-column time \ --last-value “2014-11-09 21:00:00”
导入到Hive中
sqoop import --connect jdbc:mysql://192.168.10.3:3306/mydb?useSSL=false --username root --password mysql --table employees --target-dir /user/sqoop/employees --hive-import -m 2
创建在默认的default库中.看看效果(默认路径
hadoop fs -cat /user/sqoop/employees//part-m-00000
Hive 指定行分隔符和列分隔符,指定hive-import,指定覆盖导入,指定自动创建hive表,指定表名,指定删除中间结果数据目录
先创建一个hive库 hive> create database mydb;
sqoop import \
--connect jdbc:mysql://192.168.10.3:3306/mydb?useSSL=false \
--username root \
--password mysql \
--table employees \
--fields-terminated-by "\t" \
--lines-terminated-by "\n" \
--hive-import \
--hive-overwrite \
--create-hive-table \
--delete-target-dir \
--hive-database mydb \
--hive-table employees
查看效果
use mydb;
select * from employees limit 20;
- --create-hive-table //改参数表示如果表不存在就创建,若存在就忽略该参数
- --external-table-dir //指定外部表路径
- --hive-database <database-name> //指定hive的数据库
- --hive-import //指定导入hive表
- --hive-partition-key <partition-key> //指定分区的key
- --hive-partition-value <partition-value> //指定分区的value
- --hive-table <table-name> //指定hive的表
HDFS导出 > MySQL
sqoop export --connect jdbc:mysql://192.168.1.59:3307/bigdata --username root --password mysql --export-dir /sqoop --table student1
HIVE导出 > MySQL
sqoop export --connect jdbc:mysql://192.168.1.59:3307/bigdata --username root --password mysql --table employee1 --export-dir /user/hive/warehouse/sqoop.db/employee -input-fields-terminated-by '\001' --m 3
java代码生成
sqoop codegen --connect jdbc:mysql://192.168.1.59:3307/bigdata --username root --password mysql --table employee1
目前维护的开源产品:https://gitee.com/475660