[大数据-数据集成] Apache Sqoop安装与运用
1 文由
项目使用场景:OLTP Oracle 数据导入到 OLAP HIVE
2 Sqoop简述
-
Apache Sqoop(TM) 是一款开源的
ETL工具
,设计用于在Apache Hadoop
和结构化数据存储
(如传统的关系数据库)之间高效传输批量数据
,其可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。 -
项目生命历程:Sqoop项目开始于2009年,最早是作为Hadoop的一个第三方模块存在,后来为了让使用者能够快速部署,也为了让开发人员能够更快速的迭代开发;于 2012 年 3 月成功从Apache孵化器毕业,独立成为一个Apache顶级项目;但是在2021年6月 Apache Sqoop 从 apache 社区退役到阁楼(Attic)。
-
实现原理
- 导入和导出命令翻译成MR查询来执行
- 主要是对 MapReduce 的 inputformat 和 outputformat 做定制
Sqoop2 架构
- 社区动态: Apache Sqoop 已于 2021年6月 从 apache 社区退役 到 阁楼(Attic)
Apache Sqoop 于 2021 年 6 月搬入阁楼。Apache Sqoop 的任务是为 Apache Hadoop 和结构化数据存储创建和维护与批量数据传输相关的软件。
网站、下载和问题跟踪器都保持打开状态,尽管问题跟踪器是只读的。有关 Sqoop 的更多信息,请参阅网站 http://sqoop.apache.org 。
与 Attic 中的任何项目一样 - 如果您应该选择在 Apache 之外分叉 Sqoop,请告诉我们,以便我们可以链接到您的项目。
-
最新版本
-
官网
these information is updated at 2022/05/11 12:29 PM
3 安装Sqoop2(1.99.7)
step0 部署环境
jdk 1.8
hadoop 3.3.2
hive 3.1.3
sqoop 1.99.7
step1 下载sqoop2安装包
[sqoop 1.99.7] 截止2022年5月10日,最新版本(发版时间:2020年)
http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz
http://archive.apache.org/dist/sqoop/
http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz
step2 解压,并创建软链接目录sqoop
tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.99.7-bin-hadoop200.tar.gz -C /home/hadoop/
cd /home/hadoop
ln -s sqoop-1.99.7-bin-hadoop200 sqoop
ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/
step3 配置环境变量:HADOOP_HOME
/SQOOP_HOME
vim /etc/profile
# HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
# SQOOP_HOME
export SQOOP_HOME=/home/hadoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
[补充:查看hadoop目录]
[hadoop@node-d ~]$ whereis hadoop
hadoop: /usr/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop.cmd
[hadoop@node-d ~]$ whereis sqoop
sqoop: /usr/bin/sqoop /home/hadoop/sqoop-1.99.7-bin-hadoop200/bin/sqoop.sh
# ll /usr/lib/hadoop
[补充:如果不配置HADOOP_HOME
,则:]
# sqoop help
Error: /opt/software/sqoop-1.4.7/bin/../../hadoop-mapreduce does not exist!
Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.
step4 配置MYSQL
数据库驱动
将mysql-connector-java-5.1.*.jar
放入$SQOOP_HOME/server/lib
目录下
ls -l $SQOOP_HOME/server/lib/
step5 配置derby
数据库驱动
mv $SQOOP_HOME/server/lib/derby-10.8.2.2.jar $SQOOP_HOME/server/lib/derby-10.8.2.2.jar.bak
cp $HIVE_HOME/lib/derby-10.14.1.0.jar $SQOOP_HOME/server/lib/
此步是为了避免
sqoop2-tool verify
时报derby数据库
错误(大概率是derby
版本过低导致)
Exception in thread "main" java.lang.ExceptionInInitializerError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.sqoop.utils.ClassUtils.loadClassWithClassLoader(ClassUtils.java:106)
at org.apache.sqoop.utils.ClassUtils.loadClass(ClassUtils.java:70)
at org.apache.sqoop.repository.JdbcRepositoryProvider.initializeRepositoryHandler(JdbcRepositoryProvider.java:122)
at org.apache.sqoop.repository.JdbcRepositoryProvider.initialize(JdbcRepositoryProvider.java:65)
at org.apache.sqoop.repository.RepositoryManager.initialize(RepositoryManager.java:124)
at org.apache.sqoop.tools.tool.UpgradeTool.runToolWithConfiguration(UpgradeTool.java:39)
at org.apache.sqoop.tools.ConfiguredTool.runTool(ConfiguredTool.java:35)
at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed
step6 配置$SQOOP_HOME/conf/sqoop.properties
的Hadoop配置文件目录
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/home/hadoop/hadoop/etc/hadoop/
# 默认值: /etc/hadoop/conf/
# 不要配置成: $HADOOP_HOME/etc/hadoop/,否则还是会报错
此步骤是为了避免报如下错误:
...
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
...
step7 测试是否安装/启动成功
(sqoop 1.99.7)
利用sqoop2-tool
验证是否安装成功
# cd $SQOOP_HOME/bin
# sqoop2-tool verify
Verification was Successful.
Tool class ..... has finished correctly.
如果出现任何
Verification has failed, please check Server logs for further details.
字样,可到在如下路径查看SQQOP日志:
vi $SQOOP_HOME/bin/@LOGDIR@/sqoop.log
补充: SQOOP的日志默认配置策略↓
/home/hadoop/sqoop-1.99.7-bin-hadoop200/conf/sqoop.properties:org.apache.sqoop.log4j.appender.file.File
org.apache.sqoop.log4j.debug=false
org.apache.sqoop.log4j.rootLogger=INFO, file
org.apache.sqoop.log4j.category.org.apache.sqoop=INFO
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=@LOGDIR@/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n
# Audit logger for default configuration of FileAuditLogger
org.apache.sqoop.log4j.logger.audit=INFO, audit
org.apache.sqoop.log4j.appender.audit=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.audit.File=@LOGDIR@/audit.log
org.apache.sqoop.log4j.appender.audit.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.audit.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.audit.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.audit.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n
step8 启动sqoop2服务端、客户端
启动服务端sqoop2-server
sqoop2-server [ start | stop ]
登录客户端sqoop2-shell
注:不启动服务端也能登录 shell 端。
# sqoop2-shell
到这里,Sqoop2
的安装就已经完成了。
4 安装Sqoop1(1.4.7)
step0 部署环境
hadoop 3.3.2
建议: hadoop 2.6.x / 2.7.x
step1 下载sqoop安装包
[sqoop 1.4.7]
http://sqoop.apache.org/
http://www.apache.org/dyn/closer.lua/sqoop/1.4.7
https://mirror.bit.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
step2 解压,并创建软链接目录sqoop
tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /home/hadoop/
cd /home/hadoop
ln -s sqoop-1.4.7.bin__hadoop-2.6.0 sqoop
ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/
step3 step3 配置环境变量:HADOOP_HOME
/SQOOP_HOME
vim /etc/profile
# HADOOP_HOME
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
# SQOOP_HOME
export SQOOP_HOME=/home/hadoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
step4 配置MYSQL数据库驱动
- step4.1 将
mysql-connector-java-5.1.*.jar
放入$SQOOP_HOME/lib
目录下
ls -l $SQOOP_HOME/lib/
- step2 将
mysql
数据库驱动所依赖的外部jar包(commons-lang-2.6.jar
)导入$SQOOP_HOME/lib
目录下
ls -la $SQOOP_HOME/lib | grep -i "commons-lang"
此举是为了避免后续使用sqoop,并连接mysql数据库时报如下错误:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73)
at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89)
at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33)
at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51)
...
step5 配置 sqoop-env.sh
中 Hadoop 及其生态组件路径
cd $SQOOP_HOME/conf/
cp sqoop-env-template.sh sqoop-env.sh
vi $SQOOP_HOME/conf/sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=$HADOOP_HOME
# 或: HADOOP_COMMON_HOME=/home/hadoop/hadoop
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=$HADOOP_HOME
# 或: HADOOP_MAPRED_HOME=/home/hadoop/hadoop
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
export HIVE_HOME=$HIVE_HOME
# 或: /home/hadoop/hive/
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
# 或: /home/hadoop/zookeeper/
step6 测试是否安装/启动成功
(如下命令支持 sqoop 1.4.7,不支持sqoop 1.99.7)
sqoop version
sqoop version
sqoop help
# sqoop help
Warning: /opt/software/sqoop-1.4.7/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/tez-0.9.0/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/lib/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/01/18 21:32:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
连接mysql数据库,并列出数据库、数据表
sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/ --username root --password 123456
sqoop list-tables --connect jdbc:mysql://127.0.0.1:3306/exam_system --username root --password 123456
5 应用场景(基于Sqoop1)
5.1 数据迁移:Oracle导入Hive
[格式]
sqoop import
--connect jdbc:oracle:thin:eas/eas@192.168.100.199:1521:eas
--table TMP_NEEDCLEAN_PROVIDER2
--hive-import --hive-database xxx
--hive-table RPT_PROVIDER_NEEDCLEAN2
--hive-overwrite
--delete-target-dir
--hive-drop-import-delims
--null-string '' --null-non-string ''
-m1
[示例]
sqoop import
--connect jdbc:oracle:thin:@10.0.3.3:1521:sid
--username abc
--password r123
--table CMX.SALES
-m 1
--hive-import
--hive-overwrite
--hive-database ra
--hive-table "cmx_sales"
--null-non-string ''
--null-string ''
--delete-target-dir
--hive-drop-import-delims
5.2 数据迁移: Hive导出到MySQL
sqoop export \
--connect jdbc:mysql://172.16.100.19:3306/test \
--username queryuser \
--password abcde12345 \
--table t_sitenumber_distance \
--export-dir /usr/local/warehouse/sitelight.db/t_sitenumber_distance \
--fields-terminated-by '\001'
注意: 创建hive数据库时,建议完全了解或指定【分隔符】,如果未指定,默认分隔符为
\001
。
5.3 数据迁移:MySQL/PostgreSQL导入到Hive
源表方式
sqoop import \
--connect jdbc:mysql://172.16.100.19:3306/test_db \
--username queryuser \
--password abcde12345 \
--table t_sitenumber_distance \
# (如下选项为可选字段)
--null-string '@@@' \
--null-non-string '###' \
--hive-import \
--hive-overwrite \
--create-hive-table \
--delete-target-dir \
--hive-database ods_db\
--hive-table t_sitenumber_distance \
注:
hive目标表
默认会自动创建,不需要手动创建
--null-string
的含义是 string类型的字段,当Value是NULL,替换成指定的字符,该例子中为” <-数据库那种空值--null-non-string
的含义是非string类型的字段,当Value是NULL,替换成指定字符,该例子中为”<-数据库那种空值
此处Sqoop失败主要是由于Yarn集群的内存资源不足所致,无需care。
SQL方式
sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode,primecode,usercode,primemembertypeid,primeoffertypeid,primeofferactivityid,reusablecount,usedseqid,totalavailablecount,validstarttime
,validendtime,saving,activatedchannelid,activatedtime,channelremark,version,lastupdatetime,updateuser,lastusedtime,lastusedcount,offerstatus,statusupdatetime from t_prime_offer_info_4 where lastupdatetime >='2021-06-17' and lastupdatetime < '2021-06-18' and \$CONDITIONS" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_4 \
--hive-import \
--hive-database tmp \
--delete-target-dir \
--hive-table coupon_t_prime_offer_info_4 \
--m 1
# 注意1:-m/--num-mappers 指定maptask的个数
# 注意2:--query参数和--table参数不能同时出现
# 注意3:通过–query选项中"where $CONDITIONS"是必须带,否者会报错,如下:
## ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [SELECT * FROM t_passenger ] must contain '$CONDITIONS' in WHERE clause.
SQL方式(limit)
sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode from t_prime_offer_info_0 where \$CONDITIONS limit 10" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_0_qhh \
--hive-import \
--create-hive-table \
--hive-database tmp \
--hive-table coupon_t_prime_offer_info_0_qhh \
--delete-target-dir \
--m 1
示例(SQL方式) 【推荐阅读】
- step1 确定待导入到Hive的MySQL数据
select
id,name,birthdate,idcard,sex,height,school_name,now() as etl_time
from exam_system.tb_examinee
-- where \$CONDITIONS
- step2 准备一个HDFS临时文件夹(Sqoop的
--target-dir
任务运行过程中需要暂存数据)
hdfs dfs -mkdir /test/test_ods/
hdfs dfs -ls /test/test_ods/
hdfs dfs -ls /test/
- step3 正式开始导入
sqoop import \
--connect jdbc:mysql://172.16.20.102:3306/exam_system \
--username root \
--password 123456 \
--query "select id,name,birthdate,idcard,sex,height,school_name,now() as etl_time from exam_system.tb_examinee where \$CONDITIONS" \
--split-by 'id' \
--target-dir /test/test_ods/ \
--hive-import \
--hive-overwrite \
--create-hive-table \
--delete-target-dir \
--hive-database default \
--hive-table tb_examinee
- step4 查验hive中的新表和新数据
# hive
sql> select * from tb_examineee;
- step5 查看
Yarn Cluster Web UI
中Sqoop
任务运行过程中所提交的Mapreduce
任务
X 参考文献
- Home Page - Apache Sqoop 【强烈推荐】
- Sqoop User Guide (v1.4.7) - Apache Sqoop 【强烈推荐(sqoop1)】
- Apache Sqoop documentation(v 1.99.7) - Apache Sqoop 【强烈推荐(sqoop2)】
- sqoop安装及使用 - 博客园
- sqoop安装教程 - 简书
- mysql-connector-java-5.-bin.jar 下载方法 - 博客园
- 安装Sqoop的时候出现Error: /root/training/sqoop-1.4.5.bin__hadoop-0.23/../hadoop-mapreduce does not exist! - CSDN
- oracle到hive数据类型转换 - CSDN
- Sqoop - 1.99.7安装配置(详细图文) - CSDN 【强烈推荐(sqoop2)】
- 【sqoop】sqoop2安装时验证因derby的jar包版本过低报错 - CSDN
- sqoop2安装 - CSDN
- sqoop工具安装及简单操作 - 简书 【强烈推荐(sqoop1)】
- sqoop --split-by详解 - 简书
- Sqoop学习 - CSDN 【推荐】
- 【数据采集与预处理】数据传输工具Sqoop - CSDN 【推荐】
本文链接: https://www.cnblogs.com/johnnyzen
关于博文:评论和私信会在第一时间回复,或直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
日常交流:大数据与软件开发-QQ交流群: 774386015 【入群二维码】参见左下角。您的支持、鼓励是博主技术写作的重要动力!