[大数据-数据集成] Apache Sqoop安装与运用

1 文由

项目使用场景:OLTP Oracle 数据导入到 OLAP HIVE

2 Sqoop简述

  • Apache Sqoop(TM) 是一款开源的ETL工具,设计用于在 Apache Hadoop结构化数据存储(如传统的关系数据库)之间高效传输批量数据,其可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。

  • 项目生命历程:Sqoop项目开始于2009年,最早是作为Hadoop的一个第三方模块存在,后来为了让使用者能够快速部署,也为了让开发人员能够更快速的迭代开发;于 2012 年 3 月成功从Apache孵化器毕业,独立成为一个Apache顶级项目;但是在2021年6月 Apache Sqoop 从 apache 社区退役到阁楼(Attic)。

  • 实现原理

  • 导入和导出命令翻译成MR查询来执行
  • 主要是对 MapReduce 的 inputformat 和 outputformat 做定制

Sqoop2 架构

  • 社区动态: Apache Sqoop 已于 2021年6月 从 apache 社区退役 到 阁楼(Attic)
Apache Sqoop 于 2021 年 6 月搬入阁楼。Apache Sqoop 的任务是为 Apache Hadoop 和结构化数据存储创建和维护与批量数据传输相关的软件。
网站、下载和问题跟踪器都保持打开状态,尽管问题跟踪器是只读的。有关 Sqoop 的更多信息,请参阅网站 http://sqoop.apache.org 。
与 Attic 中的任何项目一样 - 如果您应该选择在 Apache 之外分叉 Sqoop,请告诉我们,以便我们可以链接到您的项目。
  • 最新版本

    • Sqoop1最新的稳定版本是 1.4.7(下载文档)。
    • Sqoop2最新版本是1.99.7(下载文档)。请注意,1.99.7 与 1.4.7 不兼容,并且功能不完整,它不适用于生产部署。
  • 官网

these information is updated at 2022/05/11 12:29 PM

3 安装Sqoop2(1.99.7)

step0 部署环境

jdk 1.8
hadoop 3.3.2
hive 3.1.3
sqoop 1.99.7

step1 下载sqoop2安装包

[sqoop 1.99.7] 截止2022年5月10日,最新版本(发版时间:2020年)
  http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz

http://archive.apache.org/dist/sqoop/
http://archive.apache.org/dist/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz

step2 解压,并创建软链接目录sqoop

tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.99.7-bin-hadoop200.tar.gz -C /home/hadoop/

cd /home/hadoop
ln -s sqoop-1.99.7-bin-hadoop200 sqoop

ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/

step3 配置环境变量:HADOOP_HOME/SQOOP_HOME

vim /etc/profile
      # HADOOP_HOME
      export HADOOP_HOME=/home/hadoop/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin

      # SQOOP_HOME
      export SQOOP_HOME=/home/hadoop/sqoop
      export PATH=$PATH:$SQOOP_HOME/bin

source /etc/profile

[补充:查看hadoop目录]

[hadoop@node-d ~]$ whereis hadoop
hadoop: /usr/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop /home/hadoop/hadoop-3.3.2/bin/hadoop.cmd

[hadoop@node-d ~]$ whereis sqoop
sqoop: /usr/bin/sqoop /home/hadoop/sqoop-1.99.7-bin-hadoop200/bin/sqoop.sh

# ll /usr/lib/hadoop

[补充:如果不配置HADOOP_HOME,则:]

# sqoop help
Error: /opt/software/sqoop-1.4.7/bin/../../hadoop-mapreduce does not exist!
Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.

step4 配置MYSQL数据库驱动

mysql-connector-java-5.1.*.jar放入$SQOOP_HOME/server/lib目录下

MySQL Product Archives -

ls -l $SQOOP_HOME/server/lib/

step5 配置derby数据库驱动

mv $SQOOP_HOME/server/lib/derby-10.8.2.2.jar $SQOOP_HOME/server/lib/derby-10.8.2.2.jar.bak

cp $HIVE_HOME/lib/derby-10.14.1.0.jar $SQOOP_HOME/server/lib/

此步是为了避免sqoop2-tool verify时报derby数据库错误(大概率是derby版本过低导致)

Exception in thread "main" java.lang.ExceptionInInitializerError
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.sqoop.utils.ClassUtils.loadClassWithClassLoader(ClassUtils.java:106)
	at org.apache.sqoop.utils.ClassUtils.loadClass(ClassUtils.java:70)
	at org.apache.sqoop.repository.JdbcRepositoryProvider.initializeRepositoryHandler(JdbcRepositoryProvider.java:122)
	at org.apache.sqoop.repository.JdbcRepositoryProvider.initialize(JdbcRepositoryProvider.java:65)
	at org.apache.sqoop.repository.RepositoryManager.initialize(RepositoryManager.java:124)
	at org.apache.sqoop.tools.tool.UpgradeTool.runToolWithConfiguration(UpgradeTool.java:39)
	at org.apache.sqoop.tools.ConfiguredTool.runTool(ConfiguredTool.java:35)
	at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed

step6 配置$SQOOP_HOME/conf/sqoop.properties的Hadoop配置文件目录

# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/home/hadoop/hadoop/etc/hadoop/
  # 默认值: /etc/hadoop/conf/
  # 不要配置成: $HADOOP_HOME/etc/hadoop/,否则还是会报错

此步骤是为了避免报如下错误:

...
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
        at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
...

step7 测试是否安装/启动成功

(sqoop 1.99.7)

利用sqoop2-tool验证是否安装成功

# cd $SQOOP_HOME/bin

# sqoop2-tool verify
Verification was Successful.
Tool class ..... has finished correctly.

如果出现任何Verification has failed, please check Server logs for further details.字样,可到在如下路径查看SQQOP日志:

vi $SQOOP_HOME/bin/@LOGDIR@/sqoop.log

补充: SQOOP的日志默认配置策略↓

/home/hadoop/sqoop-1.99.7-bin-hadoop200/conf/sqoop.properties:org.apache.sqoop.log4j.appender.file.File
org.apache.sqoop.log4j.debug=false
org.apache.sqoop.log4j.rootLogger=INFO, file
org.apache.sqoop.log4j.category.org.apache.sqoop=INFO
org.apache.sqoop.log4j.appender.file=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.file.File=@LOGDIR@/sqoop.log
org.apache.sqoop.log4j.appender.file.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.file.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.file.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.file.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n
# Audit logger for default configuration of FileAuditLogger
org.apache.sqoop.log4j.logger.audit=INFO, audit
org.apache.sqoop.log4j.appender.audit=org.apache.log4j.RollingFileAppender
org.apache.sqoop.log4j.appender.audit.File=@LOGDIR@/audit.log
org.apache.sqoop.log4j.appender.audit.MaxFileSize=25MB
org.apache.sqoop.log4j.appender.audit.MaxBackupIndex=5
org.apache.sqoop.log4j.appender.audit.layout=org.apache.log4j.PatternLayout
org.apache.sqoop.log4j.appender.audit.layout.ConversionPattern=%d{ISO8601} %-5p [%l] %m%n

step8 启动sqoop2服务端、客户端

启动服务端sqoop2-server

sqoop2-server [ start | stop ]

登录客户端sqoop2-shell

注:不启动服务端也能登录 shell 端。

# sqoop2-shell

到这里,Sqoop2的安装就已经完成了。

4 安装Sqoop1(1.4.7)

step0 部署环境

hadoop 3.3.2
  建议: hadoop 2.6.x / 2.7.x

step1 下载sqoop安装包

[sqoop 1.4.7]
http://sqoop.apache.org/
      http://www.apache.org/dyn/closer.lua/sqoop/1.4.7
            https://mirror.bit.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

http://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

step2 解压,并创建软链接目录sqoop

tar -xvf /mnt/share_data/Apache\ Sqoop/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /home/hadoop/

cd /home/hadoop
ln -s sqoop-1.4.7.bin__hadoop-2.6.0 sqoop

ls -l /home/hadoop
ls -l /home/hadoop/sqoop
ls -l /home/hadoop/sqoop/

step3 step3 配置环境变量:HADOOP_HOME/SQOOP_HOME

vim /etc/profile
      # HADOOP_HOME
      export HADOOP_HOME=/home/hadoop/hadoop
      export PATH=$PATH:$HADOOP_HOME/bin

      # SQOOP_HOME
      export SQOOP_HOME=/home/hadoop/sqoop
      export PATH=$PATH:$SQOOP_HOME/bin

source /etc/profile

step4 配置MYSQL数据库驱动

  • step4.1 将mysql-connector-java-5.1.*.jar放入$SQOOP_HOME/lib目录下
ls -l $SQOOP_HOME/lib/
  • step2 将mysql数据库驱动所依赖的外部jar包(commons-lang-2.6.jar)导入$SQOOP_HOME/lib目录下
ls -la $SQOOP_HOME/lib | grep -i "commons-lang"

此举是为了避免后续使用sqoop,并连接mysql数据库时报如下错误:

...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
	at org.apache.sqoop.manager.MySQLManager.initOptionDefaults(MySQLManager.java:73)
	at org.apache.sqoop.manager.SqlManager.<init>(SqlManager.java:89)
	at com.cloudera.sqoop.manager.SqlManager.<init>(SqlManager.java:33)
	at org.apache.sqoop.manager.GenericJdbcManager.<init>(GenericJdbcManager.java:51)
...

step5 配置 sqoop-env.sh 中 Hadoop 及其生态组件路径

cd $SQOOP_HOME/conf/
cp sqoop-env-template.sh sqoop-env.sh
vi $SQOOP_HOME/conf/sqoop-env.sh
  #Set path to where bin/hadoop is available
  export HADOOP_COMMON_HOME=$HADOOP_HOME
  # 或: HADOOP_COMMON_HOME=/home/hadoop/hadoop

  #Set path to where hadoop-*-core.jar is available
  export HADOOP_MAPRED_HOME=$HADOOP_HOME
  # 或: HADOOP_MAPRED_HOME=/home/hadoop/hadoop

  #set the path to where bin/hbase is available
  #export HBASE_HOME=

  #Set the path to where bin/hive is available
  export HIVE_HOME=$HIVE_HOME
  # 或: /home/hadoop/hive/

  #Set the path for where zookeper config dir is
  #export ZOOCFGDIR=
  # 或: /home/hadoop/zookeeper/

step6 测试是否安装/启动成功

(如下命令支持 sqoop 1.4.7,不支持sqoop 1.99.7)

sqoop version

sqoop version

sqoop help

# sqoop help
Warning: /opt/software/sqoop-1.4.7/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/software/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/tez-0.9.0/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/sdp/2.6.0.0-1245/hadoop-2.7.3/lib/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/01/18 21:32:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

连接mysql数据库,并列出数据库、数据表

sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/ --username root --password 123456
sqoop list-tables --connect jdbc:mysql://127.0.0.1:3306/exam_system --username root --password 123456

5 应用场景(基于Sqoop1)

5.1 数据迁移:Oracle导入Hive

[格式]

sqoop import 
--connect jdbc:oracle:thin:eas/eas@192.168.100.199:1521:eas   
--table TMP_NEEDCLEAN_PROVIDER2 
--hive-import --hive-database xxx 
--hive-table RPT_PROVIDER_NEEDCLEAN2   
--hive-overwrite 
--delete-target-dir 
--hive-drop-import-delims 
--null-string '' --null-non-string ''
 -m1 

[示例]

sqoop import 
--connect jdbc:oracle:thin:@10.0.3.3:1521:sid 
--username abc 
--password r123  
--table CMX.SALES  
-m 1 
--hive-import 
--hive-overwrite 
--hive-database ra 
--hive-table "cmx_sales" 
--null-non-string '' 
--null-string '' 
--delete-target-dir 
--hive-drop-import-delims

5.2 数据迁移: Hive导出到MySQL

sqoop export \
--connect jdbc:mysql://172.16.100.19:3306/test  \
--username queryuser \
--password abcde12345  \
--table t_sitenumber_distance  \
--export-dir /usr/local/warehouse/sitelight.db/t_sitenumber_distance \
--fields-terminated-by '\001'

注意: 创建hive数据库时,建议完全了解或指定【分隔符】,如果未指定,默认分隔符\001

5.3 数据迁移:MySQL/PostgreSQL导入到Hive

源表方式

sqoop import  \
--connect jdbc:mysql://172.16.100.19:3306/test_db  \
--username queryuser  \
--password abcde12345  \
--table t_sitenumber_distance  \
# (如下选项为可选字段)
--null-string '@@@' \ 
--null-non-string '###' \
--hive-import  \
--hive-overwrite  \
--create-hive-table  \
--delete-target-dir \
--hive-database  ods_db\
--hive-table t_sitenumber_distance \

注:hive目标表默认会自动创建,不需要手动创建

  • --null-string的含义是 string类型的字段,当Value是NULL,替换成指定的字符,该例子中为” <-数据库那种空值
  • --null-non-string的含义是非string类型的字段,当Value是NULL,替换成指定字符,该例子中为”<-数据库那种空值

此处Sqoop失败主要是由于Yarn集群的内存资源不足所致,无需care。

SQL方式

sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode,primecode,usercode,primemembertypeid,primeoffertypeid,primeofferactivityid,reusablecount,usedseqid,totalavailablecount,validstarttime
,validendtime,saving,activatedchannelid,activatedtime,channelremark,version,lastupdatetime,updateuser,lastusedtime,lastusedcount,offerstatus,statusupdatetime from t_prime_offer_info_4 where lastupdatetime >='2021-06-17' and lastupdatetime < '2021-06-18' and \$CONDITIONS" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_4 \
--hive-import \
--hive-database tmp \
--delete-target-dir \
--hive-table coupon_t_prime_offer_info_4 \
--m 1

# 注意1:-m/--num-mappers 指定maptask的个数
# 注意2:--query参数和--table参数不能同时出现
# 注意3:通过–query选项中"where $CONDITIONS"是必须带,否者会报错,如下:
##  ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Query [SELECT * FROM t_passenger ] must contain '$CONDITIONS' in WHERE clause.

SQL方式(limit)

sqoop import \
--connect 'jdbc:postgresql://172.21.232.128:1921/coupon?useUnicode=true&characterEncoding=utf8' \
--username coupondw \
--password 'dw!123456' \
--query "select id,primeofferid,primeoffercode,primecardcode from t_prime_offer_info_0 where \$CONDITIONS  limit 10" \
--target-dir /apps/hive/warehouse/staging/all/coupon_t_prime_offer_info_0_qhh \
--hive-import \
--create-hive-table \
--hive-database tmp \
--hive-table coupon_t_prime_offer_info_0_qhh \
--delete-target-dir \
--m 1

示例(SQL方式) 【推荐阅读】

  • step1 确定待导入到Hive的MySQL数据
select 
  id,name,birthdate,idcard,sex,height,school_name,now() as etl_time 
from exam_system.tb_examinee 
-- where \$CONDITIONS
  • step2 准备一个HDFS临时文件夹(Sqoop的--target-dir任务运行过程中需要暂存数据)
hdfs dfs -mkdir /test/test_ods/
hdfs dfs -ls /test/test_ods/
hdfs dfs -ls /test/
  • step3 正式开始导入
sqoop import  \
--connect jdbc:mysql://172.16.20.102:3306/exam_system  \
--username root  \
--password 123456  \
--query  "select id,name,birthdate,idcard,sex,height,school_name,now() as etl_time from exam_system.tb_examinee where \$CONDITIONS" \
--split-by 'id' \
--target-dir /test/test_ods/ \
--hive-import \
--hive-overwrite  \
--create-hive-table  \
--delete-target-dir \
--hive-database default \
--hive-table tb_examinee
  • step4 查验hive中的新表和新数据
# hive
sql> select * from tb_examineee;
  • step5 查看Yarn Cluster Web UISqoop任务运行过程中所提交的Mapreduce任务

X 参考文献

posted @ 2021-01-18 20:28  千千寰宇  阅读(929)  评论(2编辑  收藏  举报