用Sqoop实现数据HDFS到mysql到Hive

【Hive】用Sqoop实现数据HDFS到mysql到Hive

大数据协作框架

“大数据协作框架”其实是一个统称，主要是以下四个框架

数据转换工具 Sqoop
文件收集库框架 Flume
任务调度框架 Oozie
大数据WEB工具Hue

Sqoop作用

将关系数据库中的某张表数据抽取到Hadoop的HDFS文件系统中，底层运行的还是MapReduce。

利用MapReduce加快数据传输速度。

批处理方式进行数据传输。

也可以将HDFS上的文件数据或者Hive表中的数据导出到关系型数据库当中的某张表中。

HDFS →RDBMS

sqoop export \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--export-dir xxx

RDBMS→Hive

sqoop import \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--fields-terminated-by "\t" \
--table xxx \
--hive-import \
--hive-table xxx

Hive→RDBMS

sqoop export \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--export-dir xxx \
--input-fields-terminated-by '\t'

RDBMS→HDFS

sqoop import \
--connect jdbc:mysql://xxx:3306/xxx \
--username xxx \
--password xxx \
--table xxx \
--target-dir xxx

规律：

从RDBMS导入到HDFS或者Hive中的都使用import；从HDFS或者Hive导出到RDBMS中的都使用export；以HDFS和Hive为参考，根据数据流向选择关键字。
connect、username、password、table四个参数为每一种传输都必须的；其中connect参数格式均为--connect jdbc:mysql://主机名:3306/数据库名(使用mysql数据库)；table是指明mysql中的表名。
export-dir参数只有在导出数据到RDBMS中时才会用到，含义为表在hdfs中存放的路径。

区别：

HDFS →RDBMS：指明table：mysql中的表，需要自行先创建；指明export-dir：HDFS中数据的存储路径
RDBMS→Hive：指明fields-terminated-by：指定分隔符，分隔符指的是存放在Hive中的数据的分隔符，如果目标将存储在Hive中可理解为编码格式，若目标将存储在RDBMS上，则可理解为解码格式；指明table:mysql中的表名；指明hive-import：导入到hive操作;指明hive-table:hive中的表名。注意：table参数不可以与用户家目录下已存在的目录重名，因为sqoop导数据到hive会先将数据导入到HDFS上，然后再将数据load到hive中，最后把这个目录再删除掉。
Hive→RDBMS：指明table:mysql中的表名；指明export-dir:hive在hdfs中存储的路径;指明hive-table:hive中的表名。
RDBMS→HDFS：指明table:mysql里的表名;指明target-dir:hdfs存储数据的目录;

Sqoop安装

配置Sqoop1.x

conf目录【sqoop-env-template.sh】

export HADOOP_COMMON_HOME=Hadoop目录
export HADOOP_MAPRED_HOME=Hadoop目录
export HIVE_HOME=Hive目录
export ZOOCFGDIR=Zookeeper目录

将mysqlJDBC驱动包拷到sqoop的lib目录下

测试sqoop

bin/sqoop list-databases \
--connect jdbc:mysql://主机名：3306 \
--username root \
--password 123456 \

查看本地mysql

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| metastore |
| mysql |
| test |
+--------------------+
4 rows in set (0.00 sec)
mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| my_user |
+----------------+
1 row in set (0.00 sec)
mysql> desc my_user;
+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| id | tinyint(4) | NO | PRI | NULL | auto_increment |
| account | varchar(255) | YES | | NULL | |
| passwd | varchar(255) | YES | | NULL | |
+---------+--------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)
mysql> select * from my_user;
+----+----------+----------+
| id | account | passwd |
+----+----------+----------+
| 1 | admin | admin |
| 2 | johnny | 123456 |
| 3 | zhangsan | zhangsan |
| 4 | lisi | lisi |
| 5 | test | test |
| 6 | qiqi | qiqi |
| 7 | hangzhou | hangzhou |
+----+----------+----------+
7 rows in set (0.00 sec)

hive创建相同结构的空表

hive (test)> create table h_user(
> id int,
> account string,
> passwd string
> )row format delimited fields terminated by '\t';
OK
Time taken: 0.113 seconds
hive (test)> desc h_user;
OK
col_name data_type comment
id int
account string
passwd string
Time taken: 0.228 seconds, Fetched: 3 row(s)

从本地mysql导出数据到Hive里

bin/sqoop import \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--delete-target-dir \
--fields-terminated-by "\t" \
--hive-database test \
--hive-import \
--hive-table h_user
hive (test)> select * from h_user;
OK
h_user.id h_user.account h_user.passwd
1 admin admin
2 johnny 123456
3 zhangsan zhangsan
4 lisi lisi
5 test test
6 qiqi qiqi
7 hangzhou hangzhou
Time taken: 0.061 seconds, Fetched: 7 row(s)

从mysql导入到HDFS里

bin/sqoop import \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table my_user \
--num-mappers 3 \
--target-dir /user/hadoop/ \
--delete-target-dir \
--fields-terminated-by "\t"
------------------------------------------------------------
[hadoop@cdaisuke sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \
> --connect jdbc:mysql://cdaisuke:3306/test \
> --username root \
> --password 123456 \
> --table my_user \
> --num-mappers 3 \
> --target-dir /user/hadoop/ \
> --delete-target-dir \
> --fields-terminated-by "\t"
18/08/14 00:02:11 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
18/08/14 00:02:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/14 00:02:12 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/14 00:02:12 INFO tool.CodeGenTool: Beginning code generation
18/08/14 00:02:13 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user` AS t LIMIT 1
18/08/14 00:02:13 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user` AS t LIMIT 1
18/08/14 00:02:13 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6_Hive
Note: /tmp/sqoop-hadoop/compile/7c8bdb7cd3df7b2f4b48700704f46f65/my_user.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/14 00:02:18 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/7c8bdb7cd3df7b2f4b48700704f46f65/my_user.jar
18/08/14 00:02:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 00:02:22 INFO tool.ImportTool: Destination directory /user/hadoop is not present, hence not deleting.
18/08/14 00:02:22 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/08/14 00:02:22 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/08/14 00:02:22 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/08/14 00:02:22 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/08/14 00:02:22 INFO mapreduce.ImportJobBase: Beginning import of my_user
18/08/14 00:02:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/14 00:02:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/14 00:02:23 INFO client.RMProxy: Connecting to ResourceManager at slave01/192.168.79.140:8032
18/08/14 00:02:28 INFO db.DBInputFormat: Using read commited transaction isolation
18/08/14 00:02:28 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `my_user`
18/08/14 00:02:28 INFO mapreduce.JobSubmitter: number of splits:3
18/08/14 00:02:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533652222364_0078
18/08/14 00:02:29 INFO impl.YarnClientImpl: Submitted application application_1533652222364_0078
18/08/14 00:02:29 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1533652222364_0078/
18/08/14 00:02:29 INFO mapreduce.Job: Running job: job_1533652222364_0078
18/08/14 00:02:50 INFO mapreduce.Job: Job job_1533652222364_0078 running in uber mode : false
18/08/14 00:02:50 INFO mapreduce.Job: map 0% reduce 0%
18/08/14 00:03:00 INFO mapreduce.Job: map 33% reduce 0%
18/08/14 00:03:01 INFO mapreduce.Job: map 67% reduce 0%
18/08/14 00:03:02 INFO mapreduce.Job: map 100% reduce 0%
18/08/14 00:03:02 INFO mapreduce.Job: Job job_1533652222364_0078 completed successfully
18/08/14 00:03:02 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=394707
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=295
HDFS: Number of bytes written=106
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Job Counters
Launched map tasks=3
Other local map tasks=3
Total time spent by all maps in occupied slots (ms)=25213
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=25213
Total vcore-seconds taken by all map tasks=25213
Total megabyte-seconds taken by all map tasks=25818112
Map-Reduce Framework
Map input records=7
Map output records=7
Input split bytes=295
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=352
CPU time spent (ms)=3600
Physical memory (bytes) snapshot=316162048
Virtual memory (bytes) snapshot=2523156480
Total committed heap usage (bytes)=77766656
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=106
18/08/14 00:03:02 INFO mapreduce.ImportJobBase: Transferred 106 bytes in 40.004 seconds (2.6497 bytes/sec)
18/08/14 00:03:02 INFO mapreduce.ImportJobBase: Retrieved 7 records.

设置3个map任务

--num-mappers 3 \

设置HDFS目标存储目录

--target-dir /user/hadoop/ \

--delete-target-dir \

从Hive导出到mysql

在mysql创建新表

create table user_export(
id tinyint(4) not null auto_increment,
account varchar(255) default null,
passwd varchar(255) default null,
primary key(id)
);

用sqoop导出数据

bin/sqoop export \
--connect jdbc:mysql://cdaisuke:3306/test \
--username root \
--password 123456 \
--table user_export \
--num-mappers 1 \
--fields-terminated-by "\t" \
--export-dir /user/hive/warehouse/test.db/h_user
----------------------------------------------------
[hadoop@cdaisuke sqoop-1.4.5-cdh5.3.6]$ bin/sqoop export \
> --connect jdbc:mysql://cdaisuke:3306/test \
> --username root \
> --password 123456 \
> --table user_export \
> --num-mappers 1 \
> --fields-terminated-by "\t" \
> --export-dir /user/hive/warehouse/test.db/h_user
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
18/08/14 00:16:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
18/08/14 00:16:32 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/14 00:16:33 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/14 00:16:33 INFO tool.CodeGenTool: Beginning code generation
18/08/14 00:16:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_export` AS t LIMIT 1
18/08/14 00:16:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_export` AS t LIMIT 1
18/08/14 00:16:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6_Hive
Note: /tmp/sqoop-hadoop/compile/6823ffae505b34f7ae8b9881bae4b898/user_export.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/14 00:16:39 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/6823ffae505b34f7ae8b9881bae4b898/user_export.jar
18/08/14 00:16:39 INFO mapreduce.ExportJobBase: Beginning export of user_export
18/08/14 00:16:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 00:16:40 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:16:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/14 00:16:43 INFO client.RMProxy: Connecting to ResourceManager at slave01/192.168.79.140:8032
18/08/14 00:16:48 INFO input.FileInputFormat: Total input paths to process : 1
18/08/14 00:16:48 INFO input.FileInputFormat: Total input paths to process : 1
18/08/14 00:16:48 INFO mapreduce.JobSubmitter: number of splits:1
18/08/14 00:16:48 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:16:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533652222364_0079
18/08/14 00:16:50 INFO impl.YarnClientImpl: Submitted application application_1533652222364_0079
18/08/14 00:16:50 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1533652222364_0079/
18/08/14 00:16:50 INFO mapreduce.Job: Running job: job_1533652222364_0079
18/08/14 00:17:11 INFO mapreduce.Job: Job job_1533652222364_0079 running in uber mode : false
18/08/14 00:17:11 INFO mapreduce.Job: map 0% reduce 0%
18/08/14 00:17:27 INFO mapreduce.Job: map 100% reduce 0%
18/08/14 00:17:27 INFO mapreduce.Job: Job job_1533652222364_0079 completed successfully
18/08/14 00:17:27 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=131287
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=258
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=13426
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=13426
Total vcore-seconds taken by all map tasks=13426
Total megabyte-seconds taken by all map tasks=13748224
Map-Reduce Framework
Map input records=7
Map output records=7
Input split bytes=149
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=73
CPU time spent (ms)=1230
Physical memory (bytes) snapshot=113061888
Virtual memory (bytes) snapshot=838946816
Total committed heap usage (bytes)=45613056
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
18/08/14 00:17:27 INFO mapreduce.ExportJobBase: Transferred 258 bytes in 44.2695 seconds (5.8279 bytes/sec)
18/08/14 00:17:27 INFO mapreduce.ExportJobBase: Exported 7 records.
-----------------------------------------------------------------
mysql> select * from user_export;
+----+----------+----------+
| id | account | passwd |
+----+----------+----------+
| 1 | admin | admin |
| 2 | johnny | 123456 |
| 3 | zhangsan | zhangsan |
| 4 | lisi | lisi |
| 5 | test | test |
| 6 | qiqi | qiqi |
| 7 | hangzhou | hangzhou |
+----+----------+----------+
7 rows in set (0.00 sec)

从HDFS导出到mysql

在mysql创建新表

create table my_user2(
id tinyint(4) not null auto_increment,
account varchar(255) default null,
passwd varchar(255) default null,
primary key (id)
);
---------------------------------------------------------
[hadoop@cdaisuke sqoop-1.4.5-cdh5.3.6]$ bin/sqoop export \
> --connect jdbc:mysql://cdaisuke:3306/test \
> --username root \
> --password 123456 \
> --table my_user2 \
> --num-mappers 1 \
> --fields-terminated-by "\t" \
> --export-dir /user/hadoop
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
18/08/14 00:39:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
18/08/14 00:39:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/08/14 00:39:52 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/08/14 00:39:52 INFO tool.CodeGenTool: Beginning code generation
18/08/14 00:39:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user2` AS t LIMIT 1
18/08/14 00:39:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `my_user2` AS t LIMIT 1
18/08/14 00:39:53 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6_Hive
Note: /tmp/sqoop-hadoop/compile/7222f42cd6507a21fdcef7600bd14a20/my_user2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/08/14 00:39:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/7222f42cd6507a21fdcef7600bd14a20/my_user2.jar
18/08/14 00:39:59 INFO mapreduce.ExportJobBase: Beginning export of my_user2
18/08/14 00:40:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/14 00:40:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/08/14 00:40:04 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
18/08/14 00:40:04 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:40:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/08/14 00:40:04 INFO client.RMProxy: Connecting to ResourceManager at slave01/192.168.79.140:8032
18/08/14 00:40:09 INFO input.FileInputFormat: Total input paths to process : 3
18/08/14 00:40:09 INFO input.FileInputFormat: Total input paths to process : 3
18/08/14 00:40:09 INFO mapreduce.JobSubmitter: number of splits:1
18/08/14 00:40:09 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
18/08/14 00:40:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1533652222364_0084
18/08/14 00:40:11 INFO impl.YarnClientImpl: Submitted application application_1533652222364_0084
18/08/14 00:40:11 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1533652222364_0084/
18/08/14 00:40:11 INFO mapreduce.Job: Running job: job_1533652222364_0084
18/08/14 00:40:30 INFO mapreduce.Job: Job job_1533652222364_0084 running in uber mode : false
18/08/14 00:40:30 INFO mapreduce.Job: map 0% reduce 0%
18/08/14 00:40:46 INFO mapreduce.Job: map 100% reduce 0%
18/08/14 00:40:46 INFO mapreduce.Job: Job job_1533652222364_0084 completed successfully
18/08/14 00:40:46 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=131229
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=365
HDFS: Number of bytes written=0
HDFS: Number of read operations=10
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=13670
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=13670
Total vcore-seconds taken by all map tasks=13670
Total megabyte-seconds taken by all map tasks=13998080
Map-Reduce Framework
Map input records=7
Map output records=7
Input split bytes=250
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=89
CPU time spent (ms)=1670
Physical memory (bytes) snapshot=115961856
Virtual memory (bytes) snapshot=838946816
Total committed heap usage (bytes)=45613056
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
18/08/14 00:40:46 INFO mapreduce.ExportJobBase: Transferred 365 bytes in 42.3534 seconds (8.618 bytes/sec)
18/08/14 00:40:46 INFO mapreduce.ExportJobBase: Exported 7 records.
------------------------------------------------------------------------
mysql> select * from my_user2;
+----+----------+----------+
| id | account | passwd |
+----+----------+----------+
| 1 | admin | admin |
| 2 | johnny | 123456 |
| 3 | zhangsan | zhangsan |
| 4 | lisi | lisi |
| 5 | test | test |
| 6 | qiqi | qiqi |
| 7 | hangzhou | hangzhou |
+----+----------+----------+
7 rows in set (0.00 sec)

posted @ 2021-01-06 00:51 virtual_daemon 阅读(237) 评论(0) 收藏举报

刷新页面返回顶部

daemon

用Sqoop实现数据HDFS到mysql到Hive

公告