使用sqoop把mysql数据导入hive
使用sqoop把mysql数据导入hive
export HADOOP_COMMON_HOME=/hadoop export HADOOP_MAPRED_HOME=/hadoop cp /hive/lib/mysql-connector-java-5.1.25-bin.jar /sqoop/lib/ |
share表第一列为自增主键 share_id,share_id<1000的数据共有999条:
mysql> SELECT COUNT(*) FROM share WHERE share_id<1000; +----------+ | COUNT(*) | +----------+ | 999 | +----------+ 1 ROW IN SET (0.00 sec) |
使用sqoop导入到hive
[hduser@www lib]$ /sqoop/bin/sqoop import --connect jdbc:mysql://localhost/shipincon --table share --username root --password xxx --hive-import --where "share_id<1000" -- --default-character-set=utf8 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. --password 密码 比较危险,可以改成 -P (Read password from console) 13/07/12 18:42:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 默认是用hive的分隔符 ctrl+a 可以使用 --fields-terminated-by 来更改 插一句:在console vim中如何输入ctrl+a: 在输入模式下先按ctrl+v,再按ctrl+a即可,会显示成^A 13/07/12 18:42:58 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override 13/07/12 18:42:58 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc. 13/07/12 18:42:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 13/07/12 18:42:58 INFO tool.CodeGenTool: Beginning code generation 13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1 13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1 13/07/12 18:42:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop Note: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/07/12 18:43:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.jar 13/07/12 18:43:00 WARN manager.MySQLManager: It looks like you are importing from mysql. 如果没有blob clob longvarbinary字段,可以使用 --direct 速度也许会快一些。 text字段不清楚。 13/07/12 18:43:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 13/07/12 18:43:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 13/07/12 18:43:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 13/07/12 18:43:00 INFO mapreduce.ImportJobBase: Beginning import of share 13/07/12 18:43:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`share_id`), MAX(`share_id`) FROM `share` WHERE ( share_id<1000 ) 13/07/12 18:43:02 INFO mapred.JobClient: Running job: job_201307121751_0001 13/07/12 18:43:03 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 18:43:12 INFO mapred.JobClient: map 50% reduce 0% 13/07/12 18:43:17 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 18:43:18 INFO mapred.JobClient: Job complete: job_201307121751_0001 13/07/12 18:43:18 INFO mapred.JobClient: Counters: 18 13/07/12 18:43:18 INFO mapred.JobClient: Job Counters 13/07/12 18:43:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=20723 13/07/12 18:43:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 18:43:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 ################################ map tasks =4 就会生成四个文件 13/07/12 18:43:18 INFO mapred.JobClient: Launched map tasks=4 13/07/12 18:43:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/07/12 18:43:18 INFO mapred.JobClient: File Output Format Counters 13/07/12 18:43:18 INFO mapred.JobClient: Bytes Written=202094 13/07/12 18:43:18 INFO mapred.JobClient: FileSystemCounters 13/07/12 18:43:18 INFO mapred.JobClient: HDFS_BYTES_READ=455 13/07/12 18:43:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=260292 13/07/12 18:43:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=202094 13/07/12 18:43:18 INFO mapred.JobClient: File Input Format Counters 13/07/12 18:43:18 INFO mapred.JobClient: Bytes Read=0 13/07/12 18:43:18 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 18:43:18 INFO mapred.JobClient: Map input records=999 13/07/12 18:43:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=179216384 13/07/12 18:43:18 INFO mapred.JobClient: Spilled Records=0 13/07/12 18:43:18 INFO mapred.JobClient: CPU time spent (ms)=3360 13/07/12 18:43:18 INFO mapred.JobClient: Total committed heap usage (bytes)=65011712 13/07/12 18:43:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1519448064 13/07/12 18:43:18 INFO mapred.JobClient: Map output records=999 13/07/12 18:43:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=455 13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Transferred 197.3574 KB in 18.0516 seconds (10.9329 KB/sec) 共999行记录 13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Retrieved 999 records. 13/07/12 18:43:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1 13/07/12 18:43:19 WARN hive.TableDefWriter: Column add_time had to be cast to a less precise type in Hive add_time的精度变小了。。为嘛? mysql中 `add_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, mysql> select add_time from share limit 1; +---------------------+ | add_time | +---------------------+ | 2012-04-08 17:17:13 | +---------------------+ 1 row in set (0.00 sec) 到了hive中会变成: 2012-04-08 17:17:13.0 貌似精度是变大了。 不过据说某些数据库的timestamp字段支持到了毫秒的精度。 13/07/12 18:43:19 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hduser/share/_logs 13/07/12 18:43:19 INFO hive.HiveImport: Loading uploaded data into Hive 13/07/12 18:43:21 INFO hive.HiveImport: WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. 13/07/12 18:43:22 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties 13/07/12 18:43:22 INFO hive.HiveImport: Hive history file=/tmp/hduser/hive_job_log_hduser_201307121843_1382245494.txt 13/07/12 18:43:25 INFO hive.HiveImport: OK 13/07/12 18:43:25 INFO hive.HiveImport: Time taken: 2.389 seconds 13/07/12 18:43:25 INFO hive.HiveImport: Loading data to table default.share 13/07/12 18:43:26 INFO hive.HiveImport: Table default.share stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 202094, raw_data_size: 0] 13/07/12 18:43:26 INFO hive.HiveImport: OK 13/07/12 18:43:26 INFO hive.HiveImport: Time taken: 0.726 seconds 13/07/12 18:43:26 INFO hive.HiveImport: Hive import complete. 13/07/12 18:43:26 INFO hive.HiveImport: Export directory is empty, removing it. [hduser@www lib]$ --default-character-set=utf8之前还有 -- Arguments to mysqldump and other subprograms may be supplied after a '--' on the command line. |
去hive中看看数据:
[hduser@www lib]$ /hive/bin/hive WARNING: org.apache.hadoop.metrics.jvm.EventCounter IS deprecated. Please USE org.apache.hadoop.log.metrics.EventCounter IN ALL the log4j.properties files. Logging initialized USING configuration IN jar:file:/usr/LOCAL/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties Hive history file=/tmp/hduser/hive_job_log_hduser_201307121854_1566000179.txt hive> USE shipincon; FAILED: Error IN metadata: ERROR: The DATABASE shipincon does NOT exist. FAILED: Execution Error, RETURN code 1 FROM org.apache.hadoop.hive.ql.EXEC.DDLTask hive> SHOW DATABASES; OK DEFAULT TIME taken: 0.244 seconds hive> USE DEFAULT; OK TIME taken: 0.018 seconds hive> SHOW TABLES; OK a cleanmaster cm login_info login_interm share TIME taken: 0.155 seconds hive> SELECT COUNT(*) FROM sahre; FAILED: SemanticException [Error 10001]: Line 1:21 TABLE NOT found 'sahre' hive> SELECT COUNT(*) FROM share > ; Total MapReduce jobs = 1 Launching Job 1 OUT OF 1 NUMBER OF reduce tasks determined at compile TIME: 1 IN ORDER TO CHANGE the average LOAD FOR a reducer (IN bytes): SET hive.EXEC.reducers.bytes.per.reducer=<number> IN ORDER TO LIMIT the maximum NUMBER OF reducers: SET hive.EXEC.reducers.MAX=<number> IN ORDER TO SET a constant NUMBER OF reducers: SET mapred.reduce.tasks=<number> Starting Job = job_201307121751_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201307121751_0002 KILL Command = /usr/LOCAL/hadoop/libexec/../bin/hadoop job -KILL job_201307121751_0002 Hadoop job information FOR Stage-1: NUMBER OF mappers: 1; NUMBER OF reducers: 1 2013-07-12 18:54:37,120 Stage-1 map = 0%, reduce = 0% 2013-07-12 18:54:40,143 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:41,154 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:42,162 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:43,168 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:44,175 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:45,181 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:46,192 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:47,199 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec 2013-07-12 18:54:48,208 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.82 sec 2013-07-12 18:54:49,219 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.74 sec 2013-07-12 18:54:50,238 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.74 sec MapReduce Total cumulative CPU TIME: 1 seconds 740 msec Ended Job = job_201307121751_0002 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.74 sec HDFS READ: 202547 HDFS WRITE: 4 SUCCESS Total MapReduce CPU TIME Spent: 1 seconds 740 msec OK 999 TIME taken: 18.573 seconds hive> SHOW CREATE TABLE share; OK CREATE TABLE share( share_id BIGINT, add_time string, tag string) COMMENT 'Imported by sqoop on 2013/07/12 19:48:33' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://localhost:54310/user/hive/warehouse2/share' TBLPROPERTIES ( 'numPartitions'='0', 'numFiles'='4', 'transient_lastDdlTime'='1373629961', 'totalSize'='227161', 'numRows'='0', 'rawDataSize'='0') TIME taken: 0.292 seconds hive> |
再去dfs里面看看:
[hduser@www lib]$ /hadoop/bin/hadoop dfs -ls /user/hive/warehouse2/share Found 5 items -rw-r--r-- 1 hduser supergroup 0 2013-07-12 18:43 /user/hive/warehouse2/share/_SUCCESS -rw-r--r-- 1 hduser supergroup 53226 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00000 -rw-r--r-- 1 hduser supergroup 50185 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00001 -rw-r--r-- 1 hduser supergroup 50634 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00002 -rw-r--r-- 1 hduser supergroup 48049 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00003 |
乍一看,像是4个partition。 可是细看文件名,并不是partition的那种结构:
比如:/user/hive/warehouse2/cm/country_name=VN_en
把文件复制到本地:
/hadoop/bin/hadoop dfs -get /user/hive/warehouse2/share ~/ [hduser@www share]$ cd ~/share [hduser@www share]$ wc -l * 250 part-m-00000 250 part-m-00001 249 part-m-00002 250 part-m-00003 0 _SUCCESS 999 total |
–hive-overwrite 可以覆盖hive中的现有数据。
–hive-table
Sets the table name to use when importing to Hive.
可以把数据import到某个分区中:
–hive-partition-key Name of a hive field to partition are sharded on
–hive-partition-value String-value that serves as partition key for this imported into hive in this job.
可以写个cron,周期性地把share表的新增数据导入到hive中了: –append
参考:
http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html
http://www.mysqlperformanceblog.com/2013/07/11/mysql-and-hadoop/