kalor

导航

 
从数据库economy表stocks中导入data到数据库human_resources表emp_stocks


Problem 1: 无法查看本地临时目录emp_stocks中的数据000000_0,000001_0;
原因是:自己居然在hive中的文件系统hdfs用dfs -cat file命令去查询,脑子进水了不是,应该先退出hdfs文件系统hive> quit;然后执行查询 
cat /emp_stocks/000000_0; 如果将数据库economy表stocks中要查询的数据导入到hive中hdfs文件系统的一个临时目录,可用dfs -cat file查询。

Problem 2: 在将数据库economy表stocks中要查询的数据导入到一个临时目录时,临时目录中的文件列分隔符出现乱码;
原因是:Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type - then those columns are serialized to JSON format.

首先,将数据库economy表stocks中要查询的数据导入到一个临时目录(可以是local的,也可以是hdfs系统的->emit the 'local' keyword)
hive> set hive.cli.print.current.db = true;
hive(economy)> insert overwrite local directory '/home/landen/UntarFile/hive-0.10.0/emp_stocks' 
    > select exchange,symbol,ymd,price_open,price_close,price_adj_close from stocks;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201303271617_0008, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0008
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201303271617_0008
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2013-03-28 22:25:06,557 Stage-1 map = 0%,  reduce = 0%
2013-03-28 22:25:19,031 Stage-1 map = 11%,  reduce = 0%
2013-03-28 22:25:23,603 Stage-1 map = 24%,  reduce = 0%
2013-03-28 22:25:24,607 Stage-1 map = 36%,  reduce = 0%
2013-03-28 22:25:28,307 Stage-1 map = 41%,  reduce = 0%
2013-03-28 22:25:34,324 Stage-1 map = 53%,  reduce = 0%
2013-03-28 22:25:37,345 Stage-1 map = 72%,  reduce = 0%
2013-03-28 22:25:40,352 Stage-1 map = 83%,  reduce = 0%
2013-03-28 22:25:43,427 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:44,433 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:45,437 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:46,441 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:47,445 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:48,453 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 22.03 sec
2013-03-28 22:25:49,456 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:50,460 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:51,464 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:52,467 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:53,473 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:54,478 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:55,482 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:56,486 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:57,490 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 47.59 sec
2013-03-28 22:25:58,494 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 47.59 sec
MapReduce Total cumulative CPU time: 47 seconds 590 msec
Ended Job = job_201303271617_0008
Copying data to local directory /home/landen/UntarFile/hive-0.10.0/emp_stocks
Copying data to local directory /home/landen/UntarFile/hive-0.10.0/emp_stocks
8487547 Rows loaded to /home/landen/UntarFile/hive-0.10.0/emp_stocks
MapReduce Jobs Launched: 
Job 0: Map: 2   Cumulative CPU: 47.59 sec   HDFS Read: 481098497 HDFS Write: 330347902 SUCCESS
Total MapReduce CPU Time Spent: 47 seconds 590 msec
OK
Time taken: 71.212 seconds

其次,再将上面那个临时目录'/home/landen/UntarFile/hive-0.10.0/emp_stocks'中的数据导入到human_resources表emp_stocks中:
hive (human_resources)> load data local inpath '/home/landen/UntarFile/hive-0.10.0/emp_stocks'
                      > overwrite into table emp_stocks;
Copying data from file:/home/landen/UntarFile/hive-0.10.0/emp_stocks
Copying file: file:/home/landen/UntarFile/hive-0.10.0/emp_stocks/000001_0
Copying file: file:/home/landen/UntarFile/hive-0.10.0/emp_stocks/000000_0
Loading data to table human_resources.emp_stocks
Deleted hdfs://localhost:9000/home/landen/UntarFile/hive-0.10.0/user/hive/warehouse/emp_stocks
Table human_resources.emp_stocks stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 330347902, raw_data_size: 0]
OK
Time taken: 10.752 seconds
hive (human_resources)> select count(1) from emp_stocks;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201303271617_0009, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0009
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201303271617_0009
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-03-29 12:07:30,122 Stage-1 map = 0%,  reduce = 0%
2013-03-29 12:08:30,461 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 6.14 sec
2013-03-29 12:08:32,668 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 6.14 sec
2013-03-29 12:08:33,674 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:34,678 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:35,681 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:36,684 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:37,687 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:38,690 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:39,693 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:40,696 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:43,888 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:44,892 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:45,895 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:46,898 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:47,901 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:48,904 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:49,906 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:50,909 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:51,913 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:52,916 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:53,919 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:54,922 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 6.6 sec
2013-03-29 12:08:55,926 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:08:56,930 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:08:57,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:08:58,947 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:08:59,950 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:09:00,955 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.8 sec
2013-03-29 12:09:01,960 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:02,964 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:03,967 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:04,970 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:05,974 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:06,978 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
2013-03-29 12:09:08,042 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 18.76 sec
MapReduce Total cumulative CPU time: 18 seconds 760 msec
Ended Job = job_201303271617_0009
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 18.76 sec   HDFS Read: 330365236 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 760 msec
OK
8487547
Time taken: 111.511 seconds
hive (human_resources)> 

key 1: Loading files into tables

Synopsis

    1. Load operations are current pure copy/move operations that move datafiles into locations corresponding to Hive tables.filepath can be a relative path, eg: project/data1; absolute path, eg: /user/hive/project/data1; a full URI with scheme and (optionally) an authority, eg: hdfs://namenode:9000/user/hive/project/data1;
    2. The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns;
    3. filepath can refer to a file (in which case hive will move the file into the table) or it can be a directory (in which case hive will move all the files within that directory into the table). In either case filepath addresses a set of files;
    4. If the keyword LOCAL is specified, then: the load command will look for filepath in the local file system. If a relative path is specified - it will be interpreted relative to the current directory of the user. User can specify a full URI for local files as well - for example: file:///user/hive/project/data1;
    5. the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table;
    6. If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath if one is specified. Otherwise the following rules are applied: If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable fs.default.name that specifies the Namenode URI ; If the path is not absolute - then Hive will interpret it relative to /user/<username>. Hive will move the files addressed by filepath into the table (or partition);
    7. if the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced with the files referred to by filepath. Otherwise the files referred by filepath will be added to the table.
    
    Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath - then the existing file will be replaced with the new file.

Notes

    1. filepath cannot contain subdirectories;
    2. If we are not using the keyword LOCAL - filepath must refer to files within the same filesystem as the table (or partition's) location;
    3. Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format - that the files being loaded are also sequencefiles and vice versa;
    4. Please read CompressedStorage if your datafile is compressed.

key 2: Inserting data into Hive Tables from queries

Synopsis

    1. INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0);
    2. INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8);
    3. Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns;
    4. Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query;
    5. The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement;
    6. The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table).
In the dynamic partition inserts, users can give partial partition specification, which means you just specify the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this static partition, otherwise dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

Notes

    1. Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.

key 3: Writing data into filesystem from queries

Synopsis

    1. directory can be full URI. If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable fs.default.name that specifies the Namenode URI;
    2. if LOCAL keyword is used - then Hive will write data to the directory on the local file system;
    3. Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type - then those columns are serialized to JSON format.

Notes

    1. INSERT OVERWRITE statements to directories, local directories and tables (or partitions) can all be used together within the same query;
    2. INSERT OVERWRITE statements to HDFS filesystem directories is the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job;
    3. The directory is, as you would expect, OVERWRITten, in other words, if the specified path exists, it is clobbered and replaced with the output.

 

posted on 2013-03-29 12:56  kalor  阅读(2048)  评论(0编辑  收藏  举报