hdfs小文件合并
HDFS small file merge
1.hive
Settings
There are 3 settings that should be configured before archiving is used. (Example values are shown.)
hive> set hive.archive.enabled=``true``;``hive> set hive.archive.har.parentdir.settable=``true``;``hive> set har.partfile.size=``1099511627776``;
hive.archive.enabled
controls whether archiving operations are enabled.
hive.archive.har.parentdir.settable` informs Hive whether the parent directory can be set while creating the archive. In recent versions of Hadoop the option can specify the root directory of the archive. For example, if is archived with as the parent directory, then the resulting archive file will contain the directory structure . In older versions of Hadoop (prior to 2011), this option was not available and therefore Hive must be configured to accommodate this limitation.`-p``/dir1/dir2/file``/dir1``dir2/file
har.partfile.size` controls the size of the files that make up the archive. The archive will contain `/` files, rounded up. Higher values mean fewer files, but will result in longer archiving times due to the reduced number of mappers.`*size_of_partition*``har.partfile.size
Usage
Archive
Once the configuration values are set, a partition can be archived with the command:
ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
For example:
ALTER TABLE srcpart ARCHIVE PARTITION(ds=``'2008-04-08'``, hr=``'12'``)
Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.
Unarchive
The partition can be reverted back to its original files with the unarchive command:
ALTER TABLE srcpart UNARCHIVE PARTITION(ds=``'2008-04-08'``, hr=``'12'``)
2.hdfs(Apache Hadoop Archives – Hadoop Archives Guide)
Overview
Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a .har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-) files. The _index file contains the name of the files that are part of the archive and the location within the part files.
How to Create an Archive
Usage: hadoop archive -archiveName name -p <parent> [-r <replication factor>] <src>* <dest>
-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. The parent argument is to specify the relative path to which the files should be archived to. Example would be :
-p /foo/bar a/b/c e/f/g
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. For a detailed example the later sections.
-r indicates the desired replication factor; if this optional argument is not specified, a replication factor of 3 will be used.
If you just want to archive a single directory /foo/bar then you can just use
hadoop archive -archiveName zoo.har -p /foo/bar -r 3 /outputdir
If you specify source files that are in an encryption zone, they will be decrypted and written into the archive. If the har file is not located in an encryption zone, then they will be stored in clear (decrypted) form. If the har file is located in an encryption zone they will stored in encrypted form.
How to Look Up Files in Archives
The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename’s, deletes and creates return an error. URI for Hadoop Archives is
har://scheme-hostname:port/archivepath/fileinarchive
If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like
har:///archivepath/fileinarchive
How to Unarchive an Archive
Since all the fs shell commands in the archives work transparently, unarchiving is just a matter of copying.
To unarchive sequentially:
hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir
To unarchive in parallel, use DistCp:
hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir
Archives Examples
Creating an Archive
hadoop archive -archiveName foo.har -p /user/hadoop -r 3 dir1 dir2 /user/zoo
The above example is creating an archive using /user/hadoop as the relative archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be archived in the following file system directory – /user/zoo/foo.har. Archiving does not delete the input files. If you want to delete the input files after creating the archives (to reduce namespace), you will have to do it on your own. In this example, because -r 3
is specified, a replication factor of 3 will be used.
Looking Up Files
Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the example above, to see all the files in the archives you can just run:
hdfs dfs -ls -R har:///user/zoo/foo.har/
To understand the significance of the -p argument, lets go through the above example again. If you just do an ls (not lsr) on the hadoop archive using
hdfs dfs -ls har:///user/zoo/foo.har
The output should be:
har:///user/zoo/foo.har/dir1
har:///user/zoo/foo.har/dir2
As you can recall the archives were created with the following command
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If we were to change the command to:
hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo
then a ls on the hadoop archive using
hdfs dfs -ls har:///user/zoo/foo.har
would give you
har:///user/zoo/foo.har/hadoop/dir1
har:///user/zoo/foo.har/hadoop/dir2
Notice that the archived files have been archived relative to /user/ rather than /user/hadoop.
3.practice (demo by Internal table stored PARQUET file )
create internal table
CREATE TABLE xx.a(
original_test_value DOUBLE,
flag STRING
)
PARTITIONED BY (
stat_date STRING,
parametric_hash STRING
)
WITH SERDEPROPERTIES ('serialization.format'='1')
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/user/hive/warehouse/xx.db/a'
copy the parquet file to new dir
sudo -u hdfs hdfs dfs -mkdir -p /user/hive/warehouse/xx.db/a/stat_date=20220125/parametric_hash=0
sudo -u hdfs hdfs dfs -cp /user/hive/warehouse/xx.db/a/stat_date=20220125/parametric_hash=0/* /user/hive/warehouse/xx.db/a/stat_date=20220125/parametric_hash=0
current hdfs file list:
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
-rwxr-xr-x | airflow | hive | 40.08 KB | Feb 09 11:21 | 3 | 128 MB | part-00000-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 29.07 KB | Feb 09 11:21 | 3 | 128 MB | part-00002-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 27.97 KB | Feb 09 11:21 | 3 | 128 MB | part-00003-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 27.03 KB | Feb 09 11:21 | 3 | 128 MB | part-00004-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 39.71 KB | Feb 09 11:21 | 3 | 128 MB | part-00006-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 32.54 KB | Feb 09 11:21 | 3 | 128 MB | part-00007-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 23.99 KB | Feb 09 11:21 | 3 | 128 MB | part-00011-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 23.62 KB | Feb 09 11:21 | 3 | 128 MB | part-00012-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 30.02 KB | Feb 09 11:21 | 3 | 128 MB | part-00014-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 22.91 KB | Feb 09 11:21 | 3 | 128 MB | part-00015-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 28.79 KB | Feb 09 11:21 | 3 | 128 MB | part-00016-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 19.11 KB | Feb 09 11:21 | 3 | 128 MB | part-00018-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 |
add hive partition and archive
#add partition
ALTER TABLE xx.a add PARTITION(stat_date='20220125'parametric_hash='0') location '/user/hive/warehouse/xx.db/a/stat_date=20220125/parametric_hash=0/';
#archive
ALTER TABLE xx.a ARCHIVE PARTITION(stat_date='20220125',parametric_hash='0');
archived HDFS file list
hive merge the 12 file to 1 file as data.har
the data.har have _SUCCESS,_index,_masterindex,part-0,The actual file is stored in part-0,select the actual file by index and masterindex.
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
drwxr-xr-x | hdfs | hive | 0 B | Mar 01 12:36 | 0 | 0 B | data.har |
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
-rw-r--r-- | hdfs | hive | 0 B | Mar 01 12:36 | 3 | 128 MB | _SUCCESS | ||
-rw-r--r-- | hdfs | hive | 3.38 KB | Mar 01 12:36 | 3 | 128 MB | _index | ||
-rw-r--r-- | hdfs | hive | 24 B | Mar 01 12:36 | 3 | 128 MB | _masterindex | ||
-rw-r--r-- | hdfs | hive | 561.31 KB | Mar 01 12:36 | 3 | 512 MB | part-0 |
effect
Positive
12 small file with 12 block merge to 1 file with 1 block.
cut down 11 metadata to offload the namenode.
nagative
##### select * from no archive and archived table with same partition by hiveServer2
# no archive
SELECT * from xx.b where parametric_hash =0;
20000 行 - 93ms (+5.754s)
#archive
SELECT * from xx.a where parametric_hash =0;
20000 行 - 102ms (+8.972s)
error: select table by impala
select * from xx.a;
#can not select by impala due to Failed to connect to FS: har://hdfs-nameservice1/
SQL 错误 [500051] [HY000]: [Cloudera][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: Failed to connect to FS: har://hdfs-nameservice1/
Error(255): Unknown error 255
Root cause: IOException: Invalid path for the Har Filesystem. har://hdfs-nameservice1/
, Query: SELECT `a`.`original_test_value`, `a`.`flag`, `a`.`stat_date`,`a`.`parametric_hash` FROM `xx`.`a`.
practice (demo by EXTERNAL table stored PARQUET file )
create external table
CREATE EXTERNAL TABLE xx.b(
original_test_value DOUBLE,
flag STRING
)
PARTITIONED BY (
stat_date STRING,
parametric_hash STRING
)
WITH SERDEPROPERTIES ('serialization.format'='1')
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/user/hive/warehouse/xx.db/b'
copy the parquet file to new dir
sudo -u hdfs hdfs dfs -mkdir -p /user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0
sudo -u hdfs hdfs dfs -cp /user/hive/warehouse/xx.db/c/stat_date=20220125/parametric_hash=0/* /user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0
current hdfs file list:
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
-rwxr-xr-x | airflow | hive | 40.08 KB | Feb 09 11:21 | 3 | 128 MB | part-00000-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 29.07 KB | Feb 09 11:21 | 3 | 128 MB | part-00002-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 27.97 KB | Feb 09 11:21 | 3 | 128 MB | part-00003-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 27.03 KB | Feb 09 11:21 | 3 | 128 MB | part-00004-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 39.71 KB | Feb 09 11:21 | 3 | 128 MB | part-00006-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 32.54 KB | Feb 09 11:21 | 3 | 128 MB | part-00007-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 23.99 KB | Feb 09 11:21 | 3 | 128 MB | part-00011-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 23.62 KB | Feb 09 11:21 | 3 | 128 MB | part-00012-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 30.02 KB | Feb 09 11:21 | 3 | 128 MB | part-00014-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 22.91 KB | Feb 09 11:21 | 3 | 128 MB | part-00015-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 28.79 KB | Feb 09 11:21 | 3 | 128 MB | part-00016-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 | ||
-rwxr-xr-x | airflow | hive | 19.11 KB | Feb 09 11:21 | 3 | 128 MB | part-00018-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000 |
add hive partition and archive
# add partition
ALTER TABLE xx.b add PARTITION(stat_date='20220125',parametric_hash='0') location '/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/';
# archive
ALTER TABLE xx.b ARCHIVE PARTITION(stat_date='20220125',parametric_hash='0');
error:
## external table don't support hive archive
ALTER TABLE xx.b ARCHIVE PARTITION(stat_date='20220125', parametric_hash='0');
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ARCHIVE can only be performed on managed tables
## drop the partition
ALTER TABLE xx.b drop PARTITION(stat_date='20220125' ,parametric_hash='0') ;
merge the small file with HDFS command
# bulid the har of all file in source dir output same dir
sudo -u hdfs hadoop archive -archiveName data.har -p /user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/ /user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/
# delete the original file by alreday build the har
hdfs dfs -rmr /user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/*.c000
archived HDFS file list
hive merge the 12 file to 1 file as data.har
the data.har have _SUCCESS,_index,_masterindex,part-0,The actual file is stored in part-0,select the actual file by index and masterindex.
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
drwxr-xr-x | hdfs | hive | 0 B | Mar 01 12:36 | 0 | 0 B | data.har |
Permission | Owner | Group | Size | Last Modified | Replication | Block Size | Name | ||
---|---|---|---|---|---|---|---|---|---|
-rw-r--r-- | hdfs | hive | 0 B | Mar 01 12:36 | 3 | 128 MB | _SUCCESS | ||
-rw-r--r-- | hdfs | hive | 3.38 KB | Mar 01 12:36 | 3 | 128 MB | _index | ||
-rw-r--r-- | hdfs | hive | 24 B | Mar 01 12:36 | 3 | 128 MB | _masterindex | ||
-rw-r--r-- | hdfs | hive | 561.31 KB | Mar 01 12:36 | 3 | 512 MB | part-0 |
rebuild the partition
# rebuild the partition by location har file path
ALTER TABLE xx.b add PARTITION(stat_date='20220125',parametric_hash='0') location 'har:///user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har';
effect
Positive
12 small file with 12 block merge to 1 file with 1 block.
cut down 11 metadata to offload the namenode.
negative
##### select * from no archive and archived table with same partition by hiveServer2
# no archive
SELECT * from xx.b where parametric_hash =0;
20000 行 - 93ms (+5.754s)
#archive
SELECT * from xx.b where parametric_hash =0;
20000 行 - 102ms (+8.972s)
error: select table by impala
select * from xx.b;
#can not select by impala due to Failed to connect to FS: har://hdfs-nameservice1/
SQL 错误 [500312] [HY000]: [Cloudera][ImpalaJDBCDriver](500312) Error in fetching data rows: Disk I/O error on impala03-dev:22000: Failed to open HDFS file har:/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har/part-00006-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000
Error(22): Invalid argument
Root cause: IllegalArgumentException: Wrong FS: har:/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har/part-00006-e5bd2f13-a955-4920-a8ac-bd19ec031843.c000, expected: hdfs://nameservice1
[Not achieved]how to fix the problem by select the archived table by impala
1.designate the HDFS HA name service when location the HAR file into partition
drop the partition location by har://path
ALTER TABLE xx.b drop PARTITION(stat_date='20220125',parametric_hash='0') ;
Dropped the partition stat_date=20220125/parametric_hash=0
add the new partition location by hdfs://nameservice/path
ALTER TABLE xx.b add PARTITION(stat_date='20220125',parametric_hash='0') location 'hdfs://nameservice1/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har';
failed : select from the new partition
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://nameservice1/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har/part-0
2.designate the HDFS namenode:port when location the HAR file into partition
drop the partition location by hdfs://nameservice/path
ALTER TABLE xx.b drop PARTITION(stat_date='20220125',parametric_hash='0') ;
Dropped the partition stat_date=20220125/parametric_hash=0
add the new partition location by hdfs://namenode:port/path
ALTER TABLE xx.b add PARTITION(stat_date='20220125',parametric_hash='0') location 'hdfs://192.168.1.170:8020/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har';
OK
Time taken: 0.088 seconds
failed : select from the new partition
hive> select * from xx.b limit 10;
OK
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://192.168.1.170:8020/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har/part-0
3.designate the har://nameservice/path when location the HAR file into partition
drop the partition location by hdfs://path
ALTER TABLE xx.b drop PARTITION(stat_date='20220125',parametric_hash='0') ;
Dropped the partition stat_date=20220125/parametric_hash=0
add the new partition location by har://nameservice/path
ALTER TABLE xx.b add PARTITION(stat_date='20220125',parametric_hash='0') location 'har://hdfs-nameservice1/user/hive/warehouse/xx.db/b/stat_date=20220125/parametric_hash=0/data.har';
success hive : select from the new partition
select * from xx.b limit 10;
Time taken: 1.056 seconds, Fetched: 10 row(s)
failed impala : select from the new partition
SQL 错误 [500051] [HY000]: [Cloudera][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: Failed to connect to FS: har://hdfs-nameservice1/
Error(255): Unknown error 255
Root cause: IOException: Invalid path for the Har Filesystem. har://hdfs-nameservice1/
, Query: SELECT `b`.`original_test_value`, `b`.`flag` `b`.`stat_date`, `b`.`parametric_hash` FROM `test_xac_dws`.`b` LIMIT 10.
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· 【.NET】调用本地 Deepseek 模型
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库
· 上周热点回顾(2.17-2.23)
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)