Ubuntu20.04下C++调用接口解压缩zip文件
一、前言
一般而言,我们在Linux系统下通常使用解压缩命令去压缩/解压缩文件。在C++程序中,要实现该功能,我们有两种方式:
- 通过system函数调用7z等命令去执行压缩或者解压缩;
- 通过使用C++调用解压缩工具库去执行压缩或者解压缩;
第一种方式操作起来很简便,但是它比较死板,以解压缩为例,就一定要将压缩包解压出来生成文件,再进行后续使用;而第二种方式,就比较灵活但是使用要求更高,比如考虑这样的场景:我们想在加压缩之后不生成文件直接使用,它的好处是能够省掉2次IO操作。由于第一种方式操作很简单,本文着重探讨如何利用第二种方式实现解压缩文件。
对于使用gzip压缩的文件,可以使用zlib库,但是zlib无法解压缩.zip文件。如果想要解压缩.zip文件,可以使用minizip库,或者7zip库。
另外,常用解压缩软件打包速度比拼可参考:https://peazip.github.io/peazip-compression-benchmark.html
二、解决方案
2.1 (方法一)使用C++调用minizip库实现解压缩.zip文件
这里我们使用https://github.com/Lecrapouille/zipper来实现:
Zipper is a C++11 wrapper around minizip compression library. Its goal is to bring the power and simplicity of minizip to a more object-oriented/c++ user-friendly library.
2.1.1 编译与安装
下载项目并编译:
git clone https://github.com/lecrapouille/zipper.git --recursive cd zipper make download-external-libs make compile-external-libs make -j`nproc --all`
最后一行命令输出:
mulan@mulan-ai:/home/xxx/projects/zipper$ sudo make -j`nproc --all` .makefile/Makefile.flags:127: CXXFLAGS: You have replaced default compilation flags by your own flags *** About: Zipper => C++ wrapper around minizip compression library *** Compiler: g++ --std=c++11 g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. *** Version: Zipper <= VERSION.txt Project version: 2.0 Git branch master SHA1 5da208a307465022ec41f41e6a8964f3487cbb75 Compiling [25%] Zipper <= ./src/utils/Timestamp.cpp Compiling [50%] Zipper <= ./src/utils/Path.cpp Compiling [75%] Zipper <= ./src/Zipper.cpp Compiling [100%] Zipper <= ./src/Unzipper.cpp *** Generating: Zipper => zipper-2.0.0.pc *** Static lib: Zipper => build/libzipper.a.2.0.0 *** Shared lib: Zipper => build/libzipper.so.2.0 ar: creating libzipper.a.2.0.0 a - Timestamp.o a - Path.o a - Zipper.o a - Unzipper.o a - _libaes/aescrypt.c.o a - _libaes/aeskey.c.o a - _libaes/aestab.c.o a - _libaes/entropy.c.o a - _libaes/fileenc.c.o a - _libaes/hmac.c.o a - _libaes/prng.c.o a - _libaes/pwd2key.c.o a - _libaes/sha1.c.o a - _libminizip/ioapi.c.o a - _libminizip/ioapi_buf.c.o a - _libminizip/ioapi_mem.c.o a - _libminizip/unzip.c.o a - _libminizip/zip.c.o a - _libz/adler32.c.o a - _libz/adler32_avx2.c.o a - _libz/adler32_avx512.c.o a - _libz/adler32_avx512_vnni.c.o a - _libz/adler32_fold.c.o a - _libz/adler32_sse42.c.o a - _libz/adler32_ssse3.c.o a - _libz/chunkset.c.o a - _libz/chunkset_avx.c.o a - _libz/chunkset_sse2.c.o a - _libz/chunkset_sse41.c.o a - _libz/compare256.c.o a - _libz/compare256_avx2.c.o a - _libz/compare256_sse2.c.o a - _libz/compress.c.o a - _libz/cpu_features.c.o a - _libz/crc32_braid.c.o a - _libz/crc32_braid_comb.c.o a - _libz/crc32_fold.c.o a - _libz/crc32_fold_pclmulqdq.c.o a - _libz/crc32_fold_vpclmulqdq.c.o a - _libz/deflate.c.o a - _libz/deflate_fast.c.o a - _libz/deflate_huff.c.o a - _libz/deflate_medium.c.o a - _libz/deflate_quick.c.o a - _libz/deflate_rle.c.o a - _libz/deflate_slow.c.o a - _libz/deflate_stored.c.o a - _libz/functable.c.o a - _libz/gzlib.c.o a - _libz/gzread.c.o a - _libz/gzwrite.c.o a - _libz/infback.c.o a - _libz/inffast.c.o a - _libz/inflate.c.o a - _libz/inftrees.c.o a - _libz/insert_string.c.o a - _libz/insert_string_roll.c.o a - _libz/insert_string_sse42.c.o a - _libz/slide_hash.c.o a - _libz/slide_hash_avx2.c.o a - _libz/slide_hash_sse2.c.o a - _libz/trees.c.o a - _libz/uncompr.c.o a - _libz/x86_features.c.o a - _libz/zutil.c.o
安装 C++ header files and compiled libraries, type:
mulan@mulan-ai:/home/xxx/projects/zipper$ sudo make install .makefile/Makefile.flags:127: CXXFLAGS: You have replaced default compilation flags by your own flags *** About: Zipper => C++ wrapper around minizip compression library *** Compiler: g++ --std=c++11 g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. *** Installing: doc => /usr/share/Zipper/2.0.0/doc *** Installing: libs => /usr/lib *** Installing: pkg-config => /usr/lib/pkgconfig *** Installing: headers => /usr/include/Zipper-2.0.0 *** Installing: include => /usr/include/Zipper-2.0.0 *** Installing: src => /usr/include/Zipper-2.0.0
2.1.2 链接Zipper到自己的项目
- In your project add the needed headers in your c++ files:
#include <Zipper/Unzipper.hpp> #include <Zipper/Zipper.hpp>
- To compile your project against Zipper use pkg-config:
For cmake:
# for zipper include(FindPkgConfig) message(STATUS "finding zipper ...") message ("pkg include dirs: ${PKGS_INCLUDE_DIRS}") pkg_check_modules(PKGS REQUIRED zipper) include_directories(${PKGS_INCLUDE_DIRS}) link_directories(${PKGS_LIBRARY_DIRS})
如果不想通过FindPkgConfig的方式找到zipper库,可以使用如下的cmake配置:
set(ZIPPER_DIR "/home/mulan/deploy_env/Zipper-2.0.0/") include_directories(${ZIPPER_DIR}/include/) link_directories(${ZIPPER_DIR}/lib/)
其中include目录下放置Zipper-2.0.0里的内容(可以copy自/usr/include/Zipper-2.0.0)
然后创建lib文件夹,里面放入前面build文件夹里的动态库文件即可。(亲测有效)
-----------------------------------------------------------------------
补充:pkg-config介绍
pkg-config是一个在源代码编译时查询已安装的库的使用接口的计算机工具软件。
它输出已安装的库的相关信息,包括:
- C/C++编译器需要的输入参数;
- 链接器需要的输入参数;
- 已安装软件包的版本信息。
当安装一个库时(例如从RPM,deb或其他二进制包管理系统),会包括一个后缀名为pc的文件,它会放入某个文件夹下(Ubuntu系统中在/usr/lib/pkgconfig目录下):
比如,我们查看cuda-11.3.pc内容如下:
mulan@mulan-PowerEdge-R7525:/usr/lib/pkgconfig$ cat cuda-11.3.pc cudaroot=/usr/local/cuda-11.3/ libdir=${cudaroot}/targets/x86_64-linux/lib/stubs/ includedir=${cudaroot}/targets/x86_64-linux/include Name: cuda Description: CUDA Driver Library Version: 11.3 Libs: -L${libdir} -lcuda Cflags: -I${includedir}
-----------------------------------------------------------------------
在测试方面,该项目引入了:
- googletest framework
- lcov for code coverage(代码覆盖率)
源码阅读:
int extractToMemory(std::vector<unsigned char>& outvec, ZipEntry& info) { int err; int bytes = 0; err = unzOpenCurrentFilePassword(m_zf, (m_outer.m_password != "") ? m_outer.m_password.c_str() : NULL); if (UNZ_OK == err) { std::vector<unsigned char> buffer; buffer.resize(ZIPPER_WRITE_BUFFER_SIZE); outvec.reserve(static_cast<size_t>(info.uncompressedSize)); do { bytes = unzReadCurrentFile(m_zf, buffer.data(), static_cast<unsigned int>(buffer.size())); if (bytes == 0) break; if (bytes < 0) { m_error_code = make_error_code( unzipper_error::INTERNAL_ERROR, strerror(errno)); return UNZ_ERRNO; } outvec.insert(outvec.end(), buffer.data(), buffer.data() + bytes); } while (bytes > 0); } else { m_error_code = make_error_code(unzipper_error::INTERNAL_ERROR); } return err; }
三、实战分析
3.1 问题分析:double free or corruption (out)
当我们运行C++代码时,遇到下列错误信息:
double free or corruption (out) 已放弃 (核心已转储)
基本上根据信息判定,就是内存问题:
1. 内存重复释放,看程序中是否释放了两次空间(检查);
2. 内存越界(大部分是这种情况,如果你使用了数组,或者开辟了空间,但是在循环的时候越界了,就会出现这种情况);
3.2 问题分析:Unzip should be optimized
Benchmark from sebastiandev/zipper#120 (作者不再维护该仓库,所以关闭了该ticket)
Zipper lib: Zip "The Expanse S4": 206947.200 ms => 3 Minutes 26 Seconds (=-) / 1.2 Unzip "The Expanse S4": 116280.597 ms => 1 Minutes 56 Seconds (--) / 3.4 Zip hy_yj_zg_sc: 16780.517 ms => 16 Seconds (==) * 1.1 Unzip hy_yj_zg_sc: 94578.933 ms => 1 Minutes 34 Seconds (--) / 8.2 Linux zip tool: Zip "The Expanse S4": 171211.721 ms => 2 Minutes 51 Seconds (=+) Unzip "The Expanse S4": 33556.513 ms => 33 Seconds (++) Zip hy_yj_zg_sc: 18668.122 ms => 18 Seconds (==) Unzip hy_yj_zg_sc: 11414.059 ms => 11 Seconds (++) The config: Ryzen 1800x, SSD, 16 GB RAM, only one core is running (Zipper lib and Linux zip tool)
后续会补充自己的测试结果...
3.3 问题分析:解压到内存出现异常
下面是我自己写的解压到memory的封装函数:
bool AnalysisHelper::extract2MemoryFromZipFile(const std::string& zipFilePath, std::vector<std::string>& entry_list, std::unordered_map<std::string, std::vector<unsigned char>>& unzipped_entries_map) { LOG_DEBUG << "AnalysisHelper::extract2MemoryFromZipFile() called ..."; try { std::vector<unsigned char> unzipped_entry; Unzipper unzipper(zipFilePath); std::vector<ZipEntry> entries = unzipper.entries(); LOG_DEBUG << "extract2MemoryFromZipFile: ZipEntry size:" << entries.size(); for (auto& it: entries) { LOG_DEBUG << "extract2MemoryFromZipFile: cur entry:" << it.name; entry_list.emplace_back(it.name); bool res = unzipper.extractEntryToMemory(it.name, unzipped_entry); std::error_code err = unzipper.error(); LOG_ERROR << "extract2MemoryFromZipFile: got uzipper error:" << err.message(); if(!res){ //上面2行代码本该放在此处,但是debug的原因暂时放在了前面 return false; } const auto [iter, sucess] = unzipped_entries_map.emplace(it.name, unzipped_entry);// 效率高于insert LOG_DEBUG << "extract2MemoryFromZipFile: cur entry put into map: " << sucess << ", iter:" << (*iter).first; } unzipper.close(); } catch (std::runtime_error const& e) { LOG_ERROR << "Failed unzipping '" << zipFilePath << "' Exception was: '" << e.what() << "'"; return false; } LOG_DEBUG << "AnalysisHelper::extract2MemoryFromZipFile() called done!"; return true; }
相关思考:
- map, unordered_map, dictionary 对比(效率,原理等)
- map的emplace,insert,try_emplace的使用差异
在测试时遇到了一个问题:
传入的压缩包里有40张图,在解压到第38张时,突然就断掉了,在该函数里也没有捕获任何异常(log如下):
问题解决了,是多线程的逻辑出了点问题。extractEntryToMemory和extractAll函数都测过了,没啥问题。