apex 安装总结
最近使用一个库,依赖apex。折腾一个早上才安装好。做记录以方便后来者。
环境:
系统: Windows
库:pytorch1.9.0
cuda版本: 11.1
vs : 2019
vs补充说明,除 vs和默认推荐C++推荐安装外。遇到问题的时候,临时装
且没有重启电脑。理论上应该和apex安装无关。因为过程发生操作,所以此处也做记录。
1.cuda版本不匹配
库推荐使用pytorch1.7.1 cuda=10.2 。按照库给出的说明安装,提示cuda库不匹配。
打开 “apex/setup.py” 文件 ,查看代码 发现 torch的cuda版本(torch_binary_major ,torch_binary_minor)和安装的cuda驱动版本要一致nvcc(bare_metal_major,bare_metal_minor)
def get_cuda_bare_metal_version(cuda_dir): raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True) output = raw_output.split() release_idx = output.index("release") + 1 release = output[release_idx].split(".") bare_metal_major = release[0] bare_metal_minor = release[1][0] return raw_output, bare_metal_major, bare_metal_minor def check_cuda_torch_binary_vs_bare_metal(cuda_dir): raw_output, bare_metal_major, bare_metal_minor = get_cuda_bare_metal_version(cuda_dir) torch_binary_major = torch.version.cuda.split(".")[0] torch_binary_minor = torch.version.cuda.split(".")[1] print("\nCompiling cuda extensions with") print(raw_output + "from " + cuda_dir + "/bin\n") if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor): raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " + "not match the version used to compile Pytorch binaries. " + "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) + "In some cases, a minor-version mismatch will not cause later errors: " + "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. " "You can try commenting out this check (at your own risk).")
解决办法,cuda和pytorch之间,一者适应另一者 。另外,查看SetUp,py文件,cuda版本>10.0
最终选择
python:3.7
pytorch安装命令“”
2.安装nvcc
cmd激活命令, 输入 “nvcc -V” 提示不是系统命令
重新安装cuda11.1 ,选择自定义,去除其余,勾选nvcc 。安装。
接着设定 nvcc的路径到系统路径 。然后参考网上命令 激活Path(正在跑程序,不想重启电脑)
cmd窗口输入“nvcc -V” 。结果正常
疑似此处留的坑,当时安装完没重启,可能因此导致后面安装失败,直到重启为止。
3.遇到“Given no hashes to check XXX links for project 'pip': discarding no candidates”错误
一直卡在这个提示
1)首先,打开“apex/requirements.txt”,“apex/requirements_dev.txt” ,对照conda list ,安装缺失的库。
2)其次,“https://blog.csdn.net/qq_33019383/article/details/103990248” 说要安装 torch-scatter 。于是安装。
3)网上说删除之前下载的“C:\Users\Administrator\apex”文件夹,重新执行如下命令
git clone https://www.github.com/nvidia/apex
cd apex
python3 setup.py install
遗憾的是以上都没有生效
4.最终解决
重启电脑。因为前面说的库,还依赖其它,就顺手装
pip install opencv-python==4.4.0.46 termcolor==1.1.0 yacs==0.1.8 diffdist
然后执行
cd apex
python3 setup.py install
有警告,但安装成功了。
torch.__version__ = 1.9.0 setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies! warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!") running install running bdist_egg running egg_info writing apex.egg-info\PKG-INFO writing dependency_links to apex.egg-info\dependency_links.txt writing top-level names to apex.egg-info\top_level.txt reading manifest file 'apex.egg-info\SOURCES.txt' writing manifest file 'apex.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py creating build\lib creating build\lib\apex copying apex\__init__.py -> build\lib\apex creating build\lib\apex\amp copying apex\amp\amp.py -> build\lib\apex\amp copying apex\amp\compat.py -> build\lib\apex\amp …… copying build\lib\apex\pyprof\nvtx\__init__.py -> build\bdist.win-amd64\egg\apex\pyprof\nvtx creating build\bdist.win-amd64\egg\apex\pyprof\parse copying build\lib\apex\pyprof\parse\db.py -> build\bdist.win-amd64\egg\apex\pyprof\parse …… copying build\lib\apex\RNN\__init__.py -> build\bdist.win-amd64\egg\apex\RNN copying build\lib\apex\__init__.py -> build\bdist.win-amd64\egg\apex byte-compiling build\bdist.win-amd64\egg\apex\amp\amp.py to amp.cpython-37.pyc …… byte-compiling build\bdist.win-amd64\egg\apex\RNN\RNNBackend.py to RNNBackend.cpython-37.pyc byte-compiling build\bdist.win-amd64\egg\apex\RNN\__init__.py to __init__.cpython-37.pyc byte-compiling build\bdist.win-amd64\egg\apex\__init__.py to __init__.cpython-37.pyc creating build\bdist.win-amd64\egg\EGG-INFO copying apex.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO copying apex.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO copying apex.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO copying apex.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO zip_safe flag not set; analyzing archive contents... apex.pyprof.nvtx.__pycache__.nvmarker.cpython-37: module references __file__ apex.pyprof.nvtx.__pycache__.nvmarker.cpython-37: module references __path__ creating dist creating 'dist\apex-0.1-py3.7.egg' and adding 'build\bdist.win-amd64\egg' to it removing 'build\bdist.win-amd64\egg' (and everything under it) Processing apex-0.1-py3.7.egg creating c:\programdata\anaconda3\envs\XXXX\lib\site-packages\apex-0.1-py3.7.egg Extracting apex-0.1-py3.7.egg to c:\programdata\anaconda3\envs\XXXX\lib\site-packages Adding apex 0.1 to easy-install.pth file Installed c:\programdata\anaconda3\envs\XXXX\lib\site-packages\apex-0.1-py3.7.egg Processing dependencies for apex==0.1 Finished processing dependencies for apex==0.1
5.后续
1)
后面发现执行设定精度设置的语句会报错,所以实际没安装成功。
并且再次执行命令
python setup.py install
命令执行,直接换行,没有执行结果。
改用
python setup.py build
pip install -v --no-cache-dir
执行结果
torch.__version__ = 1.9.0 setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies! warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
running bdist_wheel
running build
running build_py
installing to build\bdist.win-amd64\wheel
running install
running install_lib
………………………………………………………………………………………………………………………………………………
adding 'apex-0.1.dist-info/WHEEL'
adding 'apex-0.1.dist-info/top_level.txt'
adding 'apex-0.1.dist-info/RECORD'
removing build\bdist.win-amd64\wheel
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\pytorch1.8.1\lib\site-packages\colorama\ansitowin32.py", line 59, in closed
return stream.closed
ValueError: underlying buffer has been detached
done
Created wheel for apex: filename=apex-0.1-py3-none-any.whl size=206058 sha256=8761f64146164553df82742b07c5ef2cfe9da3a82a636b9457483cb95a9544ba
Stored in directory: C:\Users\Administrator\AppData\Local\Temp\pip-ephem-wheel-cache-8l21lyri\wheels\17\e2\d0\fbd642567ec1ec2e05aa8db3ae5d45c586c0f909da3f40de6e
Successfully built apex
Installing collected packages: apex
Successfully installed apex-0.1
1 location(s) to search for versions of pip:
* https://pypi.org/simple/pip/
Fetching project page and analyzing links: https://pypi.org/simple/pip/
Getting page https://pypi.org/simple/pip/
Found index url https://pypi.org/simple
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /simple/pip/ HTTP/1.1" 200 16538
……………………………………………………………………………………………………………………………………………………………………
Found link https://files.pythonhosted.org/packages/b1/44/6e26d5296b83c6aac166e48470d57a00d3ed1f642e89adc4a4e412a01643/pip-21.1.2.tar.gz#sha256=eb5df6b9ab0af50fe1098a52fd439b04730b6e066887ff7497357b9ebd19f79b (from https://pypi.org/simple/pip/) (requires-python:>=3.6), version: 21.1.2
Skipping link: not a file: https://pypi.org/simple/pip/
Given no hashes to check 167 links for project 'pip': discarding no candidates
Removed build tracker: 'C:\\Users\\Administrator\\AppData\\Local\\Temp\\pip-req-tracker-hs8z7jdp'
“Successfully installed apex-0.1”显示安装成功。但是要注意命令没有安装cuda拓展和C++拓展。一旦代码运用到涉及的部分,就会出现问题。
比如:运行swin_Transformer 示例。 会弹警告,提示找不到 “amp_C” 。连锁反应“
torch.distributed.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
这一句执行弹出警告,实际执行失败,没有完成分布式运算初始化。 进而导致,后续跟分布式有关代码全部要手动注释掉(抽样,训练时世代设置)
2)
其余安装方法参考 codebrid的 apex 安装/使用 记录
测试参考apex 安装/使用 记录