Fastmoe安装
Fastmoe安装
名称 | ubuntu | cuda | torch | nccl |
---|---|---|---|---|
版本 | 18.04 | 10.2 | torch-1.8.0-cp37-cp37m | 2.7.8 |
1、安装虚拟环境
#创建虚拟环境
(base) root@9fd4db53dc92:~# conda create -n torch-1.8-cu102-py37 python=3.7
#进入虚拟环境
(base) root@9fd4db53dc92:~# conda activate torch-1.8-cu102-py37
2、NCCL安装
2.1 确认此pytorch支持的nccl版本
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# ipython
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.34.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: torch.cuda.nccl.version()
Out[2]: 2708
2.2 安装2.7.8版本的nccl
nccl下载链接:https://developer.nvidia.com/nccl/nccl-legacy-downloads
2.2.1 安装依赖包
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-get install -y gnupg2
2.2.2 添加key
#下载key
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
#添加key
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-key add 7fa2af80.pub
OK
2.2.3 安装储存库资源
#下载储存库资源
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
#安装储存库资源
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
2.2.4 更新
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-get update
2.2.5 安装 libnccl2 libnccl-dev
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt install libnccl2=2.7.8-1+cuda10.2 libnccl-dev=2.7.8-1+cuda10.2
2.2.6 配置环境变量
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# vim ~/.bashrc
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# tail -3 ~/.bashrc
export PATH=$PATH:/usr/local/cuda-10.2/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64
export CUDA_HOME=/usr/local/cuda-10.2
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# source ~/.bashrc
3、编译fastmoe 1.0
3.1 通过release下载的源码
默认的仓库是master分支,我选择了release分支里的1.0版本代码安装,会更稳一点
3.2 排坑
cuda/fastermoe/smart_schedule.h: In function ‘void fmoe_cuda_fused_forward_impl(pybind11::function, pybind11::function, pybind11::function, c10::Device, std::vector<at::Tensor>, scalar_t*, scalar_t*, scalar_t*, scalar_t*, const long int*, const long int*, const bool*, long int, long int, long int, long int, long int, long int, CudaStreamManager*)’:
cuda/fastermoe/smart_schedule.h:169:61: error: too few arguments to function ‘cudaError_t cudaStreamWaitEvent(cudaStream_t, cudaEvent_t, unsigned int)’
cudaStreamWaitEvent(smgr->stream(1), evt_get);
cuda/fastermoe/smart_schedule.h:39:9: note: in definition of macro ‘GEN_IDX’
int gidx_recv = ei * world_size + rank_recv; \
^~~~~~~~~
error: command '/usr/bin/gcc' failed with exit code 1
官网上说默认没开启nccl,但实际他的安装过程是开启的
只有关闭nccl才能通过编译安装,否则报错
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# export USE_NCCL=0
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# cd moe_5.31/
(torch-1.8-cu102-py37) root@9fd4db53dc92:~/moe_5.31# python setup.py install
3.3 验证
4、附录
以下是编译过程中一些排错过程
https://linuxtect.com/the-error-command-gcc-failed-with-exit-status-1-error-and-solution/
https://www.cnblogs.com/gerrydeng/p/7159021.html
https://itsmycode.com/error-command-errored-out-with-exit-status-1/