Fastmoe安装

Fastmoe安装

名称 ubuntu cuda torch nccl
版本 18.04 10.2 torch-1.8.0-cp37-cp37m 2.7.8

1、安装虚拟环境

#创建虚拟环境
(base) root@9fd4db53dc92:~# conda create -n torch-1.8-cu102-py37 python=3.7

#进入虚拟环境
(base) root@9fd4db53dc92:~# conda activate torch-1.8-cu102-py37

2、NCCL安装

2.1 确认此pytorch支持的nccl版本

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# ipython
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.34.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.cuda.nccl.version()
Out[2]: 2708

2.2 安装2.7.8版本的nccl

nccl下载链接:https://developer.nvidia.com/nccl/nccl-legacy-downloads
image.png-197.5kB

2.2.1 安装依赖包

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-get install -y gnupg2

2.2.2 添加key

#下载key
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

#添加key
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-key add 7fa2af80.pub
OK

2.2.3 安装储存库资源

#下载储存库资源
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

#安装储存库资源
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

2.2.4 更新

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt-get update

2.2.5 安装 libnccl2 libnccl-dev

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# apt install libnccl2=2.7.8-1+cuda10.2 libnccl-dev=2.7.8-1+cuda10.2

2.2.6 配置环境变量

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# vim ~/.bashrc
(torch-1.8-cu102-py37) root@9fd4db53dc92:~# tail -3 ~/.bashrc
export PATH=$PATH:/usr/local/cuda-10.2/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64
export CUDA_HOME=/usr/local/cuda-10.2

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# source ~/.bashrc

3、编译fastmoe 1.0

3.1 通过release下载的源码

默认的仓库是master分支,我选择了release分支里的1.0版本代码安装,会更稳一点

3.2 排坑

cuda/fastermoe/smart_schedule.h: In function ‘void fmoe_cuda_fused_forward_impl(pybind11::function, pybind11::function, pybind11::function, c10::Device, std::vector<at::Tensor>, scalar_t*, scalar_t*, scalar_t*, scalar_t*, const long int*, const long int*, const bool*, long int, long int, long int, long int, long int, long int, CudaStreamManager*)’:
cuda/fastermoe/smart_schedule.h:169:61: error: too few arguments to function ‘cudaError_t cudaStreamWaitEvent(cudaStream_t, cudaEvent_t, unsigned int)’
                 cudaStreamWaitEvent(smgr->stream(1), evt_get);
cuda/fastermoe/smart_schedule.h:39:9: note: in definition of macro ‘GEN_IDX’
     int gidx_recv = ei * world_size + rank_recv; \
         ^~~~~~~~~
error: command '/usr/bin/gcc' failed with exit code 1

官网上说默认没开启nccl,但实际他的安装过程是开启的
image.png-96.7kB

只有关闭nccl才能通过编译安装,否则报错

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# export USE_NCCL=0

(torch-1.8-cu102-py37) root@9fd4db53dc92:~# cd moe_5.31/
(torch-1.8-cu102-py37) root@9fd4db53dc92:~/moe_5.31# python setup.py install

3.3 验证

440595383651e6ab31c0efd8ec68786.png-73.7kB

4、附录

以下是编译过程中一些排错过程

https://stackoverflow.com/questions/11094718/error-command-gcc-failed-with-exit-status-1-while-installing-eventlet

https://linuxtect.com/the-error-command-gcc-failed-with-exit-status-1-error-and-solution/

https://www.cnblogs.com/gerrydeng/p/7159021.html

https://itsmycode.com/error-command-errored-out-with-exit-status-1/

https://qiita.com/picato1123/items/eb8405b1c98de06e628e

https://computerverge.com/error-message-error-command-gcc-failed-with-exit-status-1-causes-and-fixes/