大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC的异构计算——CPU和GPU的混合计算模式

好消息,居然有经费了,账号可以接着用了,可以接着玩超算了。

 

 

 

 

==========================================================

 

 

在超算平台上安装pytorch:

执行:

export REQUESTS_CA_BUNDLE=

export CURL_CA_BUNDLE=

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

 

 

Job提交脚本:

submit_python.sh

/opt/batch/cli/bin/dsub  -n task_test -A xxxxxxxxxx --priority 9999 --job_retry 10 --job_type hmpi -R "cpu=1;gpu=1;mem=128" -N 1  -eo error.txt -oo output.txt /home/share/xxxxxxxxxxxx/home/xxxxxxx/xxxxxxx/run_python.sh

 

 

MPI运行脚本:

run_python.sh

复制代码
#!/bin/sh
echo ----- print env vars -----

if [ "${CCS_ALLOC_FILE}" != "" ]; then
    echo "   "
    ls -la ${CCS_ALLOC_FILE}
    echo ------ cat ${CCS_ALLOC_FILE}
    cat ${CCS_ALLOC_FILE}
fi

export HOSTFILE=/tmp/hostfile.$$
rm -rf $HOSTFILE
touch $HOSTFILE

# parse CCS_ALLOC_FILE
## node name,  cores, tasks, task_list
#  hpcbuild002 8 1 container_22_default_00001_e01_000002
#  hpctest005 8 1 container_22_default_00000_e01_000001

ntask=`cat ${CCS_ALLOC_FILE} | awk -v fff="$HOSTFILE" '{}
{
    split($0, a, " ")
    if (length(a[1]) >0 && length(a[3]) >0) {
        print a[1]" slots="a[2] >> fff
        total_task+=a[3]
    }
}END{print total_task}'`

echo "openmpi hostfile $HOSTFILE generated:"
echo "-----------------------"
cat $HOSTFILE
echo "-----------------------"
echo "Total tasks is $ntask"
echo "mpirun -hostfile $HOSTFILE -n $ntask <your application>"

#start a simple mpi program
#/usr/local/bin/mpirun -hostfile $HOSTFILE -n $ntask hostname

/home/HPCBase/HMPI/hmpi/bin/mpirun -hostfile $HOSTFILE -np $ntask  --mca plm_rsh_agent /opt/batch/agent/tools/dstart /home/share/xxxxxxxxx/home/xxxxxxxx/anaconda3/bin/python /home/share/xxxxxxxxx/home/xxxxxx/xxxxxxx/hello.py
ret=$?

rm -rf $HOSTFILE
exit $ret
复制代码

 

 

 

运行代码:

复制代码
import mpi4py.MPI as MPI
import sys
import socket
import numpy as np
import torch


def func1(queue, num):
    import time
    # time.sleep(num)
    # time.sleep(180)
    """
    x = np.random.rand(100)
    for _ in range(2000000):
        x += np.random.rand(100)
    num += np.sum(x)
    """
    x=torch.randn(10000000, device="cuda:0")
    for _ in range(200000):
        x+=torch.randn(10000000, device="cuda:0")

    # queue.put(num)
    queue.put(torch.sum(x).item())


def run_queue():
    from multiprocessing import Process, Queue

    ps = 1

    queue = Queue(maxsize=200)  # the following attribute can call in anywhere

    process = [Process(target=func1, args=(queue, num)) for num in range(ps)]
    [p.start() for p in process]
    [p.join() for p in process]
    return [queue.get() for p in process]

 
comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
node_name = MPI.Get_processor_name()
# node_name = socket.gethostname()
 
# point to point communication
data_send = [comm_rank]*1

comm.send(data_send,dest=(comm_rank+1)%comm_size)

res = run_queue() ###

data_recv =comm.recv(source=(comm_rank-1)%comm_size)

# print("my rank is %d, and Ireceived:" % comm_rank, data_recv, file=sys.stdout, flush=True)
# print(data_recv)

with open("/home/share/xxxxxxxx/home/xxxxxx/xxxxxx/results/{}.txt".format(comm_rank, ), "w") as f:
    f.write("my rank is %d/%d, and node_name: %s Ireceived:" % (comm_rank, comm_size, node_name) + str(data_recv) + str(res) + "\n" )
复制代码

 

 

运行报错:

raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

 

 

 

运行时的监控: 

 

 

 

上面的Job运行失败,再次尝试运行:

提交超算系统运行:

/opt/batch/cli/bin/dsub   -n task_test -A xxxxxxxxxxxx -eo error.txt -oo output.txt -R "gpu=1" /usr/bin/nvidia-smi 

 

Thu Jul 6 12:11:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 36W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

 

 

 

 

再次提交超算系统:

/opt/batch/cli/bin/dsub   -n task_test -A xxxxxxxxxxxxxx -eo error.txt -oo output.txt -R "cpu=1"  find / -name libcudnn*

运行结果:

/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train_static_v8.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_static.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_static_v8.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer_static.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer_static_v8.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train_static.a
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so.8
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so.8.2.0
/home/HPCBase/PACKAGE/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train_static_v8.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_static.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_static_v8.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer_static.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_infer_static_v8.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_cnn_train_static.a
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so.8
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_train.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_ops_train.so.8.2.0
/home/HPCBase/Application/GROMACS/cuda-11.3/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8

 

 

看到这个信息我们可以知道在华为的超算平台上如果使用CUDA那么其版本应该是arm64-sbsa:

 

 ============================================================

 

再次测试:

/opt/batch/cli/bin/dsub  -n task_test -A xxxxxxxx -eo error.txt -oo output.txt -R "cpu=1"  ls /usr/local/cuda/bin

运行结果:

bin2c
compute-sanitizer
crt
cudafe++
cuda-gdb
cuda-gdbserver
cuda-install-samples-11.4.sh
cuda-uninstaller
cu++filt
cuobjdump
fatbinary
ncu
nsight-sys
nsys
nsys-exporter
nsys-ui
nvcc
nvcc.profile
nvdisasm
nvlink
nv-nsight-cu-cli
nvprune
ptxas

 

 

在环境变量中加入:(在提交Job的提交主机上的.bashrc加入内容)

export PATH=/usr/local/cuda-11.4/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH

 

使用Job提交检查运行主机slave上的环境变量:

得到PATH变量:

/home/HPCBase/HMPI/hmpi/bin:/home/HPCBase/HMPI/hmpi/bin:/usr/local/cuda-11.4/bin:/home/share/xxxxxxxxxxx/home/xxxxxxxx/anaconda3/bin:/home/share/xxxxxxxxxx/home/xxxxxxxx/anaconda3/condabin:/home/share/xxxxxxxxxxxxx/home/xxxxxxx/.local/bin:/home/share/xxxxxxxxxxx/home/xxxxxxx/bin:/opt/batch/cli/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin

 

LD_LIBRARY_PATH变量:

/home/HPCBase/HMPI/hmpi/lib:/home/HPCBase/HMPI/hmpi/lib:/usr/local/cuda-11.4/lib64:

 

 

可以知道,PATH变量中 

Job提交主机的原有变量:

/home/share/xxxxxxxxxxxxx/home/xxxxxxx/.local/bin:/home/share/xxxxxxxxxxx/home/xxxxxxx/bin:/opt/batch/cli/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin 

conda中设置的:

/home/share/xxxxxxxxxxx/home/xxxxxxxx/anaconda3/bin:/home/share/xxxxxxxxxx/home/xxxxxxxx/anaconda3/condabin:

.bashrc中设置的:

/usr/local/cuda-11.4/bin

超算运行主机加入的变量:

/home/HPCBase/HMPI/hmpi/bin:/home/HPCBase/HMPI/hmpi/bin 

 

 

LD_LIBRARY_PATH变量中

.bashrc中设置的:

/usr/local/cuda-11.4/lib64:

超算运行主机加入的变量:

/home/HPCBase/HMPI/hmpi/lib:/home/HPCBase/HMPI/hmpi/lib:

 

 

=============================================================

 

 

 

根据对超算运行主机中环境变量的打印,可以猜测到之所以最开始的代码报错是因为运行主机中cuda的路径没有设置,上边把LD_LIBRARY_PATH路径重新配置后再在提交Job的主机上安装cupy用来测试是否配置成功,根据上面中对CUDA安装路径的测试可以猜测到运行主机上安装的CUDA版本为11.4,因此按照这个版本安装cupy,再次运行:

复制代码
import mpi4py.MPI as MPI
import sys
import os
import socket
import numpy as np
import cupy as cp



 
comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
node_name = MPI.Get_processor_name()
# node_name = socket.gethostname()
 
# point to point communication
data_send = [comm_rank]*1

comm.send(data_send,dest=(comm_rank+1)%comm_size)

# res = run_queue() ###
print(os.environ['PATH'])
print(os.environ['LD_LIBRARY_PATH'])
# print(os.environ['CUDA_HOME'])


arr = cp.array([1, 2, 3, 4, 5])
arr += 10
print(arr)
print(type(arr))


data_recv =comm.recv(source=(comm_rank-1)%comm_size)

# print("my rank is %d, and Ireceived:" % comm_rank, data_recv, file=sys.stdout, flush=True)
# print(data_recv)
res = 111
复制代码

运行结果:

 证明配置成功,这也证明了在华为超算平台上基本上实现了异构计算配置的成功。

 

 

 

 

=============================================================

 

 

官方资料:

https://support.huawei.com/enterprise/zh/doc/EDOC1100228705/d690fe77

 

posted on   Angry_Panda  阅读(188)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
历史上的今天:
2022-07-05 python编程中的circular import问题
2021-07-05 MindSpore 建立神经网络
2021-07-05 MindSpore 数据加载及处理
2021-07-05 MindSpore 初探, 使用LeNet训练minist数据集
2020-07-05 《Python数据可视化之matplotlib实践》 源码 第三篇 演练 第八章
2020-07-05 《Python数据可视化之matplotlib实践》 源码 第二篇 精进 第七章

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

统计

点击右上角即可分享
微信分享提示