使用A卡(AMD Radeon RX470)进行机器学习的失败经历
想赶上机器学习ML深度学习的热潮不容易,光是显卡就是一笔不小的投入。网上搜索了一下,见A卡也可以勉强用于ML,遂想用手头有的一张A卡(RX470)进行学习,过程不易,记录之。
一、试用WSL2,失败。
到AMD ROCM官网查看,不支持windows平台,基本上推荐Ubuntu,心想正好在windows10上安装WSL2,最新版已经升到20.04,过程不赘述。安装好anaconda和ROCM后,rocminfo查看,报告找不到GPU,网上搜索后,确定wsl暂时(据微软说,解决方案正在研发中)不支持直接访问硬件,所以本方法失败。
二、物理机安装ubuntu20.04
按照教程安装rocm和anaconda 后, 安装tensorflow-rocm。安装很顺利,一切就绪,进入python,import tensorflow,报错!
(base) python@python-MS-7972:~$ python Python 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow Traceback (most recent call last): File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module> from tensorflow.python._pywrap_tensorflow_internal import * ImportError: librocsolver.so.0: cannot open shared object file: No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module> from tensorflow.python.tools import module_util as _module_util File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 40, in <module> from tensorflow.python.eager import context File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 35, in <module> from tensorflow.python import pywrap_tfe File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tfe.py", line 28, in <module> from tensorflow.python import pywrap_tensorflow File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 83, in <module> raise ImportError(msg) ImportError: Traceback (most recent call last): File "/home/python/python-dev/anaconda3/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module> from tensorflow.python._pywrap_tensorflow_internal import * ImportError: librocsolver.so.0: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. See https://www.tensorflow.org/install/errors for some common reasons and solutions. Include the entire stack trace above this error message when asking for help.
又经过一番艰苦卓绝的搜索:),终于发现正确解决方案,竟然只是安装 rocm-libs!
suso apt install rocm-libs
但是由于rocm-libs的库文件都安装在/opt/rocm-4.3.0下面的多个子路径中,因此需要条件到LD路径中。
我这里采用的时在/etc/ld.so.conf.d下面创建一个新的独立配置文件 rocm_4.3.0_libs.conf
/opt/rocm-4.3.0/lib /opt/rocm-4.3.0/rocsolver/lib /opt/rocm-4.3.0/rocblas/lib /opt/rocm-4.3.0/rocclr/lib
再次进入python导入tensorflow,终于Ok了!
但是,不要高兴得太早,随便写段代码:
import os import numpy as np import pandas as pd import matplotlib as plot import keras from keras.utils import np_utils from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense,Dropout,MaxPooling2D,MaxPooling1D,Conv1D,Conv2D,LSTM def main(): (x_train,y_train),(x_test,y_test) = mnist.load_data() x_train=np_utils.normalize(x_train,2) y_train=np_utils.to_categorical(y_train) x_test=np_utils.normalize(x_test,2) y_test=np_utils.to_categorical(y_test) model=Sequential() model.add(Dense()) if __name__=='__main__': main()
结果有报错了:
2021-09-26 09:42:15.973798: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" 已放弃 (核心已转储)
经搜索,gf803系列显卡(RX 470/480/570/580/590)竟然已经不在AMD得ROCM3.7版本以后得支持名单中!!让我哭一会儿:(
不过,据说按照这个网址的办法可以解决,但是我按照步骤依次安装(除了pytorch的两个)之后,tensorflow倒是可以引入使用了,但tensorflow还是没有找到GPU,用的还是CPU!我放弃了,你们哪位TX试试吧,如果试好了,请一定告诉我。
2022-10-17 更新
今天,网上闲逛,竟然让我发现这篇文章MacBook显卡不跑AI模型太浪费:这个深度学习工具支持所有品牌GPU,回来实验一下,希望能够解决!