Chineseocr在GPU上运行的问题及解决方法
系统:Ubuntu 18.0
CUDA: 10.0.130
仅支持tensorflow 1.14.0以上,否则import时报错
ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
chineseocr:tensorflow最高支持1.13.1,否则报错:
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_368' with dtype float and shape [2]
[[{{node Placeholder_368}}]]
解决方法:
修改keras_yolo3.py line 365-366
boxes = concatenate(boxes, axis=0)
scores = concatenate(scores, axis=0)
改为
boxes = K.concatenate(boxes, axis=0)
scores = K.concatenate(scores, axis=0)
修改后的安装版本:
keras==2.2.4 tensorflow==1.14.0 tensorflow-gpu==1.14.0
更新版本后依然报错:
2020-08-29 18:47:06.157935: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2020-08-29 18:47:06.286745: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
解决方法(详见这里):
修改keras_yolo3.py line 120-121
box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))
改为
box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[..., ::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[..., ::-1], K.dtype(feats))
同时pytorch遇到一个报错:
Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 3 does not equal 0 (while checking arguments for cudnn_convolution)
原因:pytorch加载weights和加载input的device不一致
原code加载train weight时, 将weight保存在了CPU上(详见这里):
map_location=lambda storage, loc: storage
解决方法:统一加载device
device = torch.device('cuda', GPUID) if GPU and torch.cuda.is_available() else torch.device('cpu')
model = CRNN(32, 1, len(alphabet) + 1, 256, 1, lstmFlag=LSTMFLAG).to(device) trainWeights = torch.load(ocrModel, map_location=device) ... image = image.to(device)
运行tf时发现GPU可以被识别,但运行时并未使用GPU,
检测code参考这篇文章:
import tensorflow as tf # 检测gpu from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) print("============") # 新建一个 graph. a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # 新建session with log_device_placement并设置为True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # 运行这个 op. print (sess.run(c))
发现原来import tensorflow时有个报错被忽略了
2020-08-29 23:08:12.874793: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-08-29 23:08:12.874804: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-29 23:08:12.874832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-29 23:08:12.874840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3
2020-08-29 23:08:12.874847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y
2020-08-29 23:08:12.874854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y
2020-08-29 23:08:12.874861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y
2020-08-29 23:08:12.874867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N
参考了一些文章,可能安装CUDA的时候没有建立CUDNN连接,但是由于需要sudo权限,稍后再更新解决方案,先列出一些参考资料:
https://github.com/tensorflow/tensorflow/issues/20271
https://blog.csdn.net/weixin_40298200/article/details/79420758
更新解决方法:安装CUDNN 7.6:
wget http://file.ppwwyyxx.com/nvidia/cudnn-10.0-linux-x64-v7.6.4.38.tgz tar xzvf cudnn-10.0-linux-x64-v7.6.4.38.tgz sudo cp cuda/include/cudnn*.h /usr/local/cuda-10.0/include sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64 sudo chmod a+r /usr/local/cuda-10.0/include/cudnn*.h /usr/local/cuda-10.0/lib64/libcudnn*
关于CUDA和CUDNN的关系可以看这篇文章。
另外server上nvcc command失效,原因是没有指定LD_LIBRARY_PATH,在~/.bashrc中添加:
if [ -d "/usr/local/cuda-10.0/bin/" ]; then export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} fi