Chineseocr在GPU上运行的问题及解决方法

系统：Ubuntu 18.0

CUDA: 10.0.130

仅支持tensorflow 1.14.0以上，否则import时报错

ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

chineseocr：tensorflow最高支持1.13.1，否则报错：

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_368' with dtype float and shape [2]
[[{{node Placeholder_368}}]]

解决方法:

修改keras_yolo3.py line 365-366

boxes  = concatenate(boxes, axis=0)
scores = concatenate(scores, axis=0)

改为

boxes  = K.concatenate(boxes, axis=0)
scores = K.concatenate(scores, axis=0)

修改后的安装版本：

keras==2.2.4 
tensorflow==1.14.0 
tensorflow-gpu==1.14.0

更新版本后依然报错：

2020-08-29 18:47:06.157935: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2020-08-29 18:47:06.286745: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)

解决方法（详见这里）：

修改keras_yolo3.py line 120-121

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))

改为

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[..., ::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[..., ::-1], K.dtype(feats))

同时pytorch遇到一个报错：

Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 3 does not equal 0 (while checking arguments for cudnn_convolution)

原因：pytorch加载weights和加载input的device不一致

原code加载train weight时，将weight保存在了CPU上（详见这里）：

 map_location=lambda storage, loc: storage

解决方法：统一加载device

device = torch.device('cuda', GPUID) if GPU and torch.cuda.is_available() else torch.device('cpu')

model = CRNN(32, 1, len(alphabet) + 1, 256, 1, lstmFlag=LSTMFLAG).to(device)
trainWeights = torch.load(ocrModel, map_location=device)
...
image = image.to(device)

运行tf时发现GPU可以被识别，但运行时并未使用GPU，

检测code参考这篇文章：

import tensorflow as tf
  
# 检测gpu
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
print("============")

# 新建一个 graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# 新建session with log_device_placement并设置为True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# 运行这个 op.
print (sess.run(c))

发现原来import tensorflow时有个报错被忽略了

2020-08-29 23:08:12.874793: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-08-29 23:08:12.874804: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-08-29 23:08:12.874832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-29 23:08:12.874840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3
2020-08-29 23:08:12.874847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y
2020-08-29 23:08:12.874854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y
2020-08-29 23:08:12.874861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y
2020-08-29 23:08:12.874867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N

参考了一些文章，可能安装CUDA的时候没有建立CUDNN连接，但是由于需要sudo权限，稍后再更新解决方案，先列出一些参考资料：

https://github.com/tensorflow/tensorflow/issues/20271

https://blog.csdn.net/weixin_40298200/article/details/79420758

更新解决方法：安装CUDNN 7.6：

wget http://file.ppwwyyxx.com/nvidia/cudnn-10.0-linux-x64-v7.6.4.38.tgz
tar xzvf cudnn-10.0-linux-x64-v7.6.4.38.tgz

sudo cp cuda/include/cudnn*.h /usr/local/cuda-10.0/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
sudo chmod a+r /usr/local/cuda-10.0/include/cudnn*.h /usr/local/cuda-10.0/lib64/libcudnn*

关于CUDA和CUDNN的关系可以看这篇文章。

另外server上nvcc command失效，原因是没有指定LD_LIBRARY_PATH，在~/.bashrc中添加：

if [ -d "/usr/local/cuda-10.0/bin/" ]; then
    export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
    export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi

posted @ 2020-08-30 02:25 Sherrrry 阅读(2205) 评论(0) 编辑收藏举报

刷新页面返回顶部

Sherrrry

Chineseocr在GPU上运行的问题及解决方法

公告