ubuntu18 下 tensoflow-gpu 2.0 卷积报错:Failed to get convolution algorithm.

环境:ubuntu18 + nvidia 430 + cuda 10.0 + cudnn7.6.0 + tensorflow-gpu 2.0.0

调用 layers.Conv2D() 就报错,报错信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Epoch 1/5
2019-10-11 21:50:00.814925: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-11 21:50:01.026836: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-11 21:50:01.472727: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-11 21:50:01.476545: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-11 21:50:01.476593: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node sequential/conv2d/Conv2D}}]]
64/60000 [..............................] - ETA: 16:19Traceback (most recent call last):
File "/media/dxs/E/Project/AI/PycharmProject/deepLearningProject-201903/tf2_demo/mnist_cnn.py", line 53, in <module>
model.fit(train_images, train_labels, epochs=5,batch_size=64)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
return self._stateless_fn(*args, **kwds)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
self.captured_inputs)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node sequential/conv2d/Conv2D (defined at home/dxs/anaconda3/envs/tf2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_1000]
 
Function call stack:
distributed_function

 

尝试过 升级cuda到10.1 还是报错,经过一番查找,发现只要在开头设置下 就可以了, 在开头加上以下语句:

1
2
3
4
5
6
# # =====================================================================
# # 不加这几句,则CONV 报错
# physical_devices = tf.config.experimental.list_physical_devices('GPU')
# assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
# tf.config.experimental.set_memory_growth(physical_devices[0], True)
# # =========================================================================

  

发个完整代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# -*- coding: utf-8 -*-
# @Time : 2019/10/10 下午10:39
# @Author : dxs
# @Email : dangxusheng163163.com
# @File : mnist_cnn.py
# @Project : PycharmProject
 
from __future__ import absolute_import, division, print_function, unicode_literals
 
import os
import os.path as osp
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets, layers, models
 
import cv2
import numpy as np
import matplotlib.pyplot as plt
 
#
# =====================================================================
# 不加这几句,则CONV 报错
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)
# =========================================================================
 
 
 
(train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data()
 
train_images = train_images.reshape((60000, 28, 28, 1))
test_images = test_images.reshape((10000, 28, 28, 1))
 
# 特征缩放[0, 1]区间
train_images, test_images = train_images / 255.0, test_images / 255.0
 
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary() # 显示模型的架构
 
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
 
model.fit(train_images, train_labels, epochs=5,batch_size=64)

  

输出结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
2019-10-11 21:54:15.109707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-11 21:54:15.109714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-10-11 21:54:15.109717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-10-11 21:54:15.109766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-11 21:54:15.109991: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-11 21:54:15.110211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5562 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #  
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0        
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0        
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928    
_________________________________________________________________
flatten (Flatten)            (None, 576)               0        
_________________________________________________________________
dense (Dense)                (None, 64)                36928    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650      
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples
Epoch 1/5
2019-10-11 21:54:15.965416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-11 21:54:16.172605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-11 21:54:17.364824: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
60000/60000 [==============================] - 9s 145us/sample - loss: 0.1734 - accuracy: 0.9475
Epoch 2/5
60000/60000 [==============================] - 8s 127us/sample - loss: 0.0504 - accuracy: 0.9844
Epoch 3/5
60000/60000 [==============================] - 7s 124us/sample - loss: 0.0365 - accuracy: 0.9886
Epoch 4/5
60000/60000 [==============================] - 7s 113us/sample - loss: 0.0290 - accuracy: 0.9902
Epoch 5/5
60000/60000 [==============================] - 7s 110us/sample - loss: 0.0229 - accuracy: 0.9929
 
Process finished with exit code 0

  

 

posted @   dangxusheng  阅读(7533)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示