显卡丢失,nvidia-smi找不到显卡,cuda failure 999 ( GPU is lost.)

error info: (They can occur at the same time.)

tonyyan@tonyyan-X11SPI:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:65:00.0: GPU is lost.  Reboot the system to recover this GPU
327.411fps
3ms
312.613fps
3ms
309.92fps
3ms
300.209fps
2ms
342.361fps
3ms
322.467fps
3ms
316.99fps
3ms
318.749fps
3ms
321.253fps
3ms
314.281fps
3ms
312.419fps
2ms
342.166fps
3ms
312.345fps
3ms
327.62fps



178ms
5.59761fps
192ms
5.19022fps
178ms
5.59837fps
Cuda failure: 999

 

Unable to open 'raise.c': Unable to read file '/build/glibc-S9d2JN/glibc-2.27/sysdeps/unix/sysv/linux/raise.c' (Error: Unable to resolve non-existing file '/build/glibc-S9d2JN/glibc-2.27/sysdeps/unix/sysv/linux/raise.c').

Reason: unknown

How this occurs:

  1. Cuda GPU losts after a period of time (usually several hours) after being booted even if nothing is done .
  2. Running GPU dependent process, such as model traning or TensorRT inference. The FPS would gradually slow down until it shows 'Cuda failure: 999'.

Current solution:

Restart the computer.

 

posted @ 2021-06-30 16:01  略略略——  阅读(1869)  评论(0编辑  收藏  举报