Bug: Failed to initialize NVML: Driver/library version mismatch; NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

在docker的使用过程中,出现:nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
在终端输入nvidia-smi查看显卡驱动,结果提示:Failed to initialize NVML: Driver/library version mismatch

这个问题已经是新系统第二次出现,解决方案:

  1. 重启机器,第一次解决了问题,但是这里没有一个月又更新了;

  2. 查看历史记录:

    1、cat /var/log/apt/history.log 查看是谁更新的显卡相关驱动内核;这里可以发现unattended-upgrade在更新我们的内核版本;

    2、 sudo dpkg-reconfigure unattended-upgrade 取消自动更新

    3、 这里设置失败,所以重新回到传统处理方法!

  3. 固定系统内核版本

    1、获取当前的驱动版本:cat /proc/driver/nvidia/version

    2、固定当前驱动版本:sudo apt-mark hold nvidia-535.146

    3、这个方法也不行

  4. 系统内核
    1、查看系统更新记录:cat /var/log/dpkg.log | grep nvidia | grep upgrade

    这里是半个月更新两次,太刺激了!

    2、固定版本:sudo apt-mark hold nvidia-driver-535:amd64 535.129.03-0ubuntu0.22.04.1

2、还原系统内核

2.1 问题记录

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2.2 查看系统内核

查看当前内核:

uname -a

查看历史内核排序:

grep menuentry /boot/grub/grub.cfg
if [ x"${feature_menuentry_id}" = xy ]; then
  menuentry_id_option="--id"
  menuentry_id_option=""
export menuentry_id_option
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-b9661316-823f-4b3f-b1ba-1eed5d4649ad' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-b9661216-823f-4b3f-b1ba-1eed5d4649ad' {
	menuentry 'Ubuntu, with Linux 6.5.0-25-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-25-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
	menuentry 'Ubuntu, with Linux 6.5.0-25-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-25-generic-recovery-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
	menuentry 'Ubuntu, with Linux 6.5.0-21-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-21-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
	menuentry 'Ubuntu, with Linux 6.5.0-21-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-21-generic-recovery-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {

这里由21更新到25了!

2.3 修改系统内核版本指定

$ sudo vim /etc/default/grub
# 设置GRUB_DEFAULT为如下的值,注意“b9661316-823f-4b3f-b1ba-1eed5c4649ad”对应的是你自己主机的码
GRUB_DEFAULT="gnulinux-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad>gnulinux-6.5.0-21-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad"

2.4 生效改动

sudo update-grub

重启后:

$ uname -a
Linux industai 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-smi
Mon Mar 11 11:40:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:02:00.0 Off |                  N/A |
|  0%   37C    P8               7W / 170W |      8MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2174      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

参考:

posted @ 2024-01-23 16:51  巴蜀秀才  阅读(1796)  评论(0编辑  收藏  举报