Bug: Failed to initialize NVML: Driver/library version mismatch; NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
在docker的使用过程中,出现:nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
在终端输入nvidia-smi查看显卡驱动,结果提示:Failed to initialize NVML: Driver/library version mismatch
这个问题已经是新系统第二次出现,解决方案:
-
重启机器,第一次解决了问题,但是这里没有一个月又更新了;
-
查看历史记录:
1、
cat /var/log/apt/history.log
查看是谁更新的显卡相关驱动内核;这里可以发现unattended-upgrade在更新我们的内核版本;
2、
sudo dpkg-reconfigure unattended-upgrade
取消自动更新
3、 这里设置失败,所以重新回到传统处理方法! -
固定系统内核版本
1、获取当前的驱动版本:
cat /proc/driver/nvidia/version
2、固定当前驱动版本:
sudo apt-mark hold nvidia-535.146
3、这个方法也不行
-
系统内核
1、查看系统更新记录:cat /var/log/dpkg.log | grep nvidia | grep upgrade
这里是半个月更新两次,太刺激了!2、固定版本:
sudo apt-mark hold nvidia-driver-535:amd64 535.129.03-0ubuntu0.22.04.1
2、还原系统内核
2.1 问题记录
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
2.2 查看系统内核
查看当前内核:
uname -a
查看历史内核排序:
grep menuentry /boot/grub/grub.cfg
if [ x"${feature_menuentry_id}" = xy ]; then
menuentry_id_option="--id"
menuentry_id_option=""
export menuentry_id_option
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-b9661316-823f-4b3f-b1ba-1eed5d4649ad' {
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-b9661216-823f-4b3f-b1ba-1eed5d4649ad' {
menuentry 'Ubuntu, with Linux 6.5.0-25-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-25-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
menuentry 'Ubuntu, with Linux 6.5.0-25-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-25-generic-recovery-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
menuentry 'Ubuntu, with Linux 6.5.0-21-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-21-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
menuentry 'Ubuntu, with Linux 6.5.0-21-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.5.0-21-generic-recovery-b9661316-823f-4b3f-b1ba-1eed5c4649ad' {
menuentry 'UEFI Firmware Settings' $menuentry_id_option 'uefi-firmware' {
这里由21更新到25了!
2.3 修改系统内核版本指定
$ sudo vim /etc/default/grub
# 设置GRUB_DEFAULT为如下的值,注意“b9661316-823f-4b3f-b1ba-1eed5c4649ad”对应的是你自己主机的码
GRUB_DEFAULT="gnulinux-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad>gnulinux-6.5.0-21-generic-advanced-b9661316-823f-4b3f-b1ba-1eed5c4649ad"
2.4 生效改动
sudo update-grub
重启后:
$ uname -a
Linux industai 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-smi
Mon Mar 11 11:40:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:02:00.0 Off | N/A |
| 0% 37C P8 7W / 170W | 8MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2174 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
参考:
清澈的爱,只为中国