centos tensorflow 如何使用 gpu

持续监控GPU使用情况命令:

watch -n 10 nvidia-smi

 

参数解释:

Fan:显示风扇转速,数值在0到100%之间,是计算机的期望转速,如果计算机不是通过风扇冷却或者风扇坏了,显示出来就是N/A;

Temp:显卡内部的温度,单位是摄氏度;

Perf:表征性能状态,从P0到P12,P0表示最大性能,P12表示状态最小性能;

Pwr:能耗表示;

Bus-Id:涉及GPU总线的相关信息;

Disp.A:是Display Active的意思,表示GPU的显示是否初始化;

Memory Usage:显存的使用率

Volatile GPU-Util:浮动的GPU利用率;

Compute M:计算模式;

 

 

watch -n 5 nvidia-smi

命令行参数-n后边跟的是执行命令的周期,以s为单位。

 

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # 使用第一块GPU

 

 

from tensorflow.python.client import device_lib 

print(device_lib.list_local_devices())
 
 
 
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 12225675321456196757
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 16026275881129682833
physical_device_desc: "device: XLA_CPU device"
]

 

import tensorflow as tf

# 查看gpu和cpu的数量

gpus = tf.config.experimental.list_physical_devices(device_type='GPU')

cpus = tf.config.experimental.list_physical_devices(device_type='CPU')

 


print(gpus,cpus)
 
[] [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

 

 

 

AttributeError: module 'tensorflow' has no attribute 'Session'

 

 

CentOS查看GPU显卡信息

# yum install pciutils lshw -y

#  lspci | grep -E "VGA|NVIDIA"
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

 

# lspci -v -s 04:00.0
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: Dell Device 3600
    Flags: bus master, fast devsel, latency 0, IRQ 63, NUMA node 0
    Memory at a3000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 90000000 (64-bit, prefetchable) [size=256M]
    Memory at a0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at 2000 [size=128]
    [virtual] Expansion ROM at a4080000 [disabled] [size=512K]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Legacy Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] #19
    Kernel driver in use: nvidia
    Kernel modules: nouveau, nvidia_drm, nvidia

 

 

 

常用GPU管理命令

1.列出所有可用的Nvidia设备

nvidia-smi -L

 

2.列出每个GPU卡的详细信息

nvidia-smi --query-gpu=index,name,uuid,serial --format=csv

 

3.查询某个GPU卡的详细信息(指定GPU卡的id,只截图一部分)

nvidia-smi -i 0 -q

 

4.要以1秒的更新间隔监控GPU的总体使用情况

nvidia-smi dmon

 

5.要以1秒的更新间隔监视每个进程的GPU使用情况

nvidia-smi pmon

 

6.加上-pm参数可设置持久模式:0/禁用,1/启用

nvidia-smi -pm 1

 

7.加上-e参数可以切换ECC支持:0/禁用,1/启用

nvidia-smi -e 1

 

8. 加上-r参数可以重启某个GPU卡(0是GPU卡的序号)

nvidia-smi -r -i 0

 

查看是否安装了相关的软件(CUDA, cuDNN)

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

 

 

REF

https://tensorflow.google.cn/install/gpu?hl=zh_cn 【官网推荐docker方式】

https://blog.csdn.net/To_be_little/article/details/124438800

https://www.nhooo.com/note/qa3ovr.html

 NVIDIA GeForce GTX 1080 Ti基于16nm GP102核心,352-bit 11GB GDDR5X显存,多达3584个流处理器,

https://cloud.tencent.com/developer/article/1486194?from=15425&areaSource=102001.1&traceId=QZ8GVMtf-DfTefbWaYiDW

https://blog.51cto.com/u_15790101/5673579

email: CentOS 7.3安装NVIDIA-1080ti驱动、cuda、cuDNN、TensorFlow

posted @ 2023-10-10 19:57  emanlee  阅读(89)  评论(0编辑  收藏  举报