关于M40计算卡那些事
SEO#
使用sensors查看主板温度
linux下控制PWM风扇转速
获取nvidia显卡核心温度
背景#
因为模型训练需要20G显存,上云前希望能有一个本地debug环境。
正经RTX显卡太贵,Tesla M40 24G计算卡物美价廉,CUDA少不是问题。
安装#
和正常显卡一样,安装在PCIe x16插槽即可使用,注意CPU必须有核显。
显卡没有显示输出,连接双8Pin供电后可以正常检测并安装NV驱动。
Official Drivers | NVIDIA
(过程和RTX系列显卡驱动一样,多了一项CUDA版本选择)
无法识别可以尝试再主板BIOS中打开Above 4G decoding选项就能正常识别。
散热#
Tesla定位数据中心计算卡,依靠机箱气流散热,主动散热风扇需要自行加装。
我选择了3D打印导风板加装两个服务器暴力风扇(16000PRM)进行PWM调速。
查看温度相关信息
# info from nvidia-smi -q -a
Temperature
GPU Current Temp : 48 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : N/A
GPU Target Temperature : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
根据信息,显卡温度维持在87度以下是比较推荐的,超过89开始降频。
温控#
温度#
由于显卡没有风扇无法自行调节转速,需要手动获取显卡温度并计算相应风扇转速。
显卡温度可以通过一句命令获取
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
如果觉得速度较慢(12ms)还可以直接调用nvml库进行获取。
转速#
获取转速主要看主板芯片组的兼容情况,大多数家用主板都无法在linux下直接获取风扇转速或者仅提供Windows控制软件。我手上的华硕Z370-P II就只能在BIOS里调整QFAN温度曲线,而显卡风扇连接在CHASSIS FAN上关联的温度是错误的机箱温度,因此只能手动控制。
安装lm-sensors
和fancontrol
无法使用pwmconfig
对风扇进行识别。
/usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed
根据SO的建议,后手动修改内核参数后重启能够识别到PWM设备。
sudo sed -E -i 's/(GRUB_CMDLINE_LINUX_DEFAULT=.+)"$/\1 acpi_enforce_resources=lax"/' /etc/default/grub
sudo update-grub
sudo reboot
识别效果
nct6793-isa-0290
Adapter: ISA adapter
in0: 632.00 mV (min = +0.00 V, max = +1.74 V)
in1: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in2: 3.41 V (min = +0.00 V, max = +0.00 V) ALARM
in3: 3.38 V (min = +0.00 V, max = +0.00 V) ALARM
in4: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in5: 160.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in6: 864.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in7: 3.41 V (min = +0.00 V, max = +0.00 V) ALARM
in8: 3.14 V (min = +0.00 V, max = +0.00 V) ALARM
in9: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in10: 864.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in11: 864.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in12: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in13: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in14: 416.00 mV (min = +0.00 V, max = +0.00 V) ALARM
fan1: 1121 RPM (min = 0 RPM)
fan2: 1053 RPM (min = 0 RPM)
fan3: 4141 RPM (min = 0 RPM)
fan5: 1330 RPM (min = 0 RPM)
fan6: 0 RPM (min = 0 RPM)
SYSTIN: +33.0°C (high = +98.0°C, hyst = +95.0°C) sensor = thermistor
CPUTIN: +41.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
AUXTIN0: +33.5°C sensor = thermistor
AUXTIN1: +33.0°C sensor = thermistor
AUXTIN2: +33.0°C sensor = thermistor
AUXTIN3: +89.0°C sensor = thermistor
PECI Agent 0: +66.0°C (high = +98.0°C, hyst = +95.0°C)
(crit = +100.0°C)
PECI Agent 0 Calibration: +54.0°C
PCH_CHIP_CPU_MAX_TEMP: +0.0°C
PCH_CHIP_TEMP: +0.0°C
intrusion0: ALARM
intrusion1: ALARM
beep_enable: disabled
使用pwmconfig
对风扇进行识别,通过短暂停止风扇对PWM设备进行关联。
最终生成/etc/fancontrol
配置文件,语法参考fancontrol(8) - Linux man page。
$ ssh alloy cat /etc/fancontrol
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=hwmon0=devices/virtual/thermal/thermal_zone0 hwmon3=devices/platform/nct6775.656
DEVNAME=hwmon0=acpitz hwmon3=nct6793
FCTEMPS=hwmon3/pwm3=hwmon0/temp1_input
FCFANS= hwmon3/pwm3=hwmon3/fan3_input
MINTEMP=hwmon3/pwm3=60
MAXTEMP=hwmon3/pwm3=85
MINSTART=hwmon3/pwm3=150
MINSTOP=hwmon3/pwm3=100
MINPWM=hwmon3/pwm3=63
我的风扇是FAN3注意不能直接照抄配置文件。可视化便于理解
A graph might help you understand how the different values relate
to each other:
PWM ^
255 +
|
|
| ,-------------- MAXPWM
| ,'.
| ,' .
| ,' .
| ,' .
| ,' .
| ,' .
| MINSTOP .' .
| | .
| | .
| | .
MINPWM |---------------' .
| . .
| . .
| . .
0 +---------------+-------------+---------------->
MINTEMP MAXTEMP t (degree C)
魔改#
最后一步就是关联核心温度和风扇转速了,我魔改了fancontrol的源代码,按需follow!
修改在line 24 - 27
# /usr/sbin/fancontrol
function UpdateFanSpeeds
{
local fcvcount
local pwmo tsens fan mint maxt minsa minso minpwm maxpwm
local tval tlastval pwmpval fanval min_fanval one_fan one_fanval
local -i pwmval
let fcvcount=0
while (( $fcvcount < ${#AFCPWM[@]} )) # go through all pwm outputs
do
#hopefully shorter vars will improve readability:
pwmo=${AFCPWM[$fcvcount]}
tsens=${AFCTEMP[$fcvcount]}
fan=${AFCFAN[$fcvcount]}
let mint="${AFCMINTEMP[$fcvcount]}*1000"
let maxt="${AFCMAXTEMP[$fcvcount]}*1000"
minsa=${AFCMINSTART[$fcvcount]}
minso=${AFCMINSTOP[$fcvcount]}
minpwm=${AFCMINPWM[$fcvcount]}
maxpwm=${AFCMAXPWM[$fcvcount]}
avg=${AFCAVERAGE[$fcvcount]}
#read tlastval < ${tsens}
# hardcode GPU temp start
let tlastval=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)*1000
# hardcode GPU temp end
if [ $? -ne 0 ]
then
echo "Error reading temperature from $DIR/$tsens"
restorefans 1
fi
完成后重启服务生效
sudo service fancontrol restart
参考#
How to see the Video Card Temperature (Nvidia, ATI, Intel...) - Ask Ubuntu
NVIDIA Management Library (NVML) | NVIDIA Developer
usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed MSI | ubuntu 16.04 fancontrol - Stack Overflow
Cippo95/nvidia-fan-control: Controlling fans on my NVIDIA graphics card
fancontrol(8) - Linux man page
lm-sensors/doc/fancontrol.txt at master · lm-sensors/lm-sensors
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架