erlang 虚机CPU 占用高排查
-问题起因
近期线上一组服务中,个别节点服务器CPU使用率很低,只有其他1/4。排除业务不均,曾怀疑是系统top统计错误,从 Erlang调度器的利用率调查 找到通过erlang:statistics(scheduler_wall_time) 查看服务器CPU低的机器调度器实际的CPU利用率很高接近100%,而其他机器都不到30%。
分析不同业务服务,发现只有在node 中进程数采用调度器CPU利用低这个问题。
- 高利用率
Cpu0 : 68.2%us, 11.4%sy, 0.0%ni, 3.7%id, 0.0%wa, 0.0%hi, 16.7%si, 0.0%st Cpu1 : 81.6%us, 13.4%sy, 0.0%ni, 4.3%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st Cpu2 : 1.3%us, 0.3%sy, 0.0%ni, 98.0%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 1.3%us, 0.3%sy, 0.0%ni, 98.0%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.7%us, 1.0%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st [{total,0.06939920430400189}, {1,0.8268947112724254}, {2,0.8384896040437823}, {3,8.867169683843468e-6}, {4,1.0365168328954172e-5}, {5,1.0024957622820418e-5}, {6,8.853059601737486e-6}, {7,8.402152852410522e-6}, {8,7.63072324998243e-6}, {9,8.474728373485058e-6}, {10,1.1576532481056016e-5}, {11,1.6194115883237974e-5}, {12,1.44167774196793e-5}, {13,9.819386220386243e-6}, {14,7.892097518034394e-6}, {15,7.163693583884608e-6}, {16,7.1916733850567694e-6}, {17,7.148667780983167e-6}, {18,6.134044075369504e-6}, {19,8.508809953551505e-6}, {20,8.418451926460262e-6}, {21,7.99327959145592e-6}, {22,1.0466001303723225e-5}, {23,1.165690052491439e-5}, {24,1.110477551389582e-5}]
- 低利用率
Cpu0 : 50.9%us, 13.7%sy, 0.0%ni, 21.4%id, 0.0%wa, 0.0%hi, 14.0%si, 0.0%st Cpu1 : 70.5%us, 14.1%sy, 0.0%ni, 15.1%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu2 : 68.2%us, 16.1%sy, 0.0%ni, 15.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 68.4%us, 15.1%sy, 0.0%ni, 16.1%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu4 : 69.9%us, 15.4%sy, 0.0%ni, 14.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 67.9%us, 14.1%sy, 0.0%ni, 17.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu6 : 50.6%us, 13.4%sy, 0.0%ni, 35.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 41.5%us, 10.8%sy, 0.0%ni, 47.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 32.1%us, 9.6%sy, 0.0%ni, 58.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.7%us, 1.0%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st [{total,0.09717691269949602}, {1,0.18233794842702225}, {2,0.29948956042597197}, {3,0.29957276129564725}, {4,0.3039782882961829}, {5,0.29122320708472654}, {6,0.31739635715470543}, {7,0.2454373354022171}, {8,0.26177474519403443}, {9,0.13033757084128628}, {10,8.274133360405898e-5}, {11,4.181167100346997e-5}, {12,4.0870150064878635e-5}, {13,4.012856385623552e-5}, {14,4.402024019534071e-5}, {15,4.464950668964882e-5}, {16,4.662729312473385e-5}, {17,4.765041344339578e-5}, {18,4.442241285611087e-5}, {19,4.494246472994659e-5}, {20,4.1057127449095396e-5}, {21,4.487741704964992e-5}, {22,3.939601150277982e-5}, {23,4.02231871509171e-5}, {24,3.866736564497342e-5}]
-Whatsapp 案例
erlang方面能找到案例不多,幸运的发现whatsapp 给出了类似案例详细的分析:
First bottleneck showed up at 425K. System ran into a lot of contention. Work stopped.Instrumented the scheduler to measure how much useful work is being done, or sleeping, or spinning.Under load it started to hit sleeping locks so 35-45% CPU was being used across the system but the schedulers are at 95% utilization.
-工具分析
通过微博私信,请教了下郑思瑶,推荐VTune分析,推测是大量进程时调度器消耗过大。
通过Intel 官方网站,填写注册信息,会立即回复邮件下载地址,并给30天试用期。
速度很慢,建议挂着VPN下载;VTune 的linux版本命令行模式使用很简单:
tar -zxf vtune_amplifier_xe_2015.tar.gz
cd vtune_amplifier_xe_2015
./install.sh
cd /opt/intel/vtune_amplifier_xe_2015.1.0.367959/
source amplxe-vars.sh
amplxe-cl -collect lightweight-hotspots -run-pass-thru=--no-altstack -target-pid=1575
amplxe-cl -report hotspots
可直接线上执行,不影响服务正常运行,得到如下结果:
Summary ------- Elapsed Time: 19.345 CPU Time: 182.023 Average CPU Usage: 9.155 CPI Rate: 1.501 Function Module CPU Time:Self ------------------------------------------- ------------------ ------------- sched_spin_wait beam.smp 72.754 raw_local_irq_enable vmlinux 19.282 process_main beam.smp 10.476 ethr_native_atomic32_read beam.smp 8.337 func@0xffffffff8100af60 vmlinux 3.007 __pthread_mutex_lock libpthread-2.12.so 2.342 raw_local_irq_restore vmlinux 1.973 __sched_yield libc-2.12.so 1.913 pthread_mutex_unlock libpthread-2.12.so 1.553 __audit_syscall_exit vmlinux 1.192 system_call vmlinux 1.156 erts_thr_yield beam.smp 1.114 handle_delayed_dealloc beam.smp 0.977 update beam.smp 0.828 raw_local_irq_enable vmlinux 0.780
可以看到sched_spin_wait占用了 40% 的CPU时间
#define ERTS_SCHED_SPIN_UNTIL_YIELD 100 2121 static erts_aint32_t 2122 sched_spin_wait(ErtsSchedulerSleepInfo *ssi, int spincount) 2123 { 2124 int until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD; 2125 int sc = spincount; 2126 erts_aint32_t flgs; 2127 2128 do { 2129 flgs = erts_smp_atomic32_read_acqb(&ssi->flags); 2130 if ((flgs & (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING)) 2131 != (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING)) { 2132 break; 2133 } 2134 ERTS_SPIN_BODY; 2135 if (--until_yield == 0) { 2136 until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD; 2137 erts_thr_yield(); 2138 } 2139 } while (--sc > 0); 2140 return flgs; 2141 }
默认spincount = 10000,但每次都有atom 读操作,原子操作一般几十到几百CPU周期,导致这个忙等待 实际执行完的话会很长。
同时也找到对应的配置:
+sbwt none|very_short|short|medium|long|very_longSet scheduler busy wait threshold. Default is medium. The threshold determines how long schedulers should busy wait when running out of work before going to sleep.
启动参数:+sbwt none 即可见spin彻底关掉,而不必像whatsapp patch beam解决。
同时strace 系统调用比较:
利用率高机器:
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 34.07 0.022954 0 49747 writev 33.15 0.022336 0 58925 11806 recvfrom 21.04 0.014176 1 14558 2394 futex 8.37 0.005636 0 24722 4 epoll_ctl 2.71 0.001828 0 6947 epoll_wait 0.30 0.000199 10 19 munmap 0.24 0.000164 0 612 sched_yield 0.12 0.000078 16 5 mmap 0.00 0.000000 0 4 close
利用率低机器:
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 86.38 13.138511 307 42838 7218 futex 6.91 1.050431 12 90659 writev 5.70 0.866909 13 69221 25351 recvfrom 0.54 0.082772 9 9307 sched_yield 0.43 0.065219 1 50157 epoll_ctl 0.01 0.002220 34 66 munmap 0.01 0.001092 1 815 epoll_wait 0.00 0.000675 20 34 mmap 0.00 0.000612 19 32 32 connect 0.00 0.000564 9 64 fcntl 0.00 0.000529 17 32 getsockname 0.00 0.000457 14 32 getsockopt 0.00 0.000341 11 32 socket 0.00 0.000127 16 8 read 0.00 0.000109 3 32 bind
两次strace 时间不同,可通过writev比例看出,第二次futex量要快高一倍,调度器线程切换较为严重。