valgrind试用笔记
valgrind是一款功能齐全的代码诊断软件,Ubuntu下可以获取安装
sudo apt-get install valgrind
官网上可以下载 Manuel.pdf。
可以诊断内存泄漏
g++ xxx.cpp valgrind --tool=memcheck ./a.out
它会汇报内存漏点。
也可以诊断缓存命中率
g++ xxx.cpp valgrind --tool=cachegrind ./a.out
它会汇报一级缓存数据命中率、指令命中率、最末级缓存命中率等信息。
如下示例
#include<iostream> using namespace std; #include<ctime> const size_t N = 1E3; int main(){ double y=0,z=0; clock_t tstart = clock(); double *A = new double [N*N]; for(size_t i=0;i<N*N;i++)A[i]=i; double *B = new double [N*N]; for(size_t i=0;i<N*N;i++)B[i]=i; double *C = new double [N*N]; for(size_t i=0;i<N;i++) for(size_t j=0;j<N;j++){ z=0; for(size_t l=0;l<N;l++) z += B[l*N+j]; y=0; for(size_t k=0;k<N;k++){ y += A[k*N+i] * z; } C[i*N+j]=y; } clock_t t1=clock(); cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl; delete [] A; delete [] B; delete [] C; return 0; }
这个代码的内层循环中,l,k是行数,所以会导致 memory locality 不太好,反映在 valgrind 的检测报告中,就是1级缓存数据命中率低一些(D1 miss rate: 11.8%)。
==2322== Cachegrind, a cache and branch-prediction profiler ==2322== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==2322== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==2322== Command: ./a.out ==2322== --2322-- warning: L3 cache found, using its data for the LL simulation. 276.428 s ==2322== ==2322== I refs: 31,053,190,692 ==2322== I1 misses: 1,976 ==2322== LLi misses: 1,928 ==2322== I1 miss rate: 0.00% ==2322== LLi miss rate: 0.00% ==2322== ==2322== D refs: 17,025,701,244 (15,018,537,430 rd + 2,007,163,814 wr) ==2322== D1 misses: 2,001,266,444 ( 2,000,014,098 rd + 1,252,346 wr) ==2322== LLd misses: 125,490,381 ( 125,113,840 rd + 376,541 wr) ==2322== D1 miss rate: 11.8% ( 13.3% + 0.1% ) ==2322== LLd miss rate: 0.7% ( 0.8% + 0.0% ) ==2322== ==2322== LL refs: 2,001,268,420 ( 2,000,016,074 rd + 1,252,346 wr) ==2322== LL misses: 125,492,309 ( 125,115,768 rd + 376,541 wr) ==2322== LL miss rate: 0.3% ( 0.3% + 0.0% )
而下面的代码的内层循环中,k,l是列数,memory locality 就好一些,
#include<iostream> using namespace std; #include<ctime> const size_t N = 1E3; int main(){ double y=0,z=0; clock_t tstart = clock(); double *A = new double [N*N]; for(size_t i=0;i<N*N;i++)A[i]=i; double *B = new double [N*N]; for(size_t i=0;i<N*N;i++)B[i]=i; double *C = new double [N*N]; for(size_t i=0;i<N;i++) for(size_t j=0;j<N;j++){ z=0; for(size_t l=0;l<N;l++) z += B[j*N+l]; y=0; for(size_t k=0;k<N;k++){ y += A[i*N+k] * z; } C[i*N+j]=y; } clock_t t1=clock(); cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl; delete [] A; delete [] B; delete [] C; return 0; }
反映在 cachegrind 的报告中,就是1级缓存数据命中率高一些(D1 miss rate: 0.7%)。
==2334== Cachegrind, a cache and branch-prediction profiler ==2334== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==2334== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==2334== Command: ./a.out ==2334== --2334-- warning: L3 cache found, using its data for the LL simulation. 202.343 s ==2334== ==2334== I refs: 31,053,190,658 ==2334== I1 misses: 1,974 ==2334== LLi misses: 1,926 ==2334== I1 miss rate: 0.00% ==2334== LLi miss rate: 0.00% ==2334== ==2334== D refs: 17,025,701,233 (15,018,537,423 rd + 2,007,163,810 wr) ==2334== D1 misses: 125,517,445 ( 125,140,099 rd + 377,346 wr) ==2334== LLd misses: 125,510,970 ( 125,134,429 rd + 376,541 wr) ==2334== D1 miss rate: 0.7% ( 0.8% + 0.0% ) ==2334== LLd miss rate: 0.7% ( 0.8% + 0.0% ) ==2334== ==2334== LL refs: 125,519,419 ( 125,142,073 rd + 377,346 wr) ==2334== LL misses: 125,512,896 ( 125,136,355 rd + 376,541 wr) ==2334== LL miss rate: 0.3% ( 0.3% + 0.0% )
在加上 valgrind 以后,两段代码的运行时间分别是 276.428s 和 202.342s。不加 valgrind 命令,两段代码的运行时间分别是 24.0162 s 和 7.99471 s。所以报告中 D1 miss rate 的 10% 的差别(此外几乎没有别的差别,LL refs 差一个数量级,但是 LL misses 数量差不多),会导致几倍的效率区别。这说明 cpu 计算的任务远没有那 10% 的 D1 miss 导致的多余任务。
关于电脑的缓存,cpu 需要数据的时候,会依次在如下单元中寻找 1级缓存 -> 2级缓存 -> ... -> 最后一级缓存 -> 内存。如果在 1级缓存中找到了就停止寻找,拿去用了,找的深度越大,时间成本越高。如果最后要到内存中找(缓存未命中),就一次取一个“数据块”,存到缓存里,如果下次找的数据在同一个数据块中,就可以节省时间成本。所以,缓存命中率越高,程序性能越好,所以 memory locality 非常重要。