栈内矩阵相乘 v.s. 堆内矩阵相乘
1. 栈内数组相乘
可以定义局域变量:三个 500 x 500 的数组,共占据 3 x 500 x 500 x 8 = 6 x 10^6 bytes,差不多 6 M,栈空间约为 不到 8 M。然后进行矩阵相乘,计时得到耗时。
2. 通过栈内指针 new 出堆内存,进行矩阵相乘
定义 double **a, **b, **c, new 出三个 500 x 500 的数组,即大约 3 x 500 个栈内指针,指向 3 x 500 x 500 x 8 < 6M 的堆内内存,进行矩阵相乘,计时得到耗时。
代码如下:
#include<iostream> using namespace std; #include<cmath> #include<time.h> int n=500; double stack_multiply(void){ double a[n][n], b[n][n], c[n][n]; for(int i=0;i<n;i++){ for(int j=0;j<n;j++){ a[i][j] = i*j; b[i][j] = i*j; } } for(int i=0;i<n;i++){ for(int j=0;j<n;j++){ double y=0; for(int k=0;k<n;k++){ y += a[i][k] * b[k][j]; } c[i][j] = y; } } return c[200][200]; } void heap_multiply(double **a, double **b, double **c){ for(int i=0;i<n;i++){ for(int j=0;j<n;j++){ double y=0; for(int k=0;k<n;k++){ y += a[i][k] * b[k][j]; } c[i][j] = y; } } } int main(){ clock_t t_start = clock(); int repeat = 1E0; double y; for(int i=0;i<repeat;i++) y = stack_multiply(); cout<<"\t\tc[200][200]="<<y<<endl; clock_t t_end = clock(); cout<<" It took me "<< (double)(t_end- t_start)/repeat/CLOCKS_PER_SEC<<"s to do the matrix multiplication in stack."<<endl; double **a=new double *[n]; for(int i=0;i<n;i++){ a[i] = new double [n]; for(int j=0;j<n;j++){ a[i][j] = i*j; } } int piece = 1E4; /* double ***fragment = new double ** [piece]; for(int i=0;i<piece;i++){ fragment[i] = new double * [piece]; for(int j=0;j<piece;j++){ fragment[i][j] = new double [piece]; } } */ double **b=new double *[n]; for(int i=0;i<n;i++){ b[i] = new double [n]; for(int j=0;j<n;j++){ b[i][j] = i*j; } } double **c = new double * [n]; for(int i=0;i<n;i++) c[i] = new double [n]; t_start = clock(); heap_multiply(a, b, c); t_end = clock(); cout<<"\t\tc[200][200]="<<c[200][200]<<endl; for(int i=0;i<n;i++) delete [] a[i]; delete [] a; /* for(int i=0;i<piece;i++){ for(int j=0;j<piece;j++){ delete []fragment[i][j]; } delete [] fragment[i]; } delete [] fragment; */ for(int i=0;i<n;i++) delete [] b[i]; delete [] b; for(int i=0;i<n;i++) delete [] c[i]; delete [] c; cout<<" It took me "<< (double)(t_end- t_start)/CLOCKS_PER_SEC<<"s to do the matrix multiplication in heap."<<endl; return 0; }
其中的 fragment 是模拟堆内存碎片化的。
注掉 fragment 这部分以后(堆上没有碎片化,矩阵 a 与矩阵 b 紧挨着),两种矩阵相乘耗时差不多,堆上的还稍微快一点,
g++ main.cpp
./a.out
c[200][200]=1.66167e+12
It took me 0.71875s to do the matrix multiplication in stack
c[200][200]=1.66167e+12
It took me 0.671875s to do the matrix multiplication in heap
注意到,如果编译加上 -O2,耗时变为原来的 1/3 多一些,
g++ main.cpp -O2
./a.out
c[200][200]=1.66167e+12
It took me 0.328125s to do the matrix multiplication in stack
c[200][200]=1.66167e+12
It took me 0.25s to do the matrix multiplication in heap
如果用 fragment 模拟堆碎片,设置 piece = 1E1,得到
g++ main.cpp -O2
./a.out
c[200][200]=1.66167e+12
It took me 0.296875s to do the matrix multiplication in stack
c[200][200]=1.66167e+12
It took me 0.234375s to do the matrix multiplication in heap
设置 piece = 1E2 也差不多:
g++ main.cpp -O2
./a.out
c[200][200]=1.66167e+12
It took me 0.28125s to do the matrix multiplication in stack
c[200][200]=1.66167e+12
It took me 0.203125s to do the matrix multiplication in heap
所以两种方式似乎差不多。这可能是因为,从堆中读内存时,会一次得到该单元附近一块单元的值,所以只要不是离散地跳来跳去取值,这一点都会使程序变得较快。
但在测试过程中,有一次,我得到,piece = 1E2 时,第二种变得非常慢。所以似乎也不完全确定。