猿代码 高性能传统优化技术
高性能传统优化技术
高性能算法
lapack安装 lapack里面有blas和lapack 所以较为方便 但是下载的时候遇到了许多困难 最后是看知乎评论区解决的 需要补上cmake使用指南
cd lapack-3.11
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_SHARED_LIBS=ON
make
cmake使用指南
下载petsc
./configure --prefix=../petsc_install --with-mpi-dir=/thfs1/software/mpich/mpi-n-gcc9.3.0 --with-blas-lapack-dir=../lapack-3.11/build
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug all
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug install
mpicc ex1.c -o ex1 -I./petsc_install/include -L./petsc_install/lib -Wl,-rpath=./petsc_install/lib -lpetsc
srun -p thcp1 -n 1 ex1
KSP Object: 1 MPI process
type: gmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI process
type: jacobi
type DIAGONAL
linear system matrix = precond matrix:
Mat Object: 1 MPI process
type: seqaij
rows=10, cols=10
total: nonzeros=28, allocated nonzeros=50
total number of mallocs used during MatSetValues calls=0
not using I-node routines
Norm of error 2.41202e-15, Iterations 5
程序性能分析
静态分析 利用understand进行静态分析
understand
主要来分析程序的流程 调用关系
**动态分析 ** 利用gprof进行动态分析
gprof
除了函数的调用关系,同时还能给出函数的调用时间分布
g++ -pg main.cpp -o main
srun -N -n 1 -p thcp1 ./main
gprof main gmon.out >output.txt
chmod +x gprof2dot.py
gprof2dot.py output.txt | dot - Tpng -o output.png #利用gproff2dot 生成图片
计时
CLOCKS_PER_SEC;
clock_t start, end;
start = clock();
end = clock();
printf("%f seconds\n", (double)(end - start) / CLOCKS_PER_SEC));
其他分析工具 valgrind + Qcachegrind
编译运行串行HPCG 之前校内是要求跑通HPL 相比之下HPCG明显简便多了
cd setup
cp Make.Linux_Serial ../
#修改Make.Linux_MPI 把mpi路径填上去
mkdir build
cd build
../configure Linux_Serial
vim makefile #添加-pg参数
make
cd bin
srun -n 1 -N 1 -p thcp1 xhpc
gprof xhpcg gmon.out >output.txt
利用gprof 进行jacobi程序性能分析
这里给出结果图
![img](file:///C:/Users/10235/AppData/Local/Packages/Microsoft.Windows.Photos_8wekyb3d8bbwe/TempState/ShareServiceTempFolder/output.jpeg)
传统性能优化
从体系结构的角度
(1)提高主频
(2)高速缓存
(3)流水线
(4)并行技术(超标量)
常见循环优化技术
(1)循环合并 (loop fusion)
before
int i;
for(i = 0; i < n; i++) x[i] = a[i] + b[i];
for(i = 0; i < n; i++) y[i] = a[i] - b[i];
after
int i;
for(i = 0; i < n; i++) {
x[i] = a[i] + b[i];
y[i] = a[i] - b[i];
}
(2)循环展开 (loop unrolling)
before
int i = 0;
for(i = 0; i < N; i++) A[i] = A[i] + B[i]
after
int i = 0;
for(i = 0; i < N; i+=4) {
A[i] = A[i] + B[i];
A[i + 1] = A[i + 1] + B[i + 1];
A[i + 2] = A[i + 2] + B[i + 2];
A[i + 3] = A[i + 3] + B[i + 3];
}
(3)循环交换(loop interchange)
before
int j, k, i;
for(j = 0; j < N; j++)
for(k = 0; k < N; k++)
for(i = 0; i < N; i++)
A[i][j] += B[i][k] + C[k][j];
after
int j, k, i;
for(j = 0; j < N; j++)
for(i = 0; i < N; i++)
for(k = 0; k < N; k++)
A[i][j] += B[i][k] + C[k][j];
(4)循环分布(loop distribute)
before
int i;
for(i = 0; i < N; i++) {
A[i] = i;
B[i] = 2 + B[i];
C[i] = 3 + C[i - 1];
}
after
int i;
for(i = 0; i < N; i++) {
A[i] = i;
B[i] = 2 + B[i];
}
for(i = 0; i < N; i++) C[i] = 3 + C[i - 1];
(5)循环不变量外提
before
for(i = 0; i < N; i++)
for(j = 0; j < M; j++)
U[i] += W[i] * W[i] * D[j] / (dt * dt);
aftere
T1 = dt * dt;
for(i = 0; i < N; i++) {
T2 = W[i] * W[i];
for(j = 0; j < M; j++) {
U[i] += T2 * D[j] / T1;
}
}
优化Jacobi实例
initial
loop fusion
loop interchange
循环不变量外提