垃圾CPU，耗我时光——Jetson Nano 初体验2

CPU与GPU性能测试

1. CPU性能测试：计算圆周率

bc 命令是任意精度计算器语言，通常在 linux 下当计算器用。它类似基本的计算器, 使用这个计算器可以做基本的数学运算
man 一下 bc 即可知道，a 是 bc 的一个内置函数，代表反正切 arctan ，由于 tan(pi/4) = 1 ，于是 4*arctan(1) = pi

计算圆周率的前一万位（单线程）并与 Intel(R) Xeon(R) Platinum 8163 CPU 的CPU做对比

# jetson nano CPU 参数
lscpu

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               1
Model name:          Cortex-A57
Stepping:            r1p1
CPU max MHz:         1428.0000
CPU min MHz:         102.0000
BogoMIPS:            38.40
L1d cache:           32K
L1i cache:           48K
L2 cache:            2048K
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32

# 计算圆周率的前一万位（单线程）
time echo "scale = 10000; 4*a(1)" | bc -l -q
3.1415926535897...

real    5m22.161s
user    5m21.496s
sys     0m0.020s

# Intel(R) Xeon(R) Platinum 8163 CPU 参数
lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              1
On-line CPU(s) list: 0
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping:            4
CPU MHz:             2500.008
BogoMIPS:            5000.01
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1

# 计算圆周率的前一万位（单线程）
time echo "scale = 10000; 4*a(1)" | bc -l -q
3.1415926535897...

real    2m20.695s
user    2m19.211s
sys     0m0.047s

单核 CPU 性能大概是 Intel(R) Xeon(R) Platinum 8163 的一半

2. CPU与GPU对比测试

CUDA在线学习

2.1 四种计算机模型

GPU设计的初衷就是为了减轻CPU计算的负载，将一部分图形计算的功能设计到一块独立的处理器中，将矩阵变换、顶点计算和光照计算等操作从 CPU 中转移到 GPU中，从而一方面加速图形处理，另一方面减小了 CPU 的工作负载，让 CPU 有时间去处理其它的事情。
在GPU上的各个处理器采取异步并行的方式对数据流进行处理，根据费林分类法（Flynn's Taxonomy），可以将信息流（information stream）分成指令（Instruction）和数据（Data）两种，据此又可分成四种计算机类型：

单一指令流单一数据流计算机（SISD）：单核CPU
单一指令流多数据流计算机（SIMD）：GPU的计算模型
多指令流单一数据流计算机（MISD）：流水线模型
多指令流多数据流计算机（MIMD）：多核CPU

2.2 CPU 与 GPU 结构差异

CPU 与 GPU 结构差异

(1)CPU设计理念：低延时

CPU设计理念：低延时

ALU：CPU有强大的ALU（算术运算单元）,它可以在很少的时钟周期内完成算术计算。
- 当今的CPU可以达到64bit 双精度。执行双精度浮点源算的加法和乘法只需要1～3个时钟周期。
- CPU的时钟周期的频率是非常高的，达到1.532～4gigahertz(千兆HZ, 10的9次方).
Cache：大的缓存也可以降低延时。保存很多的数据放在缓存里面，当需要访问的这些数据，只要在之前访问过的，如今直接在缓存里面取即可。
Control：复杂的逻辑控制单元。
- 当程序含有多个分支的时候，它通过提供分支预测的能力来降低延时。
- 数据转发。当一些指令依赖前面的指令结果时，数据转发的逻辑控制单元决定这些指令在pipeline中的位置并且尽可能快的转发一个指令的结果给后续的指令。这些动作需要很多的对比电路单元和转发电路单元。

(2)GPU设计理念：大吞吐量

GPU设计理念：大吞吐量

ALU，Cache：GPU的特点是有很多的ALU和很少的cache. 缓存的目的不是保存后面需要访问的数据的，这点和CPU不同，而是为thread提高服务的。如果有很多线程需要访问同一个相同的数据，缓存会合并这些访问，然后再去访问dram（因为需要访问的数据保存在dram中而不是cache里面），获取数据后cache会转发这个数据给对应的线程，这个时候是数据转发的角色。但是由于需要访问dram，自然会带来延时的问题。
Control：控制单元（左边黄色区域块）可以把多个的访问合并成少的访问。

GPU的虽然有dram延时，却有非常多的ALU和非常多的thread. 为了平衡内存延时的问题，我们可以中充分利用多的ALU的特性达到一个非常大的吞吐量的效果。尽可能多的分配多的Threads.通常来看GPU ALU会有非常重的pipeline就是因为这样。

2.3 Nvidia GPU架构

(1)硬件架构

SP：最基本的处理单元，streaming processor，也称为CUDA core。最后具体的指令和任务都是在SP上处理的。GPU进行并行计算，也就是很多个SP同时做处理。
SM：多个SP加上其他的一些资源组成一个streaming multiprocessor。也叫GPU大核，其他资源如：warp scheduler，register，shared memory等。SM可以看做GPU的心脏（对比CPU核心），register和shared memory是SM的稀缺资源。CUDA将这些资源分配给所有驻留在SM中的threads。因此，这些有限的资源就使每个SM中active warps有非常严格的限制，也就限制了并行能力。

Nvidia GPU硬件架构

(2)软件架构

CUDA在软件方面组成有：一个CUDA库、一个应用程序编程接口（API）及其运行库(Runtime)、两个较高级别的通用数学库，即CUFFT和CUBLAS。CUDA改进了DRAM的读写灵活性，使得GPU与CPU的机制相吻合。另一方面，CUDA 提供了片上（on-chip）共享内存，使得线程之间可以共享数据。应用程序可以利用共享内存来减少DRAM的数据传送，更少的依赖DRAM的内存带宽。

thread：一个CUDA的并行程序会被以许多个threads来执行。
block：数个threads会被群组成一个block，同一个block中的threads可以同步，也可以通过shared memory通信。
grid：多个blocks则会再构成grid。
warp：GPU执行程序时的调度单位，目前cuda的warp的大小为32，同在一个warp的线程，以不同数据资源执行相同的指令,这就是所谓 SIMT。

Nvidia GPU软件架构

（3）软硬件架构对应关系
从软件上看，SM更像一个独立的CPU core。SM（Streaming Multiprocessors）是GPU架构中非常重要的部分，GPU硬件的并行性就是由SM决定的。

当一个kernel启动后，thread会被分配到这些SM中执行。大量的thread可能会被分配到不同的SM，同一个block中的threads必然在同一个SM中并行（SIMT）执行。每个thread拥有它自己的程序计数器和状态寄存器，并且用该线程自己的数据执行指令，这就是所谓的Single Instruction Multiple Thread。

CUDA是一种典型的SIMT架构（单指令多线程架构），SIMT和SIMD（Single Instruction, Multiple Data）类似，SIMT应该算是SIMD的升级版，更灵活，但效率略低，SIMT是NVIDIA提出的GPU新概念。二者都通过将同样的指令广播给多个执行官单元来实现并行。一个主要的不同就是，SIMD要求所有的vector element在一个统一的同步组里同步的执行，而SIMT允许线程们在一个warp中独立的执行。

2.4 CUDA C编程入门

（1）程序架构

CUDA程序构架分为两部分：Host和Device。一般而言，Host指的是CPU，Device指的是GPU。在CUDA程序构架中，主程序还是由 CPU 来执行，而当遇到数据并行处理的部分，CUDA 就会将程序编译成 GPU 能执行的程序，并传送到GPU。而这个程序在CUDA里称做核（kernel）。CUDA允许程序员定义称为核的C语言函数，从而扩展了 C 语言，在调用此类函数时，它将由N个不同的CUDA线程并行执行N次，这与普通的C语言函数只执行一次的方式不同。执行核的每个线程都会被分配一个独特的线程ID，可通过内置的threadIdx变量在内核中访问此ID。
在 CUDA 程序中，主程序在调用任何 GPU 内核之前，必须对核进行执行配置，即确定线程块数和每个线程块中的线程数以及共享内存大小。

CUDA 设备拥有多个独立的存储空间，其中包括：全局存储器、本地存储器、共享存储器、常量存储器、纹理存储器和寄存器

CUDA线程可在执行过程中访问多个存储器空间的数据，如下图所示其中：

每个线程都有一个私有的本地存储器。
每个线程块都有一个共享存储器，该存储器对于块内的所有线程都是可见的，并且与块具有相同的生命周期。
所有线程都可访问相同的全局存储器。
此外还有两个只读的存储器空间，可由所有线程访问，这两个空间是常量存储器空间和纹理存储器空间。全局、固定和纹理存储器空间经过优化，适于不同的存储器用途。纹理存储器也为某些特殊的数据格式提供了不同的寻址模式以及数据过滤，方便 Host对流数据的快速存取。

CUDA 假设线程可在物理上独立的设备上执行，此类设备作为运行C语言程序的主机的协处理器操作。内核在GPU上执行，而C语言程序的其他部分在CPU上执行（即串行代码在主机上执行，而并行代码在设备上执行）。此外，CUDA还假设主机和设备均维护自己的DRAM，分别称为主机存储器和设备存储器。因而，一个程序通过调用CUDA运行库来管理对内核可见的全局、固定和纹理存储器空间。这种管理包括设备存储器的分配和取消分配，还包括主机和设备存储器之间的数据传输。

（2）CUDA C基础

CUDA C是对C/C++语言进行拓展后形成的变种，兼容C/C++语法，文件类型为".cu"文件，编译器为"nvcc"，相比传统的C/C++，主要添加了以下几个方面：

函数类型限定符：用来确定某个函数是在CPU还是GPU上运行，以及这个函数是从CPU调用还是从GPU调用
- device表示从GPU调用，在GPU上执行
- global表示从CPU调用，在GPU上执行，也称之为kernel函数
- host表示在CPU上调用，在CPU上执行
执行配置运算符：执行配置运算符<<<>>>，用来传递内核函数的执行参数。格式如下：
kernel<<<gridDim, blockDim, memSize, stream>>>(para1, para2, ...);
- gridDim表示网格的大小，可以是1,2,3维
- blockDim表示块的·大小，可以是1,2,3维
- memSize表示动态分配的共享存储器大小，默认为0
- stream表示执行的流，默认为0
- para1, para2等为核函数参数
五个内置变量：这些内置变量用来在运行时获得Grid和Block的尺寸及线程索引等信息
- gridDim: 包含三个元素x, y, z的结构体，表示Grid在三个方向上的尺寸，对应于执行配置中的第一个参数
- blockDim: 包含上元素x, y, z的结构体，表示Block在三个方向上的尺寸，对应于执行配置中的第二个参数
- blockIdx: 包含三个元素x, y, z的结构体，分别表示当前线程所在块在网格中x, y, z方向上的索引
- threadIdx: 包含三个元素x, y, z的结构体，分别表示当前线程在其所在块中x, y, z方向上的索引
- warpSize: 表明warp的尺寸
变量类型限定符：用来确定某个变量在设备上的内存位置
- device表示位于全局内存空间，默认类型
- share表示位于共享内存空间
- constant表示位于常量内存空间
- texture表示其绑定的变量可以被纹理缓存加速访问
其他的还有数学函数、原子函数、纹理读取、绑定函数等

2.5 CPU与GPU的矩阵乘法对比

（1）CPU单线程矩阵乘法

// CPU单线程矩阵乘法
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>    
#include <unistd.h>

#define w 2000

struct Matrix
{
    int width;
    int height;
    float *elements;
};

void matMul(float * M, float * N, float * P, int width){
    for (int i = 0; i < width; i++){
        for (int j = 0; j < width; j++){
            float sum = 0;
            for (int k = 0; k < width; k++){
                float a = M[i * width + k];
                float b = N[k * width + j];
                sum += a * b;
            }
            P[i * width + j] = sum;
        }
    }
}

int main(){
    int width = w;
    int height = w; 
    
    float * m = (float *)malloc (width * height * sizeof (float));
    float * n = (float *)malloc (width * height * sizeof (float));
    float * p = (float *)malloc (width * height * sizeof (float));

    for (int i = 0; i < width * height; i++){
        m[i] = 9.9;
        n[i] = 2.5;
    }

    struct timeval t1,t2;
    gettimeofday(&t1,NULL);
    double timeuse;

    matMul(m, n, p, w);

    gettimeofday(&t2,NULL);
    timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;
    printf("Use Time:%f\n",timeuse);

    return 0;
}

然后编译运行

gcc  cpu_sigle.c -O3 -o cpu_sigle

./cpu_sigle

Use Time:52.641901

（2）CPU多线程矩阵乘法

//CPU多线程矩阵乘法
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define LOG_
#define SIZE 8000

int * A, * B; // 计算矩阵
int * result, * result2, * result3, * result4; // 结果矩阵
/*
int A[SIZE][SIZE];
int B[SIZE][SIZE];
int result[SIZE][SIZE];
int result2[SIZE][SIZE];
int result3[SIZE][SIZE];
int result4[SIZE][SIZE];
*/

int size; // 矩阵阶数
pthread_t tid2[2]; // 双线程id
pthread_t tid3[3]; // 三线程id
pthread_t tid4[4]; // 四线程id

/* 双线程函数 */
void twoThread1(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 2 == 0)
            result2[i * size + j] += A[i * size + k] * B[k * size + j];
//            result2[i][j] += A[i][k] * B[k][j];
    }
}

void twoThread2(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 2 != 0)
            result2[i * size + j] += A[i * size + k] * B[k * size + j];
//            result2[i][j] += A[i][k] * B[k][j];
    }
}
/* 双线程函数 end */

/* 三线程函数 */
void threeThread1(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 3 == 0)
            result3[i * size + j] += A[i * size + k] * B[k * size + j];
//            result3[i][j] += A[i][k] * B[k][j];
    }
}

void threeThread2(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 3 != 0 && i % 2 != 0)
            result3[i * size + j] += A[i * size + k] * B[k * size + j];
//            result3[i][j] += A[i][k] * B[k][j];
    }
}

void threeThread3(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 3 != 0 && i % 2 == 0)
            result3[i * size + j] += A[i * size + k] * B[k * size + j];
//            result3[i][j] += A[i][k] * B[k][j];
    }
}
/* 三线程函数 end */

/* 四线程函数 */
void fourThread1(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 2 == 0 && i % 4 != 0)
            result4[i * size + j] += A[i * size + k] * B[k * size + j];
//            result4[i][j] += A[i][k] * B[k][j];
    }
}

void fourThread2(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 4 == 0)
            result4[i * size + j] += A[i * size + k] * B[k * size + j];
//            result4[i][j] += A[i][k] * B[k][j];
    }
}

void fourThread3(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 2 != 0 && i % 3 == 0)
            result4[i * size + j] += A[i * size + k] * B[k * size + j];
//            result4[i][j] += A[i][k] * B[k][j];
    }
}

void fourThread4(){
    int i, j, k;
    for (i = 0; i < size; i++)
    for (j = 0; j < size; j++)
    for (k = 0; k < size; k++){
        if (i % 2 != 0 && i % 3 != 0)
            result4[i * size + j] += A[i * size + k] * B[k * size + j];
//            result4[i][j] += A[i][k] * B[k][j];
    }
}
/* 四线程函数 end */


int main(){
    int i, j, k, m, n; // 循环变量
    struct timeval t1, t2; 
    double timeuse; // 计时

    char sizeChars[8]; // 阶数写入字符串
    char timeChars[16]; // 耗时写入字符串

    // 申请空间, 计算矩阵和结果矩阵
    A = (int *)malloc (sizeof (int) * SIZE * SIZE);
    B = (int *)malloc (sizeof (int) * SIZE * SIZE);
    result = (int *)malloc (sizeof (int) * SIZE * SIZE);
    result2 = (int *)malloc (sizeof (int) * SIZE * SIZE);
    result3 = (int *)malloc (sizeof (int) * SIZE * SIZE);
    result4 = (int *)malloc (sizeof (int) * SIZE * SIZE);

    for (i = 0; i < SIZE; i++)
    for (j = 0; j < SIZE; j++){
        /*
        A[i][j] = 1;
        B[i][j] = 2;
        result[i][j] = 0;
        result2[i][j] = 0;
        result3[i][j] = 0;
        result4[i][j] = 0;
        */
        A[i * SIZE + j] = 1;
        B[i * SIZE + j] = 2;
        result[i * SIZE + j] = 0;
        result2[i * SIZE + j] = 0;
        result3[i * SIZE + j] = 0;
        result4[i * SIZE + j] = 0;
    }

    int fd;
    fd = open ("./pthreadTime.txt", O_WRONLY | O_CREAT, 0777);
    lseek(fd, 0, SEEK_SET);

    for (size = 200; size <= SIZE; size += 200){
        printf ("当前阶数: %d\n", size);
        sprintf (sizeChars, "%d, ", size);
        write (fd, sizeChars, strlen (sizeChars));

#ifdef LOG
        printf ("A矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", A[i * size + j]);
//                printf ("%d ", A[i][j]);
            }
            printf ("\n");
        }

        printf ("B矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", B[i * size + j]);
//                printf ("%d ", B[i][j]);
            }
            printf ("\n");
        }
#endif 

        /* 单线程 */
        gettimeofday (&t1, NULL);

        for (i = 0; i < size; i++)
        for (j = 0; j < size; j++)
        for (k = 0; k < size; k++){
            result[i * size + j] += A[i * size + k] * B[k * size +j];
//            result[i][j] += A[i][k] * B[k][j];
        }

        gettimeofday(&t2, NULL);
        timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

#ifdef LOG
        printf ("单线程结果矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", result[i * size + j]);
//                printf ("%d ", result[i][j]);
            }
            printf ("\n");
        }
#endif

        printf("单线程耗时: %fs\n", timeuse);
        sprintf (timeChars, "%lf, ", timeuse);
        write (fd, timeChars, strlen (timeChars));


        for (i = 0; i < size; i++)
        for (j = 0; j < size; j++){
            result[i * size + j] = 0;
            result2[i * size + j] = 0;
            result3[i * size + j] = 0;
            result4[i * size + j] = 0;
            /*
            result[i][j] = 0;
            result2[i][j] = 0;
            result3[i][j] = 0;
            result4[i][j] = 0;
            */
        }
        /* 单线程 end */

        /* 双线程 */
        gettimeofday (&t1, NULL);
        pthread_create (&tid2[0], NULL, (void *)twoThread1, NULL);
        pthread_join (tid2[0], NULL);
        pthread_create (&tid2[1], NULL, (void *)twoThread2, NULL);
        pthread_join (tid2[1], NULL);

        gettimeofday (&t2, NULL);
        timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

#ifdef LOG
        printf ("双线程结果矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", result2[i * size + j]);
//                printf ("%d ", result2[i][j]);
            }
            printf ("\n");
        }
#endif

        printf("双线程耗时: %fs\n", timeuse);
        sprintf (timeChars, "%lf, ", timeuse);
        write (fd, timeChars, strlen (timeChars));

        for (i = 0; i < size; i++)
        for (j = 0; j < size; j++){
            result[i * size + j] = 0;
            result2[i * size + j] = 0;
            result3[i * size + j] = 0;
            result4[i * size + j] = 0;
            /*
            result[i][j] = 0;
            result2[i][j] = 0;
            result3[i][j] = 0;
            result4[i][j] = 0;
            */
        }
        /* 双线程 end */

        /* 三线程 */
        gettimeofday (&t1, NULL);
        pthread_create (&tid3[0], NULL, (void *)threeThread1, NULL);
        pthread_join (tid3[0], NULL);
        pthread_create (&tid3[1], NULL, (void *)threeThread2, NULL);
        pthread_join (tid3[1], NULL);
        pthread_create (&tid3[2], NULL, (void *)threeThread3, NULL);
        pthread_join (tid3[2], NULL);

        gettimeofday (&t2, NULL);
        timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

#ifdef LOG
        printf ("三线程结果矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", result3[i * size + j]);
            }
            printf ("\n");
        }
#endif

        printf("三线程耗时: %fs\n", timeuse);
        sprintf (timeChars, "%lf, ", timeuse);
        write (fd, timeChars, strlen (timeChars));

        for (i = 0; i < size; i++)
        for (j = 0; j < size; j++){
            result[i * size + j] = 0;
            result2[i * size + j] = 0;
            result3[i * size + j] = 0;
            result4[i * size + j] = 0;
            /*
            result[i][j] = 0;
            result2[i][j] = 0;
            result3[i][j] = 0;
            result4[i][j] = 0;
            */
        }
        /* 三线程 end */

        /* 四线程 */
        gettimeofday (&t1, NULL);
        pthread_create (&tid4[0], NULL, (void *)fourThread1, NULL);
        pthread_join (tid4[0], NULL);
        pthread_create (&tid4[1], NULL, (void *)fourThread2, NULL);
        pthread_join (tid4[1], NULL);
        pthread_create (&tid4[2], NULL, (void *)fourThread3, NULL);
        pthread_join (tid4[2], NULL);
        pthread_create (&tid4[3], NULL, (void *)fourThread4, NULL);
        pthread_join (tid4[3], NULL);

        gettimeofday (&t2, NULL);
        timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;

#ifdef LOG
        printf ("四线程结果矩阵: \n");
        for (i = 0; i < size; i++){
            for (j = 0; j < size; j++){
                printf ("%d ", result4[i * size + j]);
            }
            printf ("\n");
        }
#endif

        printf("四线程耗时: %fs\n", timeuse);
        sprintf (timeChars, "%lf\n", timeuse);
        write (fd, timeChars, strlen (timeChars));

        for (i = 0; i < size; i++)
        for (j = 0; j < size; j++){
            result[i * size + j] = 0;
            result2[i * size + j] = 0;
            result3[i * size + j] = 0;
            result4[i * size + j] = 0;
            /*
            result[i][j] = 0;
            result2[i][j] = 0;
            result3[i][j] = 0;
            result4[i][j] = 0;
            */
        }
        /* 四线程 end */
    }

    // 释放空间
    free (A);
    free (B);
    free (result);
    free (result2);
    free (result3);
    free (result4);
    A = NULL; 
    B = NULL;
    result = NULL;
    result2 = NULL;
    result3 = NULL;
    result4 = NULL;

    // 关闭文件
    close (fd);

    return 0;
}

编译

gcc  cpu_mul.c -O3 -o cpu_mul

出现问题

undefined reference to `pthread_create'

问题的原因：pthread不是linux下的默认的库，也就是在链接的时候，无法找到phread库中哥函数的入口地址，于是链接会失败。

解决：在gcc编译的时候，附加要加 -lpthread参数即可解决。

再编译

gcc  cpu_mul.c -O3 -lpthread -o cpu_mul

执行

./cpu_mul
当前阶数: 200
单线程耗时: 0.068800s
双线程耗时: 0.064988s
三线程耗时: 0.065114s
四线程耗时: 0.065234s
当前阶数: 400
单线程耗时: 0.616225s
双线程耗时: 0.592269s
三线程耗时: 0.590107s
四线程耗时: 0.590583s
当前阶数: 600
单线程耗时: 2.106992s
双线程耗时: 2.067765s
三线程耗时: 2.078379s
四线程耗时: 2.079277s
当前阶数: 800
单线程耗时: 5.433704s
双线程耗时: 5.232933s
三线程耗时: 5.241303s
四线程耗时: 5.245513s
当前阶数: 1000
单线程耗时: 11.862752s
双线程耗时: 11.370975s
三线程耗时: 11.380489s
四线程耗时: 11.376795s
当前阶数: 1200
单线程耗时: 21.028770s
双线程耗时: 20.471958s
三线程耗时: 20.481492s
四线程耗时: 20.683800s
当前阶数: 1400
单线程耗时: 33.121682s
双线程耗时: 30.445559s
三线程耗时: 30.390110s
四线程耗时: 30.387628s
当前阶数: 1600
单线程耗时: 57.306933s
双线程耗时: 56.083187s
三线程耗时: 56.102355s
四线程耗时: 56.104007s
当前阶数: 1800
单线程耗时: 70.554218s
双线程耗时: 64.973096s
三线程耗时: 64.993218s
四线程耗时: 64.977718s
当前阶数: 2000
单线程耗时: 98.518510s
双线程耗时: 94.792153s
三线程耗时: 94.810620s
四线程耗时: 94.876091s

然而从htop里面看的多线程时实际只有两个线程，有一个线程在工作，另一个在sleep，很是奇怪

编译时去掉优化级别

gcc  cpu_mul.c  -lpthread -o cpu_mul2

当前阶数: 200
单线程耗时: 0.236366s
双线程耗时: 0.276663s
三线程耗时: 0.352680s
四线程耗时: 0.365158s
当前阶数: 400
单线程耗时: 2.031096s
双线程耗时: 2.332579s
三线程耗时: 2.967604s
四线程耗时: 3.046181s
当前阶数: 600
单线程耗时: 7.005228s
双线程耗时: 7.990198s
三线程耗时: 10.129884s
四线程耗时: 10.384419s
当前阶数: 800
单线程耗时: 16.948757s
双线程耗时: 19.568093s
三线程耗时: 24.364115s
四线程耗时: 24.274340s
当前阶数: 1000
单线程耗时: 38.138426s
双线程耗时: 42.429860s
三线程耗时: 52.424720s
四线程耗时: 53.720896s

从htop里面看也是两个线程，一个CPU占用100%，另一个CPU占用0%，在围观......结果未优化的性能更差，同样只有两个线程，且一个线程在围观。。。真的是醉了。。。有机会再研究多线程的编程吧！
（3）GPU 器件参数输出

vim device.cu
nvcc device.cu  -o device

#include <stdio.h>
int main() {
        int nDevices;

        cudaGetDeviceCount(&nDevices);
        for (int i = 0; i < nDevices; i++) {
                cudaDeviceProp prop;
                cudaGetDeviceProperties(&prop, i);
                printf("Device Num: %d\n", i);
                printf("Device name: %s\n", prop.name);
                printf("Device SM Num: %d\n", prop.multiProcessorCount);
                printf("Share Mem Per Block: %.2fKB\n", prop.sharedMemPerBlock / 1024.0);
                printf("Max Thread Per Block: %d\n", prop.maxThreadsPerBlock);
                printf("Memory Clock Rate (KHz): %d\n",
                   prop.memoryClockRate);
                printf("Memory Bus Width (bits): %d\n",
                   prop.memoryBusWidth);
                printf("Peak Memory Bandwidth (GB/s): %.2f\n\n",
                   2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
        }
        return 0;
}

结果：

./device
Device Num: 0
Device name: NVIDIA Tegra X1
Device SM Num: 1
Share Mem Per Block: 48.00KB
Max Thread Per Block: 1024
Memory Clock Rate (KHz): 12750
Memory Bus Width (bits): 64
Peak Memory Bandwidth (GB/s): 0.20

(4)GPU 矩阵乘法

//GPU 矩阵乘法
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <unistd.h>

#define w 2000

struct Matrix
{
    int width;
    int height;
    float *elements;
};

__device__ float getElement(Matrix *A, int row, int col)
{
        return A->elements[row * A->width + col];
}

__device__ void setElement(Matrix *A, int row, int col, float value)
{
        A->elements[row * A->width + col] = value;
}

__global__ void matMulKernel(Matrix *A, Matrix *B, Matrix *C)
{
        float Cvalue = 0.0;
        int row = threadIdx.y + blockIdx.y * blockDim.y;
        int col = threadIdx.x + blockIdx.x * blockDim.x;
        
        for (int i = 0; i < A->width; ++i)
        {
                Cvalue += getElement(A, row, i) * getElement(B, i, col);
        }
        setElement(C, row, col, Cvalue);
}

int main()
{
    int width = w;
    int height = w;

    Matrix *A, *B, *C;

    cudaMallocManaged((void**)&A, sizeof(Matrix));
    cudaMallocManaged((void**)&B, sizeof(Matrix));
    cudaMallocManaged((void**)&C, sizeof(Matrix));

    int nBytes = width * height * sizeof(float);

    cudaMallocManaged((void**)&A->elements, nBytes);
    cudaMallocManaged((void**)&B->elements, nBytes);
    cudaMallocManaged((void**)&C->elements, nBytes);

    A->height = height;
    A->width = width;
    B->height = height;
    B->width = width;
    C->height = height;
    C->width = width;

    for (int i = 0; i < width * height; ++i)
    {
        A->elements[i] = 1.0;
        B->elements[i] = 2.0;
    }

    dim3 blockSize(32, 32);
    dim3 gridSize((width + blockSize.x - 1) / blockSize.x,
        (height + blockSize.y - 1) / blockSize.y);

    struct timeval t1,t2;
    gettimeofday(&t1,NULL);
    double timeuse;

    matMulKernel << < gridSize, blockSize >> >(A, B, C);

    cudaDeviceSynchronize();

    gettimeofday(&t2,NULL);
    timeuse = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;
    printf("Use Time:%fs\n", timeuse);

    return 0;
}

vim gpu_mul.cu
nvcc gpu_mul.cu  -o gpu_mul

Use Time:1.466122s (w=2000)
Use Time:267.060169s (w=8000)

（5）结果对比

通过对矩阵乘法运算的结果对比，同样2000阶的矩阵乘法，单核CPU需要52.64s，而GPU只需要1.466s，矩阵乘法运算速度远快于CPU，另外编程时也要根据其GPU编程模型来优化

参考资料

posted @ 2019-05-10 16:58 〇矽小筑阅读(3097) 评论(0) 收藏举报

刷新页面返回顶部

〇矽小筑

海阔凭鱼跃，天高任鸟飞。独立之精神，自由之思想