Minitorch笔记

https://github.com/mukobi/Minitorch-Self-Study-Guide-SAIA/blob/main/README.md#what-is-this
https://minitorch.github.io/
srush/GPU-Puzzles: Solve puzzles. Learn CUDA. (github.com)

代码规范和提交

black . # Format all files
flake8 . # check docstring and etc.
mypy . # check type

visualization

streamlit run app.py -- 0

测试

pytest, hypothesis(property testing)

基本模式

minitorch: 计算相关,项目主体,operator
- operator.py: scalar functions
1. scalar相关的,基于float的数学函数如relu, add, less,
2. 基于float的最简单的求导函数如log_backinv_back
3.基于float的map, reduce和zip,仅起到debug作用
- tensor_op, fast_op.py, cuda_op.py: operators, fastop基于numba.jit,主要优化为多线程并行,cudaop基于numba.cuda,主要优化为SIMT。包装了operator.py中的Scalar function为tensor function
1. 继承TensorOp类,用map, zip, reduce来调用operator中的基本函数。在调用时将Tensor降级为numpy.array(storage, strides, shape), e.g: SimpleOps.log_back(tensor)=SimpleOps.map(operators.log_back)
2. 基于Tensor的map, reduce, zip
3. Simplebackend=TensorBackend(SimpleOps);FastTensorBackend = minitorch.TensorBackend(minitorch.FastOps);GPUBackend = minitorch.TensorBackend(minitorch.CudaOps)

- `fast_conv`, `cuda_conv`
- `tensor_function.py`: class Function及其各个衍生类,记录History(构成计算图),实现forward和backward
	1. Function: 定义了接口forward和backward,调用参数所属的backend对应的operators来完成计算;apply: 实际上被调用的接口,创建Context,如果需要backward,则记录history将其加入计算图,调用接口forward,确保输出的结果是Tensor
	2. Neg, Add, Inv, Matmul...Function衍生类, 对dim等非tensor也不需要梯度的类,返回0,之后会在backpropagation中被忽略。forward会存一些backward需要的数值进入context, backward再取出来进行计算。此外,operators只对基本的函数进行了定义,例如sigmoid.backward的逻辑为`mul_zip(grad_output, mul_zip(sigmoid(input), add_zip(1.0, neg_map(sigmoid(input)))))`
	3. Copy, View, Permute等形状相关的衍生类,Permute: 只是将strides和shapes进行调换,而非是移动实际数据。有些特别的操作要求contiguous,也就是按照空间顺序排列,`strides[i] = prod(shape[i+1:])`
	4. zeros, rand等新建tensor的函数
	5. `grad_central_difference`, `grad_check`使用数值方法简单计算梯度,但是有误差,`grad_check`一般选择10%的相对误差
- `tensor_data`
	1. `index_to_position`, `to_index`, `broadcast_index`: 用于Strided access Tensordata的辅助函数,用于在position(storage[position]一维绝对地址),和index(多维strided索引,逻辑地址)之间转化。此外broardcast可以将已经broadcast的大tensor的index转化为对应的没有broadcast的小tensor的index。注意`to_index`和`index_to_position`并不是互相为逆函数,因此`to_index(pos, shape, index)`和`pos2=index_to_position`中pos只是绝大多数情况下等于pos2,有时strided access和broadcast并用可能导致pos与pos2不同(仍需探究),最好先把绝对位置转化为相对位置,再转化回来。
	2. TensorData: storage, strides, shape, 还实现了诸如permute, copy等基本的转化函数
- tensor.py
	1. class Tensor: `_tensor: TensorData`, backend, history, name, unique_id:重载了基本操作符,添加了如sigmoid, log等基本(调用Sigmoid等Function衍生类)计算函数,包装了TensorData的转化函数
	2. 如果tensor.history为空,则被视作constant,不需要梯度。如果tensor.history不为空,但是没有内容(tensor.history.fn和tensor.history.inputs都为空),则被视为叶节点(计算图的起点),梯度将会被保存grad中。对于需要梯度的非叶节点,梯度只是被传递,没有在传递后被保存。
- autodiff.py
	1. 拓扑排序
	2. backpropagation
- optim.py
	1. SGD: `zero_grad`清理叶节点上已经保存的梯度,step: 使用梯度将已经关联的参数进行update
- module.py
	1. class Parameter
	2. class Module:方法: train, eval, named_parameters. `__init__`将自动将属性中类型为Parameter的fields加入`self._parameters`,将类型为Module的加入`self._modules`。`self._parameters`和`self._modules`这些子Module对应的参数会自动被self.optim记录并在step中update
  • nn.py: avgpool2d, softmax, logsoftmax, maxpool2d, dropout等调用Function衍生类和Tensor方法的类似神经网络层但是不需要记录参数的层
    project: 可视化,具体的神经网络层
  • project/run_tensor.py会实现Linear, Conv这类需要记录weights等参数的层

Lesson0 准备 & ML Primer

  1. sigmoid虽然被定义为1/(1+exp(-x),但是由于exp很容易冲破精度,所以实际上是当x>=0时1/(1+exp(-x),x<0时,sigmoid(x)=exp(x)/(exp(x) + 1)
  2. log加了delta以保证不出现inf
  3. relu在0处导数为0而非1
  4. 可以把大部分数学操作分为map(1to1), zip(1+1), reduce(nto1),对这些数学操作,实现最基本的float基础的函数之后,再用优化的map, zip, reduce来加速对tensor这个大张量的计算速度。(在minitorch中)
  5. sigmoid以0.5为中心对称, 而非以0为中心对称。当以float计算时,约>=37, sigmoid恒为1,当<=-800,sigmoid恒为0
  6.  probs = (out * y) + (out - 1.0) * (y - 1.0)
           loss = -probs.log().sum()
           # Update
           loss.view(1).backward()
    

主要卡顿的愿意:
sigmoid精度台上小
Iterable不等于List,没有len这个方法,只有iter(x)再用for .. in或者next(iterx)

Lesson1 Autodiff

Derivatives求导数的方法:symbolic derivatives, numerical derivatives, auto differentiation。
auto differentiation是在symbolic derivatives和numerical derivatives之间,在精度和易于实现速度更快之间的平衡

Forward mode:令x1'为1,x2'为0,再由v4=x1 + 2x2 + cos(x2)来计算v4',一路传递合成df/dx
Forward-mode AD is implemented by a nonstandard interpretation of the program in which real numbers are replaced by dual numbers, constants are lifted to dual numbers with a zero epsilon coefficient, and the numeric primitives are lifted to operate on dual numbers. This nonstandard interpretation is generally implemented using one of two strategies: source code transformation or operator overloading.
Backward mode: 令f'=1,再由f=v4+v5,得v4'=αf/αv4,一路传递到x1'

Chain Rule:
h(x) = f(g(x)): 令z=g(x),则dfz = df/dz, αh/αx=dfz * αz/αx
h(x,y) = f(g(x,y)):令z=g(x,y),αh/αx=dfz * αz/αx, αh/αy=dfz * αz/αy
h(x) = f(g(x), g(x)):令z=g(x),αh/αx=2*dfz * αz/αx

Backpropagation:
可以从top也可以从bottom来时传播,本教程使用BFS
topological sort of a directed acyclic graph
Steps:

  1. Call topological sort
  2. create a dictionary to record derivatives
  3. if the scalar is leaf, record to tensor.grad, else call chainrule and accumulate derivatives
  4. Q: 为什么只在leaf tensor记录Grad?A: 是为了遵从pytorch的写法
  5. sigmoid backward
    t = operators.sigmoid(a)
    return d_output * (t - t * t)
    
  6. python protocol:
    This is called duck typing in Python. In duck typing, the behaviors and properties of an object determine the object type, not the explicit type of the object.
    To make the calculate_total() more dynamic while leveraging type hints, you can use the Protocol from the typing module.
    加了@runtime_checkable才能使用isinstance来判断是否属于Protocol的衍生类

Lesson2 Tensor

strided access, contiguous strides, non-contiguous strides
• Storage is where the core data of the tensor is kept. It is always a 1-D array of numbers of length size, no matter the dimensionality or shape of the tensor. Keeping a 1-D storage allows us to have tensors with different shapes point to the same type of underlying data.
• Strides is a tuple that provides the mapping from user indexing to the position in the 1-D storage storage[s1 * index1 + s2 * index2 + s3 * index3 ... ]

broadcasting: 例如只要给一个常数1,就能让形为(2,2,2)的tensor点加1--不需要创建一个(2,2,2)的tensor,全部赋值为 1之后再点加。

Broadcasting is a protocol that allows us to automatically interpret the frist expression as implying the second one. Inside zip, we pretend that 10 is a vector of shape (3,) when zipping it with a vector of shape (3,). Again, this is just an interpretation: we never actually create this vector.

Rule 1: Any dimension of size 1 can be zipped with dimensions of size n > 1 by assuming the dimension is copied n times.

Rule 2: Extra dimensions of shape 1 can be added to a tensor to ensure the same number of dimensions with another tensor.

Rule 3: Any extra dimension of size 1 can only be implicitly added on the left side of the shape.

Matrix multiplication can be written in this style, here is (BxAT)( ) where A is 3 x 2 and B is 2 x 2. (And you will need to use this for the assignment). However, note this is a memory inefficient way

From https://minitorch.github.io/module2/broadcasting/

out=minitorch.zeros((2,3,1))+minitorch.zeros((7,2,1,5))out.shape
(7, 2, 3, 5)

From https://minitorch.github.io/module2/broadcasting/

• Operation a / map: These operations just touch each of the positions in the tensor individually. They don't need to deal with other positions or know anything about the shape or size of the tensor. We can view these operations as applying the following transformation:
• Operation b / zip: These operations only need to pair operations between input tensors. If we assume the tensors have the same size and shape, this type of operation simply aligns these two tensors and applies an operator to each pair of elements:
• Operation c / reduce: These operations need to group together cells within a single tensor. We can think of there being an implied reduce shape that is eliminated in the process of the output construction. For instance, in the example below, we start with an input of shape (3, 2) and create an output of shape (1, 2). Implicitly, we reduce over a tensor of shape (3, 1) for each element in the output.

From https://minitorch.github.io/module2/tensorops/

Note:

  1. 需要注意to_index的计算方向
  2. 在backward中需要格外注意Grad的shape,我的实现直接让grad要么为size1,要么就与对应输入shape一致
  3. 注意expand和reshape可能并没有更改索引,有时候该用的是permute
  4. 注意除法使用乘法的倒数实现

Lesson3

教程中的:1. 并行(numba.jit(parallel=True)) 2. CUDA(numba.cuda)

numba.jit(parallel=True)
利用map和zip等结构来实现快速并行化,此外,将矩阵乘法等操作单独实现。
教程要求的规范:1. Main loop in parallel(prange) 2. all indices use numpy buffers 3. inner function shouldn't call any functions or write any non-local variables. numba会自动对prange并行化。但是numba只支持了一部分numpy.array相关的函数和操作,如果无法通过编译,可以尝试将操作用更原始更简单的numpy操作代替,可能就能通过编译而无需更改逻辑。此外,numba还可能自动fusion一些operation(把多个基本操作联合在一起用一个优化过后的操作替代,如pointadd + reduce->matmul)
Note: 尽管应该可以自动hoist declaration(将loop内的声明放在外面节约创建时间),但是如果直接这么写,可能导致多个线程同写一片数组

会带来一定的start overhead,但是能够用fast low-level code代替效率低下的python code
combine operations to eliminate unnecessary intermediate tensors.
对matmul backward

g'M(f(M,N)) = dN^T
g'N(f(M,N)) = M^Td

numba.jit

Supported Operations with parallel semantics

  1. numba array operation,
    • unary operators: + - ~
    • binary operators: + - * / /? % | >> ^ << & ** //
    • comparison operators: == != < <= > >=
    • Numpy ufuncs that are supported in nopython mode.
    • User defined DUFunc through vectorize().
  2. Numpy reduction functions sum, prod, min, max, argmin, and argmax. Also, array math functions mean, var, and std.
  3. Numpy array creation functions zeros, ones, arange,...
  4. Numpy dot function
  5. The full semantics of Numpy broadcast is not supported
  6. Array assignment using a slice
  7. The reduce operator of functools on 1D Numpy arrays

Explicit Parrel Loops

  1. prange可以替代range来指代可以并行化的loop,一般numba会用最外层那个prange,里面的都当作普通range对待
  2. reduce的结果变量在线程开始之前应当初始化,所有线程以相同初始状态开始
from numba import njit, prange
import numpy as np

@njit(parallel=True)
def prange_wrong_result(x):
    n = x.shape[0]
    y = np.zeros(4)
    for i in prange(n):
        # accumulating into the same element of `y` from different
        # parallel iterations of the loop results in a race condition
        y[i % 4] += x[i] # ERROR

    return y

whereas performing a whole array reduction is fine:

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def prange_ok_result_whole_arr(x):
    n = x.shape[0]
    y = np.zeros(4)
    for i in prange(n):
        y += x[i]#OK
    return y

基本模式

def tensor_map(
    fn: Callable[[float], float]
) -> Callable[[Storage, Shape, Strides, Storage, Shape, Strides], None]:
    """
    NUMBA low_level tensor_map function. See `tensor_ops.py` for description.

    Optimizations:

    * Main loop in parallel
    * All indices use numpy buffers
    * When `out` and `in` are stride-aligned, avoid indexing

    Args:
        fn: function mappings floats-to-floats to apply.

    Returns:
        Tensor map function.
    """

    def _map(
        out: Storage,
        out_shape: Shape,
        out_strides: Strides,
        in_storage: Storage,
        in_shape: Shape,
        in_strides: Strides,
    ) -> None:
        if np.array_equal(out_shape, in_shape) and np.array_equal(
            out_strides, in_strides
        ):
            for out_pos in prange(len(out)):
                out[out_pos] = fn(in_storage[out_pos])
            return

        for outi in prange(len(out)):
            out_index = np.zeros(
                len(out_shape), dtype=np.int32
            )  # will be automatically hoisted
            in_index = np.zeros(len(in_shape), dtype=np.int32)
            to_index(outi, out_shape, out_index)
            out_pos = index_to_position(out_index, out_strides)
            if out_pos != outi:
                print("ERROR!", out_pos, outi)
            # assert out_pos == i
            broadcast_index(out_index, out_shape, in_shape, in_index)
            in_pos = index_to_position(in_index, in_strides)
            out[out_pos] = fn(in_storage[in_pos])

    return njit(parallel=True)(_map)  # type: ignore

to_index = njit(inline="always")(to_index)
index_to_position = njit(inline="always")(index_to_position)
broadcast_index = njit(inline="always")(broadcast_index)

CUDA

grid->blocks->threads
threads bundle = block
block bundle = grid
threads数目根据GPU设备不同有上限,数目最好是32的倍数。threads和grid都可以是1-3维,多维并不改变实际上的访问和操作,只是方便程序员书写。每个线程能使用的寄存数量和显存大小都一定,如果超过了,在numba环境会表现为coredump

            # Instantiate and run the cuda kernel.
            threadsperblock = THREADS_PER_BLOCK
            blockspergrid = (out.size + THREADS_PER_BLOCK - 1) // THREADS_PER_BLOCK
            # print("map:", out.shape, out.strides, a.shape, a.strides)
            f[blockspergrid, threadsperblock](*out.tuple(), out.size, *a.tuple())  # type: ignore

There are multiple limits. All must be satisfied.

The maximum number of threads in the block is limited to 1024. This is the product of whatever your threadblock dimensions are (xyz). For example (32,32,1) creates a block of 1024 threads. (33,32,1) is not legal, since 33321 > 1024.

The maximum x-dimension is 1024. (1024,1,1) is legal. (1025,1,1) is not legal.

The maximum y-dimension is 1024. (1,1024,1) is legal. (1,1025,1) is not legal.

The maximum z-dimension is 64. (1,1,64) is legal. (2,2,64) is also legal. (1,1,65) is not legal.

使用numba的基本模式

    def map(fn: Callable[[float], float]) -> MapProto:
        "See `tensor_ops.py`"
        f = tensor_map(cuda.jit(device=True)(fn))

        def ret(a: Tensor, out: Optional[Tensor] = None) -> Tensor:
            if out is None:
                out = a.zeros(a.shape)

            # Instantiate and run the cuda kernel.
            threadsperblock = THREADS_PER_BLOCK
            blockspergrid = (out.size + THREADS_PER_BLOCK - 1) // THREADS_PER_BLOCK
            # print("map:", out.shape, out.strides, a.shape, a.strides)
            f[blockspergrid, threadsperblock](*out.tuple(), out.size, *a.tuple())  # type: ignore
            return out

        return ret


def tensor_map(
    fn: Callable[[float], float]
) -> Callable[[Storage, Shape, Strides, Storage, Shape, Strides], None]:
    """
    CUDA higher-order tensor map function. ::

      fn_map = tensor_map(fn)
      fn_map(out, ... )

    Args:
        fn: function mappings floats-to-floats to apply.

    Returns:
        Tensor map function.
    """

    def _map(
        out: Storage,
        out_shape: Shape,
        out_strides: Strides,
        out_size: int,
        in_storage: Storage,
        in_shape: Shape,
        in_strides: Strides,
    ) -> None:
        out_index = cuda.local.array(MAX_DIMS, numba.int32)
        in_index = cuda.local.array(MAX_DIMS, numba.int32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        if i >= out_size:
            return
        if is_array_eq(out_shape, in_shape) and is_array_eq(out_strides, in_strides):
            out[i] = fn(in_storage[i])
            # print(i, "->", i, "value:", out[i], "named, ", fn(in_storage[i]), "=", fn_name, "(", in_storage[i], ")")
        else:
            to_index(i, out_shape, out_index)
            broadcast_index(out_index, out_shape, in_shape, in_index)
            in_pos = index_to_position(in_index, in_strides)
            out_pos = index_to_position(out_index, out_strides)
            out[out_pos] = fn(in_storage[in_pos])
            # print(i, "(", out_pos, ")", "->", in_pos, "value:", out[out_pos], "named, ", fn(in_storage[in_pos]), "=", fn_name, "(", in_storage[in_pos], ")")

    return cuda.jit()(_map)  # type: ignore

shared memory:
在block中的threads均可访问sharedmemory,在block间独立,访问比内存更快。

  1. Each block has exactly the same size and shape.
  2. Each thread knows its position in the global grid.
  3. x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x标识了threads第一维的地址

CUDA +numba: 缺失的功能:1. dynamic parallesim(能自动创建和管理并行工作负载,而无需CPU干预)2. Texture Memory(纹理内存,用于存储纹理,对2D空间局部计算进行了优化,例如双线性和三线性filter

Writing CUDA kernels: CUDA的执行模型:CUDA的执行模型与已有的线性CPU编程模型不同,上千个同时执行
3种CPU Memory:

  1. global device memory: the large relative slow off-chip memory that connected the GPU itself.
  2. shared memory: on-chip shared memory
  3. local memory: on-chip

Kernel declaration
kernel function: A GPU function that is meant to be called from GPU code,没有返回值,计算结果写回参数
device function: cuda.jit(device=True),可以返回值,从设备内部,由kernel function和其他device functions调用
function只编译一次,可以被不同的grid设置调用

Kernel innovation:

blockspergrid = (16,16,40)
threadsperblock = (16,16,1)
kernel_func[blockspergrid, threadsperblock](inputs_array)

How to choose the block size? 需要足够容纳full occupation of execution units?
Thread positioning: x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
或者numba.cuda.grad()

Memory Management:
Data Transfer:

  1. Numba can transfer numpy arrays to the device and then transfer device memory back to the host when a kernel finished
  2. numba.cuda.device_array: allocate an empty device ndarray; numba.cuda.device_array_like
  3. numba.cuda.to_device: allocate and transfer a numpy ndarray or structured scalar to the device.numba.cuda.DeviceNDArray
    e.g.:
    ary = numpy.arrange(w)
    d_ary = cuda.to_device(ary)
    d_ary.copy_to_host(ary)
    
  4. Q:numba.cuda.DeviceNDarray中 is_c_contiguousis_f_contiguous的不同?ravel方法和reshape方法的不同?
  5. Pinned Memory: 不允许换出,directed memory access transfers, avoid page faults. numba.cuda.pinned
  6. numba.cuda.stream
  7. numba.cuda.shared.array
  8. numba.cuda.synncthreads: synchronize all threads in the same thread block,相当于barrier
  9. numba.cuda.local.array: private to each thread,当为局部变量分配的寄存器不够时,常使用这一方法增加scratchpad area(?)。会在kernel执行开始时分配一次,执行过程中不可分配。
  10. constant memory: read-only, cached, off-chip,所有线程均可访问,由host分配而非device分配
  11. 释放行为:
    1.外部内存管理插件可以更改释放行为
    2. 所有CUDA资源的重新分配都会接上下文跟踪,在最后一个引用被删掉时,底层内存将被加入释放队列
    3. 释放不会立刻发生,而是将其添加到一个队列中,延迟释放带来的同步代价,也避免持续的释放错误可能导致严重错误
  12. CUDA python: numba没有支持任何numpy数组级别的操作,只支持单个元素级别的简单操作,不支持任何try catch, 支持print但是是异步的,支持简单计算和zip,不支持数组创建。assert会导致奇怪的错误,因此也不应该用。(program coredump或者部分线程退出,即使Assert没有触发也可能导致部分线程退出,从而让加载缓存或者计算的行为不完全)
  13. Atomic Operation: numba.cuda.atomic: 原子操作add, max, min, compare_and_swap
    14: 随机数: numba.cuda.random
    15: 可以用CUDA simulator来调试cudapython,但是只能调试简单的单个线程。

CUDA程序的写作技巧:
多个线程加载缓存,然后再进行计算。
例如matmul,A@B,可以加载A[blockstart:blockend, calculatenowstart:calculatenowend], B[calculatenowstart:calculatenowend, blockstarty:blockendy],一块块计算,按理来说要比直接从off-chip的输入读取更快。
注意,在计算之后也需要进行syncthreads,而不仅仅只是在读写缓存的时候。!!!!!!!

e.g.:

def _tensor_matrix_multiply(
    out: Storage,
    out_shape: Shape,
    out_strides: Strides,
    out_size: int,
    a_storage: Storage,
    a_shape: Shape,
    a_strides: Strides,
    b_storage: Storage,
    b_shape: Shape,
    b_strides: Strides,
) -> None:
    """
    CUDA tensor matrix multiply function.
    Requirements:
    * All data must be first moved to shared memory.
    * Only read each cell in `a` and `b` once.
    * Only write to global memory once per kernel.
    Should work for any tensor shapes that broadcast as long as ::
    ```python
    assert a_shape[-1] == b_shape[-2]
    ```
    Returns:
        None : Fills in `out`
    """
    out_size = 1
    for sz in out_shape:
        out_size *= sz
    out_bbatch_stride = out_strides[-4] if len(out_strides) >= 4 else out_size
    out_bbatch_size = out_size // out_bbatch_stride
    # Batch dimension - fixed, c[batch, i, j]
    batch = cuda.blockIdx.z
    out_dim = len(out_shape)
    a_dim = len(a_shape)
    b_dim = len(b_shape)
    BLOCK_DIM = 16
    a_shared = cuda.shared.array((BLOCK_DIM, BLOCK_DIM), numba.float64)
    b_shared = cuda.shared.array((BLOCK_DIM, BLOCK_DIM), numba.float64)
    out_index = cuda.local.array(MAX_DIMS, numba.int32)
    a_index = cuda.local.array(MAX_DIMS, numba.int32)
    b_index = cuda.local.array(MAX_DIMS, numba.int32)
    # The final position c[i, j]
    i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
    debug = False  # (cuda.blockIdx.x, cuda.blockIdx.y, cuda.blockIdx.z) == (2,2,0)
    # The local position in the block.
    pi = cuda.threadIdx.x
    pj = cuda.threadIdx.y
    # Code Plan:
    # 1) Move across shared dimension by block dim.
    #    a) Copy into shared memory for a matrix.
    #    b) Copy into shared memory for b matrix
    #    c) Compute the dot produce for position c[i, j]
    dim_m = a_shape[-2]
    dim_n = a_shape[-1]
    dim_d = b_shape[-1]
    if i >= dim_m and j >= dim_d:
        return
    out_pos_now = batch * out_strides[-3] + i * out_strides[-2] + j
    for bbatch_i in range(out_bbatch_size):
        to_index(
            batch * out_strides[-3] + bbatch_i * out_bbatch_stride, out_shape, out_index
        )
        out_index[out_dim - 2] = i
        out_index[out_dim - 1] = j
        broadcast_index(out_index, out_shape, a_shape, a_index)
        broadcast_index(out_index, out_shape, b_shape, b_index)
        tmp = 0.0
        for base_n in range(0, dim_n, BLOCK_DIM):
            a_index[a_dim - 1] = base_n + pj
            a_c = b_c = 0
            if a_index[a_dim - 1] < dim_n and i < dim_m:
                a_c = 344
                a_pos = index_to_position(a_index, a_strides)
                a_shared[pi, pj] = a_storage[a_pos]
            b_index[b_dim - 2] = base_n + pi
            if b_index[b_dim - 2] < dim_n and j < dim_d:
                b_c = 234
                b_pos = index_to_position(b_index, b_strides)
                b_shared[pi, pj] = b_storage[b_pos]
            numba.cuda.syncthreads()
            if i < dim_m and j < dim_d:
                k_lim = min(BLOCK_DIM, dim_n - base_n)
                for k in range(k_lim):
                    tmp += a_shared[pi, k] * b_shared[k, pj]
            numba.cuda.syncthreads()  # Note:!!!!!!!!!!!!!
        if i < dim_m and j < dim_d and out_pos_now < out_size:
            out[out_pos_now] = tmp
        out_pos_now += out_bbatch_stride


def _sum_practice(out: Storage, a: Storage, size: int) -> None:
    """
    This is a practice sum kernel to prepare for reduce.
    Given an array of length $n$ and out of size $n // \text{blockDIM}$
    it should sum up each blockDim values into an out cell.
    $[a_1, a_2, ..., a_{100}]$
    |
    $[a_1 +...+ a_{31}, a_{32} + ... + a_{64}, ... ,]$
    Note: Each block must do the sum using shared memory!
    Args:
        out (Storage): storage for `out` tensor.
        a (Storage): storage for `a` tensor.
        size (int):  length of a.
    """
    BLOCK_DIM = 32
    cache = cuda.shared.array(BLOCK_DIM, numba.float64)
    i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    pos = cuda.threadIdx.x
    if i >= size:
        return
    cache[pos] = a[i]
    numba.cuda.syncthreads()
    s = 1
    while s < BLOCK_DIM:
        if pos % (2 * s) == 0 and pos + s < BLOCK_DIM and i + s < size:
            cache[pos] += cache[pos + s]
        s *= 2
        cuda.syncthreads()
    if pos == 0:
        out[cuda.blockIdx.x] = cache[pos]

Note:

  1. Sum居然是用二分法加在一起,而不是直接加和(log复杂度)
  2. Bug常在大于2倍BlockDim.x的时候发生
  3. 慎用return或者continue,因为即使线程不参与计算,也可能参与缓存。而如果提前continue,可能导致只有一两个缓存没有加载上。此外,必须要在加载缓存的时候检查,因为cuda,尤其是在index_to_position这种情况下,很可能误加载缓存造成错误
  4. cuda函数中不应该写assert,即使assert不触发也可能造成其他问题

Lesson4

1-D convolution

假设滑动串口超过输出边缘,则pad为0。
在此前提下output = (unrolled_input.view(batch, width, inchannle*kernelwidth) @ weight.view(outchannel, inchannel*kenrelwidth) + self.bias
where weights.shape = (outchannel, inchannel, kernelwidth), bias.shape=(1, outchannel,1), input.shape = (batch, inchannel, width), 在不控制的情况下,简化output.shape=(batch, outchannel, width),则unrolledinput.shape=(batch, ,width, inchannel, kernelwidth),利用矩阵乘(@)操作将weights和unrolledinput.shape的后两位消掉
Note:

  1. output的shape可以用来控制convolution每一维要计算多长(比如只算到不需要padding的地方)
  2. conv可以用来计算其自身的导数
    
       def backward(ctx: Context, grad_output: Tensor) -> Tuple[Tensor, Tensor]:
            input, weight = ctx.saved_values
            batch, in_channels, w = input.shape
            out_channels, in_channels, kw = weight.shape
            grad_weight = grad_output.zeros((in_channels, out_channels, kw))
            new_input = input.permute(1, 0, 2)
            new_grad_output = grad_output.permute(1, 0, 2)
            tensor_conv1d(
                *grad_weight.tuple(),
                grad_weight.size,
                *new_input.tuple(),
                *new_grad_output.tuple(),
                False,
            )
            grad_weight = grad_weight.permute(1, 0, 2)
            grad_input = input.zeros((batch, in_channels, w))
            new_weight = weight.permute(1, 0, 2)
            tensor_conv1d(
                *grad_input.tuple(),
                grad_input.size,
                *grad_output.tuple(),
                *new_weight.tuple(),
                True,
            )
            return grad_input, grad_weight
    
    
  3. 最后的一维用于kernel而非输出维数
  4. 检查le pytorch代码,具体是aten/src/aTEn/vulkan/ops/convolutions.cpp,Tile(unrolled)也往往需要逐个数字拷贝
  5. 注意exp相当不稳定,也导致softmax不稳定-输出达到700后就会导致exp(x)=inf,进而导致对应的weights被更新为-inf=nan。loss多使用logsoftmax,且计算需要格外注意,
    
    
    def logsoftmax(input: Tensor, dim: int) -> Tensor:
        r"""
        Compute the log of the softmax as a tensor.
        $z_i = x_i - \log \sum_i e^{x_i}$
        See https://en.wikipedia.org/wiki/LogSumExp#log-sum-exp_trick_for_log-domain_calculations
        Args:
            input : input tensor
            dim : dimension to apply log-softmax
        Returns:
             log of softmax tensor
        """
        max_input = max(input, dim)
        mid_input = input - max_input
        mid_input = mid_input.exp()
        mid_input = mid_input.sum(dim)
        mid_input = mid_input.log()
        lse_input = max_input + mid_input
        return input - lse_input
    
    

2-Dconv

max

In practice, to avoid implementing the pooling operation as a loop, we can manipulate the shape and strides of the input tensor in order to pool it directly.

From https://minitorch.github.io/module4/pooling/

为了防止变成循环,可以结合tile(加padding,用tile函数将数组最后一维改为newshape*kernelsize,如果不够整除就加pad

training

  1. 在最基本的CNNsentimenttraining中,如果使用maxpool2d而不是avgpool2d,则在3个epoch时还可能收敛,之后就会稳定在loss11,也就是大部分输出都是-1,然后被relu卡掉的情况。
  2. 线性层之间的relu是必须的,至少就实验的这几次,没有relu就不会收敛
  3. 神经网络单元太大会导致coredump
  4. logsoftmax确实是返回负数,除了概率最大的一维
  5. nn结构设计不合理很容易导致loss不下降或者间隔出现loss飞速增长然后达到inf导致nan
  6. 在写缓存和写计算时需要分开看cuda。blockspergrid是数据分治,要按照输出数据分。threadsperblock。则按照利用到的数据(需要cache的数据来思考)
posted @ 2024-03-11 22:03  雪溯  阅读(134)  评论(0编辑  收藏  举报