HARDWAR FOR ML- LECTURE 6 DATAFLOW

Review

Deep neural networks typically have a sequence of convolutional,fully-connected, pooling, batch normalization, and activation layers.

Convolution is one of the fundamental kernel in DNNs.

  • 2-D convolution
  • Stride and padding
  • 3-D convolution with input/output channels·Batch size

Convolution can be calculated in different ways.

  • Direct, GEMM, FFT-based, Winograd-based

Convolution Loop Nest

Option 1: Direct Convolution

image

Option 2:GEMM

由于直接卷积的计算效率并不高,Option 2是通过im2col(python存在相应的函数)将卷积运算转换为GEMM。矩阵乘法具有较多的开发经验,直接开发卷积加速软硬件核较为困难。这种方法主要是将卷积窗口对应的局部视野展开为列,将多个卷积窗口内的输入激活展开为多列元素。

主要缺陷:增加了输入激活的memory占用,需要重新组织输入的数据流(元素沿副对角线对称分布)。
image

通用矩阵乘法的优化可见: https://blog.csdn.net/qq_35985044/article/details/128474264

除了上述方式以外,我们还可以尝试下述这种对角映射方案。以\(H\times W\)的输入图像和尺寸为\(Kh\times Kw\)的卷积核(步长为1,padding为0)为例,输入尺寸为\(H=W=100\),卷积核尺寸为\(Kh=Kw=3\)

Option 3: FFT-based Convolution

FFT方法是将时域的卷积运算转换为频域的点乘运算,因此需要对权重和输入特征图进行FFT变换得到频域输入和权重,通过乘法得到输出激活的频域表示,最后通过反傅里叶变换恢复输出激活的真实输出。
image

#!/user/bin/env python3
# -*- coding: utf-8 -*-

from functools import partial
from typing import Iterable, Tuple, Union

import torch
import torch.nn.functional as f
from torch import Tensor, nn
from torch.fft import irfftn, rfftn
from math import ceil, floor

def complex_matmtul(a: Tensor, b: Tensor, groups: int = 1) -> Tensor:
    """
    :param a:
    :param b:
    :param groups: grouped multiplications support multiple sections of channels
    :return:
    """
    a = a.view(a.size(0), groups, -1, *a.shape[2:])
    b = b.view(groups, -1, *b.shape[1:])

    a = torch.movedim(a, 2, a.dim() - 1).unsqueeze(-2)
    b = torch.movedim(b, (1, 2), (b.dim() - 1, b.dim() - 2))

    real = a.real @ b.real - a.imag @ b.imag
    imag = a.imag @ b.real + a.imag @ b.imag
    real = torch.movedim(real, real.dim() - 1, 2).squeeze(-1)
    imag = torch.movedim(imag, imag.dim() - 1, 2).squeeze(-1)
    c = torch.zeros(real.shape, dtype=torch.complex64, device=a.device)
    c.real, c.imag = real, imag  

    return c.view(c.size(0), -1, *c.shape[3:])


def to_ntuple(val: Union[int, Iterable[int]], n: int) -> Tuple[int, ...]:
    """
    :param val: 
    :param n: 
    :return:
   """

    if isinstance(val, Iterable):
        out = tuple(val)
        if len(out) == n:
            return out
        else:
            raise ValueError(f"Cannot cast tuple of length {len(out)} to length {n}.")
    else:
        return n * (val,)


def fft_conv(
        signal: Tensor,
        kenerl: Tensor,
        bias: Tensor,
        padding: Union[int, Iterable[int], str] = 0,
        padding_mode: str = "constant",
        stride: Union[int, Iterable[int]] = 1,
        dilation: Union[int, Iterable[int]] = 1,
        groups: int = 1
) -> Tensor:
    """
    :param signal: Input tensor to be convolved with the kernel
    :param kenerl: convolution kernel
    :param bias: bias tensor to add to the output
    :param padding: If int, number of zero samples to pad input on the last dimension; If str "same", pad input for size preservation
    :param padding_mode: padding_mode: use {constant, reflection, replication}
    :param stride: (Union[int, Iterable[int]]) Stride size for computing output values
    :param dilation: (Union[int, Iterable[int]]) Dilation rate for the kernel
    :param groups: Number of groups for the convolution
    :return:
    """

    # Cast padding, stride & dilation tu tuples
    n = signal.dim - 2
    stride_ = to_ntuple(stride, n=n)
    dilation_ = to_ntuple(dilation, n=n)
    if isinstance(padding, str):
        if padding == 'same':
            if stride != 1 or dilation != 1:
                raise ValueError("stride must be 1 for padding = 'same'.")
            padding_ = [(k - 1) / 2 for k in kenerl.shape[2:]]
        else:
            raise ValueError(f"Padding mode {padding} not supported")
    else:
        padding_ = to_ntuple(padding, n=n)

    # internal dilation offsets
    offset = torch.zeros(1, 1, *dilation_, device=signal.device, dtype=signal.dtype)
    offset[(slice(None), slice(None), *((0,) * n))] = 1.0  
 
    # correct the kernel by cutting off unwanted dilation trailing zeros
    cutoff = tuple(slice(None, -d + 1 if d != 1 else None) for d in dilation_)  # create tuple

    # pad the kernel internally according to the dilation parameters
    kernel = torch.kron(kenerl, offset)[(slice(None), slice(None)) + cutoff]  # after dilation

    # Pad the input signal & kernel tensors (round to support even sized convolutions)
    signal_padding = [r(p) for p in padding_[::-1] for r in (floor, ceil)]
    signal = f.pad(signal, signal_padding, mode=padding_mode)

    signal_size = signal.size()  # original signal size without padding to even
    if signal.size(-1) % 2 != 0:
        signal = f.pad(signal, [0, 1])

    kernel_padding = [
        pad for i in reversed(range(2, signal.ndim)) for pad in [0, signal.size(i) - kernel.size(i)]
    ]  # (H - Kh) * (W - Kw)
    padded_kernel = f.pad(kernel, kernel_padding)  # input_channels * output_channels * H * W

    # Perform Fourier convolution FFT matrix multiply IFFT
    signal_fr = rfftn(signal.float(), dim=tuple(range(2, signal.ndim)))
    kernel_fr = rfftn(padded_kernel.float(), dim=tuple(range(2, signal.ndim)))

    kernel_fr.imag *= -1  
    output_fr = complex_matmtul(signal_fr, kernel_fr, groups=groups)
    output = irfftn(output_fr, dim=tuple(range(2, signal.ndim)))

    # Remove extra padded values
    crop_slices = [slice(None), slice(None)] + [
        slice(0, (signal_size[i] - kernel.size(i) + 1), stride_[i - 2])
        for i in range(2, signal.ndim)
    ]
    output = output[crop_slices].contiguous()

    if bias is not None:
        bias_shape = tuple([1, -1] + (signal.ndim - 2) * [1])  # 1 * -1 * 1 * 1
        output += bias.view(bias_shape)
    
    return output

class _FFTConv(nn.Module):
    def __init__(self,
                 in_channels: int,
                 out_channels: int,
                 kernel_size: Union[int, Iterable[int]],
                 padding: Union[int, Iterable[int]] = 0,
                 padding_mode: str="constant",
                 stride: Union[int, Iterable[int]] = 1,
                 dilation: Union[int, Iterable[int]] = 1,
                 groups: int = 1,
                 bias: bool = True,
                 ndim: int = 1):
        super().__init__()
        
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.padding = padding
        self.padding_mode = padding_mode
        self.stride = stride
        self.dilation = dilation
        self.groups = groups
        self.use_bias = bias
        
        if in_channels % groups != 0:
            raise ValueError(
                "'in_channels' must be divisible by 'groups'."
                f"Found: in_channels={in_channels}, groups={groups}."
            )
        if out_channels % groups != 0:
            raise ValueError(
                "'out_channels' must be divisible by 'groups'."
                f"Found: out_channels={out_channels}, groups={groups}."
            )
        
        kernel_size = to_ntuple(kernel_size, ndim)
        weight = torch.randn(out_channels, in_channels // groups, *kernel_size)
        
        self.weight = nn.Parameter(weight)
        self.bias = nn.Parameter(torch.randn(out_channels)) if bias else None
        
    def forward(self, signal):
        return fft_conv(
            signal,
            self.weight,
            bias=self.bias,
            padding=self.padding,
            padding_mode=self.padding_mode,
            stride=self.stride,
            dilation=self.dilation,
            groups=self.groups,
        )

FFTConv1d = partial(_FFTConv, ndim=1)
FFTConv2d = partial(_FFTConv, ndim=2)
FFTConv3d = partial(_FFTConv, ndim=3)

下面给出测试代码:

import torch
from fft_conv import fft_conv, FFTConv1d

signal = torch.randn(3, 3, 1024)  # data shape: (batch, channels, length)
kernel = torch.randn(2, 3, 128)  # kernel shape: (out_channels, in_channels, kernel_size)
bias = torch.randn(2)

out = fft_conv(signal, kernel, bias=bias)

fft_conv = FFTConv1d(3, 2, 128, bias=True)
fft_conv.weight = torch.nn.Parameter(kernel)
fft_conv.bias = torch.nn.Parameter(bias)
out = fft_conv(signal)
print(f"Output shape: {out.shape}")

可以注意到的是,在卷积核计算中FFT-based方法比Direct convolution更具有speedup优势。
image

Option 4: Winograd Transform

以下图中的一维卷积为例,一般矩阵乘法需要进行6次乘法和4次加法。
image

卷积运算中输入信号转换成的矩阵不是任意矩阵,其中有规律地分布大量元素,第一行和第二行中的\(d_1\)\(d_2\),卷积转换成的矩阵乘法比一般矩阵乘法的问题域更小。
Winograd引入\(m_1\sim m_4\)来参与计算,计算\(r_0=m_1+mm_2+m_3,r_1=m_2-m_3-m_4\)需要在输入信号\(d\)上消耗4次加法(减法),输出\(m\)上需要消耗4次乘法和4次加法。
由于神经网络推理时,卷积核元素是固定的,因此\(g\)上的运算可以提前算好,预测阶段只需要计算一次,可以忽略\(g\)的计算(三次加法,\(g_0+g_2\)为1次,\(g_0+g_2-g_1\)\(g_0+g_2+g_1\)为1次),总共需要的运算次数为4次乘法和8次加法。计算机中,乘法比加法慢,减少乘法次数,增加少量加法可以实现加速。

我们可以将Winograd过程表述为下述矩阵形式(\(G,B^T\)为对\(g\)\(d\)的变换算子),具体包括输入变换、卷积核变换、哈达马积、输出变换:
image

如何将Winograd推广到二维?可以使用\(Y=A^T[[GgG^T]\odot [B^TdB]] A\),\(g\)\(r\times r\)卷积核,\(d\)\((m+r-1)\times (m+r-1)\)的image tiles.

对于1维卷积\(F(m,r)\)的Winograd算法,其需要的乘法个数为\(m+r−1\)。对于2维卷积\(F(m\times n,r\times s)\)的Winograd算法,其需要的乘法个数为\((m+r-1)\times(n+s-1)\)。当n = m n=mn=m以及\(s=r\)时,卷积\(F(m\times m,r\times r)\)的Winograd算法需要的乘法个数为\((m+r-1)\times(m+r-1)\)。推导手稿可见下面

image
image

其他信息补充

Dilation

在卷积神经网络(CNN)中,膨胀(dilation)是一种用于增加卷积操作感受野的技术。通常情况下,卷积操作在输入张量上以固定的步幅滑动,以便从每个位置提取信息。然而,通过引入膨胀参数,可以使卷积核在输入张量上以更大的步幅滑动,从而扩大其感受野。
具体来说,膨胀操作会在卷积核中的元素之间插入额外的零值,这样就扩大了卷积核的有效大小,使其在输入张量上的感受野变大。这样做的一个重要优点是,在不增加卷积核大小的情况下,可以增加网络的感受野,从而更好地捕捉输入数据的长程依赖关系和上下文信息。

膨胀在卷积神经网络中的应用有以下几个方面的作用:

增大感受野:通过增加卷积核的有效大小,可以在不增加参数数量的情况下扩大网络的感受野,使网络能够更好地理解输入数据的整体结构和上下文信息。

减少参数数量:相比于传统的卷积操作,膨胀卷积可以在不增加参数数量的情况下增大感受野,因为它只是通过插入零值来改变卷积核的行为,而不是增加额外的权重参数。

提高计算效率:由于膨胀卷积可以在更大的步幅下滑动,因此可以在一定程度上减少计算量,提高模型的计算效率。

总的来说,膨胀在卷积神经网络中被用来扩大网络的感受野,从而增强网络对输入数据的理解能力,同时又不增加太多的参数数量和计算成本。

Dataflow Taxonomy

Locality主要反映在内存访存模式和时空数据重用。

  • 内存访存涉及内存memory read,mac和memory write。内存访存次数远大于Mac次数。
    image
  • 时空数据重用体现为数据缓存的时分复用或空分复用。
    image
    image
    针对temporal和spatial locality的改进方法:
    内存时分复用中的改进方法可以建立memory hierarchy,通过在计算单元和DRAM中间引入一个较小的、速度更快的cache来实现数据的缓冲,便于DNNs计算中的数据重用。
    image
    空分复用的典型方法是建立并行的计算单元来提高吞吐量。【the same data is used by more than one consumer at different spatial locations of the hardware.】
    image
    image

image

Locality and Parallisim是提高性能的主要方法。

Data Reuse in DNN

image

image

Dataflow: 决定硬件中DNN操作的执行顺序,包括计算顺序或数据移动顺序。
Loop nest:一种紧凑的方式来描述执行顺序(这里讨论的不是严格的体系结构中数据分布概念)。例如,dataflow (for表示temporal,描述时序执行顺序;spatial_for用于描述并行顺序)。

Output-Stationary and Weight-Stationart

image
WS和OS取决于loop nest的最内层循环,其中不变的量为静止对象,输出激活是保持不变的。

image

OS

image
image
image
image
image
image

WS

image
image
image
image
image

IS

image
image
image
image

其他方法

image
image

Summary

image
image
image

image
image

参考链接

https://www.cnblogs.com/shine-lee/p/10906535.html
https://eyeriss.mit.edu/tutorial.html

posted @ 2024-05-16 17:28  信海  阅读(19)  评论(0编辑  收藏  举报