硬件专业化和软件映射的敏捷框架
硬件专业化和软件映射的敏捷框架
概述
随着摩尔定律接近尾声,设计专用硬件及将应用程序映射到专用硬件的软件,都是很有前途的解决方案。硬件设计决定了峰值性能,软件也很重要,决定了实际性能。硬件/软件(HW/SW)协同优化硬件加速和软件映射,提高整体性能。当前的流程将硬件和软件设计隔离。由于编程层次低,设计空间大,硬件和软件都难以设计和优化。
将介绍AHS,一个用于张量应用程序的硬件专业化和软件映射的敏捷框架。对于使用高级语言描述的张量应用程序,AHS可以自动定义硬件和软件间的接口,协同导航巨大的设计空间,自动生成硬件实现和软件映射。AHS由几个组件组成,每个组件都带有一个开源工具。
第一,介绍HASCO,一种用于硬件和软件协同设计的工具。HASCO使用基于循环的IR的匹配方法,探索不同的硬件/软件分区方法。由于设计目标和评估成本不同,HASCO在硬件和软件方面采用了不同的DSE算法。
第二,介绍了TENET,一种用于硬件数据流表示法和性能模型的工具。TENET可以使用以关系为中心的表示法,完全覆盖硬件数据流的设计空间。
第三,介绍Tennet的合成后端TensorLib。TensorLib可以自动生成用Chisel编写的硬件数据流实现。
第四,介绍了Flextensor,一种用于自动软件映射和优化的工具。Flextensor可以为各种硬件平台(包括CPU、GPU、FPGA和ASIC)自动生成优化的软件实现。
Chisel是由伯克利大学发布的一种开源硬件构建语言,通过使用高度化的参数生成器和分层的专用硬件设计语言,支持高级硬件设计。
重要特性:
内嵌Scala编程语言
层次化+面向对象+功能构建
使用Scala中的元编程可以高度地参数化
支持专用设计语言的分层
生成低级Verilog设计文件,传递到标准ASIC或FPGA工具
采用Chisel设计的电路,经过编译,可以得到针对FPGA、ASIC的Verilog HDL代码,可以得到对应的时钟精确C++模拟器。
Chisel -> FPGA Verilog
Chisel -> ASIC Verilog
Chisel -> C++ Simulator
调度
先概述AHS项目,再介绍一系列技术演示和开源工具演示。涵盖AHS的所有组件,包括硬件和软件协同设计、硬件专业化和软件映射。
组织者
梁云(Eric)目前是北京大学EECS学院的副教授。研究兴趣包括计算机体系结构、电子设计自动化和编译器。在ISCA、MICRO、DAC、FPGA等领域发表了100多篇科学论文。研究获得了两项最佳论文奖和六项最佳论文提名。担任MICRO、ISCA、ASPLO、HPCA、DAC、FPGA、FCCM等的技术项目委员会,以及ACM TEC和TRET的助理编辑。
罗紫章目前是北京大学研究生的最后一年。对特定领域芯片的架构和软件协同设计感兴趣。
陆立强是北京大学五年级的博士生。2017年,在同一所大学获得学士学位。对空间架构和可重构计算感兴趣。
贾连成是北京大学四年级的博士生。于2018年在同一所大学获得学士学位。对高级综合和敏捷硬件设计感兴趣。
郑Size北京大学三年级的博士生。2019年,在同一所大学获得学士学位。对特定领域加速器的编译器设计和优化感兴趣。
Install Steps
Install
Shell script
You can use this shell shell script to install everything.
sh –c “$(wget https://pku-ahs.github.io/tutorial/en/master/_downloads/9064601015f9cd5e747a641dbdacf3aa/install_ahs.sh –O -)”
source ~/.bashrc
The shell script is tested under Ubuntu:20.04LTS. If you use another OS, or if you use Anaconda or Virtualenv for python, you may need to modify the script yourself. For windows users, it is best to use WSL.
Docker
You can pull our docker. We had everything prepared, configured and installed for you.
docker pull ericlyun/ahsmicro:latest
docker run –it ahsmicro:latest /bin/bash
Requirement
Apt
- python3
- python3-pip
- git
- llvm-9
- cmake
- build-essential
- make
- autoconf
- automake
- scons
- libboost-all-dev
- libgmp10-dev
- libtool
- default-jdk
- csvtool
- numpy
- decorator
- attrs
- tornado
- psutil
- xgboost
- cloudpickle
- tensorflow
- tqdm
- IPython
- botorch
- jinja2
- pandas
- scipy
- scikit-learn
- plotly
Pip
Sbt
echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
sudo apt-get update
sudo apt-get install sbt
Git
git clone --recursive -b micro_tutorial https://github.com/pku-liang/HASCO.git
git clone --recursive -b micro_tutorial https://github.com/pku-liang/TENET.git
git clone https://github.com/KnowingNothing/FlexTensor-Micro.git
git clone -b demo https://github.com/pku-liang/TensorLib.git
Configure & Compile
Hasco
cd ./ HASCO
bash ./install.sh
# Settings
vim ~/.bashrc
# append:
# export TVM_HOME=<install_dir>/HASCO/src/tvm
# export AX_HOME=<install_dir>/HASCO/src/Ax
# export PYTHONPATH=$TVM_HOME/python:$AX_HOME:${PYTHONPATH}
source ~/.bashrc
TENET
cd ./TENET
bash ./init.sh
vim ~/.bashrc
# append:
# export LD_LIBRARY_PATH=<install_dir>/TENET/external/lib:$LD_LIBRARY_PATH
source ~/.bashrc
cd TENET
make cli
make hasco
Dockerfile
The size of the docker is about 7G. If you find it difficult to pull it due to its size, you can run the following Dockerfile to build the docker by yourself.
# syntax=docker/dockerfile:1
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninterative
RUN apt-get update \
&& apt-get -y -q install git sudo vim python3 python3-pip llvm-9 cmake build-essential make autoconf automake scons libboost-all-dev libgmp10-dev libtool curl default-jdk csvtool \
&& pip3 install tensorflow decorator attrs tornado psutil xgboost cloudpickle tqdm IPython botorch jinja2 pandas scipy scikit-learn plotly \
&& echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list \
&& echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | sudo tee /etc/apt/sources.list.d/sbt_old.list \
&& curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add \
&& sudo apt-get update \
&& sudo apt-get -y -q install sbt \
&& mkdir AHS \
&& cd AHS \
&& git clone --recursive -b micro_tutorial https://github.com/pku-liang/HASCO.git \
&& git clone --recursive -b micro_tutorial https://github.com/pku-liang/TENET.git \
&& git clone -b demo https://github.com/pku-liang/TensorLib.git \
&& git clone https://github.com/KnowingNothing/FlexTensor-Micro.git \
&& cd HASCO \
&& bash ./install.sh \
&& cd ../TENET \
&& bash ./init.sh
Run
HASCO
Config
vim
src/codesign/config.py
mastro_home = "<install_dir>/HASCO/src/maestro"
tenet_path = "<install_dir>/TENET/bin/HASCO_Interface"
tenet_params = {
"avg_latency":16 # average latency for each computation
"f_trans":12 # energy consume for each element transfered
"f_work":16 # energy consume for each element in the workload
}
tensorlib_home = "<install_dir>/TensorLib"
tensorlib_main = "tensorlib.ParseJson"
Python API
python3 testbench/co_mobile_conv.py
python3 testbench/co_resnet_gemm.py
...
CLI
cd HASCO
./hasco.py -h
# Run a GEMM intrinsic with MobileNetV2 benchmark
./hasco.py -i GEMM -b MobileNetv2 -f gemm_example.json -l 1000 -p 20 -a 0
Results:
- rst/MobileNetV2_CONV.csv config of best design for each constraint, view with column
- rst/software/MobileNetV2_CONV_* tvm IR for each design
- rst/hardware/CONV_*.json TensorLib config for each design
- rst/hardware/CONV_*.v TensorLib generated Verilog
TENET
cd TENET
# Help Text
./bin/tenet -h
# Run a KC-systolic dataflow
./bin/tenet -p ./dataflow_example/pe_array.p -s ./dataflow_example/conv.s -m ./dataflow_example/KC_systolic_dataflow.m -o output.csv --all
# Run a OxOy dataflow
./bin/tenet -p ./dataflow_example/pe_array.p -s ./dataflow_example/conv.s -m ./dataflow_example/OxOy_dataflow.m -o output.csv --all
# Run all layers in MobileNet
./bin/tenet -e ./network_example/MobileNet/config -d ./network_example -o output.csv --all
Result:output.csv
TensorLib
cd TensorLib
# Optional, download the requirements from MAVEN, so that the rest instructions runs faster
sbt compile
# Examples of Scala APIs
sbt "runMain tensorlib.Example_GenConv2D"
sbt "runMain tensorlib.Example_GenGEMM"
# Examples of JSON interface
sbt "runMain tensorlib.ParseJson ./examples/conv2d.json ./output/conv2d.v"
sbt "runMain tensorlib.ParseJson ./examples/gemm.json ./output/gemm.v"
# Testing the result
sbt "runMain tensorlib.Test_Runner_Gemm"
Result:
Scala Interface: PEArray.v
ParseJson: the second argument you specified.
FlexTensor
cd FlexTensor-Micro
export PYTHONPATH=$PYTHONPATH:/path/to/FlexTensor-Micro
cd FlexTensor-Micro/flextensor/tutorial
# First, CPU experiments
cd conv2d_llvm
# run flextensor
python optimize_conv2d.py --shapes res --target llvm --parallel 8 --timeout 20 --log resnet_config.log
# run test
python optimize_conv2d.py --test resnet_optimize_log.txt
# run baseline
python conv2d_baseline.py --type tvm_generic --shapes res --number 100
# run plot
python plot.py
# Next, GPU experiments
cd ../conv2d_cuda
# run flextensor
python optimize_conv2d.py --shapes res --target cuda --parallel 4 --timeout 20 --log resnet_config.log
# run test
python optimize_conv2d.py --test resnet_optimize_log.txt
# run baseline
python conv2d_baseline.py --type pytorch --shapes res --number 100
# run plot
python plot.py
# At last, VNNI experiments
cd ../gemm_vnni
# run flextensor (cascadelake)
python optimize_gemm.py --target "llvm -mcpu=cascadelake" --target_host "llvm -mcpu=cascadelake" --parallel 8 --timeout 20 --log gemm_config.log --dtype int32
# run flextensor (skylake)
python optimize_gemm.py --target "llvm -mcpu=skylake-avx512" --target_host "llvm -mcpu=skylake-avx512" --parallel 8 --timeout 20 --log gemm_config.log
# run test
python optimize_gemm.py --test gemm_optimize_log.txt
# run baseline
python gemm_baseline.py --type numpy --number 100
# run plot
python plot.py
参考链接:
https://pku-ahs.github.io/tutorial/en/master/