Proj. CDeepFuzz Paper Reading: Pytorch2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
Abstract
工具:TorchDynamo, TorchInductor
Task: 实现了torch.compile的解释器和编译器
- TorchDynamo:
- Task: Python-level just-in-time(JIT) compiler
- Method: 允许graph compilation,
- 动态修改python bytecode
- 将pytorch operations转为FX graph
- 使用JIT compiled using extensible backends
- TorchInductor
- Task: default compiler
- Method: 将python翻译为OpenAI's triton for GPUs, c++ for CPUs
实验:
效果:
- TorchDynamo:
- able to capture graphs more robustly than prior approaches while adding minimal overhead
- TorchInductor
- 2.27× inference and 1.41× training geometric mean speedup on an NVIDIA A100 GPU across 180+ real-world models, which outperforms six other compilers
1. Intro
P1:
Machine Learning Framework:
-
Eager mode framework:
- e.g.: PyTorch, JAX
- 命令式运行定义方法(imperative define-by-run) approach
- 机器学习模型表达为每次运行时所需的代码learning model is represented as code that is executed each time one wants to run the model
- 优点:更易于理解,可以使用标准工具进行调试
- 缺点:框架一次只能看到一个运算符,无法自动执行优化,例如fusing或者scheduling, that cross operator boundaries
-
graph mode framework
- e.g.: Tensorflow, Caffe, Theano, CNTK
- declarative define-and-run
- 要求用户构建Graph,之后再执行Graph
P2:
Challenge: Eager mode framework只能看到一个运算符号,无法自动执行优化
已有方法: record/replay; python parsing; lazy evaluation
已有方法的缺点:可用性低 - record/replay: unsound and can produce incorrect behavior
- python parsing: 适用于简单程序,但是无法复制所有python的复杂语义。会在超过一半以上的真实模型上失败
- lazy evaluation: 导致高runtime overheads, adds latency to kernel launches
- 对某些模型来说,Pytorch的纯图形模式后端exclusively graph mode backend难以处理
- 许多模型作者利用了难以映射到图形的功能,如dictionaries, lists, custom classes, third party libraries (numpy, logging, etc), disk/network, multiprocessing, exceptions, and handwritten kernels
P3:
本文介绍TorchDynamo 和 TorchInductor。
这些扩展是PyTorch 2中引入的torch.compile功能的背后支撑,该功能于 2023 年 3 月正式发布。
TorchDynamo:一个Python 级 JIT 编译器,旨在允许在PyTorch 程序中进行图形编译
- TorchDynamo:通过字节码分析创建FX 图,旨在生成可与 Python 执行混合的较小图形片段
1. 挂接到CPython 中的Python 框架评估API [9] ,以在执行 Python 字节码之前动态修改它。hooks into the Python frame evaluation API in CPython to dynamically modify Python bytecode right before it is executed
2. 重写 Python 字节码,以便将 PyTorch 操作序列提取到FX 图[34] It rewrites Python bytecode in order to extract sequences of PyTorch operations into an FX graph
3. 使用许多可扩展的后端进行即时编译。 JIT compiled with many extensible backends.
- TorchInductor:
- 是 TorchDy‑namo 的新编译器后端
- 将PyTorch 程序转换为 OpenAI 的 Triton [ 46] (用于 GPU)和 C++/OpenMP [15] (用于 CPU)
- 使用新的define-by-run loop-level intermediate representation(IR)
- 用Python写成,Python用户可以轻松扩展和修改
2. Prior Attempts at PyTorch Graph Capture
challenge for graph capture: the user is free to embed arbitrary code,进而因为以下原因而无法很好地将pytorch code转化为fixed graph abstraction
mismatch between the flexibility provided by Python/PyTorch, and the inflexibility of graph representations
1. 导致PyTorch tensors经常被转化为python types,
2. usage of external libraries
3. usage of Python constructs(classes, closures, exceptions, control flow, etc)
2.1 torch.jit.trace
use record/replay with example inputs to produce a torchScript graph 利用示例输入的记录重放来生成
在pytorch.Dispatcher level做recording(在c++中),不捕捉任何控制流信息
作用: used to dispatch operators to device-specific kernels and for autograd
def example1(x):
if len(torch.nonzero(x)) > 1:
return x + 1
return x- 1
With example input torch.tensor([0, 0]),torch.jit.trace
would capture a graph equivalent to:
def example1_incorrect_capture(x):
torch.nonzero(x)
return x- 1
2.2 torch.jit.script
通过解析python AST,再做静态分析,来构建TorchScript graph
缺点:试图将Python固定为静态语言(emulating all of Python statically)
2.3 Lazy Tensors
在PyTorch/XLA project中出现,用来supporting Google TPUs
Lazy Tensors: C++ level graph capture tech,每次迭代,都推迟执行,直到积累完这个graph,再将accumulated graph发给XLA
避免重新编译的方法: graph hashing
缺点:
- higher overheads, 在running the same Python code and PyTorch dispatcher stack之外,还要一直维护additional graph data
- introduced delays: Lazy Tensors直到模型主体的代码执行完毕后,才开始第一个kernel的计算。但是普通的tensor只要遇到第一个kernel就会开始计算了,因此模型主题的代码执行和kernel会并行执行
- 弥补措施:serialize host execution with GPU/accelerator utilization
- Recompilation: Q: every time captured graph has a new hash, Lazy Tensors must recompile. 可能导致一些情况下经常重新编译
弥补措施:只运行Lazy Tensors一次,在之后的iteration不再运行。user TorchDynamo to figure out when recapture is needed
2.4 torch.fx.symbolic_trace
- introduced the FX graph format
- takes a record/replay-based approach, 但跟踪是在python级别运行的
效果:
能捕获更多操作,但仍收到all or nothing的影响,且可能产生不正确的结果,难以debug
def example3(x):
global call_count
call_count += 1
return torch.rand(10) + x
If one runs torch.fx.symbolic_trace
on this example it pro
duces a graph equivalent to:
def example3_incorrect_capture(x):
return _tensor_constant0 + x
graph format难以明白python global variable
Nearly all graphs formats for machine learning have no concept of a Python global, so even if this could be captured, it is not supported by downstream backend compilers.
2.5 torch.onnx.export
ONNX export实际上并不是为了graph capture来设计,但是许多人会这样用
内部使用torch.jit.trace和torch.jit.script
且ONNX并不支持全部的Pytorch operators,只能转化一部分Torch Script
Q: The ONNX team is working on an integration with Torch Dynamo that will replace TorchScript with a direct Torch Dynamo integration.
2.6 Comparison To Graph Capture in JAX
JAX: 谷歌出品的高性能计算库,带auto differentiation. JAX同XLA紧密耦合,因此JAX programs常有与XLA相似的constraints。
使用的capture机制简单,例如,jax.jit不支持data-dependent Python Control Flow
3. Torch Dynamo Design and Implementation
Rather than trying to remove or replace Python, TorchDynamo tries to work with CPython by just-in-time (JIT) compiling Python bytecode.
TorchDynamo is a Python bytecode to Python bytecode translator, where it extracts PyTorch operations from the original bytecode and replaces them with calls to compiled artifacts that fuse many PyTorch operations
3.1 Usage API
torch compile
调用方式: call it on a PyTorch Module, 或者as a function decorator; e.g.: torch.compile(model, backend='nvfuser')
参数:backend; option; mode
a custom Cpython frame evaluation hook rewrite the bytecode of each python function
3.2 CPython Frame Evaluation Hook
Background: frame evaluation API, PEP 523
这里Frame用来表示一个single function call
Method: 利用PyInterpreterState.eval_frame
, override the core function used to interpret a single function call in Python
- creates a PyFrameObject
- call
eval_frame
hook, 默认为_PyEval_EvalFrameDefault
,被TorchDynamo用来替换为JIT compilation of Python frames
做以下操作:
- 是否应当根据filename exclusion被skip掉,主要用来忽略common libs等
- 这个frame是否已经被编译或者cache,如果是,执行generated guard function,直到返回True,如果是True那就重用cache
- 执行符号分析,extract an FX graph, guards and side-effects
- compile FX graph with a user-defined compiler(实现了torch.compile)
- generate and compile a single Python function that checks all of the guards,如果返回True,则重用已有的compiled artifact
- 如果分析无法分析完整个区域,生成
resume_at_XX
这种cintinuation functions - Generate new Python bytecode
- call the compiled FX graph
- store and reconstruct the local/stack state
- perform side effects the original function should have had
- either return or implement a graph break by falling back to the original bytecode and calling the generated continuation function(s).
- Install the generated Python bytecode and guard function in the cache, run the generated bytecode with
_PyEval_EvalFrameDefault
, and return
3.3 Guards
Guards are the mechanism TorchDynamo uses to recheck dynamic properties used by JIT compilation to determine is a cached compilation can be reused.
利用信息:
checking many torch.Tensor properties, Python types, constant specialization, attributes, dicts/lists/tuples, nn.Module instances, and global PyTorch state. The guard system spans across TorchDynamo, AOTAutograd, and TorchInductor.
Any layer can introduce guards to protect specializations.
Guards是独立的
3.4 Symbolic Evaluation
symbolic Python bytecode evaluator: analyzing Python bytecode and modeling effects of each instruction.
记录以下信息:
- stack state;
- local variables;
- exception contexts;
- accumulated FX graph [34];
- accumulated guards;
- side effects
算法每次仅仅处理一个Python bytecode
Steps:
- At the start of symbolic evaluation, function arguments are examined and converted to a symbolic representation, VariableTracker.
- If bytecodes access data structures such as class attributes or global variables, new symbolic representations for these constructs are added lazily.
- The symbolic evaluator starts at the first bytecode instruction of the function, and continues processing the function one bytecode at a time.
Q: The soundness of this analysis can be shown via induction: as long each individual bytecode is processed correctly, the overall algorithm will be correct.
e.g.: 对LOAD_FAST
byte code,如果Value是Pytorch tensor,则会向FX graph加一个新node,并且创建一个指向结果节点的新symbolic tensor
3.5 Modeling Python Data Structures
为了分析stack entry 或者variable,TorchDynamo建模了class hierarchy,建模了许多不同的数据结构的共有行为
Each of those data structures is a subclass of VariableTracker
e.g.:
TensorVariable, ConstDictVariable, DataClassVariable, ListVariable, TupleVariable, UserFunctionVariable, UserMethodVariable, UserDefinedClassVariable
每个VairableTracer都有一套guards,Q: which will be propagated through operations via union
Q: 如何传播
所有instances也会记录来源,用于加载或者变异bytecode
3.6 Inlining, ControlFlow and Closures
Function calls可能从用户bytecode中被调用,但也可能在magic methods如__getitem__
中被调用
为了收集更大的graph,TorchDynamo会尝试inline function calls并flatten programs
对一个function call:
- creates a checkpoint of the current symbolic state
- 递归地用符号尝试推理,被调用的functions,发送input symbolic state,记录任何变动
- 如果触发从无,比如graph break或者其他erros,rolls back to the symbolic state and generated a graph break on that function call
Q: 大多数的control flow会被优化掉?那么特意使用TorchDynamo是不是也不能收集到control flow信息
Q: 当遍历torch.nn.Module的列表时,TorchDynamo会保护list不变,unroll the loop(Q:)
对于使用type, size, shape of tensors来做control flow的,TorchDynamo会"guard on those properties and remove the control flow"
意思似乎是会抽出这个property条件,把控制流if去掉。
是说只保留执行的那个分支?还是为每个分支都生成一个flatten program?
Q: 是不是比tf更不灵活了?
Challenge: closure
e.g.:
def closure_example(x):
y = torch.sigmoid(x)
return lambda z: y + z
y is in the closure
必须handle closure
- 在captured region外创建的cell variables的访问方式与其他的不同,使用
LOAD_DEREF
和STORE_DEREF
,TorchDynamo会把它改成直接从inlined function cell来读写。 - Cell variables both created and destroyed within the captured region: statically optimized away the closure
- 在captured region中船舰,但是不是在这个region中被释放的:最难处理
- TorchDynamo will optimize away all uses of the closure inside the captured region.
- at the very end in the generated bytecode, it will create any needed cells and Python function objects to return.
- 在cell外部,caller不能区分返回的closure与original program是否区分地创建
3.7 Mutation and Side Effects
defer FX side effect until the FX graph has been called
TorchDynamo有一个side-effect data structure,记录所有原本code可能有的side effect。
Q: If the code tries to read a value that would have been mutated by a pending side effect, it instead reads that pending value.
在图形生成后,一个garbage collection pass删掉没有逃离analysis context的side effects
TorchDynamo生成代码以apply the needed side effects
Q: Handling side effects this way results in multiple writes to the same value being collapsed into a single write.
支持处理以下种类的side effects
- Writes to global variables result in a
STORE_GLOBAL
byte code if the target global is in the same file. - Writes to attributes (such as on classes) are handled similarly and mapped to
STORE_ATTR
in output byte codes. - Writes to cells/closures are tracked
- Class construction is handled by creating a placeholder symbolic object, inlining the
__init__
method, and tracking all the attribute mutation on that placeholder object. - Dictionary and list mutation can also cause side effect if the dict/list was passed in as an input or loaded from a global/attribute.
- a new dict/list will be created to match the final state and the original list object will be mutated to match that object.
3.8 Graph Breaks and Continuation Functions
如果遇到无法handle的python bytecode,例如对外部library的调用,则生成graph break,并将bytecode分成好几段。
Q: 为何不是干脆放弃,而是要把图作进一步分割
TorchDynamo:会把已经编译的fragments混入原本的python code来hybrid execute
Q: Any pending partial FX graph is compiled
在调用partial graph的code中,不支持的bytecode会被执行,然后继续使用TorchDynamo分析剩下的函数
TorchDynamo的continuation function form
def resume_at_X(... livevars ...):
... restore try/except/stack state ...
JUMP_ABSOLUTE X
... original function bytecode ...
与原先function的差异:
- the arguments被修改了,以反应哪些variables仍是live across the graph break
- a prefix,用于restore the stack/exception state
根据branches的数目(1或2)可能生成1或者2个functions
recursively trigger TorchDynamo through evaluation API
Q: TorchDynamo会像处理其他function一样处理continuation function
3.9 AOTAutograd
被许多PyTorch compiler backend调用,用于增加training support
PyTorch eager: the backwards graph is generated dynamically using a tape-based autograd
AOTAutograd: turns the forward graph into a forwards and backwards graph,支持partial graph
方法:
- running the PyTorch eager mode autograd engine on fake tensor inputs and recording a joint forwards and backwards graph.
- 对于data-dependent operations,使用graph break在Graph之外执行(是不是说就不编译直接运行?)
- 使用最小割算法将joint graph分割,以利于memory usage
- As part of this min-cut algorithm, we apply backend-specific optimizations to rematerialize certain activations that are cheap to recompute in the backwards graph.
- apply backend-specific optimizations to be dematerialize certain cheap activations
- 使用其他dispatched level transformation
- decomposition: 将一些operators变为更低级的
- making the graph purely functional by removing operations that perform mutation and replacing them with their functional equivalents
4 TorchInductor Design and Implementation
take the captured FX graph, 从中生成fast code,能直接被用户使用
4.1 Design Principles and Key Technologies
PyTorch做出了许多与众不同的design choices,
e.g.:
- 允许用户操作strides
- aliasing views非常方便
- data和metadata能inplace 修改
Python First: 使用Python实现TorchInductor
Breadth First: 不仅仅只支持当前流行模型相关的operators
Reuse State-Of-The-Art Languages: GPU Kernel-Triton; CPU: C++/OpenMP
4.2 Decompositions
e.g.: log2可拆分为log和mul
log2_scale = 1 / math.log(2)
@register_decomposition(torch.ops.aten.log2)
def log2(x):
return torch.log(x) * log2_scale
Q: . Note that the active decomposition set must not contain cycles.
4.3 Lowerings and Define-By-Run Loop-Level IR
define-by-Run IR: IR使用可执行的Python code来定义bodies of loops
本阶段任务:lowering from an FX graph of PyTorch operations into TorchInductor's define-by-run IR
目前已经做了433个operators的lower
def inner_fn_buf0(index):
i0, i1 = index
tmp0 = ops.load("arg0_1", i0 * s1 + i1)
tmp1 = ops.log(tmp0)
tmp2 = ops.constant(1.4426950408889634, torch.float32)
tmp3 = ops.mul(tmp1, tmp2)
return tmp3
buf0_ir = TensorBox(StorageBox(ComputedBuffer(
name='buf0',
layout=FixedLayout('cuda', torch.float32,
size=[s0, s1], stride=[s1, 1]),
data=Pointwise(inner_fn=inner_fn_buf0,
ranges=[s0, s1], ...))))
Figure 2. TorchInductor IR for torch.log2 on a 2D tensor.
ComputedBuffer represents a tensor that will be computed using generated code (in contrast to ones created via fallback kernels or inputs).
SymPy Symbols, symbolic coordinates
能解决Symbolic Algebra,代数,等式,linear algebra, discrete math,还能自动差分
namespace for ops.*
, which can be dynamically over ridden to perform different functions.
actions like record memory accesses or high/low watermarks for strength reduction optimizations.
At the time of writing, the loop-level TorchInductor IR consisted of 54 primitive operators:
- ops.load and ops.store access Tensor memory from a provided buffer name and a SymPy index specifying a symbolic memory location.
- ops.reduction operates like ops.store where the reduc tion happens implictly inside the write. It combines stored values along the reduction dimension of the cur rent node using a supplied reduction type. Supported reduction types are: argmin, argmax, any, max, min, prod, sum, xor_sum, and welford_combine [50].
- ops.index_expr converts from SymPy expressions used for indexing into values used for compute.
- ops.indirect_indexing converts from computed values into SymPy expressions used for indexing by introduc ing a new SymPy variable bound dynamically.
- ops.masked implements conditional execution. It takes a condition and a Python function (recursively using the same IR) with no args. This gets mapped to masks in Triton and conditionals in C++.
- ops.load_seed, ops.rand, ops.randn, and ops.randint64 are used for computing random numbers.
- The remaining ops are elementwise math operations.
4.4 Scheduling
确定什么操作需要被fuse,kernel运行的order,memory planning
- 将所有的buffer in IR转化为a subclass of BaseSchedulerNode
- SchedulerNode represents a stan dard kernel that TorchInductor will codegen the body of.
- ExternKernelSchedulerNode represents calls to library code or user-defined kernels.
- NopKernelSchedulerNode maps to nothing, but is used to add dependency edges to ensure the ordering of kernels (for example, a concatenate kernel which has been handled by making producers write directly to the combinedbuffer).
- FusedSchedulerNode represents a set of two or more SchedulerNodes fused into a single kernel.
- 将读写关系变为dependency edgges between nodes, annotated with symbolic memory address being read
- key functions
Scheduler.can_fuse(node1, node2)
returns True if two nodes can be fused together.- e.g.: if config.aggressive_fusion=False, then can_fuse will prevent fusion of nodes that do not share any common memory accesses.
Scheduler.score_fusion(node1, node2)
is used to order different fusion possibilities.- 一些fusions不可共存
- 打分依据
- the category of the fusion (e.g. pointwise/reduc tion/template);
- estimated bytes of memory traffic saved by the fusion;
- shorter distance between nodes in the original graph
- key functions
- 直接用贪心算法取最高值,循环到无法再取
4.5 Triton Code Generation
Generation:
Q: 在log2的实例中,如果元素数量不是XBLOCK的倍数,则可能在末尾被频闭
During codegen, we simplify indexing
e.g.: 2D strided load会被转化为a contiguous load
CSE: common subexpression elimination
The pointwise decorator encodes boilerplate code used to facilitate block size heuristics, auto-tuning, and ahead-of-time kernel compilation.
The decorator is the type of kernel being generated (pointwise, reduction, or template), and its arguments are required metadata about the kernel like data alignments.
Codegen:
- Mode 1: For smaller reductions, it will gener ate a persistent reduction where the entire reduction is loaded in a single block and retained in registers/shared memory; in this case reductions map directly to Triton reduction operators.
- Q: persistent reduction
- Mode 2: For larger reductions, TorchInductor generates a loop using an entire block as an accumulator with a call to a Triton reduction at the end of the loop.
- Mode 3: For more complex operations (matmuls and convolutions), TorchInductor has its own template system for generating Triton code that mixes handwritten Triton with generated Triton. Templates are written using Jinja [29] with helper methods to interact with TorchInductor’s codegen system.
4.6 C++ Code Generation
- vectorized variant: tiling and maps most opera tions to the at::vec::Vectorized class
- non-vectorized variant: generates relatively standard C++ code using many C++ standard template library [24] (STL) functions.
4.7 Wrapper Codegen
The Python backend is more flexible and supports some corner cases that the C++ one does not
When enabled with mode="reduce-overhead", TorchInductor uses CUDA Graphs [20] to completely eliminate the overhead from wrapper code. CUDA Graphs records and replays kernel launches at the CUDA driver level
4.8 Related Deep Learning Compilers
PyTorch: Triton/OpenMP
被Halide启发的: TVM, nvFuser, NNC
Tensorflow, JAX: XLA
MLIR ecosystem,包括IREE(part of OpenXLA)
5 Dynamic Shapes
必要性:
- Some dimensions, such as batch size or sequence length, may vary.
- e.g.: adaptive batching will execute inference requests with varying batch sizes depending on how many re quests it received within its batching window.
- 可能是对每个batch,选择max seq length,把其他值用pad填补到那么长
- e.g.: adaptive batching will execute inference requests with varying batch sizes depending on how many re quests it received within its batching window.
- Some models exhibit data-dependent output shapes
- e.g.: image detection可能生成若干个候选boxes
- sparse representation
- e.g.: programs whose inputs tensor change in dimensionality
不支持:dynamic rank programs
e.g.: programs whose inputs tensor change in dimensionality
5.1 Symbolic Shape Guards
与fully symbolic system不同,对condition,不是同时推理2个分支,而是每次只pick一个branch,然后, specialize our trace that this trace will only be reused when the assumption holds
Q: we always pick one branch and specialize our trace under the assumption that this trace will only be reused when the assumptions hold.???为何一定正确有效?是加了额外的guards?
为每个symbolic size保存一个size hint,内含第一次JIT编译时的具体值
Q: 又是record/replay?
Q: 为何这回很快的简化了symbolic shape?去掉了condition?control flow简化?
def f(x, y):
z = torch.cat([x, y])
if z.size(0) > 2:
return z.mul(2)
return z.add(2)
branch 1: torch.cat([x, y]).mul(2)
branch 2: torch.cat([x, y]).add(2)
condition: z.size(0)(x.size(0) + y.size(0))
方法:write meta functions for all operators in PyTorch
- Meta functions propagate size information to the output of a ten sor without actually performing computation on the node.
目前meta function: 已经覆盖了2657/3028 PyTorch Ops
5.2 Optimizing Dynamic Shapes Reasoning
使用dynamic shapes的冬季:reduce compile time
Q: a compiler which supports only static shapes must recompile a kernel for every possible combination of possi ble input shapes
使用多种决策加速:
- dynamic shape: PyTorch 的dynamic shape不需要任何user annotation
- Assumption:
- all inputs are potentially dynamic,所有输入都是动态的
- model weights are static,模型权重是静态的
- Q: we in fer the true dynamism by stepping through the model and analyzing the interactions between the two.
- 除非用
assume_static_by_default
,不然默认是dynamic
- Assumption:
- 会额外检查variable size是否为0或者1。如果是0或者1,则直接视为常量,并add an appropriate guard
- 在处理用户程序时,会从guards中了解更多事实
- Q: 了解什么事实?
5.3 Hint-Free (Unbacked) Symbolic Integers
为了解决控制流问题
- check the actual value of a symbolic integer to determine which branch to take and guard on.
- 如果size variable从
.non_zero()
或.item()
这类的data-dependent operation中出现,则会有unbacked symbolic integers。无法对这些symbolic integers执行control flow的处理,所以we must graph break on these operations
缺点:过于严格
解决方案:
- 在创建tensor是,计算有关tensor的数据
- e.g.: sort the strides and determine overlapping,是否dense
- marking some properties lazy
- API:
contain_range
,用户可指定size bounded
6 Experimental Results
数据集:
- TorchBench
- HuggingFace
- TIMM
TorchInductor Performance Dashboard
实验环境:
- A100
- CUDA 11.6
- Intel Xeon 8275CL CPU
Competitors: nvFuser 2.0; NNC 2.0; Hidet 0.2.2; TVM 0.11.1; ONNX Runtime (ONNXRT) 1.14.1; and PyTorch/XLA 2.1.
For Hugging Face, TorchScript fails on every model because HuggingFace modelsreturns a ModelOutput container class that TorchScript does not support.
metrics:
比较了operator captured, overheads, 能否加速
分析:
Speedup的来源:
主要:
combining pointwise, reduction, and scatter kernels together into a smaller number of fused kernels
原因:reduces memory traffic since values can be reused without requiring a round trip to memory
来源:
- Inlining happens during lowering, and duplicates the body of pointwise kernels into all their consumers when thresholds are met.
- Fusion happens dur ing scheduling, and combines remaining kernels together, and also does horizontal consumer/consumer fusions.
其他优化:
- Loop/layout reordering uses a voting algorithm to reorder loops in kernels and change data layouts to match usage
- Matmul templates use Triton templates with pointwise epi logue fusion for matrix multiply instead of cuBLAS/cuDNN.
- Parameter freezing is an inference-only optimization that constant-folds away parts of the model that only depend on parameters.
- Pattern matching uses graph-level peephole optimizations to rewrite the input graph before it is lowered to TorchInductor
- Cudagraphs is a way to reduce kernel launch overheads at the CUDA driver level. (必须要检查可用且设置中设定)
Artifact
- Source code and benchmark code: https://github.com/pytorch/pytorch/
- PyTorch binaries: https://pytorch.org/
- TorchBench: https://github.com/pytorch/benchmark/