Proj. CDeepFuzz Paper Reading: TensorFlow Eager: A Multi-state Python Embedded DSL for Machine Learning
Abstract
Tool: Tensorflow Eager
Task: domain specific language for hardware-accelerated machine learning
特点:multi-stage, Python embedded
背景:为何要extend tensorflow?
1. tensorflow难以rapid prototyping and runtime dynamism
Method:
1. provides an imperative front-end
2. provides a JIT tracer that translate Python function into executable dataflow graph
效果:容易在imperative和staged execution in a single package
1. Introduction
differentiable programs
DSLs for differentiable programming are usually embedded in a host language
用于写可微分编程的领域特化语言(domain specific language),一般都会嵌入到host language中
imperative DSLs/declarative DSLs
这里declarative DSLs: staging their models as dataflow graphs
selectively stage parts that they wish to accelerate or serialize
opt-in extension
calling a single Tensorflow library function
provides a Python decorator that trace its Python function in a graph building context
在一个构建图的上下文中,追踪其Python函数?
将primitive operation堆积起球来,构建dataflow graph,使用命名的inputs/outputs,返回一个可执行的Graph function
默认为imperative模式,默认不再使用static graph
graph function同Python function的不同:
- 调用是一样的,只是会传给c++ dataflow runtime或者直接编译optimized code for CPUs, GPUs, ASICs
graph function and imperative code share a lexical env
Q: embed imperative code in graph function via upstaging annotation
"Imperative and staged Tensorflow code share a single set of primitive operations, kernels and user-visible APIs"
2. Related Work
易用性: 无需用户staging computation, where the operations are defined and organized into a computational graph
已有方法:
- embed the framework in a compiled procedural language and implement graph extraction and automatic differentiation as compiler rewrites
- e.g.: DLMM, Swift for Tensorflow, Zygote
- 利用Python AST,重写imperative code为declarative code
- 实现fused kernels
- Q: 是说要更多的直接把多个operations组合在一起,让imperative粗粒度?
- e.g.: CuDNN kernels
- multi-stage programming model
- e.g.: JAX, MXNet, Gluon, PyTorch
- Q: PyTorch is implemented a staging Tracer that is similar to ours
- 不直接相关的:Terra, OptiML, Numba, Pypy, Scala
- Multi-stage programming is related to staging transformations in compilers and partial evaluation in programming languages
- e.g.: JAX, MXNet, Gluon, PyTorch
3. Design Principles
- 易于被Python, IPython用户使用
- Privilege imperative execution
- Seamlessly embed into Python
- host language简化了segmented recurrent NN 和recursive NN等
- 易于测试,易于在不同设备上部署
- Stage imperative code as dataflow graphs
4. Execution Model
定义:tensor, operation, kernel, model
a tensor is a multi-dimensional, typed array
an operation is a primitive, possibly stateful function that takes tensors as inputs and produces tensors as outputs,
a kernel is a device specific implementation of an operation
a model is a composition of primitive operations.
4.1 Multi-stage programming
执行operations的2种方法:
- imperatively
- as part of a static dataflow graph
- Q: 用的是同样的operations和kernels,但idffers in how they dispatch kernels
Imperative execution: 很像Numpy+automatic differentiation
Staged execution:
- 提供了decorator: "function"
- traces the execution of a Python function
- recording all Tensorflow operations and the tensors flowing between them
- 可以类比为compiler,但是又不完全是compiler,,因为只追踪tensorflow operaitons
- dataflow graph format没有支持所有dynamism present in the python
- 只有在operations不依赖Python state的情况下能产生正确trace
- Q: 与PyTorch graph capture的区别?是不是还比不过graph capture不带control flow的正确性
- Python state: 似乎就是Python中的普通变量
- as long as the set of operations in the trace does not depend on Python State, we can generate a correct trace
- invoking a callable returned by @function with execute a dataflow graph, 而不是执行那个Python funciton
- graph function: 执行方式-executed y an operation that takes tensors as inputs and a function name as an attribute
- operation似乎是dataflow graph runtime
- @function supports code generation via XLA
import tensorflow as tf
tf.enable_eager_execution()
def select(vector):
A = tf.constant([[1.0, 0.0]])
return tf.matmul(A, vector)
x = tf.constant([[2.0], [-2.0]])
print(select(x))
// prints
// tf.Tensor(
// [[ 2.]], shape=(1, 1), dtype=float32).
A multi-stage workflow
- implementation: develop, debug and test a single-stage implementation program
- analysis: 使用profiling tools, 识别performance-critical blocks of operations, express blocks as staging-friendly Python functions or callable objects
- staging: 将那些function以@function修饰
@function: a JIT tracer that executes Python functions in a graph building context and only records operations and tensors
- Q: 如何record?是否record control flow?
- 返回symbolic variables而不是具体的值?
- non Tensorflow Python code executed normally
- 怎样同non-Tensorlfow Python code结合
Python functions that are amenable to staging:
1. 能生成恰当的Graph
2. 多次执行也没问题, can also be resilient to being executed multiple times
执行dataflow graph和Python function可能有语义差异
e.g.: 随机数在dataflow graph中会被固定,此时就需要将这个python state(np.random.randn)改为tf operation: tf.random_normal
def add_noise():
eye = tf.eye(5)
randn = np.random.randn(5, 5)
return eye + randn
e.g.2: 如果python funciton f has Python side-effects,例如,修改global python counter,则执行几次后逻辑与直接执行Python function不同。
函数通过tracing而非源代码分析来生成Graph,会把for和while loops展开,if也只会展开单支
否则,需要用tf.cond重写,tf.while_loops
staging-friendly/staging-unfriendly helper functions
4.2 Automatic Differentiation
a variant of tracing-based reverse-mode automatic differentiation, with a few changes to better support partially staged computation
实现类似于Chainer, Autograd, PyTorch
用户可见tape,if a tape watches a value, operations taking this value as an input will be recorded
作用:
- 允许用户决定能被用来trace automatic differentiation的值
- variable会被自动跟踪
- nested tapes能允许实现higher-order gradients
x = tf.constant(3.0)
with tf.GradientTape() as t1:
with tf.GradientTape() as t2:
t1.watch(x)
t2.watch(x)
y = x * x
dy_dx = t2.gradient(y, x) # 6.0
d2y_dx2 = t1.gradient(dy_dx, x) # 2.0
第一次调用graph function,且tape有效并且在监视时,会build a forward version,返回backward step所需的任何intermediate values。
Q: " As such, there is no meaningful change in the amount of computation or memory needed in the backward pass by staging or unstaging a particular function, leading to more predictable performance. "
这确保了如果一个operation在forward pass中,就一定在backward pass中
gradient iteself也能stage it or not
4.3 State
Tensorflow keeps program state in variables, restore and save operation
a variable's value automatically watches on all active tapes
在eager中,variables相当于Python objects,删掉object时就删掉了its own unique storage
staged computations refer variables by unique identifiers
One challenge when moving from purely staged computation to keeping state in Python objects is matching state between executions of the same program.
Q: 为何时moving from而不是moving to, 整个eager不都是从Python code中capture graph来的?
保持Python object state 的挑战时要在多轮执行之间匹配state?
tensorflow 使用unique names for each variable in a program, 因此用户需要以a consistent order来创建variables
e.g.: 创建同个模型的两份copies,则restore第二个模型时需要额外注意。使用贪心算法
其他的tensorflow state: 常被作为有向图的一部分,带named edges
e.g.: 对数据集的位置序列化的输入进行iterate的iterator。mutable hash tables
Q: Numpy与其他Python状态也可以使用graph-based state matching
Staging: 允许序列化程序,允许在没有Python interpreter的情况下使用
e.g.:
- 在编写tf Eager时使用graph-based state matching
- serializing the trace
- 使用tf的C++ API 执行trace
4.4 Devices
API: list_devices
: list all devices that the runtime is aware of
device: a context manager, tensorflow能自动地根据kernels的可用性来选择device,操作的输入来自与执行操作设备不同的设备时,也会透明地将输入copy到正确的设备
Q: 可以通过device context manager在各种设备上run graph functions
例如,graph function 在GPU上执行,指定其中一个操作在CPU上,就override device context manager,指定的操作会在CPU运行,其他操作在GPU上运行。
在TPU上,自动调用XLA,开销很大。多种优化,例如layout optimization, instruction scheduling for concurrency, operation fusion
# stored on CPU
a = tf.constant(1.0)
b = tf.constant(2.0)
with tf.device("/gpu:0"):
c = tf.add(a, b)
assert c.numpy() == 3.0
4.5 Distribution
4.6 Staging Computations
lightweight modular staging, partial evaluation
function API, 接受a Python function,返回object。
在调用这个object时,executes a dataflow graph created by running the user-provided Python function in a graph building context
多态性:Python输入是多态的,但是graph function是静态类型的
Solution: trace cache
function相当于function overload
object F = function(f),维护a cache mapping,从inferred input signatures映射到concrete graph functions
每次调用F时,都会处理或输入并推断其签名
1. tensors将被abstract types(numerical type and shape tuples)
2. non-tensor values: 由对象表示(object identity)编码
3. (input signature, other metadata): key of the cache of graph functions
- 如果cache miss,就会trigger a trace of f on the given inputs
ad-hoc polymorphism(临时多态性)
function overloading(函数重载)
@tf.contrib.eager.function
def lossy_matmul(W, x, training=True):
outputs = tf.matmul(W,x)
if training:
outputs = tf.nn.dropout(outputs, 0.2)
return outputs
W = tf.random_normal((3,5))
x = tf.random_normal((5,1))
lossy_outputs = lossy_matmul(W, x, training=True)
exact_outputs = lossy_matmul(W, x, training=False)
Q: function specializes on the runtime values of non-tensor arguments to let them parameterize the computation
似乎时说tensorflow会为non-tensor参数对值的类型做自动特化
Lexical closure: function is able to trace Python functions that lexically close over tensors or variables
- function is capable of tracing Python functions that lexically close over tensors or variables — these closed-over objects are treated as “captured” inputs that are silently passed to the graph function at call-time, without programmer intervention.
- Variables are captured by reference and not by value,
v = tf.Variable(0.0)
@tf.contrib.eager.function
def mutate():
v.assign_add(1.0)
return v.read_value()
mutate()
assert float(v.read_value()) == 1.0
v.assign_add(1.0)
assert float(v.read_value()) == 2.0
mutate()
assert float(v.read_value()) == 3.0
Listing 7. function transparently captures closed-over tensors and variables, forwarding them to TensorFlow functions as inputs.
Composition
@tf.contrib.eager.function
def inner(a):
return tf.nn.relu(a)
@tf.contrib.eager.function
def outer(a, b):
return inner(tf.matmul(a, b))
outer(tf.eye(3), tf.diag([-1.0, 1.0, 2.0]))
state creation: python function f会在第一次被调用时,create and init variables
- 对f的要求:satte(例如tensorflow variables)必须在f第一次被调用时创建variables不应当在第二次及之后的trace中被创建
4.7 Escaping staged computation
Embedding imperative code in graphs:
- staging computations requires the programmer to refactor the to-be-staged code into Python functions that when traced, construct dataflow graphs
- tracing时需要用tensorflow control flow替换复杂的Python control flow
- e.g.: 一个需要staging 的Python函数,只有需要调用 a data-dependent recursive Python function
- 可行方法:
1. staging the call before/after the recursive call
2. leaving the recursive call unstaged
3. stage the entire function while wrapping the recursive call in a py_func
,将Python function作为attribute来execute it imperatively
- py_func
是可微的?为什么?似乎也只是啊“能够追踪py_func
内部的tf操作”
- custom gradient definition
- in the stage mode, py_func
是一种embed imperative, Pythonic code into a dataflow graph,可以用来快速在Python code中实现custom operation
- py_func
的缺点
1. overhead, 因为会将控制权返回给单线程Python的解释器
2. 带有pyfunc的graph通常不可序列化
5 Implementation
Staging, automatic differentiation, Python, Imperative Runtime,lightweight API
In tensorflow, the dataflow graph defines the union of all the computation that the author might be interested
将每个分阶段都表达为一个graph function
6 Evaluation
Dataset: ResNet-50
envs: W-2135CPU; GTX 1080 GPU; Cloud TPU
实验:1. ResNet-50 2. TPU+ResNet-50 3. L2HMC
比较对象:Tensorflow Eager 2. Tensorflow Eager + function 3. Tensorflow
Q: 为何不考虑staging cmputation的构建和优化时间
效果:
似乎并不明显,不知为何在batch size较大时staging的效果很差。
since the ratio of the time spent in kernels over the time spent in Python increases, 训练ResNet不会从inter-op parallelization中获益