Proj. CDeepFuzz Paper Reading: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems; TensorFlow: A system for large-scale machine learning

Abstract

特点:能在大量异构系统上运行

1. Intro

前身: DistBelief
tensorflow:
效果:

  1. 从外围设备到几千个卡的集群,可以简单完成不同规模的模型训练与推理(multi-scale training)
  2. 节约maintainance burden,避免leaky abstraction
  3. 允许并行,为此在参数更新时允许一定的灵活度(relaxed synchronization requirement)
  4. 不仅支持深度学期,还支持传统机器学习和数值计算

方法: stateful dataflow graphs + 并行
并行:

  1. (partial) graph replication
  2. data parallel training(a core model dataflow's parallel execution)

Q:
we have focused on making the system both flexible enough for quickly experimenting with new models for research purposes and sufficiently high performance and robust for production training and deployment of machine learning models.

2. Programming Model and Basic Concept

基本模型:directed graph
在基本运算的dataflow graph上,添加了多种节点,例如:

  1. nodes to maintain and update persistent state
  2. branching and looping

node: 每个node代表一个operation
tensor: value that flow along normal edges in the graph, arbitrary dimensionality arrays where the underlying element type is specified or inferred at graph construction time
- 支持多种类型
control dependencies: 特殊边,没有数据流过,仅代表先后顺序,要求source node在destinate node开始执行之前执行完毕(the source node for the edge must finish exactly before the destination node starts executing),常用于控制peak memory usage
operation: an operation has a name and represents an abstract computation. operation can have attributes. 这点可以用来支持多态化,比如int32类型的add

kernel: a particular implementation of an operation that can be run on a particular type of device
registration mechanism: A tensorflow binary defines the sets of operations and kernels via a registration mechanism.

Example Tensorflow operation types

Category Examples
Element-wise mathematical operations Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal, ...
Array operations Concat, Slice, Split, Constant, Rank, Shape, Shuffle, ...
Matrix operations MatMul, MatrixInverse, MatrixDeterminant, ...
Stateful operations Variable, Assign, AssignAdd, ...
Neural-net building blocks SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool, ...
Checkpointing operations Save, Restore
Queue and synchronization operations Enqueue, Dequeue, MutexAcquire, MutexRelease, ...
Control flow operations Merge, Switch, Enter, Leave, NextIteration

Sessions: Clients programs interact with the TensorFlow system by creating a Session
- 允许使用Extend方法来为当前管理的graph添加额外的nodes和edges。
- session刚创建时graph是空的

Variable: a Variable is a special kind of operation that returns a handle to a persistent mutable tensor that survives across executions of a graph.
the parameters of the model are typically stored in tensors held in variables, and are updated as part of the Run of the training graph for the model

3. Implementation

client, master, worker
client使用session interface与master交流
有一个或多个worker processes,每个worker process负责对一到多个devices的访问
tensorflow interface有local和interface两种版本,两种版本大部分相同
当client, master, worker都在同一台机器的同一个系统(可以在不同devices上),则使用local版本
different tasks are containers in jobs managed by a cluster scheduling system.
device: each device object is responsible for managing allocation and deallocation of device memory, arranging the execution of any kernels
- computational heart of tensorflow.
- device name: e.g.: "job: worker(id of job)/task: 17(task of worker)/device: gpu(device type): 3(the device's index within the worker)"

3.1 Single-Device Execution

Keep track of a count per node of the number of dependencies of nodes that haven't been executed
a ready queue: 以某种unspecified order执行

3.2 Multi-device Execution

challenges:

  1. 确定每个node所属的计算放在哪个device
  2. 管理做出这些决定所需的数据通信

3.2.1 node placement

map the computation onto the set of available devices
为每个node建立一个cost model
参数:

  1. 估计的input/output tensors的byte数目
  2. 预估的计算时间

计算方式:

  1. 静态,通过不同operation types和启发式规则来估计
  2. 用已有的执行数据估计

The placement algo:

  1. 模拟执行这个Graph
  2. 使用贪心算法为mode选择一个device,生成"node-to-device placement"

模拟执行算法

  1. 从sourceds of comptational graph开始,对每个待执行的node
    1. 确定可行的devices集合
    2. 对有多个可行的devices的node,检查将这个node放在这些devices分别的运行时间
      1. the estimated on measured execution time of the operation
      2. the costs of communication
    3. 使用贪心算法
  2. 允许用户使用hints和partial constraints来指导placement algo

3.2.2 Cross-device communication

在node placement之后,将graph按照设备分为多个subgraph(e.g.: 对应device 0)
将croess-device edge换掉,换成send--recv node pairs,比如:
A--->send----recv--->B
---->C

所有通信操作被抽象在send, recv nodes中。
这里canonicalize all the users of a tensor,将多个A->B, A->C之类从相同node发送的消息放在同个node上,这样确保只发送一次
send-recv allows the scheduling of individual nodes of the graph on different devices to be decentralized
master只需要释放单个run request,再不需要负责communication

3.3 Distributed Execution

与multi-device exeuction场景类似,send/recv使用TCP或者RDMA这类协议来交流
Fault tolerance
最重要的fault:

  1. send/recv之间的通信错误
  2. periodic health-check from the master to worker

如果发生了这类故障,则执行中止,重新启动。
但是允许从variables中恢复(从save node中执行保存操作,从restore node中执行恢复操作)
- restore node仅存在于重启之后的第一次迭代中

4 Extensions

4.1 Gradient Computation

计算梯度的条件:tensor C depends on some set of tensor {Xk},则会计算{dc/dXk}
algo:

  1. finds the path in the comptation graph from I to C
  2. from C backtracks to I.每计算一次,增加一个包含partial gradients的node

任意操作都可能注册一个gradient function
[db, dW, dx] = tf.gradient(C, [b,W,x])
自动梯度计算使优化变得复杂,尤其是内存使用
由于Gradient computation,常需要在图早期执行的tensor作为输入,这些tensors虽然不是persistent,但也不能释放,过于占用内存
Solutions:

  1. 积极改进内存管理机制
  2. 用更复杂的启发式算法决定执行顺序
  3. 直接重新计算tensor,而不是利用内存保存tensor
  4. 不用GPU memory,用CPU内存保存tensors

4.2 Partial Execution

执行子图,而不是全部computation graph
allow them to execute an arbitrary subgraph of the whole graph, and to inject arbitrary data along any edge, and to review data flowing along any edge in the graph
the output of a node命名:
e.g. bar(the name of the node): 0(the port)

如何做partial execution:
Run函数的inputs和outputs支持mappings,输入都支持name:port这种格式。因此可以指定更细粒度的图。

  • 在input中执行的node:port将被feed node替换。将从特别初始化的rendezvous object的条目中获取输入值
  • 在output中执行的将被fetch node替换

4.3 Device Constraints

client可以提供partial constraints来约束node在device上的放置(placement)
constraint e.g.:

  1. 放这个node在"/job:worker/task:17"的devices上
  2. 与var13放在一起

place algo不仅要尽量加快执行时间,还要满足这些用户约束,以及内存约束。
Steps:

  1. 在为每个node寻找feasible set of devices之后,用union find来找colocation constraints的图分量(graph component)
  2. 计算这个分量中所有node的可用devies的交集

Q: The computed feasible device set per node fits easily into the placement algo's simulation
似乎只是说这个信息会用到模拟执行中,来确定node-device placement pair

4.4 Control Flow

Method:引入一个primitive control flow operators小集合,允许tf处理带环的数据流图

  • Switch, Merge: 允许根据bool tensor的值跳过整个子图的执行
  • Enter, leave, nextIteration: 表达迭代

Tensorflow runtime: 根据MIT tagged token machine实现了一些tags和frames
每个loop iteration都对应一个描述属性的tag和一个描述执行状态的frame
多个iterations可以被独立执行。
对于带有控制流的graph的执行:

  1. 由于单个loop可能被分发给多个devices,所以必须要注意管理the state of a loop

Solution: graph rewriting
Steps:

  1. 在graph partitioning的过程中,所有partition都会额外添加control node
  2. control nodes相当于小型状态机,用于控制每个iteration的起始和终止。对每个iteration,带有loop termination predicate的device会给其他device发送a tiny control message

挑战:Control Flow + Gradient Descent
例如,if branch需要让图知道使用了哪个branch,进而要保存那个branch相关的intermediate tensors
while-loop需要知道运行了多少个iterations,也要记录对应的中间tensors

4.5 Input Operations

input node: 带有一到多个文件信息,生成一个tensor,每次执行时,从这些文件中读取一些examples。允许直接从文件系统中读取文件。
This allows data to be read directly from the underlying storage system into the memory of the machine

当client与worker在网络上时,常常需要额外的network hop, storage system->client->worker

4.6 Queues

意义:允许图的不同部分异步执行,通过Enqueue和Dequeue分发数据
"
They allow different portions of the graph to execute asynchronously, possibly at different candences, and to hand off data through Enqueue and Dequeue operations.
"
e.g.: 允许prefetch data from disk。累计许多梯度,在更大的batch上计算更复杂的梯度?
将recurrent language models group into different bins

Enqueue: block直到队列中有足量空间
Dequeue: block直到队列中有足量元素
有normal FIFO queue,还有shifting queue

4.7 Container

用于管理long-lived mutable state
Variable存在Container中(the backing store for a variable lives in a container)
默认container在process退出时退出
允许named container
使用container可以在不同sessions的不同Graphs之间分享状态
container的重置:清空全部内容

5 Optimizations

5.1 Common Subexpression Elimination

背景:多重抽象可能令相同计算存在不同copies(redundant copies of the same computation)
方案:使用Click描述的算法,做common subexpression pass similiar elimination
如果输入相同,operation types相同,则将multiple copies of operation换成单个副本
接着redirect graph edges

5.2 Controlling Data Communication and Memory Usage

Operation Scheduling,尤其是与data transfers和memory usage相关的
作用:

  1. 减少中间变量被存储的时间(time window during which intermediate results need to be kept in memory between operaions and hence the peak memory consumption)
  2. reduce contention for network resources

最重要的:

  1. receive node
    方案: ASAP(as soon as possible)/ ALAP(as late as possible)
    似乎主要方案是ALAP?直到需要时才开始传输

5.3 Asynchronous kernels

non-blocking kernel, the compute kernel is passed a continuation that should be invoked when the kernels execution is complete
用在如下环境:
having many active threads is relatively expensive in terms of memory usage and other resources
用在一个单独的执行线程执行不确定的时间长度可能比较昂贵时
e.g.: receive kernel, Enqueue and Dequeue kernel

5.4 Optimized Libraries for kernel Implementation

利用已存在的optimized numerial libraries来实现某些kernels
许多kernel实现其实就是在其他libraries外面套了wrapper
e.g.:
矩阵乘法:BLAS, cuBLAS
conv: cuda-convnet, cuDNN
使用jit的开源Eigen linear algebra library

5.5 Lossy Compression

Some machine learning algos, including those typically used for training neural networks are tolerant of noise and reduced precision arithmetic
在不同devices中传输高精度数据时,用lossy compression of higher precision internal representations

6. Status and Experience

front-ends: specifying tensorflow computation in python, c++, js
Validating correctness is a difficult enterprise because the system is inherently stochastic and only intended to behave in a certain way in expectation — potentially after hours of computation.

迁移模型得到了一些经验教训,e.g.: 在迁移inception时assembling and debugging are 36000 operations into the correct graph structure proved challenge
以下策略非常重要

  1. Build tools to gain insights into the exact number of parameters in a given model.
    • Q: 能显示出复杂的网络结构说明的缺陷(subtle flaws in a complex network architecture specifications)
  2. start small and scale up
    • subtle edge cases in individual operations
  3. always ensure that the object(loss function) matches between machine learning systems when learning is turned off
    • 设置学习率为0,再看是否有unexcepted behaviors,在动态训练时这种问题很难得到控制Such an error would have been difficult to identify in a dynamic, training network
  4. Make a single machine implementation match before debugging a distributed implementation
    • 用来帮助描述和调试机器学习系统之间的训练性能差异
    • e.g.: race conditions and non-atomic operation incorrectly assumed to be automatic
  5. Guard against numerical errors
    • numerical libraries在对待non-finite floating point values方面的行为不同
    • conv尤其易受影响,且在调试时同实际实验中往往表现不同
    • solution: check for non-finite floating point values
  6. Analyze pieces of a network and understand the magnitude of numerical error
    • 预测和了解预期数值误差的大小非常重要,以判断给定组件是否正确实现
    • validating complex mathematical operations in the presence of an inherently stochastic system is quite challenging

7. Common Programming Idioms

e.g.: optimizer: SGD, training batch size: 100 to 1000

Data Parallel Training

e.g.: 对size为1000的batch,分为10个replica,每个处理100个元素,之后combine the gradient
可异步化,每个replica一个client thread

model parallel training

e.g.: recurrent deep LSTM model for seq2seq
Q: 似乎是把模型分块?

Concurrent Steps for Model Computation Pipelining

类似asynchronous data parallellsim,但是在同个device中
Q: " This allows “filling in the gaps” where computation of a single batch of examples might not be able to fully utilize the full parallelism on all devices at all times during a single step"

9. Tools

  1. Tensorboard可视化
  2. EEG(performance training)用于收集和可视化graph的ordering和performance特征。在理解bottleneck方面非常有用
    • method: ftrace, 自研轻量tracing tools, CUPTI
    • 重建distributed training steps, 每个thread switch, CUDA kernel launch和DMA operation都能以毫秒级追踪
    • 用于通信,同步或DMA导致的任何延迟

Q: background housekeeping threads can be seen in other colors being migrated across processor cores

10. Future Work

  1. 添加更多机器学习models
  2. function mechanism: 用户能指定一整个subgraph作为reusable component,希望能across different fron-end language
  3. JIT compiler: take a subgraph of a Tensorflow execution,可能带一些runtime profiling information of typical sizes and shapes of tensors, can generate an optimized runtime for this subgraph
  4. 更多improvements on placement and node scheduling
  • TensorFlow. Theano [7], Torch [13], Caffe [26], Chainer [49] and the Computational Network Toolkit [54]
    • 与Theano和Chainer相同,都支持symbolic differentiation
      • 但通用流图更灵活,Q: 能将状态参数节点表达为Variable?
  • Halide: 与tensorflow有类似的IR,但是对操作的语义更了解,同时考虑了并行和locality,优化程度更高,但仅考虑单机

分布式系统:

  • 其他几种分布式系统来在集群中执行数据流图:
    • Dryad [24] 和 Flume [8] 演示了如何将复杂的工作流表示为数据流图。
    • CIEL [37] 和 Naiad [36] 引入了对数据相关控制流的通用支持:
      - 早期计算的软状态缓存输出
      - CIEL 将迭代表示为动态展开的 DAG
      - Naiad 使用具有循环的静态图来支持低延迟迭代。
      - Spark [55] 针对重复访问相同数据的计算进行了优化,使用“弹性分布式数据集”(RDD),
      • Dandelion [44] 在包括 GPU 在内的异构设备集群中执行数据流图。
      • TensorFlow 使用混合数据流模型,借用了每个系统中的元素。它的数据流调度程序是选择下一个要执行的节点的组件,
        • 它使用与 Dryad、Flume、CIEL 和 Spark 相同的基本算法。
        • 它的分布式架构最接近 Naiad,因为系统使用单个优化的数据流图来表示整个计算,并在每个设备上缓存有关该图的信息,以最大限度地减少协调开销。
        • 与 Spark 和 Naiad 一样,当集群中有足够的 RAM 来容纳计算的工作集时,TensorFlow 效果最佳。
        • TensorFlow 中的迭代使用混合方法:同一数据流图的多个副本可以同时执行,同时共享同一组变量。副本可以通过变量异步共享数据,或者使用图中的同步机制(如队列)同步运行。
        • TensorFlow 还支持图内的迭代,它是 CIEL 和 Naiad 的混合体:为简单起见,每个节点仅在其所有输入都准备就绪时触发(如 CIEL);但为了提高效率,该图被表示为一个静态的、循环的数据流(如 Naiad)。

在2015paper版本的基础上

Abstract

“parameter server” designs the management of shared state is built into the system

1. Intro

  • high-level programming models of dataflow systems
  • the low-level efficiency of parameter servers

Unlike traditional dataflow systems, in which graph vertices represent functional computation on immutable data, TensorFlow allows vertices to represent computations that own or update mutable state, 与传统数据流系统不同(节点代表不可变数据),Tensorflow允许节点代表拥有或者更新可更新状态的计算
e.g.:
offload computation onto the servers that hold the shared state to reduce the amount of network traffic.

Q:
We have also built various coordination protocols, and achieved encouraging results with synchronous replication, echoing recent results [10, 18] that contradict the commonly held belief that asynchronous replication is required for scalable learning

Synchronous vs. Asynchronous Replication

Asynchronous Replication:

  • In asynchronous replication, different devices or nodes (e.g., multiple GPUs or machines) work independently to update the model parameters.
  • Each device computes gradients based on its local data and then updates the model parameters asynchronously, meaning that updates can happen out of sync with one another.
  • This approach is often believed to be more scalable because devices don't have to wait for each other, potentially speeding up training.

Synchronous Replication:

  • In synchronous replication, all devices or nodes must agree on the model's parameters before moving on to the next step.
  • Each device computes gradients based on its local data, and then these gradients are averaged (or otherwise combined) across all devices. Only after this aggregation step are the model parameters updated synchronously across all devices.
  • This means that every device is working with the same version of the model parameters at each step of training.

Why is Synchronous Replication Important?

The commonly held belief is that asynchronous replication is necessary for scalable learning, especially in large distributed systems. The reasoning is that synchronous methods can be slower because they require all devices to wait for each other, potentially causing bottlenecks if one device is slower than the others.

TensorFlow's Achievement with Synchronous Replication

The TensorFlow team achieved promising results using synchronous replication, which is noteworthy because it challenges the belief that asynchronous replication is required for scalability.

  1. Consistency: With synchronous replication, all nodes update the model parameters at the same time, leading to consistent and stable model updates. This can be particularly beneficial for convergence, ensuring that all nodes are working with the same model state at each training step.

  2. Scalability: Despite the traditional belief that synchronous replication might not scale well, TensorFlow’s implementation showed that it could be done efficiently. They managed to synchronize updates across multiple devices in a way that didn't significantly hinder performance, showing that synchronous methods can be both scalable and effective.

  3. Echoing Recent Results: The paper references other research that also supports the idea that synchronous replication can be scalable, indicating a shift in the understanding of distributed learning.

Example in Machine Learning Training

Imagine you're training a deep learning model on a dataset using multiple GPUs.

  • Asynchronous: Each GPU processes a different mini-batch of data, computes gradients, and updates the model independently. This can lead to faster updates, but the model parameters may diverge slightly between GPUs, leading to issues like instability or slower convergence.

  • Synchronous: All GPUs process their mini-batches, compute gradients, and then these gradients are aggregated (e.g., averaged). The model is updated in a synchronized manner, ensuring that all GPUs are working with the same version of the model. This might introduce some waiting time (as all GPUs must finish their computations before proceeding), but it leads to more stable updates and can still be scalable if implemented efficiently.

2. Background and Motivation

v1 2.1 Requirements

Distributed Execution

分布式优点:能处理更多数据,更大的模型
要求:

  1. data parallel: eliminate the I/O bottleneck for input data,且预处理步骤可以独立运行
  2. a distributed file system
  3. a distributed system can shared the model across many processes, to increase the available network bandwidth
  4. 多个worker将同时读写model参数
  5. mini-batch gradient descent: 要求大量数据传输,且要以low-latency传输

Accelerator

expensive computations: 矩阵乘,卷积
Support: 介绍了GPU, TPU, Movidius Deep Learning Accelerator, FPGA等加速器
tensorflow 用了portable programming model that can target a generic device abstraction,因此其架构在新设备出现时可以更专业

Training and inference support

根据应用程序性质不同,可能是1. 交互 2. 离线 3. 多个servers参与推理
训练和推理使用一个通用的,经过良好优化的系统

Extensibility

extensible programming models,支持expressive control flow and stateful constraints

v2 2.1 Previous System: DistBelief

DistBelief: a large distributed cluster of multicore servers with GPU acceleration
- 在参数服务器架构中,a job可能是stateless worker process,也可能是stateful parameter server
programming model: 要求用户定义neural network as a directed acyclic graph of layers,以loss function结尾
- 每个layer是若干数学计算操作符的集合
- DistBelief使用DAG的结构与每层的语义来back propogate计算gradients
- 由于许多算法中的参数更新是可交换的,且一致性要求比较弱,worker独立更新这些权重

缺点:

  1. python-based scripting interface仅适用于要求简单的用户,难以实现以下的灵活性D
    1. Defining new layers:class是c++的,对于python使用者有点难
    2. Refining the training algo:需要修改parameter server的实现,且get(), put()不适应全部优化结构
    3. 有时需要一组操作,在parameter server上直接算更有效率而不支持
  2. 定义新的训练方法很难
    1. 训练模式太经典,不能很好适应RNN,GAN等
    2. 有时损失不是在系统中训练,而是来自于
      1. agent in a separate system, e.g.: video game emulator
      2. 训练模式太经典,不能很好适应RNN, GAN等
      3. 无法支持random forest,EM, latent Dirichlet allocations等
  3. 过于重量,无法适应以下流水线
    1. 在本地的GPU workstation调整模型
    2. 在cluster上用更大的数据集训练
    3. push the model into production,例如与在线service相结合;部署在mobile上离线执行等

2.2 Design Principles

Dataflow graphs of primitive operators

  • tensorflow模型讲不通的数学操作符表达为node in dataflow,能允许用户定义新的模型architecture而无需修改核心系统
  • enabling experimentation with different update rules

Deferred execution

phases:

  1. defines the program as a symbolic dataflow graph with placeholders,这些placeholders将放入input和state
  2. 执行optimized verion of the program
    • 通过图的依赖结构,将一系列内核发送到GPU而无需等待中间结果

Common abstraction for heterogenous accelarators

TPU省电:performance-per-watt更高

common abstraction for devices

  • methods
    1. issuing a kernel for execution
    2. allocating memory for inputs and outputs
      3 transferring to and from host memory
  • 可能根据不同意图产生优化
    1. training
    2. serving
    3. offline inference
  • 使用tensors of primitive values as a common interchange format that all devices understand
    • primitive value: 硬件,软件本来能支持的类型
    • 稀疏tensors能以dense tensors表现
    • 能确保系统的最低级别又简单的内存分配和序列化实现
    • tensor还带有其他的optimization for memory management and communication
      • e.g.: RDMA and direct GPU-to-CPU transfer
  • 在tensorflow中没有parameter server,而是parameter server task
    • task: named processes that can communicate over a network
    • 每个task export the same graph execution API, and contain one or more devices
    • task: PS task, worker task
    • 用户能以定义模型的scripting interface;来定义PS task

Simple-machine framework

Caffe, Theano, Torch
Theano: 有一个命令式的编程模型,允许对执行顺序和内存利用的细粒度控制
- tensorflow与theano最为相似

Batch dataflow system

批处理系统要求输入数据不可变,且所有子运算都是确定性的,更有利于传统负载
e.g. : SparkNet需20s来广播权重和收集更新。这迫使系统提供更大的batch,这会减缓收敛速度

MapReduce
Dryad LINQ:更复杂的高级查询语言
Spark: 能缓存之前的计算集合
Dandelion: extends, support GPUs and FPGAs
Naiad: 专为稀疏离散数据设计,但是状态可变
- arguments a dataflow model with streaming execution, stateful vertices and structured timestamps(timely dataflow)
- an handle incremental updates and iterative algos
- 使用cyclic dataflow graphs来传递iteration
- tensorflow 借鉴了timely dataflow iteration

Parameter servers

Parameter server: 使用一组服务器来管理shared states,work是data parallel
写入操作一般是由an associate and communitative combiner(结合同交换的组合操作)控制,e.g.: +=

scalable topic modeling

GeePS: a parameter server for GPUs
MXNet:
- 特点:
1. use a parameter server to scale training
2. supports GPU acceleration
3. 有一个灵活的编程模型,支持多种front end语言、
- 满足tensorflow使用者的部分要求,但是核心代码闭源(priviledged code)
- the MXNet key-value store inference does not allow sparse gradient updates within a single value

DistBelief: 添加一种新的算法或者模型结构常常需要修改parameter server implementation,且frontend language不是已有的高级语言

3. Tensorflow Execution model

TensorFlow uses a single dataflow graph to represent all computation and state in a machine learning algorithm, including the individual mathematical operations, the parameters and their update rules, and the input preprocessing

特点:

  1. 在重叠子图上也支持多重并发执行
  2. 可变状态可以在图的不同执行之间共享

The key observation in the parameter server architecture [14, 20, 49] is that mutable state is crucial when training very large models, because it becomes possible to make in-place updates to very large parameters, and propagate those updates to parallel training steps as quickly as possible

3.1 Dataflow graph element

In a TensorFlow graph, each vertex represents a unit of local computation, and each edge represents the output from, or input to, a vertex.

We refer to the computation at vertices as operations, and the values that flow along edges as tensors
将顶点处的计算称为操作,将沿边流动的值称为张量

representing sparse data:

  1. encode the data into variable-length string elements of a dense tensor
  2. use a tuple of dense tensors (e.g., an n-D sparse tensor with m non-zero elements can be represented in coordinate-list format as an m × n matrix of coordinates and a length-m vector of values).

Operations: An operation takes m ≥ 0 tensors as input and produces n ≥ 0 tensors as output.
An operation has a named “type” (such as Const, MatMul, or Assign) and may have zero or more compile-time attributes that determine its behavior.
例如,最简单的操作Const没有输⼊,只有⼀个输出;其值是编译时属性

Stateful operations: variables
- 变量操作可以包含每次执⾏时读取和/或写⼊的可变状态。变量操作拥 有⼀个可变缓冲区,可⽤于在训练模型时存储模型的共享参数
- A Variable has no inputs, and produces a reference handle, which acts as a typed capability for reading and writing the buffer
- A Read operation takes a reference handle r as input, and outputs the value of the variable (State[r]) as a dense tensor
- Other operations modify the underlying buffer: 例如AsignAdd(r, x); Read(r)

Stateful operations: queues
- 如FIFOQueue
- The combination of queues and dynamic control flow (§3.4) can also implement a form of streaming computation between subgraphs.

3.2 Partial and concurrent execution

The API for executing a graph allows the client to specify declaratively the subgraph that should be executed. The client selects zero or more edges to feed input tensors into the dataflow, and one or more edges to fetch output tensors from the dataflow

step: Each invocation of the API is called a step, and TensorFlow supports multiple concurrent steps on the same graph.

Partial and concurrent execution is responsible for much of TensorFlow’s flexibility
这种并发异步性使得弱一致性要求的机器学习算法实现起来非常方便

3.3 Distributed Execution

A device is responsible for executing a kernel for each operation assigned to it. TensorFlow allows multiple kernels to be registered for a single operation, with specialized implementations for a particular device or data type

Send and Recv operations that replace edges across device boundaries. Send transmits its single input to a specified device as soon as the tensor is available, using a rendezvous key to name the value. Recv has a single output, and blocks until the value for a specified rendezvous key is available locally, before producing that value. Send and Recv have specialized implementations for several device-type pairs

Once the graph for a step has been pruned, placed, and partitioned, its subgraphs are cached in their respective devices

A client session maintains the mapping from step definitions to cached subgraphs, so that a distributed step on a large graph can be initiated with one small message to each participating task

3.4 Dynamic control flow

支持包含条件和迭代控制流的高级机器学习算法
we add conditional (if statement) and iterative (while loop) programming constructs in the dataflow graph itself. We use these primitives to build higher-order constructs, such as map(), fold(), and scan() [2].

we borrow the Switch and Merge primitives from classic dynamic dataflow architectures

The while loop is more complicated, and uses Enter, Exit, and NextIteration operators to ensure that the loop is well-formed

Extensibility case studies

4.1 Differentiation and optimization

TensorFlow includes a user-level library that differentiates a symbolic expression for a loss function and produces a new symbolic expression representing the gradients.

For example, given a neural network as a composition of layers and a loss function, the library will automatically derive the backpropagation code.

We have extended the algorithm to differentiate conditional and iterative subcomputations (§3.4) by adding nodes to the graph that record the control flow decisions in the forward pass, and replaying those decisions in reverse during the backward pass

Differentiating iterative computations over long sequences can lead to a large amount of intermediate state being accumulated in memory, and we have developed techniques for managing limited GPU memory on these computations.

SGD is easy to implement in a parameter server: for each parameter W, gradient ∂L/∂W, and learning rate α, the update rule is W′ ← W − α × ∂L/∂W. A parameter server can implement SGD by using -= as the write operation, and writing α × ∂L/∂W to each W after a training step.

4.2 Training very large models

甚至无法存储在单个主机的RAM中

  1. 1hot vector:
    • Solution: sparse embedding layers
      1. Gather: 提取tensor的一个sparse set of rows
        • Q: 这意味着one-hot不变?不必用dense tensor表示?
      2. Part: 动态划分operation,将接下来的indices划分为vairable-sized tensors
      3. Stitch: reassembles the partial results from each shard into a single result tensor
    • 支持自动微分
  2. softmax:
    • Solutions:
      1. 权重分片
      2. 采样的softmax,基于稀疏乘法进行训练

4.3 Fault tolerance

We often need to train a model using non-dedicated resources, for example using the Borg cluster manager [71], which does not guarantee availability of the same resources for the duration of the training process.
长时间运行的tf job可能出现故障或者被抢占,需要某种形式的容错能力,不能让任务经常失败

There is no need to make every write to the parameter state durable, because we can recompute any update from the input data, and many learning algorithms do not require strong consistency

Save writes one or more tensors to a checkpoint file
- 将任务中的每个变量连接到相同的保存操作,每个任务一个保存,以最大化分布式文件系统的 I/O 带宽
- The checkpointing library does not attempt to produce consistent checkpoints: if training and checkpointing execute concurrently, the checkpoint may include none, all, or some of the updates from the training step. This behavior is compatible with the relaxed guarantees of asynchronous SGD
- Consistent checkpoints require additional synchronization to ensure that update operations do not interfere with checkpointing;
Restore reads one or more tensors from a checkpoint file

the user can apply different policies to subsets of the variables in a model, or customize the checkpoint retention scheme. For example, many users retain checkpoints with the highest score in a custom evaluation metric.

4.4 Synchronous replica coordination同步副本协同

许多系统使用异步parameter updates以增加吞吐量,但是低一致性,using stale parameters
Q: since GPUs enable training with hundreds
同步训练在time-to-quality上可能更快
tf一开始是按照异步读写设计的,现在向同步参数方案切换
tf允许用户指定参数的读写方式:

  1. asynchronous method,异步:每个worker读each step开始时的参数值,在结尾时将梯度应用在当前参数上(可能已经不是开始时的参数值)
  2. synchronous method with queues:
    • Steps:
      1. 使用a blocking queues来迫使所有的workers来用相同的参数
      2. 使用per-variable queue收集gradient updates,进行原子化更新
    • challenge: slow workers(stragglers) limit overall throughput
  3. backup workers:类似MapReduce中的backup tasks
    • 主动运行
    • 在检测到straggler之后,aggregations takes the first m of n updates produced, 直接ignore a particular batch

5. Implementation

  1. C API: 分开用户级代码和kernel runtime
  2. distributed master:
    • translates user requests into execution across a set of tasks
    • given a graph and a step definition
    • master将
      • 对graph做prune和partitions
        • pruning: dead code elimination
      • 为每个设备分配对应的子图
      • cache这些子图,防止后续步骤可能宠用
      • optimization: 优化,比如common subexpression elimination, constant folding
      • 在一组tasks中协调execution of the optimized graphs
  3. dataflow executor
    • schedules the execution of the kernels
    • 调度组成本地子图的kernels的执行
    • can execute 10,000 subgraphs per second
    • dispatches kernels to local devices
    • runs kernels in parallel when possible
  4. Kernel Implementations
    • 许多kernel在Eigen::Tensor中实现,使用C++模板,为multicore CPUs和GPUs生成并行代码
    • cuDNN: 更高效
    • quantization: 在手机,high-throughput data center等场景加速运算
      • gemmlowp: low-precision matrix library,加速量化运算
  5. Networking layer
    • CPU->GPU: cuda memory async,使用collective operations优化
    • GPU<->GPU: 使用DMA
    • between tasks: gRPC over TCP, RDMA over converged Ethernet

如果将子计算表示为操作组合很困难或者效率低下,则用户可以注册额外的内核。hand-implement fused kernels. e.g.: Relu Sigmoid
automatic kernel fusion

6 Performance

6.1 Single-machine benchmark

6.2 Synchronous replica microbenchmark

7 Discussion

不足:

  1. no default policies for all users
  2. move on automatic placement, kernel fusion, memory management, scheduling
  3. some applications should have stronger consistency
  4. static dataflow graph不好,要能使the structure of computation unfolds dynamically
posted @ 2024-08-25 06:01  雪溯  阅读(13)  评论(0编辑  收藏  举报