SystemML大规模机器学习,优化算子融合方案的研究

SystemML大规模机器学习,优化算子融合方案的研究

摘要

许多大规模机器学习(ML)系统允许通过线性代数程序指定定制的ML算法,然后自动生成有效的执行计划。在这种情况下,优化的机会融合基本算子的熔合链的算子是无处不在的。这些机会包括

(1)更少的物化中间表示

(2)更少的输入数据扫描,以及

(3)利用算子链上的稀疏性。

自动算子融合消除了手写的需要

融合运算符并显著提高

复杂的或以前看不见的算子链。然而,现有的融合启发式算法,很难找到好的融合方法。

复杂DAG计划或局部分布式算子的混合计划。

本文提出了一种融合方案的系统推理优化框架

综合了DAGs中的物化点、稀疏性利用、不同的融合模板类型,局部性

和分布式算子。提出了

(1)有效融合方案的候选探索算法

(2)基于成本的候选选择,以及

(3)在密集、稀疏和压缩数据上生成本地和分布式算子的代码。

在SystemML中的实验表明,通过优化融合方案,端到端性能得到了改善

与手写熔合运算符相比,高达21倍,可忽略的优化和代码生成开销。

1.引论

大规模机器学习(ML)的目的,是从大量的数据收集和通常的数据中,建立依赖于数据并行框架的预测模型,如MapReduce。

Spark在商品硬件上实现了经济高效的并行化。大规模的ML应用范围从数据密集型的、传统的分类、回归和集群用例,到计算密集型的矩阵分解和聚类的深度学习架构。在这种情况下,最先进的ML系统,允许数据科学家用线性代数和统计函数,来表达他们的ML算法,并自动高效地编译执行计划。这个高级规范简化了开发定制的ML算法,并允许执行计划适应不同的部署以及不同的数据、硬件和集群特征。

 

融合机会:执行产生计划有很多机会,其中融合算子-基本算子链的复合算子项-

可以提高性能。图1显示了融合潜在的主要类别。

首先,融合可以消除物化的中间表示

 

算子融合的现有工作:考虑到无处不在的机会和高性能影响,在数据库和高性能计算文献中,算子融合和代码生成受到了广泛的关注。

例如,SystemML使用手工编码的融合算子,为了消除中间表示,不必要的扫描,以及利用算子间的稀疏性。同样,Cumulon和常见问题,分别通过屏蔽算子和半连接减少(最坏情况下的最佳连接)。然而,这样的方法需要定制的算子,通常仅限于少数算子的固定模式,并且对于密集算子和稀疏输入。自动算子融合解决了这些问题:访问模式感知融合和后续代码的问题生成。

现有的研究大多忽略了稀疏和压缩。除了BTO中的节点局部优化或微优化(如Tupleware中的预测和循环平铺一样,不要考虑算子融合计划的优化。

一个优化融合计划的案例:随着代码生成器变得越来越复杂(例如,通过覆盖更多操作),优化融合计划的原则性方法变得越来越成问题。在这方面,关键的挑战是算子重叠访问模式,以及稀疏性开发的目标,创建需要自动优化的搜索空间:

• Materialization points (e.g., for multiple consumers)

• Sparsity exploitation and ordering of sparse inputs

• Decisions on fusion patterns (e.g., template types), and

• Constraints (e.g., memory budget and blocksizes) and costs for local and/or distributed operations.

例如,关于物化点的决策,考虑了冗余计算与物化和需求,比较指数级的计划。基本解决方案是启发式的,例如fuse all或fuse no redundancy,这些启发式算法很难为复杂的问题找到好本地和分布式操作的DAG或混合计划的解决方案。

贡献:拨我介绍了一种基于cost-based,基于DAGs的算子融合方案优化框架,对线性代数运算进行了描述,并将其集成到SystemML。用下列公式来描述优化问题。

分为三个阶段候选探索,候选选择,以及代码生成。

设计了新颖高效的算法。详细技术贡献如下

从论文的结构来看:

• System Architecture: We describe the integration into SystemML in Section 2. This overview includes the compiler integration, our optimization framework, code generation plans, and their runtime integration.

• Candidate Exploration: In Section 3, we introduce a novel bottom-up algorithm for the efficient exploration of valid partial fusion plans. We also discuss our memoization data structure and basic pruning rules.

• Candidate Selection: In Section 4, we then present strategies for selecting the optimal candidate plans. We formulate the problem, describe heuristics, and introduce our novel cost-based plan selection, including its search space, cost model, and enumeration algorithm.

• Experiments: In Section 5, we then report on extensive experiments in SystemML. These cover micro benchmarks for code generation, end-to-end performance in local and distributed environments, as well as comparisons with Julia, TensorFlow, and fusion heuristics.

 

2系统架构

本节将描述代码的体系结构

代码生成器(codegen)及其与SystemML的编译器集成。cost-based的优化框架,扩展了SPOOF框架,该框架依赖于adhoc候选探索和启发式的融合。还提供了代码生成的概述计划、生成的算子及其运行时runtime集成。

2.1编译器集成

SystemML提供了一种高级脚本语言类R语法,包括线性代数、元素和实现ML算法的统计运算。

如图2所示,这些脚本被解析为一个层次结构语句块,其中块由控制流描述。对于每个块,编译高级算子(HOPs)的DAG。这些DAG通过静态修改,即大小无关重写和过程间分析(IPA),从输入中传播矩阵维数和稀疏性。根据尺寸信息,应用动态重写,即依赖于大小的重写,并计算每个算子的内存估计。这些估计是:

依次用于选择本地或分布式执行类型,实际算子形成执行计划。类似自适应查询处理,SystemML重新编译HOP。在运行时(从动态重写)调整DAG,最初未知或改变尺寸的计划。

Codegen编译器集成:代码生成器采用如图2所示的HOP,动态重写后的DAG作为输入,并产生可能修改的HOPDAG,可包括基本和融合DAG算子。融合算子通过一个泛型表示

SpoofOp,它有一个类型,并引用生成的编译类。这些算子仍然是有效的HOP。因此,剩下的内存估计编译步骤,算子选择(如本地/分布式)和运行计划,生成也无缝地应用于融合算子。还可以在动态重新编译期间,调用代码生成器,这对于许多算法非常重要,因为优化器依赖已知的大小信息来计算成本和有效性约束。编译过程中的实际编译器集成初始编译稍微复杂一些,称之为密码。运行时程序生成后的生成器,但具有访问权限到HOPDAG以生成修改的运行时指令。但保留原来的DAG。这种方法有助于避免不完全融合,失去了算子的语义,限制了动态重新编译时的融合潜力。

 

Codegen体系结构:在高层,Codegen编译器包含五个定义良好的编译步骤。

第一,   关于候选探索,做了一个自下而上的研究、

pass over the HOP DAG to explore all valid partial fusion plans and store these plans in a memoization table, organized by HOPs. Second, during candidate selection (Section 4), we choose the optimal subset of fusion plans using a time-based cost model. Third, we construct code generation plans (CPlans, Section 2.2)—which are a backendindependent basis for code generation—for all selected fusion plans. Fourth, we then recursively expand templates for these given CPlans to generate java source code for each fused operator, compile the classes and load them into the JVM. By default, we use the fast janino compiler [89] but also support the standard javac compiler. Generated fused operators are maintained in a plan cache—which identifies equivalent CPlans via hashing—to avoid redundant code generation and compilation for existing operators. Finally, we replace covered parts of the HOP DAG by the fused operators. These separate compilation steps are very valuable for debugging without compromising on fusion potential. 2.2 Code Generation Plans Code generation plans (CPlans) [27] are a backend-independent representation of fused operators and allow for recursive code generation. We generate code via a depth-first template expansion to ensure valid ordering. Such plans consist of CNodes, which are either template or basic operation nodes. Template nodes represent generic fused operator skeletons that have a specific data binding and contain a DAG of basic operations that encodes the data flow. Example Expressions: We illustrate CPlans for two typical expressions with high performance impact of fusion. The first expression is part of an inner-loop update rule of ALS-CG (alternating least squares via conjugate gradient)

 

 

 

 

 1: public final class TMP10 extends SpoofCellwise { 2: public TMP10() {super(CellType.NO_AGG, null, false);} 3: protected double genexec(double a,SideInput[] b, 4: double[] scalars,..., int rix, int cix) { 5: double TMP5 = getValue(b[0], n, rix, cix); 6: double TMP6 = a * 1.0E-6; 7: double TMP7 = getValue(b[1], rix); 8: double TMP8 = TMP6 * TMP7; 9: double TMP9 = TMP5 + TMP8; 10: return TMP9; }} Finally, Figure 3(c) shows the row-wise CPlan of Expression (2). This fused operator requires only a single pass over X by exploiting temporal locality and it avoids six large intermediates. The memory for row intermediates is managed via a preallocated ring buffer per thread (here of size 5). 1: public final class TMP25 extends SpoofRowwise { 2: public TMP25() {super(RowType.COL_AGG_B1_T,true,5);} 3: protected void genexecDense(double[] a, int ai, 4: SideInput[] b, double[] c,..., int len) { 5: double[] TMP11 = getVector(b[1].vals(rix),...); 6: double[] TMP12 = vectMatMult(a,b[0].vals(rix),...); 7: double[] TMP13 = vectMult(TMP11, TMP12, 0,0,...); 8: double TMP14 = vectSum(TMP13, 0, TMP13.length); 9: double[] TMP15 = vectMult(TMP11, TMP14, 0,...); 10: double[] TMP16 = vectMinus(TMP13, TMP15, 0,0,...); 11: vectOuterMultAdd(a, TMP16, c, ai,0,0,...); } 12: protected void genexecSparse(double[] avals, int[] 13: aix, int ai, SideInput[] b, ..., int len){...}}

 

 

 

 

 

 

 Runtime Integration: Templates refer to generic skeletons of fused operators, which are inspired by algorithmic skeletons [21]. Figure 4 exemplifies this runtime integration using the Cell template. Unlike existing work [9, 22, 48], we made the conscious design decision not to generate the data access into the fused operators. Instead, the handcoded skeleton implements the data access—depending on its sparse-safeness over cells or non-zero values—of dense, sparse, or compressed [28] matrices and calls an abstract (virtual) genexec method for each value. Generated operators inherit this skeleton and only override the specific genexec, which yields very lean yet efficient operators. The skeleton also handles multi-threading, cache blocking, memory management, and pseudo-sparse-safe aggregations. Sharing common skeletons and vector primitives among fused operators can also reduce the instruction footprint and thus, L1 instruction cache misses, which is a known bottleneck in OLTP [85] and scale-out workloads [30]. 3. CANDIDATE EXPLORATION The exploration of candidate fusion plans aims to identify all valid partial fusion plans to provide a common input for different plan selection policies and simplify optimization. However, the exponential search space prohibits the enumeration of all possible plans. Instead, we enumerate partial fusion plans per operator, which represent local fusion decisions. We describe (1) the representations of partial fusion plans in our central memoization table, and (2) an efficient algorithm for populating this memo table in a single pass over the HOP DAG, including pruning techniques. 3.1 Memoization Table Our memoization (memo) table consists of a set of groups, where each group represents the output of an operator in the HOP DAG, i.e., a logical subexpression. Each group is identified by the operator ID, has access to its operator meta data, and contains a set of valid partial fusion plans for this operator. A partial fusion plan is called a memo table entry, and can reference other groups to represent fusion decisions. This structure is similar to groups and group expressions in the Cascades Optimization Framework [16, 34, 83], but we use it merely as a compact representation of fusion plans, which only includes operators that are amenable to fusion. Memo Table Entries: A memo table entry is a tuple (type, {i1, .., ik}, closed), consisting of a template type (as introduced in Table 1), a list of input references, and a closed type. The list of inputs corresponds to HOP inputs (i.e., data dependencies) by position, and each input is either a group reference or -1, which indicate fusion or materialized intermediates. A reference from an entry to a group implies that the group contains at least one compatible fusion plan. Finally, the close status can be open valid, open invalid (i.e., an invalid entry point), closed valid, and closed invalid. Example: We use Expression (2) from Section 2.2 to illustrate the structure of our memo table. Figure 5 shows the HOP DAG and the related memo table after candidate exploration and pruning (described in Section 3.2). All eight operators are represented by groups in the memo table. The group 11 refers to the final matrix multiplication (binary aggregate ba(+*)), and consists of three memo table entries of type Row. These entries encode fusion alternatives: (1) fuse right R(-1,9), (2) fuse left R(10,-1), and (3) fuse both R(10,9). Instead of encoding all alternative subplans along inputs, we only reference the input groups. This memo table then allows for simple costing and fusion by traversing the HOP DAG top down, probing for fusion plans, traversing group references, and determining the input HOPs, from where this process repeats until we reach the leaf HOPs. 3.2 Open-Fuse-Merge-Close Exploration Given a HOP DAG and an empty memo table, we aim to efficiently discover all valid partial fusion plans. We introduce a bottom-up algorithm that is template-oblivious and populates the memo table in a single pass over the DAG. OFMC Template Abstraction: As the basis of our candidate exploration algorithm, we define the open-fusemerge-close (OFMC) template abstraction: • open(Hop h): Indicates if a new fused operator of this template can be started at HOP h, covering its operation and reading materialized inputs. For example, the condition of an Outer template is an outer-product-like matrix multiplication with size constraints. • fuse(Hop h, Hop in): Indicates if an open fused operator (of this template) at the input HOP in can be expanded to its consumer HOP h. For example, a Cell template can fuse valid unary, binary, or ternary operations, valid aggregations, and inner products. • merge(Hop h, Hop in): Indicates if an open fused operator (of this template) at the consumer HOP h can be expanded to its input HOP in, i.e., if it can merge with fused operators at the input. An example is the merge of Cell templates into Row templates. • close(Hop h): Indicates the close status of the template after the HOP h and its validity. For example, any aggregation closes a Cell template (as valid or invalid), whereas only column-wise or full aggregations close a Row template. Outer templates are also validated for the existence of sparsity exploiting operators. The major benefit of this OFMC abstraction is the separation of template-specific conditions from the HOP DAG traversal and the population of the memo table. The OFMC Algorithm: Based on the memo table and OFMC abstraction, we introduce the OFMC exploration algorithm (Algorithm 1). This algorithm is called recursively, in a depth-first manner to populate the memo table bottomup. First, we check for already processed operators— indicated by existing groups or marked operators—(lines 1- 3) to avoid redundant exploration if nodes are reachable over multiple paths. Second, we recursively explore all input operators (lines 4-6) because these input data dependencies constitute potential fusion references. Third, we explore all templates for valid opening conditions at the current operator (lines 7-10). In case of a valid opening condition, we add this memo entry and enumerate merge plans with createPlans. This merging is important to cover scenarios such as X>(y 

 z), where the matrix-vector multiplication with X opens a Row template, which can also merge Cell templates over y
 z. Third, we fuse and merge existing partial fusion plans from the operator inputs to the current operator (lines 11-15). This step entails iterating over all distinct template types of all inputs and probing pair-wise fusion conditions. In case of a valid condition, we again call createPlans, which constructs a memo table entry for the fused operator, and enumerates all local plan combinations for inputs that satisfy the pair-wise merge condition. This entire plan set is then added to the group of the current operator. Fourth, we check all group entries for closing conditions (lines 16-20). Entries which satisfy the closing condition of their templates are either removed (invalid) or marked as closed (valid), while all other entries remain open. Pruning Techniques: Finally, we prune duplicates and valid closed entries without group references (line 22). For example, the group 7 ua(R+) in Figure 5 does not contain C(-1) because a rowSums closes the Cell template, which would cover only a single operator. In addition, there are advanced techniques that exploit characteristics of candidate selection policies. For instance, a policy that only considers materialization points with multiple consumers allows pruning dominated plans. A memo entry is dominated if all its references point to operators that are consumed once, and there is another entry (of the same type) whose reference list is a strict superset. For example, in Figure 5, R(10,9) dominates R(10,-1) but R(6,8) does not dominate R(-1,8) because group 6 has multiple consumers. However, we prune dominated plans only for selection heuristics. Algorithm Analysis: Overall our algorithm has linear time and space complexity in the number of operators. Memoization ensures that we visit each operator exactly once and the OFMC conditions apply only locally to an operator and its inputs. These conditions still have access to the hops and thus the entire DAG but this flexibility is only exploited in rare exceptions such as recognizing t(cumsum(t(X))) as a row operation.

 

 

 4. CANDIDATE SELECTION Given a memo table of partial fusion plans, candidate selection aims to choose the optimal subset of non-conflicting partial fusion plans. We describe the problem and cost model, as well as introduce our cost-based enumeration algorithm MPSkipEnum. The basic ideas are to (1) partition the set of partial fusion plans into independent groups, (2) restrict the search per group to interesting materialization points, (3) linearize the resulting exponential search space, and (4) enumerate and cost plans with skipping of search space areas that can be pruned based on cost or structure. 4.1 Problem Formulation and Heuristics Overall, we aim to find the cost-optimal set of fusion plans with the optimization scope of a single HOP DAG at-a-time and hybrid runtime plans that might include single-node and distributed operations. We define this problem as follows:

 

 

 

 参考文献:

https://arxiv.org/pdf/1801.00829.pdf

 

posted @ 2021-01-27 08:27  吴建明wujianming  阅读(505)  评论(0编辑  收藏  举报