（模拟）(SIG 2022)布料仿真论文解析 MASpreconditioner

首先，老规矩：

未经允许禁止转载(防止某些人乱转，转着转着就到蛮牛之类的地方去了)

B站:Heskey0

本文介绍学术界最新的布料仿真方法，它的高并行性，使其在GPU能达到30FPS，如果在分布式计算的加持下，完全可以达到实时高精度布料仿真。

1. Abstract

our preconditioner naturally adopts multilevel and domain decomposition concepts. We advocate the use of small, non-overlapping domains that can well explore the parallel computing power on GPU. We implement our preconditioner for both GPUs and CPUs.

2. Background

2.1 Preconditioning

\[q=\arg\min_qF(q) \]

使用Newton-Raphson method:

solve \(Ax=b\), in which \(A=\part^2F/\part q^2\) and \(b=-(\part F/\part q)^T\)
- 使用PCG，需要构造preconditioner: \(M\)
then \(x\) provide update to: \(q\leftarrow q+x\)

2.2 MAS

Let:

\(\{\Omega_d\}\) be a set of domains covering all of the nodes \(\Omega\), such that \(\Omega=\cup\Omega_d\).
\(S_d\) the selection matrix(restriction matrix) of domain \(d\) that pulls nodes in \(\Omega_d\) out of the \(N\) nodes in \(\Omega\).

define the Additive Schwarz(AS) preconditioner as:

\[M^{-1}_{(0)}=\sum_dS^T_dM^{-1}_{d,(0)}S_d \]

when the domain do not overlap and \(M_{d,(0)}=S_dAS^T_d\) (the diagonal block of \(A\)), the AS solver is equivalent to the block Jacobi method.

two-level AS preconditioner: （以two-grid cycle为例）

\[M^{-1}_{TAS}=M^{-1}_{(0)}+C^T_{(1)}M^{-1}_{(1)}C_{(1)} \]

where

\(C_{(1)}\in \R^{3N_{(1)}\times3N}\) is a coarsener matrix

\(M_{(1)}\in \R^{3N_{(1)}\times3N_{(1)}}\) is a single-level additive Schwarz preconditioner built for the coarsened system matrix \(A_{(1)}=C_{(1)}AC^T_{(1)}\).

\(C^T_{(1)}M^{-1}_{(1)}C_{(1)}\) is coarse space correction

个人理解：

\(M^{-1}_{(0)}\) 是fine-grid的preconditioner的inverse

\(C^T_{(1)}M^{-1}_{(1)}C_{(1)}\) 是coarse grid的preconditioner的inverse，再通过 \(C_{(1)}\) 转换到fine-grid

其中 \(M^{-1}_{(0)}\) 和 \(M^{-1}_{(1)}\) 二者分别来自不同粒度网格下的AS preconditioner

MAS preconditioner:（例如V-cycle, F-cycle, W-cycle）

\[M^{-1}_{MAS}=M^{-1}_{(0)}+\sum^L_{l=1}C^T_{(l)}M^{-1}_{(l)}C_{(l)} \]

the coarsener matrix \(C_{(l)}\) maps from level 0 directly to all of the levels

3. Overview

3.1 Algorithm

the algorithmic pipeline contains three stages:

Multilevel domain construction.
- Based on Morton code sorting (也就是Collision detection中Spacial Hashing的一个算法), we present a supernode splitting method and a skipping approach, for fast and simple domain partitioning and coarse space construction.
- The algorithm sorts all of the nodes by their Morton codes.
```
//SePreDefine.h
#define SE_ALIGN(n)	__declspec(align(n))
//SeMorton.h
class SE_ALIGN(8) SeMorton64
{
    void Encode(float x, float y, float z){}
}
//SeSchwarzPreconditioner.h
std::vector<SeMorton64> m_mortonCode;

/* Core */
//SeSchwarzPreconditioner.cpp
SeSchwarzPreconditioner::SpaceSort(){}
```
Matrix precomputation.
- We propose a one-way elimination algorithm and a compact matrix format, to calculate the submatrix inverse of each domain with low computational and storage costs. We also develop a selective update scheme to address slight matrix modifications
- the algorithm calculates the coarsened system matrix \(A_{(l)}\) at each level and directly computes the sub-matrix inverse of each domain: \(M^{-1}=(S_dA_{(l)}S^T_d)^{-1}\).
Runtime preconditioning.
- Given the precomputed sub-matrix inverses, we perform runtime preconditioning by simple matrix-vector multiplication. To further reduce this cost, we invent a symmetric-matrix-vector multiplication method with balanced workload and zero write conflict.
- executes three steps in every call:
  - distributing the input \(x\) to every level
  - performing the matrix-vector product in each domain
  - gathering the sub-vectors into the output \(y\).

preconditioner代码提供的函数：

public:

	//==== call before time integration once a frame
	void AllocatePrecoditioner(int numVerts, int numEdges, int numFaces){
        ComputeTotalAABB();
        SpaceSort();			//Motorn code sorting
        ComputeInverseMapper();
        MapHessianTable();
    }

	//==== call before PCG iteration loop
	void PreparePreconditioner(const SeMatrix3f* diagonal, const SeMatrix3f* csrOffDiagonals, const int* csrRanges, const EfSet* efSets, const EeSet* eeSets, const VfSet* vfSets, unsigned int* efCounts, unsigned int* eeCounts, unsigned int* vfCounts){
        PrepareCollisionStencils();
        ReorderRealtime();
        PrepareCollisionHessian();
        PrepareHessian();
        LDLtInverse512();		//matrix precomputation
    }

	//==== call during PCG iterations
	void Preconditioning(SeVec3fSimd* z, const SeVec3fSimd* residual, int dim){
        BuildResidualHierarchy(residual);
        SchwarzLocalXSym();
        CollectFinalZ(z);
    }

3.2 Cloth and Deformable Body Simulation

论文的方案：

Cloth simulation
- It uses the co-rotational linear model for planar elasticity and the quadratic bending model for bending elasticity
Deformable body simulation
- it supports multiple elastic models, including projective dynamics [Bouaziz et al. 2014], the co-rotational FEM model and hyperelastic models
Collision
- spatial hashing [2018]
Friction
- both the impulse-based method and the adhesive method (the latter of which is particularly useful for static frictions)

4. Multilevel Domain Construction

In this stage, the algorithm is responsible for two tasks:

partitioning nodes into domains, i.e., setting up the selection matrices \(S_𝑑\)
constructing the coarse space, i.e., creating the coarseners \(C_{(𝑙)}\).

We formulate this stage by symbolically analyzing the system matrix

4.1 Domain Partitioning

计算Morton code：Specifically, we calculate the axis-aligned bounding box of the nodes, and divide the box into \(2^{20} × 2^{20} × 2^{20}\) cells. We then obtain the 60-bit Morton code of each node by simply interleaving the bits of its cell indices.
sort带来的优点：Intuitively, as spatial locality improves, the likelihood of a connection between two distant nodes in the sorted array drops.(例如：1号grid point不会和40号相邻，而是和4号相邻) As a result, off-diagonal blocks should be closer to the diagonal of the system matrix, making the preconditioner a better approximation to the matrix.
使用基数排序算法：In our implementation, we use radix sort for sorting. Because of its high performance on a GPU, our approach can work at runtime based on node positions in the current state.

4.2 Coarse Space Construction

define:

\(map_{l\rightarrow l+1}[i]\): a coarsening map that converts the index of every supernode \(𝑖\) at level \(𝑙\) to the index of its parent supernode at level \(𝑙 + 1\)
Then we collapse \(\{map_{l\rightarrow l+1}\}\) to compute \(map_{(l)}\): the coarsening map from level \(0\) directly to every level \(l\).

Using these maps, we can easily coarsen a vector by atomic additions and prolongate a coarsened vector by retrieving multiple copies of data on a GPU. We can also easily coarsen the system matrix \(A\) to \(A_{(𝑙)}\)

By default, we prefer the supernode size \(𝑆\) to be equal to the domain size \(𝑀\), so that a domain naturally becomes a supernode at the coarser level.

但这会导致false coupling artifacts:

两块独立的布料，用一个线性系统去求解。

When we use a uniform supernode size \((𝑆 = 4)\) to construct the coarse space, two disjoint nodes may be grouped into the same supernode.

其中一块布料的运动，会影响另一块布料

Our solution instead is to use node-node connections. Specifically, two nodes at level 0 are connected if they are linked by an off-diagonal block in the system matrix, due to topological proximity or contact. According to these connections, we check every domain-sized supernode candidate at level 1 and split it into multiple supernodes, if they are not actually connected. We then construct the coarsening map at level 1. After that, we coarsen node-node connections to level 1 and repeat the process to construct more maps at coarser levels hierarchically

5. Matrix Precomputation

5.1 The Choice of the Domain size \(M\)

𝑀 大的时候：In computer graphics and computational mathematics literature, most of the domain decomposition methods prefer using large domains over small domains. The use of large domains reduces the error associated with inter-domain connections, at the cost of more difficulty in solving each sub-problem. (AS preconditioner achieves a faster convergence rate as the domain size 𝑀 grows)
𝑀 小的时候：However, if there are many small domains, we can explore their parallelization on a GPU and even achieve \(M^{−1}_{𝑑,(𝑙)} = A^{−1}_{𝑑,(𝑙)}\) by computing each sub-matrix inverse directly. Meanwhile, we can also effectively lessen the inter-domain connection issue by coarse space corrections. (In fact, if a MAS preconditioner is able to solve its coarse-level problem exactly, it would converge even faster as 𝑀 gets smaller)

In this work, we set 𝑀 = 32. This number provides the right sub-matrix size for our algorithm(in the next subsection) to utilize the shared memory in the computation of the sub-matrix inverse.

5.2 Per-Domain Sub-Matrix Inverse Calculation

计算矩阵的inverse：A variant of Gaussian elimination called Gauss–Jordan elimination can be used for finding the inverse of a matrix Gaussian elimination - Wikipedia

Algorithm: Matrix precomputation

Input: System matrix \(A\) and coarsening maps \(\{map_{(l)}\}\)
Result: Sub-matrix inverses \(\{M^{-1}_{d,(l)}\}\)

Start:

\(\{A_{d,(l)}\}\leftarrow 0\)
foreach node \(i\) and node \(j\) and level \(l\) do
- if \([map_{(l)}[i]/32]=[map_{(l)}[j]/32]]\) then
  - \(d\leftarrow[map_{(l)}[i]/32]\)
  - \(i^\prime\leftarrow map_{(l)}[i]-32d\)
  - \(j^\prime\leftarrow map_{(l)}[j]-32d\)
  - \(A_{d,(l)}[i^\prime,j^\prime]\leftarrow A_{d,(l)}[i^\prime,j^\prime]+A[i,j]\)
- end
end
foreach domain \(d\) and level \(l\) do
- \(\{L^{-1}_{d,(l)},D_{d,(l)}\}\leftarrow Gauss\_Jordan\_Elimination(A_{d,(l)})\)
end
foreach domain \(d\) and level \(l\) do
- \(M^{-1}_{d,(l)}\leftarrow L^{-T}_{d,(l)}D^{-1}_{d,(l)}L^{-1}_{d,(l)}\)
end

Gauss-Jordan-Elimination：将矩阵变为一个echelon form(阶梯形式) \([U_{d,(l)}|L^{-1}_{d,(l)}]\), such that \(L_{d,(l)}U_{d,(l)}=A_{d,(l)}\)

Gauss-Jordan-Elimination法可以求 \(A^{-1}\)，即利用row-reduce(矩阵的行变换) \([A|I]\rightarrow[I|A^{-1}]\)

上面算法分为三个step:

we check every \(3\times3\) system matrix block \(A[i,j]\) and add it into the \(96\times96\) submatrix \(A_{d,(l)}\), if node \(i\) and node \(j\) are mapped to the same domain \(d\) at level \(l\).
apply Gauss-Jordan Elimination to row-reduce the \(96\times192\) matrix \([A_{d,(l)}|I]\) into a row echelon form \([U_{d,(l)}|L^{-1}_{d,(l)}]\), such that \(L_{d,(l)}U_{d,(l)}=A_{d,(l)}\)
1. To decrease the memory footprint, we apply elimination to the \(96\times96\) matrix \(A_{d,(l)}\) directly, and overwrite it by \(U_{d,(l)}\) and \(L^{-1}_{d,(l)}\). Since \(A_{d,(l)}\) is symmetric, \(U_{d,(l)}=D_{d,(l)}L^T_{d,(l)}\) and \(D_{d,(l)}\) is the diagonal of \(U_{d,(l)}\).
calculate \(L^{-T}_{d,(l)}D^{-1}_{d,(l)}L^{-1}_{d,(l)}\) by visiting every two columns \(i_0\) and \(i_1\) in parallel.

Thanks to this format, we can perform the third step completely in the GPU shared memory. The final output is a per-domain submatrix inverse \(M^{−1}_{𝑑,(𝑙)} = A^{−1}_{𝑑,(𝑙)}\) , stored back to the global memory in an alternative compact format to be discussed in Subsection 6.1.

The reason we do not use Gauss-Jordan elimination to obtain \(A^{−1}_{𝑑,(𝑙)}\) directly is because doing so requires two row-reductions.

Every row-reduction involves \(32\sum^{31}_{𝑖=1} 𝑖 ≈ \frac12 32^3\) node-node matrix multiplications and the rows must be visited sequentially.

In comparison, the \(L^{-T}_{d,(l)}D^{-1}_{d,(l)}L^{-1}_{d,(l)}\) product uses \(\sum^{32}_{𝑖_0=1} \sum^{𝑖_0}_{𝑖_1=1}𝑖_1 ≈ \frac16 32^3\) node-node multiplications and it can be well parallelized.

5.3 Selective Sub-Matrix Update

Since the evaluation of the tangent stiffness matrix can be expensive, researchers have also developed simulation techniques that temporarily or permanently skip stiffness matrix updates, such as quasi-Newton methods[2013] and projective dynamics[2014]. (Still need to update the whole system matrix)

We make our preconditioner suitable for these methods by developing a selective matrix inverse update function. To do so, we use the changed contact pairs to detect the domains with changes and recalculate the matrix inverses of those domains only.

In our experiment, when the stiffness matrix stays unchanged, this practice reduces the cost of matrix precomputation to be under five percent of its original cost

6. Runtime Preconditioning

Our runtime preconditioning process handles all of the domains at all of the levels in three steps:

coarsens the input vector into different copies at all of the levels
(most complex and expensive one) performs preconditioning inside of each domain
sums up the results at all levels into a joint output

6.1 Per-Domain Preconditioning

The preconditioning task inside of each domain is essentially matrix-vector multiplication by the submatrix inverse \(A^{−1}_{𝑑,(l)}\). Our basic idea is to halve the memory access by exploring matrix symmetry, so hopefully we can halve the computational time as well.

7. Results

We evaluate the performances of our preconditioner in these examples on an NVIDIA GeForce RTX 3080 GPU and an Intel Core i7-9700 3.0GHz CPU.

By default

simulate linearized examples by the PCG solver
simulate nonlinear examples by the inexact Newton solver with 32 inner PCG iterations per Newton step

posted @ 2022-07-22 18:44 Heskey0 阅读(627) 评论(0) 编辑收藏举报

刷新页面返回顶部

Heskey0