Emre Aksan-2021-AnSpatio-temporalTransformer for 3D Human Motion Prediction-IEEE

A Spatio-temporal Transformer for 3D Human Motion Prediction #paper


1. paper-info

1.1 Metadata

  • Author:: [[Emre Aksan]], [[Manuel Kaufmann]], [[Peng Cao]], [[Otmar Hilliges]]
  • 作者机构::
  • Keywords:: #DeepLearning , #HMP , #Transformer
  • Journal:: #IEEE
  • Date:: [[2021-11-29]]
  • 状态:: #Done

1.2 Abstract

We propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations autoregressively and hence generation of plausible future developments over both short and long term. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.


2. Introduction

  • 领域
    3D human motion modelling: 用给定的动作序列去预测之后的动作序列。由于人体动作的复杂运动学和灵活性,不好预测。这类问题可被看做成generative modelling task
    作者提出了一种新的网络结构,可以从已知序列中学习空间-时间信息,不依赖于RNN中隐藏状态的传播或者固定的时间编码(DCT) 。
  • 作者的方法
    作者提出了两种注意力机制用于分别处理空间信息和时间信息
  • temporal attention:同一关节点的不同时刻作为输入
  • spatial attention:同一时刻的不同关节点作为输入

图 2-1 Spatio-temporal attention

3. Architecture


Figure. 1. 整体的网络结构图

Fig1.为整的网络结构图。

3.1 Ploblem Formulation

\(X=\{x_1, ...,x_T\};x_t=\{j_t^{(1)},...,j_t^{(N)}\}\):动作序列
\(x_t\)\(t\)时刻的人体姿态。
\(j_t^{n}\in \mathbb{R}^M\):该关节点对应的特征维度。根据表示方式不同有不同的值。作者使用的是rotation matrix representation(M=9)
\(W^{(n,I)}\):参数矩阵。对关节\(n\)的输入\(I\)

3.2. Spatio-temporal Transformer

Joint Embedding
通过一个线性层将输入投影到D-dimensional space,方便之后加上位置编码(sinusoidal position encoding),然后加上dropout层,送入注意力层。线性层的数学描述为:

\[e_t^{(n)}=W^{(n,E)}j_t^{(n)} + b^{(n, E)}, W^{(n, E)}\in \mathbb{R}^{D\times M},b^{(n, E)}\in \mathbb{R}^D \]

Temporal Attention
同常规的注意力机制一致
Spatial Attention
\(N\)个关节点计算注意力。

3.3. Traing and Inference

损失函数:

\[\mathcal{L}(\boldsymbol{X}, \hat{\boldsymbol{X}})=\sum_{t=2}^{T+1} \sum_{n=1}^{N}\left\|\boldsymbol{j}_{t}^{(n)}-\hat{\boldsymbol{j}}_{t}^{(n)}\right\|_{2} \]

在计算时间注意力时,使用滑动窗口技术。
\(\{x_1,...,x_T\}\)去预测\(\hat{x}_{T+1}\) ,然后滑动窗口,使用\(\{x_2,...,x_{T+1}\}\)去预测\(x_{T+2}\)

4. Experiments

  • datasets
    • AMASS
    • Human3.6M

5. 总结

该论文利用transformer获取时间和空间上的信息,在预测性能上有所提高。


posted @ 2022-11-03 20:57  GuiXu40  阅读(29)  评论(0编辑  收藏  举报