Emre Aksan-2021-AnSpatio-temporalTransformer for 3D Human Motion Prediction-IEEE
A Spatio-temporal Transformer for 3D Human Motion Prediction #paper
1. paper-info
1.1 Metadata
- Author:: [[Emre Aksan]], [[Manuel Kaufmann]], [[Peng Cao]], [[Otmar Hilliges]]
- 作者机构::
- Keywords:: #DeepLearning , #HMP , #Transformer
- Journal:: #IEEE
- Date:: [[2021-11-29]]
- 状态:: #Done
1.2 Abstract
We propose a novel Transformer-based architecture
for the task of generative modelling of 3D human motion
. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations
autoregressively and hence generation of plausible future developments over both short and long term
. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism
. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.
2. Introduction
- 领域
3D human motion modelling
: 用给定的动作序列去预测之后的动作序列。由于人体动作的复杂运动学和灵活性,不好预测。这类问题可被看做成generative modelling task
。
作者提出了一种新的网络结构,可以从已知序列中学习空间-时间信息,不依赖于RNN中隐藏状态的传播或者固定的时间编码(DCT) 。 - 作者的方法
作者提出了两种注意力机制用于分别处理空间信息和时间信息 temporal attention
:同一关节点的不同时刻作为输入spatial attention
:同一时刻的不同关节点作为输入
3. Architecture
Fig1.为整的网络结构图。
3.1 Ploblem Formulation
\(X=\{x_1, ...,x_T\};x_t=\{j_t^{(1)},...,j_t^{(N)}\}\):动作序列
\(x_t\):\(t\)时刻的人体姿态。
\(j_t^{n}\in \mathbb{R}^M\):该关节点对应的特征维度。根据表示方式不同有不同的值。作者使用的是rotation matrix representation
(M=9)
\(W^{(n,I)}\):参数矩阵。对关节\(n\)的输入\(I\)
3.2. Spatio-temporal Transformer
Joint Embedding
通过一个线性层将输入投影到D-dimensional space
,方便之后加上位置编码(sinusoidal position encoding
),然后加上dropout
层,送入注意力层。线性层的数学描述为:
Temporal Attention
同常规的注意力机制一致
Spatial Attention
对\(N\)个关节点计算注意力。
3.3. Traing and Inference
损失函数:
在计算时间注意力时,使用滑动窗口技术。
用\(\{x_1,...,x_T\}\)去预测\(\hat{x}_{T+1}\) ,然后滑动窗口,使用\(\{x_2,...,x_{T+1}\}\)去预测\(x_{T+2}\)
4. Experiments
- datasets
- AMASS
- Human3.6M
5. 总结
该论文利用transformer
获取时间和空间上的信息,在预测性能上有所提高。