Paper: Informer

Informer 时间序列模型

1 Introduction

3 significant limitations in LSTF

LSTF(Long sequence time-series forecasting)

The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).
The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layers makes total memory usage to be O(J · L2), which limits the model scalability in receiving long sequence inputs.
The speed plunge in predicting long outputs. Dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model (Fig.(1b)).

prior works

Vanilla Transformer(2017)
The Sparse Transformer(2019)
LogSparse Transformer(2019)
Longformer(2020)
Reformer(2019)
Linformer(2020)
Transformer-XL(2019)
Compressive Transformer(2019)

2 Preliminary

3 Methodology

Efficient Self-attention Mechanism

query’s attention is defined as a kernel smoother in a probability form

\(\mathcal{A}(q, K, V) = \mathbb{E}_{p(k|q)[v]}\)

The Sparse Transformer

“self-attention probability has potential sparsity” 自注意力概率具有潜在的稀疏性

Query Sparsity Measurement

a few dot-product pairs contribute to the major attention,
others generate trivial attention.

distinguish the “important” queries

Kullback-Leibler divergence

Dropping the constant,

query’s sparsity measurement

Log-Sum-Exp (LSE)

arithmetic mean

ProbSparse Self-attention

ProbSparse self-attention

only attend to the u dominant queries

Encoder

extract dependency

Self-attention Distilling

distilling (inspired by dilated convolution)
1. Attention Block
2. Conv1d( )
3. ELU( ) : activation function
4. MaxPool
reduce memory usage

Decoder

two identical multihead attention layers

Generative Inference

sample a L_token long sequence
take the known 5 days before the target sequence as “starttoken”
feed the generative-style inference decoder
one forward procedure predicts outputs

Loss function

MSE loss function

4 Experiment

Datasets

2 collected real-world datasets for LSTF and 2 public benchmark datasets.

ETT (Electricity Transformer Temperature)

ECL (Electricity Consuming Load)

Weather

Experimental Details

Baselines:

ARIMA(2014)
Prophet(2018)
LSTMa(2015)
LSTnet(2018)
DeepAR(2017)

self-attention:

the canonical self-attention variant
Reformer(2019)
LogSparse self-attention(2019)

Metrics

Platform:

a single Nvidia V100 32GB GPU

Results and Analysis

Parameter Sensitivity

Ablation Study

Computation Efficiency

5 Conclusion

posted @ 2023-09-08 20:42 Hecto 阅读(54) 评论(0) 收藏举报

刷新页面返回顶部

杳百