Paper: Informer
Informer 时间序列模型
1 Introduction
3 significant limitations in LSTF
LSTF(Long sequence time-series forecasting)
- The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).
- The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layers makes total memory usage to be O(J · L2), which limits the model scalability in receiving long sequence inputs.
- The speed plunge in predicting long outputs. Dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model (Fig.(1b)).
prior works
- Vanilla Transformer(2017)
- The Sparse Transformer(2019)
- LogSparse Transformer(2019)
- Longformer(2020)
- Reformer(2019)
- Linformer(2020)
- Transformer-XL(2019)
- Compressive Transformer(2019)
2 Preliminary
3 Methodology
Efficient Self-attention Mechanism
query’s attention is defined as a kernel smoother in a probability form
-
The Sparse Transformer
“self-attention probability has potential sparsity” 自注意力概率具有潜在的稀疏性
Query Sparsity Measurement
-
a few dot-product pairs contribute to the major attention,
-
others generate trivial attention.
distinguish the “important” queries
- Kullback-Leibler divergence
- Dropping the constant,
- query’s sparsity measurement
- Log-Sum-Exp (LSE)
- arithmetic mean
ProbSparse Self-attention
-
ProbSparse self-attention
only attend to the u dominant queries
Encoder
- extract dependency
Self-attention Distilling
-
distilling (inspired by dilated convolution)
- Attention Block
- Conv1d( )
- ELU( ) : activation function
- MaxPool
reduce memory usage
Decoder
- two identical multihead attention layers
Generative Inference
- sample a L_token long sequence
- take the known 5 days before the target sequence as “starttoken”
- feed the generative-style inference decoder
- one forward procedure predicts outputs
Loss function
- MSE loss function
4 Experiment
Datasets
2 collected real-world datasets for LSTF and 2 public benchmark datasets.
ETT (Electricity Transformer Temperature)
ECL (Electricity Consuming Load)
Weather
Experimental Details
Baselines:
- ARIMA(2014)
- Prophet(2018)
- LSTMa(2015)
- LSTnet(2018)
- DeepAR(2017)
self-attention:
- the canonical self-attention variant
- Reformer(2019)
- LogSparse self-attention(2019)
Metrics
- MSE
- MAE
Platform:
- a single Nvidia V100 32GB GPU
Results and Analysis
Parameter Sensitivity
Ablation Study
Computation Efficiency
5 Conclusion
本文作者:Hecto
本文链接:https://www.cnblogs.com/tow1/p/17688525.html
版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步