时间序列ARIMA模型

一、数据平稳性与差分法

1.1 平稳性

平稳性就是要求经由样本时间序列所得到的拟合曲线在未来的一段期间内仍能顺着现有的形态“惯性”地延续下去
平稳性要求序列的均值和方差不发生明显变化

1.2 严平稳与弱平稳

严平稳：严平稳表示的分布不随时间的改变而改变。
如：白噪声（正态），无论怎么取，都是期望为0，方差为1

弱平稳：期望与相关系数（依赖性）不变
未来某时刻的t的值Xt就要依赖于它的过去信息，所以需要依赖性

严平稳的条件只是理论上的存在，现实中用得比较多的是宽平稳的条件。

弱平稳也叫宽平稳或者二阶平稳（均值和方差平稳），它应满足：

常数均值
常数方差
常数自协方差

ARIMA 模型对时间序列的要求是平稳型。因此，当你得到一个非平稳的时间序列时，首先要做的即是做时间序列的差分，直到得到一个平稳时间序列。如果你对时间序列做d次差分才能得到一个平稳序列，那么可以使用ARIMA(p,d,q)模型，其中d是差分次数。

1.3 差分法

时间序列在t与t-1时刻的差值

从统计学的角度来看，数据差分是指将非平稳数据转换为平稳的过程，去除其非恒定的趋势。“差分消除了时间序列水平的变化，消除了趋势和季节性，从而稳定了时间序列的平均值。”

1.4 差分实验

数据来自：https://github.com/SimiY/pydata-sf-2016-arima-tutorial

import sys
import os
 
import pandas as pd
import numpy as np
 
import matplotlib.pylab as plt
import seaborn as sns
 
pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
np.set_printoptions(precision=5, suppress=True) # numpy
 
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
 
# seaborn plotting style
sns.set(style='ticks', context='poster')

#美国消费者信心指数
Sentiment = 'data/sentiment.csv'
Sentiment = pd.read_csv(Sentiment, index_col=0, parse_dates=[0])
Sentiment.head()

DATE	UMCSENT
2000-01-01	112.00000
2000-02-01	111.30000
2000-03-01	107.10000
2000-04-01	109.20000
2000-05-01	110.70000

# Select the series from 2005 - 2016
sentiment_short = Sentiment.loc['2005':'2016']

sentiment_short.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Consumer Sentiment")
sns.despine()

index

sentiment_short['diff_1'] = sentiment_short['UMCSENT'].diff(1)#求差分值，一阶差分。     1指的是1个时间间隔，可更改。
 
sentiment_short['diff_2'] = sentiment_short['diff_1'].diff(1)#再求差分，二阶差分。
 
sentiment_short.plot(subplots=True, figsize=(18, 12))

index

二、ARIMA模型拆解

基本模型：自回归移动平均模型(ARMA(p，q))是时间序列中最为重要的模型之一。它主要由两部分组成： AR代表p阶自回归过程，MA代表q阶移动平均过程。

2.1 AR - 自回归

1286166-20180707163430738-1068734747

自回归模型的限制:

自回归模型是用自身的数据来进行预测
必须具有平稳性
必须具有自相关性，如果自相关系数(φi)小于0.5，则不宜采用
自回归只适用于预测与自身前期相关的现象

2.2 MA - 移动平均

1286166-20180707163823559-1177873310

2.3 I - 表示综合

ARIMA中的“I”指的是它的综合方面。当应用差分步骤时，数据是“综合”的，以消除非平稳性。表示原始观测值的差异，以允许时间序列变得平稳，即数据值被数据值和以前的值之间的差异替换。

I表示差分项，用d表示，等于1为1阶差分，2为2阶差分，等于0不用做，一般做1阶差分就够了。

2.4 最终模型

1286166-20180707164506801-1781516986

ARIMA(p，d，q)模型全称为差分自回归移动平均模型(Autoregressive Integrated Moving Average Model,简记ARIMA)

AR是自回归，p为自回归项；
MA为移动平均
q为移动平均项数，d为时间序列成为平稳时所做的差分次数

原理：将非平稳时间序列转化为平稳时间序列然后将因变量。仅对它的滞后值以及随机误差项的现值和滞后值进行回归所建立的模型。

滞后值指的是阶数

三、自相关和偏自相关

3.1 自相关函数ACF

自相关函数ACF(autocorrelation function)
对一个时间序列，现在值与其过去值的相关性。如果相关性为正，则说明现有趋势将继续保持。
自相关函数反映了同一序列在不同时序的取值之间的相关性

公式：

Pk的取值范围为[-1,1]

3.2 偏自相关函数(PACF)

偏自相关函数(PACF)(partial autocorrelation function)
对于一个平稳AR(p)模型，求出滞后k自相关系数p(k)时实际上得到并不是x(t)与x(t-k)之间单纯的相关关系
x(t)同时还会受到中间k-1个随机变量x(t-1)、x(t-2)、……、x(t-k+1)的影响，而这k-1个随机变量又都和x(t-k)具有相关关系。所以自相关系数p(k)里实际掺杂了其他变量对x(t)与x(t-k)的影响
剔除了中间k-1个随机变量x(t-1)、x(t-2)、……、x(t-k+1)的干扰之后x(t-k)对x(t)影响的相关程度。
ACF还包含了其他变量的影响，而偏自相关系数PACF是严格这两个变量之间的相关性

3.3 pdq阶数确定

截尾：落在置信区间内（95%的点都符合该规则），可简单理解为从某阶之后直接就变为0。

1286166-20180707201054120-95753925

3.4 ACF、PACF求解

del sentiment_short['diff_2']
del sentiment_short['diff_1']
sentiment_short.head()
print (type(sentiment_short))

<class 'pandas.core.frame.DataFrame'>

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig = plt.figure(figsize=(12,8))
#acf 自相关
ax1 = fig.add_subplot(211)
plot_acf(sentiment_short,lags=20,title='acf', ax=ax1)  
ax1.xaxis.set_ticks_position('bottom')
fig.tight_layout();

#pacf 偏自相关
ax2 = fig.add_subplot(212)
plot_pacf(sentiment_short,lags=20,title='pacf', ax=ax2)  
ax2.xaxis.set_ticks_position('bottom')
fig.tight_layout();

index

# 散点图也可以表示
 
lags=9
ncols=3
nrows=int(np.ceil(lags/ncols))
 
fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(4*ncols, 4*nrows))
 
for ax, lag in zip(axes.flat, np.arange(1,lags+1, 1)):
    lag_str = 't-{}'.format(lag)
    X = (pd.concat([sentiment_short, sentiment_short.shift(-lag)], axis=1,
                   keys=['y'] + [lag_str]).dropna())
 
    X.plot(ax=ax, kind='scatter', y='y', x=lag_str);
    corr = X.corr().values[0][1]
    ax.set_ylabel('Original')
    ax.set_title('Lag: {} (corr={:.2f})'.format(lag_str, corr));
    ax.set_aspect('equal');
    sns.despine();
 
fig.tight_layout();

index

# 更直观一些
#模板，使用时直接改自己的数据就行，用以下四个图进行评估和分析就可以
def tsplot(y, lags=None, title='', figsize=(20, 8)):
    
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax   = plt.subplot2grid(layout, (0, 0))
    hist_ax = plt.subplot2grid(layout, (0, 1))
    acf_ax  = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))
     
    y.plot(ax=ts_ax)
    ts_ax.set_title(title)
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    plot_acf(y, lags=lags, ax=acf_ax)
    plot_pacf(y, lags=lags, ax=pacf_ax)
    [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
    sns.despine()
    plt.tight_layout()
    return ts_ax, acf_ax, pacf_ax
 
tsplot(sentiment_short, title='Consumer Sentiment', lags=36);

index

四、AIC和BIC

AIC和BIC两个值均是越低越好

1286166-20180707203523056-449874149

AIC表达含义是参数和最终结果之间的权衡

AIC和BIC就是在保持精度的情况下，使K值越小越好

五、ARIMA完整流程

1286166-20180707201318245-1855657900

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'
 
from __future__ import absolute_import, division, print_function
import sys
import os
 
import pandas as pd
import numpy as np
 
import matplotlib.pylab as plt
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
 
pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
np.set_printoptions(precision=5, suppress=True) # numpy
 
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
 
# seaborn plotting style
sns.set(style='ticks', context='poster')

读入数据

filename_ts = 'data/series2.csv'
ts_df = pd.read_csv(filename_ts, index_col=0, parse_dates=[0])
n_sample = ts_df.shape[0]
print(ts_df.shape)
print(ts_df.head())

(250, 1)
value
1998-06-01 -0.59883
1998-07-01 -0.80018
1998-08-01 2.29898
1998-09-01 1.15039
1998-10-01 -1.19258

划分训练集和测试集

# Create a training sample and testing sample before analyzing the series
 
n_train=int(0.95*n_sample)+1
n_forecast=n_sample-n_train
#ts_df
ts_train = ts_df.iloc[:n_train]['value']
ts_test = ts_df.iloc[n_train:]['value']
print(ts_train.shape)
print(ts_test.shape)
print("Training Series:", "\n", ts_train.tail(), "\n")
print("Testing Series:", "\n", ts_test.head())

(238,)
(12,)
Training Series:
2017-11-01 1.08165
2017-12-01 0.40939
2018-01-01 2.17708
2018-02-01 2.21133
2018-03-01 -0.39728
Name: value, dtype: float64

Testing Series:
2018-04-01 0.82064
2018-05-01 0.66053
2018-06-01 0.78946
2018-07-01 -0.02444
2018-08-01 -0.39888
Name: value, dtype: float64

def tsplot(y, lags=None,figsize=(14, 8)):
     
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax   = plt.subplot2grid(layout, (0, 0))
    hist_ax = plt.subplot2grid(layout, (0, 1))
    acf_ax  = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))
    y.plot(ax=ts_ax)
    ts_ax.set_title('A Given Training Series')
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    plot_acf(y, lags=lags, ax=acf_ax)
    plot_pacf(y, lags=lags, ax=pacf_ax)
    [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
    sns.despine()
    
    fig.tight_layout()
    
    return ts_ax, acf_ax, pacf_ax
    
tsplot(ts_train, lags=20);

index

模型训练

#Model Estimation
 
# Fit the model
arima200 = SARIMAX(ts_train, order=(2,0,0))#order里边的三个参数p,d,q
model_results = arima200.fit()#fit模型

import itertools
#当多组值都不符合时，遍历多组值，得出最好的值
p_min = 0
d_min = 0
q_min = 0
p_max = 4
d_max = 0
q_max = 4
 
# Initialize a DataFrame to store the results
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
                           columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
 
for p,d,q in itertools.product(range(p_min,p_max+1),
                               range(d_min,d_max+1),
                               range(q_min,q_max+1)):
    if p==0 and d==0 and q==0:
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
        continue
     
    try:
        model = SARIMAX(ts_train, order=(p, d, q),
                               #enforce_stationarity=False,
                               #enforce_invertibility=False,
                              )
        results = model.fit()
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
    except:
        continue
results_bic = results_bic[results_bic.columns].astype(float)

fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(results_bic,
                 mask=results_bic.isnull(),
                 ax=ax,
                 annot=True,
                 fmt='.2f',
                 );
ax.set_title('BIC');

index

# Alternative model selection method, limited to only searching AR and MA parameters
from statsmodels.tsa.stattools import arma_order_select_ic
 
train_results = arma_order_select_ic(ts_train, ic=['aic', 'bic'], trend='nc', max_ar=4, max_ma=4)
 
print('AIC', train_results.aic_min_order)
print('BIC', train_results.bic_min_order)

AIC (1, 3)
BIC (2, 0)

结果：得出两个不同的标准，比较尴尬，还需要进行筛选

模型残差检验：
ARIMA模型的残差是否是平均值为0且方差为常数的正态分布
QQ图：线性即正态分布

#残差分析 正态分布 QQ图线性
model_results.plot_diagnostics(figsize=(20, 10));#statsmodels库

index

Q-Q图：越像直线，则是正态分布；越不是直线，离正态分布越远。

时间序列建模基本步骤：

获取被观测系统时间序列数据；
对数据绘图，观测是否为平稳时间序列；对于非平稳时间序列要先进行d阶差分运算，化为平稳时间序列；
经过第二步处理，已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF ，通过对自相关图和偏自相关图的分析，得到最佳的阶层 p 和阶数 q
由以上得到的，得到ARIMA模型。然后开始对得到的模型进行模型检验。

posted @ 2022-04-18 15:42 王陸阅读(647) 评论(0) 收藏举报

刷新页面返回顶部

王陸

我可不是为了被全人类喜欢才活着的，只要对于某一个人来说我是必要的，我就能活下去。

时间序列ARIMA模型

一、数据平稳性与差分法

1.1 平稳性

1.2 严平稳与弱平稳

1.3 差分法

1.4 差分实验

二、ARIMA模型拆解

2.1 AR - 自回归

2.2 MA - 移动平均

2.3 I - 表示综合

2.4 最终模型

三、自相关和偏自相关

3.1 自相关函数ACF

3.2 偏自相关函数(PACF)

3.3 pdq阶数确定

3.4 ACF、PACF求解

四、AIC和BIC

五、ARIMA完整流程

公告