时间序列学习笔记2
2. 时间序列基础
In [7]: dates = [(2011,1,1),(2011,2,3),(2011,2,4),(2011,4,23),(2011,4,22),(2011,
...: 4,1)]
In [8]: dates = [datetime(*x) for x in dates]
In [14]: ts = Series(np.random.randn(6), index=dates)
# 创建一个以时间戳为index的Series。
In [15]: ts
Out[15]:
2011-01-01 3.627969
2011-02-03 0.731217
2011-02-04 1.178071
2011-04-23 -2.085412
2011-04-22 -0.093829
2011-04-01 -0.157532
dtype: float64
In [16]: type(ts)
Out[16]: pandas.core.series.Series
In [17]: ts.index
Out[17]:
DatetimeIndex(['2011-01-01', '2011-02-03', '2011-02-04', '2011-04-23',
'2011-04-22', '2011-04-01'],
dtype='datetime64[ns]', freq=None)
# 和普通的Series一样,可以做Series相加
In [19]: ts + ts[::2]
2011-01-01 7.255939
2011-02-03 NaN
2011-02-04 2.356142
2011-04-01 NaN
2011-04-22 -0.187658
2011-04-23 NaN
dtype: float64
# 时间序列的index类型为datetime64,单位是纳秒
In [20]: ts.index.dtype
Out[20]: dtype('<M8[ns]')
In [21]: stamp = ts.index[0]
In [22]: stamp
Out[22]: Timestamp('2011-01-01 00:00:00')
2.1 索引、选取和子集的构造
索引
# 可以使用datetime格式的索引
In [24]: stamp = ts.index[2]
In [25]: ts[stamp]
Out[25]: 1.1780707665960897
# 也可以使用常用日期格式的字符串类型作为索引。
In [27]: ts['01/01/2011']
Out[27]:
2011-01-01 3.627969
dtype: float64
In [28]: ts['20110101']
Out[28]:
2011-01-01 3.627969
dtype: float64
切片
# 通过日期来直接切片,但是只对Series有效。
# pd.date_range可以将创建时间序列
In [29]: longer_ts = Series(np.random.randn(1000), index=pd.date_range('1/1/2017
...: ',periods=1000))
In [30]: longer_ts[:5]
Out[30]:
2017-01-01 0.311815
2017-01-02 -0.424868
2017-01-03 0.198069
2017-01-04 1.011494
2017-01-05 -0.312494
Freq: D, dtype: float64
In [31]: longer_ts[-5:]
Out[31]:
2019-09-23 -0.637869
2019-09-24 0.721613
2019-09-25 -0.914481
2019-09-26 0.036966
2019-09-27 0.677846
Freq: D, dtype: float64
# 获取2017-2月的所有数据
In [32]: longer_ts['2017-2']
Out[32]:
2017-02-01 1.258390
2017-02-02 0.606618
2017-02-03 0.927122
2017-02-04 0.761009
...
2017-02-23 -1.039703
2017-02-24 0.478075
2017-02-25 -0.328411
2017-02-26 -1.019641
2017-02-27 0.186212
2017-02-28 -1.466734
Freq: D, dtype: float64
# 单日数据
In [33]: longer_ts['2017-2-3']
Out[33]: 0.92712152603736908
# 年数据
In [34]: longer_ts['2017'][:5]
Out[34]:
2017-01-01 0.311815
2017-01-02 -0.424868
2017-01-03 0.198069
2017-01-04 1.011494
2017-01-05 -0.312494
Freq: D, dtype: float64
也可以通过不存在的时间戳对Series进行切片。
2.带有重复索引的时间序列
In [35]: dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','
...: 1/3/2000'])
In [36]: dup_ts = Series(np.arange(5), index=dates)
In [37]: dup_ts
Out[37]:
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int64
# 查看索引是否重复
In [40]: dup_ts.index.is_unique
Out[40]: False
In [41]: dup_ts['1/2/2000'] # 重复, 数组
Out[41]:
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int64
In [42]: dup_ts['1/3/2000'] # 不重复,标量
Out[42]: 4
In [43]: grouped = dup_ts.groupby(level=0)
In [44]: grouped.mean()
Out[44]:
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int64
In [45]: grouped.count()
Out[45]:
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
3. 日期的范围、频率及移动
pandas中的时间序列一般是不规则的,没有固定的频率。但是通常需要一某种频率对序列进行分析,
幸运的是pandas有一套工具,帮助我们解决这些问题。
resample
In [49]: dates = pd.DatetimeIndex(['2000-01-02','2000-01-05','2000-01-07','2000-
...: 01-08','2000-01-10','2000-01-12'])
In [50]: ts = Series(np.random.randn(6), index=dates)
In [51]: ts
Out[51]:
2000-01-02 0.124049
2000-01-05 -0.840846
2000-01-07 -0.051655
2000-01-08 -0.603824
2000-01-10 0.467815
2000-01-12 -0.201388
dtype: float64
In [52]: ts.resample('D')
Out[52]: /Users/yangfeilong/anaconda/lib/python2.7/site-packages/IPython/utils/dir2.py:65: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
canary = getattr(obj, '_ipython_canary_method_should_not_exist_', None)
DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
In [53]: ts.resample('D').mean() # 填充空日期
Out[53]:
2000-01-02 0.124049
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 -0.840846
2000-01-06 NaN
2000-01-07 -0.051655
2000-01-08 -0.603824
2000-01-09 NaN
2000-01-10 0.467815
2000-01-11 NaN
2000-01-12 -0.201388
Freq: D, dtype: float64
3.1 生成日期范围
pandas.date_range可以生成指定长度的日期范围。
In [54]: index = pd.date_range('4/1/2017','6/1/2017') # 生成一段时间的序列,默认00:00
In [55]: index
Out[55]:
DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04',
'2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
'2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
'2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
'2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20',
'2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24',
'2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28',
'2017-04-29', '2017-04-30', '2017-05-01', '2017-05-02',
'2017-05-03', '2017-05-04', '2017-05-05', '2017-05-06',
'2017-05-07', '2017-05-08', '2017-05-09', '2017-05-10',
'2017-05-11', '2017-05-12', '2017-05-13', '2017-05-14',
'2017-05-15', '2017-05-16', '2017-05-17', '2017-05-18',
'2017-05-19', '2017-05-20', '2017-05-21', '2017-05-22',
'2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26',
'2017-05-27', '2017-05-28', '2017-05-29', '2017-05-30',
'2017-05-31', '2017-06-01'],
dtype='datetime64[ns]', freq='D')
In [56]: pd.date_range(start='4/1/2017',periods=20) # 指定长度
Out[56]:
DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04',
'2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
'2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
'2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
'2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20'],
dtype='datetime64[ns]', freq='D')
In [57]: pd.date_range(end='4/1/2017',periods=20) # 指定结束日期
Out[57]:
DatetimeIndex(['2017-03-13', '2017-03-14', '2017-03-15', '2017-03-16',
'2017-03-17', '2017-03-18', '2017-03-19', '2017-03-20',
'2017-03-21', '2017-03-22', '2017-03-23', '2017-03-24',
'2017-03-25', '2017-03-26', '2017-03-27', '2017-03-28',
'2017-03-29', '2017-03-30', '2017-03-31', '2017-04-01'],
dtype='datetime64[ns]', freq='D')
In [58]: pd.date_range('4/1/2017','6/1/2017',freq='BM') # 指定频率,为月末工作日
Out[58]: DatetimeIndex(['2017-04-28', '2017-05-31'], dtype='datetime64[ns]', freq='BM')
In [59]: pd.date_range('5/3/2017 12:34:12',periods=5) # 默认时分秒 不变
Out[59]:
DatetimeIndex(['2017-05-03 12:34:12', '2017-05-04 12:34:12',
'2017-05-05 12:34:12', '2017-05-06 12:34:12',
'2017-05-07 12:34:12'],
dtype='datetime64[ns]', freq='D')
In [60]: pd.date_range('5/3/2017 12:34:12',periods=5, normalize=True) # 可以改到0时
Out[60]:
DatetimeIndex(['2017-05-03', '2017-05-04', '2017-05-05', '2017-05-06',
'2017-05-07'],
dtype='datetime64[ns]', freq='D')
3.2 频率和日期偏移量
In [61]: # 可以显式的创建频率使用的日期偏离
In [62]: from pandas.tseries.offsets import Hour
In [63]: four_hours = Hour(4)
In [64]: four_hours
Out[64]: <4 * Hours>
In [65]: # 也可以直接使用4H之类的字符串直接指定
In [66]: pd.date_range('1/1/2017', '1/3/2017 22:25',freq='4H')
Out[66]:
DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 04:00:00',
'2017-01-01 08:00:00', '2017-01-01 12:00:00',
'2017-01-01 16:00:00', '2017-01-01 20:00:00',
'2017-01-02 00:00:00', '2017-01-02 04:00:00',
'2017-01-02 08:00:00', '2017-01-02 12:00:00',
'2017-01-02 16:00:00', '2017-01-02 20:00:00',
'2017-01-03 00:00:00', '2017-01-03 04:00:00',
'2017-01-03 08:00:00', '2017-01-03 12:00:00',
'2017-01-03 16:00:00', '2017-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
In [67]: pd.date_range('1/1/2017', '1/3/2017 22:25',freq=four_hours)
Out[67]:
DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 04:00:00',
'2017-01-01 08:00:00', '2017-01-01 12:00:00',
'2017-01-01 16:00:00', '2017-01-01 20:00:00',
'2017-01-02 00:00:00', '2017-01-02 04:00:00',
'2017-01-02 08:00:00', '2017-01-02 12:00:00',
'2017-01-02 16:00:00', '2017-01-02 20:00:00',
'2017-01-03 00:00:00', '2017-01-03 04:00:00',
'2017-01-03 08:00:00', '2017-01-03 12:00:00',
'2017-01-03 16:00:00', '2017-01-03 20:00:00'],
dtype='datetime64[ns]', freq='4H')
In [68]: from pandas.tseries.offsets import Hour,Minute
# 可以通过相加获得指定长度的时间偏移
In [69]: Hour(1) + Minute(30)
Out[69]: <90 * Minutes>
# 也可以用更简单的字符串
In [70]: pd.date_range('1/1/2017',periods=3, freq='1h30min')
Out[70]:
DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 01:30:00',
'2017-01-01 03:00:00'],
dtype='datetime64[ns]', freq='90T')
有些偏移是不规律的,pandas自带了一些日期偏移量,供大家使用。如下表:
3.3 移动(超前或滞后)数据
shift沿着时间轴将数据进行前移或后移。
In [71]: ts = Series(np.random.randn(4), index=pd.date_range('1/1/2017',periods=
...: 4, freq='M'))
In [72]: ts
Out[72]:
2017-01-31 -0.080326
2017-02-28 0.432715
2017-03-31 1.094710
2017-04-30 -1.024227
Freq: M, dtype: float64
In [73]: ts.shift(2) # 将数据超前
Out[73]:
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 -0.080326
2017-04-30 0.432715
Freq: M, dtype: float64
In [74]: ts.shift(-2) # 数据滞后
Out[74]:
2017-01-31 1.094710
2017-02-28 -1.024227
2017-03-31 NaN
2017-04-30 NaN
Freq: M, dtype: float64
# 计算本月相对上月的增长率
In [76]: ts/ts.shift(1) - 1
Out[76]:
2017-01-31 NaN
2017-02-28 -6.386994
2017-03-31 1.529866
2017-04-30 -1.935615
Freq: M, dtype: float64
# 加上freq后,日期增长,数据位置行不变
In [78]: ts.shift(2, freq='M')
Out[78]:
2017-03-31 -0.080326
2017-04-30 0.432715
2017-05-31 1.094710
2017-06-30 -1.024227
Freq: M, dtype: float64
# 当然还能加上其他频率,会更加灵活
In [79]: ts.shift(3, freq='D')
Out[79]:
2017-02-03 -0.080326
2017-03-03 0.432715
2017-04-03 1.094710
2017-05-03 -1.024227
dtype: float64
In [80]: ts.shift(1, freq='3D')
Out[80]:
2017-02-03 -0.080326
2017-03-03 0.432715
2017-04-03 1.094710
2017-05-03 -1.024227
dtype: float64
日期位移
# day:偏移日期,可传入数量
# MonthEnd:偏移到月末
In [81]: from pandas.tseries.offsets import Day,MonthEnd
In [82]: now = datetime(2017,2,18)
In [83]: now + 3 * Day() # 通过+-直接计算日期
Out[83]: Timestamp('2017-02-21 00:00:00')
In [84]: now + MonthEnd() # 偏移到月末
Out[84]: Timestamp('2017-02-28 00:00:00')
In [85]: now + MonthEnd(1) # 下月末
Out[85]: Timestamp('2017-02-28 00:00:00')
In [86]: offset = MonthEnd()
In [87]: offset.rollforward(now) # 滚到本月末
Out[87]: Timestamp('2017-02-28 00:00:00')
In [88]: offset.rollback(now) # 滚到上月末
Out[88]: Timestamp('2017-01-31 00:00:00')
In [90]: ts = Series(np.random.randn(20),index=pd.date_range('2/18/2017',periods
...: =20, freq='4d'))
In [91]: ts.groupby(offset.rollforward).mean() # 每个日期滚到月末后分组,并求平均值
Out[91]:
2017-02-28 -0.536243
2017-03-31 -0.373386
2017-04-30 0.131691
2017-05-31 1.775742
dtype: float64
In [92]: ts.resample('M',how='mean') # resample更易
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[92]:
2017-02-28 -0.536243
2017-03-31 -0.373386
2017-04-30 0.131691
2017-05-31 1.775742
Freq: M, dtype: float64
In [93]: ts.resample('M').mean()
Out[93]:
2017-02-28 -0.536243
2017-03-31 -0.373386
2017-04-30 0.131691
2017-05-31 1.775742
Freq: M, dtype: float64
待续。。。