6-Pandas时序数据处理之重采样与频率转换(升降采样、resample()、OHLC、groupby()重采样)
重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的过程,其中:
- 高频转为低频成为降采样(下采样)
- 低频转为高频成为升采样(上采样)
1、使用resample()方法进行重采样
例:现有一个以年月日为索引的时间序列ts,将其重采样为年月的频率,并计算均值
>>> ts = pd.Series(np.random.randint(0,10,5)) >>> ts.index = pd.date_range('2020-7-30',periods=5) >>> ts 2020-07-30 4 2020-07-31 4 2020-08-01 8 2020-08-02 6 2020-08-03 7 Freq: D, dtype: int32 >>> ts.resample('M') DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0] >>> ts.resample('M').mean() 2020-07-31 4 2020-08-31 7 Freq: M, dtype: int32
2、降采样
使用resample()对数据进行降采样时,需要考虑两个问题:
- 各区间那边是闭合的(close参数的值--right即又边界闭合,left即左边界闭合)
- 如何标记各个聚合面元,用区间的开头还是末尾
例:通过求和的方式将上述数据聚合到2分钟的集合里,传入close='right'会让右边界闭合,传入close='left'会让左边界闭合
>>> ts.resample('2min',closed='right').sum() 2020-07-31 23:58:00 0 2020-08-01 00:00:00 3 2020-08-01 00:02:00 7 Freq: 2T, dtype: int32 >>> ts.resample('2min',closed='left').sum() 2020-08-01 00:00:00 1 2020-08-01 00:02:00 5 2020-08-01 00:04:00 4 Freq: 2T, dtype: int32
可以使用loffset设置索引位移,传入参数loffset一个字符串或者偏移量,即可实现对结果索引的一些位移
>>> ts.resample('T',loffset='-1s').sum() 2020-07-31 23:59:59 0 2020-08-01 00:00:59 1 2020-08-01 00:01:59 2 2020-08-01 00:02:59 3 2020-08-01 00:03:59 4 Freq: T, dtype: int32 >>> ts.resample('2T',loffset='-1s').sum() 2020-07-31 23:59:59 1 2020-08-01 00:01:59 5 2020-08-01 00:03:59 4
3、OHLC重采样
金融领域中有一种时间序列聚合方式(OHLC),计算各面元的四个值:
- O:open,开盘
- H:high,最大值
- L:low,最小值
- C:close,收盘
其后也不单单用于金融领域,O可以用于表达初始值,H表示最大值,L表示最小值,C表示末尾值。
传入how = ‘ohlc’可得到一个含有这四种聚合值的DateFrame,但是格式已经改变,如下:
>>> ts.resample('2T',closed = 'right',how = 'ohlc') __main__:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).ohlc() open high low close 2020-07-31 23:58:00 0 0 0 0 2020-08-01 00:00:00 1 2 1 2 2020-08-01 00:02:00 3 4 3 4 >>> ts.resample('2T',closed = 'right').ohlc() open high low close 2020-07-31 23:58:00 0 0 0 0 2020-08-01 00:00:00 1 2 1 2 2020-08-01 00:02:00 3 4 3 4
3、groupby重采样
同另一篇博文【Pandas时序数据处理(日期范围pd.date_range()、频率(基础频率表)及移动(shift()、rollforward()、rollback()))的第四部分的例子】
例:若有一时间序列数据,如何在每月月末显示该月数据的均值
无需用到 rollback 滚动,只需传入一个能够访问 ts 索引上的月份字段的函数即可
>>> rng = pd.date_range('2020-1-14',periods=100,freq='D') >>> ts = pd.Series(np.random.randint(0,10,100),index=rng) >>> ts.groupby(lambda x:x.month).mean() 1 3.722222 2 5.068966 3 4.290323 4 4.863636 dtype: float64
根据星期几对上述时间序列进行分组并求出分组后的均值,只需传一个能够访问ts索引上的星期字段函数即可
>>> ts.groupby(lambda x:x.weekday).mean() 0 4.714286 1 5.333333 2 4.800000 3 3.214286 4 4.142857 5 4.785714 6 4.714286 dtype: float64
4、升采样
在将数据从低频转换到高频时,不需要聚合。
>>> data = pd.DataFrame(np.random.randint(0,10,size=(2,4))) >>> data.index = pd.date_range('2020-1-14',periods = 2,freq='W-WED') >>> data.columns = ['one','two','three','four'] >>> data one two three four 2020-01-15 6 9 8 6 2020-01-22 5 6 7 6
将data重采样到日频率,默认会引入缺失值
>>> data.resample('D').mean() one two three four 2020-01-15 6.0 9.0 8.0 6.0 2020-01-16 NaN NaN NaN NaN 2020-01-17 NaN NaN NaN NaN 2020-01-18 NaN NaN NaN NaN 2020-01-19 NaN NaN NaN NaN 2020-01-20 NaN NaN NaN NaN 2020-01-21 NaN NaN NaN NaN 2020-01-22 5.0 6.0 7.0 6.0
假设用前面的值填充缺失值,使用ffill()实现,具体填充方式可以参考另一篇博文【Pandas数据初探索之缺失值处理与丢弃数据(填充fillna()、删除drop()、drop_duplicates()、dropna())的第二部分】
>>> data.resample('D').ffill() one two three four 2020-01-15 6 9 8 6 2020-01-16 6 9 8 6 2020-01-17 6 9 8 6 2020-01-18 6 9 8 6 2020-01-19 6 9 8 6 2020-01-20 6 9 8 6 2020-01-21 6 9 8 6 2020-01-22 5 6 7 6 >>> data.resample('D').bfill() one two three four 2020-01-15 6 9 8 6 2020-01-16 5 6 7 6 2020-01-17 5 6 7 6 2020-01-18 5 6 7 6 2020-01-19 5 6 7 6 2020-01-20 5 6 7 6 2020-01-21 5 6 7 6 2020-01-22 5 6 7 6
也可以仅填充指定的时期数(目的是限制前面观测值的持续使用距离,limit = 2表示前面的观测值只能填充往后的两行数据)
>>> data.resample('D').pad(limit=2) one two three four 2020-01-15 6.0 9.0 8.0 6.0 2020-01-16 6.0 9.0 8.0 6.0 2020-01-17 6.0 9.0 8.0 6.0 2020-01-18 NaN NaN NaN NaN 2020-01-19 NaN NaN NaN NaN 2020-01-20 NaN NaN NaN NaN 2020-01-21 NaN NaN NaN NaN 2020-01-22 5.0 6.0 7.0 6.0
5、通过日期进行重采样
对于使用时期索引的数据进行重采样较为简单,先创建一个对象:
>>> df = pd.DataFrame(np.random.randn(24,4)) >>> df.index = pd.period_range('2020-1',periods=24,freq='M') >>> df.columns = ['one','two','three','four'] >>> df one two three four 2020-01 -0.773347 0.121962 0.688172 -0.128935 2020-02 1.260893 0.949058 0.617078 -1.444115 2020-03 0.470896 2.678574 -0.789855 -0.788634 2020-04 -1.011997 -0.743128 1.118954 -0.643499 2020-05 0.139304 0.119937 0.386177 -0.395788 2020-06 -1.264226 -0.647303 0.484827 0.986434 2020-07 0.430877 -0.007752 0.484699 -0.494257 2020-08 2.734575 0.850000 1.020758 0.078646 2020-09 -0.038556 0.168716 -1.301591 0.874963 2020-10 -1.061978 0.329240 0.372740 -0.474351 2020-11 -1.744309 0.050698 -1.261978 1.312718 2020-12 0.518119 -0.062940 0.765845 1.788449 2021-01 -0.876448 0.449906 0.927772 -0.044937 2021-02 -0.515143 1.594102 0.470797 0.377561 2021-03 0.857145 0.488788 0.346126 0.588185 2021-04 -0.467256 0.338766 0.307865 -0.713797 2021-05 1.674114 -0.730812 0.486691 0.059144 2021-06 0.746407 -0.542054 0.047589 -0.616221 2021-07 0.205364 -0.865091 -0.450592 0.736776 2021-08 1.123738 0.091906 1.039720 0.776065 2021-09 1.869627 1.688411 -2.790112 -0.116390 2021-10 -1.315471 -0.085058 0.729701 0.848654 2021-11 2.065949 0.297769 -0.398484 -1.197251 2021-12 -0.466184 -0.084250 0.700341 -1.764270
传入'A-DEC'进行降采样(使用年度财政的方式)
>>> df.resample('A-DEC').mean() one two three four 2020 -0.028312 0.317255 0.215486 0.055969 2021 0.408487 0.220199 0.118118 -0.088873
传入'A-JUN'进行降采样(使用6月作为财政年度的分割单位)
>>> df.resample('A-JUN').mean() one two three four 2020 -0.196413 0.413183 0.417559 -0.402423 2021 0.188129 0.243888 0.222276 0.228009 2022 0.580504 0.173948 -0.194904 -0.119403
6、通过日期进行升采样
需决定在新频率中,各区间的哪端用于放置原来的值,convention参数默认为start,可设置为end
>>> annu_df = df.resample('A-DEC').mean() >>> annu_df one two three four 2020 -0.028312 0.317255 0.215486 0.055969 2021 0.408487 0.220199 0.118118 -0.088873 >>> annu_df.resample('Q-DEC').ffill() one two three four 2020Q1 -0.028312 0.317255 0.215486 0.055969 2020Q2 -0.028312 0.317255 0.215486 0.055969 2020Q3 -0.028312 0.317255 0.215486 0.055969 2020Q4 -0.028312 0.317255 0.215486 0.055969 2021Q1 0.408487 0.220199 0.118118 -0.088873 2021Q2 0.408487 0.220199 0.118118 -0.088873 2021Q3 0.408487 0.220199 0.118118 -0.088873 2021Q4 0.408487 0.220199 0.118118 -0.088873 >>> annu_df.resample('Q-DEC',convention = 'end').ffill() one two three four 2020Q4 -0.028312 0.317255 0.215486 0.055969 2021Q1 -0.028312 0.317255 0.215486 0.055969 2021Q2 -0.028312 0.317255 0.215486 0.055969 2021Q3 -0.028312 0.317255 0.215486 0.055969 2021Q4 0.408487 0.220199 0.118118 -0.088873