pandas 之 时间序列索引
import numpy as np
import pandas as pd
引入
A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python string or datetime objects:
from datetime import datetime
dates = [
datetime(2011, 1, 2),
datetime(2011, 1, 5),
datetime(2011, 1, 7),
datetime(2011, 1, 8),
datetime(2011, 1, 10),
datetime(2011, 1, 12)
]
ts = pd.Series(np.random.randn(6), index=dates)
ts
2011-01-02 0.825502
2011-01-05 0.453766
2011-01-07 0.077024
2011-01-08 -1.320742
2011-01-10 -1.109912
2011-01-12 -0.469907
dtype: float64
Under the hood, these datetime objects have been put in a DatetimeIndex:
ts.index
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
Like other Series, arithmetic operations between differently indexed time series auto-matically align(自动对齐) on the dates:
ts + ts[::2]
2011-01-02 1.651004
2011-01-05 NaN
2011-01-07 0.154049
2011-01-08 NaN
2011-01-10 -2.219823
2011-01-12 NaN
dtype: float64
Recall that ts[::2] selects every second element in ts:
pandas stores timestamp using NumPy's datetime64 data type the nanosecond resolution:
ts.index.dtype
dtype('<M8[ns]')
Scalar values from a DatetimeIndex are Timestamp object:
stamp = ts.index[0]
stamp
Timestamp('2011-01-02 00:00:00')
A Timestamp can be substituted(被替代) anywhere you would use a datetime object. Additionally, it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.
(各种转换操作, 对于时间序列)
索引-切片
Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:
stamp = ts.index[2]
ts[stamp]
0.0770243257021936
As a convenience, you can also pass a string that is interpretable as a date:
ts['1/10/2011']
-1.109911691867437
ts['20110110']
-1.109911691867437
For longer time series, a year or only a year and month can be passed to easly select slices of data:
longer_ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
longer_ts[:5]
2000-01-01 0.401394
2000-01-02 0.720214
2000-01-03 0.488505
2000-01-04 0.446179
2000-01-05 -2.129299
Freq: D, dtype: float64
longer_ts['2001'][:5]
2001-01-01 0.315472
2001-01-02 0.796386
2001-01-03 0.611503
2001-01-04 0.980799
2001-01-05 0.184401
Freq: D, dtype: float64
Here, the string '2001' is interpreted as a year and selects that time period. This also works if you speicify the month:
longer_ts['2001-05'][:5]
2001-05-01 0.439009
2001-05-02 -0.304236
2001-05-03 0.603268
2001-05-04 -0.726460
2001-05-05 -0.521669
Freq: D, dtype: float64
"Slicing with detetime objects works as well"
ts[datetime(2011, 1, 7):]
'Slicing with detetime objects works as well'
2011-01-07 0.077024
2011-01-08 -1.320742
2011-01-10 -1.109912
2011-01-12 -0.469907
dtype: float64
Because most time series data is ordered chrnologically(按年代顺序的), you can slice with time-stamps not contained in a time series to perform a range query:
ts
2011-01-02 0.825502
2011-01-05 0.453766
2011-01-07 0.077024
2011-01-08 -1.320742
2011-01-10 -1.109912
2011-01-12 -0.469907
dtype: float64
ts['1/6/2011': '1/11/2011']
2011-01-07 0.077024
2011-01-08 -1.320742
2011-01-10 -1.109912
dtype: float64
As before, you can pass either a string date, datetime or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the orginal data.
There is an equivalent instance method,truncate that slices a Series between two dates:
ts.truncate(after='1/9/2011')
2011-01-02 0.825502
2011-01-05 0.453766
2011-01-07 0.077024
2011-01-08 -1.320742
dtype: float64
All of this holds true for DataFrame as well, indexing on its rows:
# periods: 多少个, freq: 间隔
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100, 4),
index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.loc['5-2001']
Colorado | Texas | New York | Ohio | |
---|---|---|---|---|
2001-05-02 | 0.972317 | 0.407519 | 0.628906 | 1.995901 |
2001-05-09 | 0.299961 | -1.208505 | 1.019247 | 2.244728 |
2001-05-16 | 0.628163 | -0.716498 | 0.621912 | 1.257635 |
2001-05-23 | 0.508852 | 0.753517 | -0.793127 | 0.273496 |
2001-05-30 | -1.443141 | -0.878143 | -0.680227 | 0.455401 |
重复索引
- ts.is_unique
- ts.groupby(level=0)
In some applications, there may be multiple data observations falling on a particular timestamp.Here is an example:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000',
'1/2/2000', '1/2/2000', '1/3/2000'
])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
2000-01-01 0
2000-01-02 1
2000-01-02 2
2000-01-02 3
2000-01-03 4
dtype: int32
We can tell that the index is not unique by checking its is_unique property:
dup_ts.index.is_unique
False
Indexing into this time series will now either produce scalar values or slice depending on whether a timestamp is duplicated:
dup_ts['1/3/2000'] # not duplicated
4
dup_ts['1/2/2000'] # duplicated
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int32
Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is use groupby and pass level=0
grouped = dup_ts.groupby(level=0) # 没有level 会报错, 默认是None
grouped.mean()
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int32
grouped.count()
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通