[pandas]用户指南:1.十分钟入门

1. 导入numpy和pandas

import numpy as np
import pandas as pd

2. 生成对象

  • 用值列表生成Series,索引默认为整数
  • 用含日期时间索引与标签的numpy数组生成DataFrame
  • 用Series字典对象生成DataFrame

总之,pandas生成对象很灵活。

s= pd.Series([1,3,4,np.nan, 6, 8])
s
0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    8.0
dtype: float64
dates = pd.date_range('20220514', periods=6)
dates
DatetimeIndex(['2022-05-14', '2022-05-15', '2022-05-16', '2022-05-17',
               '2022-05-18', '2022-05-19'],
              dtype='datetime64[ns]', freq='D')
df =  pd.DataFrame(np.random.randn(6,4), index = dates, columns= list('ABCD'))
df
A B C D
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-16 -0.807610 0.197376 0.002867 2.958108
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608
2022-05-18 -0.259270 0.256750 0.619412 1.899276
2022-05-19 1.216169 -0.220087 -1.994829 -0.866881
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20220514'),
                    'C': pd.Series(1, index = list(range(4)),dtype='float32'),
                    'D':np.array([3]*4,dtype='int32'),
                    'E': pd.Categorical(["test","train","test","train"]),
                    'F': 'foo'})
df2
A B C D E F
0 1.0 2022-05-14 1.0 3 test foo
1 1.0 2022-05-14 1.0 3 train foo
2 1.0 2022-05-14 1.0 3 test foo
3 1.0 2022-05-14 1.0 3 train foo
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

3. 查看数据

df.head() # 查看头部数据,参数缺省=全部显示,数字=头几行
A B C D
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-16 -0.807610 0.197376 0.002867 2.958108
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608
2022-05-18 -0.259270 0.256750 0.619412 1.899276
df.tail(3)  # 查看尾部数据
A B C D
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608
2022-05-18 -0.259270 0.256750 0.619412 1.899276
2022-05-19 1.216169 -0.220087 -1.994829 -0.866881
df.index 
DatetimeIndex(['2022-05-14', '2022-05-15', '2022-05-16', '2022-05-17',
               '2022-05-18', '2022-05-19'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy()输出底层数据的NumPy对象。
注意,DataFrame的列由多种数据类型组成时,该操作耗费系统资源较大,这也是Pandas和NumPy的本质区别:

  • NumPy 数组只有一种数据类型,
  • DataFrame 每列的数据类型各不相同。
  • Pandas查找支持DataFrame里所有数据类型的NumPy数据类型。
  • .to_numpy()输出结果不包含索引和列标签
    还有一种数据类型是object,可以把DataFrame列里的值强制转换为Python对象。
df.to_numpy()
array([[ 1.71434051e+00, -1.97947895e+00, -3.00576349e-01,
        -1.08886170e+00],
       [ 8.20437240e-01,  1.26699927e+00, -1.40821878e+00,
         1.36472164e+00],
       [-8.07610335e-01,  1.97375765e-01,  2.86653595e-03,
         2.95810777e+00],
       [ 4.95664812e-01, -2.70283518e-01, -2.43174905e-01,
        -7.15608296e-01],
       [-2.59269764e-01,  2.56749508e-01,  6.19412044e-01,
         1.89927592e+00],
       [ 1.21616869e+00, -2.20086698e-01, -1.99482874e+00,
        -8.66880756e-01]])
df2.to_numpy()
array([[1.0, Timestamp('2022-05-14 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2022-05-14 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2022-05-14 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2022-05-14 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.529955 -0.124787 -0.554087 0.591792
std 0.935005 1.063632 0.964729 1.706927
min -0.807610 -1.979479 -1.994829 -1.088862
25% -0.070536 -0.257734 -1.131308 -0.829063
50% 0.658051 -0.011355 -0.271876 0.324557
75% 1.117236 0.241906 -0.058644 1.765637
max 1.714341 1.266999 0.619412 2.958108
df2.describe() # 描述 各种计数、均值、标准差、最小、最大等
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
df.T # 转置
2022-05-14 2022-05-15 2022-05-16 2022-05-17 2022-05-18 2022-05-19
A 1.714341 0.820437 -0.807610 0.495665 -0.259270 1.216169
B -1.979479 1.266999 0.197376 -0.270284 0.256750 -0.220087
C -0.300576 -1.408219 0.002867 -0.243175 0.619412 -1.994829
D -1.088862 1.364722 2.958108 -0.715608 1.899276 -0.866881
df.sort_index(axis=1, ascending = False ) # 按列降序
D C B A
2022-05-14 -1.088862 -0.300576 -1.979479 1.714341
2022-05-15 1.364722 -1.408219 1.266999 0.820437
2022-05-16 2.958108 0.002867 0.197376 -0.807610
2022-05-17 -0.715608 -0.243175 -0.270284 0.495665
2022-05-18 1.899276 0.619412 0.256750 -0.259270
2022-05-19 -0.866881 -1.994829 -0.220087 1.216169
df.sort_values(by='B',ascending=False) # 按B列的降序,ascending = False 降序,默认升序。
A B C D
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-18 -0.259270 0.256750 0.619412 1.899276
2022-05-16 -0.807610 0.197376 0.002867 2.958108
2022-05-19 1.216169 -0.220087 -1.994829 -0.866881
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862
df.at # 生产环境使用 .at .iat .loc .iloc
<pandas.core.indexing._AtIndexer at 0x10fda6f20>
df.iat
<pandas.core.indexing._iAtIndexer at 0x11cbcef20>
df.loc
<pandas.core.indexing._LocIndexer at 0x11cc5d940>
df.iloc
<pandas.core.indexing._iLocIndexer at 0x11cc5fdd0>

4. 获取数据

4.1 直接获取

输出Series对象或者DataFrame对象,取决于不同的用法:

  • 列表切片,后面用方括号[]
  • 面向对象方法ORM,df.A
df['A'] # 产生单列,生产的数据结构与Series一样, 等于 df.A
2022-05-14    1.714341
2022-05-15    0.820437
2022-05-16   -0.807610
2022-05-17    0.495665
2022-05-18   -0.259270
2022-05-19    1.216169
Freq: D, Name: A, dtype: float64
df.A # 验证一下看看, 如果是中文标签咋搞? 'ABCD' 改成 "阿波茨坦" 是可以操作的,用  df.阿
2022-05-14    1.714341
2022-05-15    0.820437
2022-05-16   -0.807610
2022-05-17    0.495665
2022-05-18   -0.259270
2022-05-19    1.216169
Freq: D, Name: A, dtype: float64
df[0:3].A #切片行df[0:3].A和df.A[0:2]也可以执行
2022-05-14    1.714341
2022-05-15    0.820437
2022-05-16   -0.807610
Freq: D, Name: A, dtype: float64
df['20220514':'20220517']
A B C D
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-16 -0.807610 0.197376 0.002867 2.958108
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608

4.2 按标签获取数据

.loc方法的使用主要就是操作标签值来取数

.loc[[行标签列表], [列标签列表]]

  • 形参都是列表,输出DataFrame对象
  • 形参有具体值,输出Series/numpy对象
  • 形参全是值,输出Value值
df.loc[dates[0]]
A    1.714341
B   -1.979479
C   -0.300576
D   -1.088862
Name: 2022-05-14 00:00:00, dtype: float64
df.loc[:,['A','B']]
A B
2022-05-14 1.714341 -1.979479
2022-05-15 0.820437 1.266999
2022-05-16 -0.807610 0.197376
2022-05-17 0.495665 -0.270284
2022-05-18 -0.259270 0.256750
2022-05-19 1.216169 -0.220087
df.loc['20220514':'20220517',['A','C']]
A C
2022-05-14 1.714341 -0.300576
2022-05-15 0.820437 -1.408219
2022-05-16 -0.807610 0.002867
2022-05-17 0.495665 -0.243175
df.loc['20220514',['A','C']] #.loc[行标签列表,列标签列表],但是如果用这种用法,是解构了DataFrame,取值到了numpy数组对象了。
A    1.714341
C   -0.300576
Name: 2022-05-14 00:00:00, dtype: float64
df.loc[['20220514'],['A','B']] # 注意看与上面输出数据的区别,这里还是还是DataFrame
A B
2022-05-14 1.714341 -1.979479
df.loc[dates[0],'A']
1.7143405095532576
df.at[dates[0],'A']  # 这个方面慢多了
1.7143405095532576
df.loc[[dates[0]],['A']] # 果然,.loc[[行列表],[列列表]]始终输出的是DataFrame,还是个表,不是值
A
2022-05-14 1.714341

4.3 按位置获取

用法与.loc有点像,不重复

df.iloc[3]
A    0.495665
B   -0.270284
C   -0.243175
D   -0.715608
Name: 2022-05-17 00:00:00, dtype: float64
df.iloc[0:5, 0:2] # 这里的切片好像是不包含最后一个,不论行列,从0开始,0是第一行/列
A B
2022-05-14 1.714341 -1.979479
2022-05-15 0.820437 1.266999
2022-05-16 -0.807610 0.197376
2022-05-17 0.495665 -0.270284
2022-05-18 -0.259270 0.256750
df.iloc[[0,1,2,3,5],[0,3]] # 不连续列表取行or列
A D
2022-05-14 1.714341 -1.088862
2022-05-15 0.820437 1.364722
2022-05-16 -0.807610 2.958108
2022-05-17 0.495665 -0.715608
2022-05-19 1.216169 -0.866881
df.iloc[0:3,:] #显式取整行
A B C D
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-16 -0.807610 0.197376 0.002867 2.958108
df.iloc[:, 1:3] #显式取整列
B C
2022-05-14 -1.979479 -0.300576
2022-05-15 1.266999 -1.408219
2022-05-16 0.197376 0.002867
2022-05-17 -0.270284 -0.243175
2022-05-18 0.256750 0.619412
2022-05-19 -0.220087 -1.994829
df.iloc[1,2] # .iloc[单个值,单个值] 时,直接返回值,而不是DataFrame对象了
-1.408218776287673
df.iat[1,3] #与上面一致
1.364721643082539

4.4 布尔索引(条件选择)

类似其他库的where

df[df.B > 0] # B列大于0的行
A B C D
2022-05-15 0.820437 1.266999 -1.408219 1.364722
2022-05-16 -0.807610 0.197376 0.002867 2.958108
2022-05-18 -0.259270 0.256750 0.619412 1.899276
df[df > 0] # 表里面所有大于0的,不满足的直接显示NaN
A B C D
2022-05-14 1.714341 NaN NaN NaN
2022-05-15 0.820437 1.266999 NaN 1.364722
2022-05-16 NaN 0.197376 0.002867 2.958108
2022-05-17 0.495665 NaN NaN NaN
2022-05-18 NaN 0.256750 0.619412 1.899276
2022-05-19 1.216169 NaN NaN NaN
df3 = df.copy()
df3['E']= ['one','two','two','three','four','five'] # 不能通过调用属性来直接赋值:df.E
df3
A B C D E
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862 one
2022-05-15 0.820437 1.266999 -1.408219 1.364722 two
2022-05-16 -0.807610 0.197376 0.002867 2.958108 two
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608 three
2022-05-18 -0.259270 0.256750 0.619412 1.899276 four
2022-05-19 1.216169 -0.220087 -1.994829 -0.866881 five
df3[df3.E.isin(['two','four'])] # 可以调用属性来做判断
A B C D E
2022-05-15 0.820437 1.266999 -1.408219 1.364722 two
2022-05-16 -0.807610 0.197376 0.002867 2.958108 two
2022-05-18 -0.259270 0.256750 0.619412 1.899276 four

4.5 赋值

# 用索引自动对齐,注意起始日期 是原df中的第2天/行
df['F'] = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20220515', periods=6))
df
A B C D F
2022-05-14 1.714341 -1.979479 -0.300576 -1.088862 NaN
2022-05-15 0.820437 1.266999 -1.408219 1.364722 1.0
2022-05-16 -0.807610 0.197376 0.002867 2.958108 2.0
2022-05-17 0.495665 -0.270284 -0.243175 -0.715608 3.0
2022-05-18 -0.259270 0.256750 0.619412 1.899276 4.0
2022-05-19 1.216169 -0.220087 -1.994829 -0.866881 5.0
# 按标签、位置、numpy数组赋值
df.at[dates[0],'A'] = 0 # 第1行,第2列
df.iat[0,1] = 0 # 第1行,第2列
df.loc[:, 'D'] = np.array([5]*len(df)) # 用np数组赋值
df
A B C D F
2022-05-14 0.000000 0.000000 -0.300576 5 NaN
2022-05-15 0.820437 1.266999 -1.408219 5 1.0
2022-05-16 -0.807610 0.197376 0.002867 5 2.0
2022-05-17 0.495665 -0.270284 -0.243175 5 3.0
2022-05-18 -0.259270 0.256750 0.619412 5 4.0
2022-05-19 1.216169 -0.220087 -1.994829 5 5.0

5. 缺失值

np.nan

# 重构时的缺省值
df5 = df.reindex(index=dates[0:4], columns= list(df.columns) + ['E'])
df5.loc[dates[0]:dates[1], 'E'] = 1 # 用.loc时不存在 用.iloc时遇到的不包括最后一个数的问题
df5
A B C D F E
2022-05-14 0.000000 0.000000 -0.300576 5 NaN 1.0
2022-05-15 0.820437 1.266999 -1.408219 5 1.0 1.0
2022-05-16 -0.807610 0.197376 0.002867 5 2.0 NaN
2022-05-17 0.495665 -0.270284 -0.243175 5 3.0 NaN
# 删除所有含NaN的行
df5.dropna(how='any')
A B C D F E
2022-05-15 0.820437 1.266999 -1.408219 5 1.0 1.0
# 填充缺省值
df5.fillna(value=5)
A B C D F E
2022-05-14 0.000000 0.000000 -0.300576 5 5.0 1.0
2022-05-15 0.820437 1.266999 -1.408219 5 1.0 1.0
2022-05-16 -0.807610 0.197376 0.002867 5 2.0 5.0
2022-05-17 0.495665 -0.270284 -0.243175 5 3.0 5.0
# 提取na的布尔值掩码
pd.isna(df5)
A B C D F E
2022-05-14 False False False False True False
2022-05-15 False False False False False False
2022-05-16 False False False False False True
2022-05-17 False False False False False True

6. 运算

6.1. 二进制

6.2. 统计

# 求均值
df.mean(0)   # 首个参数 axias=0 或者默认值,显示按列求平均值
A    0.244232
B    0.205126
C   -0.554087
D    5.000000
F    3.000000
dtype: float64
df.mean(1) # axias=1 按行求平均值
2022-05-14    1.174856
2022-05-15    1.335844
2022-05-16    1.278526
2022-05-17    1.596441
2022-05-18    1.923378
2022-05-19    1.800251
Freq: D, dtype: float64

不同维度对象运算时,要先对齐。

pandas 自动沿指定维度广播(扩展可能更合适吧)

# 对齐的案例 
s = pd.Series([1,3,5, np.nan, 6, 8], index = dates).shift(2) 
# 通过 .shift(n) 向后移动n个。
s
2022-05-14    NaN
2022-05-15    NaN
2022-05-16    1.0
2022-05-17    3.0
2022-05-18    5.0
2022-05-19    NaN
Freq: D, dtype: float64

6.3. Apply()函数

df
A B C D F
2022-05-14 0.000000 0.000000 -0.300576 5 NaN
2022-05-15 0.820437 1.266999 -1.408219 5 1.0
2022-05-16 -0.807610 0.197376 0.002867 5 2.0
2022-05-17 0.495665 -0.270284 -0.243175 5 3.0
2022-05-18 -0.259270 0.256750 0.619412 5 4.0
2022-05-19 1.216169 -0.220087 -1.994829 5 5.0
# 使用np的计算方法
df.apply(np.cumsum) # 按行累加
0 1 2 3
0 0.804641 -0.217559 2.073518 0.345552
1 -0.327518 0.562475 0.841837 0.954023
2 1.114386 -0.654050 0.387675 -0.704480
3 1.633977 0.134903 0.734002 -1.505859
4 -0.194504 -0.852858 -1.305951 0.722428
5 1.980137 -0.907263 0.297719 1.357773
6 1.890712 2.498787 0.320914 -0.351401
7 -1.901487 -0.539700 -1.418863 -1.481537
8 -1.021292 0.860927 1.171972 0.208986
9 -1.380867 -1.287114 -1.492172 -0.932805
pieces = [df[:3], df[7:]]
pd.concat(pieces)
0 1 2 3
0 0.804641 -0.217559 2.073518 0.345552
1 -0.327518 0.562475 0.841837 0.954023
2 1.114386 -0.654050 0.387675 -0.704480
7 -1.901487 -0.539700 -1.418863 -1.481537
8 -1.021292 0.860927 1.171972 0.208986
9 -1.380867 -1.287114 -1.492172 -0.932805

7.2. 连接(JION)

SQL风格

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left, right
(   key  lval
 0  foo     1
 1  foo     2,
    key  rval
 0  foo     4
 1  foo     5)
pd.merge(left, right, on = 'key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [4, 5]})
left, right
(   key  lval
 0  foo     4
 1  bar     5,
    key  rval
 0  foo     4
 1  bar     5)
pd.merge(left, right, on = 'key')
key lval rval
0 foo 4 4
1 bar 5 5

7.3. 追加

# 为df追加一行
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 0.317374 -1.337608 -2.674631 -0.420806
1 -0.428625 0.401127 1.002601 -0.600688
2 1.032571 0.239903 -0.153389 -2.722230
3 0.346756 -0.464619 0.874594 0.438894
4 -0.705527 0.719372 0.907008 0.656376
5 0.423865 -1.262852 0.721386 -0.614503
6 2.170255 1.199904 -1.468446 -0.809605
7 -1.030749 1.182177 -0.325211 -0.919963
s  = df.iloc[3]
s
A    0.346756
B   -0.464619
C    0.874594
D    0.438894
Name: 3, dtype: float64
# 这个方法要被干掉了。
# FutureWarning: 
#   The frame.append method is deprecated and will be removed from pandas in a future version. 
#   Use pandas.concat instead.
df.append(s,ignore_index=True) 
/var/folders/wj/nc3k2r8x1l9blh3bp02_y01r0000gn/T/ipykernel_62802/1540360925.py:5: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s,ignore_index=True)
A B C D
0 0.317374 -1.337608 -2.674631 -0.420806
1 -0.428625 0.401127 1.002601 -0.600688
2 1.032571 0.239903 -0.153389 -2.722230
3 0.346756 -0.464619 0.874594 0.438894
4 -0.705527 0.719372 0.907008 0.656376
5 0.423865 -1.262852 0.721386 -0.614503
6 2.170255 1.199904 -1.468446 -0.809605
7 -1.030749 1.182177 -0.325211 -0.919963
8 0.346756 -0.464619 0.874594 0.438894

8. 分组 Groupby

“group by” 指的是涵盖下列一项或多项步骤的处理流程:

  • 分割:按条件把数据分割成多组;
  • 应用:为每组单独应用函数;
  • 组合:将处理结果组合成一个数据结构。
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df
A B C D
0 foo one 0.319217 0.930735
1 bar one -1.410576 -0.613675
2 foo two 0.426761 0.318037
3 bar three -0.310835 -2.186474
4 foo two 1.340606 2.311946
5 bar two -0.531829 -0.047955
6 foo one 0.588822 -0.115847
7 foo three 0.274960 -0.639542
df.groupby('A').sum()
C D
A
bar -2.253241 -2.848104
foo 2.950364 2.805329
df.groupby(['A','B']).sum()
C D
A B
bar one -1.410576 -0.613675
three -0.310835 -2.186474
two -0.531829 -0.047955
foo one 0.908038 0.814888
three 0.274960 -0.639542
two 1.767366 2.629982

9. 重塑 reshaping

没弄明白,姑且先知道有这么个东西吧。

# 数据准备
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8,2),index = index, columns= ['A', 'B'])
df
df2 = df[:4]
df2
A B
first second
bar one 1.082895 -0.845524
two -1.459212 -0.990315
baz one 1.330689 0.391560
two 1.543693 -0.546439
# 用stack()压缩至1层:DataFrame对象变成有多层索引
stacked = df2.stack()
stacked
first  second   
bar    one     A    1.082895
               B   -0.845524
       two     A   -1.459212
               B   -0.990315
baz    one     A    1.330689
               B    0.391560
       two     A    1.543693
               B   -0.546439
dtype: float64
# 反向操作 unstack(): 没搞懂
stacked.unstack()
A B
first second
bar one 1.082895 -0.845524
two -1.459212 -0.990315
baz one 1.330689 0.391560
two 1.543693 -0.546439
stacked.unstack(1)
second one two
first
bar A 1.082895 -1.459212
B -0.845524 -0.990315
baz A 1.330689 1.543693
B 0.391560 -0.546439
stacked.unstack(2)
A B
first second
bar one 1.082895 -0.845524
two -1.459212 -0.990315
baz one 1.330689 0.391560
two 1.543693 -0.546439

10. 数据透视

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                       'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})
df
A B C D E
0 one A foo -0.460053 0.696100
1 one B foo -0.377592 -1.117457
2 two C foo 0.480715 1.824891
3 three A bar -0.327904 0.374917
4 one B bar -0.242910 -0.644575
5 one C bar -0.088203 -1.164071
6 two A foo -0.811716 -1.360298
7 three B foo 0.013366 0.088306
8 one C foo 1.680185 -0.770378
9 one A bar 0.813511 -0.576679
10 two B bar 0.888552 0.424111
11 three C bar -0.017553 0.335018
pd.pivot_table(df,values='D',index= ['A', 'B'], columns= 'C')
C bar foo
A B
one A 0.813511 -0.460053
B -0.242910 -0.377592
C -0.088203 1.680185
three A -0.327904 NaN
B NaN 0.013366
C -0.017553 NaN
two A NaN -0.811716
B 0.888552 NaN
C NaN 0.480715

11. 时间序列

主要是针对时间序列的频率、时区、时间戳等的转换。

  • 转换数据频率
# 数据频率
rng = pd.date_range('1/1/2022', periods=100, freq='S') # 不同的时间格式都可以
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2022-01-01 00:00:00    353
2022-01-01 00:00:01     33
2022-01-01 00:00:02    411
2022-01-01 00:00:03    408
2022-01-01 00:00:04    236
                      ... 
2022-01-01 00:01:35    260
2022-01-01 00:01:36     60
2022-01-01 00:01:37    226
2022-01-01 00:01:38    361
2022-01-01 00:01:39    288
Freq: S, Length: 100, dtype: int64
ts.resample('5Min').sum() 
# ts.resample()之后是一个<pandas.core.resample.DatetimeIndexResampler object at 0x111f64730>对象
# ts1 = ts.resample('5Min')
# ts1
2022-01-01    23659
Freq: 5T, dtype: int64
  • 转时区表示
rng = pd.date_range('3/6/2022 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2022-03-06    0.229783
2022-03-07    1.244212
2022-03-08   -0.246107
2022-03-09    0.678679
2022-03-10    1.117801
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC') # tz_localize 和 tz_convert
ts_utc
2022-03-06 00:00:00+00:00    0.229783
2022-03-07 00:00:00+00:00    1.244212
2022-03-08 00:00:00+00:00   -0.246107
2022-03-09 00:00:00+00:00    0.678679
2022-03-10 00:00:00+00:00    1.117801
Freq: D, dtype: float64
ts_utc.tz_convert('US/Eastern')
2022-03-05 19:00:00-05:00    0.229783
2022-03-06 19:00:00-05:00    1.244212
2022-03-07 19:00:00-05:00   -0.246107
2022-03-08 19:00:00-05:00    0.678679
2022-03-09 19:00:00-05:00    1.117801
Freq: D, dtype: float64
  • 转换时间段
    可以对时间的月份显示,月末显示等等。
rng = pd.date_range('1/1/2022', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2022-01-31   -0.084049
2022-02-28   -0.291836
2022-03-31   -1.579093
2022-04-30    0.332739
2022-05-31   -0.247872
Freq: M, dtype: float64
ps = ts.to_period() # 日期转换为月份
ps
2022-01   -0.084049
2022-02   -0.291836
2022-03   -1.579093
2022-04    0.332739
2022-05   -0.247872
Freq: M, dtype: float64
ps.to_timestamp() # 日期转换为时间戳
2022-01-01   -0.084049
2022-02-01   -0.291836
2022-03-01   -1.579093
2022-04-01    0.332739
2022-05-01   -0.247872
Freq: MS, dtype: float64

案例:把以11月为结束年份的季度频率转换为下一季度月末上午9点

  • .asfreq()
  • .period_range()
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 # 这个完全没看懂啊
ts.head()
1990-03-01 09:00    1.473524
1990-06-01 09:00    2.420399
1990-09-01 09:00    0.157428
1990-12-01 09:00    1.361583
1991-03-01 09:00    0.654051
Freq: H, dtype: float64

12. 类别型(Categoricals)

对DataFrame中的数据指定类别,这个好像可以用到对资产负债表里面不同科目的设定“资产”、“负债”这样的用途。好像还不是这样哦

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5, 6], 
    "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']
    })
df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
df['grade'] = df['raw_grade'].astype('category')
df['grade'], df
(0    a
 1    b
 2    b
 3    a
 4    a
 5    e
 Name: grade, dtype: category
 Categories (3, object): ['a', 'b', 'e'],
    id raw_grade grade
 0   1         a     a
 1   2         b     b
 2   3         b     b
 3   4         a     a
 4   5         a     a
 5   6         e     e)

Series.cat.categories: 用有含义的名字命名不同的类型。

df['grade'].cat.categories= ['very good','good', 'bad']

Series.cat.set_categories: 重新排序类别,并添加缺失值

df['grade'] = df['grade'].cat.set_categories(['very bad', 'bad','medium','good','very good'])
df['grade'] # 和案例不一样,e 对应的是 very bad,这里只有bad 为什么??
0    very good
1         good
2         good
3    very good
4    very good
5          bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
df.sort_values(by='grade') # 意思是不是按 a=verygood,b=good,... 这样子分等级哦?
id raw_grade grade
5 6 e bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
df.groupby('grade').size() # 和案例显示的不一样啊。。。
grade
very bad     0
bad          1
medium       0
good         2
very good    3
dtype: int64

13. 可视化

用matplotlib实现可视化。

 # 1条时间序列
 ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000', periods=1000))
 ts = ts.cumsum()
 ts.plot()
<AxesSubplot:>
# n条时间序列
import matplotlib.pyplot as plt # 尼玛还要去学个matplotlib 。。

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure() #这是个啥? matplotlib??
df.plot()
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x11e2d6a70>




<Figure size 432x288 with 0 Axes>

14. 输入/输出

对不同数据源的读写:

格式
CSV df.to_csv('/path') pd.read_csv('/path')
HDF5 df.to_hdf('/path') pd.read_hdf('/path')
Excel df.to_excel('/path', sheetname= ) pd.read_excel('/path', 'sheetname', index_col, na_values=['NA'] )
  • df: DataFrame对象
  • pd:pandas对象
# CSV
df.to_csv('test.csv')
pd.read_csv('test.csv')
Unnamed: 0 A B C D
0 2000-01-01 -0.202417 0.456969 -2.924227 0.515216
1 2000-01-02 -0.438973 -0.123265 -2.138087 0.180029
2 2000-01-03 -1.314234 0.729317 -3.167284 -0.970958
3 2000-01-04 -0.334582 1.380405 -3.591633 -0.134346
4 2000-01-05 -2.179493 1.489109 -3.695347 -1.412114
... ... ... ... ... ...
995 2002-09-22 -0.249507 43.834701 -15.442732 -2.111668
996 2002-09-23 -1.792658 45.252003 -15.524048 -1.902708
997 2002-09-24 -2.880995 47.077331 -16.008287 -2.276801
998 2002-09-25 -2.008815 48.172294 -16.671531 -2.360542
999 2002-09-26 -1.788311 47.357009 -18.158131 -2.731052

1000 rows × 5 columns

# HDF5
df.to_hdf('foo.h5', 'df') # 需要pytable库,安装:pip install tables.
pd.read_hdf('foo.h5', 'df')
A B C D
2000-01-01 -0.202417 0.456969 -2.924227 0.515216
2000-01-02 -0.438973 -0.123265 -2.138087 0.180029
2000-01-03 -1.314234 0.729317 -3.167284 -0.970958
2000-01-04 -0.334582 1.380405 -3.591633 -0.134346
2000-01-05 -2.179493 1.489109 -3.695347 -1.412114
... ... ... ... ...
2002-09-22 -0.249507 43.834701 -15.442732 -2.111668
2002-09-23 -1.792658 45.252003 -15.524048 -1.902708
2002-09-24 -2.880995 47.077331 -16.008287 -2.276801
2002-09-25 -2.008815 48.172294 -16.671531 -2.360542
2002-09-26 -1.788311 47.357009 -18.158131 -2.731052

1000 rows × 4 columns

# Excel
df.to_excel('test.xlsx', sheet_name='pandas')
pd.read_excel('test.xlsx', 'pandas', index_col = None, na_values=['NA'])
Unnamed: 0 A B C D
0 2000-01-01 -0.202417 0.456969 -2.924227 0.515216
1 2000-01-02 -0.438973 -0.123265 -2.138087 0.180029
2 2000-01-03 -1.314234 0.729317 -3.167284 -0.970958
3 2000-01-04 -0.334582 1.380405 -3.591633 -0.134346
4 2000-01-05 -2.179493 1.489109 -3.695347 -1.412114
... ... ... ... ... ...
995 2002-09-22 -0.249507 43.834701 -15.442732 -2.111668
996 2002-09-23 -1.792658 45.252003 -15.524048 -1.902708
997 2002-09-24 -2.880995 47.077331 -16.008287 -2.276801
998 2002-09-25 -2.008815 48.172294 -16.671531 -2.360542
999 2002-09-26 -1.788311 47.357009 -18.158131 -2.731052

1000 rows × 5 columns

15. 各种坑(报错) Gotchs

链接断了。

posted @ 2022-05-27 22:36  王大桃zzZ  阅读(93)  评论(0编辑  收藏  举报