pandas学习笔记——阅读官方文档

1. 初始化

(1)生成简单序列pd.Series

>>>s = pd.Series([1,3,5,np.nan,6,8])
>>>s
0    1.0
1    3.0
2    5.0
3    NaN   #注意空
4    6.0
5    8.0
dtype: float64

(2)生成日期序列pd.date_range

>>>dates = pd.date_range('20130101', periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

(3)结构

>>>df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
# index 表示序号,columns表示列名称

>>> df
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

 

>>>: df2 = pd.DataFrame({     'A' : 1.,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
   ....: 

>>>: df2
     A        B    C    D     E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

 

2. 观察数据

(1)前n个(head),后n个(tail)

>>> df.head(2)
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236


>>> df.tail(3)
                   A         B         C         D
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

(2)展示序号(index)、列号(columns)、值(values)

>>>df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

>>> df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

>>> df.values
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
       [ 1.2121, -0.1732,  0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949,  1.0718],
       [ 0.7216, -0.7068, -1.0396,  0.2719],
       [-0.425 ,  0.567 ,  0.2762, -1.0874],
       [-0.6737,  0.1136, -1.4784,  0.525 ]])

(3)快速数据统计describe

>>>df.describe()
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

(4)转置df.T

(5)按轴排序

降序:ascending=False
升序:ascending=True
横轴: df.sort_index(axis=1, ascending=False)
纵轴: df.sort_index(axis=0, ascending=False)
>>>df.sort_index(axis=1, ascending=False)
                   D         C         B         A
2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
2013-01-02 -1.044236  0.119209 -0.173215  1.212112
2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
2013-01-04  0.271860 -1.039575 -0.706771  0.721555
2013-01-05 -1.087401  0.276232  0.567020 -0.424972
2013-01-06  0.524988 -1.478427  0.113648 -0.673690

(6)按值排序

>>> df.sort_values(by='B')
                   A         B         C         D
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

 

3. 选择, 与matlab类似

选择某列( df.A == df['A']

选择某个区间(df[0:3])

按标签选择(df.loc[dates[0]])

 

4. 数据缺失

用nan表示

舍去丢失数据的行 df.dropna(how='any')

补全丢失的数据 df.fillna(value=5)

判断是否缺失数据 pd.isna(df1)

 

5. 统计

求平均值  df.mean()

 

6. 使用函数

>>>df.apply(lambda x: x.max() - x.min())
 
A    2.073961
B    2.671590
C    1.785291
D    0.000000
F    4.000000
dtype: float64

 

posted @ 2017-11-28 09:52  farmerspring  阅读(228)  评论(0编辑  收藏  举报