pandas-1.0.3

一.简介

1.Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具，主要用于数据处理（数据整理，操作，存储，读取等）和数据分析

2.http://pandas.pydata.org/和https://pandas.pydata.org/docs/pandas.pdf

3.pandas有很多数据结构（类），主要用到：Series（一维图表），DataFrame（二维表格），panel(三维数组)

二.Series

1.具有标签（index）的一维数组，能够保存任何数据类型（int，str，float，python对象等），轴标签（索引）从0开始（表格的列的列表）

2.创建：pd.Series(data, index=index)，data可以是字典、ndarray、标量

（1）字典

1 #dict,
 2 #1.当未传递Series索引时，键表示索引，值表示值
 3 d = {'b' : 1, 'a' : 0, 'c' : 2}
 4 s=pd.Series(d)
 5 print(s)
 6 #2.如果传递索引，则将拉出与索引中的标签对应的数据中的值，NaN（不是数字）是pandas中使用的标准缺失数据标记
 7 s2=pd.Series(d, index=['b', 'c', 'd', 'a'])
 8 print(s2)
 9 ----------------------------------------------------------
10 a    0
11 b    1
12 c    2
13 dtype: int64
14 b    1.0
15 c    2.0
16 d    NaN
17 a    0.0
18 dtype: float64

字典

（2）ndarray

1 #ndarry，若设置索引，则索引的长度必须与数据的长度相同，
 2 s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
 3 s2=pd.Series(np.random.randn(5))#如果没有传递索引，将创建一个具有值的索引。[0, ..., len(data) - 1]
 4 print(s)
 5 print(s2)
 6 -------------------------------------------
 7 a   -0.019921
 8 b   -2.324644
 9 c   -0.429393
10 d    1.436731
11 e    2.564406
12 dtype: float64
13 0   -0.925714
14 1    0.319075
15 2    0.528071
16 3   -0.385841
17 4    0.963207
18 dtype: float64

数组

（3）标量

1 #标量,如果data是标量值，则必须提供索引。将重复该值以匹配索引的长度
 2 s=pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
 3 print(s)
 4 
 5 -------------------------------------------------------
 6 a    5.0
 7 b    5.0
 8 c    5.0
 9 d    5.0
10 e    5.0
11 dtype: float64

标量

三.DataFrame

1.二维的表格型数据结构，具有索引（行标签）和列（列标签）参数，若没指定标签，则从0开始自动标记到length-1，

2.创建：pd.DataFrame(data, index=index，columns=columns) ，data可以是ndarray，列表，字典，series，dataframe，也就是说可以把Series很容易地转为DataFrame

（1）字典{‘columns’：series}，推荐

1 #字典，#字典的键为columns，值为每一个series，#通过字典创建会产生列的顺序会是随机的
 2 d = { 'one': pd.Series([1., 2., 3.],index=['a', 'b', 'c']),
 3       'two': pd.Series([1., 2., 3., 4.],index=['a', 'b', 'c', 'd'])}
 4 df = pd.DataFrame(d)
 5 print(df)
 6 print(pd.DataFrame(d, index=['d', 'b', 'a']))#如果没有传递列，则列将是dict键的有序列表。
 7 print(pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']))
 8 -----------------------------------------------------------
 9    one  two
10 a  1.0  1.0
11 b  2.0  2.0
12 c  3.0  3.0
13 d  NaN  4.0
14    one  two
15 d  NaN  4.0
16 b  2.0  2.0
17 a  1.0  1.0
18    two three
19 d  4.0   NaN
20 b  2.0   NaN
21 a  1.0   NaN

字典

（2）字典{‘columns’：ndarrays / lists}

1 #字典，ndarrays必须都是相同的长度,
 2 d = {'one' : [1., 2., 3., 4.],
 3      'two' : [4., 3., 2., 1.]}
 4 df = pd.DataFrame(d)
 5 print(df)#如果没有传递索引，结果将是range(n)，
 6 print(pd.DataFrame(d, index=['a', 'b', 'c', 'd']))#如果传递索引，则它必须明显与数组的长度相同。
 7 ----------------------------------------
 8    one  two
 9 0  1.0  4.0
10 1  2.0  3.0
11 2  3.0  2.0
12 3  4.0  1.0
13    one  two
14 a  1.0  4.0
15 b  2.0  3.0
16 c  3.0  2.0
17 d  4.0  1.0

字典2

（3）列表【{‘columns’：}，{}】

1 #列表【字典】，每个字典的值代表的是每条记录（一行），而且顺序确定，字典的键表示columns
 2 data2 = [{'a': 1, 'b': 2},
 3          {'a': 5, 'b': 10, 'c': 20}]
 4 df=pd.DataFrame(data2)
 5 print(df)
 6 print(pd.DataFrame(data2, index=['first', 'second']))
 7 print(pd.DataFrame(data2, columns=['a', 'b']))
 8 -----------------------------------------------------
 9    a   b     c
10 0  1   2   NaN
11 1  5  10  20.0
12         a   b     c
13 first   1   2   NaN
14 second  5  10  20.0
15    a   b
16 0  1   2
17 1  5  10

列表

3.常用属性

 1 df = pd.DataFrame(np.random.randn(20, 3))
 2 
 3 print(df)
 4 print(df.dtypes)       # 类型
 5 print(df.index)        # 行标签（索引），第一维度的标签
 6 print(df.columns)      # 列标签，第二维度的标签
 7 print(df.values)       # 返回numpy的ndarray形式
 8 print(df.axes)         # 返回行标签和列标签的列表，也就是轴（维度）的列表
 9 print(df.ndim)         # 返回一个表示（维度数）轴数/数组维数的整数。2
10 print(df.size)         # 整个表格的元素个数，行数*列数
11 print(df.shape)        # 返回表格的形状，维度数的元组
12 print(df.empty)        # 判断是否为空
13 ---------------------------------------------
14            0         1         2
15 0   1.463760  0.889478  2.719325
16 1   1.151600 -1.543238 -1.092630
17 2  -0.915165  0.182080  0.015987
18 3   0.997108 -1.062458 -0.290915
19 4   0.506596  0.521730  1.204838
20 5  -1.205070  0.593703 -0.237471
21 6  -0.941276  0.375154  2.109682
22 7   0.490349 -0.333887 -2.234917
23 8   0.927428  1.178269 -0.252521
24 9  -0.734907 -1.701942  0.008140
25 10  0.066335 -0.279483 -1.536980
26 11  0.381364  0.527889 -0.735369
27 12 -1.759830  0.837367 -0.311767
28 13 -0.331585 -0.081331 -1.250890
29 14  0.010716  0.100442  0.030236
30 15 -0.718699  1.051054  0.990649
31 16 -0.295118  0.463517 -0.011839
32 17  0.216745  1.397626 -0.242623
33 18  0.667472  1.096342 -1.638717
34 19 -0.972141 -0.502762 -0.484464
35 0    float64
36 1    float64
37 2    float64
38 dtype: object
39 RangeIndex(start=0, stop=20, step=1)
40 RangeIndex(start=0, stop=3, step=1)
41 [[ 1.46376002  0.88947846  2.71932544]
42  [ 1.15159974 -1.54323848 -1.09262975]
43  [-0.91516515  0.18208043  0.01598746]
44  [ 0.99710758 -1.06245845 -0.29091497]
45  [ 0.506596    0.52173002  1.20483791]
46  [-1.20507018  0.59370326 -0.23747127]
47  [-0.94127614  0.37515381  2.10968156]
48  [ 0.49034868 -0.33388734 -2.23491685]
49  [ 0.9274285   1.17826853 -0.25252075]
50  [-0.73490667 -1.70194233  0.00814038]
51  [ 0.06633457 -0.27948286 -1.53698016]
52  [ 0.3813643   0.52788901 -0.73536879]
53  [-1.7598303   0.83736726 -0.31176709]
54  [-0.33158538 -0.08133123 -1.25089025]
55  [ 0.01071576  0.10044171  0.03023593]
56  [-0.71869935  1.05105444  0.99064853]
57  [-0.29511769  0.46351668 -0.01183942]
58  [ 0.216745    1.3976258  -0.24262259]
59  [ 0.66747248  1.0963415  -1.63871673]
60  [-0.97214094 -0.50276224 -0.48446367]]
61 [RangeIndex(start=0, stop=20, step=1), RangeIndex(start=0, stop=3, step=1)]
62 2
63 60
64 (20, 3)
65 False

常用属性

4.常用方法

（1）关于查看数据的方法：索引：直接索引、iloc位置索引、loc标签索引

 1 df = pd.DataFrame(pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
 2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
 3 print(df)
 4 # dataframe：从左到右，从上到下，最左边一列是索引列表，每一条索引表示一条记录
 5 print(df.head())         # 从表格顶部开始显示表格，查看前几行
 6 print(df.tail())         # 从表格底部开始显示表格，查看后几行
 7 print('#'*30)
 8 
 9 # 索引
10 # 1.根据标签索引
11 print(df.at['two','B'])                # 设定行列的标签进行索引,
12 # loc按标签进行索引,还可以利用元组进行三维数据的所有
13 print(df.loc['one'])                   # 按行标签进行索引，相当于df.loc['one',:]
14 print(df.loc['two', 'B'])              # 按行标签，列标签索引某个值，相当于df.loc['two'].at['B']、
15 print(df.loc[['one','two'],'A':'B'])   # 根据行标签和列标签的设定范围来索引
16 print(df.loc[df['C'] > 6, df.loc['three']>10])  # 条件索引
17 
18 print('#'*30)
19 # 2.根据位置进行索引
20 print(df.iat[1, 2])                  # 相当于df.iloc[1].iat[2]
21 # iloc，和loc类似，只是标签变为位置序号，逗号分隔行列索引，冒号表示到，列表表示索引列表
22 print(df.iloc[2])                    # 选择index列表中标签位置为2+1的数据,
23 print(df.iloc[1, 1])                 # 按位置进行索引
24 print(df.iloc[0:1, [1, 2]])
25 
26 # 3.直接索引
27 print(df['A'])                      # 选择某一列，而且只能标签索引，相当于df.A
28 print(df.transpose()['one'])        # 可以通过转置来选择某一行，c
29 print(df[1:2])                      # 按行序号进行索引，选择行
30 print(df['A']['one'])               # 先选择列标签，再选择行标签
31 print(df[1:2]['A'])                 # 这样返回的是一维Series
32 -------------------------------------------------------
33         A   B   C
34 one     0   2   3
35 two     0   4   1
36 three  10  20  30
37         A   B   C
38 one     0   2   3
39 two     0   4   1
40 three  10  20  30
41         A   B   C
42 one     0   2   3
43 two     0   4   1
44 three  10  20  30
45 ##############################
46 4
47 A    0
48 B    2
49 C    3
50 Name: one, dtype: int64
51 4
52      A  B
53 one  0  2
54 two  0  4
55         B   C
56 three  20  30
57 ##############################
58 1
59 A    10
60 B    20
61 C    30
62 Name: three, dtype: int64
63 4
64      B  C
65 one  2  3
66 one       0
67 two       0
68 three    10
69 Name: A, dtype: int64
70 A    0
71 B    2
72 C    3
73 Name: one, dtype: int64
74      A  B  C
75 two  0  4  1
76 0
77 two    0
78 Name: A, dtype: int64

索引

（2）增、删、改、查

【1】原表格上新增一个表格

 1 # 增：要尽量符合格式和标签，若无数据，会为np.nan，若无标签，会从上一个序号开始
 2 # （1）append向下添加，添加一行或多行,但是原表格不变，可以设定是否排序,可以设定是否忽略原来的index
 3 print(df.append(df1, sort=False,ignore_index=False))
 4 print(df.append([df1,df1],sort=False,ignore_index=True))  # 把两个表添加到另一个表中，向下添加，
 5 
 6 # (2)join横向添加表格，原表格不变，一般用于对于大表格添加小表格，索引是按照被添加表格的索引来的，若两个表格有相同列标签，则可以设定成不同的列标签
 7 print(df.join(df1, lsuffix='_left', rsuffix='_right'))
 8 
 9 # (3)assign可以利用简单函数添加列,原表格不变，
10 print(df.assign(new_column = lambda x: x['A'] + 4))
11 
12 # （4）直接赋值添加一列,原表格改变
13 df['D'] = np.NaN
14 
15 # （5）upadata更新原表格，原表格根据新表格的值而改变,默认覆写，注意index还是原表格的index，而列标签会改变
16 print(df.update(df1))
17 print(df)
18 -----------------------------------------------
19 df:          A     B      C
20 one    0.0   NaN      3
21 two    0.0            1
22 three  1.0  True  False
23 df1:        B  C  D
24 two    1  2  3
25 three  4  5  6
26 four   7  8  9
27          A     B      C    D
28 one    0.0   NaN      3  NaN
29 two    0.0            1  NaN
30 three  1.0  True  False  NaN
31 two    NaN     1      2  3.0
32 three  NaN     4      5  6.0
33 four   NaN     7      8  9.0
34      A     B      C    D
35 0  0.0   NaN      3  NaN
36 1  0.0            1  NaN
37 2  1.0  True  False  NaN
38 3  NaN     1      2  3.0
39 4  NaN     4      5  6.0
40 5  NaN     7      8  9.0
41 6  NaN     1      2  3.0
42 7  NaN     4      5  6.0
43 8  NaN     7      8  9.0
44          A B_left C_left  B_right  C_right    D
45 one    0.0    NaN      3      NaN      NaN  NaN
46 two    0.0             1      1.0      2.0  3.0
47 three  1.0   True  False      4.0      5.0  6.0
48          A     B      C  new_column
49 one    0.0   NaN      3         4.0
50 two    0.0            1         4.0
51 three  1.0  True  False         5.0
52 None
53          A    B  C    D
54 one    0.0  NaN  3  NaN
55 two    0.0    1  2  3.0
56 three  1.0    4  5  6.0

增

【2】删除

 1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]],
 2                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
 3 # 删
 4 # drop删除行和列，返回删除后的表格，原表格不变
 5 print(df.drop(columns=['B', 'C']))
 6 print(df.drop(index=['one']))
 7 print(df.drop(columns=['B'],index=['one']))  # 并不是删除某个值，而是删除了行和列，和索引不一样
 8 
 9 # drop_duplicates去重,去除行重复,
10 # subset根据哪几列去重，默认None考虑所有列，keep保留哪次的重复行，有'first'（默认）、"last"和False。inplace代表是否在原表上直接去除
11 print(df.duplicated())                # 查看是否有重复行
12 print(df.drop_duplicates(subset=['A'],keep='last',inplace=True))
13 print(df)
14 ------------------------------------
15          A
16 one    0.0
17 two    0.0
18 three  1.0
19 four   1.0
20          A     B      C
21 two    0.0            1
22 three  1.0  True  False
23 four   1.0  True  False
24          A      C
25 two    0.0      1
26 three  1.0  False
27 four   1.0  False
28 one      False
29 two      False
30 three    False
31 four      True
32 dtype: bool
33 None
34         A     B      C
35 two   0.0            1
36 four  1.0  True  False

删

【3】改

 1 # 改
 2 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, '', 1.0], [1.0, True, False],[1.0, True, False]],
 3                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
 4 
 5 # （1）利用索引找到要改的值，直接赋值改，原表格变
 6 df['A'] = np.NaN
 7 print(df)
 8 
 9 # （2）修改列标签和行标签的名称，原表格名称不变，只是生成副本
10 print(df.rename(columns={"A": "a", "B": "c"},index={'one': 0, 'two': 1}))
11 
12 # （3）修改标签,inplace表示是生成副本(False)还是在原来表格上改动（True）
13 print(df.reset_index(drop=False))  # 将index的标签变为以0开始的序号,生成一个index列,drop决定是否生成index列
14 print(df.set_axis(['I', 'II','IIII','IV'], axis='index',inplace=True))  # 重新设置标签
15 
16 # （4）替换值,默认生成副本
17 print(df.replace([np.NaN,''],[0,1]))    # 前一个表示要改变的列表，后一个表示对应的替换值
18 print(df.replace({'A': np.NaN, 'B': True}, 100))  # 可以进行更精确的定位和替换
19 print(df)
20 -------------------------------------
21         A     B      C
22 one   NaN   NaN      3
23 two   NaN            1
24 three NaN  True  False
25 four  NaN  True  False
26         a     c      C
27 0     NaN   NaN      3
28 1     NaN            1
29 three NaN  True  False
30 four  NaN  True  False
31    index   A     B      C
32 0    one NaN   NaN      3
33 1    two NaN            1
34 2  three NaN  True  False
35 3   four NaN  True  False
36 None
37         A     B      C
38 I     0.0     0      3
39 II    0.0     1      1
40 IIII  0.0  True  False
41 IV    0.0  True  False
42           A    B      C
43 I     100.0  NaN      3
44 II    100.0           1
45 IIII  100.0  100  False
46 IV    100.0  100  False
47        A     B      C
48 I    NaN   NaN      3
49 II   NaN            1
50 IIII NaN  True  False
51 IV   NaN  True  False

改

【4】查

 1 df = pd.DataFrame(pd.DataFrame([[0.0, np.NaN, 3], [0.0, 6, 1.0], [1.0, True, False],[1.0, True, False]],
 2                   index=['one', 'two', 'three','four'], columns=['A', 'B', 'C']))
 3 # 查
 4 print(df)
 5 print(df.info())                              # 查询信息
 6 print(df.describe())                          # 查询基本表格信息
 7 print(df.duplicated())                        # 查询重复行
 8 print(df.select_dtypes(include='float'))      # 选择某个类型的列
 9 print(df.select_dtypes(exclude=['float']))    # 去除某个类型的列
10 print(df.isna())                              # 检验缺失值
11 print(df.notna())                             # 检验非缺失值
12 print(df.items())                             # df中是以列名为key的，所以df['A']表示列
13 print(df.iterrows())                          # 以行名为key进行的迭代
14 print(df.isin([0, 2]))                        # 判断每个值是否在列表中
15 print(df.isin({'B': [0, 3]}))                 # 判断某列每个值是否在列表中,其他为False
16 print(df.to_numpy())                          # 返回数组类型
17 # print(df.idxmax(axis=1))                    # 查询最大值的id
18 # df.equals(df1)                              # 查询是否有相同的元素
19 -------------------------------------------------------------
20          A     B      C
21 one    0.0   NaN      3
22 two    0.0     6      1
23 three  1.0  True  False
24 four   1.0  True  False
25 <class 'pandas.core.frame.DataFrame'>
26 Index: 4 entries, one to four
27 Data columns (total 3 columns):
28 A    4 non-null float64
29 B    3 non-null object
30 C    4 non-null object
31 dtypes: float64(1), object(2)
32 memory usage: 128.0+ bytes
33 None
34              A
35 count  4.00000
36 mean   0.50000
37 std    0.57735
38 min    0.00000
39 25%    0.00000
40 50%    0.50000
41 75%    1.00000
42 max    1.00000
43 one      False
44 two      False
45 three    False
46 four      True
47 dtype: bool
48          A
49 one    0.0
50 two    0.0
51 three  1.0
52 four   1.0
53           B      C
54 one     NaN      3
55 two       6      1
56 three  True  False
57 four   True  False
58            A      B      C
59 one    False   True  False
60 two    False  False  False
61 three  False  False  False
62 four   False  False  False
63           A      B     C
64 one    True  False  True
65 two    True   True  True
66 three  True   True  True
67 four   True   True  True
68 <generator object DataFrame.iteritems at 0x00000234BF340750>
69 <generator object DataFrame.iterrows at 0x00000234BF340750>
70            A      B      C
71 one     True  False  False
72 two     True  False  False
73 three  False  False   True
74 four   False  False   True
75            A      B      C
76 one    False  False  False
77 two    False  False  False
78 three  False  False  False
79 four   False  False  False
80 [[0.0 nan 3]
81  [0.0 6 1.0]
82  [1.0 True False]
83  [1.0 True False]]

查询

（3）对于缺失数据的处理

1 import numpy as np
 2 import pandas as pd
 3 dates = pd.date_range('20180507', periods=3)
 4 df = pd.DataFrame(np.arange(12).reshape(3,4), index=dates, columns=list('ABCD'))
 5 #np.nan表示丢失的数据，默认不包含计算中
 6 df.ix[1,'C']=np.nan
 7 print(df)
 8 #删除对应数据
 9 print(df.dropna(axis=0,how='any'))#删除行:行中只要有一个丢失数据就删除
10 print(df.dropna(axis=1,how='all'))#删除列:列中所有数据都是丢失数据就删除
11 #填充对应数据
12 print(df.fillna(value=0))#在丢失数据上把nan变为0
13 #检查是否确实数据
14 print(df.isnull())#print(df.isna())
15 -----------------------------------------------------------
16             A  B     C   D
17 2018-05-07  0  1   2.0   3
18 2018-05-08  4  5   NaN   7
19 2018-05-09  8  9  10.0  11
20             A  B     C   D
21 2018-05-07  0  1   2.0   3
22 2018-05-09  8  9  10.0  11
23             A  B     C   D
24 2018-05-07  0  1   2.0   3
25 2018-05-08  4  5   NaN   7
26 2018-05-09  8  9  10.0  11
27             A  B     C   D
28 2018-05-07  0  1   2.0   3
29 2018-05-08  4  5   0.0   7
30 2018-05-09  8  9  10.0  11
31                 A      B      C      D
32 2018-05-07  False  False  False  False
33 2018-05-08  False  False   True  False
34 2018-05-09  False  False  False  False

缺失数据的处理

（4）关于计算和统计方法

  1 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]],
  2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
  3 print(df)
  4 
  5 #  统计和计算
  6 print(df.abs())                   # 每个元素取绝对值
  7 print(df.clip(-4, 6))             # 修改最大最小值
  8 print(df.corr(method='pearson'))  # 计算列之间的相关性,可选方法：pearson、kendall、spearman
  9 print(df.count())                 # 计算每列的非NA(None，NaN，NaT，inf)元素个数
 10 print(df.cov())                   # 计算列的协方差矩阵
 11 print(df.cummin())                # 计算列依次的最小的累计，同样有cummax
 12 print(df.cumsum())                # 计算列的累加值，同样有cumprob累积
 13 print(df.diff())                  # 计算列中其他元素与第一个元素的差值（离散距离），periods=1表示第一行
 14 print(df.eval('A+B'))             # 计算字符串表达式
 15 print(df.max())                   # 计算列的最大值，同样有min最小值
 16 print(df.idxmax())                # 计算最大值的索引
 17 print(df.std())                   # 标准差
 18 print(df.var())                   # 方差
 19 print(df.mean())                  # 均值
 20 print(df.median())                # 中位数
 21 print(df.describe())              # 描述信息，25%表示25%分位数
 22 print(df['A'].value_counts())     # 计算频率
 23 print(df.round(2))                # 四舍五入小数
 24 # print(df.all())                 # 检查是否所有元素都为True
 25 -------------------------------------------------
 26        A  B  C
 27 one    1 -5  3
 28 two    4  5  6
 29 three  7  8  9
 30        A  B  C
 31 one    1  5  3
 32 two    4  5  6
 33 three  7  8  9
 34        A  B  C
 35 one    1 -4  3
 36 two    4  5  6
 37 three  6  6  6
 38           A         B         C
 39 A  1.000000  0.954919  1.000000
 40 B  0.954919  1.000000  0.954919
 41 C  1.000000  0.954919  1.000000
 42 A    3
 43 B    3
 44 C    3
 45 dtype: int64
 46       A          B     C
 47 A   9.0  19.500000   9.0
 48 B  19.5  46.333333  19.5
 49 C   9.0  19.500000   9.0
 50        A  B  C
 51 one    1 -5  3
 52 two    1 -5  3
 53 three  1 -5  3
 54         A  B   C
 55 one     1 -5   3
 56 two     5  0   9
 57 three  12  8  18
 58          A     B    C
 59 one    NaN   NaN  NaN
 60 two    3.0  10.0  3.0
 61 three  3.0   3.0  3.0
 62 one      -4
 63 two       9
 64 three    15
 65 dtype: int64
 66 A    7
 67 B    8
 68 C    9
 69 dtype: int64
 70 A    three
 71 B    three
 72 C    three
 73 dtype: object
 74 A    3.000000
 75 B    6.806859
 76 C    3.000000
 77 dtype: float64
 78 A     9.000000
 79 B    46.333333
 80 C     9.000000
 81 dtype: float64
 82 A    4.000000
 83 B    2.666667
 84 C    6.000000
 85 dtype: float64
 86 A    4.0
 87 B    5.0
 88 C    6.0
 89 dtype: float64
 90          A         B    C
 91 count  3.0  3.000000  3.0
 92 mean   4.0  2.666667  6.0
 93 std    3.0  6.806859  3.0
 94 min    1.0 -5.000000  3.0
 95 25%    2.5  0.000000  4.5
 96 50%    4.0  5.000000  6.0
 97 75%    5.5  6.500000  7.5
 98 max    7.0  8.000000  9.0
 99 7    1
100 1    1
101 4    1
102 Name: A, dtype: int64
103        A  B  C
104 one    1 -5  3
105 two    4  5  6
106 three  7  8  9

统计信息

（5）读取和存储

 1 # 读取read_文件类型   保存to_文件类型
 2 # 文件类型可以csv/excel/hdf/sql/json/html/stata/sas/pickle/records/markdown/dict/latex等
 3 # 解决乱码：https://blog.csdn.net/leonzhouwei/article/details/8447643
 4 df = pd.DataFrame(pd.DataFrame([[1, -5, 3], [4, 5, 6], [7, 8, 9]],
 5                   index=['one', 'two', 'three'], columns=['A', 'B', 'C']))
 6 df.to_csv('foo.csv', index=False)
 7 data = pd.read_csv('foo.csv')
 8 print(data)
 9 ----------------------------------
10    A  B  C
11 0  1 -5  3
12 1  4  5  6
13 2  7  8  9

保存和读取

（6）可视化

 1 # 可视化用matplotlib
 2 df = pd.DataFrame({
 3     'sales': [3, 2, 3, 9, 10, 6],
 4     'signups': [5, 5, 6, 12, 14, 13],
 5     'visits': [20, 42, 28, 62, 81, 50],
 6 }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
 7                        freq='M'))
 8 import matplotlib.pyplot as plt
 9 df.plot()         # 折线图
10 df.plot.area()    # 面积图
11 df.plot.bar()     # 垂直条形图
12 df.plot.barh()    # 水平条形图
13 df.plot.hist()    # 列的直方图
14 df.plot.line()    # 线图
15 df.plot.scatter(x='sales', y='signups') # 散点图
16 plt.show()#显示图

可视化

（7）其他

 1 df = pd.DataFrame(
 2     {'A':[2,3,2],
 3      'B':[7,8,6],
 4      'C':[7,11,9]}
 5 )
 6 print(df)
 7 
 8 print(df.insert(loc=1,column='D',value=[3,7,1]))  # 在指定位置插入
 9 print(df.astype({'A': 'float16'}))                # 转换数据类型
10 print(df.copy())                                  # 复制
11 print(df.applymap(lambda x:x+1 ))                 # 应用函数
12 df2 = df.groupby(by=['A'])                        # 依据某个标签里的种类进行分组
13 print(df2.get_group(2))                           # 获取某个类别的分组df对象
14 print(df.aggregate(np.median))                    # 使用指定轴上的一项或多项操作进行汇总
15 
16 print('#'*30)
17 print(df)
18 
19 print(df.sort_values(by='C',ascending=False))  # 指定某一属性，按值从小到大把整个列表排序,倒叙
20 print(df.to_dense())                           # 将稀疏矩阵变为稠密矩阵
21 -------------------------------------------------
22    A  B   C
23 0  2  7   7
24 1  3  8  11
25 2  2  6   9
26 None
27      A  D  B   C
28 0  2.0  3  7   7
29 1  3.0  7  8  11
30 2  2.0  1  6   9
31    A  D  B   C
32 0  2  3  7   7
33 1  3  7  8  11
34 2  2  1  6   9
35    A  D  B   C
36 0  3  4  8   8
37 1  4  8  9  12
38 2  3  2  7  10
39    A  D  B  C
40 0  2  3  7  7
41 2  2  1  6  9
42 A    2.0
43 D    3.0
44 B    7.0
45 C    9.0
46 dtype: float64
47 ##############################
48    A  D  B   C
49 0  2  3  7   7
50 1  3  7  8  11
51 2  2  1  6   9
52    A  D  B   C
53 1  3  7  8  11
54 2  2  1  6   9
55 0  2  3  7   7
56    A  D  B   C
57 0  2  3  7   7
58 1  3  7  8  11
59 2  2  1  6   9

其他

5.两个表格的操作

（1）合并

 1 df1 = pd.DataFrame([[0.0, np.NaN, 3], [0.0, 3, 1.0], [1.0, True, False]],
 2                   index=['one', 'two', 'three'], columns=['A', 'B', 'C'])
 3 df2 = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],
 4                   index =['two', 'three', 'four'], columns=['B', 'C', 'D'])
 5 
 6 # (1)concat合并，axis=0表示上下合并，1表示左右合并；ignore_index=True表示忽略以前的index，重置index;
 7 # join = outer表示并集，inner表示交集;sort表示index是否排序,index不忽略时用,True表示排序，str按首字母进行排序
 8 print(pd.concat([df1,df2],axis=1,ignore_index=False,sort=True,join='outer'))
 9 print(pd.concat([df1,df2],axis=0,ignore_index=False,sort=False,join='inner'))
10 
11 # (2)merge合并，基于关键字
12 # how=['left','right','outer','inner']合并的方式：基于左边的表进行填充，右边的表进行填充，并集，交集
13 # left_index和right_index：是否考虑左边的index和右边的index，值有True或False
14 # suffixes:合并时，给一样的columns，不一样的数据，添加标记进行区分
15 print(df1)
16 print(df2)
17 print(pd.merge(df1,df2,on=['B'],suffixes=['_left','_right'],how='outer'))  # 基于相同columns=‘key’进行合并
18 ----------------------------------------------------------
19          A     B      C    B    C    D
20 four   NaN   NaN    NaN  7.0  8.0  9.0
21 one    0.0   NaN      3  NaN  NaN  NaN
22 three  1.0  True  False  4.0  5.0  6.0
23 two    0.0     3      1  1.0  2.0  3.0
24           B      C
25 one     NaN      3
26 two       3      1
27 three  True  False
28 two       1      2
29 three     4      5
30 four      7      8
31          A     B      C
32 one    0.0   NaN      3
33 two    0.0     3      1
34 three  1.0  True  False
35        B  C  D
36 two    1  2  3
37 three  4  5  6
38 four   7  8  9
39      A     B C_left  C_right    D
40 0  0.0   NaN      3      NaN  NaN
41 1  0.0     3      1      NaN  NaN
42 2  1.0  True  False      2.0  3.0
43 3  NaN     4    NaN      5.0  6.0
44 4  NaN     7    NaN      8.0  9.0

合并

（2）计算,和numpy差不多

四.通用

1.所有对列的操作和对行的操作都可以通过转置进行等价操作

2.pandas中index表示行（列表），对应axis=0行操作;columns表示列（列表），对应axis=1列操作

3.通过字典创建dataframe，创建的图表：最左边是index列表，从上到下；最上面是columns列表，从左到右；中间是字典数据，每一个数据对应相应的index和columns

posted on 2020-06-06 22:29 温润有方阅读(469) 评论(0) 编辑收藏举报