pandas基础使用

这里主要是记录一些pandas的基本方法,熟练使用这里的方法可以放你在处理数据值的时候无往不利。

一、生成对象

  pandns主要有两种数据结构:series和DataFrame。对着两个两种数据结构的操作的简单的增删改查的操作也在前面的博客里介绍过,有问题的请跳转:https://www.cnblogs.com/ppzhang/p/13747910.html

 

二、查看数据

在这里主要是介绍查看二维数组DataFrame的数据。

  1、head():从上到下查看数据

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    df2 = pd.DataFrame({'A': 1,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})

    return df2

def print_data():

    df = create_numpay_dataform()

    print("-------显示原始数据-----")
    print(df)
    print("-------显示前1行数据-----")
    print(df.head(1))
    print("-------显示前3行数据-----")
    print(df.head(3))

if __name__ == '__main__':
    print_data()


#结果如下
-------显示原始数据-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo
-------显示前1行数据-----
   A          B    C  D     E    F
0  1 2013-01-02  3.0  3  test  foo
-------显示前3行数据-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
head()

  2、tail():从下往上查看数据

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    df2 = pd.DataFrame({'A': 1,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})

    return df2

def print_data():

    df = create_numpay_dataform()

    print("-------显示原始数据-----")
    print(df)
    print("-------显示后1行数据-----")
    print(df.tail(1))
    print("-------显示后2行数据-----")
    print(df.tail(2))

if __name__ == '__main__':
    print_data()




#结果如下
-------显示原始数据-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo
-------显示后1行数据-----
   A          B    C  D      E    F
3  1 2013-01-02  3.0  3  train  foo
-------显示后2行数据-----
   A          B    C  D      E    F
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo
tail()

  3、to_numpy():输出底成数据的numpy对象。

  注意:

    a.Numpy的数组只有一种数据类型

    b.DataFrame数组每列的数据类型各不相同

    c.DataFrame数组的列有多种数据类型组成,该操作消费系统资源较大

    d.调用to_numpy()时Pandas查找支持查找DataFrame里说有数据类型的Numpy数据类型

    e.还有一种数据类型时object,可以将DataFrame列里的值强行转化成python对象

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

    df2 = pd.DataFrame({'A': 1,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})



    return df,df2

def print_data():

    df1,df2 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------DF2-----")
    print(df2)

    print("\n" + "-------DF1 to to_numpy-----")
    print(" df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。")
    print(df1.to_numpy())


    print("\n" + "-------DF2 to to_numpy-----")
    print("df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。")
    print(df2.to_numpy())

if __name__ == '__main__':
    print_data()


#结果如下
-------DF1-----
                   A         B         C         D
2013-01-01  0.214933 -0.932719  0.409751 -1.579671
2013-01-02  0.857846 -0.450446  1.334220 -0.256340
2013-01-03  1.855527 -0.459457 -0.088609  1.970731
2013-01-04 -0.315940  1.216017  0.145649  0.844216
2013-01-05  1.229986 -0.307384 -0.816692 -1.266780
2013-01-06 -0.324177 -0.606538 -0.993541 -1.018344

-------DF2-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo

-------DF1 to to_numpy-----
 df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。
[[ 0.21493314 -0.93271907  0.40975128 -1.57967127]
 [ 0.85784569 -0.45044625  1.3342199  -0.25634002]
 [ 1.85552743 -0.45945651 -0.08860859  1.97073069]
 [-0.31593997  1.2160171   0.14564932  0.8442159 ]
 [ 1.22998622 -0.30738437 -0.81669186 -1.26677969]
 [-0.3241766  -0.60653794 -0.99354086 -1.01834351]]

-------DF2 to to_numpy-----
df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。
[[1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo']
 [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo']
 [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo']
 [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo']]
to_numpy()    

  4、describe():可以快速查看数据的统计摘要,有三个参数

    a.第一个percentiles,这个参数可以设定数值型特征的统计量,默认是[.25, .5, .75],也就是返回25%,50%,75%数据量时的数字,但是这个可以修改的,describe(percentiles=[.2,.75, .8])默认有5

    b.第二个参数:include,这个参数默认是只计算数值型特征的统计量,当输入include=['O'],会计算离散型变量的统计特征,此外传参数是‘all’的时候会把数值型和离散型特征的统计都进行显示。

    c.第三个参数的设计就更贴心了,第二个参数是你可以指定选那些,第三个参数就是你可以指定不选哪些,人性化设计。这个参数默认不丢弃任何列,相当于无影响。

    d.如果只想显示某一行的结果需要使用:

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2]
    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------DF2.describe()无参数-----")
    #第一个参数默认显示25%,50%,75%
    #第二个参数默认显示数值类型
    #第三个参数默认为None不丢弃任何列,相当于无影响
    print(df1.describe())

    print("\n" + "-------DF2.describe()第一个参数-----")
    #返回10 %,60 %,80 %,90 % 数据量时的数字,50 %默认显示
    print(df1.describe(percentiles=[.1,.6,.8,.9]))


    print("\n" + "-------DF2.describe()第二个参数=all -----")
    #‘all’的时候会把数值型和离散型特征的统计都进行显示
    print(df1.describe(include="all"))
    print("\n" + "-------DF2.describe()第二个参数=O -----")
    #include=['O'],会计算离散型变量的统计特征
    print(df1.describe(include='O'))


    print("\n" + "-------DF2.describe()第三个参数 -----")
    #exclude='O'表示不输出离散型
    print(df1.describe(exclude='O'))

    print("\n" + "-------DF2.describe()显示第N行结果 -----")
    print(df1.describe().iloc[4])

if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
   A  B
0  a  2
1  b  4
2  a  6
3  a  3
4  c  6
5  d  2
6  a  5
7  d  8
8  a  0
9  f  2

-------DF2.describe()无参数-----
               B
count  10.000000
mean    3.800000
std     2.440401
min     0.000000
25%     2.000000
50%     3.500000
75%     5.750000
max     8.000000

-------DF2.describe()第一个参数-----
               B
count  10.000000
mean    3.800000
std     2.440401
min     0.000000
10%     1.800000
50%     3.500000
60%     4.400000
80%     6.000000
90%     6.200000
max     8.000000

-------DF2.describe()第二个参数=all -----
          A          B
count    10  10.000000
unique    5        NaN
top       a        NaN
freq      5        NaN
mean    NaN   3.800000
std     NaN   2.440401
min     NaN   0.000000
25%     NaN   2.000000
50%     NaN   3.500000
75%     NaN   5.750000
max     NaN   8.000000

-------DF2.describe()第二个参数=O -----
         A
count   10
unique   5
top      a
freq     5

-------DF2.describe()第三个参数 -----
               B
count  10.000000
mean    3.800000
std     2.440401
min     0.000000
25%     2.000000
50%     3.500000
75%     5.750000
max     8.000000

-------DF2.describe()显示第N行结果 -----
B    2.0
Name: 25%, dtype: float64

进程已结束,退出代码 0
describe()

  5、sort_index():按轴排序

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():

    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------DF2按索引排序-----")
    #sort_index()默认正序,ascending=False
    print(df1.sort_index( ascending=False))

    print("\n" + "-------DF2按列排序(表头排序)-----")
    # sort_index()默认正序,ascending=False
    print(df1.sort_index(axis=1 ,ascending=False))



if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------DF2按索引排序-----
   A  B  F  D
9  f  2  2  s
8  a  0  0  e
7  d  8  8  a
6  a  5  0  x
5  d  2  2  f
4  c  6  9  r
3  a  3  7  a
2  a  6  6  v
1  b  4  6  f
0  a  2  2  a

-------DF2按列排序(表头排序)-----
   F  D  B  A
0  2  a  2  a
1  6  f  4  b
2  6  v  6  a
3  7  a  3  a
4  9  r  6  c
5  2  f  2  d
6  0  x  5  a
7  8  a  8  d
8  0  e  0  a
9  2  s  2  f

进程已结束,退出代码 0
sort_index()

  6、sort_value():按值排序

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():

    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------DF1值排序-----")
    #sort_index()默认正序,ascending=False
    print(df1.sort_values(by='B' ,ascending=False))




if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------DF1值排序-----
   A  B  F  D
7  d  8  8  a
2  a  6  6  v
4  c  6  9  r
6  a  5  0  x
1  b  4  6  f
3  a  3  7  a
0  a  2  2  a
5  d  2  2  f
9  f  2  2  s
8  a  0  0  e

进程已结束,退出代码 0
sort_value()

  7、inde:显示索引(列,最前面一列)

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    df2 = pd.DataFrame({'A': 1,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})

    return df2

def print_data():

    df = create_numpay_dataform()

    print("-------显示原始数据-----")
    print(df)
    print("-------索引-----")
    print(df.index)


if __name__ == '__main__':
    print_data()


#结果如下
-------显示原始数据-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo
-------索引-----
Int64Index([0, 1, 2, 3], dtype='int64')
index

  8、columns:显示列名(行,最上面一行)

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    df2 = pd.DataFrame({'A': 1,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                        'F': 'foo'})

    return df2

def print_data():

    df = create_numpay_dataform()

    print("-------显示原始数据-----")
    print(df)
    print("-------列名-----")
    print(df.columns)


if __name__ == '__main__':
    print_data()


#结果如下

-------显示原始数据-----
   A          B    C  D      E    F
0  1 2013-01-02  3.0  3   test  foo
1  1 2013-01-02  3.0  3  train  foo
2  1 2013-01-02  3.0  3   test  foo
3  1 2013-01-02  3.0  3  train  foo
-------列名-----
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
columns

  9、T:转置数据

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------DF1行列转换-----")
    print(df1.T)




if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------DF1行列转换-----
   0  1  2  3  4  5  6  7  8  9
A  a  b  a  a  c  d  a  d  a  f
B  2  4  6  3  6  2  5  8  0  2
F  2  6  6  7  9  2  0  8  0  2
D  a  f  v  a  r  f  x  a  e  s

进程已结束,退出代码 0
T

 

三、选择数据

  1、选择单列[ "列名" ] 或者 df.列名

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------选择单列['列名']-----")
    print(df1["D"])
    print("\n" + "-------选择单列DF1.列名-----")
    print(df1.A)




if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------选择单列['列名']-----
0    a
1    f
2    v
3    a
4    r
5    f
6    x
7    a
8    e
9    s
Name: D, dtype: object

-------选择单列DF1.列名-----
0    a
1    b
2    a
3    a
4    c
5    d
6    a
7    d
8    a
9    f
Name: A, dtype: object

进程已结束,退出代码 0














  
选择单列

  2、用 [ ] 切片行

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------切片行-----")
    print(df1[4:6])
    




if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------切片行-----
   A  B  F  D
4  c  6  9  r
5  d  2  2  f

进程已结束,退出代码 0
用 [ ] 切片行

  3、loc:按标签选择

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    

    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    print("\n"+"-------标签选择某行数据------")
    print(df1.loc[8])
    print("\n" + "-------标签选择多行数据------")
    print(df1.loc[[1,6,8]])
    print("\n" + "-------标签选择多列数据------")
    print(df1.loc[:,['A', 'D']])
    print("\n" + "-------标签选择规定行,规定列数据------")
    print(df1.loc[4:7, ['A', 'D']])
    print("\n" + "-------数据降维------")
    print(df1.loc[5, ['A', 'D']])

   

if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------标签选择某行数据------
A    a
B    0
F    0
D    e
Name: 8, dtype: object

-------标签选择多行数据------
   A  B  F  D
1  b  4  6  f
6  a  5  0  x
8  a  0  0  e

-------标签选择多列数据------
   A  D
0  a  a
1  b  f
2  a  v
3  a  a
4  c  r
5  d  f
6  a  x
7  d  a
8  a  e
9  f  s

-------标签选择规定行,规定列数据------
   A  D
4  c  r
5  d  f
6  a  x
7  d  a

-------数据降维------
A    d
D    f
Name: 5, dtype: object

进程已结束,退出代码 0
loc

   4、iloc:按位置选取

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():


    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
   

    print("\n" + "-------按位置选择行------")
    print(df1.iloc[3])
    print("\n" + "-------按位置选择切片行和列------")
    print(df1.iloc[3:,:3])
    print("\n" + "-------按位置选择指定行和列------")
    print(df1.iloc[[1,3,5], [0,2]])
   

if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------按位置选择行------
A    a
B    3
F    7
D    a
Name: 3, dtype: object

-------按位置选择切片行和列------
   A  B  F
3  a  3  7
4  c  6  9
5  d  2  2
6  a  5  0
7  d  8  8
8  a  0  0
9  f  2  2

-------按位置选择指定行和列------
   A  F
1  b  6
3  a  7
5  d  2

进程已结束,退出代码 0
iloc

  6、单个值布尔判断(注意判断时要保证数据结构是一致的,不同数据结构之间判断会报错)

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    
    print("\n" + "-------布尔索引判断某一列的值------")
    print(df1[df1.B > 0])
    print("\n" + "-------布尔索引判断整体的值------")
    print(df1[df1 > 0.543791])



if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
                   A         B         C         D
2013-01-01 -0.571184  0.810240 -1.834513 -0.185410
2013-01-02 -0.085790 -1.009361  1.311410  0.141120
2013-01-03  0.672282  0.569641 -1.394152  0.832807
2013-01-04  0.170832 -0.882142  0.928596 -0.945374
2013-01-05 -1.100324 -1.045981  1.217005  1.420321
2013-01-06 -0.952931  0.575549 -0.164552 -1.097455

-------布尔索引判断某一列的值------
                   A         B         C         D
2013-01-01 -0.571184  0.810240 -1.834513 -0.185410
2013-01-03  0.672282  0.569641 -1.394152  0.832807
2013-01-06 -0.952931  0.575549 -0.164552 -1.097455

-------布尔索引判断整体的值------
                   A         B         C         D
2013-01-01       NaN  0.810240       NaN       NaN
2013-01-02       NaN       NaN  1.311410       NaN
2013-01-03  0.672282  0.569641       NaN  0.832807
2013-01-04       NaN       NaN  0.928596       NaN
2013-01-05       NaN       NaN  1.217005  1.420321
2013-01-06       NaN  0.575549       NaN       NaN

进程已结束,退出代码 0
布尔判断

  7、isin():多个值做布尔判断

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    

    print("\n" + "-------单列筛选-------")
    print(df1.F.isin(["a","b",6,9]))
    print("\n" + "-------整体筛选-------")
    print(df1.isin(["a", "b", 6, 9]))

if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s

-------单列筛选-------
0    False
1     True
2     True
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: F, dtype: bool

-------整体筛选-------
       A      B      F      D
0   True  False  False   True
1   True  False   True  False
2   True   True   True  False
3   True  False  False   True
4  False   True   True  False
5  False  False  False  False
6   True  False  False  False
7  False  False  False   True
8   True  False  False  False
9  False  False  False  False

进程已结束,退出代码 0
isin()

  8、赋值:赋值是个非常简单的操作,只要使用标签选择或者位置选择找到了对应的数据,直接赋值即可。在赋值的时候也可以使用条件判断来赋值

  9、缺失值:Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    
    df = pd.DataFrame(data={
        'A': list('abaacdadaf'),
        'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
        'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
        'D': list('afvarfxaes'),

    })


    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
    

    print("-------设置缺失值----------")
    df1.iloc[1:4, :-1] = np.nan
    print(df1)
    print("-------删除缺失行----------")
    #how = all整行数据都确实时才删除行数据,any只要存在缺失数据,就会删除本行
    print(df1.dropna(how="all"))
    print("-------填充缺失数据----------")
    print(df1.fillna(value=10086))
    print("-------判断是否是缺失数据----------")
    print(pd.isna(df1))



if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
   A  B  F  D
0  a  2  2  a
1  b  4  6  f
2  a  6  6  v
3  a  3  7  a
4  c  6  9  r
5  d  2  2  f
6  a  5  0  x
7  d  8  8  a
8  a  0  0  e
9  f  2  2  s
-------设置缺失值----------
     A    B    F  D
0    a  2.0  2.0  a
1  NaN  NaN  NaN  f
2  NaN  NaN  NaN  v
3  NaN  NaN  NaN  a
4    c  6.0  9.0  r
5    d  2.0  2.0  f
6    a  5.0  0.0  x
7    d  8.0  8.0  a
8    a  0.0  0.0  e
9    f  2.0  2.0  s
-------删除缺失行----------
     A    B    F  D
0    a  2.0  2.0  a
1  NaN  NaN  NaN  f
2  NaN  NaN  NaN  v
3  NaN  NaN  NaN  a
4    c  6.0  9.0  r
5    d  2.0  2.0  f
6    a  5.0  0.0  x
7    d  8.0  8.0  a
8    a  0.0  0.0  e
9    f  2.0  2.0  s
-------填充缺失数据----------
       A        B        F  D
0      a      2.0      2.0  a
1  10086  10086.0  10086.0  f
2  10086  10086.0  10086.0  v
3  10086  10086.0  10086.0  a
4      c      6.0      9.0  r
5      d      2.0      2.0  f
6      a      5.0      0.0  x
7      d      8.0      8.0  a
8      a      0.0      0.0  e
9      f      2.0      2.0  s
-------判断是否是缺失数据----------
       A      B      F      D
0  False  False  False  False
1   True   True   True  False
2   True   True   True  False
3   True   True   True  False
4  False  False  False  False
5  False  False  False  False
6  False  False  False  False
7  False  False  False  False
8  False  False  False  False
9  False  False  False  False

进程已结束,退出代码 0
np.nan

 

四、统计

  1、在对数组进行计算的时候要使数组对齐,这时候需要使用shift()方法。(效率不是特别高)

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

   

    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)

    print("------行向下平移一行--------")
    print(df1.shift())

    print("------列向左平移一列--------")
    print(df1.shift(-1,axis=1))





if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
                   A         B         C         D
2013-01-01 -1.494982 -1.816127 -1.557673  0.676270
2013-01-02 -0.382565 -0.772728 -2.028113 -1.000548
2013-01-03  1.024764  1.438836 -2.294408 -0.391837
2013-01-04  0.460244  1.823243 -0.183927  1.755757
2013-01-05  0.655894 -0.193546  1.155935 -0.773810
2013-01-06 -2.142355 -0.583462  1.369368  0.703252
------行向下平移一行--------
                   A         B         C         D
2013-01-01       NaN       NaN       NaN       NaN
2013-01-02 -1.494982 -1.816127 -1.557673  0.676270
2013-01-03 -0.382565 -0.772728 -2.028113 -1.000548
2013-01-04  1.024764  1.438836 -2.294408 -0.391837
2013-01-05  0.460244  1.823243 -0.183927  1.755757
2013-01-06  0.655894 -0.193546  1.155935 -0.773810
------列向左平移一列--------
                   A         B         C   D
2013-01-01 -1.816127 -1.557673  0.676270 NaN
2013-01-02 -0.772728 -2.028113 -1.000548 NaN
2013-01-03  1.438836 -2.294408 -0.391837 NaN
2013-01-04  1.823243 -0.183927  1.755757 NaN
2013-01-05 -0.193546  1.155935 -0.773810 NaN
2013-01-06 -0.583462  1.369368  0.703252 NaN

进程已结束,退出代码 0

    
shift()  

  2、mean():计算平均值

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)
   
    print("-------设置缺失值----------")
    df1.iloc[1:4, :-1] = np.nan
    print(df1)

    print("-------填充缺失数据----------")
    c1 = df1.fillna(value=10086)
    print(c1)
    

    print("-------计算无缺失数组的平均值----------")
    #axis = 0 以列计算平均值   1 以行计算平均值,默认为0
    print(c1.mean(axis=1))
    print("-------计算有缺失数组的平均值----------")
    #有缺失数组的平均值去掉缺失数据然后计算
    print(df1.mean())
    
if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
                   A         B         C         D
2013-01-01  0.504033  0.167604  0.656164 -0.305116
2013-01-02  0.743423  1.004330 -1.858694 -0.962968
2013-01-03 -0.978681 -0.858943  1.527813  0.442333
2013-01-04 -0.447715 -1.075530  0.655507  1.271325
2013-01-05  0.877627  0.641684 -1.701115 -0.211141
2013-01-06  2.704554 -0.666753 -1.092838 -2.232137
-------设置缺失值----------
                   A         B         C         D
2013-01-01  0.504033  0.167604  0.656164 -0.305116
2013-01-02       NaN       NaN       NaN -0.962968
2013-01-03       NaN       NaN       NaN  0.442333
2013-01-04       NaN       NaN       NaN  1.271325
2013-01-05  0.877627  0.641684 -1.701115 -0.211141
2013-01-06  2.704554 -0.666753 -1.092838 -2.232137
-------填充缺失数据----------
                       A             B             C         D
2013-01-01      0.504033      0.167604      0.656164 -0.305116
2013-01-02  10086.000000  10086.000000  10086.000000 -0.962968
2013-01-03  10086.000000  10086.000000  10086.000000  0.442333
2013-01-04  10086.000000  10086.000000  10086.000000  1.271325
2013-01-05      0.877627      0.641684     -1.701115 -0.211141
2013-01-06      2.704554     -0.666753     -1.092838 -2.232137
-------计算无缺失数组的平均值----------
A    5043.681036
B    5043.023756
C    5042.643702
D      -0.332951
dtype: float64
-------计算有缺失数组的平均值----------
A    1.362071
B    0.047512
C   -0.712596
D   -0.332951
dtype: float64


进程已结束,退出代码 0
mean()

  3、diff():计算两行之间的差值 

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))



    return df

def print_data():

    df1 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)

    print("------上下两行的差值(后面的数-前面的数)--------")
    print(df1.diff(-1))

    print("------左右步长为2的差值(后面的数-前面的数)--------")
    print(df1.diff(2,axis=1))





if __name__ == '__main__':
    print_data()


#结果如下

-------DF1-----
                   A         B         C         D
2013-01-01 -1.482337  0.735672 -0.523935  1.441714
2013-01-02 -0.293590 -1.251721 -0.532770 -0.178270
2013-01-03  0.464124  0.148478  0.647906 -0.462180
2013-01-04 -1.313573 -0.280773 -0.815059  0.449937
2013-01-05 -0.042054  0.037449 -1.380082  1.694301
2013-01-06 -0.685625  0.379272 -0.009392 -0.563834
------上下两行的差值(后面的数-前面的数)--------
                   A         B         C         D
2013-01-01 -1.188747  1.987393  0.008835  1.619984
2013-01-02 -0.757714 -1.400199 -1.180676  0.283910
2013-01-03  1.777696  0.429250  1.462965 -0.912117
2013-01-04 -1.271519 -0.318222  0.565023 -1.244364
2013-01-05  0.643571 -0.341823 -1.370691  2.258135
2013-01-06       NaN       NaN       NaN       NaN
------左右步长为2的差值(后面的数-前面的数)--------
             A   B         C         D
2013-01-01 NaN NaN  0.958402  0.706042
2013-01-02 NaN NaN -0.239180  1.073451
2013-01-03 NaN NaN  0.183783 -0.610658
2013-01-04 NaN NaN  0.498513  0.730710
2013-01-05 NaN NaN -1.338028  1.656852
2013-01-06 NaN NaN  0.676234 -0.943107

进程已结束,退出代码 0
diff()

  4、sub():两个DataFrame或者DataFrame和series减法计算

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    df1 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates)
    return df,df1,s

def print_data():

    df,df1,s1 = create_numpay_dataform()

    print("-------DF-----")
    print(df)
    print("-------DF1--------")
    print(df1)
    print("-------s1--------")
    print(s1)


    print("------DataFrame和Series计算--------")
    print(df.sub(s1,axis="index"))

    print("------DataFrame和DataFrame计算--------")
    print(df.sub(df1))





if __name__ == '__main__':
    print_data()


#结果如下

-------DF-----
                   A         B         C         D
2013-01-01 -1.117915 -0.387520  0.013181 -0.732305
2013-01-02  0.259804  0.943158  0.209316 -0.179862
2013-01-03 -0.681971 -1.385040  0.354760 -0.572621
2013-01-04 -0.019748 -0.703220 -0.765874 -0.584478
2013-01-05  1.187278 -0.287918 -0.215136  0.075496
2013-01-06 -1.160146 -0.882323 -0.620577  0.380190
-------DF1--------
                   A         B         C         D
2013-01-01  1.532277 -0.527844 -0.345524  0.701999
2013-01-02  0.794895 -2.042780  1.163952 -0.877180
2013-01-03 -0.489494 -0.131753  0.444089  0.789567
2013-01-04  0.440047 -0.693099 -0.243348 -0.612980
2013-01-05 -1.128350 -1.012848  0.632883 -0.023234
2013-01-06 -0.672428 -0.249193  1.676576 -1.486626
-------s1--------
2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64
------DataFrame和Series计算--------
                   A         B         C         D
2013-01-01 -2.117915 -1.387520 -0.986819 -1.732305
2013-01-02 -2.740196 -2.056842 -2.790684 -3.179862
2013-01-03 -5.681971 -6.385040 -4.645240 -5.572621
2013-01-04       NaN       NaN       NaN       NaN
2013-01-05 -4.812722 -6.287918 -6.215136 -5.924504
2013-01-06 -9.160146 -8.882323 -8.620577 -7.619810
------DataFrame和DataFrame计算--------
                   A         B         C         D
2013-01-01 -2.650191  0.140324  0.358705 -1.434305
2013-01-02 -0.535091  2.985938 -0.954636  0.697318
2013-01-03 -0.192477 -1.253287 -0.089329 -1.362188
2013-01-04 -0.459795 -0.010121 -0.522526  0.028502
2013-01-05  2.315628  0.724930 -0.848019  0.098730
2013-01-06 -0.487718 -0.633130 -2.297154  1.866815

进程已结束,退出代码 0
sub()

  5、apply():是pandas里面所有函数中自由度最高的函数。apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

   a.该函数最有用的是第一个参数,这个参数是函数,这个函数需要自己实现

   b.函数的传入参数根据axis来定,比如axis = 1,就会把一行数据作为Series的数据结构传入给自己实现的函数中,我们在函数中实现对Series不同属性之间的计算,

   c.则apply函数会自动遍历每一行DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
    return df

def print_data():

    df = create_numpay_dataform()

    print("-------DF-----")
    print(df)
    


    print("------apply调用已有函数--------")
    #np.cumsun累加
    print(df.apply(np.cumsum))

    print("------apply调用自定义函数--------")
    print(df.apply(lambda x : x.max()-x[2]))



if __name__ == '__main__':
    print_data()

#结果如下

-------DF-----
                   A         B         C         D
2013-01-01 -2.086164  0.201652  1.722858  1.071210
2013-01-02 -0.462887 -0.188189  0.733832 -1.798445
2013-01-03  0.672316 -1.359191 -0.031073  0.508793
2013-01-04 -0.624844 -0.503734  0.262923 -0.519521
2013-01-05 -1.170108 -0.308858 -0.653888  0.552537
2013-01-06 -1.408287  0.406629  0.000608  0.085242
------DataFrame和Series计算--------
                   A         B         C         D
2013-01-01 -2.086164  0.201652  1.722858  1.071210
2013-01-02 -2.549051  0.013464  2.456690 -0.727234
2013-01-03 -1.876735 -1.345727  2.425617 -0.218442
2013-01-04 -2.501579 -1.849461  2.688541 -0.737963
2013-01-05 -3.671687 -2.158320  2.034652 -0.185426
2013-01-06 -5.079974 -1.751691  2.035260 -0.100184
------DataFrame和Series计算--------
A    0.000000
B    1.765820
C    1.753931
D    0.562418
dtype: float64

进程已结束,退出代码 0
apply()

  

五、合并

  1、concat:多个DataFrame拼接。

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=9)
    df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD'))

    return df

def print_data():

    df = create_numpay_dataform()

    print("-------DF-----")
    print(df)

    print("-------切片后的DF-------")
    pieces = [df[:3], df[3:7], df[7:]]
    print(pieces[0])
    print(pieces[1])
    print(pieces[2])

    print("-------将切片后的数组拼接上-------")
    print(pd.concat([pieces[0],pieces[2],pieces[1]]))




if __name__ == '__main__':
    print_data()


#结果如下

-------DF-----
                   A         B         C         D
2013-01-01  0.276946  1.235298  0.932776 -0.565113
2013-01-02 -0.503525  0.365262  0.884855 -1.432992
2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
2013-01-05  0.032306  0.003271  0.605058  0.398746
2013-01-06  0.033632 -1.831336  0.828554 -0.745181
2013-01-07 -0.306900  0.027087  0.387204 -1.099752
2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
2013-01-09  1.077574  1.034927 -0.360812 -0.792874
-------切片后的DF-------
                   A         B         C         D
2013-01-01  0.276946  1.235298  0.932776 -0.565113
2013-01-02 -0.503525  0.365262  0.884855 -1.432992
2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
                   A         B         C         D
2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
2013-01-05  0.032306  0.003271  0.605058  0.398746
2013-01-06  0.033632 -1.831336  0.828554 -0.745181
2013-01-07 -0.306900  0.027087  0.387204 -1.099752
                   A         B         C         D
2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
2013-01-09  1.077574  1.034927 -0.360812 -0.792874
-------将切片后的数组拼接上-------
                   A         B         C         D
2013-01-01  0.276946  1.235298  0.932776 -0.565113
2013-01-02 -0.503525  0.365262  0.884855 -1.432992
2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
2013-01-09  1.077574  1.034927 -0.360812 -0.792874
2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
2013-01-05  0.032306  0.003271  0.605058  0.398746
2013-01-06  0.033632 -1.831336  0.828554 -0.745181
2013-01-07 -0.306900  0.027087  0.387204 -1.099752

进程已结束,退出代码 0
concat()

  2、merge():sql风格的合并,类似于连接。

import pandas as pd
import numpy as np

def create_numpay_dataform():
    
    df1 = pd.DataFrame({"rng":["xiaohua","ming","uzi"],"age":[22,19,24]})
    df2 = pd.DataFrame({"rng": ["xiaohua", "ming", "uzi"], "role": ["mid", "sup", "adc"]})
    df3 = pd.DataFrame({"team":["rng","rng","rng"],"name":["xiaohua","ming","uzi"]})
    df4 = pd.DataFrame({"team":["rng","rng"],"opponent":["LGD","IG"]})
    return df1,df2,df3,df4

def print_data():

    df1,df2,df3,df4 = create_numpay_dataform()

    print("-------DF1-----")
    print(df1)

    print("-------DF2-----")
    print(df2)

    print("-------DF3-----")
    print(df3)

    print("-------DF4-----")
    print(df4)

    print("-------key下是不同的value的使用方法-------")
    print(pd.merge(df1,df2,on="rng"))

    print("-------key下是统一value的使用-------")
    print(pd.merge(df3,df4,on="team"))


if __name__ == '__main__':
    print_data()

#结果如下

-------DF1-----
       rng  age
0  xiaohua   22
1     ming   19
2      uzi   24
-------DF2-----
       rng role
0  xiaohua  mid
1     ming  sup
2      uzi  adc
-------DF3-----
  team     name
0  rng  xiaohua
1  rng     ming
2  rng      uzi
-------DF4-----
  team opponent
0  rng      LGD
1  rng       IG
-------key下是不同的value的使用方法-------
       rng  age role
0  xiaohua   22  mid
1     ming   19  sup
2      uzi   24  adc
-------key下是统一value的使用-------
  team     name opponent
0  rng  xiaohua      LGD
1  rng  xiaohua       IG
2  rng     ming      LGD
3  rng     ming       IG
4  rng      uzi      LGD
5  rng      uzi       IG

进程已结束,退出代码 0
merge()

  3、append():在DataFrame最后追加数据

import pandas as pd
import numpy as np

def create_numpay_dataform():
    dates = pd.date_range('20130101', periods=9)
    df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD'))


    return df

def print_data():

    df = create_numpay_dataform()

    print("-------DF-----")
    print(df)

    print("-------选择DF里的N行数据-----")
    s = df.iloc[0:2]
    print(s)

    print("-------DF追加到最后-------")
    #ignore_index忽略带哦索引行
    print(df.append(s,ignore_index= True))


if __name__ == '__main__':
    print_data()


#结果如下

-------DF-----
                   A         B         C         D
2013-01-01  0.180073  1.027674  0.699021  0.211052
2013-01-02  0.700873  0.893067  0.234802  1.378712
2013-01-03 -0.318609 -0.291524  0.123771  1.057293
2013-01-04 -0.145169  0.213432  0.285161 -0.231468
2013-01-05 -0.916774 -1.284495  1.661716 -0.258821
2013-01-06  0.460373 -2.351527 -0.462772 -0.587480
2013-01-07 -1.149013 -1.290900  0.171418 -0.076885
2013-01-08 -1.621095  0.704023 -0.706554  0.016696
2013-01-09 -0.405135 -1.019510  0.863830 -1.316628
-------选择DF里的N行数据-----
                   A         B         C         D
2013-01-01  0.180073  1.027674  0.699021  0.211052
2013-01-02  0.700873  0.893067  0.234802  1.378712
-------DF追加到最后-------
           A         B         C         D
0   0.180073  1.027674  0.699021  0.211052
1   0.700873  0.893067  0.234802  1.378712
2  -0.318609 -0.291524  0.123771  1.057293
3  -0.145169  0.213432  0.285161 -0.231468
4  -0.916774 -1.284495  1.661716 -0.258821
5   0.460373 -2.351527 -0.462772 -0.587480
6  -1.149013 -1.290900  0.171418 -0.076885
7  -1.621095  0.704023 -0.706554  0.016696
8  -0.405135 -1.019510  0.863830 -1.316628
9   0.180073  1.027674  0.699021  0.211052
10  0.700873  0.893067  0.234802  1.378712

进程已结束,退出代码 0
append()

 

六、分组

  group by:指的是涵盖以下一项或者多项的步骤流程:

    a. 分割:按条件将数据分割成多组

    b. 应用:为每组单独应用函数

    c. 组合:将处理结果组合成一个数据结构

import pandas as pd
import numpy as np

def create_numpay_dataform():
    df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                       'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                       'C': np.random.randn(8),
                       'D': np.random.randn(8)})

    return df

def print_data():

    df = create_numpay_dataform()

    print("-------DF-----")
    print(df)

    print("-------单条件分组后在计算-----")
    print(df.groupby('A').sum())

    print("-------多条件分组在计算-------")
    print(df.groupby(["B","A"]).sum())
    print("-------我是华丽的分割线-------")
    print(df.groupby(["A", "B"]).sum())


if __name__ == '__main__':
    print_data()


#结果如下

-------DF-----
     A      B         C         D
0  foo    one  0.453100 -0.544181
1  bar    one  1.692183 -0.253889
2  foo    two -0.656308 -1.177487
3  bar  three -1.078701  1.239209
4  foo    two -0.866770 -0.949062
5  bar    two -1.305346 -1.705380
6  foo    one -0.259537 -1.492884
7  foo  three -0.669982 -0.943082
-------单条件分组后在计算-----
            C         D
A                      
bar -0.691863 -0.720059
foo -1.999496 -5.106695
-------多条件分组在计算-------
                  C         D
B     A                      
one   bar  1.692183 -0.253889
      foo  0.193563 -2.037065
three bar -1.078701  1.239209
      foo -0.669982 -0.943082
two   bar -1.305346 -1.705380
      foo -1.523078 -2.126549
-------我是华丽的分割线-------
                  C         D
A   B                        
bar one    1.692183 -0.253889
    three -1.078701  1.239209
    two   -1.305346 -1.705380
foo one    0.193563 -2.037065
    three -0.669982 -0.943082
    two   -1.523078 -2.126549

进程已结束,退出代码 0
groupby()

 

七、数据透视表

  什么是数据透视表?

    数据透视表是一种交互式的表,可以自由选择多个字段的不同组合,用于快速汇总、分析大量数据中字段与字段之间的关联关系。使用数据透视表可以按照数据表格的不同字段从多个角度进行透视,并建立交叉表格,用以查看数据表格不同层面的汇总信息、分析结果以及摘要数据。

  数据透视表的优势?   

    • 对数值数据快速分类汇总,按分类和子分类查看数据信息。
    • 展开或折叠所关注的数据,快速查看摘要数据的明细信息。
    • 建立交叉表格(将行移动到列或将列移动到行),以查看数据的不同汇总。
    • 快速的计算数值数据的汇总信息、差异等。

  pivot_table():用法 pivot_table(data, values=None, index=None, columns=None,aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

    四个最重要的参数:index,values,columns,aggfunc

      a. index:每个pivot_table必须拥有一个index,作为透视表的索引列,可以是一层索引,也可以是多层索引

      b. values:筛选需要计算的数据

      c. columns:Columns类似Index可以设置列层次字段,它不是一个必要参数,作为一种分割数据的可选方式。

      d. aggfunc:aggfunc参数可以设置我们对数据聚合时进行的函数操作,这个参数是一个函数。

import pandas as pd
import numpy as np

def create_numpay_dataform():

    df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                       'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})

    return df

def table_func(data):
    for i in data:
        if i > 0:
            return True
        else:
            return False

def print_data():

    df = create_numpay_dataform()

    print("-------DF-----")
    print(df)

    print("-------数据透视表-----")
    print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"]))
    print("--------------------")
    print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"], aggfunc=[np.sum,np.mean]))


if __name__ == '__main__':
    print_data()


#结果如下

-------DF-----
        A  B    C         D         E
0     one  A  foo  0.596744  1.260272
1     one  B  foo -0.560929  2.077597
2     two  C  foo -1.326983 -0.997230
3   three  A  bar  0.714451  0.520551
4     one  B  bar  2.378704  0.336855
5     one  C  bar -0.771644  0.109514
6     two  A  foo -2.606868 -0.279142
7   three  B  foo -0.775949 -1.383773
8     one  C  foo  0.106014 -0.840803
9     one  A  bar -0.877053  0.090785
10    two  B  bar -1.594153 -1.002086
11  three  C  bar -0.032272 -0.700847
-------数据透视表-----
C             bar       foo
A     B                    
one   A  0.090785  1.260272
      B  0.336855  2.077597
      C  0.109514 -0.840803
three A  0.520551       NaN
      B       NaN -1.383773
      C -0.700847       NaN
two   A       NaN -0.279142
      B -1.002086       NaN
      C       NaN -0.997230
              sum                mean          
C             bar       foo       bar       foo
A     B                                        
one   A  0.090785  1.260272  0.090785  1.260272
      B  0.336855  2.077597  0.336855  2.077597
      C  0.109514 -0.840803  0.109514 -0.840803
three A  0.520551       NaN  0.520551       NaN
      B       NaN -1.383773       NaN -1.383773
      C -0.700847       NaN -0.700847       NaN
two   A       NaN -0.279142       NaN -0.279142
      B -1.002086       NaN -1.002086       NaN
      C       NaN -0.997230       NaN -0.997230

进程已结束,退出代码 0
pivot_table()

 

 

 

 

 

  

posted @ 2020-10-05 17:50  X小白的逆袭之旅  阅读(243)  评论(0编辑  收藏  举报