数据分析之Pandas

一、Pandas介绍

1、介绍

pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。

为什么学习pandas?

numpy已经可以帮助我们进行数据的处理了，那么学习pandas的目的是什么呢？
numpy能够帮助我们处理的是数值型的数据，但是在数据分析中除了数值型的数据，还有很多其他类型的数据（字符串，时间序列），那么pandas就可以帮我们很好的处理除了数值型的其他数据！

2、数据结构

Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型，字符串、boolean值、数字等都能保存在Series中。

DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。

Time- Series：以时间为索引的Series。

Panel ：三维的数组，可以理解为DataFrame的容器。

二、Pandas数据结构之Series

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

Series是一种类似于一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签

Series的创建

由列表或numpy数组创建
由字典创建

1、Series的创建

# 有两种创建方式
# 1.由列表或numpy数组创建(默认索引为0到N-1的整数型索引)

# 使用列表创建Series
Series(data=[1,2,3,4,5],name='zzz')
# 结果
0    1
1    2
2    3
3    4
4    5
Name: zzz, dtype: int64

# 使用numpy创建Series
Series(data=np.random.randint(0,10,size=(5,)))
# 结果
0    7
1    6
2    2
3    2
4    2
dtype: int32

# 还可以通过设置index参数指定显式索引
s = Series(data=np.random.randint(0,10,size=(5,)),index=['a','b','c','d','e'])
# 结果
a    2
b    5
c    6
d    6
e    3
dtype: int32


# 2.由字典创建:不能再使用index.但是依然存在默认索引
# 注意：数据源必须为一维数据
dic = {
    '语文':80,
    '数学':95
}
s = Series(data=dic)
# 结果
数学    95
语文    80
dtype: int64

2、Series的索引

从上面的创建Series方法中，我们可以看到Series可以创建显式的索引，那么显式索引有什么用处呢？

其实很明显：显示索引可以增强Series的可读性。

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的是一个Series类型）。

(1) 显式索引：

- 使用index中的元素作为索引值
- 使用s.loc[]（推荐）:注意，loc中括号中放置的一定是显示索引

注意，显式索引切片取值时是闭区间。

(2) 隐式索引：

- 使用整数作为索引值
- 使用.iloc[]（推荐）:iloc中的中括号中必须放置隐式索引

注意，隐式索引切片取值时是半开区间。

s = Series(data=np.random.randint(0,10,size=(5,)),index=['a','b','c','d','e'])

# s的值
a    7
b    6
c    3
d    1
e    8
dtype: int32


# 隐式索引
s[0]
# 结果
7


s[[0,3]]
# 结果
a    7
d    1
dtype: int32


s.iloc[0]
# 结果
7


s.iloc[[0,3]]
# 结果
a    7
d    1
dtype: int32


# 显示索引
s.a
# 结果
7


s[['a','d']]
# 结果
a    7
d    1
dtype: int32


s.loc['a']
# 结果
7


s.loc[['a','b']]
# 结果
a    7
b    6
dtype: int32

3、切片

# 1.显示索引切片:index和loc
s = Series(data=np.random.randint(0,10,size=(5,)),index=['a','b','c','d','e'])
a    5
b    7
c    6
d    0
e    0
dtype: int32

# 显式索引，闭区间
s['a':'d']
# 结果
a    5
b    7
c    6
d    0
dtype: int32


s.loc['a':'c']
# 结果
a    5
b    7
c    6
dtype: int32


# 2.隐式索引切片：整数索引值和iloc，半开闭区间
s[0:3]
# 结果
a    5
b    7
c    6
dtype: int32


s.iloc[0:3]
# 结果
a    5
b    7
c    6
dtype: int32

4、Series常用属性

Series的常用属性

shape
size
index
values

s = Series(data=np.random.randint(1,10,size=(4,)),index=['a','b','c','d'])
a    2
b    8
c    1
d    7
dtype: int32

s.shape  # 形状
(4,)

s.size  # 元素个数
4

s.index # 返回索引
Index(['a', 'b', 'c', 'd'], dtype='object')

s.values # 返回值
array([2, 8, 1, 7])

s.dtype # 元素的类型
dtype('int32')

5、Series的常用方法

head(),tail()
unique()
isnull(),notnull()
add() sub() mul() div()

可以把Series看成一个不定长的有序字典

s = Series(data=np.random.randint(0,10,size=(5,)),index=['a','b','c','d','e'])
a    5
b    7
c    6
d    0
e    0
dtype: int32


# 1.向Series增加一行：相当于给字典增加一组键值对
s['g'] = 10
# 结果
a     5
b     7
c     6
d     0
e     0
g    10
dtype: int64


# 2.可以通过shape，size，index,values等得到series的属性
s.shape  # (6,)

s.size  # 6

s.index  # Index(['a', 'b', 'c', 'd', 'e', 'g'], dtype='object')

s.values  # array([ 5,  7,  6,  0,  0, 10], dtype=int64)


# 3.可以使用s.head(),tail()分别查看前n个和后n个值
s.head(3)
a    5
b    7
c    6
dtype: int64


s.tail(2)
e     0
g    10
dtype: int64


# 4.对Series元素进行去重
s = Series([1,1,1,2,2,3,3,4,4,4,5,6,6])
0     1
1     1
2     1
3     2
4     2
5     3
6     3
7     4
8     4
9     4
10    5
11    6
12    6
dtype: int64


s.unique()  # 使用unique去重，返回的是一个ndarray
array([1, 2, 3, 4, 5, 6], dtype=int64)


# 5.当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况
s1 = Series([1,2,3],index=['a','b','c'])
s2 = Series([1,2,3],index=['a','b','d'])
display(s1,s2)
a    1
b    2
c    3
dtype: int64

a    1
b    2
d    3
dtype: int64

# 让两个Series进行相加
s = s1+s2
a    2.0
b    4.0
c    NaN
d    NaN
dtype: float64


# 6.可以使用pd.isnull()，pd.notnull()，或s.isnull(),notnull()函数检测缺失数据
# 上面相加后的s
s.isnull()
a    False
b    False
c     True
d     True
dtype: bool


s.notnull()
a     True
b     True
c    False
d    False
dtype: bool


# 把缺失的数据去掉
s[[True,True,False,False]]  # 保留的True
a    2.0
b    4.0
dtype: float64

# 或者可以使用notnull
s[s.notnull()]
a    2.0
b    4.0
dtype: float64


# 7.Series的运算
'''
Series之间的运算
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

运算法则：
+ - * / 相当于 add() sub() mul() div() 
'''

s1 = Series([1,2,3])
s2 = Series([4,5,6])


# 加法
s1+s2  # s1.add(s2)
0    5
1    7
2    9
dtype: int64


# 乘法
s1.mul(s2)  # s1 * s2
0     4
1    10
2    18
dtype: int64

三、Pandas数据结构之DataFrame

DataFrame是一个【表格型】的数据结构。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values

DataFrame的创建

ndarray创建
字典创建

1、DataFrame的创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。

此外，DataFrame会自动加上每一行的索引。

使用字典创建的DataFrame后，则columns参数将不可被使用。

同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN。

使用ndarray创建DataFrame

# 不指定index和columns，则使用默认隐式索引
DataFrame(data=np.random.randint(0,100,size=(3,4)))
# 结果
      0    1     2    3
0    37    12   72    9
1    75    5    79    64
2    16    1    76    71

# 指定index
DataFrame(data=np.random.randint(0,100,size=(3,4)),index=['a','b','c'])
# 结果
      0    1    2    3
a    37    12   72    9
b    75    5    79    64
c    16    1    76    71

# 指定index和columns
DataFrame(data=np.random.randint(0,100,size=(3,4)),index=['a','b','c'],columns=['A','B','C','D'])
# 结果
      A    B    C    D
a    37    12   72    9
b    75    5    79    64
c    16    1    76    71

使用字典创建DataFrame

import pandas as pd
from pandas import Series,DataFrame
import numpy as np


dic = {
    'name':['john','tom'],
    'salary':[10000,20000]
}

# 字典的键即列名，使用index可以设置行名
df = DataFrame(data=dic,index=['a','b'])
# 结果
     name    salary
a    john    10000
b    tom    20000

# 如果不指定index行名，则使用默认的隐式索引
df = DataFrame(data=dic)
# 结果
     name    salary
0    john    10000
1    tom     20000

DataFrame属性：values、columns、index、shape

dic = {
    'name':['john','tom'],
    'salary':[10000,20000]
}

df = DataFrame(data=dic)
# df
     name    salary
0    john    10000
1    tom     20000


# df的值
df.values
# 结果
array([['john', 10000],
       ['tom', 20000]], dtype=object)

# df的列索引
df.columns
# 结果
Index(['name', 'salary'], dtype='object')

# df的行索引
df.index
# 结果
RangeIndex(start=0, stop=2, step=1)

# df的维度(形状)
df.shape
# 结果
(2, 2)

2、DataFrame的索引

(1) 对列进行索引

- 通过类似字典的方式  df['q']
- 通过属性的方式     df.q

可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且name属性也已经设置好了，就是相应的列名。

dic = {
    '张三':[150,150,150,300],
    '李四':[0,0,0,0]
}

df = DataFrame(data=dic,index=['语文','数学','英语','理综'])
# df

       张三    李四
语文    150     0
数学    150     0
英语    150     0
理综    300     0

# 1.修改列索引
df.columns = ['zhangsan','lisi']
# 结果
    zhangsan  lisi
语文    150     0
数学    150     0
英语    150     0
理综    300     0


# 2.获取zhangsan这列数据
df['zhangsan']
# 结果
语文    150
数学    150
英语    150
理综    300
Name: zhangsan, dtype: int64


# 3.获取lisi这列数据
df.lisi
# 结果
语文    0
数学    0
英语    0
理综    0
Name: lisi, dtype: int64


# 4.获取前两列
df[['lisi','zhangsan']]
# 结果
      lisi    zhangsan
语文    0        150
数学    0        150
英语    0        150
理综    0        300

(2) 对行进行索引

- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引

- 使用.loc[行]：默认输出这行所有内容
- 使用.loc[行,列]：找到这行的某一列，定位到某行的某个值

同样返回一个Series，index为原来的columns。

# df
    zhangsan  lisi
语文    150    0
数学    150    0
英语    150    0
理综    300    0


# 1.获取显示索引为数学的成绩
df.loc['数学']
# 结果
zhangsan    150
lisi          0
Name: 数学, dtype: int64


# 2.获取隐式索引为1的成绩
df.iloc[1]
# 结果
zhangsan    150
lisi          0
Name: 数学, dtype: int64

(3) 对元素索引的方法

- 使用列索引
- 使用行索引(iloc[3,1] or loc['C','q']) 行索引在前，列索引在后，定位到某个行某个元素

# df
    zhangsan lisi
语文    150    0
数学    150    0
英语    150    0
理综    300    0


# 1.获取英语成绩且是张三的成绩
df.loc['英语','zhangsan']
# 结果
150

# 获取张三的语文和数学成绩
df.loc[['语文','数学'],'zhangsan']
# 结果
语文   150
数学   150
Name:zhangsan, dtype:int64

# 2.修改张三的英语成绩
df.loc['英语','zhangsan'] = 60
# 结果
    zhangsan  lisi
语文    150    0
数学    150    0
英语    60     0
理综    300    0

(3) 切片

【注意】直接用中括号时：

索引表示的是列索引
切片表示的是行切片

# df
    zhangsan lisi
语文    150    0
数学    150    0
英语    60     0
理综    300    0


# 1.获取语文到英语这些行的数据
df['语文':'英语']
# 结果
    zhangsan lisi
语文    150    0
数学    150    0
英语    60     0


# 2.在loc和iloc中使用切片(切列): df.loc['B':'C','丙':'丁']
df.loc['数学':'英语','zhangsan':'lisi']
# 结果
    zhangsan lisi
数学    150    0
英语    60     0

3、DataFrame的运算

（1） DataFrame之间的运算

同Series一样：

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

# df
    zhangsan  lisi
语文    150    0
数学    150    0
英语    60     0
理综    300    0


# 1.加法
df + df
# 结果
    zhangsan  lisi
语文    300    0
数学    300    0
英语    120    0
理综    600    0


# 2.除法
df / df
# 结果

zhangsan      lisi
语文    1.0    NaN
数学    1.0    NaN
英语    1.0    NaN
理综    1.0    NaN


# 3.把张三的数学成绩改成0分
df.loc['数学','zhangsan'] = 0
# 结果
　　  zhangsan  lisi
语文    150      0
数学    0        0
英语    60       0
理综    300      0


# 4.把李四所有成绩改成100分
df['lisi'] += 100
# 结果
　　　zhangsan  lisi
语文    150     100
数学    0       100
英语    60      100
理综    300     100


# 5.把所有人的成绩都加10分
df += 10
# 结果
　　  zhangsan   lisi
语文    160      110
数学    10       110
英语    70       110
理综    310      110

四、处理丢失的数据

panda之数据加载

read_xxx()参数：

- sep
-header

# 常用的数据加载方式
pd.read_csv()
pd.read_excel()
pd.read_sql

1、两种丢失的数据

有两种丢失数据：

None
np.nan(NaN)

# 1.None
None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

# 查看None的数据类型
type(None)  # NoneType


# 2.np.nan（NaN）
np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

# 查看np.nan的数据类型
type(np.nan)  # float
np.nan + 1  # nan

2、pandas中的None与NaN

pandas中None与np.nan都视作np.nan

（1）数据的准备

# 1.创建DataFrame
np.random.seed(1)
df = DataFrame(np.random.randint(100,200,size=(7,6)))
# 结果
      0      1      2      3      4      5
0    137    112    172    109    175    105
1    179    164    116    101    176    171
2    106    125    150    120    118    184
3    111    128    129    114    150    168
4    187    187    194    196    186    113
5    109    107    163    161    122    157
6    101    100    160    181    108    188

# 2.将某些数组元素赋值为nan
df.iloc[1,2] = None
df.iloc[3,2] = np.nan
df.iloc[4,4] = None
# 结果

      0      1       2       3       4       5
0    137    112    172.0    109    175.0    105
1    179    164    NaN      101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    NaN      114    150.0    168
4    187    187    194.0    196    NaN      113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188

（2）pandas处理空值操作

isnull(): 判断函数
notnull():判断函数
dropna(): 过滤丢失数据
fillna(): 填充丢失数据

(2-1)判断函数和删除函数

isnull()
notnull()
drop()

# 1.判断函数
    isnull()
    notnull()


# df
　　   0      1       2       3      4        5
0    137    112    172.0    109    175.0    105
1    179    164    NaN      101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    NaN      114    150.0    168
4    187    187    194.0    196    NaN      113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 判断是否为空
df.isnull()
# 结果
       0        1    　　 2    　　 3    　　 4    　　5
0    False    False    False    False    False    False
1    False    False    True     False    False    False
2    False    False    False    False    False    False
3    False    False    True     False    False    False
4    False    False    False    False    True     False
5    False    False    False    False    False    False
6    False    False    False    False    False    False


# 2.清洗数据函数drop
　　drop()

# 使用drop进行数据的清洗
# df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
drop删除数据
指定删除
df.drop(labels=1)删除行索引为1的数据，删除多行用labels=[0,1,2]
df.drop(columns=1)删除列索引为1的数据,删除多列用columns=[1,2,3]

当没有指定labels或者columns时候，axis=0代表行，axis=1代表列，不写，默认是行
df.drop([0, 1])相当于df.drop([0, 1], axis=0) 删除行索引为0和1的数据
df.drop([0, 1], axis=1) 删除列索引为0和1的数据


# 删除行索引为1的数据
df.drop(labels=1)  
# 结果
      0      1       2       3       4      5
0    137    112    172.0    109    175.0    105
2    106    125    150.0    120    118.0    184
3    111    128    NaN      114    150.0    168
4    187    187    194.0    196    NaN      113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 只保留True的行，即0和2行的数据
df.loc[[True,False,True]]
# 结果
      0      1      2        3       4       5
0    137    112    172.0    109    175.0    105
2    106    125    150.0    120    118.0    184

(2-2)判断函数和any/all的结合使用

df.notnull/isnull().any()/all()

# 1.isnull是否是空值
df.isnull()
# 结果
　　　　 0    　　1    　　 2    　　 3    　　4    　　 5
0    False    False    False    False    False    False
1    False    False    True     False    False    False
2    False    False    False    False    False    False
3    False    False    True     False    False    False
4    False    False    False    False    True     False
5    False    False    False    False    False    False
6    False    False    False    False    False    False


# any查看各行是否含有空值
# axis=0代表列，axis=1代表行
df.isnull().any(axis=1)
0    False      # 代表索引为0的这行没有空值
1     True      # 代表索引为1的这行有空值
2    False
3     True
4     True
5    False
6    False
dtype: bool


# 共多少行含有空值
df.isnull().any(axis=1).sum()
# 结果
3  # 代表df中一共有3行数据里面含有空值


# 返回含有空值的行
df.loc[df.isnull().any(axis=1)]
# 结果
　　　　0      1      2     3       4       5
1    179    164    NaN    101    176.0    171
3    111    128    NaN    114    150.0    168
4    187    187    194.0    196    NaN    113


# 返回含有空值的行的索引
index = df.loc[df.isnull().any(axis=1)].index
# 结果
Int64Index([1, 3, 4], dtype='int64')


# 删除含有空白行的数据
df.drop(labels=index)
# 结果
　　　 0    　1    　　2    　　3    　　4      5
0    137    112    172.0    109    175.0    105
2    106    125    150.0    120    118.0    184
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# all查看各行是否全部都是空值
df.isnull().all(axis=1)
0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool


# 2.notnull是否不是空值
df.notnull()
# 结果
    　 0    　  1      2       3       4       5
0    True    True    True    True    True    True
1    True    True    False   True    True    True
2    True    True    True    True    True    True
3    True    True    False   True    True    True
4    True    True    True    True    False    True
5    True    True    True    True    True    True
6    True    True    True    True    True    True


# all查看各行是否全部都不是空值
df.notnull().all(axis=1)
0     True
1    False
2     True
3    False
4    False
5     True
6     True
dtype: bool


# 把全部都不是空值的行输出
df.loc[df.notnull().all(axis=1)]
# 结果
　　　　0     1       2       3       4    　　5
0    137    112    172.0    109    175.0    105
2    106    125    150.0    120    118.0    184
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188

(2-3)判断函数和any/all的结合使用

df.dropna(): 过滤空值的行/列

df.dropna() 可以选择过滤的是行还是列（默认为行）:axis中0表示行，1表示的列

# 1.过滤含有空值的行
df.dropna(axis=0)
# 结果
    0    1    2    3    4    5
0    137    112    172.0    109    175.0    105
2    106    125    150.0    120    118.0    184
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188

# 2.过滤含有空值的列
df.dropna(axis=1)
# 结果
      0      1      3      5
0    137    112    109    105
1    179    164    101    171
2    106    125    120    184
3    111    128    114    168
4    187    187    196    113
5    109    107    161    157
6    101    100    181    188

(2-4) 填充函数 Series/DataFrame

fillna():value和method参数

df.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

# df
　　　　0      1      2       3       4       5
0    137    112    172.0    109    175.0    105
1    179    164    NaN      101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    NaN      114    150.0    168
4    187    187    194.0    196    NaN    113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 1.用value填充空值
df.fillna(value=0) 等于 df.fillna(0)  # 用0填充所有空值
# 结果
　　　　0      1       2      3       4       5
0    137    112    172.0    109    175.0    105
1    179    164    0.0      101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    0.0      114    150.0    168
4    187    187    194.0    196    0.0      113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 2.将索引为2和4的列的空值分别替换成1和2
values = {2: 1, 4: 2}
df.fillna(value=values)  # df.fillna(value=values, limit=1) 代表只替换2和4列的第一个空值 
# 结果
　　　　0      1      2        3      4       5
0    137    112    172.0    109    175.0    105
1    179    164    1.0      101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    1.0      114    150.0    168
4    187    187    194.0    196    2.0      113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 3.前向填充,用axis指定列还是行，0代表列，1代表行
df.fillna(method='ffill',axis=0)
# 结果
　　　　0      1      2        3       4      5
0    137    112    172.0    109    175.0    105
1    179    164    172.0    101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    150.0    114    150.0    168
4    187    187    194.0    196    150.0    113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188


# 4.后向填充
df.fillna(method='bfill',axis=0)
# 结果
      0       1      2       3       4       5
0    137    112    172.0    109    175.0    105
1    179    164    150.0    101    176.0    171
2    106    125    150.0    120    118.0    184
3    111    128    194.0    114    150.0    168
4    187    187    194.0    196    122.0    113
5    109    107    163.0    161    122.0    157
6    101    100    160.0    181    108.0    188

五、创建多层索引

1、创建多层列索引

1.隐式构造

最常见的方法是给DataFrame构造函数的index或者columns参数传递两个或更多的数组

# 1.示例
df = DataFrame(np.random.randint(100,150,size=(3,3)),index=['a','b','c'],columns=[['A','B','C'],['AA','BB','CC']])
　　　　A      B     C
　　　 AA     BB    CC
a    137    143    112
b    108    109    111
c    105    115    100

2. 显示构造pd.MultiIndex.from_product

pandas.MultiIndex.from_product(iterables，sortorder = None，names = None)

从多个迭代的笛卡尔积中创建一个MultiIndex。

参数：	iterables ：列表/可迭代序列每个iterable都为索引的每个级别都有唯一的标签。 sortorder ： int或None 排序级别（必须按字典顺序按该级别排序）。 names ： str的列表/序列，可选索引中级别的名称。
返回：	index ： MultiIndex

col=pd.MultiIndex.from_product([['qizhong','qimo'],
                                ['chinese','math']])

df = DataFrame(data=np.random.randint(60,120,size=(2,4)),index=['tom','jay'],
         columns=col)

# 结果
      qizhong             qimo
　　  chinese    math    chinese    math
tom    74    　　110    　　64       83
jay    83    　　101    　　109      115

2、创建多层行索引

# 跟创建列的一样
df = DataFrame(np.random.randint(100,150,size=(3,3)),index=[['a','b','c'],['aa','bb','cc']],columns=['A','B','C'])

       　　  A      B      C
a    aa    137    143    112
b    bb    108    109    111
c    cc    105    115    100

# MultiIndex.from_product
index=pd.MultiIndex.from_product([['all'],
                                ['a','b']])
df = DataFrame(data=np.random.randint(60,120,size=(2,4)),index=index,
         columns=['A','B','C','D'])

        　　A     B    C     D
all   a    76    61    72    67
      b    105   66    85    110

3、多层索引对象的索引与切片操作

1.索引操作

"""
注意在对行索引的时候，若一级行索引还有多个，对二级行索引会遇到问题！也就是说，无法直接对二级索引进行索引，必须让二级索引变成一级索引后才能对其进行索引！
在DataFrame中直接用中括号[]时：
　　　　索引表示的是列索引
　　　　切片表示的是行切片

"""

# df
    　qizhong           qimo
　　　chinese    math    chinese    math
tom    80       97        78       80
jay    71       102       88       89


# 1.获取所有学生所有科目期末考试成绩
df['qimo']
# 结果
    chinese    math
tom    78      80
jay    88      89


# 2.获取所有学生期末的math的考试成绩
df['qimo']['math']
# 结果
tom    80
jay    89
Name: math, dtype: int32


# 3.获取tom期中所有科目的考试成绩
df['qizhong'].loc['tom']
# 结果
chinese    80
math       97
Name: tom, dtype: int32


# 4.获取tom期末的math成绩
df['qimo'].loc['tom','math']
# 结果
80

2.切片操作

# 用切片的形式获取tom和jay期中的语文和数学成绩
df.loc['tom':'jay','qizhong']
# 结果
     chinese    math
tom    80    　　97
jay    71    　　102

3.总结

"""
总结：
直接使用[],索引是列索引，切片是行切片

1.索引
访问一列或多列 直接用中括号[columnname]  [[columname1,columnname2...]]
访问一行或多行  .loc[indexname]
访问某一个元素  .loc[indexname,columnname]  获取李四期中的php成绩

2.切片：
行切片          .[index1:index2]        获取张三李四的期中成绩
列切片          .loc[:,column1:column2]    获取张三李四期中的php和c++成绩
"""

六、聚合操作

# df
    qizhong             qimo
　　 chinese    math    chinese    math
tom    80    　　97        78      80
jay    71    　　102       88      89


# 1.每行最大的值
df.max(axis=1)
# 结果
tom     97
jay    102
dtype: int32


# 2.每列最小的值
df.min(axis=0)
# 结果
qizhong  chinese    71
         math       97
qimo     chinese    78
         math       80
dtype: int32


# 3.平均值mean()
df.mean(axis=1)
# 结果
tom    83.75
jay    87.50
dtype: float64


# 4.求和sum
df.sum(axis=1)
# 结果
tom    335
jay    350
dtype: int64


# 其他
Function Name    NaN-safe Version    Description
np.sum    np.nansum    Compute sum of elements
np.prod    np.nanprod    Compute product of elements
np.mean    np.nanmean    Compute mean of elements
np.std    np.nanstd    Compute standard deviation
np.var    np.nanvar    Compute variance
np.min    np.nanmin    Find minimum value
np.max    np.nanmax    Find maximum value
np.argmin    np.nanargmin    Find index of minimum value
np.argmax    np.nanargmax    Find index of maximum value
np.median    np.nanmedian    Compute median of elements
np.percentile    np.nanpercentile    Compute rank-based statistics of elements
np.any    N/A    Evaluate whether any elements are true
np.all    N/A    Evaluate whether all elements are true
np.power 幂运算

七、pandas的拼接操作

pandas的拼接分为两种：

级联：pd.concat, pd.append
合并：pd.merge, pd.join

1、使用pd.concat()级联

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式，outer会将所有的项进行级联（忽略匹配和不匹配），而inner只会将匹配的项级联到一起，不匹配的不级联
ignore_index  默认是False,如果设置为True，则代表 重建索引

1.匹配级联

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# 数据
df = DataFrame(np.random.randint(0,100,size=(3,3)),index=['a','b','c'],columns=['A','B','C'])
# 结果
　　　 A     B    C
a    47    81    12
b     7    43    36
c    76    85    47

# outer级联匹配，行拼接
pd.concat([df,df],axis=1,join='outer')
# 结果
　　   A    B     C     A     B     C
a    47    81    12    47    81    12
b     7    43    36    7     43    36
c    76    85    47    76    85    47

2.不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致

有2种连接方式：

外连接(outer)：补NaN（默认模式）

内连接(inner)：只连接匹配的项

# 数据
# df
      A     B     C
a    47    81    12
b     7    43    36
c    76    85    47

# df1
df1 = DataFrame(np.random.randint(0,100,size=(3,3)),index=['a','b','c'],columns=['A','B','D'])
　　  A     B    D
a    73    64    69
b    98    68    9
c    63    38    74

# inner只连接匹配项
pd.concat([df,df1],axis=0,join='inner',ignore_index=True)
# 结果
　　  A    B
0    47    81
1    7    43
2    76    85
3    73    64
4    98    68
5    63    38

# outer全部都连接
pd.concat([df,df1],axis=0,join='outer')
# 结果
　　   A    B      C       D
a    47    81    12.0    NaN
b    7     43    36.0    NaN
c    76    85    47.0    NaN
a    73    64    NaN     69.0
b    98    68    NaN     9.0
c    63    38    NaN     74.0

3.使用df.append()函数添加

由于在 纵轴下面 级联的使用非常普遍，因此有一个函数append专门用于在后面添加
df.append(df1)
　　   A    B      C       D
a    47    81    12.0    NaN
b    7     43    36.0    NaN
c    76    85    47.0    NaN
a    73    64    NaN    69.0
b    98    68    NaN    9.0
c    63    38    NaN    74.0

2、使用pd.merge()合并

merge与concat的区别在于，merge需要依据某一共同的列来进行合并

使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。

注意每一列元素的顺序不要求一致

参数：

how：outer取并集 inner取交集

on：当有多列相同的时候，可以使用on来指定使用那一列进行合并，on的值为一个列表

0.内合并与外合并:out取并集 inner取交集

内合并：how='inner'   只保留两者都有的key（默认模式）
外合并：how='outer'   补NaN

1. 一对一合并

# 数据
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
# 结果
　　 employee    group
0    Bob     Accounting
1    Jake    Engineering
2    Lisa    Engineering


df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })
# 结果
　　 employee    hire_date
0    Lisa    　　　2004
1    Bob    　　　 2008
2    Jake    　　　2012


# 合并,取交集
pd.merge(df1,df2,how='inner')
# 结果
　　employee    group     hire_date
0    Bob     Accounting     2008
1    Jake    Engineering    2012
2    Lisa    Engineering    2004

2.多对一合并

# 数据
df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
# 结果
    employee    group    hire_date
0    Lisa    Accounting     2004
1    Jake    Engineering    2016

df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })
# 结果
        group     supervisor
0    Accounting     Carly
1    Engineering    Guido
2    Engineering    Steve


# 多对一合并
pd.merge(df3,df4)  # how默认是inner取交集
# 结果
   employee     group     hire_date    supervisor
0    Lisa    Accounting     2004    　　  Carly
1    Jake    Engineering    2016    　　  Guido
2    Jake    Engineering    2016    　　  Steve

3.多对多合并

# 数据
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
# 结果
    employee     group
0    Bob      Accounting
1    Jake     Engineering
2    Lisa     Engineering

df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })
# 结果
　　　　 group     supervisor
0    Engineering    Carly
1    Engineering    Guido
2    HR    　　　　　 Steve


# 多对多outer取并集
pd.merge(df1,df5,how='outer')
# 结果
　　 employee    group     supervisor
0    Bob     Accounting      NaN
1    Jake    Engineering    Carly
2    Jake    Engineering    Guido
3    Lisa    Engineering    Carly
4    Lisa    Engineering    Guido
5    NaN     HR    　　　　  Steve

4.额外知识点

加载excl数据:pd.read_excel('excl_path',sheetname=1)

pd.read_excel('./data.xlsx',sheet_name=1)

5.key的规范化

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

# 数据
df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                 'group':['Accounting','Finance','Marketing']})
# 结果
　　 employee    group
0    Jack     Accounting
1    Summer    Finance
2    Steve    Marketing

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})
# 结果
    employee    group    hire_date
0    Jack     Accounting    2003
1    Bob        sell        2009
2    Jake       ceo         2012


# on指定用哪个列合并
pd.merge(df1,df2,on='group',how='outer')
# 结果
    employee_x     group    　　employee_y    hire_date
0    Jack       Accounting         Jack    　　2003.0
1    Summer       Finance          NaN         NaN
2    Steve       Marketing         NaN    　  　NaN
3    NaN           sell    　　     Bob   　　  2009.0
4    NaN            ceo    　　     Jake       2012.0

当两张表没有可进行连接的列时，可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

# 数据
df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})
# 结果
　　employee    group    hire_date
0    Bobs    Accounting    1998
1    Linda    Product      2017
2    Bill    Marketing    2018

df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})
# 结果
　　hire_dates    name
0    1998    　　 Lisa
1    2016    　　 Bobs
2    2007    　　 Bill


# 左边指定用employee和右边的name作为key进行合并
pd.merge(df1,df5,left_on='employee',right_on='name',how='outer')
# 结果
    employee    group    hire_date    hire_dates    name
0    Bobs    Accounting    1998.0    　　2016.0      Bobs
1    Linda    Product      2017.0   　　   NaN   　　 NaN
2    Bill    Marketing     2018.0    　　 2007.0     Bill
3    NaN    　　NaN    　　　 NaN     　　  1998.0 　　Lisa

八、案例分析：美国各州人口数据分析

# 导入文件，查看原始数据
import numpy as np
from pandas import DataFrame,Series
import pandas as pd

# 从csv文件中读取数据，文本在我本地
abb = pd.read_csv('./data/state-abbrevs.csv')  # 洲
pop = pd.read_csv('./data/state-population.csv')  # 人口
area = pd.read_csv('./data/state-areas.csv')  # 面积

abb.head(1)
# 数据太多，这里只查看一条数据，为了查看行列名
     state    abbreviation
0    Alabama    AL

pop.head(1)
# 结果
   state/region    ages      year    population
0    AL    　　　　under18    2012    1117489.0

area.head(1)
# 结果
      state    area (sq. mi)
0    Alabama     52423


# 1.将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
abb_pop.head()
# 结果
　　  state    abbreviation    state/region    ages     year    population
0    Alabama    　　AL    　　　　　　AL    　　under18    2012    1117489.0
1    Alabama    　　AL    　　　　　　AL    　　total      2012    4817528.0
2    Alabama    　　AL    　　　　　　AL   　　 under18    2010    1130966.0
3    Alabama    　　AL    　　　　　　AL    　　total      2010    4785570.0
4    Alabama    　　AL    　　　　　　AL    　　under18    2011    1125763.0


# 2.将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head()
# 结果
　　　　state    state/region    ages      year    population
0    Alabama    AL    　　　　　under18    2012    1117489.0
1    Alabama    AL    　　　　　total      2012    4817528.0
2    Alabama    AL   　　　　　 under18    2010    1130966.0
3    Alabama    AL    　　　　　total      2010    4785570.0
4    Alabama    AL    　　　　　under18    2011    1125763.0


# 3.查看存在缺失数据的列
abb_pop.isnull().any(axis=0)
# 结果
state            True
state/region    False
ages            False
year            False
population       True
dtype: bool


# 4.找到有哪些state/region使得state的值为NaN，进行去重操作
# state列中所有的空值
condition = abb_pop['state'].isnull()
# 找出state列空值对应的行数据
abb_pop.loc[condition]
# 去重：abb_pop.loc[condition]['state/region']得到的是Series
abb_pop.loc[condition]['state/region'].unique()
# 结果
array(['PR', 'USA'], dtype=object)


# 5.为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
# 获取原数据中所有USA对应的行索引
c = abb_pop['state/region'] == 'USA'
index = abb_pop.loc[c].index  
abb_pop.loc[index,'state'] = 'United State'

c1 = abb_pop['state/region'] == 'PR'
index1 = abb_pop.loc[c1].index
abb_pop.loc[index,'state'] = 'PPPRRR'


# 6.合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head(1)
# 结果
　　　 state    state/region    ages    　year    population    area (sq. mi)
0    Alabama    AL    　　　　 under18   2012.0    1117489.0    52423.0


# 7.我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
con = abb_pop_area['area (sq. mi)'].isnull()
index = abb_pop_area.loc[con].index


# 8.去除含有缺失数据的行
abb_pop_area.drop(labels=index,axis=0,inplace=True)


# 9.找出2010年的全民人口数据
abb_pop_area.query('year == 2010 & ages == "total"')


# 10.计算各州的人口密度
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)'] 
abb_pop_area.head()
# 结果
　　　state     state/region    ages     year     population    area (sq. mi)    midu
0    Alabama    　　AL    　　under18    2012.0    1117489.0     52423.0    　　21.316769
1    Alabama    　　AL    　　total      2012.0    4817528.0     52423.0   　　 91.897221
2    Alabama    　　AL    　　under18    2010.0    1130966.0     52423.0   　　 21.573851
3    Alabama    　　AL    　　total      2010.0    4785570.0     52423.0   　　 91.287603
4    Alabama    　　AL    　　under18    2011.0    1125763.0     52423.0   　　 21.474601


# 11.排序，并找出人口密度最高的五个州   df.sort_values()
abb_pop_area.sort_values(by='midu',axis=0,ascending=False).head()

九、pandas数据处理

1、删除重复行

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True

- keep参数：指定保留哪一重复的行数据

使用duplicated(keep='first/last'/False)和drop函数删除重复行

# 1.创建具有重复元素行的DataFrame
np.random.seed(1)
df = DataFrame(data=np.random.randint(0,100,size=(8,6)))

df.iloc[1] = [3,3,3,3,3,3]
df.iloc[3] = [3,3,3,3,3,3]
df.iloc[5] = [3,3,3,3,3,3]
# 结果
　　  0     1    2     3    4    5
0    37    12   72    9    75   5
1    3     3    3     3    3    3
2    6     25   50    20   18   84
3    3     3    3     3    3    3
4    87    87   94    96   86   13
5    3     3    3     3    3    3
6    1     0    60    81   8    88
7    13    47   72    30   71   3


# 2.使用duplicated查看所有重复元素行
keep参数：
    keep=False：所有重复行都不保留
    keep="last"  保留最后一行的重复行
    keep="first"  保留第一行的重复行，默认值

# 这个数据的1,3,5行为重复行，设置keep="last"保留最后一行
con = df.duplicated(keep="last")
# 结果
0    False
1     True   # 1行
2    False
3     True   # 3行
4    False
5    False   # 5行保留
6    False
7    False
dtype: bool


# 3.删除重复元素的行
index = df.loc[con].index  # 重复行的行索引
df.drop(labels=index,axis=0)  # 删除
# 结果
    　0     1     2     3     4    5
0    37    12    72    9     75    5
2    6     25    50    20    18    84
4    87    87    94    96    86    13
5    3     3     3     3     3     3
6    1     0     60    81    8     88
7    13    47    72    30    71    3

使用drop_duplicates(keep='first/last'/False)函数删除重复行

"""
drop_duplicates()相当于上面的duplicated和drop函数的结合
"""
# 保留最开始的重复行，其他删除
df.drop_duplicates(keep='first')
# 结果
    　0     1     2     3    4    5
0    37    12    72     9   75    5
1    3    　3     3     3    3    3
2    6     25    50    20   18   84
4    87    87    94    96   86   13
6    1     0     60    81   8    88
7    13    47    72    30   71    3

2、映射

replace()函数：替换元素

使用replace()函数，对values进行映射操作

1.Series替换操作

单值替换
- 普通替换
- 字典替换(推荐）
多值替换
- 列表替换
- 字典替换（推荐）
参数
- to_replace:被替换的元素

# 1.单值普通替换
s = Series([1,2,3,4,5,6])

s.replace(to_replace=3,value='three')
# 结果
0        1
1        2
2    three
3        4
4        5
5        6
dtype: object


# 2.单值字典替换
s.replace(to_replace={1:'one'})
# 结果
0    one
1      2
2      3
3      4
4      5
5      6
dtype: object


# 3.多值列表替换
s.replace(to_replace=[1,2],value=['a','b'])
# 结果
0    a
1    b
2    3
3    4
4    5
5    6
dtype: object


# 4.多值字典替换
s.replace(to_replace={3:"three",4:"four"})
# 结果
0        1
1        2
2    three
3     four
4        5
5        6
dtype: object


# 5.method参数
method：对指定的值使用相邻的值填充替换，{'pad'，'ffill'，'bfill'，None}
limit：设定填充次数, int，默认None

# 用前面的值替换
s.replace(to_replace=3,method='ffill')
# 结果
0    1
1    2
2    2  # 3被前面的2替换了
3    4
4    5
5    6
dtype: int64

# 用后面的值替换
s.replace(to_replace=3,method='bfill')
# 结果
0    1
1    2
2    4  # 3被后面的4替换了
3    4
4    5
5    6
dtype: int64

2.DataFrame替换操作

单值替换
- 普通替换：替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换： to_replace={列标签：要被替换的元素},value='value'

多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换（推荐） to_replace={列标签1：要被替换的元素1,列标签2：要被替换的元素2},value='value'

# df
　　  0     1     2     3    4     5
0    37    12    72    9    75    5
1    3      3     3    3     3    3
2    6     25    50    20    18   84
3    3      3     3    3     3    3
4    87    87    94    96    86   13
5    3      3     3     3     3    3
6    1      0    60    81     8   88
7    13    47    72    30    71    3


# 1.单值替换
df.replace(to_replace=3,value='three')  # 把所有 3 元素替换成three
# 结果
　　   0    　　1   　　  2  　　   3  　　  4  　　  5
0    37    　　12   　　 72  　　  9  　　  75   　  5
1    three    three     three    three   three    three
2    6    　　 25    　　50    　　20    　 18   　  84
3    three    three    three    three    three    three
4    87    　　87    　　94    　　96     　86    　　13
5    three    three    three    three    three    three
6    1    　　 0    　　 60    　　81        8       88
7    13    　　47    　　72    　　30    　　71      three


# 2.多值列表替换
df.replace(to_replace=[1,3],value=['one','three'])
# 结果
　　  0       1    　　2    　　3    　　4 　　     5
0    37      12   　  72    　 9　　   75         5
1    three   three   three    three   three    three
2    6    　　25       50      20      18        84
3    three   three   three    three    three    three
4    87       87   　  94 　   96  　   86　　    13
5    three   three   three    three    three    three
6    one       0       60  　  81 　　   8  　　  88
7    13    　　47  　   72 　   30  　　  71      three


# 3.字典替换
# 把第3列的 3 元素替换成'aaaa'
df.replace(to_replace={3:3},value='aaaa')
# 结果
　　  0     1     2     3    4     5
0    37    12    72    9    75    5
1    3     3     3   aaaa    3    3
2    6    25    50    20    18    84
3    3     3     3   aaaa    3    3
4    87   87    94    96    86    13
5    3     3     3   aaaa    3    3
6    1     0    60    81     8    88
7    13    47   72    30    71    3


# 4.把第3列的3和第4列的3替换成three
df.replace(to_replace={3:3,4:3},value='three')
# 结果

　　  0     1     2     3    4     5
0    37    12    72    9    75    5
1    3      3     3  three three  3
2    6     25    50    20    18   84
3    3      3     3  three three  3
4    87    87    94    96    86   13
5    3      3     3  three three  3
6    1      0    60    81    8    88
7    13    47    72    30    71    3


注意：DataFrame中，无法使用method和limit参数

3.map()函数：新建一列， map函数并不是df的方法，而是series的方法

map()可以映射新一列数据
map()中可以使用lambda表达式
map()中可以使用方法，可以是自定义的方法

eg:map({to_replace:value})
注意 map()中不能使用sum之类的函数，for循环

"""
注意：并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值，那么该函数才可以作为map的参数
"""

# 1.新增一列：给df中，添加一列，该列的值为英文名对应的中文名
dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,8000,2000]
}
df = DataFrame(data=dic)
# 结果
　　 name   salary
0    jay    10000
1    tom    8000
2    jay    2000

# 映射关系表
dic = {
    'jay':'周杰伦',
    'tom':'张三'
}
df['c_name'] = df['name'].map(dic)  # 把name这一列的数据替换成dic的内容，并新增到新的c_name列中
# 结果
　　 name   salary    c_name
0    jay    10000    周杰伦
1    tom    8000     张三
2    jay    2000     周杰伦


map当做一种运算工具，至于执行何种运算，是由map函数的参数决定的（参数：lambda，函数）

# 2.使用自定义函数
# 函数
def after_sal(s):
    if s <= 3000:
        return s
    else:
        return s - (s-3000)*0.5

# 超过3000部分的钱缴纳50%的税
df['salary'].map(after_sal)
# 结果
0    6500.0
1    5500.0
2    2000.0
Name: salary, dtype: float64

# 新增列到原数据中
df['after_sal'] = df['salary'].map(after_sal)
# 结果
     name   salary    c_name    after_sal
0    jay    10000     周杰伦       6500.0
1    tom    8000      张三         5500.0
2    jay    2000      周杰伦    　　2000.0


# 3.使用lambda函数
# 给每个人工资加1000块
df['sal_1000'] = df['salary'].map(lambda x:x+1000)
# 结果
　　 name   salary    c_name   after_sal   sal_1000
0    jay    10000    周杰伦     6500.0       11000
1    tom    8000     张三       5500.0       9000
2    jay    2000     周杰伦     2000.0       3000

3、使用聚合操作对数据异常值检测和过滤

使用df.std()函数可以求得DataFrame对象每一列的标准差

# 1.创建一个1000行3列的df 范围（0-1），求其每一列的标准差
df = DataFrame(np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
# 部分结果
　　　　　 A    　　　　B　　　　    C
0    0.671654    0.411788    0.197551
1    0.289630    0.142120    0.783314
2    0.412539    0.034171    0.624030
3    0.660636    0.298495    0.446135
4    0.222125    0.073364    0.469239


# 2.对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差
# 两倍的C列标准差
value = df['C'].std() * 2

# 结果
df.loc[df['C'] > value]

4.、排序

使用.take()函数排序

- take()函数接受一个索引列表，用数字表示,使得df根据列表中索引的顺序进行排序
- eg:df.take([1,3,4,2,5])

# 1.数据
df = DataFrame(data=np.random.randint(0,100,size=(8,4)),columns=['A','B','C','D'])
# 结果
      A     B    C     D
0    77    72    75    76
1    43    20    30    36
2    7     45    68    57
3    82    96    13    10
4    23    81    7     24
5    74    92    20    32
6    12    65    94    60
7    24    82    97    2


# 2.排序列
df.take([1,2,0,3],axis=1)
# 结果
　　  B     C     A     D
0    72    75    77    76
1    20    30    43    36
2    45    68    7     57
3    96    13    82    10
4    81    7     23    24
5    92    20    74    32
6    65    94    12    60
7    82    97    24    2


# 3.排序行
df.take([7,3,4,1,5,2,0])  # axis默认等于0，为行排序，且没排序的不显示
# 结果(第6行没有排序，所有没有显示出来)
　　  A     B     C     D
7    24    82    97    2
3    82    96    13    10
4    23    81    7     24
1    43    20    30    36
5    74    92    20    32
2    7     45    68    57
0    77    72    75    76

可以借助np.random.permutation()函数随机排序实现随机抽样

np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
当DataFrame规模足够大时，直接使用np.random.permutation(x)函数，就配合take()函数实现随机抽样

"""
如果有1000行，用take，参数列表不是要写1000个元素吗，
因此不科学，可以借助np.random.permutation()函数随机进行排序

示例
np.random.permutation(7)
生成0到6的随机数列
结果每次都是不同的：array([0, 3, 4, 2, 1, 5, 6])
"""

# 把行和列都打乱顺序
new_df = df.take(np.random.permutation(7),axis=0).take(np.random.permutation(3),axis=1)
# 结果
     C     A    B
2    68    7    45
6    94    12   65
4    7     23    81
5    20    74    92
3    13    82    96
1    30    43    20
0    75    77    72

# 在打乱了顺序的数据中获取一部分数据
new_df2 = new_df[2:6]  # 这里数据较少，实际中的随机抽样数据肯定很多的，这里只是演示


# 生成新的数据，即是随机抽样的数据
new_df3 = DataFrame(data=new_df2.values)

5、数据分类处理(重)

数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。

数据分类处理：

分组：先把数据分为几组
用函数处理：为不同组的数据应用不同的函数以转换数据
合并：把不同组得到的结果合并起来

数据分类处理的核心：

 - groupby()函数
 - groups属性查看分组情况
 - eg: df.groupby(by='item').groups

1.分组聚合

# 1.数据
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
# 结果
　　 color      item    price   weight
0    red       Apple    4.0      12
1    yellow    Banana   3.0      20
2    yellow    Orange   3.0      50
3    green     Banana   2.5      30
4    green     Orange   4.0      20
5    green     Apple    2.0      44


# 2.使用groupby实现分组
df.groupby(by='item')  # <pandas.core.groupby.DataFrameGroupBy object at 0x00000152DB865DA0>


# 3.使用groups查看分组情况
# 该函数可以进行数据的分组，但是不显示分组情况
df.groupby(by='item').groups
# 结果
{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}


# 4.分组后的聚合操作：分组后的成员中可以被进行运算的值会进行运算，不能被运算的值不进行运算
# 各种水果的平均价格和平均重量
df.groupby(by='item').mean()
# 结果
　　    item price    weight
Apple     3.00    　　  28
Banana    2.75   　　　 25
Orange    3.50    　　　35


# 创建新的一列为水果的平均价格
mean_price = df.groupby(by='item')['price'].mean()  # 平均价格
# 结果
item
Apple   3.00
Banana  2.75
Orange  3.50
Name:price, dtype:float64

# 构建字典
dic = {
    'Apple':3,
    'Banana':2.75,
    'Orange':3.5
}
# 映射
df['mean_price'] = df['item'].map(dic)
# 结果
　　 color      item    price   weight  mean_price
0    red       Apple    4.0    　12    　　3.00
1    yellow    Banana   3.0    　20    　　2.75
2    yellow    Orange   3.0    　50    　　3.50
3    green     Banana   2.5    　30    　　2.75
4    green     Orange   4.0    　20    　　3.50
5    green     Apple    2.0    　44    　　3.00


# 5.按颜色查看各种颜色的水果的平均价格，并新建一列
s = df.groupby(by='color')['price'].mean()
# 结果
color
green   2.833333
red     4.000000
yellow  3.000000
Name:price, dtype:float64

# 可以使用to_dict()构建字典
df['color_mean_price'] = df['color'].map(s.to_dict())
# 结果
　　 color      item    price    weight   mean_price   color_mean_price
0    red       Apple    4.0    　　12   　　 3.00    　　　　4.000000
1    yellow    Banana   3.0    　　20    　　2.75   　　　　 3.000000
2    yellow    Orange   3.0    　　50    　　3.50   　　　　 3.000000
3    green     Banana   2.5    　　30    　　2.75   　　　　 2.833333
4    green     Orange   4.0    　　20    　　3.50   　　　　 2.833333
5    green     Apple    2.0    　　44    　　3.00   　　　　 2.833333

6、高级数据聚合

使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算

df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
transform和apply都会进行运算，在transform或者apply中传入函数即可
transform和apply也可以传入一个lambda表达式

# 1.求出各种水果价格的平均值
df.groupby(by='item')['price'].mean()


# 2.使用apply函数求出各种水果价格的平均值
def fun(s):
    sum = 0
    for i in s:
        sum += s
    return s/s.size

df.groupby(by='item')['price'].apply(fun)


# 3.使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(fun)

十、案例分析：美国2012年总统候选人政治献金数据分析

# 1.导包
import numpy as np
import pandas as pd
from pandas import Series,DataFrame


# 2.方便大家操作，将月份和参选人以及所在政党进行定义
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
of_interest = ['Obama, Barack', 'Romney, Mitt', 'Santorum, Rick', 
               'Paul, Ron', 'Gingrich, Newt']
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }


# 3.读取文件
data = pd.read_csv('./data/usa_election.txt')


# 4.使用map函数+字典，新建一列各个候选人所在党派party
data['party'] = data['cand_nm'].map(parties)
# 5.使用np.unique()函数查看colums：party这一列中有哪些元素
data['party'].unique()
# 结果
array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)


# 6.使用value_counts()函数，统计party列中各个元素出现次数
data['party'].value_counts()
# 结果
Democrat       292400
Republican     237575
Reform           5364
Libertarian       702
Name: party, dtype: int64


# 7.使用groupby()函数，查看各个党派收到的政治献金总数contb_receipt_amt
data.groupby(by='party')['contb_receipt_amt'].sum()
# 结果
party
Democrat       8.105758e+07
Libertarian    4.132769e+05
Reform         3.390338e+05
Republican     1.192255e+08
Name: contb_receipt_amt, dtype: float64


# 8.查看具体每天各个党派收到的政治献金总数contb_receipt_amt
使用groupby([多个分组参数])
data.groupby(by=['party','contb_receipt_dt'])['contb_receipt_amt'].sum()


# 9.查看日期格式，并将其转换为'yyyy-mm-dd'日期格式,通过函数加map方式进行转换:months['月份简写']==》mm形式的月份
def tranform_date(d):  # 20-JUN-11
    day,month,year = d.split('-')
    month = str(months[month])
    
    return '20'+year+'-'+month+'-'+day

data['contb_receipt_dt'] = data['contb_receipt_dt'].map(tranform_date)
# 10.查看老兵(捐献者职业)DISABLED VETERAN主要支持谁 ：查看老兵们捐赠给谁的钱最多 考察Series索引
# 获取老兵对应的行数据
data['contbr_occupation'] == 'DISABLED VETERAN'

old_bing = data.loc[data['contbr_occupation'] == 'DISABLED VETERAN']

old_bing.groupby(by='cand_nm')['contb_receipt_amt'].sum()
# 结果
cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64


# 11.找出候选人的捐赠者中，捐赠金额最大的人的职业以及捐献额
# 通过query("查询条件来查找捐献人职业")
data['contb_receipt_amt'].max()  # 1944042.43
data.query('contb_receipt_amt == 1944042.43')

posted @ 2019-02-15 22:30 我用python写Bug 阅读(1257) 评论(0) 编辑收藏举报

刷新页面返回顶部

小小程序员Zzbj

Eating our own dog food