和大熊猫们(Pandas)一起游戏吧！

Pandas是Python的一个用于数据分析的库： http://pandas.pydata.org

API速查：http://pandas.pydata.org/pandas-docs/stable/api.html

基于NumPy,SciPy的功能，在其上补充了大量的数据操作（Data Manipulation）功能。

统计、分组、排序、透视表自由转换，如果你已经很熟悉结构化数据库（RDBMS）与Excel的功能，就会知道Pandas有过之而无不及！

0. 上手玩：Why Pandas?

普通的程序员看到一份数据会怎么做？

import codecs

r = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

with codecs.open('S1EP3_Iris.txt','r',encoding='utf-8') as f:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa

Pandas的意义就在于

快速的识别结构化数据

import pandas as pd

快速的操作元数据

cnames = ['sepal_length','sepal_width','petal_length','petal_width','class']

快速过滤

#irisdata[irisdata['petal_width']==irisdata.petal_width.max()]

快速切片

#irisdata.iloc[::30,:2]

快速统计

#print irisdata['class'].value_counts()

SEPAL_LENGTH  Statistics:    7.9    4.3   5.84   0.83
SEPAL_WIDTH   Statistics:    4.4    2.0   3.05   0.43
PETAL_LENGTH  Statistics:    6.9    1.0   3.76   1.76
PETAL_WIDTH   Statistics:    2.5    0.1    1.2   0.76

快速“MapReduce”

slogs = lambda x:sp.log(x)*x

1. 欢迎来到大熊猫世界

Pandas的重要数据类型

DataFrame(二维表)
Series(一维序列)
Index(行索引，行级元数据)

1.1 Series：pandas的长枪(数据表中的一列或一行,观测向量,一维数组...)

数据世界中对于任意一个个体的全面观测，或者对于任意一组个体某一属性的观测，全部可以抽象为Series的概念。

用值构建一个Series：

由默认index和values组成。

Series1 = pd.Series(np.random.randn(4))

0   -0.909672
1    1.739425
2   -1.163028
3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Int64Index([0, 1, 2, 3], dtype='int64')
[-0.90967166  1.73942495 -1.1630284   0.40869312]

Series支持过滤的原理就如同NumPy：

print Series1>0

0    False
1     True
2    False
3     True
dtype: bool
1    1.739425
3    0.408693
dtype: float64

当然也支持Broadcasting：

print Series1*2

0   -1.819343
1    3.478850
2   -2.326057
3    0.817386
dtype: float64
0    4.090328
1    6.739425
2    3.836972
3    5.408693
dtype: float64

以及Universal Function：

print np.exp(Series1)

0    0.402656
1    5.694068
2    0.312538
3    1.504850
dtype: float64
0    24.06255
1    4811.913
2    14.49702
3    336.0924
dtype: object

在序列上就使用行标，而不是创建一个2列的数据表，能够轻松辨别哪里是数据，哪里是元数据：

Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

norm_0   -0.909672
norm_1    1.739425
norm_2   -1.163028
norm_3    0.408693
dtype: float64 <class 'pandas.core.series.Series'>
Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')
<class 'pandas.core.index.Index'>
[-0.90967166  1.73942495 -1.1630284   0.40869312]

虽然行是有顺序的，但是仍然能够通过行级的index来访问到数据：

（当然也不尽然像Ordered Dict，因为行索引甚至可以重复，不推荐重复的行索引不代表不能用）

print Series2[['norm_0','norm_3']]

norm_0   -0.909672
norm_3    0.408693
dtype: float64

print 'norm_0' in Series2

True
False

默认行索引就像行号一样：

print Series1.index

Int64Index([0, 1, 2, 3], dtype='int64')

从Key不重复的Ordered Dict或者从Dict来定义Series就不需要担心行索引重复：

Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

China      Beijing
Japan        Tokyo
S.Korea      Seoul
dtype: object
['Beijing' 'Tokyo' 'Seoul']
Index([u'China', u'Japan', u'S.Korea'], dtype='object')

与Dict区别一：有序

Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
dtype: object
['Tokyo' 'Beijing' nan 'Seoul']
Index([u'Japan', u'China', u'Singapore', u'S.Korea'], dtype='object')
Japan        False
China        False
Singapore     True
S.Korea      False
dtype: bool
Japan         True
China         True
Singapore    False
S.Korea       True
dtype: bool

与Dict区别二： index内值可以重复，尽管不推荐。

Series5_IndexList = ['A','B','B','C']

A   -0.909672
B    1.739425
B   -1.163028
C    0.408693
dtype: float64
B    1.739425
B   -1.163028
A   -0.909672
dtype: float64

整个序列级别的元数据信息：name

当数据序列以及index本身有了名字，就可以更方便的进行后续的数据关联啦！

print Series4_pdSeries.name

None
None

Series4_pdSeries.name = "Capital Series"

Nation
Japan          Tokyo
China        Beijing
Singapore        NaN
S.Korea        Seoul
Name: Capital Series, dtype: object

1.2 DataFrame：pandas的战锤(数据表，二维数组)

Series的有序集合，就像R的DataFrame一样方便。

仔细想想，绝大部分的数据形式都可以表现为DataFrame。

从NumPy二维数组、从文件或者从数据库定义：数据虽好，勿忘列名

dataNumPy = np.asarray([('Japan','Tokyo',4000),\

等长的列数据保存在一个字典里（JSON）：很不幸，字典key是无序的

dataDict = {'nation':['Japan','S.Korea','China'],\

从另一个DataFrame定义DataFrame：啊，强迫症犯了！

DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

从DataFrame中取出列？两种方法（与JavaScript完全一致！）

'.'的写法容易与其他预留关键字产生冲突
'[ ]'的写法最安全。

print DF22.nation

2      China
0      Japan
1    S.Korea
Name: nation, dtype: object
2    Beijing
0      Tokyo
1      Seoul
Name: capital, dtype: object
2    9100
0    4900
1    1300
Name: GDP, dtype: int64

从DataFrame中取出行？（至少）两种方法：

print DF22[0:1] #给出的实际是DataFrame

  nation  capital   GDP
2  China  Beijing  9100
nation     Japan
capital    Tokyo
GDP         4900
Name: 0, dtype: object

像NumPy切片一样的终极招式：iloc

print DF22.iloc[0,:]

nation       China
capital    Beijing
GDP           9100
Name: 2, dtype: object
2      China
0      Japan
1    S.Korea
Name: nation, dtype: object

听说你从Alter Table地狱来，大熊猫笑了

然而动态增加列无法用"."的方式完成，只能用"[ ]"

DF22['population'] = [1600,130,55]

​

1.3 Index：pandas进行数据操纵的鬼牌（行级索引）

行级索引是

元数据
可能由真实数据产生，因此可以视作数据
可以由多重索引也就是多个列组合而成
可以和列名进行交换，也可以进行堆叠和展开，达到Excel透视表效果

Index有四种...哦不，很多种写法，一些重要的索引类型包括

pd.Index（普通）
Int64Index（数值型索引）
MultiIndex（多重索引，在数据操纵中更详细描述）
DatetimeIndex（以时间格式作为索引）
PeriodIndex （含周期的时间格式作为索引）

直接定义普通索引，长得就和普通的Series一样

index_names = ['a','b','c']

Index([u'a', u'b', u'c'], dtype='object')
Index([u'a', u'b', u'c'], dtype='object')

可惜Immutable，牢记！

index_names = ['a','b','c']

['a' 'b' 'c']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-f34da0a8623c> in <module>()
      2 index0 = pd.Index(index_names)
      3 print index0.get_values()
----> 4 index0[2] = 'd'

/Users/wangweiyang/anaconda/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)
   1055 
   1056     def __setitem__(self, key, value):
-> 1057         raise TypeError("Indexes does not support mutable operations")
   1058 
   1059     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

扔进去一个含有多元组的List，就有了MultiIndex

可惜，如果这个List Comprehension改成小括号，就不对了。

#print [('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

对于Series来说，如果拥有了多重Index，数据，变形！

下列代码说明：

二重MultiIndex的Series可以unstack()成DataFrame
DataFrame可以stack成拥有二重MultiIndex的Series

data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64

data_for_multi1.unstack()

data_for_multi1.unstack().stack()

Row_1  Col_1     0
       Col_2     1
       Col_3     2
       Col_4     3
Row_2  Col_1     4
       Col_2     5
       Col_3     6
       Col_4     7
Row_3  Col_1     8
       Col_2     9
       Col_3    10
       Col_4    11
Row_4  Col_1    12
       Col_2    13
       Col_3    14
       Col_4    15
dtype: int64

我们来看一下非平衡数据的例子：

Row_1,2,3,4和Col_1,2,3,4并不是全组合的。

multi2 = pd.Index([('Row_'+str(x),'Col_'+str(y+1)) \

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],
           labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])

data_for_multi2 = pd.Series(np.arange(10),index = multi2)

Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: int64

data_for_multi2.unstack()

data_for_multi2.unstack().stack()

Row_1  Col_1    0
Row_2  Col_1    1
       Col_2    2
Row_3  Col_1    3
       Col_2    4
       Col_3    5
Row_4  Col_1    6
       Col_2    7
       Col_3    8
       Col_4    9
dtype: float64

DateTime标准库如此好用，你值得拥有

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

如果你不仅需要时间格式统一，时间频率也要统一的话

periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')

月级精度和日级精度如何转换？

有的公司统一以1号代表当月，有的公司统一以最后一天代表当月，转化起来很麻烦，可以asfreq

print periodindex1.asfreq('D',how='start')

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')

最后的最后，我要真正把两种频率的时间精度匹配上？

periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')
PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
             '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
             '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
             '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
             '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
             '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
             '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',
             '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
             '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',
             '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',
             '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',
             '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',
             '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',
             '2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',
             '2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',
             '2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',
             '2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',
             '2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',
             '2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',
             '2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',
             '2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',
             '2015-03-30', '2015-03-31'],
            dtype='int64', freq='D')

粗粒度数据＋reindex＋ffill/bfill

#print pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day)

2015-01-01    2015-01-01
2015-01-02           NaN
2015-01-03           NaN
2015-01-04           NaN
2015-01-05           NaN
2015-01-06           NaN
2015-01-07           NaN
2015-01-08           NaN
2015-01-09           NaN
2015-01-10           NaN
2015-01-11           NaN
2015-01-12           NaN
2015-01-13           NaN
2015-01-14           NaN
2015-01-15           NaN
2015-01-16           NaN
2015-01-17           NaN
2015-01-18           NaN
2015-01-19           NaN
2015-01-20           NaN
2015-01-21           NaN
2015-01-22           NaN
2015-01-23           NaN
2015-01-24           NaN
2015-01-25           NaN
2015-01-26           NaN
2015-01-27           NaN
2015-01-28           NaN
2015-01-29           NaN
2015-01-30           NaN
                 ...    
2015-03-02           NaN
2015-03-03           NaN
2015-03-04           NaN
2015-03-05           NaN
2015-03-06           NaN
2015-03-07           NaN
2015-03-08           NaN
2015-03-09           NaN
2015-03-10           NaN
2015-03-11           NaN
2015-03-12           NaN
2015-03-13           NaN
2015-03-14           NaN
2015-03-15           NaN
2015-03-16           NaN
2015-03-17           NaN
2015-03-18           NaN
2015-03-19           NaN
2015-03-20           NaN
2015-03-21           NaN
2015-03-22           NaN
2015-03-23           NaN
2015-03-24           NaN
2015-03-25           NaN
2015-03-26           NaN
2015-03-27           NaN
2015-03-28           NaN
2015-03-29           NaN
2015-03-30           NaN
2015-03-31           NaN
Freq: D, dtype: object

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

2015-01-01    2015-01-01
2015-01-02    2015-01-01
2015-01-03    2015-01-01
2015-01-04    2015-01-01
2015-01-05    2015-01-01
2015-01-06    2015-01-01
2015-01-07    2015-01-01
2015-01-08    2015-01-01
2015-01-09    2015-01-01
2015-01-10    2015-01-01
2015-01-11    2015-01-01
2015-01-12    2015-01-01
2015-01-13    2015-01-01
2015-01-14    2015-01-01
2015-01-15    2015-01-01
2015-01-16    2015-01-01
2015-01-17    2015-01-01
2015-01-18    2015-01-01
2015-01-19    2015-01-01
2015-01-20    2015-01-01
2015-01-21    2015-01-01
2015-01-22    2015-01-01
2015-01-23    2015-01-01
2015-01-24    2015-01-01
2015-01-25    2015-01-01
2015-01-26    2015-01-01
2015-01-27    2015-01-01
2015-01-28    2015-01-01
2015-01-29    2015-01-01
2015-01-30    2015-01-01
                 ...    
2015-03-02    2015-03-01
2015-03-03    2015-03-01
2015-03-04    2015-03-01
2015-03-05    2015-03-01
2015-03-06    2015-03-01
2015-03-07    2015-03-01
2015-03-08    2015-03-01
2015-03-09    2015-03-01
2015-03-10    2015-03-01
2015-03-11    2015-03-01
2015-03-12    2015-03-01
2015-03-13    2015-03-01
2015-03-14    2015-03-01
2015-03-15    2015-03-01
2015-03-16    2015-03-01
2015-03-17    2015-03-01
2015-03-18    2015-03-01
2015-03-19    2015-03-01
2015-03-20    2015-03-01
2015-03-21    2015-03-01
2015-03-22    2015-03-01
2015-03-23    2015-03-01
2015-03-24    2015-03-01
2015-03-25    2015-03-01
2015-03-26    2015-03-01
2015-03-27    2015-03-01
2015-03-28    2015-03-01
2015-03-29    2015-03-01
2015-03-30    2015-03-01
2015-03-31    2015-03-01
Freq: D, dtype: object

关于索引，方便的操作有？

前面描述过了，索引有序，重复，但一定程度上又能通过key来访问，也就是说，某些集合操作都是可以支持的。

index1 = pd.Index(['A','B','B','C','C'])

Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
Index([u'A', u'B'], dtype='object')
Index([u'C', u'C'], dtype='object')
Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')
[False False False  True  True]
Index([u'A', u'B', u'C', u'C'], dtype='object')
Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')
Index([u'B', u'C'], dtype='object')
True True False
False False True

2. 大熊猫世界来去自如：Pandas的I/O

老生常谈，从基础来看，我们仍然关心pandas对于与外部数据是如何交互的。

2.1 结构化数据输入输出

read_csv与to_csv 是一对输入输出的工具，read_csv直接返回pandas.DataFrame，而to_csv只要执行命令即可写文件
- read_table：功能类似
- read_fwf：操作fixed width file
read_excel与to_excel方便的与excel交互

还记得刚开始的例子吗？

header 表示数据中是否存在列名，如果在第0行就写就写0，并且开始读数据时跳过相应的行数，不存在可以写none
names 表示要用给定的列名来作为最终的列名
encoding 表示数据集的字符编码，通常而言一份数据为了方便的进行文件传输都以utf-8作为标准

提问：下列例子中，header=4，names=cnames时，究竟会读到怎样的数据？

print cnames

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

希望了解全部参数的请移步API：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

这里介绍一些常用的参数：

读取处理：

skiprows：跳过一定的行数
nrows：仅读取一定的行数
skipfooter：尾部有固定的行数永不读取
skip_blank_lines：空行跳过

内容处理：

sep/delimiter：分隔符很重要，常见的有逗号，空格和Tab('\t')
na_values：指定应该被当作na_values的数值
thousands：处理数值类型时，每千位分隔符并不统一 (1.234.567,89或者1,234,567.89都可能)，此时要把字符串转化为数字需要指明千位分隔符

收尾处理：

index_col：将真实的某列（列的数目，甚至列名）当作index
squeeze：仅读到一列时，不再保存为pandas.DataFrame而是pandas.Series

2.1.x Excel ... ?

对于存储着极为规整数据的Excel而言，其实是没必要一定用Excel来存，尽管Pandas也十分友好的提供了I/O接口。

irisdata.to_excel('S1EP3_irisdata.xls',index = None,encoding='utf-8')

唯一重要的参数：sheetname=k，标志着一个excel的第k个sheet页将会被取出。（从0开始）

2.2 半结构化数据

JSON：网络传输中常用的一种数据格式。

仔细看一下，实际上这就是我们平时收集到异源数据的风格是一致的：

列名不能完全匹配
关联键可能并不唯一
元数据被保存在数据里

json_data = [{'name':'Wang','sal':50000,'job':'VP'},\

2.3 数据库连接流程（Optional）

使用下列包，通过数据库配置建立Connection

pymysql
pyODBC
cx_Oracle

通过pandas.read_sql_query,read_sql_table,to_sql进行数据库操作。

Python与数据库的交互方案有很多种，从数据分析师角度看pandas方案比较适合，之后的讲义中会结合SQL语法进行讲解。

进行数据库连接首先你需要类似的这样一组信息：

IP = '127.0.0.1'

举例说明如果是MySQL：

import pymysql

3. 深入Pandas数据操纵

在第一部分的基础上，数据会有更多种操纵方式：

通过列名、行index来取数据，结合ix、iloc灵活的获取数据的一个子集（第一部分已经介绍）
按记录拼接（就像Union All）或者关联（join）
方便的自定义函数映射
排序
缺失值处理
与Excel一样灵活的数据透视表（在第四部分更详细介绍）

3.1 数据整合：方便灵活

3.1.1 横向拼接：直接DataFrame

pd.DataFrame([np.random.rand(2),np.random.rand(2),np.random.rand(2)],columns=['C1','C2'])

3.1.2 横向拼接：Concatenate

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri])

pd.concat([data_employee_ri,data_employee_ri,data_employee_ri],ignore_index=True)

3.1.3 纵向拼接：Merge

根据数据列关联，使用on关键字

可以指定一列或多列
可以使用left_on和right_on

pd.merge(data_employee_ri,data_employee_ri,on='name')

pd.merge(data_employee_ri,data_employee_ri,on=['name','job'])

根据index关联，可以直接使用left_index和right_index

data_employee_ri.index.name = 'index1'

TIPS: 增加how关键字，并指定

how = 'inner'
how = 'left'
how = 'right'
how = 'outer'

结合how，可以看到merge基本再现了SQL应有的功能，并保持代码整洁。

DF31xA = pd.DataFrame({'name':[u'老王',u'老张',u'老李'],'sal':[5000,3000,1000]})

DF31xB = pd.DataFrame({'name':[u'老王',u'老刘'],'job':['VP','Manager']})

how='left': 保留左表信息

pd.merge(DF31xA,DF31xB,on='name',how='left')

how='right': 保留右表信息

pd.merge(DF31xA,DF31xB,on='name',how='right')

how='inner': 保留两表交集信息，这样尽量避免出现缺失值

pd.merge(DF31xA,DF31xB,on='name',how='inner')

how='outer': 保留两表并集信息，这样会导致缺失值，但最大程度的整合了已有信息

pd.merge(DF31xA,DF31xB,on='name',how='outer')

3.2 数据清洗三剑客

接下来的三个功能，map,applymap,apply,功能，是绝大多数数据分析师在数据清洗这一步骤中的必经之路。

他们分别回答了以下问题：

我想根据一列数据新做一列数据，怎么办？（Series->Series）
我想根据整张表的数据新做整张表，怎么办？（DataFrame->DataFrame）
我想根据很多列的数据新做一列数据，怎么办？（DataFrame->Series）

不要再写什么for循环了！改变思维，提高编码和执行效率

dataNumPy32 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

map: 以相同规则将一列数据作一个映射，也就是进行相同函数的处理

def GDP_Factorize(v):

类似的功能还有applymap，可以对一个dataframe里面每一个元素像map那样全局操作

DF32.applymap(lambda x: float(x)*2 if x.isdigit() else x.upper())

apply则可以对一个DataFrame操作得到一个Series

他会有点像我们后面介绍的agg,但是apply可以按行操作和按列操作，用axis控制即可。

DF32.apply(lambda x:x['nation']+x['capital']+'_'+x['GDP'],axis=1)

0      JapanTokyo_4000
1    S.KoreaSeoul_1300
2    ChinaBeijing_9100
dtype: object

3.3 数据排序

sort: 按一列或者多列的值进行行级排序
sort_index: 根据index里的取值进行排序，而且可以根据axis决定是重排行还是列

dataNumPy33 = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF33.sort(['capital','nation'])

DF33.sort('GDP',ascending=False)

DF33.sort('GDP').sort(ascending=False)

DF33.sort_index(axis=1,ascending=True)

一个好用的功能：Rank

DF33

DF33.rank()

DF33.rank(ascending=False)

注意tied data（相同值）的处理：

method = 'average'
method = 'min'
method = 'max'
method = 'first'

DF33x = pd.DataFrame({'name':[u'老王',u'老张',u'老李',u'老刘'],'sal':np.array([5000,3000,5000,9000])})

DF33x.rank()默认使用method='average'，两条数据相等时，处理排名时大家都用平均值

DF33x.sal.rank()

0    2.5
1    1.0
2    2.5
3    4.0
Name: sal, dtype: float64

method='min'，处理排名时大家都用最小值

DF33x.sal.rank(method='min')

0    2
1    1
2    2
3    4
Name: sal, dtype: float64

method='max'，处理排名时大家都用最大值

DF33x.sal.rank(method='max')

0    3
1    1
2    3
3    4
Name: sal, dtype: float64

method='first'，处理排名时谁先出现就先给谁较小的数值。

DF33x.sal.rank(method='first')

0    2
1    1
2    3
3    4
Name: sal, dtype: float64

3.4 缺失数据处理

DF34 = data_for_multi2.unstack()

忽略缺失值：

DF34.mean(skipna=True)

Col_1    2.500000
Col_2    4.333333
Col_3    6.500000
Col_4    9.000000
dtype: float64

DF34.mean(skipna=False)

Col_1    2.5
Col_2    NaN
Col_3    NaN
Col_4    NaN
dtype: float64

如果不想忽略缺失值的话，就需要祭出fillna了：

DF34

DF34.fillna(0).mean(axis=1,skipna=False)

Row_1    0.00
Row_2    0.75
Row_3    3.00
Row_4    7.50
dtype: float64

4. “一组”大熊猫：Pandas的groupby

groupby的功能类似SQL的group by关键字：

Split-Apply-Combine

Split，就是按照规则分组
Apply，通过一定的agg函数来获得输入pd.Series返回一个值的效果
Combine，把结果收集起来

Pandas的groupby的灵活性：

分组的关键字可以来自于index，也可以来自于真实的列数据
分组规则可以通过一列或者多列

from IPython.display import Image

分组的具体逻辑

irisdata_group = irisdata.groupby('class')

<pandas.core.groupby.DataFrameGroupBy object at 0x10a543b10>

for level,subsetDF in irisdata_group:

Iris-setosa
    sepal_length  sepal_width  petal_length  petal_width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
40           5.0          3.5           1.3          0.3  Iris-setosa
Iris-versicolor
    sepal_length  sepal_width  petal_length  petal_width            class
50           7.0          3.2           4.7          1.4  Iris-versicolor
70           5.9          3.2           4.8          1.8  Iris-versicolor
90           5.5          2.6           4.4          1.2  Iris-versicolor
Iris-virginica
     sepal_length  sepal_width  petal_length  petal_width           class
100           6.3          3.3           6.0          2.5  Iris-virginica
120           6.9          3.2           5.7          2.3  Iris-virginica
140           6.7          3.1           5.6          2.4  Iris-virginica

分组可以快速实现MapReduce的逻辑

Map: 指定分组的列标签，不同的值就会被扔到不同的分组处理
Reduce: 输入多个值，返回一个值，一般可以通过agg实现，agg能接受一个函数

irisdata.groupby('class').agg(\

irisdata.groupby('class').agg(spstat.skew)

汇总之后的广播操作

在OLAP数据库上，为了避免groupby+join的二次操作，提出了sum()over(partition by)的开窗操作。

在Pandas中，这种操作能够进一步被transform所取代。

pd.concat([irisdata,irisdata.groupby('class').transform('mean')],axis=1)[::20]

产生 MultiIndex（多列分组）后的数据透视表操作

一般来说，多列groupby的一个副作用就是.groupby().agg()之后你的行index已经变成了一个多列分组的分级索引。

如果我们希望达到Excel的数据透视表的效果，行和列的索引自由交换，达到统计目的，究竟应该怎么办呢？

factor1 = np.random.randint(0,3,50)

hierindexDF = pd.DataFrame({'F1':factor1,'F2':factor2,'F3':factor3,'F4':values})

hierindexDF_gbsum = hierindexDF.groupby(['F1','F2','F3']).sum()

观察Index：

hierindexDF_gbsum.index

MultiIndex(levels=[[0, 1, 2], [0, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'F1', u'F2', u'F3'])

unstack：

无参数时，把最末index置换到column上
有数字参数时，把指定位置的index置换到column上
有列表参数时，依次把特定位置的index置换到column上

hierindexDF_gbsum.unstack()

hierindexDF_gbsum.unstack(0)

hierindexDF_gbsum.unstack(1)

hierindexDF_gbsum.unstack([2,0])

更进一步的，stack的功能是和unstack对应，把column上的多级索引换到index上去

hierindexDF_gbsum.unstack([2,0]).stack([1,2])

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	49.878745	49.695242	49.654909	45.810069
Iris-versicolor	49.815081	49.680665	49.694505	49.452305
Iris-virginica	49.772059	49.714500	49.761700	49.545918

	C1	C2
0	0.958384	0.826066
1	0.607771	0.687302
2	0.943502	0.647464

	nation	capital	GDP	GDP_Level	NATION
0	JAPAN	TOKYO	8000	MEDIUM	JAPAN
1	S.KOREA	SEOUL	2600	LOW	S.KOREA
2	CHINA	BEIJING	18200	HIGH	CHINA

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116502	0.103857	0.069702	1.161506
Iris-versicolor	0.102232	-0.352014	-0.588404	-0.030249
Iris-virginica	0.114492	0.355026	0.533044	-0.125612

	sepal_length	sepal_width	petal_length	petal_width
class
Iris-setosa	0.116454	0.103814	0.069673	1.161022
Iris-versicolor	0.102190	-0.351867	-0.588159	-0.030236
Iris-virginica	0.114445	0.354878	0.532822	-0.125560

	nation	capital	GDP	population	region
2	China	Beijing	9100	1600	East_Asian
0	Japan	Tokyo	4900	130	East_Asian
1	S.Korea	Seoul	1300	55	East_Asian

	name	job	sal	report
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager
0	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager

	name_x	job_x	sal_x	report_x	name_y	job_y	sal_y	report_y
index1
0	Wang	VP	50000	NaN	Wang	VP	50000	NaN
1	Zhang	Manager	NaN	VP	Zhang	Manager	NaN	VP
2	Li	NaN	5000	Manager	Li	NaN	5000	Manager

	nation	capital	GDP	GDP_Level	NATION
0	Japan	Tokyo	4000	Medium	JAPAN
1	S.Korea	Seoul	1300	Low	S.KOREA
2	China	Beijing	9100	High	CHINA

	sepal_length	sepal_width	petal_length	petal_width	class	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2	Iris-setosa	5.006	3.418	1.464	0.244
20	5.4	3.4	1.7	0.2	Iris-setosa	5.006	3.418	1.464	0.244
40	5.0	3.5	1.3	0.3	Iris-setosa	5.006	3.418	1.464	0.244
60	5.0	2.0	3.5	1.0	Iris-versicolor	5.936	2.770	4.260	1.326
80	5.5	2.4	3.8	1.1	Iris-versicolor	5.936	2.770	4.260	1.326
100	6.3	3.3	6.0	2.5	Iris-virginica	6.588	2.974	5.552	2.026
120	6.9	3.2	5.7	2.3	Iris-virginica	6.588	2.974	5.552	2.026
140	6.7	3.1	5.6	2.4	Iris-virginica	6.588	2.974	5.552	2.026

	F1	F2	F3	F4
0	1	0	1	-0.083928
1	0	0	0	0.949984
2	2	1	0	-0.037692
3	0	1	1	-0.518305
4	1	0	0	0.963678
5	1	1	1	-0.284463
6	0	0	1	0.412449
7	2	0	0	-0.277126
8	0	0	2	-2.488946
9	0	1	0	0.167413
10	0	1	1	1.520003
11	0	1	2	-0.665450
12	1	1	2	0.783906
13	1	1	1	-0.077717
14	2	0	2	-2.018863
15	0	1	1	-0.993884
16	2	0	2	-1.051174
17	0	0	2	-0.723457
18	1	1	0	-0.783521
19	1	0	2	0.355093
20	1	0	0	-0.252415
21	0	0	2	0.600679
22	2	1	2	-0.660498
23	0	0	2	-0.301613
24	0	1	2	-1.270339
25	0	0	0	0.074086
26	1	0	0	-0.043586
27	0	1	2	-2.000161
28	2	1	1	-1.041330
29	2	1	1	0.984101
30	2	0	0	0.160715
31	1	1	2	0.903035
32	1	1	1	-1.105763
33	1	0	0	0.850310
34	1	1	0	-0.062946
35	1	1	0	-0.507763
36	1	0	1	0.112303
37	2	1	2	-0.663782
38	2	1	1	1.147671
39	0	0	2	-1.552327
40	1	1	1	-1.166517
41	0	1	0	-0.622502
42	0	1	2	1.620961
43	1	1	0	-0.354238
44	1	0	2	-0.233783
45	0	0	0	0.131051
46	2	1	1	-1.301164
47	1	0	2	-1.013341
48	2	0	1	-0.508761
49	0	0	1	-1.104457

			F4
F1	F2	F3
0	0	0	1.155121
		1	-0.692009
		2	-4.465664
	1	0	-0.455089
		1	0.007813
		2	-2.314989
1	0	0	1.517987
		1	0.028374
		2	-0.892030
	1	0	-1.708468
		1	-2.634460
		2	1.686941
2	0	0	-0.116412
		1	-0.508761
		2	-3.070037
	1	0	-0.037692
		1	-0.210722
		2	-1.324281

	name	sal
0	老王	5000
1	老张	3000
2	老李	1000

	name	sal
0	老王	5000
1	老张	3000
2	老李	5000
3	老刘	9000

Pandas python