Pandas学习笔记
1. 数据结构
Pandas主要有三种数据:
- Series(一维数据,大小不可变)
- DataFrame(二维数据,大小可变)
- Panel(三维数据,大小可变)
Series
具有均匀数据的一维数组结构。例如1,3,5,7,...的集合
1 | 3 | 5 | 7 | ... |
关键点
- 均匀数据
- 尺寸大小不变
- 数据值可变
DataFrame
具有异构数据的二维数据。例如
姓名 | 年龄 | 性别 |
小明 | 20 | 男 |
小红 | 15 | 女 |
小刚 | 18 | 男 |
关键点
- 异构数据
- 大小可变
- 数据可变
Panel
具有异构数据的三维数据结构,可以说成是DataFrame的容器。
关键点
- 异构数据
- 大小可变
- 数据可变
2. Series
Series是能够保存任何类型的数据(整型,字符串,浮点数,python对象等)的一维标记数据。
构造函数
pandas.Series(data, index, dtype, copy)
参数 | 描述 |
data | 数据采取各种形式,如:ndarray,list,constants |
index | 索引值必须是唯一的和散列的,与数据的长度相同。默认np.arange(n)如果没有索引被传递。 |
dtype | 用于数据类型。如果没有,将推断数据类型。 |
copy | 复制数据,默认为false |
构建一个空的Series
1 import pandas as pd 2 s=pd.Series() 3 print(s)
输出
Series([], dtype: float64)
如果数据是ndarray,则传递的索引必须具有相同的长度。如果没有传递索引值,那么默认索引是(0 - n-1)
1 import pandas as pd 2 import numpy as np 3 data = np.array(['a','b','c','d']) 4 s = pd.Series(data) 5 print(s)
输出
0 a 1 b 2 c 3 d dtype: object
1 import pandas as pd 2 import numpy as np 3 data = np.array(['a','b','c','d']) 4 s = pd.Series(data,index=[100,101,102,103]) 5 print(s)
输出
100 a 101 b 102 c 103 d dtype: object
从字典(dict)创建一个Series,没有指定索引,则使用字典键作为索引,如果指定索引则使用指定的索引值。
1 import pandas as pd 2 import numpy as np 3 data = {'a' : 0., 'b' : 1., 'c' : 2.} 4 s = pd.Series(data) 5 print(s)
1 import pandas as pd 2 import numpy as np 3 data = {'a' : 0., 'b' : 1., 'c' : 2.} 4 s = pd.Series(data,index=['b','c','d','a']) 5 print(s)
输出
b 1.0 c 2.0 d NaN a 0.0 dtype: float64
从标量创建一个系列,如果数据是标量值,则必须提供索引。如果索引长度超过数据长度,则将重复该值以匹配索引的长度。
1 import pandas as pd 2 import numpy as np 3 s = pd.Series(5, index=[0, 1, 2, 3]) 4 print(s)
输出
0 5 1 5 2 5 3 5 dtype: int64
从具有位置的Series中访问数据,Series中的数据可以使用类似访问ndarray中的数据来访问。
1 import pandas as pd 2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 3 print(s) 4 print(s[0])
输出
a 1 b 2 c 3 d 4 e 5 dtype: int64 1
1 import pandas as pd 2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 3 print(s[:3])
输出
a 1 b 2 c 3 dtype: int64
1 import pandas as pd 2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 3 print(s[-3:])
输出
c 3 d 4 e 5 dtype: int64
使用标签检索数据,通过索引标签获取和设置值。
1 import pandas as pd 2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 3 print(s['a'])
输出
1
1 import pandas as pd 2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) 3 print(s[['a','c','d']])
输出
a 1 c 3 d 4 dtype: int64
如果不包含标签,则会出项异常。
3. DataFrame
pandas.DataFrame(data, index, columns, dtype, copy)
构造函数的参数:
参数 | 描述 |
data | 数据采取各种形式,如:ndarray,series,map,lists,dict,constant和DataFrame。 |
index | 对于行标签 |
columns | 对于列标签 |
dtype | 每列的数据类型 |
copy | 默认值为False |
创建一个空的DataFrame
1 import pandas as pd 2 df = pd.DataFrame() 3 print(df)
输出
Empty DataFrame Columns: [] Index: []
从列表创建DataFrame
1 import pandas as pd 2 data = [1,2,3,4,5] 3 df = pd.DataFrame(data) 4 print(df)
输出
0 0 1 1 2 2 3 3 4 4 5
1 import pandas as pd 2 data = [['Alex',10],['Bob',12],['Clarke',13]] 3 df = pd.DataFrame(data,columns=['Name','Age']) 4 print(df)
输出
Name Age 0 Alex 10 1 Bob 12 2 Clarke 13
1 import pandas as pd 2 data = [['Alex',10],['Bob',12],['Clarke',13]] 3 df = pd.DataFrame(data,columns=['Name','Age'],dtype=float) 4 print(df)
输出
Name Age 0 Alex 10.0 1 Bob 12.0 2 Clarke 13.0
从ndarray/Lists的字典来创建DataFrame,所有的ndarrays必须具有相同的长度,如果传递了索引,则索引的长度应等于数组的长度,如果没有则使用默认索引。
1 import pandas as pd 2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} 3 df = pd.DataFrame(data) 4 print(df)
输出
Name Age 0 Tom 28 1 Jack 34 2 Steve 29 3 Ricky 42
使用数组创建一个索引的DataFrame
1 import pandas as pd 2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} 3 df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4']) 4 print(df)
输出
从列表创建DataFrame,字典和列表可作为输入数据传递以用来创建DataFrame,字典键默认为列名。
1 import pandas as pd 2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] 3 df = pd.DataFrame(data) 4 print(df)
输出
使用字典,行索引和列索引创建DataFrame
1 import pandas as pd 2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] 3 df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b']) 4 df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) 5 print(df1) 6 print(df2)
输出
字典的Series可以传递形成一个DataFrame,得到的索引是所有Series索引的并集
1 import pandas as pd 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 3 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} 4 df = pd.DataFrame(d) 5 print(df)
输出
one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4
列选择
1 import pandas as pd 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 3 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} 4 df = pd.DataFrame(d) 5 print(df ['one'])
输出
列添加
1 import pandas as pd 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 3 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} 4 df = pd.DataFrame(d) 5 print ("Adding a new column by passing as Series:") 6 df['three']=pd.Series([10,20,30],index=['a','b','c']) 7 print(df) 8 print ("Adding a new column using the existing columns in DataFrame:") 9 df['four']=df['one']+df['three'] 10 print(df)
输出
Adding a new column by passing as Series: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN Adding a new column using the existing columns in DataFrame: one two three four a 1.0 1 10.0 11.0 b 2.0 2 20.0 22.0 c 3.0 3 30.0 33.0 d NaN 4 NaN NaN
列删除
1 import pandas as pd 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 3 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 4 'three' : pd.Series([10,20,30], index=['a','b','c'])} 5 df = pd.DataFrame(d) 6 print ("Our dataframe is:") 7 print(df) 8 print ("Deleting the first column using DEL function:") 9 del df['one'] 10 print(df) 11 print ("Deleting another column using POP function:") 12 df.pop('two') 13 print(df)
输出
Our dataframe is: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN Deleting the first column using DEL function: two three a 1 10.0 b 2 20.0 c 3 30.0 d 4 NaN Deleting another column using POP function: three a 10.0 b 20.0 c 30.0 d NaN
行选择,添加和删除
1 import pandas as pd 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 3 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} 4 df = pd.DataFrame(d) 5 print(df) 6 print('---------') 7 print(df.loc['a']) 8 print('---------') 9 print(df.iloc[2]) 10 print('---------') 11 print(df[2:4]) 12 print('---------') 13 df2=pd.DataFrame([[5,6],[7,8]],index=['e','f'],columns=['one','two']) 14 df=df.append(df2) 15 print(df) 16 df=df.drop('a') 17 print('---------') 18 print(df)
输出
one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 --------- one 1.0 two 1.0 Name: a, dtype: float64 --------- one 3.0 two 3.0 Name: c, dtype: float64 --------- one two c 3.0 3 d NaN 4 --------- one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 e 5.0 6 f 7.0 8 --------- one two b 2.0 2 c 3.0 3 d NaN 4 e 5.0 6 f 7.0 8
4. Panel
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
参数 | 描述 |
data | 数据采取各种形式,如:ndarray, series, map, lists, dict, constant和DataFrame |
items | axis=0 |
major_axis | axis=1 |
minor_axis | axis=2 |
dtype | 每列的数据类型 |
copy | 复制数据 |
创建panel和选择数据
1 print('--------creat an empty panel---------') 2 import pandas as pd 3 p=pd.Panel() 4 print(p) 5 print('-------------end---------------------') 6 print('---creat an panel from 3D ndarray----') 7 import pandas as pd 8 import numpy as np 9 data = np.random.rand(2,4,5) 10 p = pd.Panel(data) 11 print(p) 12 print('-------------end---------------------') 13 print('-creat an panel from dict(DataFrame)-') 14 import pandas as pd 15 import numpy as np 16 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 17 'Item2' : pd.DataFrame(np.random.randn(4, 2))} 18 p = pd.Panel(data) 19 print(p) 20 print('-------------end---------------------') 21 print('-------select data from panel--------') 22 import pandas as pd 23 import numpy as np 24 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 25 'Item2' : pd.DataFrame(np.random.randn(4, 2))} 26 p = pd.Panel(data) 27 print(p['Item1']) 28 print('-------------end---------------------') 29 print('-----select data use major_axis------') 30 import pandas as pd 31 import numpy as np 32 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 33 'Item2' : pd.DataFrame(np.random.randn(4, 2))} 34 p = pd.Panel(data) 35 print(p.major_xs(1)) 36 print('-------------end---------------------') 37 print('-----select data use minor_axis------') 38 import pandas as pd 39 import numpy as np 40 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 41 'Item2' : pd.DataFrame(np.random.randn(4, 2))} 42 p = pd.Panel(data) 43 print(p.minor_xs(1)) 44 print('-------------end---------------------')
输出
--------creat an empty panel--------- <class 'pandas.core.panel.Panel'> Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis) Items axis: None Major_axis axis: None Minor_axis axis: None -------------end--------------------- ---creat an panel from 3D ndarray---- <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 -------------end--------------------- -creat an panel from dict(DataFrame)- <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2 -------------end--------------------- -------select data from panel-------- 0 1 2 0 -0.960065 -1.114559 -0.296025 1 -0.382277 -0.585262 1.503437 2 1.315953 -0.350967 -0.711729 3 0.959712 0.800819 -0.673261 -------------end--------------------- -----select data use major_axis------ Item1 Item2 0 -1.742578 -0.697723 1 -0.156266 0.003577 2 0.023405 NaN -------------end--------------------- -----select data use minor_axis------ Item1 Item2 0 1.103015 0.488929 1 -0.391214 -0.030208 2 1.783799 0.039654 3 -1.863803 -0.949056 -------------end---------------------
5. 基本功能
Series基本功能
属性或方法 | 描述 |
axes | 返回行轴标签列表。 |
dtype | 返回对象的数据类型。 |
empty | 检查是否为空,返回布尔型。 |
ndim | 返回底层数据的维数,默认定义:1。 |
size | 返回基础数据中的元素数。 |
values | 将Series作为ndarray放回。 |
head(n) | 放回前n行。 |
tail(n) | 放回最后n行。 |
1 import pandas as pd 2 import numpy as np 3 s = pd.Series(np.random.randn(4)) 4 print(s) 5 print('-------------') 6 print("The axes are:") 7 print(s.axes) 8 print('-------------') 9 print ("Is the Object empty?") 10 print(s.empty) 11 print('-------------') 12 print("The dimensions of the object:") 13 print(s.ndim) 14 print('-------------') 15 print("The size of the object:") 16 print(s.size) 17 print('-------------') 18 print("The actual data series is:") 19 print(s.values) 20 print('-------------') 21 print("The first two rows of the data series:") 22 print(s.head(2)) 23 print('-------------') 24 print("The last two rows of the data series:") 25 print(s.tail(2))
输出
0 -1.478084 1 0.468882 2 0.394107 3 0.682990 dtype: float64 ------------- The axes are: [RangeIndex(start=0, stop=4, step=1)] ------------- Is the Object empty? False ------------- The dimensions of the object: 1 ------------- The size of the object: 4 ------------- The actual data series is: [-1.47808355 0.46888222 0.3941075 0.68299036] ------------- The first two rows of the data series: 0 -1.478084 1 0.468882 dtype: float64 ------------- The last two rows of the data series: 2 0.394107 3 0.682990 dtype: float64
DataFrame基本功能
属性或方法 | 描述 |
T | 转置行和列。 |
axes | 返回一个列,行轴标签和列轴标签作为唯一的成员。 |
dtypes | 放回此对象中的数据类型。 |
empty | 检查是否为空,返回布尔型。 |
ndim | 轴/数组维度大小。 |
shape | 返回表示DataFrame的维度的元组。 |
size | 尺寸 |
values | ndarray表示返回。 |
head() | 放回开头前n行。 |
tail() | 返回最后n行。 |
1 print('---------creat a DataFrame----------') 2 import pandas as pd 3 import numpy as np 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']), 5 'Age':pd.Series([25,26,25,23,30,29,23]), 6 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} 7 df = pd.DataFrame(d) 8 print("Our data series is:") 9 print(df) 10 print('----------------end-----------------') 11 print('--the transpose of the data series--') 12 print(df.T) 13 print('----------------end-----------------') 14 print('-----row and column axis labels-----') 15 print(df.axes) 16 print('----------------end-----------------') 17 print('---the data types of each column----') 18 print(df.dtypes) 19 print('----------------end-----------------') 20 print('---------is the object empty--------') 21 print(df.empty) 22 print('----------------end-----------------') 23 print('-----------the dimension------------') 24 print(df.ndim) 25 print('----------------end-----------------') 26 print('--------------the shape-------------') 27 print(df.shape) 28 print('----------------end-----------------') 29 print('--------------the shape-------------') 30 print(df.shape) 31 print('----------------end-----------------') 32 print('------total number of elements------') 33 print(df.size) 34 print('----------------end-----------------') 35 print('-------------actual data------------') 36 print(df.values) 37 print('----------------end-----------------') 38 print('-------first two rows of data-------') 39 print(df.head(2)) 40 print('----------------end-----------------') 41 print('--------last two rows of data-------') 42 print(df.tail(2)) 43 print('----------------end-----------------')
输出
---------creat a DataFrame---------- Our data series is: Name Age Rating 0 Tom 25 4.23 1 James 26 3.24 2 Ricky 25 3.98 3 Vin 23 2.56 4 Steve 30 3.20 5 Minsu 29 4.60 6 Jack 23 3.80 ----------------end----------------- --the transpose of the data series-- 0 1 2 3 4 5 6 Name Tom James Ricky Vin Steve Minsu Jack Age 25 26 25 23 30 29 23 Rating 4.23 3.24 3.98 2.56 3.2 4.6 3.8 ----------------end----------------- -----row and column axis labels----- [RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')] ----------------end----------------- ---the data types of each column---- Name object Age int64 Rating float64 dtype: object ----------------end----------------- ---------is the object empty-------- False ----------------end----------------- -----------the dimension------------ 2 ----------------end----------------- --------------the shape------------- (7, 3) ----------------end----------------- --------------the shape------------- (7, 3) ----------------end----------------- ------total number of elements------ 21 ----------------end----------------- -------------actual data------------ [['Tom' 25 4.23] ['James' 26 3.24] ['Ricky' 25 3.98] ['Vin' 23 2.56] ['Steve' 30 3.2] ['Minsu' 29 4.6] ['Jack' 23 3.8]] ----------------end----------------- -------first two rows of data------- Name Age Rating 0 Tom 25 4.23 1 James 26 3.24 ----------------end----------------- --------last two rows of data------- Name Age Rating 5 Minsu 29 4.6 6 Jack 23 3.8 ----------------end-----------------
6. 描述性统计
函数 | 描述 |
sum() | 返回所请求轴的值的总和,默认axis=0 |
mean() | 返回平均值 |
std() | 返回标准差 |
median() | 所有值的中位数 |
mode() | 值的模值 |
min() | 最小值 |
max() | 最大值 |
abs() | 绝对值 |
prod() | 数组元素的乘积 |
cumsum() | 累计总和 |
cumprod() | 累计乘积 |
describe() | 计算统计信息的摘要,object-汇总字符串,number-汇总数字,all-汇总所有列 |
1 print('--------creat a DataFrame---------') 2 import pandas as pd 3 import numpy as np 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 5 'Lee','David','Gasper','Betina','Andres']), 6 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 7 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} 8 df = pd.DataFrame(d) 9 print(df) 10 print('---------------end----------------') 11 print('---------------sum----------------') 12 print(df.sum()) 13 print('---------------end----------------') 14 print(df.sum(1)) 15 print('---------------end----------------') 16 print('--------------mean----------------') 17 print(df.mean()) 18 print('---------------end----------------') 19 print('--------------std----------------') 20 print(df.std()) 21 print('---------------end----------------') 22 print('------------describe--------------') 23 print(df.describe()) 24 print('---------------end----------------')
输出
--------creat a DataFrame--------- Name Age Rating 0 Tom 25 4.23 1 James 26 3.24 2 Ricky 25 3.98 3 Vin 23 2.56 4 Steve 30 3.20 5 Minsu 29 4.60 6 Jack 23 3.80 7 Lee 34 3.78 8 David 40 2.98 9 Gasper 30 4.80 10 Betina 51 4.10 11 Andres 46 3.65 ---------------end---------------- ---------------sum---------------- Name TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe... Age 382 Rating 44.92 dtype: object ---------------end---------------- 0 29.23 1 29.24 2 28.98 3 25.56 4 33.20 5 33.60 6 26.80 7 37.78 8 42.98 9 34.80 10 55.10 11 49.65 dtype: float64 ---------------end---------------- --------------mean---------------- Age 31.833333 Rating 3.743333 dtype: float64 ---------------end---------------- --------------std---------------- Age 9.232682 Rating 0.661628 dtype: float64 ---------------end---------------- ------------describe-------------- Age Rating count 12.000000 12.000000 mean 31.833333 3.743333 std 9.232682 0.661628 min 23.000000 2.560000 25% 25.000000 3.230000 50% 29.500000 3.790000 75% 35.500000 4.132500 max 51.000000 4.800000 ---------------end----------------
7. 函数应用
- 表合理函数应用:pipe()
- 行或列函数应用:apply()
- 元素函数应用:applymap()
通过将函数和适当数量的参数作为管道参数来执行自定义操作。
1 import pandas as pd 2 import numpy as np 3 def adder(ele1,ele2): 4 return ele1+ele2 5 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) 6 print(df) 7 print('---------------end----------------') 8 print(df.pipe(adder,2)) 9 print('---------------end----------------') 10 print(df.apply(np.mean)) 11 print('---------------end----------------') 12 print(df.apply(np.mean,axis=1)) 13 print('---------------end----------------') 14 print(df.apply(lambda x:x.max()-x.min())) 15 print('---------------end----------------') 16 print(df['col1'].map(lambda x:x*100)) 17 print('---------------end----------------') 18 print(df.applymap(lambda x:x*100)) 19 print('---------------end----------------')
输出
col1 col2 col3 0 1.689749 0.959856 1.074871 1 -0.392017 0.001075 0.806392 2 -0.484529 0.635483 0.644830 3 -0.049649 0.113976 -0.220698 4 1.413197 -0.576231 -0.075871 ---------------end---------------- col1 col2 col3 0 3.689749 2.959856 3.074871 1 1.607983 2.001075 2.806392 2 1.515471 2.635483 2.644830 3 1.950351 2.113976 1.779302 4 3.413197 1.423769 1.924129 ---------------end---------------- col1 0.435350 col2 0.226832 col3 0.445905 dtype: float64 ---------------end---------------- 0 1.241492 1 0.138483 2 0.265261 3 -0.052123 4 0.253698 dtype: float64 ---------------end---------------- col1 2.174278 col2 1.536088 col3 1.295569 dtype: float64 ---------------end---------------- 0 168.974915 1 -39.201732 2 -48.452922 3 -4.964864 4 141.319700 Name: col1, dtype: float64 ---------------end---------------- col1 col2 col3 0 168.974915 95.985614 107.487138 1 -39.201732 0.107497 80.639193 2 -48.452922 63.548250 64.483009 3 -4.964864 11.397646 -22.069797 4 141.319700 -57.623138 -7.587075 ---------------end----------------
8. 重建索引
重新索引会更改DataFrame的行标签和列标签,重新索引意味着符合数据以匹配特定轴上的一组给定的标签。
- 重新排序现有数据以匹配一组新的标签
- 在没有标签数据的标签位置插入缺失值(NA)标记
1 import pandas as pd 2 import numpy as np 3 N=20 4 df = pd.DataFrame({ 5 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 6 'x': np.linspace(0,stop=N-1,num=N), 7 'y': np.random.rand(N), 8 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 9 'D': np.random.normal(100, 10, size=(N)).tolist() 10 }) 11 df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B']) 12 print(df_reindexed)
输出
A C B 0 2016-01-01 High NaN 2 2016-01-03 Medium NaN 5 2016-01-06 Medium NaN
重建索引与其他对象对齐
1 import pandas as pd 2 import numpy as np 3 df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3']) 4 df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3']) 5 df1 = df1.reindex_like(df2) 6 print(df1)
输出
col1 col2 col3 0 0.533272 1.462343 1.958989 1 0.822496 1.020661 -0.958452 2 0.583271 1.100357 0.405649 3 -0.617700 -0.444208 0.921092 4 -0.883714 -0.068178 1.507545 5 -0.696816 0.729113 -0.509259 6 -0.127911 -0.255686 -1.378398
填充时重新加注
- pad/ffill - 向前填充值
- bfill/backfill - 向后填充值
- nearest - 从最近的索引值填充
1 import pandas as pd 2 import numpy as np 3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) 4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']) 5 print(df2.reindex_like(df1)) 6 print("Data Frame with Forward Fill:") 7 print(df2.reindex_like(df1,method='ffill'))
输出
col1 col2 col3 0 0.518742 0.162080 1.606103 1 -0.355712 2.200266 1.072651 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN Data Frame with Forward Fill: col1 col2 col3 0 0.518742 0.162080 1.606103 1 -0.355712 2.200266 1.072651 2 -0.355712 2.200266 1.072651 3 -0.355712 2.200266 1.072651 4 -0.355712 2.200266 1.072651 5 -0.355712 2.200266 1.072651
重建索引时的填充限制,限制参数在重建索引时提供对填充的额外控制。
1 import pandas as pd 2 import numpy as np 3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) 4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3']) 5 print(df2.reindex_like(df1)) 6 print("Data Frame with Forward Fill limiting to 1:") 7 print(df2.reindex_like(df1,method='ffill',limit=1))
输出
col1 col2 col3 0 0.550406 0.220336 -0.733154 1 0.372353 0.978386 1.202727 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN Data Frame with Forward Fill limiting to 1: col1 col2 col3 0 0.550406 0.220336 -0.733154 1 0.372353 0.978386 1.202727 2 0.372353 0.978386 1.202727 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN
重命名
1 import pandas as pd 2 import numpy as np 3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3']) 4 print(df1) 5 print("After renaming the rows and columns:") 6 print(df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))
输出
col1 col2 col3 0 0.162944 -0.257846 -0.890368 1 -0.969776 1.685473 -1.330109 2 -1.271563 -0.375700 0.778564 3 -1.123660 0.849679 0.436355 4 0.321475 0.779693 -2.100270 5 -1.184636 -0.206975 0.941504 After renaming the rows and columns: c1 c2 col3 apple 0.162944 -0.257846 -0.890368 banana -0.969776 1.685473 -1.330109 durian -1.271563 -0.375700 0.778564 3 -1.123660 0.849679 0.436355 4 0.321475 0.779693 -2.100270 5 -1.184636 -0.206975 0.941504
9. 迭代
1 import pandas as pd 2 import numpy as np 3 N=20 4 df = pd.DataFrame({ 5 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 6 'x': np.linspace(0,stop=N-1,num=N), 7 'y': np.random.rand(N), 8 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 9 'D': np.random.normal(100, 10, size=(N)).tolist() 10 }) 11 for col in df: 12 print(col)
输出
A x y C D
要遍历DataFrame中的行,可以使用以下函数
- iteritems() - 迭代(key, value)对
- iterrows() - 将行迭代为(索引,Series)对
- itertuples() - 以namedtuples的形式迭代行
1 import pandas as pd 2 import numpy as np 3 df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3']) 4 print('------------iteritems--------------') 5 for key,value in df.iteritems(): 6 print(key,value) 7 print('----------------end----------------') 8 print('-------------iterrows--------------') 9 for row_index,row in df.iterrows(): 10 print(row_index,row) 11 print('----------------end----------------') 12 print('-------------itertuples------------') 13 for row in df.itertuples(): 14 print(row) 15 print('----------------end----------------')
输出
------------iteritems-------------- col1 0 -0.453626 1 -1.555137 2 1.209289 3 0.238345 Name: col1, dtype: float64 col2 0 -0.309713 1 -0.018258 2 0.326646 3 1.584639 Name: col2, dtype: float64 col3 0 -1.746411 1 0.144020 2 0.932400 3 -0.848700 Name: col3, dtype: float64 ----------------end---------------- -------------iterrows-------------- 0 col1 -0.453626 col2 -0.309713 col3 -1.746411 Name: 0, dtype: float64 1 col1 -1.555137 col2 -0.018258 col3 0.144020 Name: 1, dtype: float64 2 col1 1.209289 col2 0.326646 col3 0.932400 Name: 2, dtype: float64 3 col1 0.238345 col2 1.584639 col3 -0.848700 Name: 3, dtype: float64 ----------------end---------------- -------------itertuples------------ Pandas(Index=0, col1=-0.453625680715928, col2=-0.30971276978094636, col3=-1.7464111236386397) Pandas(Index=1, col1=-1.5551365938912898, col2=-0.018257622785818713, col3=0.1440202346073698) Pandas(Index=2, col1=1.2092886777094904, col2=0.3266461576970751, col3=0.9323998460902878) Pandas(Index=3, col1=0.23834535595475798, col2=1.5846386089382405, col3=-0.8486996087036667) ----------------end----------------
10. 排序
sort_values()提供了mergeesort,heapsort和quicksort的配置。
1 import pandas as pd 2 import numpy as np 3 unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1']) 4 print(unsorted_df) 5 print('---------按标签排序----------') 6 sorted_df=unsorted_df.sort_index() 7 print(sorted_df) 8 print('--------改变排序顺序---------') 9 sorted_df = unsorted_df.sort_index(ascending=False) 10 print(sorted_df) 11 print('----------按列排序-----------') 12 sorted_df=unsorted_df.sort_index(axis=1) 13 print(sorted_df) 14 print('----------按值排序-----------') 15 sorted_df = unsorted_df.sort_values(by='col1') 16 print(sorted_df)
输出
col2 col1 1 0.295840 -0.880007 4 0.151129 1.843255 6 -0.516764 0.195839 2 -0.040592 0.582046 3 1.806547 -0.760579 5 -1.366668 0.652985 9 -1.180956 1.198587 8 -1.621409 -0.555094 0 0.403722 0.296659 7 0.520232 -0.759177 ---------按标签排序---------- col2 col1 0 0.403722 0.296659 1 0.295840 -0.880007 2 -0.040592 0.582046 3 1.806547 -0.760579 4 0.151129 1.843255 5 -1.366668 0.652985 6 -0.516764 0.195839 7 0.520232 -0.759177 8 -1.621409 -0.555094 9 -1.180956 1.198587 --------改变排序顺序--------- col2 col1 9 -1.180956 1.198587 8 -1.621409 -0.555094 7 0.520232 -0.759177 6 -0.516764 0.195839 5 -1.366668 0.652985 4 0.151129 1.843255 3 1.806547 -0.760579 2 -0.040592 0.582046 1 0.295840 -0.880007 0 0.403722 0.296659 ----------按列排序----------- col1 col2 1 -0.880007 0.295840 4 1.843255 0.151129 6 0.195839 -0.516764 2 0.582046 -0.040592 3 -0.760579 1.806547 5 0.652985 -1.366668 9 1.198587 -1.180956 8 -0.555094 -1.621409 0 0.296659 0.403722 7 -0.759177 0.520232 ----------按值排序----------- col2 col1 1 0.295840 -0.880007 3 1.806547 -0.760579 7 0.520232 -0.759177 8 -1.621409 -0.555094 6 -0.516764 0.195839 0 0.403722 0.296659 2 -0.040592 0.582046 5 -1.366668 0.652985 9 -1.180956 1.198587 4 0.151129 1.843255
11. 字符串和文本数据
函数 | 描述 |
lower() | 将Series/Index中的字符串转换为小写 |
upper() | 将Series/Index中的字符串转换为大写 |
len() | 计算字符串长度 |
strip() | 帮助从两侧的Series/索引中的每个字符串中删除空格 |
split() | 用给定的模式拆分每个字符串 |
cat() | 使用给定的分隔符连接Series/索引元素 |
get_dummies() | 返回具有单热编码值的DataFrame |
contains() | 如果元素中包含子字符串,则返回每个元素的布尔值 |
replace(a,b) | 将值a替换为值b |
repeat() | 重复每个元素指定的次数 |
count() | 返回模式中每个元素的出现总数 |
startswith() | 如果元素以模式开始,则返回true |
endswith() | 如果元素以模式结束,则返回true |
find() | 返回模式第一次出现的位置 |
findall() | 返回模式的所有出现的列表 |
swapcase() | 变换字母大小写 |
islower() | 是否小写 |
isupper() | 是否大写 |
isnumeric() | 是否数字 |
12. 自定义显示选项
- pd.get_option(param) #显示默认值
- pd.set_option(param, value) #设置默认值
- pd.reset_option(param) #重置默认值
- pd.describe_option(param) #打印参数的描述
- pd.option_context(param, value) #临时设置默认值,退出作用域自动销毁
参数 | 描述 |
"display.max_rows" | 显示的最大行数 |
"display.max_columns" | 显示的最大列数 |
"display.expand_frame_repr" | 拉伸页面 |
"display.max_colwidth" | 显示的最大列宽 |
"display.precision" | 显示的十进制数的精度 |
13. 索引
- .loc(,) #基于标签,第一个参数表示行,第二个参数表示列,参数--单标量、列表、范围标签
- .iloc(,) #基于整数,第一个参数表示行,第二个参数表示列,参数--整数、整数列表、系列值
- .ix(,) #混合方法