3.必要的基础功能¶

这一节我们讨论pandas数据结构的许多常见基本功能。先创建一些示例对象:

In [35]:

import pandas as pd
import numpy as np

In [36]:

index = pd.date_range("1/1/2000",periods=8)
s = pd.Series(np.random.randn(5),index=list('abcde'))
df = pd.DataFrame(np.random.randn(8,3), index= index, columns=["A","B","C"])

3.1.head()和tail()¶

head()

tail()

要查看Series或DataFrame对象的样本，用head()和tail()方法。默认显示元素数量是5个，但可以自定义数量。

In [37]:

long_series = pd.Series(np.random.randn(1000))
long_series.head()

Out[37]:

0    0.672212
1    0.876168
2   -1.079769
3   -0.968866
4   -1.854071
dtype: float64

In [38]:

long_series.tail()

Out[38]:

995   -0.014405
996   -0.183823
997   -0.599304
998    0.303536
999   -0.074504
dtype: float64

3.2.属性和底层数据¶

DataFrame/Series.index

DataFrame.columns

很多对象.array

DataFrame/Series.to_numpy()

Pandas对象有许多允许您访问元数据的属性:

shape(形状):给出对象的轴尺寸，与ndarray一致
Axis labels(轴标签)
- Series:索引(仅轴)
- DataFrame:索引(行)和列

注意，这些属性可以安全地分配!

In [39]:

df[:2]

Out[39]:

	A	B	C
2000-01-01	-0.219326	-0.536275	1.302458
2000-01-02	-1.249909	1.108175	0.327125

In [40]:

df.columns = [x.lower() for x in df.columns]
df

Out[40]:

	a	b	c
2000-01-01	-0.219326	-0.536275	1.302458
2000-01-02	-1.249909	1.108175	0.327125
2000-01-03	1.789501	1.296963	2.444054
2000-01-04	0.472515	0.936789	-0.823054
2000-01-05	-1.092298	1.079988	-0.739153
2000-01-06	-0.080023	-0.260854	-0.045418
2000-01-07	-1.536474	0.222915	-1.007296
2000-01-08	0.130246	-1.270309	0.280023

pandas对象(Index、Series、DataFrame)可以被认为是数组的容器，用于保存数据并进行计算。许多类型的底层数组是numpy.ndarray。但pandas和第三方库可以扩展NumPy的类型系统(请参阅dtypes)。

要获得Index或Series中的实际数据，用.array

In [41]:

s.array

Out[41]:

<PandasArray>
[ 0.25182906712820524,  -0.9346872326131562, -0.19523262571678002,
  -0.4679782959972137, -0.03616135904185705]
Length: 5, dtype: float64

In [42]:

s.index.array # 索引也是array

Out[42]:

<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

当系列或索引是ExtensionArray时，to_numpy()可能涉及复制数据和强制赋值。参见dtypes。

to_numpy()对生成的numpy.ndarray的d类型提供了一些控制。例如，考虑带有时区的日期时间。NumPy没有表示含时区的日期时间dtype，所以有两种可能有用的表示:

object-dtype numpy.ndarray使用Timestamp对象，每个对象使用正确的tz
datetime64[ns]-dtype numpy。ndarray，其中的值已转换为UTC并丢弃了时区

时区可以通过dtype=object保存

In [43]:

ser = pd.Series(pd.date_range('2000',periods=2, tz="CET"))
ser.to_numpy(dtype=object)

Out[43]:

array([Timestamp('2000-01-01 00:00:00+0100', tz='CET'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET')], dtype=object)

或使用dtype='datetime64[ns]'。

In [44]:

ser.to_numpy(dtype='datetime64[ns]')

Out[44]:

array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

在DataFrame中获取“原始数据”可能有点复杂。当你的DataFrame对所有列只有单一的数据类型时，DataFrame.to_numpy()将返回底层数据:

In [45]:

df.to_numpy()

Out[45]:

array([[-0.2193259 , -0.53627457,  1.30245811],
       [-1.24990869,  1.1081747 ,  0.32712479],
       [ 1.78950082,  1.29696309,  2.44405386],
       [ 0.47251541,  0.9367891 , -0.82305401],
       [-1.0922979 ,  1.0799879 , -0.73915313],
       [-0.08002316, -0.26085374, -0.04541801],
       [-1.53647403,  0.22291517, -1.00729634],
       [ 0.13024598, -1.27030942,  0.28002296]])

如果DataFrame包含同构类型的数据，ndarray实际上可以就地修改，并且更改将反映在数据结构中。对于异构数据(例如DataFrame的一些列并不都是相同的dtype)，就不会出现这种情况。与轴标签不同，values属性本身不能被赋值。

注意:

在处理异构数据时，将选择生成的ndarray的dtype来容纳所有涉及的数据。例如，如果涉及字符串，结果将是对象dtype。如果只存在浮点数和整数，则生成的数组将为浮点dtype。

过去，pandas推荐使用DataFrame/Series.value来从Series或DataFrame中提取数据。仍然可以在旧的代码库和在线代码库中找到这些引用。现置，我们建议避免使用.values，而使用.array或.to_numpy()。.values有以下缺点:

当Series包含扩展类型时，不清楚Series.value返回的是NumPy数组还是扩展数组。Series.array将始终返回一个ExtensionArray，并且永远不会复制数据。Series.to_numpy()将始终返回NumPy数组，可能会以复制/强制值为代价。
当你的DataFrame包含混合数据类型时，DataFrame.value可能涉及到复制数据和将值强制转换成通用dtype，这是一个相对昂贵的操作。DataFrame.to_numpy()更明确，它返回的NumPy数组可能不是DataFrame中相同数据。

3.3.加速操作¶

pandas支持使用numexpr库和bottleneck库来加速某些类型的二进制数字和布尔运算。

这些库在处理大型数据集时特别有用，并提供加速。numexpr使用智能分块、缓存和多核。bottleneck是一组专门的cython例程，在处理具有nan的数组时特别快。

默认情况下，这两个选项都是启用的，可以设置选项来控制:

In [46]:

pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)

3.4.灵活的二元运算¶

对于pandas数据结构之间的二元运算，有两个关键点值得关注:

较高(如DataFrame)和较低维(如Series)对象之间的广播行为（Broadcasting behavior，一直不知道怎么翻译这个词）。
计算中的缺失值。

3.4.1.匹配/广播行为¶

DataFrame有add()、sub()、mul()、div()和相关函数radd()、rsub()、…用于执行二进制运算（这些个函数太多，参见API文档）。对于广播行为，Series的输入是主要关注点。使用这些函数，可以通过axis关键字使用来匹配索引或列:

In [47]:

df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
    }
)
df

Out[47]:

	one	two	three
a	-0.819943	0.239292	NaN
b	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

In [48]:

row = df.iloc[1] # 第2行
column = df['two'] # 第“two”列
df.sub(row, axis='columns')  # df.sub(row, axis=1) 等效

Out[48]:

	one	two	three
a	-2.676959	0.441323	NaN
b	0.000000	0.000000	0.000000
c	-0.839372	-0.906006	-1.191237
d	NaN	-1.589312	-0.949465

In [49]:

df.sub(column,axis=0) # df.sub(column, axis="index") 等效

Out[49]:

	one	three
a	-1.059236	NaN
b	2.059046	1.767365
c	2.125679	1.482133
d	NaN	2.407212

此外，还可以将多层索引的DataFrame的一个层级与Series对齐。

In [50]:

dfmi = df.copy()
dfmi.index = pd.MultiIndex.from_tuples(
    [(1,'a'),(1,'b'),(1,'c'),(2,'a')],names=['first','second']
)
dfmi.sub(column,axis=0,level='second')

Out[50]:

		one	two	three
first	second
1	a	-1.059236	0.000000	NaN
	b	2.059046	0.000000	1.767365
	c	2.125679	0.000000	1.482133
2	a	NaN	-2.030635	0.376577

Series和Index也支持内置函数divmod()。此函数同时向下除和取模操作，返回一个与左边相同类型的二元元组。例如:

In [51]:

s = pd.Series(np.arange(10))
s

Out[51]:

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [52]:

divmod(s, 3) # 结果是(div，rem)的元组，div是s/3的整数结果，rem是余数

Out[52]:

(0    0
 1    0
 2    0
 3    1
 4    1
 5    1
 6    2
 7    2
 8    2
 9    3
 dtype: int64,
 0    0
 1    1
 2    2
 3    0
 4    1
 5    2
 6    0
 7    1
 8    2
 9    0
 dtype: int64)

In [53]:

idx = pd.Index(np.arange(10))
idx

Out[53]:

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [54]:

divmod(idx, 3)

Out[54]:

(Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64'),
 Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64'))

不同元素的除法取模计算；

In [55]:

div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
div, rem

Out[55]:

(0    0
 1    0
 2    0
 3    1
 4    1
 5    1
 6    1
 7    1
 8    1
 9    1
 dtype: int64,
 0    0
 1    1
 2    2
 3    0
 4    0
 5    1
 6    1
 7    2
 8    2
 9    3
 dtype: int64)

3.4.2.缺失值/值填充操作¶

在Series和DataFrame中，算术函数可以选择输入fill_value，即在一个位置最多缺少一个值时替换。例如，当加两个DataFrame对象时，将NaN视为0，除非两个DataFrame都缺少该值，在这种情况下，结果将是NaN(可以使用fillna将NaN替换为其他值)。

In [56]:

df

Out[56]:

	one	two	three
a	-0.819943	0.239292	NaN
b	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

In [57]:

df2 = df.copy()
df2.loc['a','three'] = 1.0 # df和df2的区别在 第a行第three列
df2

Out[57]:

	one	two	three
a	-0.819943	0.239292	1.000000
b	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

In [58]:

df + df2 # a-three：NaN+1 = NaN， d-one：NaN+NaN = NaN

Out[58]:

	one	two	three
a	-1.639887	0.478585	NaN
b	3.714031	-0.404061	3.130669
c	2.035286	-2.216072	0.748194
d	NaN	-3.582685	1.231739

In [59]:

df.add(df2,fill_value=0) # a-three：0+1 = 1， d-one：0+NaN = NaN

Out[59]:

	one	two	three
a	-1.639887	0.478585	1.000000
b	3.714031	-0.404061	3.130669
c	2.035286	-2.216072	0.748194
d	NaN	-3.582685	1.231739

3.4.3.灵活比较¶

Series和DataFrame的二进制比较方法eq, ne, lt, gt, le和ge的广播行为类似于上面的二进制算术: | 方法 | 英文 | 中文 | | ---| ---| ---| |eq|equal to|等于| |ne|not equal to |不等于| |lt|less than|小于| |gt|greater than|大于| |le|less than or equal to|小等于| |ge|greater than or equal to|大等于|

In [60]:

df.gt(df2)

Out[60]:

	one	two	three
a	False	False	False
b	False	False	False
c	False	False	False
d	False	False	False

In [61]:

df2.ne(df)

Out[61]:

	one	two	three
a	False	False	True
b	False	False	False
c	False	False	False
d	True	False	False

3.4.4.布尔简化¶

empty、any()、all()和bool()提供了汇总布尔结果的方法。

In [62]:

(df > 0).all() , (df > 0).any().any() # 可以汇总至最后一层

Out[62]:

(one      False
 two      False
 three    False
 dtype: bool,
 True)

empty属性判断pandas对象是否为空。

In [63]:

df.empty, pd.DataFrame(columns=list('abc')).empty

Out[63]:

(False, True)

要在布尔上下文中计算单元素pandas对象，用bool():

In [64]:

pd.Series([True]).bool() , pd.Series([False]).bool()

Out[64]:

(True, False)

In [65]:

pd.DataFrame([[True]]).bool(), pd.DataFrame([[False]]).bool()

Out[65]:

(True, False)

DataFrame和Series对象不能直接参与布尔运算，否则报错。

In [ ]:

if df:
    pass
"""
ValueError: The truth value of a DataFrame is ambiguous. 
Use a.empty, a.bool(), a.item(), a.any() or a.all().
"""

In [ ]:

df and df2
"""
ValueError: The truth value of a DataFrame is ambiguous. 
Use a.empty, a.bool(), a.item(), a.any() or a.all().
"""

3.4.5.判断对象相等¶

equals()用来比较两个对象是否相等。而不能用布尔运算（如all()方法）。

布尔运算中，NaN直接认定为False
需要注意，数据必须在顺序上也要保持一致。

In [ ]:

df + df == df *2

Out[ ]:

	one	two	three
a	True	True	False
b	True	True	True
c	True	True	True
d	False	True	True

In [ ]:

(df+df==df*2).all()

Out[ ]:

one      False
two       True
three    False
dtype: bool

In [ ]:

(df + df).equals(df *2)

Out[ ]:

True

In [ ]:

df1 = pd.DataFrame({'col': ['foo',0,np.nan]})
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
df1.equals(df2)

Out[ ]:

False

In [ ]:

df1.equals(df2.sort_index()) # 需要保证顺序也一样

Out[ ]:

True

3.4.6.比较类数组对象¶

数据结构与标量值进行比较：

In [ ]:

pd.Series(["foo","bar","baz"])=='foo'

Out[ ]:

0     True
1    False
2    False
dtype: bool

In [ ]:

pd.Index(["foo","bar","baz"])=="foo"

Out[ ]:

array([ True, False, False])

与相同长度的不同类似数组的对象之间的元素比较:

In [ ]:

pd.Series(["foo","bar","baz"]) == pd.Index(["foo", "bar", "qux"])

Out[ ]:

0     True
1     True
2    False
dtype: bool

In [ ]:

pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])

Out[ ]:

0     True
1     True
2    False
dtype: bool

比较不同长度的Index或Series对象会报错，但在NumPy中，是可以广播比较的，不同会返回False。

In [ ]:

np.array([1, 2, 3]) == np.array([1, 2])

C:\Users\watalo\AppData\Local\Temp\ipykernel_11620\1336612208.py:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  np.array([1, 2, 3]) == np.array([1, 2])

Out[ ]:

False

3.4.7.组合重叠的数据集¶

组合两个DataFamre对象，其中一个DataFamre中缺少的值有条件地用另一个DataFamre中类似标记的值填充。实现该操作的函数是combine_first()，我们举例说明如下:

In [ ]:

df1 = pd.DataFrame(
    {"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]}
)

df2 = pd.DataFrame(
    {
        "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
        "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
    }
)

df1, df2

Out[ ]:

(     A    B
 0  1.0  NaN
 1  NaN  2.0
 2  3.0  3.0
 3  5.0  NaN
 4  NaN  6.0,
      A    B
 0  5.0  NaN
 1  2.0  NaN
 2  4.0  3.0
 3  NaN  4.0
 4  3.0  6.0
 5  7.0  8.0)

In [ ]:

df1.combine_first(df2)

Out[ ]:

	A	B
0	1.0	NaN
1	2.0	2.0
2	3.0	3.0
3	5.0	4.0
4	3.0	6.0
5	7.0	8.0

3.4.8.通用DataFrame组合¶

combine_first()比combine()更常用。此方法采用另一个DataFrame和一个组合器函数，对齐输入DataFrame，然后传递系列的组合器函数对(即名称相同的列)。

例如，要重现上面的combine_first():

In [ ]:

def combiner(x, y):
    return np.where(pd.isna(x),y,x)

df1.combine(df2,combiner)

Out[ ]:

	A	B
0	1.0	NaN
1	2.0	2.0
2	3.0	3.0
3	5.0	4.0
4	3.0	6.0
5	7.0	8.0

3.5.描述统计学¶

有很多方法来计算描述性统计和对Series和DataFrame的其他相关操作。其中大部分是聚合(因此产生较低维的结果)，如sum()、mean()和quantile()，但也有一些产生相同大小的对象，如cumsum()和cumprod()。

一般来说，这些方法接受轴参数，就像ndarray.{sum，std，…}一样，但轴可以由名称或整数指定:

Series:不需要轴参数
DataFrame:“索引”(轴=0，默认)，“列”(轴=1)

例如:

In [ ]:

df

Out[ ]:

	one	two	three
a	1.139565	-0.565203	NaN
b	-1.184934	1.543159	0.128045
c	-1.633771	0.593862	-0.797587
d	NaN	-0.521046	-0.335744

In [ ]:

df.mean(0), df.mean(1)

Out[ ]:

(one     -0.559713
 two      0.262693
 three   -0.335095
 dtype: float64,
 a    0.287181
 b    0.162090
 c   -0.612498
 d   -0.428395
 dtype: float64)

所有这些方法都有一个skipna选项，指示是否排除缺失值(默认情况下为True):

In [ ]:

df.sum(0, skipna=False)

Out[ ]:

one           NaN
two      1.050772
three         NaN
dtype: float64

In [ ]:

df.sum(axis=1, skipna=True)

Out[ ]:

a    0.574362
b    0.486270
c   -1.837495
d   -0.856790
dtype: float64

结合广播/算术行为，可以非常简洁地描述各种统计过程，如标准化(呈现数据的零均值和标准差为1):

In [ ]:

ts_stand = (df - df.mean()) / df.std()
ts_stand.std()

Out[ ]:

one      1.0
two      1.0
three    1.0
dtype: float64

In [ ]:

xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
xs_stand.std(1)

Out[ ]:

a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注意，像cumsum()和cumprod()方法保留了NaN值的位置。这与expanding()和rolling()不同，因为NaN行为还由min_periods参数决定。

In [ ]:

df.cumsum()

Out[ ]:

	one	two	three
a	1.139565	-0.565203	NaN
b	-0.045368	0.977955	0.128045
c	-1.679139	1.571817	-0.669542
d	NaN	1.050772	-1.005286

下面是常用函数的快速参考汇总表。每个对象还带有一个可选的level参数，该参数仅在对象具有分层索引时适用。

Function	Description	功能
count	Number of non-NA observations	非缺失值计数
sum	Sum of values	求和
mean	Mean of values	求平均值
mad	Mean absolute deviation	平均绝对偏差
median	Arithmetic median of values	值的算术中位数
min	Minimum	最小值
max	Maximum	最大值
mode	Mode	模
abs	Absolute Value	绝对值
prod	Product of values
std	Bessel-corrected sample standard deviation	标准差
var	Unbiased variance	无偏方差
sem	Standard error of the mean	平均标准误差
skew	Sample skewness (3rd moment)	样本偏度
kurt	Sample kurtosis (4th moment)	样本峰度
quantile	Sample quantile (value at %)	样本分位数
cumsum	Cumulative sum	累计总和
cumprod	Cumulative product
cummax	Cumulative maximum	累积最大值
cummin	Cumulative minimum	累积最小值

这里需要注意一些NumPy方法，如mean、std和sum，会默认排除Series的NAs:

In [ ]:

np.mean(df["one"]) , np.mean(df['one'].to_numpy())

Out[ ]:

(-0.5597130075722023, nan)

Series.nunique()将返回一个Series中不重复的非na值的数量:

In [ ]:

series = pd.Series(np.random.randn(500))
series[20:500] = np.nan
series[10:20] = 5 # 前10随机不重复，后10都是5，20以后都是nan
series.nunique()

Out[ ]:

3.5.1.总结数据:describe¶

有一个方便的describe()函数，它计算关于Series或DataFrame列的各种汇总统计信息(当然不包括NAs):

In [ ]:

series = pd.Series(np.random.randn(1000))
series[::2] = np.nan
series.describe()

Out[ ]:

count    500.000000
mean       0.014638
std        0.960350
min       -2.886848
25%       -0.639844
50%        0.038532
75%        0.700063
max        3.034186
dtype: float64

In [ ]:

frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
frame.iloc[::2] = np.nan
frame.describe()

Out[ ]:

	a	b	c	d	e
count	500.000000	500.000000	500.000000	500.000000	500.000000
mean	-0.027874	-0.028041	-0.049434	0.016202	0.065243
std	0.992697	0.979050	1.041223	1.031190	1.020048
min	-2.997081	-2.737347	-3.296077	-3.009516	-3.198953
25%	-0.662278	-0.670290	-0.779942	-0.728801	-0.678985
50%	-0.001360	-0.039071	-0.034171	-0.004156	0.092582
75%	0.672736	0.601327	0.609958	0.730051	0.721171
max	2.458216	2.577917	2.703639	2.924883	3.083671

可以在输出中选择特定的百分比,但是50%是永远包含在内的:

In [ ]:

series.describe(percentiles=[0.25, 0.05,0.75,0.95]) #

Out[ ]:

count    500.000000
mean       0.014638
std        0.960350
min       -2.886848
5%        -1.546566
25%       -0.639844
50%        0.038532
75%        0.700063
95%        1.625676
max        3.034186
dtype: float64

注意：对于混合类型的DataFrame对象，describe()只包含数字列，如果没有，则只包含分类列:

In [ ]:

frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})
frame.describe()

Out[ ]:

	b
count	4.000000
mean	1.500000
std	1.290994
min	0.000000
25%	0.750000
50%	1.500000
75%	2.250000
max	3.000000

可以通过提供包含(include)/排除(exclude)参数的类型列表来控制这种行为。特殊值all也可以使用:

In [ ]:

frame.describe(include= ['object'])

Out[ ]:

	a
count	4
unique	2
top	Yes
freq	2

In [ ]:

frame.describe(include= ['number'])

Out[ ]:

	b
count	4.000000
mean	1.500000
std	1.290994
min	0.000000
25%	0.750000
50%	1.500000
75%	2.250000
max	3.000000

In [ ]:

frame.describe(include= "all") # 注意没有方括号

Out[ ]:

	a	b
count	4	4.000000
unique	2	NaN
top	Yes	NaN
freq	2	NaN
mean	NaN	1.500000
std	NaN	1.290994
min	NaN	0.000000
25%	NaN	0.750000
50%	NaN	1.500000
75%	NaN	2.250000
max	NaN	3.000000

3.5.2.最小/最大值的索引¶

idxmin()

idxmax()

Series和DataFrame上的idxmin()和idxmax()函数用对应的最小值和最大值计算索引标签:

In [ ]:

s1 = pd.Series(np.random.randn(5))
s1, s1.idxmin(), s1.idxmax()

Out[ ]:

(0    0.780791
 1    0.172858
 2    0.014429
 3   -0.376702
 4   -1.072792
 dtype: float64,
 4,
 0)

In [ ]:

df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
df1, df1.idxmin(axis=0), df1.idxmax(axis=1)

Out[ ]:

(          A         B         C
 0  0.685701  0.960447  1.895228
 1 -0.827385 -1.889043 -0.021492
 2  1.800086 -0.796629 -0.609086
 3  0.091910 -0.111135 -0.383179
 4 -1.784803  1.931232  0.951651,
 A    4
 B    1
 C    2
 dtype: int64,
 0    C
 1    C
 2    A
 3    A
 4    B
 dtype: object)

当有多个行(或列)匹配最小或最大值时，idxmin()和idxmax()返回第一个匹配的索引:

In [ ]:

df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))
df3["A"].idxmin()

Out[ ]:

'd'

idxmin和idxmax在NumPy中分别称为argmin和argmax。

3.5.3.值计数(直方图)/模式¶

value_counts() Series方法和顶级函数计算一维数组的直方图。它也可以作为常规数组的函数使用:

In [ ]:

data = np.random.randint(0, 7, size=50)
data

Out[ ]:

array([0, 5, 0, 0, 4, 6, 6, 3, 3, 1, 1, 6, 0, 5, 2, 2, 6, 3, 6, 2, 2, 6,
       6, 2, 4, 2, 1, 0, 6, 2, 0, 0, 3, 4, 2, 1, 2, 0, 0, 2, 4, 2, 5, 0,
       2, 6, 6, 4, 0, 1])

In [ ]:

s = pd.Series(data)

In [ ]:

s.value_counts()

Out[ ]:

2    12
0    11
6    10
4     5
1     5
3     4
5     3
dtype: int64

类似地，可以获得出现次数最多的值，即Series或DataFrame中mode的:

In [ ]:

s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
s5.mode()

Out[ ]:

0    3
1    7
dtype: int64

In [ ]:

df5= pd.DataFrame(
    {
        "A": np.random.randint(0, 7, size=50),
        "B": np.random.randint(-10, 15, size=50),
    }
)
df5.mode()

Out[ ]:

	A	B
0	0	-4
1	4	0

3.5.4.离散化和分位数¶

连续值可以使用cut()(基于值的容器)和qcut()(基于样本分位数的容器)函数进行离散化:

In [ ]:

arr = np.random.randn(20)
factor = pd.cut(arr,4)
factor

Out[ ]:

[(-0.00659, 0.689], (0.689, 1.384], (0.689, 1.384], (-0.702, -0.00659], (-1.4, -0.702], ..., (-0.00659, 0.689], (-0.00659, 0.689], (-0.00659, 0.689], (-1.4, -0.702], (-0.00659, 0.689]]
Length: 20
Categories (4, interval[float64, right]): [(-1.4, -0.702] < (-0.702, -0.00659] < (-0.00659, 0.689] < (0.689, 1.384]]

In [ ]:

factor = pd.cut(arr,[-5, -1, 0, 1, 5])
factor

Out[ ]:

[(0, 1], (1, 5], (0, 1], (-1, 0], (-5, -1], ..., (0, 1], (0, 1], (0, 1], (-5, -1], (0, 1]]
Length: 20
Categories (4, interval[int64, right]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut()计算样本分位数。例如，我们可以将一些正态分布的数据分成等大小的四分位数，如下所示:

In [ ]:

arr = np.random.randn(30)
factor = pd.qcut(arr, [0, 0.25, 0.5, 0.75, 1])
factor

Out[ ]:

[(-0.265, 0.108], (0.388, 1.153], (-2.0029999999999997, -0.265], (0.388, 1.153], (-0.265, 0.108], ..., (-2.0029999999999997, -0.265], (-0.265, 0.108], (-2.0029999999999997, -0.265], (-0.265, 0.108], (0.388, 1.153]]
Length: 30
Categories (4, interval[float64, right]): [(-2.0029999999999997, -0.265] < (-0.265, 0.108] < (0.108, 0.388] < (0.388, 1.153]]

3.6. 函数的应用¶

要将自己或其他库的函数应用于pandas对象，您应该了解下面的三个方法。使用何种方法取决于您的函数预期操作的对象是整个DataFrame还是Series，是行还是列，还是elementwise。

表式函数应用:pipe()
应用程序:apply()
聚合API: agg()和transform()
应用Elementwise函数:applymap()

3.6.1. 表级函数应用-Tablewise¶

DataFrames和Series可以传递到函数中。但是，如果需要在链中调用函数，请考虑使用pipe()方法。

首先一些设置:

In [ ]:

def extract_city_name(df):
    """
    Chicago, IL -> Chicago for city_name column
    """
    df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
    return df

def add_country_name(df, country_name = None):
    """
    Chicago -> Chicago-US for city_name column
    """
    col = "city_name"
    df["city_and_country"] = df[col] + country_name
    return df

df_p = pd.DataFrame({
    "city_and_code":["Chicago, IL"]
})

extract_city_name和add_country_name两个函数接收的是DataFrame，返回的结果也是DataFrame。

现在比较以下2种不同的写法:

In [ ]:

add_country_name(extract_city_name(df_p), country_name="US")

Out[ ]:

	city_and_code	city_name	city_and_country
0	Chicago, IL	Chicago	ChicagoUS

In [ ]:

df_p.pipe(extract_city_name).pipe(add_country_name,country_name="US")

Out[ ]:

	city_and_code	city_name	city_and_country
0	Chicago, IL	Chicago	ChicagoUS

pandas鼓励第二种风格，这就是所谓的方法链。pipe使得在方法链中使用你自己的或者另一个库的函数，以及pandas的方法变得很容易。

在上面的示例中，函数extract_city_name和add_country_name都需要一个DataFrame作为第一个位置参数。如果您希望应用的函数将其数据作为第二个参数，那该怎么办？在这种情况下，为pipe提供一个(callable，data_keyword)元组。。管道会将数据帧路由到元组中指定的参数。

例如，我们可以使用statsmodels进行回归拟合。他们的API首先期望一个公式，第二个参数data是DataFrame。我们将函数、关键字对(sm.ols，' data ')传递给pipe方法:

In [ ]:

import statsmodels.formula.api as sm
bb = pd.read_csv("data/baseball.csv",index_col='id')
(
    bb.query("h > 0")     # 这个库感觉好牛逼，后面有空研究下
    .assign(ln_h=lambda df: np.log(df.h))
    .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
    .fit()
    .summary()
)

Out[ ]:

OLS Regression Results
Dep. Variable:	hr	R-squared:	0.458
Model:	OLS	Adj. R-squared:	0.458
Method:	Least Squares	F-statistic:	1926.
Date:	Sun, 05 Jun 2022	Prob (F-statistic):	0.00
Time:	00:33:11	Log-Likelihood:	-60863.
No. Observations:	18236	AIC:	1.217e+05
Df Residuals:	18227	BIC:	1.218e+05
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-132.5821	3.090	-42.906	0.000	-138.639	-126.525
C(lg)[T.AL]	-0.7862	0.541	-1.454	0.146	-1.846	0.274
C(lg)[T.FL]	-1.0497	1.266	-0.829	0.407	-3.530	1.431
C(lg)[T.NL]	-1.1781	0.539	-2.187	0.029	-2.234	-0.122
C(lg)[T.PL]	0.1840	1.313	0.140	0.889	-2.390	2.758
C(lg)[T.UA]	2.4496	2.628	0.932	0.351	-2.701	7.600
ln_h	0.4191	0.071	5.886	0.000	0.280	0.559
year	0.0663	0.002	41.513	0.000	0.063	0.069
g	0.1028	0.002	48.636	0.000	0.099	0.107

Omnibus:	6196.996	Durbin-Watson:	1.907
Prob(Omnibus):	0.000	Jarque-Bera (JB):	30705.738
Skew:	1.574	Prob(JB):	0.00
Kurtosis:	8.523	Cond. No.	1.20e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.2e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

pipe方法的灵感来自于unix pipes以及最近的dplyr和magrittr，它们为r引入了流行的(%>%)(读管道)操作符。我们鼓励您查看pipe()的源代码。感觉就是python的map函数，可以映射函数名实现调用。

3.6.2.行或列级的函数应用-Row/Colums-wise¶

使用apply()方法可以沿DataFrame的轴应用任意函数，该方法与描述性统计方法一样，采用可选的axis参数。apply()方法也可以在字符串方法名上调用。

In [ ]:

df.apply(np.mean) , df.apply("mean") # 两种方面等效

Out[ ]:

(one     -0.479175
 two      0.867518
 three    1.206819
 dtype: float64,
 one     -0.479175
 two      0.867518
 three    1.206819
 dtype: float64)

In [ ]:

df.apply(np.mean, axis=1), df.apply("mean", axis=1) # 两种方法的区别在于 第二种不用import numpy

Out[ ]:

(a    0.636975
 b    0.402759
 c    0.886703
 d    0.255335
 dtype: float64,
 a    0.636975
 b    0.402759
 c    0.886703
 d    0.255335
 dtype: float64)

In [ ]:

df.apply(lambda x:x.max()-x.min())

Out[ ]:

a    3.006290
b    2.334425
c    2.663356
d    0.265488
dtype: float64

In [ ]:

df.apply(np.cumsum) # 需要再研究下累计求和的意思，是不是 a=a b=b+a ...

Out[ ]:

	one	two	three
a	-0.866169	2.140120	NaN
b	-0.651794	1.469859	1.664163
c	-1.437525	3.347483	3.232379
d	NaN	3.470074	3.620458

传递给apply()的函数的返回类型会影响DataFrame.apply的最终输出类型。默认行为:

如果应用的函数返回一个序列，则最终输出是一个DataFrame。这些列与应用的函数返回的序列的索引相匹配。
如果应用的函数返回任何其他类型，则最终输出是一个序列。

这个默认行为可以使用result_type来覆盖，它接受三个选项:reduce、broadcast和expand。这些将决定列表式返回值如何扩展(或不扩展)到数据帧。

apply()结合一些骚操作可以用来回答很多关于数据集的问题。例如，假设我们想要提取每列的最大值出现的日期:

In [ ]:

tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=1000),
)

tsdf.apply(lambda x:x.idxmax())

Out[ ]:

A   2001-11-27
B   2000-10-01
C   2000-06-11
dtype: datetime64[ns]

In [ ]:

tsdf.idxmax() # 黑人问号，这直接idxmax()不就完了么。。

Out[ ]:

A   2001-11-27
B   2000-10-01
C   2000-06-11
dtype: datetime64[ns]

apply()方法还可以传参。例如:

In [ ]:

def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

df.apply(subtract_and_divide, args=(5,), divide=3)

Out[ ]:

	one	two	three
a	-1.955390	-0.953293	NaN
b	-1.595208	-1.890087	-1.111946
c	-1.928577	-1.040792	-1.143928
d	NaN	-1.625803	-1.537307

另一个有用的特性是能够传递系列方法来对每一列或每一行执行一些系列操作:

In [ ]:

tsdf

Out[ ]:

	A	B	C
2000-01-01	-0.180591	-0.091470	0.384488
2000-01-02	-0.353529	0.757932	0.941753
2000-01-03	-2.421471	-1.542389	0.033038
2000-01-04	0.685086	-1.323862	-0.922339
2000-01-05	-1.262845	-0.152383	-1.338716
...	...	...	...
2002-09-22	-1.364892	1.389179	-1.270867
2002-09-23	-1.418890	-0.896929	1.723468
2002-09-24	-1.895373	-0.006156	1.228508
2002-09-25	-1.501526	0.848390	1.400977
2002-09-26	0.150822	1.144106	-0.259324

1000 rows × 3 columns

In [ ]:

tsdf.apply(pd.Series.interpolate)

Out[ ]:

	A	B	C
2000-01-01	-0.180591	-0.091470	0.384488
2000-01-02	-0.353529	0.757932	0.941753
2000-01-03	-2.421471	-1.542389	0.033038
2000-01-04	0.685086	-1.323862	-0.922339
2000-01-05	-1.262845	-0.152383	-1.338716
...	...	...	...
2002-09-22	-1.364892	1.389179	-1.270867
2002-09-23	-1.418890	-0.896929	1.723468
2002-09-24	-1.895373	-0.006156	1.228508
2002-09-25	-1.501526	0.848390	1.400977
2002-09-26	0.150822	1.144106	-0.259324

1000 rows × 3 columns

最后，apply()接受一个参数raw，默认情况下该参数为False，在应用该函数之前，它会将每一行或每一列转换为一个序列。当设置为True时，传递的函数将改为接收一个ndarray对象，如果不需要索引功能，这将对性能产生积极的影响。

3.6.3.聚合API¶

聚合API允许以一种简洁的方式表达多个聚合操作。这个API在pandas对象中是相似的，参见groupby API、窗口API和重采样API。聚合的入口点是DataFrame.aggregate()或别名DataFrame.agg()。

我们将从上面开始使用类似的起始帧:

In [ ]:

tsdf.iloc[3:7] = np.nan
tsdf

Out[ ]:

	A	B	C
2000-01-01	-0.180591	-0.091470	0.384488
2000-01-02	-0.353529	0.757932	0.941753
2000-01-03	-2.421471	-1.542389	0.033038
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
...	...	...	...
2002-09-22	-1.364892	1.389179	-1.270867
2002-09-23	-1.418890	-0.896929	1.723468
2002-09-24	-1.895373	-0.006156	1.228508
2002-09-25	-1.501526	0.848390	1.400977
2002-09-26	0.150822	1.144106	-0.259324

1000 rows × 3 columns

使用单个函数相当于apply()。还可以将命名方法作为字符串传递。这些将返回一系列聚合输出:

In [ ]:

tsdf.agg(np.sum), tsdf.agg("sum"), tsdf.sum(), tsdf.apply(np.sum)
# 涉及到单个函数运用时，上面这几个函数都是一个效果。

Out[ ]:

(A   -38.507567
 B    26.302116
 C   -39.862818
 dtype: float64,
 A   -38.507567
 B    26.302116
 C   -39.862818
 dtype: float64,
 A   -38.507567
 B    26.302116
 C   -39.862818
 dtype: float64,
 A   -38.507567
 B    26.302116
 C   -39.862818
 dtype: float64)

序列上的单个聚合这将返回一个标量值:

In [ ]:

tsdf["A"].agg("sum")

Out[ ]:

-38.50756741504713

使用多种功能聚合¶

您可以将多个聚合参数作为列表传递。每个传递函数的结果将是结果数据帧中的一行。这些都是根据聚合函数自然命名的。

In [ ]:

tsdf.agg(['sum','mean'])

Out[ ]:

	A	B	C
sum	-38.507567	26.302116	-39.862818
mean	-0.038662	0.026408	-0.040023

在一个序列中，多个函数返回一个序列，函数名就是索引:

In [ ]:

tsdf["A"].agg(["sum","mean"])

Out[ ]:

sum    -38.507567
mean    -0.038662
Name: A, dtype: float64

传递lambda函数将产生一个< lambda >命名的新行:

In [ ]:

tsdf["A"].agg(["sum",lambda x:x.mean()])

Out[ ]:

sum        -38.507567
<lambda>    -0.038662
Name: A, dtype: float64

传递一个命名函数将为该行生成该名称:

In [ ]:

def mymean(x):
    return x.mean()
tsdf["A"].agg(["sum",mymean])

Out[ ]:

sum      -38.507567
mymean    -0.038662
Name: A, dtype: float64

用字典聚集¶

将一个列名字典传递给一个标量或一个标量列表，传递给DataFrame.agg允许您定制将哪些函数应用于哪些列。请注意，结果没有任何特定的顺序，您可以使用OrderedDict来保证排序。

In [ ]:

tsdf.agg({"A": "mean", "B": "sum"})

Out[ ]:

A    -0.038662
B    26.302116
dtype: float64

传递一个类似列表的函数会生成一个DataFrame输出。您将获得所有聚合器的矩阵状输出。输出将包含所有独特的功能。未在特定栏中注明的内容将为NaN:

In [ ]:

tsdf.agg({"A":  ["mean", "min"],  "B": "sum"})

Out[ ]:

	A	B
mean	-0.038662	NaN
min	-3.321160	NaN
sum	NaN	26.302116

混合数据类型¶

自版本1.4.0起已弃用:尝试确定哪些列不能聚合并从结果中静默删除它们已弃用，并将在未来版本中移除。如果所提供的列或操作的任何部分失败，则调用。agg会加注。

当出现不能聚合的混合数据类型时.agg将只接受有效的聚合。这类似于.groupby.agg的工作方式。

In [ ]:

mdf = pd.DataFrame(
    {
        "A": [1, 2, 3],
        "B": [1.0, 2.0, 3.0],
        "C": ["foo", "bar", "baz"],
        "D": pd.date_range("20130101", periods=3),
    }
)

In [ ]:

mdf.agg(["min", "sum"])

/var/folders/wj/nc3k2r8x1l9blh3bp02_y01r0000gn/T/ipykernel_7506/56008705.py:1: FutureWarning: ['D'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
  mdf.agg(["min", "sum"])

Out[ ]:

	A	B	C	D
min	1	1.0	bar	2013-01-01
sum	6	6.0	foobarbaz	NaT

自定义描述¶

.agg()可以很容易地创建一个自定义的描述函数，类似于内置的描述函数。

In [ ]:

from functools import partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"
tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])

Out[ ]:

	A	B	C
count	996.000000	996.000000	996.000000
mean	-0.038662	0.026408	-0.040023
std	1.015146	0.989194	0.998424
min	-3.321160	-2.968763	-3.567000
25%	-0.769211	-0.630266	-0.705682
median	-0.022098	0.043487	-0.042489
75%	0.645934	0.687734	0.631629
max	2.945581	2.545140	2.804213

3.6.4.转换API¶

transform()方法返回一个与原始对象索引相同(大小相同)的对象。这个API允许您同时提供多个操作，而不是一个接一个地提供。它的API与.agg API非常一致。

我们创建一个类似于上一节中使用的框架。

In [ ]:

tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)
tsdf.iloc[3:7] = np.nan
tsdf

Out[ ]:

	A	B	C
2000-01-01	0.378065	1.221302	0.342178
2000-01-02	-0.683479	-2.681682	-0.183116
2000-01-03	-0.536174	1.271340	-1.080474
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	0.372487	-1.524165
2000-01-09	0.067186	-1.024088	-0.428864
2000-01-10	0.587397	-0.260108	-0.167146

单函数转换¶

transform()允许输入函数为:

NumPy函数
字符串函数名
用户定义的函数

In [ ]:

# Numpy函数
tsdf.transform(np.abs)

Out[ ]:

	A	B	C
2000-01-01	0.378065	1.221302	0.342178
2000-01-02	0.683479	2.681682	0.183116
2000-01-03	0.536174	1.271340	1.080474
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	0.372487	1.524165
2000-01-09	0.067186	1.024088	0.428864
2000-01-10	0.587397	0.260108	0.167146

In [ ]:

# 字符串函数名
tsdf.transform("abs")

Out[ ]:

	A	B	C
2000-01-01	0.378065	1.221302	0.342178
2000-01-02	0.683479	2.681682	0.183116
2000-01-03	0.536174	1.271340	1.080474
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	0.372487	1.524165
2000-01-09	0.067186	1.024088	0.428864
2000-01-10	0.587397	0.260108	0.167146

In [ ]:

# 用户自定义函数
tsdf.transform(lambda x:x.abs())

Out[ ]:

	A	B	C
2000-01-01	0.378065	1.221302	0.342178
2000-01-02	0.683479	2.681682	0.183116
2000-01-03	0.536174	1.271340	1.080474
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	0.372487	1.524165
2000-01-09	0.067186	1.024088	0.428864
2000-01-10	0.587397	0.260108	0.167146

这里transform()接收了一个函数；这相当于一个ufunc应用程序。将单个函数传递给Series的transform()将返回一个Series。

In [ ]:

np.abs(tsdf)

Out[ ]:

	A	B	C
2000-01-01	0.378065	1.221302	0.342178
2000-01-02	0.683479	2.681682	0.183116
2000-01-03	0.536174	1.271340	1.080474
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	0.372487	1.524165
2000-01-09	0.067186	1.024088	0.428864
2000-01-10	0.587397	0.260108	0.167146

In [ ]:

tsdf["A"].transform(np.abs)

Out[ ]:

2000-01-01    0.378065
2000-01-02    0.683479
2000-01-03    0.536174
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.654358
2000-01-09    0.067186
2000-01-10    0.587397
Freq: D, Name: A, dtype: float64

多函数转换¶

传递多个函数将产生一列多索引DataFrame。第一层是原始框架的列名；第二层是转换函数的名称。

In [ ]:

tsdf.transform([np.abs, lambda x:x+1])

Out[ ]:

	A		B		C
	absolute	<lambda>	absolute	<lambda>	absolute	<lambda>
2000-01-01	0.378065	1.378065	1.221302	2.221302	0.342178	1.342178
2000-01-02	0.683479	0.316521	2.681682	-1.681682	0.183116	0.816884
2000-01-03	0.536174	0.463826	1.271340	2.271340	1.080474	-0.080474
2000-01-04	NaN	NaN	NaN	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN	NaN	NaN	NaN
2000-01-08	0.654358	1.654358	0.372487	1.372487	1.524165	-0.524165
2000-01-09	0.067186	1.067186	1.024088	-0.024088	0.428864	0.571136
2000-01-10	0.587397	1.587397	0.260108	0.739892	0.167146	0.832854

将多个函数传递给一个序列将产生一个DataFrame。产生的列名是转换函数。

In [ ]:

tsdf["A"].transform([np.abs, lambda x: x + 1])

Out[ ]:

	absolute	<lambda>
2000-01-01	0.378065	1.378065
2000-01-02	0.683479	0.316521
2000-01-03	0.536174	0.463826
2000-01-04	NaN	NaN
2000-01-05	NaN	NaN
2000-01-06	NaN	NaN
2000-01-07	NaN	NaN
2000-01-08	0.654358	1.654358
2000-01-09	0.067186	1.067186
2000-01-10	0.587397	1.587397

用字典转换¶

传递函数的字典将允许每列的选择性转换。

In [ ]:

tsdf.transform({"A":np.abs,"B":lambda x:x+1})

Out[ ]:

	A	B
2000-01-01	0.378065	2.221302
2000-01-02	0.683479	-1.681682
2000-01-03	0.536174	2.271340
2000-01-04	NaN	NaN
2000-01-05	NaN	NaN
2000-01-06	NaN	NaN
2000-01-07	NaN	NaN
2000-01-08	0.654358	1.372487
2000-01-09	0.067186	-0.024088
2000-01-10	0.587397	0.739892

传递一个列表字典将生成一个带有这些选择性转换的多索引DataFrame。

In [ ]:

tsdf.transform({"A":np.abs,"B":[lambda x:x+1, "sqrt"]})

/Users/watalo/programs/pandas-note/venv/lib/python3.10/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: invalid value encountered in sqrt
  result = getattr(ufunc, method)(*inputs, **kwargs)

Out[ ]:

	A	B
	absolute	<lambda>	sqrt
2000-01-01	0.378065	2.221302	1.105125
2000-01-02	0.683479	-1.681682	NaN
2000-01-03	0.536174	2.271340	1.127537
2000-01-04	NaN	NaN	NaN
2000-01-05	NaN	NaN	NaN
2000-01-06	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN
2000-01-08	0.654358	1.372487	0.610317
2000-01-09	0.067186	-0.024088	NaN
2000-01-10	0.587397	0.739892	NaN

3.6.5.元素级的函数运用¶

由于并非所有函数都可以矢量化(接受NumPy数组并返回另一个数组或值)，因此DataFrame上的applymap()和Series上的map()方法接受任何接受单个值并返回单个值的Python函数。例如:

In [ ]:

df4 = df.copy()

In [ ]:

def f(x):
    return len(str(x))
df4["one"].map(f)

Out[ ]:

a    19
b    18
c    19
d     3
Name: one, dtype: int64

In [ ]:

df4.applymap(f)

Out[ ]:

	one	two	three
a	19	18	3
b	18	19	18
c	19	18	18
d	3	19	18

Series.map()有一个附加功能；它可用于方便地“链接”或“映射”由次要系列定义的值。这与合并/连接功能密切相关:

In [ ]:

s = pd.Series(
    ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
)
t = pd.Series({"six": 6.0, "seven": 7.0})
s , s.map(t)

Out[ ]:

(a      six
 b    seven
 c      six
 d    seven
 e      six
 dtype: object,
 a    6.0
 b    7.0
 c    6.0
 d    7.0
 e    6.0
 dtype: float64)

3.7.重新索引和更改标签¶

reindex()是pandas中基本的数据对齐方法。它用于实现几乎所有其他依赖于标签对齐功能的特性。重新索引意味着使数据符合特定轴上给定的一组标签。这完成了几件事:

对现有数据进行重新排序，以匹配新的标签集
在不存在该标签数据的标签位置插入缺失值(NA)标记
如果指定，使用逻辑填充缺失标签的数据(与处理时间序列数据高度相关)

这里有一个简单的例子:

In [ ]:

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

Out[ ]:

a   -2.622641
b    1.024650
c    1.155658
d    0.466409
e   -0.022728
dtype: float64

In [ ]:

s.reindex(['e','f','c','g','a'])

Out[ ]:

e   -0.022728
f         NaN
c    1.155658
g         NaN
a   -2.622641
dtype: float64

这里，f标签不包含在系列中，因此在结果中显示为NaN。

使用数据帧，您可以同时重新索引索引和列:

In [ ]:

df , df.reindex(index=['c','d','e'], columns=['three','one','two'])

Out[ ]:

(        one       two     three
 a  0.156364 -0.667597       NaN
 b  1.096653 -1.031939  1.124496
 c -1.349890 -2.063252 -1.053241
 d       NaN  0.550026 -1.294505,
       three      one       two
 c -1.053241 -1.34989 -2.063252
 d -1.294505      NaN  0.550026
 e       NaN      NaN       NaN)

您也可以将reindex与axis关键字一起使用:

In [ ]:

df.reindex(['b','d','a'], axis='index')

Out[ ]:

	one	two	three
b	1.096653	-1.031939	1.124496
d	NaN	0.550026	-1.294505
a	0.156364	-0.667597	NaN

请注意，包含实际轴标签的索引对象可以在对象之间共享。因此，如果我们有一个Series和一个DataFrame，可以进行以下操作:

In [ ]:

rs = s.reindex(df.index)
rs

Out[ ]:

a   -2.622641
b    1.024650
c    1.155658
d    0.466409
dtype: float64

In [ ]:

rs.index is df.index

Out[ ]:

True

DataFrame.reindex()还支持“轴风格”的调用约定，在这种情况下，您可以指定单个labels参数及其应用的轴。

In [ ]:

df.reindex(['three','one','two'], axis='columns')

Out[ ]:

	three	one	two
a	NaN	0.156364	-0.667597
b	1.124496	1.096653	-1.031939
c	-1.053241	-1.349890	-2.063252
d	-1.294505	NaN	0.550026

在编写对性能敏感的代码时，有一个很好的理由花一些时间成为一名重新索引的忍者:许多操作在预先对齐的数据上更快。添加两个未对齐的数据帧会在内部触发重新索引步骤。对于探索性分析，您几乎不会注意到差异(因为reindex已经过大量优化)，但是当CPU周期很重要时，在这里或那里进行一些显式的reindex调用可能会产生影响。

3.7.1.重新索引以与另一个对象对齐¶

您可能希望获取一个对象，并对其轴进行重新索引，以将其标记为与另一个对象相同。虽然这种操作的语法很简单(虽然有些冗长),但这是一种非常常见的操作，因此可以使用reindex_like()方法来简化这种操作:

In [ ]:

df2

Out[ ]:

	one	two	three
a	0.156364	-0.667597	1.000000
b	1.096653	-1.031939	1.124496
c	-1.349890	-2.063252	-1.053241
d	NaN	0.550026	-1.294505

In [68]:

df3 = df2.reindex(index=['a','b'],columns=['one','two'])
df3

Out[68]:

	one	two
a	-0.819943	0.239292
b	1.857015	-0.202030

In [ ]:

df.reindex_like(df3)

Out[ ]:

	one	two
a	0.156364	-0.667597
b	1.096653	-1.031939

3.7.2.使用对齐将对象彼此对齐¶

align()方法是同时对齐两个对象的最快方法。它支持连接参数(与连接和合并相关):

join='outer ':取索引的并集(默认)
join='left ':使用调用对象的索引
join='right ':使用传递的对象的索引
join='inner ':相交索引

它返回一个包含两个重新索引的序列的元组:

只是在索引上的对齐，元素值还是保持不变，所以会形成两个元素的元组。

In [ ]:

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

s1 = s[:4]

s2 = s[1:]

s1.align(s2)

Out[ ]:

(a   -2.178863
 b    0.921259
 c   -1.104563
 d    0.285100
 e         NaN
 dtype: float64,
 a         NaN
 b    0.921259
 c   -1.104563
 d    0.285100
 e   -0.532158
 dtype: float64)

In [ ]:

s1.align(s2,join='left')

Out[ ]:

(a   -2.178863
 b    0.921259
 c   -1.104563
 d    0.285100
 dtype: float64,
 a         NaN
 b    0.921259
 c   -1.104563
 d    0.285100
 dtype: float64)

In [ ]:

s1.align(s2,join='right')

Out[ ]:

(b    0.921259
 c   -1.104563
 d    0.285100
 e         NaN
 dtype: float64,
 b    0.921259
 c   -1.104563
 d    0.285100
 e   -0.532158
 dtype: float64)

对于DataFrame，默认情况下，join方法将应用于索引和列:

可以传递一个轴选项，在指定的轴上对齐:

In [ ]:

df.align(df3), df.align(df3,join='inner',axis=1)

Out[ ]:

((        one     three       two
  a  0.156364       NaN -0.667597
  b  1.096653  1.124496 -1.031939
  c -1.349890 -1.053241 -2.063252
  d       NaN -1.294505  0.550026,
          one  three       two
  a  0.156364    NaN -0.667597
  b  1.096653    NaN -1.031939
  c       NaN    NaN       NaN
  d       NaN    NaN       NaN),
 (        one       two
  a  0.156364 -0.667597
  b  1.096653 -1.031939
  c -1.349890 -2.063252
  d       NaN  0.550026,
          one       two
  a  0.156364 -0.667597
  b  1.096653 -1.031939))

In [81]:

df.align(df3.iloc[0], axis=1)
# df, df3.iloc[0]

Out[81]:

(        one     three       two
 a -0.819943       NaN  0.239292
 b  1.857015  1.565334 -0.202030
 c  1.017643  0.374097 -1.108036
 d       NaN  0.615869 -1.791342,
 one     -0.819943
 three         NaN
 two      0.239292
 Name: a, dtype: float64)

如果将一个序列传递给DataFrame.align()，则可以选择使用axis参数在DataFrame的索引或列上对齐这两个对象:

In [90]:

s = pd.Series([1,2],index=['aa','two'])
#df.align(df3.iloc[0], axis=1) ## 传入对象是列时，对应的轴参数是行，传入的是行时，对应的轴参数是列
# 这里有个误区，如果把df3.iloc[0]换成一个series就好理解了

df.align(s, axis=1) 

Out[90]:

(   aa       one     three       two
 a NaN -0.819943       NaN  0.239292
 b NaN  1.857015  1.565334 -0.202030
 c NaN  1.017643  0.374097 -1.108036
 d NaN       NaN  0.615869 -1.791342,
 aa       1.0
 one      NaN
 three    NaN
 two      2.0
 dtype: float64)

3.7.3.重建索引时填充¶

reindex()采用可选的参数方法，该方法是从下表中选择的填充方法:

方法	动作
pad/ffill	用前值填充
bfill/backfill	用后值填充
nearest	用最近的索引对应的值填充

In [93]:

rng = pd.date_range('1/3/2000', periods=8)
ts = pd.Series(np.random.randn(8), index=rng)
ts2 = ts[[0,3,6]]
ts2

Out[93]:

2000-01-03   -1.430392
2000-01-06   -0.632743
2000-01-09   -0.462662
Freq: 3D, dtype: float64

In [95]:

ts2.reindex(ts.index)

Out[95]:

2000-01-03   -1.430392
2000-01-04         NaN
2000-01-05         NaN
2000-01-06   -0.632743
2000-01-07         NaN
2000-01-08         NaN
2000-01-09   -0.462662
2000-01-10         NaN
Freq: D, dtype: float64

In [98]:

ts2.reindex(ts.index, method='ffill')

Out[98]:

2000-01-03   -1.430392
2000-01-04   -1.430392
2000-01-05   -1.430392
2000-01-06   -0.632743
2000-01-07   -0.632743
2000-01-08   -0.632743
2000-01-09   -0.462662
2000-01-10   -0.462662
Freq: D, dtype: float64

In [99]:

ts2.reindex(ts.index, method='bfill')

Out[99]:

2000-01-03   -1.430392
2000-01-04   -0.632743
2000-01-05   -0.632743
2000-01-06   -0.632743
2000-01-07   -0.462662
2000-01-08   -0.462662
2000-01-09   -0.462662
2000-01-10         NaN
Freq: D, dtype: float64

In [100]:

ts2.reindex(ts.index, method='nearest')

Out[100]:

2000-01-03   -1.430392
2000-01-04   -1.430392
2000-01-05   -0.632743
2000-01-06   -0.632743
2000-01-07   -0.632743
2000-01-08   -0.462662
2000-01-09   -0.462662
2000-01-10   -0.462662
Freq: D, dtype: float64

这些方法要求索引按顺序递增或递减。

请注意，使用fillna(除了method='nearest')或interpolate也可以获得相同的结果.如果索引不是单调递增或递减的，reindex()将引发ValueError。fillna()和interpolate()不会对索引的顺序执行任何检查。

In [101]:

ts2.reindex(ts.index).fillna(method='ffill')

Out[101]:

2000-01-03   -1.430392
2000-01-04   -1.430392
2000-01-05   -1.430392
2000-01-06   -0.632743
2000-01-07   -0.632743
2000-01-08   -0.632743
2000-01-09   -0.462662
2000-01-10   -0.462662
Freq: D, dtype: float64

3.7.4.重新索引时的填充限制¶

“limit”和“tolerance”参数在重新编制索引时提供了对填充的额外控制。

limit指定连续匹配的最大计数
tolerance(公差)指定索引值和目标值之间的最大距离

请注意，当在DatetimeIndex、TimedeltaIndex或PeriodIndex上使用时，如果可能，tolerance将被强制转换为Timedelta。这允许我们用适当的字符串指定公差。

In [102]:

ts2.reindex(ts.index, method='ffill',limit=1)

Out[102]:

2000-01-03   -1.430392
2000-01-04   -1.430392
2000-01-05         NaN
2000-01-06   -0.632743
2000-01-07   -0.632743
2000-01-08         NaN
2000-01-09   -0.462662
2000-01-10   -0.462662
Freq: D, dtype: float64

In [106]:

ts2.reindex(ts.index,method='ffill', tolerance='2 day')

Out[106]:

2000-01-03   -1.430392
2000-01-04   -1.430392
2000-01-05   -1.430392
2000-01-06   -0.632743
2000-01-07   -0.632743
2000-01-08   -0.632743
2000-01-09   -0.462662
2000-01-10   -0.462662
Freq: D, dtype: float64

3.7.5. 从轴上删除标签¶

与reindex密切相关的一个方法是drop()函数。它从轴上删除一组标签:

In [107]:

df.drop(['a','b'], axis=0)

Out[107]:

	one	two	three
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

In [109]:

df.drop(['one'],axis=1)

Out[109]:

	two	three
a	0.239292	NaN
b	-0.202030	1.565334
c	-1.108036	0.374097
d	-1.791342	0.615869

用index.different方法也可以实现，只是没有那么直观。

In [116]:

df.reindex(df.index.difference(['a'])) # 只能用在行上面

Out[116]:

	one	two	three
b	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

3.7.6 重命名/映射标签¶

rename()方法允许你基于某种映射(一个字典或序列)或任意函数重新标记一个轴。

函数: 如果传递一个函数，当用任何标签调用它时，它必须有返回值(并且必须产生一组唯一的值)。
字典：如果映射不包含列/索引标签，它不会被重命名，映射中的额外标签也不会引发错误。
axis参数：支持“轴风格”的调用约定。
inplace命名参数：默认情况下该参数为False，并复制底层数据。传递inplace=True，就地重命名数据，改原对象。
接受标量值或者与列表相似的东西，改动Series.name这个属性。

In [119]:

s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])

Out[119]:

a    0.745168
b    0.510880
c   -0.676405
d   -0.967381
e   -0.742971
dtype: float64

In [120]:

s.rename(str.upper)

Out[120]:

A    0.745168
B    0.510880
C   -0.676405
D   -0.967381
E   -0.742971
dtype: float64

In [121]:

df.rename(
    columns={"one": "foo", "two": "bar"},
    index={"a": "apple", "b": "banana", "d": "durian"},
)

Out[121]:

	foo	bar	three
apple	-0.819943	0.239292	NaN
banana	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
durian	NaN	-1.791342	0.615869

In [122]:

df.rename({'one':"foo","two":"bar"}, axis=1)

Out[122]:

	foo	bar	three
a	-0.819943	0.239292	NaN
b	1.857015	-0.202030	1.565334
c	1.017643	-1.108036	0.374097
d	NaN	-1.791342	0.615869

In [124]:

s.rename('scalar-name') # 这个操作的意义是什么？？

Out[124]:

a    0.745168
b    0.510880
c   -0.676405
d   -0.967381
e   -0.742971
Name: scalar-name, dtype: float64

In [125]:

df = pd.DataFrame(
    {"x": [1, 2, 3, 4, 5, 6], "y": [10, 20, 30, 40, 50, 60]},
    index=pd.MultiIndex.from_product(
        [["a", "b", "c"], [1, 2]], names=["let", "num"]
    ),
)

In [126]:

df

Out[126]:

		x	y
let	num
a	1	1	10
a	2	2	20
b	1	3	30
b	2	4	40
c	1	5	50
c	2	6	60

rename_axis():改复合索引的index名称。注意，不是行标签，是复合索引对应的名字。

In [127]:

df.rename_axis(index={'let':'abc'})

Out[127]:

		x	y
abc	num
a	1	1	10
a	2	2	20
b	1	3	30
b	2	4	40
c	1	5	50
c	2	6	60

In [129]:

df.rename_axis(index=str.upper)

Out[129]:

		x	y
LET	NUM
a	1	1	10
a	2	2	20
b	1	3	30
b	2	4	40
c	1	5	50
c	2	6	60

王大桃zzZ

因为懒得烧蛇吃，所以要去学python

[pandas]用户指南:3.基本功能

3.必要的基础功能¶

3.1.head()和tail()¶

3.2.属性和底层数据¶

3.3.加速操作¶

3.4.灵活的二元运算¶

3.4.1.匹配/广播行为¶

3.4.2.缺失值/值填充操作¶

3.4.3.灵活比较¶

3.4.4.布尔简化¶

3.4.5.判断对象相等¶

3.4.6.比较类数组对象¶

3.4.7.组合重叠的数据集¶

3.4.8.通用DataFrame组合¶

3.5.描述统计学¶

3.5.1.总结数据:describe¶

3.5.2.最小/最大值的索引¶

3.5.3.值计数(直方图)/模式¶

3.5.4.离散化和分位数¶

3.6. 函数的应用¶

3.6.1. 表级函数应用-Tablewise¶

3.6.2.行或列级的函数应用-Row/Colums-wise¶

3.6.3.聚合API¶

使用多种功能聚合¶

用字典聚集¶

混合数据类型¶

自定义描述¶

3.6.4.转换API¶

单函数转换¶

多函数转换¶

用字典转换¶

3.6.5.元素级的函数运用¶

3.7.重新索引和更改标签¶

3.7.1.重新索引以与另一个对象对齐¶

3.7.2.使用对齐将对象彼此对齐¶

3.7.3.重建索引时填充¶

3.7.4.重新索引时的填充限制¶

3.7.5. 从轴上删除标签¶

3.7.6 重命名/映射标签¶

公告