Pandas-2-2-中文文档-五十七-

Pandas 2.2 中文文档（五十七）

原文：pandas.pydata.org/docs/

版本 0.16.2（2015 年 6 月 12 日）

原文：pandas.pydata.org/docs/whatsnew/v0.16.2.html

这是从 0.16.1 中的次要 bug 修复版本，并包括大量的 bug 修复，以及一些新功能（pipe()方法）、增强和性能改进。

我们建议所有用户升级到此版本。

重点包括：

新的pipe方法，请参阅此处
使用numba与pandas的文档，请参阅此处。

新功能在 v0.16.2 中的变化

新功能
- 管道
- 其他增强
API 更改
性能改进
错误修复
贡献者

新功能

管道

我们引入了一个新方法DataFrame.pipe()。顾名思义，pipe应该用于将数据通过一系列函数调用传递。目标是避免混淆的嵌套函数调用，比如

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)  # noqa F821

逻辑从内到外流动，函数名称与它们的关键字参数分开。这可以重写为

(
    df.pipe(h)  # noqa F821
    .pipe(g, arg1=1)  # noqa F821
    .pipe(f, arg2=2, arg3=3)  # noqa F821
)

现在代码和逻辑都从上到下流动。关键字参数紧跟在它们的函数旁边。整体而言，代码更加可读。

在上面的示例中，函数f、g和h每个都期望 DataFrame 作为第一个位置参数。当您希望应用的函数将数据放在除第一个参数之外的任何位置时，传递一个(function, keyword)元组，指示 DataFrame 应该流动到何处。例如：

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv("data/baseball.csv", index_col="id")

# sm.ols takes (formula, data)
In [3]: (
...:     bb.query("h > 0")
...:     .assign(ln_h=lambda df: np.log(df.h))
...:     .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
...:     .fit()
...:     .summary()
...: )
...:
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
 OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Tue, 22 Nov 2022   Prob (F-statistic):           3.48e-15
Time:                        05:35:23   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

管道方法受到 Unix 管道的启发，它通过进程流传输文本。更近期的dplyr和magrittr引入了流行的(%>%)管道运算符用于R。

查看更多文档。(GH 10129) ### 其他增强

在 Index/Series StringMethods 中添加了rsplit（GH 10303）
删除了 IPython 笔记本中DataFrame HTML 表示的硬编码大小限制，并将其留给 IPython 自身（仅适用于 IPython v3.0 或更高版本）。这消除了在具有大框架的笔记本中出现的重复滚动条(GH 10231)。

请注意，笔记本有一个toggle output scrolling功能，用于限制显示非常大的框架（点击输出左侧）。您还可以使用 pandas 选项配置 DataFrame 的显示方式，请参见此处。
DataFrame.quantile的axis参数现在也接受index和column。（GH 9543） ## API 更改
如果在构造函数中同时使用offset和observance，Holiday现在会引发NotImplementedError，而不是返回不正确的结果（GH 10217） ## 性能改进
使用dtype=datetime64[ns]改进Series.resample的性能（GH 7754）
当expand=True时，提高str.split的性能（GH 10081） ## Bug 修复
当给定一个一行Series时，Series.hist中会引发错误的 bug（GH 10214）
HDFStore.select修改传递的列列表的 bug（GH 7212）
在 Python 3 中，Categorical repr 中display.width为None的 bug（GH 10087）
在特定方向和CategoricalIndex的to_json中会导致段错误的 bug（GH 10317）
一些 nan 函数的返回数据类型不一致的 bug（GH 10251）
在检查传递了有效轴的DataFrame.quantile中的 bug（GH 9543）
groupby.apply聚合中Categorical不保留类别的错误（GH 10138）
当datetime是分数时，to_csv中忽略date_format的 bug（GH 10209）
混合数据类型时DataFrame.to_json中的 bug（GH 10289）
在合并时更新缓存的 bug（GH 10264）
mean()中整数数据类型可能溢出的错误（GH 10172）
当指定 dtype 时，Panel.from_dict中未设置 dtype 的 bug（GH 10058）
当传递数组时，Index.union中引发AttributeError的 bug（GH 10149）
Timestamp的microsecond、quarter、dayofyear、week和daysinmonth属性返回np.int类型，而不是内置的int类型的 bug（GH 10050）
当访问daysinmonth、dayofweek属性时，NaT引发AttributeError的 bug（GH 10096）
使用max_seq_items=None设置时，Index repr 中的 bug（GH 10182）
在各种平台上使用 dateutil 获取时区数据时出现错误（GH 9059，GH 8639，GH 9663，GH 10121）
在显示具有混合频率的日期时间时出现错误；将 ‘ms’ 日期时间显示到正确的精度（GH 10170）
setitem 中的错误会将类型提升应用于整个块（GH 10280）
Series 算术方法中的错误可能会错误地保留名称（GH 10068）
在多个键分组时，GroupBy.get_group 中的错误，其中一个键是分类的（GH 10132）
在 timedelta 算术运算后，DatetimeIndex 和 TimedeltaIndex 的名称会丢失（GH 9926）
从具有 datetime64 的嵌套 dict 构建 DataFrame 时出现错误（GH 10160）
从具有 datetime64 键的 dict 构建 Series 时出现错误（GH 9456）
Series.plot(label="LABEL") 中的错误未正确设置标签（GH 10119）
plot 中的错误未默认为 matplotlib axes.grid 设置（GH 9792）
在 engine='python' 的 read_csv 解析器中，包含指数但没有小数点的字符串被解析为 int 而不是 float 的错误（GH 9565）
当指定 fill_value 时，Series.align 中的错误会重置 name（GH 10067）
在 read_csv 中导致空 DataFrame 上未设置索引名称的错误（GH 10184）
SparseSeries.abs 中的错误会重置 name（GH 10241）
TimedeltaIndex 切片中的错误可能会重置频率（GH 10292）
在组键包含 NaT 时，GroupBy.get_group 引发 ValueError 的错误（GH 6992）
SparseSeries 构造函数忽略输入数据名称的错误（GH 10258）
在 Categorical.remove_categories 中的错误，当底层 dtype 为浮点时，删除 NaN 类别会导致 ValueError（GH 10156）
在推断时间规则（WOM-5XXX）不受 to_offset 支持时，推断频率的错误（GH 9425）
DataFrame.to_hdf()中表格格式错误会为无效（非字符串）列名引发一个看似无关的错误。现在明确禁止这样做。(GH 9057)
处理空DataFrame掩码的错误（GH 10126）。
修复了 MySQL 接口无法处理数字表/列名称的错误（GH 10255）
read_csv中使用date_parser返回非[ns]时间分辨率的datetime64数组的错误（GH 10245）
Panel.apply中当结果的ndim=0时的错误（GH 10332）
修复了read_hdf中无法传递auto_close的错误（GH 9327）。
修复了read_hdf中无法使用打开存储的错误（GH 10330）。
添加空DataFrame的错误，现在结果是一个与空DataFrame相等的DataFrame（GH 10181）。
修复了to_hdf和HDFStore中未检查complib选择是否有效的错误（GH 4582，GH 8874）。 ## 贡献者

总共有 34 人为这个版本贡献了补丁。名字后面带有“+”符号的人第一次贡献了补丁。

Andrew Rosenfeld
Artemy Kolchinsky
Bernard Willers +
Christer van der Meeren
Christian Hudon +
Constantine Glen Evans +
Daniel Julius Lasiman +
Evan Wright
Francesco Brundu +
Gaëtan de Menten +
Jake VanderPlas
James Hiebert +
Jeff Reback
Joris Van den Bossche
Justin Lecher +
Ka Wo Chen +
Kevin Sheppard
Mortada Mehyar
Morton Fox +
Robin Wilson +
Sinhrks
Stephan Hoyer
Thomas Grainger
Tom Ajamian
Tom Augspurger
Yoshiki Vázquez Baeza
Younggun Kim
austinc +
behzad nouri
jreback
lexual
rekcahpassyla +
scls19fr
sinhrks ## 新功能

管道

我们引入了一个新方法DataFrame.pipe()。正如名称所示，pipe应该用于将数据通过一系列函数调用传递。目标是避免混乱的嵌套函数调用，比如

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)  # noqa F821

逻辑从内到外流动，函数名称与它们的关键字参数分开。这可以重写为

(
    df.pipe(h)  # noqa F821
    .pipe(g, arg1=1)  # noqa F821
    .pipe(f, arg2=2, arg3=3)  # noqa F821
)

现在代码和逻辑都从上到下流动。关键字参数紧挨着它们的函数。整体代码更易读。

在上面的示例中，函数f，g和h每个都期望 DataFrame 作为第一个位置参数。当您希望应用的函数将数据传递到除第一个参数以外的任何位置时，请传递一个元组(function, keyword)，指示 DataFrame 应该流经哪里。例如：

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv("data/baseball.csv", index_col="id")

# sm.ols takes (formula, data)
In [3]: (
...:     bb.query("h > 0")
...:     .assign(ln_h=lambda df: np.log(df.h))
...:     .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
...:     .fit()
...:     .summary()
...: )
...:
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
 OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Tue, 22 Nov 2022   Prob (F-statistic):           3.48e-15
Time:                        05:35:23   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

pipe 方法受到 Unix 管道的启发，通过进程流式传输文本。最近，dplyr 和 magrittr 引入了流行的 (%>%) 管道操作符用于 R。

查看文档以获取更多信息。 (GH 10129) ### 其他增强

将 rsplit 添加到索引/系列的字符串方法中（GH 10303）
删除了 IPython 笔记本中 DataFrame HTML 表示的硬编码大小限制，并将其留给 IPython 自己处理（仅适用于 IPython v3.0 或更高版本）。这消除了在大框架笔记本中出现的重复滚动条（GH 10231）。

请注意，笔记本有一个切换输出滚动功能，以限制非常大的框架的显示（通过点击输出左侧）。您还可以使用 pandas 选项配置 DataFrame 的显示方式，请参见这里。
DataFrame.quantile 的 axis 参数现在还接受 index 和 column。 (GH 9543) ### 管道

我们引入了一个新方法 DataFrame.pipe()。如名称所示，pipe 应该用于通过一系列函数调用传递数据。目标是避免混淆的嵌套函数调用，如下所示：

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)  # noqa F821

逻辑从内向外流动，函数名称与其关键字参数分开。这可以重写为

(
    df.pipe(h)  # noqa F821
    .pipe(g, arg1=1)  # noqa F821
    .pipe(f, arg2=2, arg3=3)  # noqa F821
)

现在代码和逻辑都是自上而下的。关键字参数紧跟在它们的函数旁边。整体上，代码更加可读。

在上面的示例中，函数f、g和h每个都将 DataFrame 作为第一个位置参数。当您希望应用的函数将数据放在除第一个参数之外的任何位置时，请传递一个元组(function, keyword)，指示 DataFrame 应该流经哪里。例如：

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv("data/baseball.csv", index_col="id")

# sm.ols takes (formula, data)
In [3]: (
...:     bb.query("h > 0")
...:     .assign(ln_h=lambda df: np.log(df.h))
...:     .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
...:     .fit()
...:     .summary()
...: )
...:
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
 OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Tue, 22 Nov 2022   Prob (F-statistic):           3.48e-15
Time:                        05:35:23   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

pipe 方法受到 Unix 管道的启发，通过进程流式传输文本。最近，dplyr 和 magrittr 引入了流行的 (%>%) 管道操作符用于 R。

查看文档以获取更多信息。 (GH 10129)

其他增强

将 rsplit 添加到索引/系列的字符串方法中（GH 10303）
删除了 IPython 笔记本中 DataFrame HTML 表示的硬编码大小限制，将此留给 IPython 自己处理（仅适用于 IPython v3.0 或更高版本）。这消除了在大框架中出现的笔记本中的重复滚动条（GH 10231）。

请注意，笔记本具有 切换输出滚动 功能，以限制显示非常大的框架（点击输出左侧）。您还可以使用 pandas 选项配置 DataFrame 的显示方式，请参见此处。
DataFrame.quantile 的 axis 参数现在还接受 index 和 column。（GH 9543）

API 变更

如果在构造函数中同时使用 offset 和 observance，则 Holiday 现在会引发 NotImplementedError，而不是返回不正确的结果（GH 10217）。

性能改进

使用 dtype=datetime64[ns] 改进了 Series.resample 的性能（GH 7754）
当 expand=True 时，提高了 str.split 的性能。（GH 10081）

错误修复

Series.hist 在给定一行 Series 时引发错误（GH 10214）
HDFStore.select 修改传递的列列表的错误（GH 7212）
在 Python 3 中，display.width 为 None 时，Categorical 的 repr 存在错误（GH 10087）
在特定方向和 CategoricalIndex 的情况下，to_json 中存在错误会导致段错误。（GH 10317）
一些 nan 函数的返回 dtype 不一致。（GH 10251）
在检查是否传递了有效的轴时，DataFrame.quantile 中存在错误。（GH 9543）
groupby.apply 聚合中 Categorical 的错误未保留类别（GH 10138）
如果 datetime 是小数，则 to_csv 中会忽略 date_format 的错误。（GH 10209）
在混合数据类型时，DataFrame.to_json 中存在错误。（GH 10289）
在合并时缓存更新中存在错误。（GH 10264）
mean() 中整数 dtype 可能会溢出的错误（GH 10172）
Panel.from_dict 中的错误未在指定时设置 dtype。（GH 10058）
Index.union 中的错误在传递数组时引发 AttributeError。（GH 10149）
在 Timestamp 的 microsecond、quarter、dayofyear、week 和 daysinmonth 属性返回 np.int 类型而不是内置 int 的错误（GH 10050）
在访问 daysinmonth、dayofweek 属性时，NaT 引发 AttributeError 的错误（GH 10096）
在使用 max_seq_items=None 设置时，Index repr 出现的错误（GH 10182）
在各种平台上使用 dateutil 获取时区数据时出现的错误（GH 9059、GH 8639、GH 9663、GH 10121）
在显示具有混合频率的日期时间时，将 ‘ms’ 日期时间显示为正确的精度的错误（GH 10170）
在 setitem 中应用类型提升到整个块的错误（GH 10280）
在 Series 算术方法中可能错误地保留名称的错误（GH 10068）
在使用多个键进行分组时，其中一个键是分类时，GroupBy.get_group 中出现的错误（GH 10132）
在 timedelta 运算后丢失 DatetimeIndex 和 TimedeltaIndex 的名称的错误（GH 9926）
在使用 datetime64 的嵌套 dict 构建 DataFrame 时出现的错误（GH 10160）
在使用 datetime64 键从 dict 构建 Series 时的错误（GH 9456）
在 Series.plot(label="LABEL") 中未正确设置标签的错误（GH 10119）
在 plot 中未默认到 matplotlib axes.grid 设置的错误（GH 9792）
在 engine='python' 的 read_csv 解析器中，导致包含指数但没有小数点的字符串被解析为 int 而不是 float 的错误（GH 9565）
在 Series.align 中指定 fill_value 时重置 name 的错误（GH 10067）
在空 DataFrame 上，read_csv 中导致索引名称未设置的错误（GH 10184）
在 SparseSeries.abs 中重置 name 的错误（GH 10241）
在 TimedeltaIndex 切片可能重置频率的错误（GH 10292）
在组键包含 NaT 时，GroupBy.get_group 引发 ValueError 的错误（GH 6992）
在 SparseSeries 构造函数中忽略输入数据名称的错误（GH 10258）
Bug in Categorical.remove_categories，当底层 dtype 为浮点型时删除 NaN 类别会导致 ValueError 的问题 (GH 10156)。
Bug 修复 infer_freq 推断时间规则 (WOM-5XXX)，to_offset 不支持的问题 (GH 9425)。
Bug in DataFrame.to_hdf()，当表格格式出现无效（非字符串）列名时会引发一个看似无关的错误。现在明确禁止这种情况。 (GH 9057)。
Bug 修复空的 DataFrame 掩码问题 (GH 10126)。
Bug 在 MySQL 接口中无法处理数字表/列名的问题 (GH 10255)。
Bug in read_csv，当 date_parser 返回除 [ns] 之外的其他时间分辨率的 datetime64 数组时 (GH 10245)。
Bug 修复 Panel.apply，当结果的 ndim=0 时 (GH 10332)。
Bug in read_hdf 无法传递 auto_close 的问题 (GH 9327)。
Bug in read_hdf 在使用 open 存储时无法使用的问题 (GH 10330)。
Bug 在添加空的 DataFrame 时，现在结果将是一个与空的 DataFrame .equals 的 DataFrame (GH 10181)。
Bug 修复 to_hdf 和 HDFStore，未检查 complib 选择是否有效的问题 (GH 4582, GH 8874)。

贡献者

总共有 34 人为此版本贡献了补丁。名字后面带有“+”符号的人第一次贡献了补丁。

Andrew Rosenfeld
Artemy Kolchinsky
Bernard Willers +
Christer van der Meeren
Christian Hudon +
Constantine Glen Evans +
Daniel Julius Lasiman +
Evan Wright
Francesco Brundu +
Gaëtan de Menten +
Jake VanderPlas
James Hiebert +
Jeff Reback
Joris Van den Bossche
Justin Lecher +
Ka Wo Chen +
Kevin Sheppard
Mortada Mehyar
Morton Fox +
Robin Wilson +
Sinhrks
Stephan Hoyer
Thomas Grainger
Tom Ajamian
Tom Augspurger
Yoshiki Vázquez Baeza
Younggun Kim
austinc +
behzad nouri
jreback
lexual
rekcahpassyla +
scls19fr
sinhrks

版本 0.16.1（2015 年 5 月 11 日）

原文：pandas.pydata.org/docs/whatsnew/v0.16.1.html

这是从 0.16.0 的一个小 bug 修复版本，并包括大量的 bug 修复以及一些新功能、增强功能和性能改进。我们建议所有用户升级到这个版本。

亮点包括：

支持CategoricalIndex，基于类别的索引，请参见这里
如何贡献给 pandas 的新部分，请参见这里
修订的“合并、连接和串联”文档，包括图示示例，以便更容易理解每个操作，请参见这里
用于从 Series、DataFrames 和 Panels 中绘制随机样本的新方法sample。请参见这里
默认的Index打印格式已更改为更统一的格式，请参见这里
现在支持BusinessHour日期偏移，请参见这里
进一步增强.str访问器，使字符串操作更加简便，请参见这里

v0.16.1 中的新内容

增强功能
- CategoricalIndex
- 样本
- 字符串方法增强
- 其他增强功能
API 更改
- 弃用
索引表示
性能改进
Bug 修复
贡献者

警告

在 pandas 0.17.0 中，子包pandas.io.data将被移除，取而代之的是一个可以单独安装的包(GH 8961)。

增强功能

CategoricalIndex

我们引入了CategoricalIndex，这是一种新类型的索引对象，用于支持具有重复索引的索引。这是一个围绕Categorical（在 v0.15.0 中引入）的容器，允许对具有大量重复元素的索引进行高效索引和存储。在 0.16.1 之前，将 DataFrame/Series 的索引设置为category dtype 将会将其转换为常规基于对象的Index。

In [1]: df = pd.DataFrame({'A': np.arange(6),
 ...:                   'B': pd.Series(list('aabbca'))
 ...:                          .astype('category', categories=list('cab'))
 ...:                   })
 ...:

In [2]: df
Out[2]:
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]:
A       int64
B    category
dtype: object

In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')

设置索引，将创建一个CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

使用__getitem__/.iloc/.loc/.ix进行索引类似于具有重复索引的索引。索引器必须在类别中，否则操作将引发错误。

In [7]: df2.loc['a']
Out[7]:
 A
B
a  0
a  1
a  5

并保留CategoricalIndex

In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

排序将按类别的顺序排序

In [9]: df2.sort_index()
Out[9]:
 A
B
c  4
a  0
a  1
a  5
b  2
b  3

对索引的 groupby 操作也将保留索引的性质

In [10]: df2.groupby(level=0).sum()
Out[10]:
 A
B
c  4
a  6
b  5

In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重新索引操作，将根据传递的索引器的类型返回结果索引，这意味着传递列表将返回一个普通的索引；使用Categorical进行索引将返回一个CategoricalIndex，根据传递的Categorical dtype 的类别进行索引。这使得可以任意地对这些进行索引，即使值不在类别中，类似于如何重新索引任何 pandas 索引。

In [12]: df2.reindex(['a', 'e'])
Out[12]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
 categories=['a', 'b', 'c', 'd', 'e'],
 ordered=False, name='B',
 dtype='category')

更多信息请参见文档（GH 7629, GH 10038, GH 10039) ### 示例

Series、DataFrames 和 Panels 现在有一个新方法：sample()。该方法接受要返回的特定行数或列数，或者总行数或列数的一部分。它还具有使用或不使用替换进行抽样的选项，用于传递非均匀抽样的权重列的选项，以及设置种子值以便复制的选项。 (GH 2419)

In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]: 
3    3
Length: 1, dtype: int64

# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]: 
2    2
1    1
0    0
Length: 3, dtype: int64

# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]: 
1    1
5    5
3    3
Length: 3, dtype: int64

# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]: 
2    2
4    4
3    3
Length: 3, dtype: int64

# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]: 
0    0
Length: 1, dtype: int64

当应用于 DataFrame 时，可以传递列名来指定从行中抽样时的抽样权重。

In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})

In [10]: df.sample(n=3, weights="weight_column")
Out[10]: 
 col1  weight_column
0     9            0.5
1     8            0.4
2     7            0.1

[3 rows x 2 columns] 
```  ### 字符串方法增强

继续从 v0.16.0 开始，以下增强使字符串操作更容易，并且与标准的 Python 字符串操作更一致。

+   将`StringMethods`（`.str`访问器）添加到`Index` ([GH 9068](https://github.com/pandas-dev/pandas/issues/9068))

    现在`.str`访问器对于`Series`和`Index`都可用。

    ```py
    In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

    In [12]: idx.str.strip()
    Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') 
    ```

    `.str`访问器在`Index`上的一个特殊情况是，如果字符串方法返回`bool`，`.str`访问器将返回一个`np.array`而不是布尔`Index` ([GH 8875](https://github.com/pandas-dev/pandas/issues/8875))。 这使得以下表达式可以自然地工作：

    ```py
    In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"])

    In [14]: s = pd.Series(range(4), index=idx)

    In [15]: s
    Out[15]: 
    a1    0
    a2    1
    b1    2
    b2    3
    Length: 4, dtype: int64

    In [16]: idx.str.startswith("a")
    Out[16]: array([ True,  True, False, False])

    In [17]: s[s.index.str.startswith("a")]
    Out[17]: 
    a1    0
    a2    1
    Length: 2, dtype: int64 
    ```

+   以下新方法可以通过`.str`访问器访问，以将函数应用于每个值。 ([GH 9766](https://github.com/pandas-dev/pandas/issues/9766), [GH 9773](https://github.com/pandas-dev/pandas/issues/9773), [GH 10031](https://github.com/pandas-dev/pandas/issues/10031), [GH 10045](https://github.com/pandas-dev/pandas/issues/10045), [GH 10052](https://github.com/pandas-dev/pandas/issues/10052))

    |  |  | 方法 |  |  |
    | --- | --- | --- | --- | --- |
    | `capitalize()` | `swapcase()` | `normalize()` | `partition()` | `rpartition()` |
    | `index()` | `rindex()` | `translate()` |  |  |

+   `split`现在使用`expand`关键字来指定是否扩展维度。`return_type`已弃用。 ([GH 9847](https://github.com/pandas-dev/pandas/issues/9847))

    ```py
    In [18]: s = pd.Series(["a,b", "a,c", "b,c"])

    # return Series
    In [19]: s.str.split(",")
    Out[19]: 
    0    [a, b]
    1    [a, c]
    2    [b, c]
    Length: 3, dtype: object

    # return DataFrame
    In [20]: s.str.split(",", expand=True)
    Out[20]: 
     0  1
    0  a  b
    1  a  c
    2  b  c

    [3 rows x 2 columns]

    In [21]: idx = pd.Index(["a,b", "a,c", "b,c"])

    # return Index
    In [22]: idx.str.split(",")
    Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object')

    # return MultiIndex
    In [23]: idx.str.split(",", expand=True)
    Out[23]: 
    MultiIndex([('a', 'b'),
     ('a', 'c'),
     ('b', 'c')],
     ) 
    ```

+   改进了`Index.str`的`extract`和`get_dummies`方法 ([GH 9980](https://github.com/pandas-dev/pandas/issues/9980))  ### 其他增强

+   现在支持 `BusinessHour` 偏移量，它表示从默认情况下的 `BusinessDay` 上午 09:00 - 下午 17:00 开始的工作时间。详细信息请参阅这里。([GH 7905](https://github.com/pandas-dev/pandas/issues/7905))

    ```py
    In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour()
    Out[24]: Timestamp('2014-08-01 10:00:00')

    In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour()
    Out[25]: Timestamp('2014-08-01 10:00:00')

    In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour()
    Out[26]: Timestamp('2014-08-04 09:30:00') 
    ```

+   `DataFrame.diff` 现在接受一个 `axis` 参数，用于确定差分的方向。([GH 9727](https://github.com/pandas-dev/pandas/issues/9727))

+   允许 `clip`、`clip_lower` 和 `clip_upper` 接受类似数组的参数作为阈值（这是从 0.11.0 开始的一个回归）。这些方法现在有一个 `axis` 参数，确定 Series 或 DataFrame 与阈值的对齐方式。([GH 6966](https://github.com/pandas-dev/pandas/issues/6966))

+   `DataFrame.mask()` 和 `Series.mask()` 现在支持与 `where` 相同的关键字。([GH 8801](https://github.com/pandas-dev/pandas/issues/8801))

+   `drop` 函数现在可以接受 `errors` 关键字，以抑制在目标数据中任何标签不存在时引发的 `ValueError`。([GH 6736](https://github.com/pandas-dev/pandas/issues/6736))

    ```py
    In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"])

    In [28]: df.drop(["A", "X"], axis=1, errors="ignore")
    Out[28]: 
     B         C
    0 -0.706771 -1.039575
    1 -0.424972  0.567020
    2 -1.087401 -0.673690

    [3 rows x 2 columns] 
    ```

+   添加对使用短划线分隔年份和季度的支持，例如 2014-Q1。([GH 9688](https://github.com/pandas-dev/pandas/issues/9688))

+   允许将具有 `datetime64` 或 `timedelta64` 类型的值转换为字符串，使用 `astype(str)`。([GH 9757](https://github.com/pandas-dev/pandas/issues/9757))

+   `get_dummies` 函数现在接受 `sparse` 关键字。如果设置为 `True`，返回的 `DataFrame` 是稀疏的，例如 `SparseDataFrame`。([GH 8823](https://github.com/pandas-dev/pandas/issues/8823))

+   `Period` 现在接受 `datetime64` 作为值输入。([GH 9054](https://github.com/pandas-dev/pandas/issues/9054))

+   当时间定义中缺少前导零时，允许时间差字符串转换，例如 `0:00:00` vs `00:00:00`。([GH 9570](https://github.com/pandas-dev/pandas/issues/9570))

+   允许 `Panel.shift` 与 `axis='items'`。([GH 9890](https://github.com/pandas-dev/pandas/issues/9890))

+   尝试写入 Excel 文件现在会引发 `NotImplementedError`，如果 `DataFrame` 具有 `MultiIndex`，而不是写入损坏的 Excel 文件。([GH 9794](https://github.com/pandas-dev/pandas/issues/9794))

+   允许 `Categorical.add_categories` 接受 `Series` 或 `np.array`。([GH 9927](https://github.com/pandas-dev/pandas/issues/9927))

+   动态添加/删除 `str/dt/cat` 访问器，从 `__dir__`。([GH 9910](https://github.com/pandas-dev/pandas/issues/9910))

+   将 `normalize` 添加为 `dt` 访问器方法。([GH 10047](https://github.com/pandas-dev/pandas/issues/10047))

+   `DataFrame` 和 `Series` 现在具有 `_constructor_expanddim` 属性，作为一个更高维度数据的可覆盖构造函数。只有在真正需要时才应该使用这个属性，请参阅这里

+   `pd.lib.infer_dtype` 现在在适当情况下在 Python 3 中返回 `'bytes'`。([GH 10032](https://github.com/pandas-dev/pandas/issues/10032))  ## API 变更

+   当传递一个 `ax` 给 `df.plot( ..., ax=ax)` 时，`sharex` 关键字参数现在默认为 `False`。结果是 xlabels 和 xticklabels 的可见性将不再改变。你必须自己为图中的正确轴设置 `sharex=True` 或明确设置（但这会改变图中所有轴的可见性，而不仅仅是传递的轴！）。如果 pandas 自己创建子图（例如没有传递 `ax` 关键字参数），那么默认值仍然为 `sharex=True`，并且应用了可见性更改。

+   `assign()` 现在按字母顺序插入新列。以前的顺序是任意的。([GH 9777](https://github.com/pandas-dev/pandas/issues/9777))

+   默认情况下，`read_csv` 和 `read_table` 现在将尝试根据文件扩展名推断压缩类型。设置 `compression=None` 来恢复先前的行为（无解压缩）。([GH 9770](https://github.com/pandas-dev/pandas/issues/9770))

### 废弃

+   `Series.str.split` 的 `return_type` 关键字已被移除，改用 `expand`。([GH 9847](https://github.com/pandas-dev/pandas/issues/9847))  ## 索引表示

`Index` 及其子类的字符串表示现在已统一。如果值很少，将显示单行显示；如果有很多值，则显示多行显示（但少于 `display.max_seq_items`；如果有很多项目（> `display.max_seq_items`），则显示截断显示（数据的头部和尾部）。`MultiIndex` 的格式化保持不变（多行包装显示）。显示宽度响应选项 `display.max_seq_items`，默认为 100。([GH 6482](https://github.com/pandas-dev/pandas/issues/6482))

先前的行为

```py
In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

新行为

In [29]: pd.set_option("display.width", 80)

In [30]: pd.Index(range(4), name="foo")
Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')

In [31]: pd.Index(range(30), name="foo")
Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')

In [32]: pd.Index(range(104), name="foo")
Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')

In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
Out[34]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
 'a', 'bb', 'ccc', 'dddd'],
 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
Out[35]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
 'bb',
 ...
 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
 'dddd'],
 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar', length=400)

In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
Out[36]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
 '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
 dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')

In [37]: pd.date_range("20130101", periods=25, freq="D")
Out[37]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
 '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
 '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
 '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
 '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
 '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
 '2013-01-25'],
 dtype='datetime64[ns]', freq='D')

In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
Out[38]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
 '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
 '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
 '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
 '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
 ...
 '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
 '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
 '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
 '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
 '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
 dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D') 
```  ## 性能改进

+   混合数据类型的 CSV 写入性能提高了多达 5 倍，包括日期时间。([GH 9940](https://github.com/pandas-dev/pandas/issues/9940))

+   一般情况下，CSV 写入性能提高了 2 倍。([GH 9940](https://github.com/pandas-dev/pandas/issues/9940))

+   将 `pd.lib.max_len_string_array` 的性能提高了 5-7 倍。([GH 10024](https://github.com/pandas-dev/pandas/issues/10024))  ## Bug 修复

+   在 `DataFrame.plot()` 的图例中标签未正确显示的 Bug，传递 `label=` 参数有效，并且 Series 索引不再被修改。([GH 9542](https://github.com/pandas-dev/pandas/issues/9542))

+   JSON 序列化中的 Bug 导致当帧长度为零时出现段错误。([GH 9805](https://github.com/pandas-dev/pandas/issues/9805))

+   `read_csv` 中的 Bug，缺少尾随分隔符会导致段错误。([GH 5664](https://github.com/pandas-dev/pandas/issues/5664))

+   在追加时保留索引名称中的 Bug。([GH 9862](https://github.com/pandas-dev/pandas/issues/9862))

+   `scatter_matrix` 中的 Bug 绘制了意外的轴刻度标签。([GH 5662](https://github.com/pandas-dev/pandas/issues/5662))

+   修复了`StataWriter`中的错误，导致保存时更改输入`DataFrame`（[GH 9795](https://github.com/pandas-dev/pandas/issues/9795)）

+   在使用快速聚合器时，`transform`中的错误导致长度不匹配��存在空条目（[GH 9697](https://github.com/pandas-dev/pandas/issues/9697)）

+   `equals`中的错误导致块顺序不同时出现假阴性（[GH 9330](https://github.com/pandas-dev/pandas/issues/9330)）

+   在多个`pd.Grouper`组合中分组时出现一个非基于时间的错误（[GH 10063](https://github.com/pandas-dev/pandas/issues/10063)）

+   使用时读取带有时区的 postgres 表时出现`read_sql_table`错误（[GH 7139](https://github.com/pandas-dev/pandas/issues/7139)）

+   `DataFrame`切片中的错误可能不会保留元数据（[GH 9776](https://github.com/pandas-dev/pandas/issues/9776)）

+   `TimdeltaIndex`在固定的`HDFStore`中未正确序列化的错误（[GH 9635](https://github.com/pandas-dev/pandas/issues/9635)）

+   `TimedeltaIndex`构造函数中的错误忽略了给定另一个`TimedeltaIndex`作为数据时的`name`（[GH 10025](https://github.com/pandas-dev/pandas/issues/10025)）

+   `DataFrameFormatter._get_formatted_index`中的错误未将`max_colwidth`应用于`DataFrame`索引（[GH 7856](https://github.com/pandas-dev/pandas/issues/7856)）

+   在具有只读 ndarray 数据源的`.loc`中出现错误（[GH 10043](https://github.com/pandas-dev/pandas/issues/10043)）

+   `groupby.apply()`中的错误，如果传递的用户定义函数只返回`None`（对于所有输入），则会引发错误（[GH 9685](https://github.com/pandas-dev/pandas/issues/9685)）

+   在 pytables 测试中始终使用临时文件（[GH 9992](https://github.com/pandas-dev/pandas/issues/9992)）

+   连续使用`secondary_y`绘图时可能无法正确显示图例（[GH 9610](https://github.com/pandas-dev/pandas/issues/9610)，[GH 9779](https://github.com/pandas-dev/pandas/issues/9779)）

+   `DataFrame.plot(kind="hist")`中的错误导致`DataFrame`包含非数值列时出现`TypeError`（[GH 9853](https://github.com/pandas-dev/pandas/issues/9853)）

+   重复绘制具有`DatetimeIndex`的`DataFrame`可能引发`TypeError`的错误（[GH 9852](https://github.com/pandas-dev/pandas/issues/9852)）

+   `setup.py`中的错误允许不兼容的 cython 版本构建（[GH 9827](https://github.com/pandas-dev/pandas/issues/9827)）

+   绘制`secondary_y`时的错误，错误地将`right_ax`属性附加到递归指定自身的次要轴上（[GH 9861](https://github.com/pandas-dev/pandas/issues/9861)）

+   `Series.quantile`在空`Datetime`或`Timedelta`类型的`Series`上的错误（[GH 9675](https://github.com/pandas-dev/pandas/issues/9675)）

+   `where`中的错误导致需要向上转型时结果不正确（[GH 9731](https://github.com/pandas-dev/pandas/issues/9731)）

+   `FloatArrayFormatter` 中的 Bug，导致以十进制格式显示“小”浮点数的决策边界偏离给定的 display.precision 一个数量级 ([GH 9764](https://github.com/pandas-dev/pandas/issues/9764))

+   修复了 `DataFrame.plot()` 在传递了 `color` 和 `style` 关键字并且样式字符串中没有颜色符号时引发错误的 Bug ([GH 9671](https://github.com/pandas-dev/pandas/issues/9671))

+   在将 list-likes 与 `Index` 结合时未显示 `DeprecationWarning` ([GH 10083](https://github.com/pandas-dev/pandas/issues/10083))

+   在使用 `skip_rows` 参数时，`read_csv` 和 `read_table` 中的 Bug 如果存在空行。 ([GH 9832](https://github.com/pandas-dev/pandas/issues/9832))

+   `read_csv()` 中的 Bug 将 `index_col=True` 解释为 `1` ([GH 9798](https://github.com/pandas-dev/pandas/issues/9798))

+   在使用 `==` 进行索引相等比较时的 Bug，在 Index/MultiIndex 类型不兼容时失败 ([GH 9785](https://github.com/pandas-dev/pandas/issues/9785))

+   `SparseDataFrame` 中的 Bug，无法将 `nan` 作为列名 ([GH 8822](https://github.com/pandas-dev/pandas/issues/8822))

+   `to_msgpack` 和 `read_msgpack` 中的 Bug，zlib 和 blosc 压缩支持 ([GH 9783](https://github.com/pandas-dev/pandas/issues/9783))

+   `GroupBy.size` 的 Bug，如果按 `TimeGrouper` 分组，则不正确地附加索引名称 ([GH 9925](https://github.com/pandas-dev/pandas/issues/9925))

+   导致切片赋值异常的 Bug，因为 `length_of_indexer` 返回错误结果 ([GH 9995](https://github.com/pandas-dev/pandas/issues/9995))

+   csv 解析器中的 Bug 导致以初始空格加一个非空格字符开头的行被跳过。([GH 9710](https://github.com/pandas-dev/pandas/issues/9710))

+   在 C csv 解析器中的 Bug 导致数据以换行符后跟空白开始时出现虚假 NaN。 ([GH 10022](https://github.com/pandas-dev/pandas/issues/10022))

+   Bug 导致具有空组的元素在按 `Categorical` 分组时溢出到最终组 ([GH 9603](https://github.com/pandas-dev/pandas/issues/9603))

+   Bug，.iloc 和 .loc 行为在空数据框上不一致 ([GH 9964](https://github.com/pandas-dev/pandas/issues/9964))

+   在 `TimedeltaIndex` 上无效属性访问的 Bug，错误地引发 `ValueError` 而不是 `AttributeError` ([GH 9680](https://github.com/pandas-dev/pandas/issues/9680))

+   在分类数据和标量之间的不相等比较中的 Bug，标量不在类别中 (例如 `Series(Categorical(list("abc"), ordered=True)) > "d"`。这对所有元素返回 `False`，但现在引发 `TypeError`。相等比较现在也对 `==` 返回 `False`，对 `!=` 返回 `True`。 ([GH 9848](https://github.com/pandas-dev/pandas/issues/9848))

+   当右侧为字典时，在 DataFrame `__setitem__` 中的 Bug ([GH 9874](https://github.com/pandas-dev/pandas/issues/9874))

+   在 `where` 中的 Bug，当 dtype 为 `datetime64/timedelta64` 时，但其他 dtype 不是。 ([GH 9804](https://github.com/pandas-dev/pandas/issues/9804))

+   在 `MultiIndex.sortlevel()` 中的 Bug 导致 Unicode 级别名称中断。 ([GH 9856](https://github.com/pandas-dev/pandas/issues/9856))

+   `groupby.transform` 中的 Bug 不正确强制输出 dtype 以匹配输入 dtype。 ([GH 9807](https://github.com/pandas-dev/pandas/issues/9807))

+   在 `DataFrame` 构造函数中，当设置了 `columns` 参数，并且 `data` 是空列表时的 Bug。 ([GH 9939](https://github.com/pandas-dev/pandas/issues/9939))

+   使用 `log=True` 的条形图中的 Bug，如果所有值都小于 1，则引发 `TypeError`。 ([GH 9905](https://github.com/pandas-dev/pandas/issues/9905))

+   水平条形图中的 Bug 忽略了 `log=True`。 ([GH 9905](https://github.com/pandas-dev/pandas/issues/9905))

+   PyTables 查询中的 Bug 未使用索引返回正确结果。 ([GH 8265](https://github.com/pandas-dev/pandas/issues/8265), [GH 9676](https://github.com/pandas-dev/pandas/issues/9676))

+   当将包含 `Decimal` 类型值的 DataFrame 除以另一个 `Decimal` 时会引发 Bug。 ([GH 9787](https://github.com/pandas-dev/pandas/issues/9787))

+   当使用 DataFrames 的 asfreq 时会移除索引的名称的 Bug。 ([GH 9885](https://github.com/pandas-dev/pandas/issues/9885))

+   在重采样 BM/BQ 时导致额外的索引点的 Bug。 ([GH 9756](https://github.com/pandas-dev/pandas/issues/9756))

+   将 `AbstractHolidayCalendar` 中的缓存更改为实例级别而不是类级别，因为后者可能导致意外行为。 ([GH 9552](https://github.com/pandas-dev/pandas/issues/9552))

+   修复了多级索引 DataFrame 的 LaTeX 输出。 ([GH 9778](https://github.com/pandas-dev/pandas/issues/9778))

+   使用 `DataFrame.loc` 设置空范围时引发异常的 Bug。 ([GH 9596](https://github.com/pandas-dev/pandas/issues/9596))

+   在向现有轴网格添加新绘图时，使用共享轴的子图隐藏刻度标签时出现 Bug。 ([GH 9158](https://github.com/pandas-dev/pandas/issues/9158))

+   在对分类变量进行分组时，`transform` 和 `filter` 中的 Bug。 ([GH 9921](https://github.com/pandas-dev/pandas/issues/9921))

+   在 `transform` 中的 Bug，当分组与输入索引的数字和 dtype 相等时。 ([GH 9700](https://github.com/pandas-dev/pandas/issues/9700))

+   Google BigQuery 连接器现在根据每个方法导入依赖项。([GH 9713](https://github.com/pandas-dev/pandas/issues/9713))

+   更新了 BigQuery 连接器，不再使用已弃用的 `oauth2client.tools.run()`。 ([GH 8327](https://github.com/pandas-dev/pandas/issues/8327))

+   在子类化的 `DataFrame` 中的 Bug。 在切片或子集化时，可能不会返回正确的类。 ([GH 9632](https://github.com/pandas-dev/pandas/issues/9632))

+   `.median()` 中的 Bug，非浮点型空值未正确处理 ([GH 10040](https://github.com/pandas-dev/pandas/issues/10040))

+   Series.fillna()中的错误，在给定可转换为数字的字符串时会引发错误 ([GH 10092](https://github.com/pandas-dev/pandas/issues/10092))  ## 贡献者

总共有 58 人为此版本贡献了补丁。名字后面带有“+”符号的人是第一次贡献补丁的。

+   Alfonso MHC +

+   Andy Hayden

+   Artemy Kolchinsky

+   Chris Gilmer +

+   Chris Grinolds +

+   Dan Birken

+   David BROCHART +

+   David Hirschfeld +

+   David Stephens

+   Dr. Leo +

+   Evan Wright +

+   Frans van Dunné +

+   Hatem Nassrat +

+   Henning Sperr +

+   Hugo Herter +

+   Jan Schulz

+   Jeff Blackburne +

+   Jeff Reback

+   Jim Crist +

+   Jonas Abernot +

+   Joris Van den Bossche

+   Kerby Shedden

+   Leo Razoumov +

+   Manuel Riel +

+   Mortada Mehyar

+   Nick Burns +

+   Nick Eubank +

+   Olivier Grisel

+   Phillip Cloud

+   Pietro Battiston

+   Roy Hyunjin Han

+   Sam Zhang +

+   Scott Sanderson +

+   Sinhrks +

+   Stephan Hoyer

+   Tiago Antao

+   Tom Ajamian +

+   Tom Augspurger

+   Tomaz Berisa +

+   Vikram Shirgur +

+   Vladimir Filimonov

+   William Hogman +

+   Yasin A +

+   Younggun Kim +

+   behzad nouri

+   dsm054

+   floydsoft +

+   flying-sheep +

+   gfr +

+   jnmclarty

+   jreback

+   ksanghai +

+   lucas +

+   mschmohl +

+   ptype +

+   rockg

+   scls19fr +

+   sinhrks

## 增强

### CategoricalIndex

我们引入了`CategoricalIndex`，这是一种新类型的索引对象，用于支持具有重复索引的索引。它是围绕`Categorical`（在 v0.15.0 中引入）的容器，允许有效地索引和存储具有大量重复元素的索引。在 0.16.1 之前，将`DataFrame/Series`的索引设置为`category` dtype 将其转换为常规基于对象的`Index`。

```py
In [1]: df = pd.DataFrame({'A': np.arange(6),
 ...:                   'B': pd.Series(list('aabbca'))
 ...:                          .astype('category', categories=list('cab'))
 ...:                   })
 ...:

In [2]: df
Out[2]:
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]:
A       int64
B    category
dtype: object

In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')

设置索引，将创建一个CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

使用__getitem__/.iloc/.loc/.ix进行索引的工作方式与具有重复索引的索引类似。索引器必须在类别中，否则操作将引发异常。

In [7]: df2.loc['a']
Out[7]:
 A
B
a  0
a  1
a  5

并保留CategoricalIndex

In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

排序将按类别的顺序排序

In [9]: df2.sort_index()
Out[9]:
 A
B
c  4
a  0
a  1
a  5
b  2
b  3

对索引进行的分组操作也会保留索引的特性

In [10]: df2.groupby(level=0).sum()
Out[10]:
 A
B
c  4
a  6
b  5

In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重新索引操作将根据传递的索引器类型返回结果索引，这意味着传递列表将返回一个普通的Index；使用Categorical进行索引将返回一个CategoricalIndex，其索引根据传递的Categorical dtype 的类别进行索引。这使得可以任意索引这些值，即使这些值不在类别中，类似于您可以重新索引任何 pandas 索引。

In [12]: df2.reindex(['a', 'e'])
Out[12]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
 categories=['a', 'b', 'c', 'd', 'e'],
 ordered=False, name='B',
 dtype='category')

查看文档获取更多信息。(GH 7629, GH 10038, GH 10039) ### 示例

Series、DataFrames 和 Panels 现在有一个新方法：sample()。该方法接受要返回的特定行数或列数，或总行数或列数的一部分。它还有关于是否使用替换进行抽样、是否传入列作为非均匀抽样的权重以及设置种子值以便复制的选项。 (GH 2419)

In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]: 
3    3
Length: 1, dtype: int64

# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]: 
2    2
1    1
0    0
Length: 3, dtype: int64

# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]: 
1    1
5    5
3    3
Length: 3, dtype: int64

# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]: 
2    2
4    4
3    3
Length: 3, dtype: int64

# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]: 
0    0
Length: 1, dtype: int64

当应用于 DataFrame 时，可以通过传递列的名称来指定行抽样权重。

In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})

In [10]: df.sample(n=3, weights="weight_column")
Out[10]: 
 col1  weight_column
0     9            0.5
1     8            0.4
2     7            0.1

[3 rows x 2 columns] 
```  ### 字符串方法增强

继续自 v0.16.0，以下增强使字符串操作更简单，并与标准 Python 字符串操作更一致。

+   向 `Index` 添加了 `StringMethods`（`.str` 访问器） ([GH 9068](https://github.com/pandas-dev/pandas/issues/9068))

    `.str` 访问器现在可用于 `Series` 和 `Index`。

    ```py
    In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

    In [12]: idx.str.strip()
    Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') 
    ```

    关于 `Index` 上的 `.str` 访问器的一个特殊情况是，如果字符串方法返回 `bool`，则 `.str` 访问器将返回一个 `np.array` 而不是布尔型 `Index` ([GH 8875](https://github.com/pandas-dev/pandas/issues/8875))。这使得以下表达式可以自然地工作：

    ```py
    In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"])

    In [14]: s = pd.Series(range(4), index=idx)

    In [15]: s
    Out[15]: 
    a1    0
    a2    1
    b1    2
    b2    3
    Length: 4, dtype: int64

    In [16]: idx.str.startswith("a")
    Out[16]: array([ True,  True, False, False])

    In [17]: s[s.index.str.startswith("a")]
    Out[17]: 
    a1    0
    a2    1
    Length: 2, dtype: int64 
    ```

+   以下新方法可通过 `.str` 访问器访问，以将函数应用于每个值。([GH 9766](https://github.com/pandas-dev/pandas/issues/9766), [GH 9773](https://github.com/pandas-dev/pandas/issues/9773), [GH 10031](https://github.com/pandas-dev/pandas/issues/10031), [GH 10045](https://github.com/pandas-dev/pandas/issues/10045), [GH 10052](https://github.com/pandas-dev/pandas/issues/10052))

    |  |  | 方法 |  |  |
    | --- | --- | --- | --- | --- |
    | `capitalize()` | `swapcase()` | `normalize()` | `partition()` | `rpartition()` |
    | `index()` | `rindex()` | `translate()` |  |  |

+   `split` 现在接受 `expand` 关键字来指定是否扩展维度。`return_type` 已被弃用。 ([GH 9847](https://github.com/pandas-dev/pandas/issues/9847))

    ```py
    In [18]: s = pd.Series(["a,b", "a,c", "b,c"])

    # return Series
    In [19]: s.str.split(",")
    Out[19]: 
    0    [a, b]
    1    [a, c]
    2    [b, c]
    Length: 3, dtype: object

    # return DataFrame
    In [20]: s.str.split(",", expand=True)
    Out[20]: 
     0  1
    0  a  b
    1  a  c
    2  b  c

    [3 rows x 2 columns]

    In [21]: idx = pd.Index(["a,b", "a,c", "b,c"])

    # return Index
    In [22]: idx.str.split(",")
    Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object')

    # return MultiIndex
    In [23]: idx.str.split(",", expand=True)
    Out[23]: 
    MultiIndex([('a', 'b'),
     ('a', 'c'),
     ('b', 'c')],
     ) 
    ```

+   改进了 `Index.str` 的 `extract` 和 `get_dummies` 方法 ([GH 9980](https://github.com/pandas-dev/pandas/issues/9980))  ### 其他增强

+   `BusinessHour` 偏移现在受支持，它表示默认从 09:00 - 17:00 开始的 `BusinessDay` 的工作小时。详情请参见这里。([GH 7905](https://github.com/pandas-dev/pandas/issues/7905))

    ```py
    In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour()
    Out[24]: Timestamp('2014-08-01 10:00:00')

    In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour()
    Out[25]: Timestamp('2014-08-01 10:00:00')

    In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour()
    Out[26]: Timestamp('2014-08-04 09:30:00') 
    ```

+   `DataFrame.diff` 现在接受一个 `axis` 参数，该参数确定差分的方向 ([GH 9727](https://github.com/pandas-dev/pandas/issues/9727))

+   允许`clip`、`clip_lower`和`clip_upper`接受类似数组的阈值作为参数（这是从 0.11.0 开始的一个回归）。 这些方法现在具有一个`axis`参数，该参数确定 Series 或 DataFrame 将如何与阈值对齐。 ([GH 6966](https://github.com/pandas-dev/pandas/issues/6966))

+   `DataFrame.mask()`和`Series.mask()`现在支持与`where`相同的关键字。 ([GH 8801](https://github.com/pandas-dev/pandas/issues/8801))

+   当目标数据中不存在任何标签时，`drop`函数现在可以接受`errors`关键字来抑制引发的`ValueError`。 ([GH 6736](https://github.com/pandas-dev/pandas/issues/6736))

    ```py
    In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"])

    In [28]: df.drop(["A", "X"], axis=1, errors="ignore")
    Out[28]: 
     B         C
    0 -0.706771 -1.039575
    1 -0.424972  0.567020
    2 -1.087401 -0.673690

    [3 rows x 2 columns] 
    ```

+   支持使用破折号分隔年份和季度，例如 2014-Q1\. ([GH 9688](https://github.com/pandas-dev/pandas/issues/9688))

+   允许使用`astype(str)`将 dtype 为`datetime64`或`timedelta64`的值转换为字符串。 ([GH 9757](https://github.com/pandas-dev/pandas/issues/9757))

+   `get_dummies`函数现在接受`sparse`关键字。 如果设置为`True`，则返回的`DataFrame`是稀疏的，例如`SparseDataFrame`。 ([GH 8823](https://github.com/pandas-dev/pandas/issues/8823))

+   现在可以将`datetime64`作为值输入。 ([GH 9054](https://github.com/pandas-dev/pandas/issues/9054))

+   在时间定义中缺少前导零时，允许时间间隔字符串转换，例如`0:00:00`与`00:00:00`。 ([GH 9570](https://github.com/pandas-dev/pandas/issues/9570))

+   允许使用`axis='items'`对`Panel.shift`进行偏移。 ([GH 9890](https://github.com/pandas-dev/pandas/issues/9890))

+   尝试写入 excel 文件现在会引发`NotImplementedError`，如果`DataFrame`具有`MultiIndex`而不是写入损坏的 Excel 文件。 ([GH 9794](https://github.com/pandas-dev/pandas/issues/9794))

+   允许`Categorical.add_categories`接受`Series`或`np.array`。 ([GH 9927](https://github.com/pandas-dev/pandas/issues/9927))

+   从`__dir__`动态添加/删除`str/dt/cat`访问器。 ([GH 9910](https://github.com/pandas-dev/pandas/issues/9910))

+   将`normalize`添加为`dt`访问器方法。([GH 10047](https://github.com/pandas-dev/pandas/issues/10047))

+   `DataFrame`和`Series`现在具有`_constructor_expanddim`属性，作为一种更高维度数据的可重写构造函数。 只有在真正需要时才应该使用此选项，详见此处

+   `pd.lib.infer_dtype`现在在适当的情况下在 Python 3 中返回`'bytes'`。 ([GH 10032](https://github.com/pandas-dev/pandas/issues/10032))  ### CategoricalIndex

我们引入了`CategoricalIndex`，这是一种新类型的索引对象，对于支持具有重复索引的索引非常有用。 这是围绕`Categorical`（在 v0.15.0 中引入）的一个容器，允许有效地索引和存储具有大量重复元素的索引。 在 0.16.1 之前，将具有类别 dtype 的 DataFrame/Series 的索引设置为常规基于对象的索引。

```py
In [1]: df = pd.DataFrame({'A': np.arange(6),
 ...:                   'B': pd.Series(list('aabbca'))
 ...:                          .astype('category', categories=list('cab'))
 ...:                   })
 ...:

In [2]: df
Out[2]:
 A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]:
A       int64
B    category
dtype: object

In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')

设置索引将创建一个CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

使用 __getitem__/.iloc/.loc/.ix 进行索引工作方式类似于具有重复项的索引。索引器必须在分类中，否则操作将引发异常。

In [7]: df2.loc['a']
Out[7]:
 A
B
a  0
a  1
a  5

并保留 CategoricalIndex

In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

排序将按照类别的顺序排序

In [9]: df2.sort_index()
Out[9]:
 A
B
c  4
a  0
a  1
a  5
b  2
b  3

对索引进行的 groupby 操作也将保留索引的性质

In [10]: df2.groupby(level=0).sum()
Out[10]:
 A
B
c  4
a  6
b  5

In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

重新索引操作将根据传递的索引器的类型返回结果索引，这意味着传递列表将返回一个普通的 Index；使用 Categorical 进行索引将返回一个 CategoricalIndex，根据传递的 Categorical dtype 的类别进行索引。这允许任意索引这些，即使值不在类别中，类似于如何重新索引任何 pandas 索引。

In [12]: df2.reindex(['a', 'e'])
Out[12]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
 A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
 categories=['a', 'b', 'c', 'd', 'e'],
 ordered=False, name='B',
 dtype='category')

有关更多信息，请参见文档。(GH 7629, GH 10038, GH 10039)

示例

Series、DataFrames 和 Panels 现在有了一个新的方法：sample()。该方法接受要返回的特定行数或列数，或总行数或列数的一部分的分数。它还具有使用或不使用替换进行抽样的选项，用于传入权重列以进行非均匀抽样的选项，并设置种子值以便进行复制。(GH 2419)

In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]: 
3    3
Length: 1, dtype: int64

# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]: 
2    2
1    1
0    0
Length: 3, dtype: int64

# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]: 
1    1
5    5
3    3
Length: 3, dtype: int64

# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]: 
2    2
4    4
3    3
Length: 3, dtype: int64

# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]: 
0    0
Length: 1, dtype: int64

当应用于 DataFrame 时，可以传递列的名称以指定从行中抽样时的抽样权重。

In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})

In [10]: df.sample(n=3, weights="weight_column")
Out[10]: 
 col1  weight_column
0     9            0.5
1     8            0.4
2     7            0.1

[3 rows x 2 columns]

字符串方法增强

从 v0.16.0 继续，以下增强使字符串操作更加简单且与标准 python 字符串操作更一致。

向 Index 添加了 StringMethods（.str 访问器）(GH 9068)

.str 访问器现在对 Series 和 Index 都可用。

In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

In [12]: idx.str.strip()
Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

.str 访问器在 Index 上的一个特殊情况是，如果字符串方法返回 bool，则 .str 访问器将返回一个 np.array 而不是布尔值 Index (GH 8875)。这使得以下表达式自然地工作：

In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"])

In [14]: s = pd.Series(range(4), index=idx)

In [15]: s
Out[15]: 
a1    0
a2    1
b1    2
b2    3
Length: 4, dtype: int64

In [16]: idx.str.startswith("a")
Out[16]: array([ True,  True, False, False])

In [17]: s[s.index.str.startswith("a")]
Out[17]: 
a1    0
a2    1
Length: 2, dtype: int64

以下新方法可以通过 .str 访问器访问以将函数应用于每个值。(GH 9766, GH 9773, GH 10031, GH 10045, GH 10052)

方法

capitalize() swapcase() normalize() partition() rpartition()

index() rindex() translate()

		方法
`capitalize()`	`swapcase()`	`normalize()`	`partition()`	`rpartition()`
`index()`	`rindex()`	`translate()`

split 现在采用 expand 关键字来指定是否扩展维度。return_type 已弃用。（GH 9847）

In [18]: s = pd.Series(["a,b", "a,c", "b,c"])

# return Series
In [19]: s.str.split(",")
Out[19]: 
0    [a, b]
1    [a, c]
2    [b, c]
Length: 3, dtype: object

# return DataFrame
In [20]: s.str.split(",", expand=True)
Out[20]: 
 0  1
0  a  b
1  a  c
2  b  c

[3 rows x 2 columns]

In [21]: idx = pd.Index(["a,b", "a,c", "b,c"])

# return Index
In [22]: idx.str.split(",")
Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object')

# return MultiIndex
In [23]: idx.str.split(",", expand=True)
Out[23]: 
MultiIndex([('a', 'b'),
 ('a', 'c'),
 ('b', 'c')],
 )

改进了 Index.str 的 extract 和 get_dummies 方法。（GH 9980）

其他增强

现在支持 BusinessHour 偏移，它默认表示从 BusinessDay 上的 09:00 - 17:00 开始的工作时间。详情请参阅此处。（GH 7905）

In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour()
Out[24]: Timestamp('2014-08-01 10:00:00')

In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour()
Out[25]: Timestamp('2014-08-01 10:00:00')

In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour()
Out[26]: Timestamp('2014-08-04 09:30:00')

DataFrame.diff 现在接受一个 axis 参数，该参数确定差分的方向。（GH 9727）
允许 clip、clip_lower 和 clip_upper 接受类似数组的阈值作为参数（这是从 0.11.0 版本中的一个回归）。这些方法现在有一个 axis 参数，确定 Series 或 DataFrame 将如何与阈值对齐。（GH 6966）
DataFrame.mask() 和 Series.mask() 现在支持与 where 相同的关键字。（GH 8801）

drop 函数现在可以接受 errors 关键字以抑制在目标数据中任何标签不存在时引发的 ValueError。（GH 6736）

In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"])

In [28]: df.drop(["A", "X"], axis=1, errors="ignore")
Out[28]: 
 B         C
0 -0.706771 -1.039575
1 -0.424972  0.567020
2 -1.087401 -0.673690

[3 rows x 2 columns]

添加了使用破折号分隔年份和季度的支持，例如 2014-Q1。（GH 9688）
允许使用 astype(str) 将 dtype 为 datetime64 或 timedelta64 的值转换为字符串。（GH 9757）
get_dummies 函数现在接受 sparse 关键字。如果设置为 True，返回的 DataFrame 是稀疏的，例如 SparseDataFrame。（GH 8823）
Period 现在接受 datetime64 作为值输入。（GH 9054）
允许在时间定义中省略前导零时进行时间差字符串转换，即 0:00:00 与 00:00:00。（GH 9570）
允许使用 axis='items' 进行 Panel.shift。（GH 9890）
如果 DataFrame 具有 MultiIndex，尝试写入 Excel 文件现在会引发 NotImplementedError，而不是写入损坏的 Excel 文件。（GH 9794）
允许 Categorical.add_categories 接受 Series 或 np.array。（GH 9927）
从 __dir__ 动态添加/删除 str/dt/cat 访问器。（GH 9910）
将 normalize 添加为 dt 访问器方法。（GH 10047）
DataFrame和Series现在有了_constructor_expanddim属性，作为可重写的构造函数，用于一维更高维度数据。仅在确实需要时使用，参见这里
pd.lib.infer_dtype现在在适当的情况下，在 Python 3 中返回'bytes'（GH 10032）

API 更改

当向 df.plot( ..., ax=ax)传入 ax 时，sharex关键字现在默认为False。其结果是 xlabels 和 xticklabels 的可见性不再改变。您必须自行设置正确的轴以使其生效，或者显式设置sharex=True（但这会改变图中所有轴的可见性，而不仅仅是传入的一个！）。如果 pandas 自己创建子图（例如没有传入ax关键字），则默认仍为sharex=True，并且可见性更改会被应用。
assign()现在按字母顺序插入新列。之前的顺序是任意的（GH 9777）
默认情况下，read_csv和read_table现在将尝试根据文件扩展名推断压缩类型。设置compression=None以恢复先前的行为（无解压缩）（GH 9770）

废弃

Series.str.split的return_type关键字已被移除，改用expand（GH 9847）
Series.str.split的return_type关键字已被移除，改用expand（GH 9847）

索引表示

The string representation of Index and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items; if lots of items (> display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting for MultiIndex is unchanged (a multi-line wrapped display). The display width responds to the option display.max_seq_items, which is defaulted to 100. (GH 6482)

先前行为

In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

新行为

In [29]: pd.set_option("display.width", 80)

In [30]: pd.Index(range(4), name="foo")
Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')

In [31]: pd.Index(range(30), name="foo")
Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')

In [32]: pd.Index(range(104), name="foo")
Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')

In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
Out[34]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
 'a', 'bb', 'ccc', 'dddd'],
 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
Out[35]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
 'bb',
 ...
 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
 'dddd'],
 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar', length=400)

In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
Out[36]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
 '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
 dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')

In [37]: pd.date_range("20130101", periods=25, freq="D")
Out[37]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
 '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
 '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
 '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
 '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
 '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
 '2013-01-25'],
 dtype='datetime64[ns]', freq='D')

In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
Out[38]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
 '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
 '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
 '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
 '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
 ...
 '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
 '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
 '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
 '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
 '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
 dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')

性能改进

改善了 csv 写入性能，包括混合 dtype 和 datetimes，最多提高了 5 倍（GH 9940）
csv 写入性能总体提升了 2 倍（GH 9940）
pd.lib.max_len_string_array的性能提升了 5-7 倍（GH 10024）

Bug 修复

在 DataFrame.plot() 的图例中标签未正确显示的 Bug，传递 label= 参数可以解决，并且不再改变 Series 的索引。 (GH 9542)
在 json 序列化中存在的 Bug，在一个框架的长度为零时会导致段错误。 (GH 9805)
在 read_csv 中存在的 Bug，缺少尾随分隔符会导致段错误。 (GH 5664)
在附加时保留索引名称的 Bug。 (GH 9862)
在 scatter_matrix 中绘制意外的轴刻度标签的 Bug。 (GH 5662)
修复了 StataWriter 中的 Bug，导致保存后更改输入的 DataFrame。 (GH 9795).
transform 中存在的 Bug，当存在空值条目并且正在使用快速聚合器时会导致长度不匹配。 (GH 9697)
在 equals 中存在的 Bug，在块顺序不同的情况下会导致误报负例。 (GH 9330)
具有多个 pd.Grouper 的分组中存在 Bug，其中一个不基于时间。 (GH 10063)
在读取带有时区的 postgres 表时存在 read_sql_table 错误。 (GH 7139)
切片 DataFrame 中可能不保留元数据的 Bug。 (GH 9776)
TimdeltaIndex 在固定 HDFStore 中未正确序列化的 Bug。 (GH 9635)
使用另一个 TimedeltaIndex 作为数据时，TimedeltaIndex 构造函数忽略 name 的 Bug。 (GH 10025).
在 DataFrameFormatter._get_formatted_index 中存在的 Bug，未将max_colwidth应用于 DataFrame 索引。 (GH 7856)
使用只读 ndarray 数据源时.loc 中存在的 Bug。 (GH 10043)
groupby.apply() 中存在的一个 Bug，如果传入的用户定义函数只返回None（对于所有输入），会引发错误。 (GH 9685)
在 pytables 测试中始终使用临时文件。 (GH 9992)
持续使用 secondary_y 绘图可能不正确显示图例的 Bug。 (GH 9610, GH 9779)
在 DataFrame.plot(kind="hist") 中存在的 Bug，当 DataFrame 包含非数值列时会导致 TypeError。 (GH 9853)
在重复绘制具有 DatetimeIndex 的 DataFrame 可能会引发 TypeError 的 Bug。 (GH 9852)
setup.py 中的 Bug 允许不兼容的 cython 版本构建 (GH 9827)
绘制 secondary_y 时的 Bug 错误地将 right_ax 属性附加到递归指定自身的次要轴上。 (GH 9861)
Series.quantile 中的 Bug 在空的 Datetime 或 Timedelta 类型的 Series 上 (GH 9675)
在 where 中的 Bug 导致需要上转时出现不正确的结果 (GH 9731)
FloatArrayFormatter 中的 Bug，其中在十进制格式中显示“小”浮点数的决策边界与给定的 display.precision 的数量级相差一个数量级。 (GH 9764)
修复了 DataFrame.plot() 在传递 color 和 style 关键字且样式字符串中没有颜色符号时引发错误的 Bug (GH 9671)
在将列表样式与 Index 结合时未显示 DeprecationWarning (GH 10083)
在使用 skip_rows 参数时，read_csv 和 read_table 中的 Bug 如果存在空行会导致跳过。 (GH 9832)
read_csv() 中的 Bug 将 index_col=True 解释为 1 (GH 9798)
在使用 == 进行索引相等比较时，Index/MultiIndex 类型不兼容导致失败的 Bug (GH 9785)
SparseDataFrame 中的 Bug 无法将 nan 作为列名 (GH 8822)
to_msgpack 和 read_msgpack 中的 Bug zlib 和 blosc 压缩支持 (GH 9783)
GroupBy.size 中的 Bug 如果按 TimeGrouper 分组，则未正确附加索引名称。 (GH 9925)
在切片赋值中引发异常的 Bug，因为 length_of_indexer 返回错误��结果 (GH 9995)
在 csv 解析器中导致以初始空格加一个非空格字符开头的行被跳过的 Bug。 (GH 9710)
在 C csv 解析器中的 Bug 导致数据以换行符开头，后跟空格时出现虚假的 NaN。 (GH 10022)
分组时，具有空组的元素会溢出到最终组的 Bug，当按 Categorical 分组时 (GH 9603)
空数据框上的 .iloc 和 .loc 行为不一致的 Bug (GH 9964)
在 TimedeltaIndex 上无效属性访问的 Bug 错误地引发 ValueError 而不是 AttributeError (GH 9680)
分类数据和标量之间不相等比较的错误，这些数据不在类别中（例如Series(Categorical(list("abc"), ordered=True)) > "d"。对于所有元素，这返回False，但现在会引发TypeError。相等比较现在也会对==返回False，对!=返回True。(GH 9848)
在 DataFrame __setitem__中，当右侧是字典时的错误。(GH 9874)
在where中的错误，当 dtype 为datetime64/timedelta64，但其他的 dtype 不是时。(GH 9804)
在MultiIndex.sortlevel()中的错误导致 unicode 级别名称中断。(GH 9856)
groupby.transform中的错误，错误地强制输出 dtype 与输入 dtype 匹配。(GH 9807)
在DataFrame构造函数中，当设置columns参数时，data是空列表时的错误。(GH 9939)
使用log=True的条形图中的错误，如果所有值都小于 1，则会引发TypeError。(GH 9905)
在水平条形图中忽略log=True的错误。(GH 9905)
在 PyTables 查询中的错误，未使用索引返回正确结果。(GH 8265, GH 9676)
在包含Decimal类型值的数据框除以另一个Decimal时会引发错误。(GH 9787)
使用 DataFrames 作为频率时会导致索引名称丢失的错误。(GH 9885)
导致在重新采样 BM/BQ 时出现额外索引点的错误。(GH 9756)
将AbstractHolidayCalendar中的缓存更改为实例级别，而不是类级别，因为后者可能导致意外行为。(GH 9552)
修复了多重索引数据框的 Latex 输出。(GH 9778)
使用DataFrame.loc设置空范围时会引发异常的错误。(GH 9596)
在添加新绘图到现有轴网格时，隐藏子图和共享轴的刻度标签时的错误。(GH 9158)
在对分类变量进行分组时，在transform和filter中的错误。(GH 9921)
在输入索引中组数和 dtype 与输入索引相等时，在transform中的错误。(GH 9700)
Google BigQuery 连接器现在按方法导入依赖项。(GH 9713)
将 BigQuery 连接器更新为不再使用已弃用的oauth2client.tools.run()（GH 8327）
在子类化的DataFrame中存在的错误。在对其进行切片或子集化时，可能不会返回正确的类。(GH 9632)
.median()中的错误，未正确处理非浮点数空值(GH 10040)
Series.fillna()中的错误，当给出可以转换为数字的字符串时引发错误(GH 10092)

贡献者

共有 58 人为此版本提交了补丁。名字后面带有“+”符号的人是首次贡献补丁的。

Alfonso MHC +
Andy Hayden
Artemy Kolchinsky
Chris Gilmer +
Chris Grinolds +
Dan Birken
David BROCHART +
David Hirschfeld +
David Stephens
Dr. Leo +
Evan Wright +
Frans van Dunné +
Hatem Nassrat +
Henning Sperr +
Hugo Herter +
Jan Schulz
Jeff Blackburne +
Jeff Reback
Jim Crist +
Jonas Abernot +
Joris Van den Bossche
Kerby Shedden
Leo Razoumov +
Manuel Riel +
Mortada Mehyar
Nick Burns +
Nick Eubank +
Olivier Grisel
Phillip Cloud
Pietro Battiston
Roy Hyunjin Han
Sam Zhang +
Scott Sanderson +
Sinhrks +
Stephan Hoyer
Tiago Antao
Tom Ajamian +
Tom Augspurger
Tomaz Berisa +
Vikram Shirgur +
Vladimir Filimonov
William Hogman +
Yasin A +
Younggun Kim +
Behzad Nouri
Dsm054
Floydsoft +
Flying-sheep +
Gfr +
Jnmclarty
Jreback
Ksanghai +
Lucas +
Mschmohl +
Ptype +
Rockg
Scls19fr +
Sinhrks

版本 0.16.0（2015 年 3 月 22 日）

原文：pandas.pydata.org/docs/whatsnew/v0.16.0.html

这是从 0.15.2 版本的一个重大发布，包括少量 API 变更，几个新特性，增强功能和性能改进以及大量错误修复。我们建议所有用户升级到此版本。

亮点包括：

DataFrame.assign 方法，请参阅这里
Series.to_coo/from_coo 方法用于与 scipy.sparse 交互，请参阅这里
对 Timedelta 的向后不兼容更改，使 .seconds 属性符合 datetime.timedelta，请参阅这里
对 .loc 切片 API 的更改以符合 .ix 的行为，请参阅这里
Categorical 构造函数中排序的默认更改，请参阅这里
对 .str 访问器的增强，使字符串操作更加方便，请参阅这里
pandas.tools.rplot，pandas.sandbox.qtpandas 和 pandas.rpy 模块已弃用。我们建议用户使用外部包，如 seaborn，pandas-qt 和 rpy2 来获得类似或等效的功能，请参阅这里

在更新之前，请查看 API 变更和弃用内容。

v0.16.0 中的新内容

新特性
- DataFrame 分配
- 与 scipy.sparse 的交互
- 字符串方法增强
- 其他增强
向后不兼容的 API 变更
- 时间增量的变更
- 索引变更
- 分类变更
- 其他 API 变更
- 弃用内容
- 删除之前版本的弃用/更改内容
性能改进
错误修复
贡献者

新特性

DataFrame 分配

受 dplyr 的 mutate 动词启发，DataFrame 新增了一个 assign() 方法。assign 的函数签名简单地是 **kwargs。键是新字段的列名，值可以是要插入的值（例如，一个 Series 或 NumPy 数组），或者是要在 DataFrame 上调用的带一个参数的函数。新值被插入，整个 DataFrame（包括所有原始列和新列）被返回。

In [1]: iris = pd.read_csv('data/iris.data')

In [2]: iris.head()
Out[2]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

[5 rows x 5 columns]

In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
Out[3]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

上面是插入预计算值的一个示例。我们还可以传入一个要评估的函数。

In [4]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth']
 ...:                                   / x['SepalLength'])).head()
 ...: 
Out[4]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

assign 方法的强大之处在于它在操作链中的使用。例如，我们可以将 DataFrame 限制为仅包含萼片长度大于 5 的部分，计算比率并绘制图表。

In [5]: iris = pd.read_csv('data/iris.data')

In [6]: (iris.query('SepalLength > 5')
 ...:     .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
 ...:             PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
 ...:     .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
 ...: 
Out[6]: <Axes: xlabel='SepalRatio', ylabel='PetalRatio'>

../_images/whatsnew_assign.png

更多信息请参见文档（GH 9229） ### 与 scipy.sparse 的交互

添加了 SparseSeries.to_coo() 和 SparseSeries.from_coo() 方法 (GH 8048) 以便将其转换为和从 scipy.sparse.coo_matrix 实例（参见此处）。例如，给定具有多索引的 SparseSeries，我们可以通过指定行和列标签作为索引级别将其转换为 scipy.sparse.coo_matrix：

s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
                                     (1, 2, 'a', 1),
                                     (1, 1, 'b', 0),
                                     (1, 1, 'b', 1),
                                     (2, 1, 'b', 0),
                                     (2, 1, 'b', 1)],
                                    names=['A', 'B', 'C', 'D'])

s

# SparseSeries
ss = s.to_sparse()
ss

A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
                             column_levels=['C', 'D'],
                             sort_labels=False)

A
A.todense()
rows
columns

from_coo 方法是从 scipy.sparse.coo_matrix 创建 SparseSeries 的方便方法：

from scipy import sparse
A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
                      shape=(3, 4))
A
A.todense()

ss = pd.SparseSeries.from_coo(A)
ss 
```  ### 字符串方法增强

+   下列新方法可通过 `.str` 访问器来应用于每个值。这旨在使其与字符串上的标准方法更一致。 ([GH 9282](https://github.com/pandas-dev/pandas/issues/9282), [GH 9352](https://github.com/pandas-dev/pandas/issues/9352), [GH 9386](https://github.com/pandas-dev/pandas/issues/9386), [GH 9387](https://github.com/pandas-dev/pandas/issues/9387), [GH 9439](https://github.com/pandas-dev/pandas/issues/9439))

    |  |  | 方法 |  |  |
    | --- | --- | --- | --- | --- |
    | `isalnum()` | `isalpha()` | `isdigit()` | `isdigit()` | `isspace()` |
    | `islower()` | `isupper()` | `istitle()` | `isnumeric()` | `isdecimal()` |
    | `find()` | `rfind()` | `ljust()` | `rjust()` | `zfill()` |

    ```py
    In [7]: s = pd.Series(['abcd', '3456', 'EFGH'])

    In [8]: s.str.isalpha()
    Out[8]: 
    0     True
    1    False
    2     True
    Length: 3, dtype: bool

    In [9]: s.str.find('ab')
    Out[9]: 
    0    0
    1   -1
    2   -1
    Length: 3, dtype: int64 
    ```

+   `Series.str.pad()` 和 `Series.str.center()` 现在接受 `fillchar` 选项来指定填充字符 ([GH 9352](https://github.com/pandas-dev/pandas/issues/9352))

    ```py
    In [10]: s = pd.Series(['12', '300', '25'])

    In [11]: s.str.pad(5, fillchar='_')
    Out[11]: 
    0    ___12
    1    __300
    2    ___25
    Length: 3, dtype: object 
    ```

+   添加了 `Series.str.slice_replace()`，它以前引发了 `NotImplementedError` ([GH 8888](https://github.com/pandas-dev/pandas/issues/8888))

    ```py
    In [12]: s = pd.Series(['ABCD', 'EFGH', 'IJK'])

    In [13]: s.str.slice_replace(1, 3, 'X')
    Out[13]: 
    0    AXD
    1    EXH
    2     IX
    Length: 3, dtype: object

    # replaced with empty char
    In [14]: s.str.slice_replace(0, 1)
    Out[14]: 
    0    BCD
    1    FGH
    2     JK
    Length: 3, dtype: object 
    ```  ### 其他增强

+   现在 reindex 支持 `method='nearest'`，用于具有单调递增或递减索引的帧或系列（[GH 9258](https://github.com/pandas-dev/pandas/issues/9258)）:

    ```py
    In [15]: df = pd.DataFrame({'x': range(5)})

    In [16]: df.reindex([0.2, 1.8, 3.5], method='nearest')
    Out[16]: 
     x
    0.2  0
    1.8  2
    3.5  4

    [3 rows x 1 columns] 
    ```

    这种方法也可以通过更低级别的 `Index.get_indexer` 和 `Index.get_loc` 方法暴露出来。

+   `read_excel()` 函数的 sheetname 参数现在接受一个列表和 `None`，以分别获取多个或所有工作表。如果指定了多个工作表，则返回一个字典。（[GH 9450](https://github.com/pandas-dev/pandas/issues/9450)）

    ```py
    # Returns the 1st and 4th sheet, as a dictionary of DataFrames.
    pd.read_excel('path_to_file.xls', sheetname=['Sheet1', 3]) 
    ```

+   允许使用迭代器逐步读取 Stata 文件；支持 Stata 文件中的长字符串。请参阅此处文档。（[GH 9493](https://github.com/pandas-dev/pandas/issues/9493)）

+   以 ~ 开头的路径现在将扩展为以用户的主目录开头。（[GH 9066](https://github.com/pandas-dev/pandas/issues/9066)）

+   在 `get_data_yahoo` 中添加了时间间隔选择。（[GH 9071](https://github.com/pandas-dev/pandas/issues/9071)）

+   添加了 `Timestamp.to_datetime64()` 来补充 `Timedelta.to_timedelta64()`。（[GH 9255](https://github.com/pandas-dev/pandas/issues/9255)）

+   `tseries.frequencies.to_offset()` 现在接受 `Timedelta` 作为输入。（[GH 9064](https://github.com/pandas-dev/pandas/issues/9064)）

+   在 `Series` 的自相关方法中添加了滞后参数，默认为滞后 1 自相关。（[GH 9192](https://github.com/pandas-dev/pandas/issues/9192)）

+   `Timedelta` 现在在构造函数中接受 `nanoseconds` 关键字。（[GH 9273](https://github.com/pandas-dev/pandas/issues/9273)）

+   SQL 代码现在安全地转义表名和列名。（[GH 8986](https://github.com/pandas-dev/pandas/issues/8986)）

+   为 `Series.str.<tab>`、`Series.dt.<tab>` 和 `Series.cat.<tab>` 添加了自动完成功能。（[GH 9322](https://github.com/pandas-dev/pandas/issues/9322)）

+   `Index.get_indexer` 现在支持 `method='pad'` 和 `method='backfill'`，即使对于任何目标数组，而不仅仅是单调的目标。这些方法也适用于单调减少以及单调增加的索引。（[GH 9258](https://github.com/pandas-dev/pandas/issues/9258)）

+   `Index.asof` 现在适用于所有索引类型。（[GH 9258](https://github.com/pandas-dev/pandas/issues/9258)）

+   在 `io.read_excel()` 中增加了一个 `verbose` 参数，默认为 False。设置为 True 以在解析时打印工作表名称。（[GH 9450](https://github.com/pandas-dev/pandas/issues/9450)）

+   为 `Timestamp`、`DatetimeIndex`、`Period`、`PeriodIndex` 和 `Series.dt` 添加了 `days_in_month`（兼容别名 `daysinmonth`）属性。（[GH 9572](https://github.com/pandas-dev/pandas/issues/9572)）

+   在 `to_csv` 中添加了 `decimal` 选项，以提供非 '。' 小数分隔符的格式化。（[GH 781](https://github.com/pandas-dev/pandas/issues/781)）

+   为 `Timestamp` 添加了 `normalize` 选项，以将其规范化为午夜。（[GH 8794](https://github.com/pandas-dev/pandas/issues/8794)）

+   添加了使用 HDF5 文件和 `rhdf5` 库导入 `DataFrame` 的示例。有关更多信息，请参阅文档。（[GH 9636](https://github.com/pandas-dev/pandas/issues/9636)）## 不兼容的 API 更改

### timedelta 的变化

在 v0.15.0 中，引入了一个新的标量类型`Timedelta`，它是`datetime.timedelta`的子类。在 v0.15.0.html#whatsnew-0150-timedeltaindex 中提到了一个关于`.seconds`访问器的 API 更改的通知。旨在提供一组用户友好的访问器，以给出该单位的‘自然’值，例如如果你有一个`Timedelta('1 day, 10:11:12')`，那么`.seconds`将返回 12。然而，这与`datetime.timedelta`的定义相矛盾，后者将`.seconds`定义为`10 * 3600 + 11 * 60 + 12 == 36672`。

因此，在 v0.16.0 中，我们恢复了 API，以匹配`datetime.timedelta`的 API。此外，组件值仍可通过`.components`访问器获得。这影响了`.seconds`和`.microseconds`访问器，并删除了`.hours`、`.minutes`和`.milliseconds`访问器。这些更改也影响了`TimedeltaIndex`和 Series 的`.dt`访问器。([GH 9185](https://github.com/pandas-dev/pandas/issues/9185), [GH 9139](https://github.com/pandas-dev/pandas/issues/9139))

旧行为

```py
In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [3]: t.days
Out[3]: 1

In [4]: t.seconds
Out[4]: 12

In [5]: t.microseconds
Out[5]: 123

新行为

In [17]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [18]: t.days
Out[18]: 1

In [19]: t.seconds
Out[19]: 36672

In [20]: t.microseconds
Out[20]: 100123

使用.components允许完全组件访问

In [21]: t.components
Out[21]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)

In [22]: t.components.seconds
Out[22]: 12 
```  ### 索引变更

使用`.loc`的一小部分边缘案例的行为已更改([GH 8613](https://github.com/pandas-dev/pandas/issues/8613))。此外，我们已改进了引发的错误消息的内容：

+   现在允许在索引中找不到开始和/或停止边界的情况下使用`.loc`进行切片；以前会引发`KeyError`。这使得在这种情况下行为与`.ix`相同。此更改仅适用于切片，而不是在单个标签上进行索引。

    ```py
    In [23]: df = pd.DataFrame(np.random.randn(5, 4),
     ....:                  columns=list('ABCD'),
     ....:                  index=pd.date_range('20130101', periods=5))
     ....: 

    In [24]: df
    Out[24]: 
     A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401

    [5 rows x 4 columns]

    In [25]: s = pd.Series(range(5), [-2, -1, 1, 2, 3])

    In [26]: s
    Out[26]: 
    -2    0
    -1    1
     1    2
     2    3
     3    4
    Length: 5, dtype: int64 
    ```

    旧行为

    ```py
    In [4]: df.loc['2013-01-02':'2013-01-10']
    KeyError: 'stop bound [2013-01-10] is not in the [index]'

    In [6]: s.loc[-10:3]
    KeyError: 'start bound [-10] is not the [index]' 
    ```

    新行为

    ```py
    In [27]: df.loc['2013-01-02':'2013-01-10']
    Out[27]: 
     A         B         C         D
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401

    [4 rows x 4 columns]

    In [28]: s.loc[-10:3]
    Out[28]: 
    -2    0
    -1    1
     1    2
     2    3
     3    4
    Length: 5, dtype: int64 
    ```

+   对于`.ix`，现在允许在整数索引上使用类似浮点数的值进行切片。以前只有`.loc`才能这样做：

    旧行为

    ```py
    In [8]: s.ix[-1.0:2]
    TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index) 
    ```

    新行为

    ```py
    In [2]: s.ix[-1.0:2]
    Out[2]:
    -1    1
     1    2
     2    3
    dtype: int64 
    ```

+   当使用`.loc`进行索引时，对于索引类型为`DatetimeIndex`、`PeriodIndex`或`TimedeltaIndex`的情况，如果使用了整数（或浮点数）索引，则会提供一个有用的异常。 

    旧行为

    ```py
    In [4]: df.loc[2:3]
    KeyError: 'start bound [2] is not the [index]' 
    ```

    新行为

    ```py
    In [4]: df.loc[2:3]
    TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys 
    ```  ### 分类变更

在先前的版本中，未指定排序（即未传递`ordered`关键字）的`Categoricals`默认为`ordered` Categoricals。从现在开始，`Categorical`构造函数中的`ordered`关键字将默认为`False`。现在必须明确指定排序。

此外，以前你可以通过设置属性来更改分类变量的`ordered`属性，例如`cat.ordered=True`；现在这已被弃用，你应该使用`cat.as_ordered()`或`cat.as_unordered()`。这些默认会返回一个**新的**对象，而不会修改现有对象。([GH 9347](https://github.com/pandas-dev/pandas/issues/9347), [GH 9190](https://github.com/pandas-dev/pandas/issues/9190))

旧行为

```py
In [3]: s = pd.Series([0, 1, 2], dtype='category')

In [4]: s
Out[4]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [5]: s.cat.ordered
Out[5]: True

In [6]: s.cat.ordered = False

In [7]: s
Out[7]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

新行为

In [29]: s = pd.Series([0, 1, 2], dtype='category')

In [30]: s
Out[30]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0, 1, 2]

In [31]: s.cat.ordered
Out[31]: False

In [32]: s = s.cat.as_ordered()

In [33]: s
Out[33]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [34]: s.cat.ordered
Out[34]: True

# you can set in the constructor of the Categorical
In [35]: s = pd.Series(pd.Categorical([0, 1, 2], ordered=True))

In [36]: s
Out[36]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [37]: s.cat.ordered
Out[37]: True

为了更容易创建分类数据系列，我们添加了在调用.astype()时传递关键字的功能。这些关键字直接传递给构造函数。

In [54]: s = pd.Series(["a", "b", "c", "a"]).astype('category', ordered=True)

In [55]: s
Out[55]:
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

In [56]: s = (pd.Series(["a", "b", "c", "a"])
   ....:        .astype('category', categories=list('abcdef'), ordered=False))

In [57]: s
Out[57]:
0    a
1    b
2    c
3    a
dtype: category
Categories (6, object): [a, b, c, d, e, f] 
```  ### 其他 API 更改

+   `Index.duplicated`现在返回`np.array(dtype=bool)`而不是包含`bool`值的`Index(dtype=object)`。([GH 8875](https://github.com/pandas-dev/pandas/issues/8875))

+   `DataFrame.to_json`现在为混合数据类型的每列返回准确的类型序列化([GH 9037](https://github.com/pandas-dev/pandas/issues/9037))

    以前，在序列化之前，数据被强制转换为一个公共数据类型，例如整数被序列化为浮点数：

    ```py
    In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}' 
    ```

    现在每列都使用其正确的数据类型进行序列化：

    ```py
    In [2]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}' 
    ```

+   `DatetimeIndex`、`PeriodIndex`和`TimedeltaIndex.summary`现在输出相同的格式。([GH 9116](https://github.com/pandas-dev/pandas/issues/9116))

+   `TimedeltaIndex.freqstr`现在输出与`DatetimeIndex`相同的字符串格式。([GH 9116](https://github.com/pandas-dev/pandas/issues/9116))

+   条形图和水平条形图不再沿着信息轴添加虚线。以前的样式可以通过 matplotlib 的`axhline`或`axvline`方法实现([GH 9088](https://github.com/pandas-dev/pandas/issues/9088))。

+   如果`Series`访问器`.dt`、`.cat`和`.str`不包含适当类型的数据，则现在会引发`AttributeError`而不是`TypeError`([GH 9617](https://github.com/pandas-dev/pandas/issues/9617))。这更贴近 Python 内置的异常层次结构，并确保像`hasattr(s, 'cat')`这样的测试在 Python 2 和 3 上保持一致。

+   `Series`现在支持整数类型的按位操作([GH 9016](https://github.com/pandas-dev/pandas/issues/9016))。以前，即使输入的数据类型是整数，输出的数据类型也会被强制转换为`bool`。

    以前的行为

    ```py
    In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
    Out[2]:
    a    True
    b    True
    c    True
    d    True
    dtype: bool 
    ```

    新行为。如果输入的数据类型是整数，则输出的数据类型也是整数，输出值是按位操作的结果。

    ```py
    In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
    Out[2]:
    a    4
    b    5
    c    6
    d    7
    dtype: int64 
    ```

+   在涉及`Series`或`DataFrame`的除法中，`0/0`和`0//0`现在会给出`np.nan`而不是`np.inf`。([GH 9144](https://github.com/pandas-dev/pandas/issues/9144), [GH 8445](https://github.com/pandas-dev/pandas/issues/8445))

    以前的行为

    ```py
    In [2]: p = pd.Series([0, 1])

    In [3]: p / 0
    Out[3]:
    0    inf
    1    inf
    dtype: float64

    In [4]: p // 0
    Out[4]:
    0    inf
    1    inf
    dtype: float64 
    ```

    新行为

    ```py
    In [38]: p = pd.Series([0, 1])

    In [39]: p / 0
    Out[39]: 
    0    NaN
    1    inf
    Length: 2, dtype: float64

    In [40]: p // 0
    Out[40]: 
    0    NaN
    1    inf
    Length: 2, dtype: float64 
    ```

+   对于分类数据，`Series.values_counts`和`Series.describe`现在将`NaN`条目放在最后。([GH 9443](https://github.com/pandas-dev/pandas/issues/9443))

+   对于分类数据，`Series.describe`现在将未使用的类别的计数和频率显示为 0，而不是`NaN`。([GH 9443](https://github.com/pandas-dev/pandas/issues/9443))

+   由于一个错误修复，使用`DatetimeIndex.asof`查找部分字符串标签现在包括与字符串匹配的值，即使它���在部分字符串标签的开始之后([GH 9258](https://github.com/pandas-dev/pandas/issues/9258))。

    旧行为：

    ```py
    In [4]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
    Out[4]: Timestamp('2000-01-31 00:00:00') 
    ```

    修正的行为：

    ```py
    In [41]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
    Out[41]: Timestamp('2000-02-28 00:00:00') 
    ```

    要复制旧行为，只需为标签添加更多精度(例如，使用`2000-02-01`而不是`2000-02`)。  ### 弃用

+   `rplot` trellis 绘图接口已被弃用，并将在将来的版本中移除。我们建议使用外部包如[seaborn](http://stanford.edu/~mwaskom/software/seaborn/)来获得类似但更精细的功能（[GH 3445](https://github.com/pandas-dev/pandas/issues/3445)）。文档中包含一些示例，说明如何将现有代码从`rplot`转换为 seaborn [here](https://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html#trellis-plotting-interface)。

+   `pandas.sandbox.qtpandas`接口已被弃用，并将在将来的版本中移除。我们建议用户使用外部包[pandas-qt](https://github.com/datalyze-solutions/pandas-qt)（[GH 9615](https://github.com/pandas-dev/pandas/issues/9615))

+   `pandas.rpy`接口已被弃用，并将在将来的版本中移除。类似功能可以通过[rpy2](http://rpy2.bitbucket.org/)项目访问（[GH 9602](https://github.com/pandas-dev/pandas/issues/9602)）

+   将`DatetimeIndex/PeriodIndex`添加到另一个`DatetimeIndex/PeriodIndex`被弃用为集合操作。这将在将来的版本中更改为`TypeError`。应该使用`.union()`进行并集操作（[GH 9094](https://github.com/pandas-dev/pandas/issues/9094)）

+   从另一个`DatetimeIndex/PeriodIndex`中减去`DatetimeIndex/PeriodIndex`被弃用为集合操作。这将在将来的版本中更改为实际的数值减法，产生一个`TimeDeltaIndex`。应该使用`.difference()`进行差分集合操作（[GH 9094](https://github.com/pandas-dev/pandas/issues/9094)）### 移除之前版本的弃用/更改

+   `DataFrame.pivot_table`和`crosstab`的`rows`和`cols`关键字参数已被移除，改为使用`index`和`columns`（[GH 6581](https://github.com/pandas-dev/pandas/issues/6581)）

+   `DataFrame.to_excel`和`DataFrame.to_csv`的`cols`关键字参数已被移除，改为使用`columns`（[GH 6581](https://github.com/pandas-dev/pandas/issues/6581)）

+   移除了`convert_dummies`，改用`get_dummies`（[GH 6581](https://github.com/pandas-dev/pandas/issues/6581)）

+   移除了`value_range`，改用`describe`（[GH 6581](https://github.com/pandas-dev/pandas/issues/6581)）## 性能改进

+   修复了使用数组或类似列表进行`.loc`索引的性能回归（[GH 9126](https://github.com/pandas-dev/pandas/issues/9126)）

+   `DataFrame.to_json`混合 dtype 框架的性能提升 30 倍（[GH 9037](https://github.com/pandas-dev/pandas/issues/9037)）

+   通过使用标签而不是值，提高了`MultiIndex.duplicated`的性能（[GH 9125](https://github.com/pandas-dev/pandas/issues/9125)）

+   通过调用`unique`而不是`value_counts`来提高`nunique`的速度（[GH 9129](https://github.com/pandas-dev/pandas/issues/9129)，[GH 7771](https://github.com/pandas-dev/pandas/issues/7771)）

+   通过适当利用同质/异质 dtypes，`DataFrame.count`和`DataFrame.dropna`的性能提高了最多 10 倍（[GH 9136](https://github.com/pandas-dev/pandas/issues/9136)）

+   在使用`MultiIndex`和`level`关键字参数时，`DataFrame.count`的性能提高了最多 20 倍（[GH 9163](https://github.com/pandas-dev/pandas/issues/9163)）

+   当键空间超过`int64`边界时，在`merge`中的性能和内存使用改进（[GH 9151](https://github.com/pandas-dev/pandas/issues/9151)）

+   多键`groupby`中的性能改进（[GH 9429](https://github.com/pandas-dev/pandas/issues/9429)）

+   `MultiIndex.sortlevel`中的性能改进（[GH 9445](https://github.com/pandas-dev/pandas/issues/9445)）

+   在`DataFrame.duplicated`中的性能和内存使用改进（[GH 9398](https://github.com/pandas-dev/pandas/issues/9398)）

+   Cython 化了`Period`（[GH 9440](https://github.com/pandas-dev/pandas/issues/9440)）

+   在`to_hdf`中减少了内存使用量（[GH 9648](https://github.com/pandas-dev/pandas/issues/9648)）## Bug 修复

+   更改了`.to_html`以删除表体中的前导/尾随空格（[GH 4987](https://github.com/pandas-dev/pandas/issues/4987)）

+   在 Python 3 上使用`read_csv`在 s3 上存在问题（[GH 9452](https://github.com/pandas-dev/pandas/issues/9452)）

+   修复了`DatetimeIndex`中的兼容性问题，影响了`numpy.int_`默认为`numpy.int32`的架构（[GH 8943](https://github.com/pandas-dev/pandas/issues/8943)）

+   在具有对象类似特性的 Panel 索引中存在错误（[GH 9140](https://github.com/pandas-dev/pandas/issues/9140)）

+   返回的`Series.dt.components`索引中的错误已重置为默认索引（[GH 9247](https://github.com/pandas-dev/pandas/issues/9247)）

+   使用列表输入时，在`Categorical.__getitem__/__setitem__`中出现错误，从索引器强制转换中获得不正确的结果（[GH 9469](https://github.com/pandas-dev/pandas/issues/9469)）

+   在具有 DatetimeIndex 的部分设置中存在错误（[GH 9478](https://github.com/pandas-dev/pandas/issues/9478)）

+   在应用导致值在数字足够大时发生更改的聚合器时，对整数和 datetime64 列进行 groupby 时存在错误（[GH 9311](https://github.com/pandas-dev/pandas/issues/9311)，[GH 6620](https://github.com/pandas-dev/pandas/issues/6620)）

+   在将`Timestamp`对象列（带有时区信息的 datetime 列）映射到适当的 sqlalchemy 类型时，修复了`to_sql`中的错误（[GH 9085](https://github.com/pandas-dev/pandas/issues/9085)）。

+   修复了`to_sql`中`dtype`参数不接受实例化的 SQLAlchemy 类型的错误（[GH 9083](https://github.com/pandas-dev/pandas/issues/9083)）。

+   在使用`np.datetime64`进行`.loc`部分设置时存在错误（[GH 9516](https://github.com/pandas-dev/pandas/issues/9516)）

+   在看起来是`Series`和`.xs`切片上推断的不正确的 dtypes（[GH 9477](https://github.com/pandas-dev/pandas/issues/9477)）

+   现在`Categorical.unique()`中的项目（如果`s`的 dtype 为`category`，则为`s.unique()`）现在按照最初找到它们的顺序出现，而不是按排序顺序（[GH 9331](https://github.com/pandas-dev/pandas/issues/9331)）。这现在与 pandas 中其他 dtype 的行为一致。

+   修复了在大端平台上产生`StataReader`中不正确结果的错误（[GH 8688](https://github.com/pandas-dev/pandas/issues/8688)）。

+   修复了`MultiIndex.has_duplicates`中当有许多级别时导致索引器溢出的错误（[GH 9075](https://github.com/pandas-dev/pandas/issues/9075)，[GH 5873](https://github.com/pandas-dev/pandas/issues/5873)）。

+   修复了`pivot`和`unstack`中`nan`值会破坏索引对齐的错误（[GH 4862](https://github.com/pandas-dev/pandas/issues/4862)，[GH 7401](https://github.com/pandas-dev/pandas/issues/7401)，[GH 7403](https://github.com/pandas-dev/pandas/issues/7403)，[GH 7405](https://github.com/pandas-dev/pandas/issues/7405)，[GH 7466](https://github.com/pandas-dev/pandas/issues/7466)，[GH 9497](https://github.com/pandas-dev/pandas/issues/9497)）。

+   修复了在具有`sort=True`或空值的 MultiIndex 上进行左连接时的错误（[GH 9210](https://github.com/pandas-dev/pandas/issues/9210)）。

+   修复了在插入新键时`MultiIndex`会失败的错误（[GH 9250](https://github.com/pandas-dev/pandas/issues/9250)）。

+   修复了`groupby`中键空间超过`int64`边界时的错误（[GH 9096](https://github.com/pandas-dev/pandas/issues/9096)）。

+   修复了在`TimedeltaIndex`或`DatetimeIndex`和空值上使用`unstack`的错误（[GH 9491](https://github.com/pandas-dev/pandas/issues/9491)）。

+   修复了`rank`中使用容差比较浮点数会导致不一致行为的错误（[GH 8365](https://github.com/pandas-dev/pandas/issues/8365)）。

+   修复了从 URL 加载数据时`read_stata`和`StataReader`中的字符编码错误（[GH 9231](https://github.com/pandas-dev/pandas/issues/9231)）。

+   添加`offsets.Nano`到其他偏移时引发`TypeError`的错误已修复（[GH 9284](https://github.com/pandas-dev/pandas/issues/9284)）。

+   修复了`DatetimeIndex`迭代中的错误，相关的（[GH 8890](https://github.com/pandas-dev/pandas/issues/8890)），在（[GH 9100](https://github.com/pandas-dev/pandas/issues/9100)）中修复。

+   修复了`resample`在夏令时转换周围的错误。这需要修复偏移类，以便它们在夏令时转换时表现正确（[GH 5172](https://github.com/pandas-dev/pandas/issues/5172)，[GH 8744](https://github.com/pandas-dev/pandas/issues/8744)，[GH 8653](https://github.com/pandas-dev/pandas/issues/8653)，[GH 9173](https://github.com/pandas-dev/pandas/issues/9173)，[GH 9468](https://github.com/pandas-dev/pandas/issues/9468)）。

+   修复了二进制运算符方法（例如`.mul()`）与整数级别对齐的错误（[GH 9463](https://github.com/pandas-dev/pandas/issues/9463)）。

+   修复了 boxplot、scatter 和 hexbin plot 可能显示不必要警告的错误（[GH 8877](https://github.com/pandas-dev/pandas/issues/8877)）。

+   在具有`layout`关键字的 subplot 中存在的错误可能会显示不必要的警告（[GH 9464](https://github.com/pandas-dev/pandas/issues/9464)）

+   在使用需要传递参数（例如轴）的分组器函数时出现的错误，当使用包装函数（例如`fillna`）时，（[GH 9221](https://github.com/pandas-dev/pandas/issues/9221)）

+   `DataFrame`现在在构造函数中正确支持同时`copy`和`dtype`参数（[GH 9099](https://github.com/pandas-dev/pandas/issues/9099)）

+   当使用 c 引擎读取带有 CR 换行的文件时，在`read_csv`中使用 skiprows 时出现的错误（[GH 9079](https://github.com/pandas-dev/pandas/issues/9079)）

+   `isnull`现在在`PeriodIndex`中检测到`NaT`（[GH 9129](https://github.com/pandas-dev/pandas/issues/9129)）

+   在多列 groupby 中存在的`nth()`错误（[GH 8979](https://github.com/pandas-dev/pandas/issues/8979)）

+   在`DataFrame.where`和`Series.where`中存在的错误，错误地将数值转换为字符串（[GH 9280](https://github.com/pandas-dev/pandas/issues/9280)）

+   在`DataFrame.where`和`Series.where`中存在的错误，当传递字符串列表时引发`ValueError`（[GH 9280](https://github.com/pandas-dev/pandas/issues/9280)）

+   现在，在非字符串值上使用`Series.str`方法将引发`TypeError`而不是产生错误的结果（[GH 9184](https://github.com/pandas-dev/pandas/issues/9184)）

+   当索引具有重复项且不是单调递增时，在`DatetimeIndex.__contains__`中出现的错误（[GH 9512](https://github.com/pandas-dev/pandas/issues/9512)）

+   修复了`Series.kurt()`在所有值相等时出现的除零错误（[GH 9197](https://github.com/pandas-dev/pandas/issues/9197)）

+   修复了`xlsxwriter`引擎中的问题，在没有应用其他格式时，将默认的“General”格式添加到单元格中。这会阻止应用其他行或列格式。 （[GH 9167](https://github.com/pandas-dev/pandas/issues/9167)）

+   在`read_csv`中指定`index_col=False`时存在的问题，当也指定了`usecols`时（[GH 9082](https://github.com/pandas-dev/pandas/issues/9082)）

+   在`wide_to_long`中存在的错误会修改输入的存根名称列表（[GH 9204](https://github.com/pandas-dev/pandas/issues/9204)）

+   在`to_sql`中存在的问题，不以双精度存储 float64 值。 （[GH 9009](https://github.com/pandas-dev/pandas/issues/9009)）

+   `SparseSeries`和`SparsePanel`现在接受零参数构造函数（与它们的非稀疏对应物相同）（[GH 9272](https://github.com/pandas-dev/pandas/issues/9272)）。

+   合并`Categorical`和`object` dtype 时出现的回归错误（[GH 9426](https://github.com/pandas-dev/pandas/issues/9426)）

+   在使用特定格式不正确的输入文件时，`read_csv`中存在缓冲区溢出的错误（[GH 9205](https://github.com/pandas-dev/pandas/issues/9205)）

+   在带有缺失对的 groupby MultiIndex 中存在的错误（[GH 9049](https://github.com/pandas-dev/pandas/issues/9049)，[GH 9344](https://github.com/pandas-dev/pandas/issues/9344)）

+   修复了`Series.groupby`中的一个 bug，即在`MultiIndex`级别进行分组会忽略`sort`参数（[GH 9444](https://github.com/pandas-dev/pandas/issues/9444)）

+   修复了一个 bug，在 Categorical 列的情况下，`DataFrame.Groupby`中的`sort=False`被忽略。 （[GH 8868](https://github.com/pandas-dev/pandas/issues/8868)）

+   修复了一个 bug，即在 Python 3 上从 Amazon S3 读取 CSV 文件会引发 TypeError（[GH 9452](https://github.com/pandas-dev/pandas/issues/9452)）

+   在 Google BigQuery 读取器中的一个 bug，在查询结果中可能存在‘jobComplete’键，但其值为 False（[GH 8728](https://github.com/pandas-dev/pandas/issues/8728)）

+   修复了`Series.values_counts`中的一个 bug，即对于`dropna=True`的分类类型`Series`排除了`NaN`（[GH 9443](https://github.com/pandas-dev/pandas/issues/9443)）

+   修复了`DataFrame.std/var/sem`中缺少的`numeric_only`选项（[GH 9201](https://github.com/pandas-dev/pandas/issues/9201)）

+   支持使用标量数据构造`Panel`或`Panel4D`（[GH 8285](https://github.com/pandas-dev/pandas/issues/8285)）

+   `Series` 文本表示与 `max_rows`/`max_columns` 脱节（[GH 7508](https://github.com/pandas-dev/pandas/issues/7508)）。

+   `Series` 数字格式不一致，当被截断时（[GH 8532](https://github.com/pandas-dev/pandas/issues/8532)）。

    以前的行为

    ```py
    In [2]: pd.options.display.max_rows = 10
    In [3]: s = pd.Series([1,1,1,1,1,1,1,1,1,1,0.9999,1,1]*10)
    In [4]: s
    Out[4]:
    0    1
    1    1
    2    1
    ...
    127    0.9999
    128    1.0000
    129    1.0000
    Length: 130, dtype: float64 
    ```

    新行为

    ```py
    0      1.0000
    1      1.0000
    2      1.0000
    3      1.0000
    4      1.0000
    ...
    125    1.0000
    126    1.0000
    127    0.9999
    128    1.0000
    129    1.0000
    dtype: float64 
    ```

+   在某些情况下，当在框架中设置新项目时，会生成虚假的`SettingWithCopy`警告（[GH 8730](https://github.com/pandas-dev/pandas/issues/8730)）

    以前会报出`SettingWithCopy`警告。

    ```py
    In [42]: df1 = pd.DataFrame({'x': pd.Series(['a', 'b', 'c']),
     ....:                    'y': pd.Series(['d', 'e', 'f'])})
     ....: 

    In [43]: df2 = df1[['x']]

    In [44]: df2['y'] = ['g', 'h', 'i'] 
    ```  ## 贡献者

这个版本有总共 60 个人贡献了补丁。带有“+”符号的人是首次贡献补丁。

+   Aaron Toth +

+   Alan Du +

+   Alessandro Amici +

+   Artemy Kolchinsky

+   Ashwini Chaudhary +

+   Ben Schiller

+   Bill Letson

+   Brandon Bradley +

+   Chau Hoang +

+   Chris Reynolds

+   Chris Whelan +

+   Christer van der Meeren +

+   David Cottrell +

+   David Stephens

+   Ehsan Azarnasab +

+   Garrett-R +

+   Guillaume Gay

+   Jake Torcasso +

+   Jason Sexauer

+   Jeff Reback

+   John McNamara

+   Joris Van den Bossche

+   Joschka zur Jacobsmühlen +

+   Juarez Bochi +

+   Junya Hayashi +

+   K.-Michael Aye

+   Kerby Shedden +

+   Kevin Sheppard

+   Kieran O’Mahony

+   Kodi Arfer +

+   Matti Airas +

+   Min RK +

+   Mortada Mehyar

+   Robert +

+   Scott E Lasley

+   Scott Lasley +

+   Sergio Pascual +

+   Skipper Seabold

+   Stephan Hoyer

+   Thomas Grainger

+   Tom Augspurger

+   TomAugspurger

+   Vladimir Filimonov +

+   Vyomkesh Tripathi +

+   Will Holmgren

+   Yulong Yang +

+   behzad nouri

+   bertrandhaut +

+   bjonen

+   cel4 +

+   clham

+   hsperr +

+   ischwabacher

+   jnmclarty

+   josham +

+   jreback

+   omtinez +

+   roch +

+   sinhrks

+   unutbu  ## 新特性

### DataFrame 分配

受[dplyr](https://dplyr.tidyverse.org/articles/dplyr.html#mutating-operations) `mutate`动词的启发，DataFrame 具有新的`assign()`方法。`assign`的函数签名只是`**kwargs`。键是新字段的列名，值可以是要插入的值（例如，一个`Series`或 NumPy 数组），或者是要在`DataFrame`上调用的一个参数的函数。新值被插入，整个 DataFrame（包括所有原始和新列）被返回。

```py
In [1]: iris = pd.read_csv('data/iris.data')

In [2]: iris.head()
Out[2]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

[5 rows x 5 columns]

In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
Out[3]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

上面是插入预先计算值的示例。我们也可以传入一个要评估的函数。

In [4]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth']
 ...:                                   / x['SepalLength'])).head()
 ...: 
Out[4]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

当assign在操作链中使用时，其威力就显现出来了。例如，我们可以将 DataFrame 限制为仅包含萼片长度大于 5 的数据，计算比率，并绘制图表。

In [5]: iris = pd.read_csv('data/iris.data')

In [6]: (iris.query('SepalLength > 5')
 ...:     .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
 ...:             PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
 ...:     .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
 ...: 
Out[6]: <Axes: xlabel='SepalRatio', ylabel='PetalRatio'>

../_images/whatsnew_assign.png

查看文档以获取更多信息。(GH 9229) ### 与 scipy.sparse 的交互

添加了SparseSeries.to_coo()和SparseSeries.from_coo()方法（GH 8048），用于在scipy.sparse.coo_matrix实例之间进行转换（参见这里）。例如，给定一个具有 MultiIndex 的 SparseSeries，我们可以通过指定行和列标签作为索引级别将其转换为scipy.sparse.coo_matrix：

s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
                                     (1, 2, 'a', 1),
                                     (1, 1, 'b', 0),
                                     (1, 1, 'b', 1),
                                     (2, 1, 'b', 0),
                                     (2, 1, 'b', 1)],
                                    names=['A', 'B', 'C', 'D'])

s

# SparseSeries
ss = s.to_sparse()
ss

A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
                             column_levels=['C', 'D'],
                             sort_labels=False)

A
A.todense()
rows
columns

from_coo方法是一个方便的方法，用于从scipy.sparse.coo_matrix创建SparseSeries：

from scipy import sparse
A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
                      shape=(3, 4))
A
A.todense()

ss = pd.SparseSeries.from_coo(A)
ss 
```  ### 字符串方法增强

+   以下新方法可通过`.str`访问器访问，以将该函数应用于每个值。这旨在使其与字符串上的标准方法更一致。([GH 9282](https://github.com/pandas-dev/pandas/issues/9282), [GH 9352](https://github.com/pandas-dev/pandas/issues/9352), [GH 9386](https://github.com/pandas-dev/pandas/issues/9386), [GH 9387](https://github.com/pandas-dev/pandas/issues/9387), [GH 9439](https://github.com/pandas-dev/pandas/issues/9439))

    |  |  | 方法 |  |  |
    | --- | --- | --- | --- | --- |
    | `isalnum()` | `isalpha()` | `isdigit()` | `isdigit()` | `isspace()` |
    | `islower()` | `isupper()` | `istitle()` | `isnumeric()` | `isdecimal()` |
    | `find()` | `rfind()` | `ljust()` | `rjust()` | `zfill()` |

    ```py
    In [7]: s = pd.Series(['abcd', '3456', 'EFGH'])

    In [8]: s.str.isalpha()
    Out[8]: 
    0     True
    1    False
    2     True
    Length: 3, dtype: bool

    In [9]: s.str.find('ab')
    Out[9]: 
    0    0
    1   -1
    2   -1
    Length: 3, dtype: int64 
    ```

+   `Series.str.pad()`和`Series.str.center()`现在接受`fillchar`选项来指定填充字符（[GH 9352](https://github.com/pandas-dev/pandas/issues/9352)）

    ```py
    In [10]: s = pd.Series(['12', '300', '25'])

    In [11]: s.str.pad(5, fillchar='_')
    Out[11]: 
    0    ___12
    1    __300
    2    ___25
    Length: 3, dtype: object 
    ```

+   添加了 `Series.str.slice_replace()`，之前会引发 `NotImplementedError` ([GH 8888](https://github.com/pandas-dev/pandas/issues/8888))

    ```py
    In [12]: s = pd.Series(['ABCD', 'EFGH', 'IJK'])

    In [13]: s.str.slice_replace(1, 3, 'X')
    Out[13]: 
    0    AXD
    1    EXH
    2     IX
    Length: 3, dtype: object

    # replaced with empty char
    In [14]: s.str.slice_replace(0, 1)
    Out[14]: 
    0    BCD
    1    FGH
    2     JK
    Length: 3, dtype: object 
    ```  ### 其他增强功能

+   Reindex 现在支持 `method='nearest'`，用于具有单调递增或递减索引的框架或系列 ([GH 9258](https://github.com/pandas-dev/pandas/issues/9258)):

    ```py
    In [15]: df = pd.DataFrame({'x': range(5)})

    In [16]: df.reindex([0.2, 1.8, 3.5], method='nearest')
    Out[16]: 
     x
    0.2  0
    1.8  2
    3.5  4

    [3 rows x 1 columns] 
    ```

    这个方法也被更低级别的 `Index.get_indexer` 和 `Index.get_loc` 方法暴露出来。

+   `read_excel()` 函数的 sheetname 参数现在接受一个列表和 `None`，分别获取多个或所有工作表。如果指定了多个工作表，则返回一个字典。 ([GH 9450](https://github.com/pandas-dev/pandas/issues/9450))

    ```py
    # Returns the 1st and 4th sheet, as a dictionary of DataFrames.
    pd.read_excel('path_to_file.xls', sheetname=['Sheet1', 3]) 
    ```

+   允许使用迭代器逐步读取 Stata 文件；支持 Stata 文件中的长字符串。查看文档这里（[GH 9493](https://github.com/pandas-dev/pandas/issues/9493):）。

+   以 ~ 开头的路径现在会扩展为以用户的主目录开头（[GH 9066](https://github.com/pandas-dev/pandas/issues/9066))

+   在 `get_data_yahoo` 中添加了时间间隔选择功能（[GH 9071](https://github.com/pandas-dev/pandas/issues/9071))

+   添加了 `Timestamp.to_datetime64()` 来补充 `Timedelta.to_timedelta64()` ([GH 9255](https://github.com/pandas-dev/pandas/issues/9255))

+   `tseries.frequencies.to_offset()` 现在接受 `Timedelta` 作为输入（[GH 9064](https://github.com/pandas-dev/pandas/issues/9064))

+   在 `Series` 的自相关方法中添加了滞后参数，默认为滞后-1 自相关 ([GH 9192](https://github.com/pandas-dev/pandas/issues/9192))

+   `Timedelta` 现在在构造函数中接受 `nanoseconds` 关键字（[GH 9273](https://github.com/pandas-dev/pandas/issues/9273))

+   SQL 代码现在安全地转义表名和列名（[GH 8986](https://github.com/pandas-dev/pandas/issues/8986))

+   为 `Series.str.<tab>`，`Series.dt.<tab>` 和 `Series.cat.<tab>` 添加了自动补全功能（[GH 9322](https://github.com/pandas-dev/pandas/issues/9322))

+   `Index.get_indexer` 现在支持 `method='pad'` 和 `method='backfill'`，即使对于任何目标数组，而不仅仅是单调的目标。这些方法也适用于单调递减以及单调递增的索引 ([GH 9258](https://github.com/pandas-dev/pandas/issues/9258)).

+   `Index.asof` 现在适用于所有索引类型 ([GH 9258](https://github.com/pandas-dev/pandas/issues/9258)).

+   在 `io.read_excel()` 中增加了一个 `verbose` 参数，默认为 False。设置为 True 以在解析时打印工作表名称。 ([GH 9450](https://github.com/pandas-dev/pandas/issues/9450))

+   在`Timestamp`，`DatetimeIndex`，`Period`，`PeriodIndex`和`Series.dt`中添加了`days_in_month`（兼容别名`daysinmonth`）属性（[GH 9572](https://github.com/pandas-dev/pandas/issues/9572))

+   在 `to_csv` 中添加了 `decimal` 选项，为非“.”小数分隔符提供格式设置。([GH 781](https://github.com/pandas-dev/pandas/issues/781))

+   为 `Timestamp` 添加了 `normalize` 选项，以将时间标准化为午夜。([GH 8794](https://github.com/pandas-dev/pandas/issues/8794))

+   添加了使用 HDF5 文件和 `rhdf5` 库导入 `DataFrame` 的示例。更多详细信息请参阅文档（[GH 9636](https://github.com/pandas-dev/pandas/issues/9636)）。### DataFrame assign

受 [dplyr](https://dplyr.tidyverse.org/articles/dplyr.html#mutating-operations) `mutate` 动词的启发，DataFrame 有一个新的 `assign()` 方法。`assign` 的函数签名简单为 `**kwargs`。键是新字段的列名，值要么是要插入的值（例如，一个 `Series` 或 NumPy 数组），要么是要在 `DataFrame` 上调用的一个参数的函数。新值被插入，并返回整个 DataFrame（包括所有原始和新列）。

```py
In [1]: iris = pd.read_csv('data/iris.data')

In [2]: iris.head()
Out[2]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

[5 rows x 5 columns]

In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
Out[3]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

上面是一个插入预计算值的示例。我们也可以传入一个要评估的函数。

In [4]: iris.assign(sepal_ratio=lambda x: (x['SepalWidth']
 ...:                                   / x['SepalLength'])).head()
 ...: 
Out[4]: 
 SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

[5 rows x 6 columns]

assign 的强大之处在于在操作链中使用时。例如，我们可以将 DataFrame 限制为仅包含 Sepal Length 大于 5 的数据，计算比率，并绘制图表。

In [5]: iris = pd.read_csv('data/iris.data')

In [6]: (iris.query('SepalLength > 5')
 ...:     .assign(SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
 ...:             PetalRatio=lambda x: x.PetalWidth / x.PetalLength)
 ...:     .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
 ...: 
Out[6]: <Axes: xlabel='SepalRatio', ylabel='PetalRatio'>

../_images/whatsnew_assign.png

更多详细信息请参阅文档。（GH 9229）

与 scipy.sparse 的交互

添加了 SparseSeries.to_coo() 和 SparseSeries.from_coo() 方法，用于将数据转换为和从 scipy.sparse.coo_matrix 实例（参见这里）。例如，给定一个带有 MultiIndex 的 SparseSeries，我们可以通过指定行和列标签作为索引级别来将其转换为 scipy.sparse.coo_matrix：

s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
                                     (1, 2, 'a', 1),
                                     (1, 1, 'b', 0),
                                     (1, 1, 'b', 1),
                                     (2, 1, 'b', 0),
                                     (2, 1, 'b', 1)],
                                    names=['A', 'B', 'C', 'D'])

s

# SparseSeries
ss = s.to_sparse()
ss

A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
                             column_levels=['C', 'D'],
                             sort_labels=False)

A
A.todense()
rows
columns

from_coo 方法是一个方便的方法，用于从 scipy.sparse.coo_matrix 创建一个 SparseSeries：

from scipy import sparse
A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
                      shape=(3, 4))
A
A.todense()

ss = pd.SparseSeries.from_coo(A)
ss

字符串方法增强

通过 .str 访问器可以访问以下新方法，以将函数应用于每个值。这旨在使其与字符串上的标准方法更一致。(GH 9282, GH 9352, GH 9386, GH 9387, GH 9439)

方法

isalnum() isalpha() isdigit() isdigit() isspace()

islower() isupper() istitle() isnumeric() isdecimal()

find() rfind() ljust() rjust() zfill()
```
In [7]: s = pd.Series(['abcd', '3456', 'EFGH'])

In [8]: s.str.isalpha()
Out[8]: 
0     True
1    False
2     True
Length: 3, dtype: bool

In [9]: s.str.find('ab')
Out[9]: 
0    0
1   -1
2   -1
Length: 3, dtype: int64 
```

		方法
`isalnum()`	`isalpha()`	`isdigit()`	`isdigit()`	`isspace()`
`islower()`	`isupper()`	`istitle()`	`isnumeric()`	`isdecimal()`
`find()`	`rfind()`	`ljust()`	`rjust()`	`zfill()`

Series.str.pad() 和 Series.str.center() 现在接受 fillchar 选项来指定填充字符 (GH 9352)

In [10]: s = pd.Series(['12', '300', '25'])

In [11]: s.str.pad(5, fillchar='_')
Out[11]: 
0    ___12
1    __300
2    ___25
Length: 3, dtype: object

添加了 Series.str.slice_replace()，之前会引发 NotImplementedError (GH 8888)

In [12]: s = pd.Series(['ABCD', 'EFGH', 'IJK'])

In [13]: s.str.slice_replace(1, 3, 'X')
Out[13]: 
0    AXD
1    EXH
2     IX
Length: 3, dtype: object

# replaced with empty char
In [14]: s.str.slice_replace(0, 1)
Out[14]: 
0    BCD
1    FGH
2     JK
Length: 3, dtype: object

其他增强

现在 reindex 支持 method='nearest'，用于具有单调递增或递减索引的数据帧或系列 (GH 9258):
```
In [15]: df = pd.DataFrame({'x': range(5)})

In [16]: df.reindex([0.2, 1.8, 3.5], method='nearest')
Out[16]: 
 x
0.2  0
1.8  2
3.5  4

[3 rows x 1 columns] 
```
这个方法也被更低级别的 Index.get_indexer 和 Index.get_loc 方法所暴露。
read_excel() 函数的 sheetname 参数现在接受列表和 None，分别用于获取多个或所有工作表。如果指定了多个工作表，将返回一个字典。(GH 9450)
```
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
pd.read_excel('path_to_file.xls', sheetname=['Sheet1', 3]) 
```
允许使用迭代器逐步读取 Stata 文件；支持 Stata 文件中的长字符串。请查看文档这里 (GH 9493😃.
以 ~ 开头的路径现在将扩展为以用户的主目录开头 (GH 9066)
在 get_data_yahoo 中添加了时间间隔选择 (GH 9071)
添加了 Timestamp.to_datetime64() 来补充 Timedelta.to_timedelta64() (GH 9255)
tseries.frequencies.to_offset() 现在接受 Timedelta 作为输入 (GH 9064)
Series 的自相关方法现在添加了滞后参数，默认为滞后-1 自相关 (GH 9192)
Timedelta 现在将在构造函数中接受 nanoseconds 关键字 (GH 9273)
SQL 代码现在安全地转义表名和列名 (GH 8986)
为 Series.str.<tab>, Series.dt.<tab> 和 Series.cat.<tab> 添加了自动补全功能 (GH 9322)
Index.get_indexer 现在支持 method='pad' 和 method='backfill'，即使对于任何目标数组，而不仅仅是单调目标。这些方法也适用于单调递减以及单调递增的索引 (GH 9258).
Index.asof 现在适用于所有索引类型 (GH 9258).
在 io.read_excel() 中增加了一个 verbose 参数，默认为 False。设置为 True 以在解析时打印工作表名称。 (GH 9450)
向 Timestamp、DatetimeIndex、Period、PeriodIndex 和 Series.dt 添加了 days_in_month（兼容别名 daysinmonth）属性 (GH 9572)
在 to_csv 中添加了 decimal 选项，以提供非“.”小数分隔符的格式 (GH 781)
为 Timestamp 添加了 normalize 选项以标准化到午夜 (GH 8794)
添加了使用 HDF5 文件和 rhdf5 库将 DataFrame 导入到 R 的示例。有关更多信息，请参见文档 (GH 9636)。

不兼容的后向 API 更改

timedelta 的更改

在 v0.15.0 中引入了一个新的标量类型 Timedelta，它是 datetime.timedelta 的子类。提到了一个关于 .seconds 访问器的 API 更改的通知。意图是提供一组用户友好的访问器，以给出该单位的“自然”值，例如，如果你有一个 Timedelta('1 day, 10:11:12')，那么 .seconds 将返回 12。然而，这与 datetime.timedelta 的定义相矛盾，它将 .seconds 定义为 10 * 3600 + 11 * 60 + 12 == 36672。

因此在 v0.16.0 中，我们恢复了 API 以匹配 datetime.timedelta 的行为。此外，组件值仍然可以通过 .components 访问器获取。这影响了 .seconds 和 .microseconds 访问器，并删除了 .hours、.minutes、.milliseconds 访问器。这些更改也影响了 TimedeltaIndex 和 Series 的 .dt 访问器。 (GH 9185, GH 9139)

之前的行为

In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [3]: t.days
Out[3]: 1

In [4]: t.seconds
Out[4]: 12

In [5]: t.microseconds
Out[5]: 123

新行为

In [17]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [18]: t.days
Out[18]: 1

In [19]: t.seconds
Out[19]: 36672

In [20]: t.microseconds
Out[20]: 100123

使用 .components 允许完全访问组件

In [21]: t.components
Out[21]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)

In [22]: t.components.seconds
Out[22]: 12 
```  ### 索引变更

`.loc` 使用的一小部分边缘情况的行为已更改 ([GH 8613](https://github.com/pandas-dev/pandas/issues/8613))。此外，我们改进了引发的错误消息的内容：

+   允许使用 `.loc` 进行切片，其中起始和/或停止边界未在索引中找到；这之前会引发 `KeyError`。这使得在这种情况下的行为与 `.ix` 相同。此更改仅用于切片，而不是在单个标签上进行索引时。

    ```py
    In [23]: df = pd.DataFrame(np.random.randn(5, 4),
     ....:                  columns=list('ABCD'),
     ....:                  index=pd.date_range('20130101', periods=5))
     ....: 

    In [24]: df
    Out[24]: 
     A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401

    [5 rows x 4 columns]

    In [25]: s = pd.Series(range(5), [-2, -1, 1, 2, 3])

    In [26]: s
    Out[26]: 
    -2    0
    -1    1
     1    2
     2    3
     3    4
    Length: 5, dtype: int64 
    ```

    之前的行为

    ```py
    In [4]: df.loc['2013-01-02':'2013-01-10']
    KeyError: 'stop bound [2013-01-10] is not in the [index]'

    In [6]: s.loc[-10:3]
    KeyError: 'start bound [-10] is not the [index]' 
    ```

    新行为

    ```py
    In [27]: df.loc['2013-01-02':'2013-01-10']
    Out[27]: 
     A         B         C         D
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401

    [4 rows x 4 columns]

    In [28]: s.loc[-10:3]
    Out[28]: 
    -2    0
    -1    1
     1    2
     2    3
     3    4
    Length: 5, dtype: int64 
    ```

+   允许在整数索引上使用浮点值进行 `.ix` 的切片。之前只有在 `.loc` 中启用此功能：

    之前的行为

    ```py
    In [8]: s.ix[-1.0:2]
    TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index) 
    ```

    新行为

    ```py
    In [2]: s.ix[-1.0:2]
    Out[2]:
    -1    1
     1    2
     2    3
    dtype: int64 
    ```

+   当使用 `.loc` 时，为索引类型无效的类型提供有用的异常。例如，尝试在类型为 `DatetimeIndex` 或 `PeriodIndex` 或 `TimedeltaIndex` 的索引上使用 `.loc`，并且使用整数（或浮点数）。

    之前的行为

    ```py
    In [4]: df.loc[2:3]
    KeyError: 'start bound [2] is not the [index]' 
    ```

    新行为

    ```py
    In [4]: df.loc[2:3]
    TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys 
    ```  ### 分类更改

在先前的版本中，未指定排序（即未传递`ordered`关键字）的`Categoricals`默认为有序`Categoricals`。从现在开始，`Categorical`构造函数中的`ordered`关键字将默认为`False`。现在必须明确指定排序。

此外，以前您*可以*通过设置属性来更改分类的`ordered`属性，例如`cat.ordered=True`；现在已弃用，您应该使用`cat.as_ordered()`或`cat.as_unordered()`。这些默认会返回一个**新**对象，而不是修改现有对象（[GH 9347](https://github.com/pandas-dev/pandas/issues/9347)，[GH 9190](https://github.com/pandas-dev/pandas/issues/9190)）。

先前的行为

```py
In [3]: s = pd.Series([0, 1, 2], dtype='category')

In [4]: s
Out[4]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [5]: s.cat.ordered
Out[5]: True

In [6]: s.cat.ordered = False

In [7]: s
Out[7]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

新行为

In [29]: s = pd.Series([0, 1, 2], dtype='category')

In [30]: s
Out[30]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0, 1, 2]

In [31]: s.cat.ordered
Out[31]: False

In [32]: s = s.cat.as_ordered()

In [33]: s
Out[33]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [34]: s.cat.ordered
Out[34]: True

# you can set in the constructor of the Categorical
In [35]: s = pd.Series(pd.Categorical([0, 1, 2], ordered=True))

In [36]: s
Out[36]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [37]: s.cat.ordered
Out[37]: True

为了更方便地创建分类数据系列，我们添加了在调用.astype()时传递关键字的功能。这些关键字直接传递给构造函数。

In [54]: s = pd.Series(["a", "b", "c", "a"]).astype('category', ordered=True)

In [55]: s
Out[55]:
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

In [56]: s = (pd.Series(["a", "b", "c", "a"])
   ....:        .astype('category', categories=list('abcdef'), ordered=False))

In [57]: s
Out[57]:
0    a
1    b
2    c
3    a
dtype: category
Categories (6, object): [a, b, c, d, e, f] 
```  ### 其他 API 更改

+   `Index.duplicated`现在返回`np.array(dtype=bool)`而不是包含`bool`值的`Index(dtype=object)`（[GH 8875](https://github.com/pandas-dev/pandas/issues/8875)）。

+   `DataFrame.to_json`现在为混合数据类型��数据框的每列返回准确的类型序列化（[GH 9037](https://github.com/pandas-dev/pandas/issues/9037)）。

    以前在序列化之前会将数据强制转换为公共数据类型，例如导致整数被序列化为浮点数：

    ```py
    In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}' 
    ```

    现在每列都使用其正确的数据类型进行序列化：

    ```py
    In [2]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
    Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}' 
    ```

+   `DatetimeIndex`、`PeriodIndex`和`TimedeltaIndex.summary`现在输出相同的格式（[GH 9116](https://github.com/pandas-dev/pandas/issues/9116)）。

+   `TimedeltaIndex.freqstr`现在输出与`DatetimeIndex`相同的字符串格式。

+   条形图和水平条形图不再沿着信息轴添加虚线。以前的样式可以通过 matplotlib 的`axhline`或`axvline`方法实现（[GH 9088](https://github.com/pandas-dev/pandas/issues/9088)）。

+   `Series` 访问器`.dt`、`.cat`和`.str`现在如果系列不包含适当类型的数据，会引发`AttributeError`而不是`TypeError`（[GH 9617](https://github.com/pandas-dev/pandas/issues/9617)）。这更贴近 Python 内置的异常层次结构，并确保像`hasattr(s, 'cat')`这样的测试在 Python 2 和 3 上保持一致。

+   `Series` 现在支持整数类型的位运算（[GH 9016](https://github.com/pandas-dev/pandas/issues/9016)）。之前，即使输入的数据类型是整数，输出的数据类型也会被强制转换为`bool`。

    先前的行为

    ```py
    In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
    Out[2]:
    a    True
    b    True
    c    True
    d    True
    dtype: bool 
    ```

    新行为。如果输入的数据类型是整数，则输出的数据类型也是整数，输出值是位运算的结果。

    ```py
    In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
    Out[2]:
    a    4
    b    5
    c    6
    d    7
    dtype: int64 
    ```

+   在涉及`Series`或`DataFrame`的除法运算中，`0/0`和`0//0`现在会返回`np.nan`而不是`np.inf`（[GH 9144](https://github.com/pandas-dev/pandas/issues/9144)，[GH 8445](https://github.com/pandas-dev/pandas/issues/8445)）。

    先前的行为

    ```py
    In [2]: p = pd.Series([0, 1])

    In [3]: p / 0
    Out[3]:
    0    inf
    1    inf
    dtype: float64

    In [4]: p // 0
    Out[4]:
    0    inf
    1    inf
    dtype: float64 
    ```

    新行为

    ```py
    In [38]: p = pd.Series([0, 1])

    In [39]: p / 0
    Out[39]: 
    0    NaN
    1    inf
    Length: 2, dtype: float64

    In [40]: p // 0
    Out[40]: 
    0    NaN
    1    inf
    Length: 2, dtype: float64 
    ```

+   对于分类数据，`Series.values_counts`和`Series.describe`现在将把`NaN`条目放在最后。([GH 9443](https://github.com/pandas-dev/pandas/issues/9443))

+   对于分类数据，`Series.describe`现在将给出未使用类别的计数和频率为 0，而不是`NaN`。([GH 9443](https://github.com/pandas-dev/pandas/issues/9443))

+   由于错误修复，使用`DatetimeIndex.asof`查找部分字符串标签现在包括与字符串匹配的值，即使它们在部分字符串标签的开始之后。([GH 9258](https://github.com/pandas-dev/pandas/issues/9258))

    旧行为:

    ```py
    In [4]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
    Out[4]: Timestamp('2000-01-31 00:00:00') 
    ```

    修正的行为：

    ```py
    In [41]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
    Out[41]: Timestamp('2000-02-28 00:00:00') 
    ```

    要复制旧行为，只需将标签更精确（例如，使用`2000-02-01`而不是`2000-02`)。  ### 弃用

+   `rplot` trellis 绘图界面已被弃用，并将在将来的版本中删除。我们建议使用外部包如[seaborn](http://stanford.edu/~mwaskom/software/seaborn/)来获得类似但更精细的功能。([GH 3445](https://github.com/pandas-dev/pandas/issues/3445)) 文档中包含一些示例，说明如何将现有代码从`rplot`转换为 seaborn [here](https://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html#trellis-plotting-interface)。

+   `pandas.sandbox.qtpandas`接口已被弃用，并将在将来的版本中删除。我们建议用户使用外部包[pandas-qt](https://github.com/datalyze-solutions/pandas-qt)。([GH 9615](https://github.com/pandas-dev/pandas/issues/9615))

+   `pandas.rpy`接口已被弃用，并将在将来的版本中删除。类似功能可以通过[rpy2](http://rpy2.bitbucket.org/)项目访问。([GH 9602](https://github.com/pandas-dev/pandas/issues/9602))

+   将`DatetimeIndex/PeriodIndex`添加到另一个`DatetimeIndex/PeriodIndex`正在被弃用为一个集合操作。这将在将来的版本中更改为`TypeError`。应该使用`.union()`进行并集操作。([GH 9094](https://github.com/pandas-dev/pandas/issues/9094))

+   从另一个`DatetimeIndex/PeriodIndex`减去`DatetimeIndex/PeriodIndex`正在被弃用为一个集合操作。这将在将来的版本中更改为实际的数值减法，产生一个`TimeDeltaIndex`。应该使用`.difference()`进行差集操作。([GH 9094](https://github.com/pandas-dev/pandas/issues/9094))  ### 删除之前版本的弃用/更改

+   `DataFrame.pivot_table`和`crosstab`的`rows`和`cols`关键字参数已被删除，改为使用`index`和`columns`。([GH 6581](https://github.com/pandas-dev/pandas/issues/6581))

+   `DataFrame.to_excel`和`DataFrame.to_csv`的`cols`关键字参数已被删除，改为使用`columns`。([GH 6581](https://github.com/pandas-dev/pandas/issues/6581))

+   删除`convert_dummies`，改用`get_dummies`。([GH 6581](https://github.com/pandas-dev/pandas/issues/6581))

+   移除了`value_range`，改用`describe`（[GH 6581](https://github.com/pandas-dev/pandas/issues/6581)） ### 时间增量的变化

在 v0.15.0 中引入了一个新的标量类型`Timedelta`，它是`datetime.timedelta`的子类。提到的这里是关于`.seconds`访问器的 API 更改的通知。其目的是提供一组用户友好的访问器，以给出该单位的“自然”值，例如如果你有一个`Timedelta('1 day, 10:11:12')`，那么`.seconds`将返回 12。然而，这与`datetime.timedelta`的定义相矛盾，它将`.seconds`定义为`10 * 3600 + 11 * 60 + 12 == 36672`。

因此，在 v0.16.0 中，我们恢复了 API 以匹配`datetime.timedelta`。此外，组件值仍可通过`.components`访问器获得。这影响了`.seconds`和`.microseconds`访问器，并删除了`.hours`、`.minutes`、`.milliseconds`访问器。这些更改也影响了`TimedeltaIndex`和 Series `.dt` 访问器。([GH 9185](https://github.com/pandas-dev/pandas/issues/9185), [GH 9139](https://github.com/pandas-dev/pandas/issues/9139))

先前的行为

```py
In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [3]: t.days
Out[3]: 1

In [4]: t.seconds
Out[4]: 12

In [5]: t.microseconds
Out[5]: 123

新行为

In [17]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [18]: t.days
Out[18]: 1

In [19]: t.seconds
Out[19]: 36672

In [20]: t.microseconds
Out[20]: 100123

使用.components允许完全组件访问

In [21]: t.components
Out[21]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)

In [22]: t.components.seconds
Out[22]: 12

索引变化

一小部分使用.loc的边缘情况的行为已更改(GH 8613)。此外，我们已改进了引发的错误消息的内容：

允许在索引中找不到开始和/或停止边界时使用.loc进行切片；以前会引发KeyError。这使得在这种情况下的行为与.ix相同。此更改仅适用于切片，而不适用于使用单个标签进行索引。

In [23]: df = pd.DataFrame(np.random.randn(5, 4),
 ....:                  columns=list('ABCD'),
 ....:                  index=pd.date_range('20130101', periods=5))
 ....: 

In [24]: df
Out[24]: 
 A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

[5 rows x 4 columns]

In [25]: s = pd.Series(range(5), [-2, -1, 1, 2, 3])

In [26]: s
Out[26]: 
-2    0
-1    1
 1    2
 2    3
 3    4
Length: 5, dtype: int64

先前的行为

In [4]: df.loc['2013-01-02':'2013-01-10']
KeyError: 'stop bound [2013-01-10] is not in the [index]'

In [6]: s.loc[-10:3]
KeyError: 'start bound [-10] is not the [index]'

新行为

In [27]: df.loc['2013-01-02':'2013-01-10']
Out[27]: 
 A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

[4 rows x 4 columns]

In [28]: s.loc[-10:3]
Out[28]: 
-2    0
-1    1
 1    2
 2    3
 3    4
Length: 5, dtype: int64

允许在整数索引上使用类似浮点的值对.ix进行切片。以前只能对.loc启用此功能：

先前的行为

In [8]: s.ix[-1.0:2]
TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)

新行为

In [2]: s.ix[-1.0:2]
Out[2]:
-1    1
 1    2
 2    3
dtype: int64

在使用.loc时，如果对具有无效类型的索引进行索引，将提供有用的异常。例如，在类型为DatetimeIndex或PeriodIndex或TimedeltaIndex的索引上使用整数（或浮点数）进行.loc。

先前的行为
```
In [4]: df.loc[2:3]
KeyError: 'start bound [2] is not the [index]' 
```
新行为
```
In [4]: df.loc[2:3]
TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys 
```

分类变化

在以前的版本中，未指定排序的Categoricals（即未传递ordered关键字）默认为有序的Categoricals。从现在开始，Categorical构造函数中的ordered关键字将默认为False。排序现在必须是显式的。

此外，以前你可以通过设置属性来更改分类的ordered属性，例如cat.ordered=True；现在已经弃用，你应该使用cat.as_ordered()或cat.as_unordered()。这些默认会返回一个新的对象，而不是修改现有的对象。(GH 9347, GH 9190)

先前的行为

In [3]: s = pd.Series([0, 1, 2], dtype='category')

In [4]: s
Out[4]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [5]: s.cat.ordered
Out[5]: True

In [6]: s.cat.ordered = False

In [7]: s
Out[7]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

新行为

In [29]: s = pd.Series([0, 1, 2], dtype='category')

In [30]: s
Out[30]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0, 1, 2]

In [31]: s.cat.ordered
Out[31]: False

In [32]: s = s.cat.as_ordered()

In [33]: s
Out[33]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [34]: s.cat.ordered
Out[34]: True

# you can set in the constructor of the Categorical
In [35]: s = pd.Series(pd.Categorical([0, 1, 2], ordered=True))

In [36]: s
Out[36]: 
0    0
1    1
2    2
Length: 3, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [37]: s.cat.ordered
Out[37]: True

为了更轻松地创建分类数据系列，我们添加了在调用 .astype() 时传递关键字的功能。这些关键字直接传递给构造函数。

In [54]: s = pd.Series(["a", "b", "c", "a"]).astype('category', ordered=True)

In [55]: s
Out[55]:
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

In [56]: s = (pd.Series(["a", "b", "c", "a"])
   ....:        .astype('category', categories=list('abcdef'), ordered=False))

In [57]: s
Out[57]:
0    a
1    b
2    c
3    a
dtype: category
Categories (6, object): [a, b, c, d, e, f]

其他 API 变更

Index.duplicated 现在返回 np.array(dtype=bool) 而不是包含 bool 值的 Index(dtype=object)。 (GH 8875)
DataFrame.to_json 现在为混合 dtype 的帧的每列返回准确的类型序列化 (GH 9037)。

先前的数据在序列化之前被强制转换为公共 dtype，例如整数被序列化为浮点数：
```
In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}' 
```
现在每列都使用其正确的 dtype 进行序列化：
```
In [2]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}' 
```
DatetimeIndex、PeriodIndex 和 TimedeltaIndex.summary 现在输出相同的格式。 (GH 9116)
TimedeltaIndex.freqstr 现在输出与 DatetimeIndex 相同的字符串格式。 (GH 9116)
条形图和水平条形图不再沿信息轴添加虚线。可以使用 matplotlib 的 axhline 或 axvline 方法实现先前的样式 (GH 9088)。
如果系列不包含适当类型的数据，Series 访问器 .dt、.cat 和 .str 现在会引发 AttributeError 而不是 TypeError (GH 9617)。这更接近于 Python 的内置异常层次结构，并确保诸如 hasattr(s, 'cat') 的测试在 Python 2 和 3 上保持一致。

Series 现在支持整数类型的位操作 (GH 9016)。即使输入的 dtype 是整数，先前的行为也会强制输出的 dtype 转换为 bool。

先前的行为

In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
Out[2]:
a    True
b    True
c    True
d    True
dtype: bool

新行为。如果输入的 dtype 是整数，则输出的 dtype 也是整数，输出值是位操作的结果。

In [2]: pd.Series([0, 1, 2, 3], list('abcd')) | pd.Series([4, 4, 4, 4], list('abcd'))
Out[2]:
a    4
b    5
c    6
d    7
dtype: int64

在涉及 Series 或 DataFrame 的除法中，0/0 和 0//0 现在会返回 np.nan 而不是 np.inf。 (GH 9144, GH 8445)。

先前的行为

In [2]: p = pd.Series([0, 1])

In [3]: p / 0
Out[3]:
0    inf
1    inf
dtype: float64

In [4]: p // 0
Out[4]:
0    inf
1    inf
dtype: float64

新行为

In [38]: p = pd.Series([0, 1])

In [39]: p / 0
Out[39]: 
0    NaN
1    inf
Length: 2, dtype: float64

In [40]: p // 0
Out[40]: 
0    NaN
1    inf
Length: 2, dtype: float64

对于分类数据，Series.values_counts 和 Series.describe 现在将 NaN 条目放在末尾。 (GH 9443)
对于分类数据，Series.describe 现在将为未使用的类别提供计数和频率为 0，而不是 NaN (GH 9443)。
由于 bug 修复，使用 DatetimeIndex.asof 查找部分字符串标签现在包括与字符串匹配的值，即使它们在部分字符串标签的起始位置之后 (GH 9258)。

旧行为：
```
In [4]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
Out[4]: Timestamp('2000-01-31 00:00:00') 
```
修复行为：
```
In [41]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02')
Out[41]: Timestamp('2000-02-28 00:00:00') 
```
要重现旧行为，只需在标签上增加更多精度（例如，使用 2000-02-01 而不是 2000-02）。

废弃内容

rplot trellis 绘图接口已被弃用，并将在未来版本中移除。我们建议使用外部包如 seaborn 来获得类似但更精细的功能 (GH 3445)。文档中包含一些示例，说明如何将现有代码从 rplot 转换为 seaborn 这里。
pandas.sandbox.qtpandas 接口已被弃用，并将在未来版本中移除。我们建议用户使用外部包 pandas-qt。 (GH 9615)
pandas.rpy 接口已被弃用，并将在未来版本中移除。类似功能可以通过 rpy2 项目访问（GH 9602）
将 DatetimeIndex/PeriodIndex 添加到另一个 DatetimeIndex/PeriodIndex 中正在被弃用作为一个集合操作。这将在未来版本中更改为 TypeError。应该使用 .union() 进行并集操作。 (GH 9094)
从另一个 DatetimeIndex/PeriodIndex 中减去 DatetimeIndex/PeriodIndex 正在被弃用作为一个集合操作。这将在未来版本中更改为实际的数值减法，产生一个 TimeDeltaIndex。应该使用 .difference() 进行差集操作。 (GH 9094)

移除之前版本的弃用/更改

DataFrame.pivot_table 和 crosstab 的 rows 和 cols 关键字参数已被移除，改为使用 index 和 columns (GH 6581)
DataFrame.to_excel 和 DataFrame.to_csv 的 cols 关键字参数已被移除，改为使用 columns (GH 6581)
移除 convert_dummies，改为使用 get_dummies (GH 6581)
移除 value_range，改为使用 describe (GH 6581)

性能改进

修复了使用数组或类似列表进行 .loc 索引的性能回归 (GH 9126😃。
DataFrame.to_json 混合数据类型框架的性能提升 30 倍。 (GH 9037)
通过使用标签而不是值来处理 MultiIndex.duplicated，提高了性能 (GH 9125)
通过调用 unique 而不是 value_counts 来提高 nunique 的速度 (GH 9129, GH 7771)
通过适当利用同质/异质数据类型，DataFrame.count 和 DataFrame.dropna 的性能提升高达 10 倍（GH 9136）
通过使用MultiIndex和level关键字参数，在DataFrame.count中的性能提升高达 20 倍（GH 9163）
在merge中当键空间超过int64边界时提高了性能和内存使用效率（GH 9151）
在多键groupby中提高了性能（GH 9429）
在MultiIndex.sortlevel中提高了性能（GH 9445）
在DataFrame.duplicated中提高了性能和内存使用效率（GH 9398）
优化了Period的 Cython 化（GH 9440）
在to_hdf上减少了内存使用（GH 9648）

Bug 修复

更改了.to_html以删除表体中的前导/尾随空格（GH 4987）
修复了在 Python 3 上使用read_csv读取 s3 时出现的问题（GH 9452）
修复了影响numpy.int_默认为numpy.int32的架构中DatetimeIndex的兼容性问题（GH 8943）
使用类似对象的 Panel 索引的错误（GH 9140）
返回的Series.dt.components索引重置为默认索引的错误（GH 9247）
修复了Categorical.__getitem__/__setitem__中使用类似列表的输入导致索引器强制转换得到不正确结果的错误（GH 9469）
部分设置中使用 DatetimeIndex 的错误（GH 9478）
在应用聚合器时，整数和 datetime64 列的 groupby 出现错误，当数字足够大时导致值发生变化（GH 9311, GH 6620）
修复了to_sql中将Timestamp对象列（带有时区信息的日期时间列）映射到适当的 SQLAlchemy 类型的错误（GH 9085）
修复了to_sql中dtype参数不接受实例化的 SQLAlchemy 类型的错误（GH 9083）
使用np.datetime64进行.loc部分设置的错误（GH 9516）
在看起来类似日期时间的Series和.xs切片上推断的不正确的数据类型（GH 9477）
Categorical.unique() 中的项目（如果s的类型为category，则为s.unique()）现在按照最初发现它们的顺序显示，而不是按排序顺序显示（GH 9331）。这现在与 pandas 中其他类型的行为一致。
修复了在大端平台上产生StataReader 中不正确结果的错误（GH 8688）
当具有许多级别时，MultiIndex.has_duplicates 中的错误会导致索引器溢出（GH 9075，GH 5873）
在pivot 和unstack 中，nan 值会破坏索引对齐（GH 4862，GH 7401，GH 7403，GH 7405，GH 7466，GH 9497）
在具有sort=True或空值的 MultiIndex 上进行左连接中存在错误（GH 9210）
在MultiIndex 中插入新键时会失败的错误（GH 9250）
当键空间超过int64边界时，groupby 中存在错误（GH 9096）
在具有TimedeltaIndex或DatetimeIndex和空值的unstack 中存在错误（GH 9491）
在rank 中，使用容差比较浮点数会导致不一致的行为（GH 8365）
修复了从 URL 加载数据时，在read_stata 和StataReader 中的字符编码错误（GH 9231）
将offsets.Nano 添加到其他偏移量会引发TypeError 的错误已修复（GH 9284)
DatetimeIndex 迭代中存在错误，与（GH 8890）相关，已在（GH 9100）中修复
在 DST 转换周围的resample 中存在错误。这需要修复偏移类，以便它们在 DST 转换时表现正确（GH 5172，GH 8744，GH 8653，GH 9173，GH 9468）
二进制运算符方法（例如.mul()）与整数级别对齐时存在错误（GH 9463）
盒形图、散点图和六边形图中的错误可能会显示不必要的警告（GH 8877）
subplot 中使用layout kw 可能会显示不必要的警告的错误（GH 9464）
修复了使用需要传递参数的分组函数时出现的错误（例如轴），当使用包装函数（例如fillna）时，（GH 9221）
DataFrame现在在构造函数中正确支持同时使用copy和dtype参数（GH 9099）
在使用 c 引擎时，read_csv在具有 CR 行结束的文件上使用 skiprows 时的错误。 (GH 9079)
isnull现在可以检测PeriodIndex中的NaT（GH 9129）
在具有多列 groupby 的 groupby .nth()中的错误（GH 8979）
DataFrame.where和Series.where错误地将数值转换为字符串（GH 9280）
当传递字符串列表时，DataFrame.where和Series.where会引发ValueError的错误。 (GH 9280)
在非字符串值上访问Series.str方法现在会引发TypeError而不是产生不正确的结果（GH 9184）
当索引具有重复项且不是单调递增时，DatetimeIndex.__contains__中的错误（GH 9512）
修复了Series.kurt()在所有值相等时出现的除零错误（GH 9197）
修复了xlsxwriter引擎中的问题，当未应用其他格式时，它会向单元格添加默认的“General”格式。这会阻止应用其他行或列格式。 (GH 9167)
修复了在read_csv中index_col=False时，同时指定usecols时的问题。 (GH 9082)
wide_to_long会修改输入存根名称列表的错误（GH 9204）
修复了to_sql中未使用双精度存储 float64 值的错误。 (GH 9009)
SparseSeries和SparsePanel现在接受零参数构造函数（与它们的非稀疏对应物相同）（GH 9272）
合并Categorical和object数据类型时的回归错误（GH 9426）
在某些格式不正确的输入文件中，read_csv中的缓冲区溢出错误。 (GH 9205)
在具有缺失对的 groupby MultiIndex 中的错误（GH 9049，GH 9344）
修复了Series.groupby中的 bug，当在MultiIndex级别上进行分组时，会忽略排序参数 (GH 9444)
修复了DataFrame.Groupby中的 bug，在分类列的情况下，sort=False被忽略。 (GH 8868)
修复了在 python 3 上从 Amazon S3 读取 CSV 文件时引发 TypeError 的 bug (GH 9452)
Google BigQuery 读取器中的一个 bug，查询结果中可能存在‘jobComplete’键，但值为 False (GH 8728)
Series.values_counts中的 bug，对于dropna=True的分类类型Series排除NaN (GH 9443)
修复了DataFrame.std/var/sem中缺失的numeric_only选项 (GH 9201)
支持使用标量数据构建Panel或Panel4D (GH 8285)
Series文本表示与max_rows/max_columns脱节 (GH 7508).

当截断时，Series数字格式不一致 (GH 8532).

以前的行��

In [2]: pd.options.display.max_rows = 10
In [3]: s = pd.Series([1,1,1,1,1,1,1,1,1,1,0.9999,1,1]*10)
In [4]: s
Out[4]:
0    1
1    1
2    1
...
127    0.9999
128    1.0000
129    1.0000
Length: 130, dtype: float64

新行为

0      1.0000
1      1.0000
2      1.0000
3      1.0000
4      1.0000
...
125    1.0000
126    1.0000
127    0.9999
128    1.0000
129    1.0000
dtype: float64

在某些情况下，在框架中设置新项时会生成一个虚假的SettingWithCopy警告 (GH 8730)

以前会报告SettingWithCopy警告。

In [42]: df1 = pd.DataFrame({'x': pd.Series(['a', 'b', 'c']),
 ....:                    'y': pd.Series(['d', 'e', 'f'])})
 ....: 

In [43]: df2 = df1[['x']]

In [44]: df2['y'] = ['g', 'h', 'i']

贡献者

总共有 60 人为这个版本贡献了补丁。名字后面带有“+”的人第一次贡献了补丁。

Aaron Toth +
Alan Du +
Alessandro Amici +
Artemy Kolchinsky
Ashwini Chaudhary +
Ben Schiller
Bill Letson
Brandon Bradley +
Chau Hoang +
Chris Reynolds
Chris Whelan +
Christer van der Meeren +
David Cottrell +
David Stephens
Ehsan Azarnasab +
Garrett-R +
Guillaume Gay
Jake Torcasso +
Jason Sexauer
Jeff Reback
John McNamara
Joris Van den Bossche
Joschka zur Jacobsmühlen +
Juarez Bochi +
Junya Hayashi +
K.-Michael Aye
Kerby Shedden +
Kevin Sheppard
Kieran O’Mahony
Kodi Arfer +
Matti Airas +
Min RK +
Mortada Mehyar
Robert +
Scott E Lasley
Scott Lasley +
Sergio Pascual +
Skipper Seabold
Stephan Hoyer
Thomas Grainger
Tom Augspurger
TomAugspurger
Vladimir Filimonov +
Vyomkesh Tripathi +
Will Holmgren
Yulong Yang +
behzad nouri
bertrandhaut +
bjonen
cel4 +
clham
hsperr +
ischwabacher
jnmclarty
josham +
jreback
omtinez +
roch +
sinhrks
unutbu

版本 0.15.2 (2014 年 12 月 12 日)

原文：pandas.pydata.org/docs/whatsnew/v0.15.2.html

这是从 0.15.1 的一个次要版本，包含大量的错误修复以及几个新功能、增强功能和性能改进。为了修复现有的 bug，需要进行少量的 API 更改。我们建议所有用户升级到此版本。

增强功能
API 更改
性能改进
错误修复

API 更改

现在支持在 MultiIndex 中超出词典排序深度的索引，尽管词典排序的索引性能更好。 (GH 2646)

In [1]: df = pd.DataFrame({'jim':[0, 0, 1, 1],
 ...:                   'joe':['x', 'x', 'z', 'y'],
 ...:                   'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])
 ...:

In [2]: df
Out[2]:
 jolie
jim joe
0   x    0.126970
 x    0.966718
1   z    0.260476
 y    0.897237

[4 rows x 1 columns]

In [3]: df.index.lexsort_depth
Out[3]: 1

# in prior versions this would raise a KeyError
# will now show a PerformanceWarning
In [4]: df.loc[(1, 'z')]
Out[4]:
 jolie
jim joe
1   z    0.260476

[1 rows x 1 columns]

# lexically sorting
In [5]: df2 = df.sort_index()

In [6]: df2
Out[6]:
 jolie
jim joe
0   x    0.126970
 x    0.966718
1   y    0.897237
 z    0.260476

[4 rows x 1 columns]

In [7]: df2.index.lexsort_depth
Out[7]: 2

In [8]: df2.loc[(1,'z')]
Out[8]:
 jolie
jim joe
1   z    0.260476

[1 rows x 1 columns]

Series 的唯一性错误，带有 category dtype，返回了所有类别，无论它们是否“被使用”（请参见此处中的讨论）。以前的行为是返回所有类别：

In [3]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])

In [4]: cat
Out[4]:
[a, b, a]
Categories (3, object): [a < b < c]

In [5]: cat.unique()
Out[5]: array(['a', 'b', 'c'], dtype=object)

现在，仅返回实际出现在数组中的类别：

In [1]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])

In [2]: cat.unique()
Out[2]: 
['a', 'b']
Categories (3, object): ['a', 'b', 'c']

Series.all 和 Series.any 现在支持 level 和 skipna 参数。Series.all、Series.any、Index.all 和 Index.any 不再支持 out 和 keepdims 参数，这些参数是为了与 ndarray 兼容而存在的。各种索引类型不再支持 all 和 any 聚合函数，并且现在会引发 TypeError。 (GH 8302).
允许对具有分类 dtype 和对象 dtype 的 Series 进行相等比较；以前这些操作会引发 TypeError (GH 8938)

在 NDFrame 中的错误：冲突的属性/列名称现在在获取和设置时表现一致。以前，当存在名为 y 的列和属性时，data.y 将返回属性，而 data.y = z 将更新列 (GH 8994)

In [3]: data = pd.DataFrame({'x': [1, 2, 3]})

In [4]: data.y = 2

In [5]: data['y'] = [2, 4, 6]

In [6]: data
Out[6]: 
 x  y
0  1  2
1  2  4
2  3  6

[3 rows x 2 columns]

# this assignment was inconsistent
In [7]: data.y = 5

旧行为：

In [6]: data.y
Out[6]: 2

In [7]: data['y'].values
Out[7]: array([5, 5, 5])

新行为：

In [8]: data.y
Out[8]: 5

In [9]: data['y'].values
Out[9]: array([2, 4, 6])

Timestamp('now') 现在等同于 Timestamp.now()，即返回本地时间而不是 UTC。此外，Timestamp('today') 现在等同于 Timestamp.today()，两者都可以使用 tz 作为参数。 (GH 9000)

修复基于标签的切片的负步长支持 (GH 8753)

旧行为：

In [1]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])
Out[1]:
a    0
b    1
c    2
dtype: int64

In [2]: s.loc['c':'a':-1]
Out[2]:
c    2
dtype: int64

新行为：

In [10]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])

In [11]: s.loc['c':'a':-1]
Out[11]: 
c    2
b    1
a    0
Length: 3, dtype: int64 
```  ## 增强功能

Categorical 增强功能：

添加了将分类数据导出到 Stata 的功能 (GH 8633)。有关导出到 Stata 数据文件的分类变量的限制，请参见此处。
在 StataReader 和 read_stata 中添加了 order_categoricals 标志，用于选择是否对导入的分类数据进行排序。(GH 8836)。有关从 Stata 数据文件导入分类变量的更多信息，请参见这里。
添加了将分类数据导出到/从 HDF5 的功能。查询的工作方式与对象数组相同。但是，category 数据类型的数据以更有效的方式存储。有关示例和与 pandas 之前版本相关的注意事项，请参见这里。
在 Categorical 类上添加了对 searchsorted() 的支持。(GH 8420)。

其他增强功能：

添加了在将 DataFrame 写入数据库时指定列的 SQL 类型的功能。例如，指定使用 sqlalchemy 的 String 类型而不是默认的 Text 类型用于字符串列：(GH 8778)。
```
from sqlalchemy.types import String
data.to_sql('data_dtype', engine, dtype={'Col_1': String})  # noqa F821 
```

Series.all 和 Series.any 现在支持 level 和 skipna 参数。(GH 8302)：

>>> s = pd.Series([False, True, False], index=[0, 0, 1])
>>> s.any(level=0)
0     True
1    False
dtype: bool

Panel 现在支持 all 和 any 聚合函数。(GH 8302)：

>>> p = pd.Panel(np.random.rand(2, 5, 4) > 0.1)
>>> p.all()
 0      1      2     3
0   True   True   True  True
1   True  False   True  True
2   True   True   True  True
3  False   True  False  True
4   True   True   True  True

在 Timestamp 类上添加了对 utcfromtimestamp()、fromtimestamp() 和 combine() 的支持。(GH 5351)。
添加了 Google Analytics（pandas.io.ga）基本文档。请参见这里。(GH 8835)。
在未知情况下，Timedelta 算术返回 NotImplemented，允许自定义类进行扩展。(GH 8813)。
Timedelta 现在支持与适当 dtype 的 numpy.ndarray 对象进行算术运算（仅适用于 numpy 1.8 或更新版本）。(GH 8884)。
在公共 API 中添加了 Timedelta.to_timedelta64() 方法。(GH 8884)。
在 gbq 模块中添加了 gbq.generate_bq_schema() 函数。(GH 8325)。
Series 现在与 map 对象一样与生成器一起工作。(GH 8909)。
在 HDFStore 中添加了上下文管理器以实现自动关闭。(GH 8791)。
to_datetime 现在具有 exact 关键字，允许格式不需要与提供的格式字符串完全匹配（如果为 False）。exact 默认为 True（意味着仍然是默认的精确匹配）。(GH 8904)。
在 parallel_coordinates 绘图函数中添加了 axvlines 布尔选项，确定是否打印垂直线，默认为 True。
添加了读取表尾的功能到read_html中（GH 8552）
to_sql现在可以推断包含 NA 值并且 dtype 为object的列的非 NA 值的数据类型（[GH 8778](https://github.com/pandas-dev/pandas/issues/8778）。## 性能
在read_csv中，当skiprows是整数时，减少内存使用量（GH 8681）
通过传递format=和exact=False，提升了to_datetime转换的性能（GH 8904）## Bug 修复
修复了将category dtype 强制转换为object的 Series 进行 concat 时的错误（GH 8641）
修复了 Timestamp-Timestamp 不返回 Timedelta 类型以及带有时区的 datelike-datelike 操作的错误（GH 8865）
使时区不匹配异常一致（tz 操作为 None 或不兼容时区），现在会返回TypeError而不是ValueError（仅有几个边缘案例）（GH 8865）
使用没有 level/axis 或仅有 level 的pd.Grouper(key=...)时的错误修复（GH 8795，GH 8866）
在 groupby 中传递无效/无参数时，报告TypeError（GH 8015）
修复了使用py2app/cx_Freeze打包 pandas 时的错误（GH 8602，GH 8831）
修复了groupby签名中不包括*args 或**kwargs 的错误（GH 8733）
当从 Yahoo 获取到的到期日期为空时，io.data.Options现在会引发RemoteDataError，当从 Yahoo 接收到的数据为空时也会引发RemoteDataError（GH 8761，GH 8783）
在 csv 解析中传递 dtype 和 names 时，如果解析的数据类型不同，错误消息不清晰（GH 8833）
修复了使用空列表和至少一个布尔索引器对 MultiIndex 进行切片的错误（GH 8781）
当从 Yahoo 获取到的到期日期为空时，io.data.Options现在会引发RemoteDataError（GH 8761）
Timedelta的 kwargs 现在可以是 numpy ints 和 floats（GH 8757）
修复了Timedelta算术和比较的几个未解决的错误（GH 8813，GH 5963，GH 5436）
sql_schema 现在生成适用于方言的 CREATE TABLE 语句（GH 8697）
slice 字符串方法现在考虑了步长 (GH 8754)
在 BlockManager 中，使用不同类型设置值会破坏块完整性 (GH 8850)
在使用 time 对象作为键时，DatetimeIndex 中存在错误 (GH 8667)
在 merge 中，how='left' 和 sort=False 时不会保留左侧帧的顺序 (GH 7331)
在 MultiIndex.reindex 中，重新索引级别时不会重新排序标签 (GH 4088)
在某些操作中存在与 dateutil 时区相关的错误，在 dateutil 2.3 中显现出来 (GH 8639)
DatetimeIndex 迭代中的回归，固定/本地偏移时区 (GH 8890)
在使用 %f 格式解析纳秒时，to_datetime 中存在错误 (GH 8989)
io.data.Options 现在在没有到期日期可用于 Yahoo 时引发 RemoteDataError，并且当从 Yahoo 接收不到数据时引发 RemoteDataError (GH 8761), (GH 8783)
修复：仅当垂直或水平时才设置字体大小于 x 轴或 y 轴。(GH 8765)
修复：在 Python 3 中读取大型 csv 文件时出现除以 0 的错误 (GH 8621)
在使用 to_html,index=False 输出 MultiIndex 时存在错误，会添加额外的列 (GH 8452)
从 Stata 文件导入的分类变量保留底层数据中的序数信息 (GH 8836)
在 NDFrame 对象上定义了 .size 属性，以提供与 numpy >= 1.9.1 的兼容性；在 np.array_split 中存在 bug (GH 8846)
跳过对 matplotlib <= 1.2 的直方图图表的测试 (GH 8648)
get_data_google 返回对象 dtype 的错误 (GH 3995)
当 DataFrame 的 columns 是一个 MultiIndex，且其 labels 没有引用其所有 levels 时，在 DataFrame.stack(..., dropna=False) 中存在错误 (GH 8844)
在 __enter__ 上应用 Option 上下文存在错误 (GH 8514)
在重新采样跨越多天并且最后一个偏移量不是从范围的起始计算而来时，导致 resample 中的错误引发 ValueError (GH 8683)
当检查一个 np.array 是否在 DataFrame 中时，DataFrame.plot(kind='scatter') 失败的错误 (GH 8852)
pd.infer_freq/DataFrame.inferred_freq 中的错误，当索引包含 DST 日时，会阻止适当的亚日频率推断 (GH 8772)。
当使用 use_index=False 绘制系列时，索引名称仍然被使用的错误 (GH 8558)。
尝试堆叠多个列时出现错误，当某些（或全部）级别名称是数字时 (GH 8584)。
如果索引未按字典顺序排序或唯一，MultiIndex 中的 __contains__ 返回错误的结果 (GH 7724)。
CSV BUG：修复跳过行中尾随空格的问题 (GH 8679)，(GH 8661)，(GH 8983)
Timestamp 中的回归问题，不会解析 'Z' 作为 UTC 时区标识符 (GH 8771)
StataWriter 中的错误，生成的字符串长度为 244 个字符，不考虑实际大小 (GH 8969)
修复了 cummin/cummax 在 datetime64 Series 包含 NaT 时引发的 ValueError。 (GH 8965)
如果存在缺失值，则 DataReader 返回对象数据类型的错误 (GH 8980)。
如果启用了 sharex 并且索引是时间序列，则绘图中存在错误，会在多个轴上显示标签 (GH 3964)。
通过将单位传递给 TimedeltaIndex 构造函数，两次应用到纳秒的转换 (GH 9011)。
在类似周期的数组绘图中存在错误 (GH 9012) ## 贡献者

总共有 49 人为此版本贡献了补丁。带有 “+” 的人是首次贡献补丁的人。

Aaron Staple
Angelos Evripiotis +
Artemy Kolchinsky
Benoit Pointet +
Brian Jacobowski +
Charalampos Papaloizou +
Chris Warth +
David Stephens
Fabio Zanini +
Francesc Via +
Henry Kleynhans +
Jake VanderPlas +
Jan Schulz
Jeff Reback
Jeff Tratner
Joris Van den Bossche
Kevin Sheppard
Matt Suggit +
Matthew Brett
Phillip Cloud
Rupert Thompson +
Scott E Lasley +
Stephan Hoyer
Stephen Simmons +
Sylvain Corlay +
Thomas Grainger +
Tiago Antao +
Tom Augspurger
Trent Hauck
Victor Chaves +
Victor Salgado +
Vikram Bhandoh +
WANG Aiyong
Will Holmgren +
behzad nouri
broessli +
charalampos papaloizou +
immerrr
jnmclarty
jreback
mgilbert +
onesandzeroes
peadarcoyle +
rockg
seth-p
sinhrks
unutbu
wavedatalab +
Åsmund Hjulstad + ## API 更改

MultiIndex 中的索引现在支持超出词典排序深度的索引，虽然词典排序的索引性能更好。(GH 2646)

In [1]: df = pd.DataFrame({'jim':[0, 0, 1, 1],
 ...:                   'joe':['x', 'x', 'z', 'y'],
 ...:                   'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])
 ...:

In [2]: df
Out[2]:
 jolie
jim joe
0   x    0.126970
 x    0.966718
1   z    0.260476
 y    0.897237

[4 rows x 1 columns]

In [3]: df.index.lexsort_depth
Out[3]: 1

# in prior versions this would raise a KeyError
# will now show a PerformanceWarning
In [4]: df.loc[(1, 'z')]
Out[4]:
 jolie
jim joe
1   z    0.260476

[1 rows x 1 columns]

# lexically sorting
In [5]: df2 = df.sort_index()

In [6]: df2
Out[6]:
 jolie
jim joe
0   x    0.126970
 x    0.966718
1   y    0.897237
 z    0.260476

[4 rows x 1 columns]

In [7]: df2.index.lexsort_depth
Out[7]: 2

In [8]: df2.loc[(1,'z')]
Out[8]:
 jolie
jim joe
1   z    0.260476

[1 rows x 1 columns]

category 类型的 Series 的唯一值的 bug，它返回所有类别，无论它们是否被“使用”（参见 GH 8559 进行讨论）。之前的行为是返回所有类别：

In [3]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])

In [4]: cat
Out[4]:
[a, b, a]
Categories (3, object): [a < b < c]

In [5]: cat.unique()
Out[5]: array(['a', 'b', 'c'], dtype=object)

现在，只返回数组中确实存在的类别：

In [1]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c'])

In [2]: cat.unique()
Out[2]: 
['a', 'b']
Categories (3, object): ['a', 'b', 'c']

Series.all 和 Series.any 现在支持 level 和 skipna 参数。 Series.all, Series.any, Index.all, 和 Index.any 不再支持 out 和 keepdims 参数，这些参数为了与 ndarray 兼容而存在。各种索引类型不再支持 all 和 any 聚合函数，并将会抛出 TypeError。(GH 8302).
允许具有分类 dtype 和对象 dtype 的 Series 进行相等比较；之前这些将引发 TypeError (GH 8938)

修复了 NDFrame 中的错误：现在在获取和设置之间的冲突属性/列名的行为一致。之前，当存在名为 y 的列和属性时，data.y 将返回属性，而 data.y = z 将更新列 (GH 8994)

In [3]: data = pd.DataFrame({'x': [1, 2, 3]})

In [4]: data.y = 2

In [5]: data['y'] = [2, 4, 6]

In [6]: data
Out[6]: 
 x  y
0  1  2
1  2  4
2  3  6

[3 rows x 2 columns]

# this assignment was inconsistent
In [7]: data.y = 5

旧行为:

In [6]: data.y
Out[6]: 2

In [7]: data['y'].values
Out[7]: array([5, 5, 5])

新行为:

In [8]: data.y
Out[8]: 5

In [9]: data['y'].values
Out[9]: array([2, 4, 6])

Timestamp('now') 现在等同于 Timestamp.now()，因为它返回本地时间而不是 UTC。而且，Timestamp('today') 现在等同于 Timestamp.today()，两者都有 tz 作为可能的参数。(GH 9000)

修复了基于标签的切片的负步长支持 (GH 8753)

旧行为:

In [1]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])
Out[1]:
a    0
b    1
c    2
dtype: int64

In [2]: s.loc['c':'a':-1]
Out[2]:
c    2
dtype: int64

新行为:

In [10]: s = pd.Series(np.arange(3), ['a', 'b', 'c'])

In [11]: s.loc['c':'a':-1]
Out[11]: 
c    2
b    1
a    0
Length: 3, dtype: int64

增强

Categorical 增强:

添加了将分类数据导出到 Stata 的功能 (GH 8633)。请参见这里以了解导出到 Stata 数据文件的分类变量的限制。
在 StataReader 和 read_stata 中添加了 order_categoricals 标志，以选择是否对导入的分类数据进行排序 (GH 8836)。有关从 Stata 数据文件导入分类变量的更多信息，请参见这里。
添加了将分类数据导出到/从 HDF5 的功能 (GH 7621)。查询与对象数组相同。但是，category 类型的数据以更有效的方式存储。有关示例和与 pandas 之前版本相关的注意事项，请参见这里。
添加了对 Categorical 类的 searchsorted() 的支持 (GH 8420).

其他增强:

当将 DataFrame 写入数据库时，添加了指定列的 SQL 类型的功能（GH 8778）。例如，指定使用 SQLAlchemy 的String类型而不是默认的Text类型用于字符串列：
```
from sqlalchemy.types import String
data.to_sql('data_dtype', engine, dtype={'Col_1': String})  # noqa F821 
```

Series.all和Series.any现在支持level和skipna参数（GH 8302）：

>>> s = pd.Series([False, True, False], index=[0, 0, 1])
>>> s.any(level=0)
0     True
1    False
dtype: bool

Panel现在支持all和any聚合函数。(GH 8302)：

>>> p = pd.Panel(np.random.rand(2, 5, 4) > 0.1)
>>> p.all()
 0      1      2     3
0   True   True   True  True
1   True  False   True  True
2   True   True   True  True
3  False   True  False  True
4   True   True   True  True

在Timestamp类上添加了对utcfromtimestamp()、fromtimestamp()和combine()的支持（GH 5351）。
添加了 Google Analytics（pandas.io.ga）基本文档（GH 8835）。请看这里。
在未知情况下，Timedelta算术运算返回NotImplemented，允许通过自定义类进行扩展（GH 8813）。
Timedelta现在支持与numpy.ndarray对象的算术运算，前提是 dtype 合适（仅限 numpy 1.8 或更新版本）（GH 8884）。
向公共 API 添加了Timedelta.to_timedelta64()方法（GH 8884）。
向 gbq 模块添加了gbq.generate_bq_schema()函数（GH 8325）。
Series现在与 map 对象一样与生成器一起工作（GH 8909）。
为HDFStore添加了上下文管理器，实现自动关闭（GH 8791）。
to_datetime增加了一个exact关键字，允许格式不需要与提供的格式字符串完全匹配（如果False）。exact默认为True（这意味着精确匹配仍然是默认值）（GH 8904）
向 parallel_coordinates 绘图函数添加了axvlines布尔选项，确定是否打印垂直线，默认为 True
添加了读取表底部的能力到 read_html（GH 8552）。
to_sql现在推断具有 NA 值并且 dtype 为object的列的非 NA 值的数据类型（GH 8778）。

性能

在 read_csv 中 skiprows 为整数时减少内存使用量（GH 8681）。
使用format=和exact=False传递参数时，to_datetime转换的性能得到提升（GH 8904）

Bug 修复

修复了具有category dtype 的 Series 进行 concat 时的错误，它们被强制转换为object。（GH 8641）
在 Timestamp-Timestamp 不返回 Timedelta 类型和带有时区的 datelike-datelike 运算中的 bug 修复（GH 8865）
使时区不匹配异常一致（即 tz 使用 None 或不兼容的时区），现在将返回 TypeError 而不是 ValueError（仅限一些特殊情况），（GH 8865）
使用没有级别/轴或仅级别的 pd.Grouper(key=...) 时出现的 bug（GH 8795, GH 8866）
当在 groupby 中传递无效/无参数时报告 TypeError（GH 8015）
使用 py2app/cx_Freeze 打包 pandas 的 bug 修复（GH 8602, GH 8831）
groupby 签名中未包含 *args 或 **kwargs 的 bug 修复（GH 8733）。
io.data.Options 现在在从 Yahoo 获取到期日期时引发 RemoteDataError，并且当它从 Yahoo 接收不到数据时也引发 RemoteDataError（GH 8761），(GH 8783)。
在 csv 解析中传递 dtype 和 names 时，解析数据类型不同时出现的不清晰的错误消息（GH 8833）
修复了使用空列表和至少一个布尔索引器对多级索引进行切片时的错误（GH 8781）
io.data.Options 现在在没有从 Yahoo 获取到期日期时引发 RemoteDataError（GH 8761）。
Timedelta 关键字参数现在可以是 numpy 的整数和浮点数了（GH 8757）。
修复了几个 Timedelta 算术和比较中的未解决错误（GH 8813, GH 5963, GH 5436）。
sql_schema 现在生成适用于方言的 CREATE TABLE 语句（GH 8697）
slice 字符串方法现在考虑步长了（GH 8754）
在 BlockManager 中设置不同类型的值会破坏块完整性的 bug 修复（GH 8850）
在使用 time 对象作为键时 DatetimeIndex 中的 bug 修复（GH 8667）
在 merge 中的一个 bug，how='left' 和 sort=False 时不会保留左边框的顺序（GH 7331）
在 MultiIndex.reindex 中重新索引时不会重新排序标签的 bug 修复（GH 4088）
在使用 dateutil 时区进行某些操作时存在错误，表现为 dateutil 2.3（GH 8639）
在具有固定/本地偏移时区的 DatetimeIndex 迭代中的回归（GH 8890）
在使用%f格式解析纳秒时的to_datetime中存在错误（GH 8989）
io.data.Options现在在 Yahoo 没有到期日期可用时引发RemoteDataError，并且当它从 Yahoo 接收不到数据时也会引发错误（GH 8761，GH 8783）。
修复：只在垂直时设置了 x 轴的字体大小，或在水平时设置了 y 轴的字体大小。(GH 8765)
在 Python 3 中读取大型 csv 文件时避免了除以 0 的错误（GH 8621）
在使用to_html,index=False输出 MultiIndex 时存在错误，会添加额外的列（GH 8452）
从 Stata 文件导入的分类变量保留底层数据中的序数信息（GH 8836)
在NDFrame对象上定义了.size属性，以提供与 numpy >= 1.9.1 的兼容性；与np.array_split一起存在 bug（GH 8846）
对于 matplotlib <= 1.2，跳过直方图绘图的测试（GH 8648）。
get_data_google返回对象 dtype 的错误（GH 3995）
当 DataFrame 的columns是一个MultiIndex，其labels未引用所有levels时，在DataFrame.stack(..., dropna=False)中存在错误（GH 8844）
在__enter__上应用 Option 上下文的错误（GH 8514）
在重新采样中存在错误，当跨越多天重新采样且最后一个偏移量不是从范围的开始计算时会引发 ValueError（GH 8683）
当检查一个 np.array 是否在 DataFrame 中时，DataFrame.plot(kind='scatter')失败的错误（GH 8852）
在pd.infer_freq/DataFrame.inferred_freq中的错误，当索引包含 DST 天时，阻止了适当的次日频率推断（GH 8772）。
在绘制一个具有use_index=False的系列时，仍然使用了索引名称的错误（GH 8558）。
当尝试堆叠多个列时出现错误，当某些（或全部）级别名称为数字时（GH 8584）。
MultiIndex中的 Bug，如果索引未按字典顺序排序或唯一，则__contains__返回错误结果 (GH 7724)
BUG CSV: 修复跳过行中尾随空格的问题, (GH 8679), (GH 8661), (GH 8983)
Timestamp中的回归不解析‘Z’时区标识符以表示 UTC 时间 (GH 8771)
StataWriter中的 Bug，生成的字符串长度为 244 个字符，而实际大小不同 (GH 8969)
修复 cummin/cummax 在 datetime64 Series 中包含 NaT 时引发的 ValueError。 (GH 8965)
如果存在缺失值，则 DataReader 中的 Bug 返回对象 dtype (GH 8980)
如果启用了 sharex 并且索引是时间序列，则绘图中存在错误，会在多个轴上显示标签 (GH 3964).
传递单位给 TimedeltaIndex 构造函数时，应用纳秒转换两次的 Bug。 (GH 9011).
期间类数组绘图中的 Bug (GH 9012)

贡献者

总共有 49 人为此版本贡献了补丁。名字后面带有“+”的人第一次贡献了补丁。

Aaron Staple
Angelos Evripiotis +
Artemy Kolchinsky
Benoit Pointet +
Brian Jacobowski +
Charalampos Papaloizou +
Chris Warth +
David Stephens
Fabio Zanini +
Francesc Via +
Henry Kleynhans +
Jake VanderPlas +
Jan Schulz
Jeff Reback
Jeff Tratner
Joris Van den Bossche
Kevin Sheppard
Matt Suggit +
Matthew Brett
Phillip Cloud
Rupert Thompson +
Scott E Lasley +
Stephan Hoyer
Stephen Simmons +
Sylvain Corlay +
Thomas Grainger +
Tiago Antao +
Tom Augspurger
Trent Hauck
Victor Chaves +
Victor Salgado +
Vikram Bhandoh +
WANG Aiyong
Will Holmgren +
behzad nouri
broessli +
charalampos papaloizou +
immerrr
jnmclarty
jreback
mgilbert +
onesandzeroes
peadarcoyle +
rockg
seth-p
sinhrks
unutbu
wavedatalab +
Åsmund Hjulstad +

posted @ 2024-06-26 10:36 绝不原创的飞龙阅读(1) 评论(0) 编辑收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

Pandas-2-2-中文文档-五十七-

Pandas 2.2 中文文档（五十七）

版本 0.16.2（2015 年 6 月 12 日）

新功能

管道

管道

其他增强

API 变更

性能改进

错误修复

贡献者

版本 0.16.1（2015 年 5 月 11 日）

增强功能

CategoricalIndex

示例

字符串方法增强

其他增强

API 更改

废弃

索引表示

性能改进

Bug 修复

贡献者

版本 0.16.0（2015 年 3 月 22 日）

新特性

DataFrame 分配

与 scipy.sparse 的交互

字符串方法增强

其他增强

不兼容的后向 API 更改

timedelta 的更改

索引变化

分类变化

其他 API 变更

废弃内容

移除之前版本的弃用/更改

性能改进

Bug 修复

贡献者

版本 0.15.2 (2014 年 12 月 12 日)

API 更改

增强

性能

Bug 修复

贡献者

公告