25个Pandas高频实用技巧

参考翻译自：https://github.com/justmarkham/pandas-videos

导入案例数据集

import pandas as pd
import numpy as np

drinks = pd.read_csv('http://bit.ly/drinksbycountry')
movies = pd.read_csv('http://bit.ly/imdbratings')
orders = pd.read_csv('http://bit.ly/chiporders', sep='\t')
orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')
stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date'])
titanic = pd.read_csv('http://bit.ly/kaggletrain')
ufo = pd.read_csv('http://bit.ly/uforeports', parse_dates=['Time'])

<ipython-input-1-9434e3b86302>:7: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')

1 显示已安装的版本

有时你需要知道正在使用的pandas版本，特别是在阅读pandas文档时。你可以通过输入以下命令来显示pandas版本:

pd.__version__

'1.2.4'

如果你还想知道pandas所依赖的模块的版本，你可以使用show_versions()函数:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22000
machine          : AMD64
processor        : AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : Chinese (Simplified)_China.936

pandas           : 1.2.4
numpy            : 1.18.0
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 22.1.2
setuptools       : 52.0.0.post20210125
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : 1.0.2
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: 0.10.0
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.9.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.7
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.53.1

2 创建示例DataFrame

假设你需要创建一个示例DataFrame。有很多种实现的途径，我最喜欢的方式是传一个字典给DataFrame constructor，其中字典中的keys为列名，values为列的取值。

df = pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})
df

	col one	col two
0	100	300
1	200	400

如果你需要更大的DataFrame，上述方法将需要太多的输入。在这种情况下，你可以使用NumPy的 random.rand()函数，定义好该函数的行数和列数，并将其传递给DataFrame构造器:

pd.DataFrame(np.random.rand(4, 8))

	0	1	2	3	4	5	6	7
0	0.434350	0.664889	0.003442	0.500086	0.053749	0.831000	0.199008	0.194081
1	0.708474	0.363857	0.949917	0.664410	0.285345	0.957187	0.851665	0.347094
2	0.107086	0.497177	0.488709	0.283645	0.155678	0.815601	0.558401	0.695038
3	0.039673	0.836976	0.878320	0.462584	0.742012	0.483997	0.578045	0.568551

这种方式很好，但如果你还想把列名变为非数值型的，你可以强制地将一串字符赋值给columns参数：

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

	a	b	c	d	e	f	g	h
0	0.106455	0.072711	0.492421	0.810857	0.986341	0.251466	0.557781	0.299379
1	0.589126	0.851388	0.362811	0.729866	0.524497	0.464101	0.873737	0.098877
2	0.623276	0.835985	0.750665	0.599064	0.230829	0.688544	0.313951	0.878711
3	0.379598	0.665771	0.949013	0.460847	0.004878	0.617837	0.773584	0.560171

你可以想到，你传递的字符串的长度必须与列数相同。

3 更改列名

我们来看一下刚才我们创建的示例DataFrame:

df

	col one	col two
0	100	300
1	200	400

我更喜欢在选取pandas列的时候使用点（.），但是这对那么列名中含有空格的列不会生效。让我们来修复这个问题。

更改列名最灵活的方式是使用rename()函数。你可以传递一个字典，其中keys为原列名，values为新列名，还可以指定axis:

df = df.rename({'col one':'col_one','col two':'col_two'},axis='columns')

使用这个函数最好的方式是你需要更改任意数量的列名，不管是一列或者全部的列。

如果你需要一次性重新命令所有的列名，更简单的方式就是重写DataFrame的columns属性：

df.columns = ['col_one', 'col_two']

如果你需要做的仅仅是将空格换成下划线，那么更好的办法是用str.replace()方法，这是因为你都不需要输入所有的列名：

df.columns = df.columns.str.replace(' ', '_')

上述三个函数的结果都一样，可以更改列名使得列名中不含有空格：

df

	col_one	col_two
0	100	300
1	200	400

最后，如果你需要在列名中添加前缀或者后缀，你可以使用add_prefix()函数：

df.add_prefix('X_')

	X_col_one	X_col_two
0	100	300
1	200	400

或者使用add_suffix()函数：

df.add_suffix('_Y')

	col_one_Y	col_two_Y
0	100	300
1	200	400

4. 行序反转

我们来看一下drinks这个DataFame:

drinks.head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

该数据集描述了每个国家的平均酒消费量。如果你想要将行序反转呢？

最直接的办法是使用loc函数并传递::-1，跟Python中列表反转时使用的切片符号一致：

drinks.loc[::-1].head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
192	Zimbabwe	64	18	4	4.7	Africa
191	Zambia	32	19	4	2.5	Africa
190	Yemen	6	0	0	0.1	Asia
189	Vietnam	111	2	1	2.0	Asia
188	Venezuela	333	100	3	7.7	South America

如果你还想重置索引使得它从0开始呢？

你可以使用reset_index()函数，告诉他去掉完全抛弃之前的索引：

drinks.loc[::-1].reset_index(drop=True).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Zimbabwe	64	18	4	4.7	Africa
1	Zambia	32	19	4	2.5	Africa
2	Yemen	6	0	0	0.1	Asia
3	Vietnam	111	2	1	2.0	Asia
4	Venezuela	333	100	3	7.7	South America

你可以看到，行序已经反转，索引也被重置为默认的整数序号。

5. 列序反转

跟之前的技巧一样，你也可以使用loc函数将列从左至右反转

drinks.loc[:, ::-1].head()

	continent	total_litres_of_pure_alcohol	wine_servings	spirit_servings	beer_servings	country
0	Asia	0.0	0	0	0	Afghanistan
1	Europe	4.9	54	132	89	Albania
2	Africa	0.7	14	0	25	Algeria
3	Europe	12.4	312	138	245	Andorra
4	Africa	5.9	45	57	217	Angola

逗号之前的冒号表示选择所有行，逗号之后的::-1表示反转所有的列，这就是为什么country这一列现在在最右边。

6. 通过数据类型选择列

这里有drinks这个DataFrame的数据类型：

drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

假设你仅仅需要选取数值型的列，那么你可以使用select_dtypes()函数：

drinks.select_dtypes(include='number').head()

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
0	0	0	0	0.0
1	89	132	54	4.9
2	25	0	14	0.7
3	245	138	312	12.4
4	217	57	45	5.9

这包含了int和float型的列。

你也可以使用这个函数来选取数据类型为object的列：

drinks.select_dtypes(include='object').head()

	country	continent
0	Afghanistan	Asia
1	Albania	Europe
2	Algeria	Africa
3	Andorra	Europe
4	Angola	Africa

你还可以选取多种数据类型，只需要传递一个列表即可：

drinks.select_dtypes(include=['number', 'object', 'category', 'datetime']).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

你还可以用来排除特定的数据类型：

drinks.select_dtypes(exclude='number').head()

7. 将字符型转换为数值型

我们来创建另一个示例DataFrame:

df = pd.DataFrame({'col_one':['1.1', '2.2', '3.3'],
                   'col_two':['4.4', '5.5', '6.6'],
                   'col_three':['7.7', '8.8', '-']})
df

	col_one	col_two	col_three
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	-

这些数字实际上储存为字符型，导致其数据类型为object:

df.dtypes

col_one      object
col_two      object
col_three    object
dtype: object

为了对这些列进行数学运算，我们需要将数据类型转换成数值型。你可以对前两列使用astype()函数：

df.astype({'col_one':'float', 'col_two':'float'}).dtypes

col_one      float64
col_two      float64
col_three     object
dtype: object

但是，如果你对第三列也使用这个函数，将会引起错误，这是因为这一列包含了破折号（用来表示0）但是pandas并不知道如何处理它。

你可以对第三列使用to_numeric()函数，告诉其将任何无效数据转换为NaN:

pd.to_numeric(df.col_three, errors='coerce')

0    7.7
1    8.8
2    NaN
Name: col_three, dtype: float64

如果你知道NaN值代表0，那么你可以fillna()函数将他们替换成0：

pd.to_numeric(df.col_three, errors='coerce').fillna(0)

0    7.7
1    8.8
2    0.0
Name: col_three, dtype: float64

最后，你可以通过apply()函数一次性对整个DataFrame使用这个函数：

df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df

	col_one	col_two	col_three
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	0.0

仅需一行代码就完成了我们的目标，因为现在所有的数据类型都转换成float:

df.dtypes

col_one      float64
col_two      float64
col_three    float64
dtype: object

8. 减小DataFrame空间大小

pandas DataFrame被设计成可以适应内存，所以有些时候你可以减小DataFrame的空间大小，让它在你的系统上更好地运行起来。

这是drinks这个DataFrame所占用的空间大小：

drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB

可以看到它使用了304.KB。

如果你对你的DataFrame有操作方面的问题，或者你不能将它读进内存，那么在读取文件的过程中有两个步骤可以使用来减小DataFrame的空间大小。

第一个步骤是只读取那些你实际上需要用到的列，可以调用usecols参数：

cols = ['beer_servings', 'continent']
small_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols)
small_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   beer_servings  193 non-null    int64 
 1   continent      193 non-null    object
dtypes: int64(1), object(1)
memory usage: 13.7 KB

第二步是将所有实际上为类别变量的object列转换成类别变量，可以调用dtypes参数：

dtypes = {'continent':'category'}
smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols, dtype=dtypes)
smaller_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   beer_servings  193 non-null    int64   
 1   continent      193 non-null    category
dtypes: category(1), int64(1)
memory usage: 2.4 KB

通过将continent列读取为category数据类型，我们进一步地把DataFrame的空间大小缩小至2.3KB。

值得注意的是，如果跟行数相比，category数据类型的列数相对较小，那么catefory数据类型可以减小内存占用。

9. 按行从多个文件中构建DataFrame

假设你的数据集分化为多个文件，但是你需要将这些数据集读到一个DataFrame中。

举例来说，我有一些关于股票的小数聚集，每个数据集为单天的CSV文件。

pd.read_csv('data/stocks1.csv')
pd.read_csv('data/stocks2.csv')
pd.read_csv('data/stocks3.csv')

	Date	Close	Volume	Symbol
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

你可以将每个CSV文件读取成DataFrame，将它们结合起来，然后再删除原来的DataFrame，但是这样会多占用内存且需要许多代码。

更好的方式为使用内置的glob模块。你可以给glob()函数传递某种模式，包括未知字符，这样它会返回符合该某事的文件列表。在这种方式下，glob会查找所有以stocks开头的CSV文件：

from glob import glob
stock_files = sorted(glob('data/stocks*.csv'))
stock_files

['data\\stocks.csv',
 'data\\stocks1.csv',
 'data\\stocks2.csv',
 'data\\stocks3.csv']

glob会返回任意排序的文件名，这就是我们为什么要用Python内置的sorted()函数来对列表进行排序。

我们以生成器表达式用read_csv()函数来读取每个文件，并将结果传递给concat()函数，这会将单个的DataFrame按行来组合：

pd.concat((pd.read_csv(file) for file in stock_files))

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
0	2016-10-04	113.00	29736800	AAPL
1	2016-10-04	57.24	20085900	MSFT
2	2016-10-04	31.35	18460400	CSCO
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

不幸的是，索引值存在重复。为了避免这种情况，我们需要告诉concat()函数来忽略索引，使用默认的整数索引：

pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL
9	2016-10-03	31.50	14070500	CSCO
10	2016-10-03	112.52	21701800	AAPL
11	2016-10-03	57.42	19189500	MSFT
12	2016-10-04	113.00	29736800	AAPL
13	2016-10-04	57.24	20085900	MSFT
14	2016-10-04	31.35	18460400	CSCO
15	2016-10-05	57.64	16726400	MSFT
16	2016-10-05	31.59	11808600	CSCO
17	2016-10-05	113.05	21453100	AAPL

10. 按列从多个文件中构建DataFrame

上一个技巧对于数据集中每个文件包含行记录很有用。但是如果数据集中的每个文件包含的列信息呢？

这里有一个例子，dinks数据集被划分成两个CSV文件，每个文件包含三列：

pd.read_csv('data/drinks1.csv').head()

	country	beer_servings	spirit_servings
0	Afghanistan	0	0
1	Albania	89	132
2	Algeria	25	0
3	Andorra	245	138
4	Angola	217	57

pd.read_csv('data/drinks2.csv').head()

	wine_servings	total_litres_of_pure_alcohol	continent
0	0	0.0	Asia
1	54	4.9	Europe
2	14	0.7	Africa
3	312	12.4	Europe
4	45	5.9	Africa

同上一个技巧一样，我们以使用glob()函数开始。这一次，我们需要告诉concat()函数按列来组合：

drink_files = sorted(glob('data/drinks*.csv'))
pd.concat((pd.read_csv(file) for file in drink_files), axis='columns').head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa	Angola	217	57	45	5.9	Africa

现在我们的DataFrame已经有六列了。

11. 从剪贴板中创建DataFrame

假设你将一些数据储存在Excel或者Google Sheet中，你又想要尽快地将他们读取至DataFrame中。

你需要选择这些数据并复制至剪贴板。然后，你可以使用read_clipboard()函数将他们读取至DataFrame中：

df = pd.read_clipboard()
df

	year	month	day	land_use
0	2018	3	6	62
1	2018	3	6	62
2	2018	3	6	130
3	2018	3	6	121
4	2018	3	6	72
5	2018	3	6	72
6	2018	3	6	130
7	2018	3	6	72
8	2018	3	6	72

和read_csv()类似，read_clipboard()会自动检测每一列的正确的数据类型：

df.dtypes

year        int64
month       int64
day         int64
land_use    int64
dtype: object

需要注意的是，如果你想要你的工作在未来可复现，那么read_clipboard()并不值得推荐。

12. 将DataFrame划分为两个随机的子集

假设你想要将一个DataFrame划分为两部分，随机地将75%的行给一个DataFrame，剩下的25%的行给另一个DataFrame。

举例来说，我们的movie ratings这个DataFrame有979行：

len(movies)

我们可以使用 sample() 方法随机选择 75% 的行并将它们分配给“movies_1”：

movies_1 = movies.sample(frac=0.75, random_state=1234)

接着我们使用drop()函数来舍弃“moive_1”中出现过的行，将剩下的行赋值给"movies_2"DataFrame：

movies_2 = movies.drop(movies_1.index)

你可以发现总的行数是正确的：

len(movies_1) + len(movies_2)

你还可以检查每部电影的索引，或者"moives_1":

movies_1.index.sort_values()

Int64Index([  0,   2,   5,   6,   7,   8,   9,  11,  13,  16,
            ...
            966, 967, 969, 971, 972, 974, 975, 976, 977, 978],
           dtype='int64', length=734)

或者"moives_2":

movies_2.index.sort_values()

Int64Index([  1,   3,   4,  10,  12,  14,  15,  18,  26,  30,
            ...
            931, 934, 937, 941, 950, 954, 960, 968, 970, 973],
           dtype='int64', length=245)

需要注意的是，这个方法在索引值不唯一的情况下不起作用。

13. 通过多种类型对DataFrame进行过滤

我们先看一眼movies这个DataFrame：

movies.head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

其中有一列是genre（类型）:

movies.genre.unique()

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

比如我们想要对该DataFrame进行过滤，我们只想显示genre为Action或者Drama或者Western的电影，我们可以使用多个条件，以"or"符号分隔

movies[(movies.genre == 'Action') |
       (movies.genre == 'Drama') |
       (movies.genre == 'Western')].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

但是，你实际上可以使用isin()函数将代码写得更加清晰，将genres列表传递给该函数：

movies[movies.genre.isin(['Action', 'Drama', 'Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

如果你想要进行相反的过滤，也就是你将吧刚才的三种类型的电影排除掉，那么你可以在过滤条件前加上破浪号：

movies[~movies.genre.isin(['Action', 'Drama', 'Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

这种方法能够起作用是因为在Python中，波浪号表示“not”操作。

14. 从DataFrame中筛选出数量最多的类别

假设你想要对movies这个DataFrame通过genre进行过滤，但是只需要前3个数量最多的genre。

我们对genre使用value_counts()函数，并将它保存成counts（type为Series）:

counts = movies.genre.value_counts()
counts

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

该Series的nlargest()函数能够轻松地计算出Series中前3个最大值：

counts.nlargest(3)

Drama     278
Comedy    156
Action    136
Name: genre, dtype: int64

最后，我们将该索引传递给isin()函数，该函数会把它当成genre列表：

movies[movies.genre.isin(counts.nlargest(3).index)].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12	8.8	Star Wars: Episode V - The Empire Strikes Back	PG	Action	124	[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...

这样，在DataFrame中只剩下Drame, Comdey, Action这三种类型的电影了。

15. 处理缺失值

我们来看一看UFO sightings这个DataFrame:

ufo.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	1930-06-01 22:00:00
1	Willingboro	NaN	OTHER	NJ	1930-06-30 20:00:00
2	Holyoke	NaN	OVAL	CO	1931-02-15 14:00:00
3	Abilene	NaN	DISK	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NaN	LIGHT	NY	1933-04-18 19:00:00

你将会注意到有些值是缺失的。
为了找出每一列中有多少值是缺失的，你可以使用isna()函数，然后再使用sum():

ufo.isna().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

isna()会产生一个由True和False组成的DataFrame，sum()会将所有的True值转换为1，False转换为0并把它们加起来。

类似地，你可以通过mean()和isna()函数找出每一列中缺失值的百分比。

ufo.isna().mean()

City               0.001371
Colors Reported    0.842004
Shape Reported     0.144948
State              0.000000
Time               0.000000
dtype: float64

如果你想要舍弃那些包含了缺失值的列，你可以使用dropna()函数：

ufo.dropna(axis='columns').head()

	State	Time
0	NY	1930-06-01 22:00:00
1	NJ	1930-06-30 20:00:00
2	CO	1931-02-15 14:00:00
3	KS	1931-06-01 13:00:00
4	NY	1933-04-18 19:00:00

或者你想要舍弃那么缺失值占比超过10%的列，你可以给dropna()设置一个阈值：

ufo.dropna(thresh=len(ufo)*0.9, axis='columns').head()

	City	State	Time
0	Ithaca	NY	1930-06-01 22:00:00
1	Willingboro	NJ	1930-06-30 20:00:00
2	Holyoke	CO	1931-02-15 14:00:00
3	Abilene	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NY	1933-04-18 19:00:00

len(ufo)返回总行数，我们将它乘以0.9，以告诉pandas保留那些至少90%的值不是缺失值的列。

16. 将一个字符串划分成多个列

我们先创建另一个新的示例DataFrame:

df = pd.DataFrame({'name':['John Arthur Doe', 'Jane Ann Smith'],
                   'location':['Los Angeles, CA', 'Washington, DC']})
df

	name	location
0	John Arthur Doe	Los Angeles, CA
1	Jane Ann Smith	Washington, DC

如果我们需要将“name”这一列划分为三个独立的列，用来表示first, middle, last name呢？我们将会使用str.split()函数，告诉它以空格进行分隔，并将结果扩展成一个DataFrame:

df.name.str.split(' ', expand=True)

	0	1	2
0	John	Arthur	Doe
1	Jane	Ann	Smith

这三列实际上可以通过一行代码保存至原来的DataFrame:

df[['first', 'middle', 'last']] = df.name.str.split(' ', expand=True)
df

	name	location	first	middle	last
0	John Arthur Doe	Los Angeles, CA	John	Arthur	Doe
1	Jane Ann Smith	Washington, DC	Jane	Ann	Smith

如果我们想要划分一个字符串，但是仅保留其中一个结果列呢？比如说，让我们以", "来划分location这一列：

df.location.str.split(', ', expand=True)

	0	1
0	Los Angeles	CA
1	Washington	DC

如果我们只想保留第0列作为city name，我们仅需要选择那一列并保存至DataFrame:

df['city'] = df.location.str.split(', ', expand=True)[0]
df

	name	location	first	middle	last	city
0	John Arthur Doe	Los Angeles, CA	John	Arthur	Doe	Los Angeles
1	Jane Ann Smith	Washington, DC	Jane	Ann	Smith	Washington

17. 将一个由列表组成的Series扩展成DataFrame

我们创建一个新的示例DataFrame:

df = pd.DataFrame({'col_one':['a', 'b', 'c'], 'col_two':[[10, 40], [20, 50], [30, 60]]})
df

	col_one	col_two
0	a	[10, 40]
1	b	[20, 50]
2	c	[30, 60]

这里有两列，第二列包含了Python中的由整数元素组成的列表。

如果我们想要将第二列扩展成DataFrame，我们可以对那一列使用apply()函数并传递给Series constructor:

df_new = df.col_two.apply(pd.Series)
df_new

	0	1
0	10	40
1	20	50
2	30	60

过使用concat()函数，我们可以将原来的DataFrame和新的DataFrame组合起来：

pd.concat([df, df_new], axis='columns')

	col_one	col_two	0	1
0	a	[10, 40]	10	40
1	b	[20, 50]	20	50
2	c	[30, 60]	30	60

18. 对多个函数进行聚合

我们来看一眼从Chipotle restaurant chain得到的orders这个DataFrame:

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

每个订单（order）都有订单号（order_id），包含一行或者多行。为了找出每个订单的总价格，你可以将那个订单号的价格（item_price）加起来。比如，这里是订单号为1的总价格：

orders[orders.order_id == 1].item_price.sum()

11.56

如果你想要计算每个订单的总价格，你可以对order_id使用groupby()，再对每个group的item_price进行求和。

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

但是，事实上你不可能在聚合时仅使用一个函数，比如sum()。为了对多个函数进行聚合，你可以使用agg()函数，传给它一个函数列表，比如sum()和count():

orders.groupby('order_id').item_price.agg(['sum', 'count']).head()

	sum	count
order_id
1	11.56	4
2	16.98	1
3	12.67	2
4	21.00	2
5	13.70	2

这为我们提供了每个订单的总价以及每个订单中的商品数量。

19. 将聚合结果与DataFrame进行组合

我们再看一眼orders这个DataFrame:

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

如果我们想要增加新的一列，用于展示每个订单的总价格呢？回忆一下，我们通过使用sum()函数得到了总价格：

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

sum()是一个聚合函数，这表明它返回输入数据的精简版本（reduced version ）。

换句话说，sum()函数的输出：

len(orders.groupby('order_id').item_price.sum())

比这个函数的输入要小：

len(orders.item_price)

解决的办法是使用transform()函数，它会执行相同的操作但是返回与输入数据相同的形状：

total_price = orders.groupby('order_id').item_price.transform('sum')
len(total_price)

我们将这个结果存储至DataFrame中新的一列：

orders['total_price'] = total_price
orders.head(10)

	order_id	quantity	item_name	choice_description	item_price	total_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56
1	1	1	Izze	[Clementine]	3.39	11.56
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98	12.67
6	3	1	Side of Chips	NaN	1.69	12.67
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75	21.00
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25	21.00
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25	13.70

你可以看到，每个订单的总价格在每一行中显示出来了。

这样我们就能方便地甲酸每个订单的价格占该订单的总价格的百分比：

orders['percent_of_total'] = orders.item_price / orders.total_price
orders.head(10)

	order_id	quantity	item_name	choice_description	item_price	total_price	percent_of_total
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56	0.206747
1	1	1	Izze	[Clementine]	3.39	11.56	0.293253
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56	0.293253
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56	0.206747
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98	1.000000
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98	12.67	0.866614
6	3	1	Side of Chips	NaN	1.69	12.67	0.133386
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75	21.00	0.559524
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25	21.00	0.440476
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25	13.70	0.675182

20. 选取行和列的切片

我们看一眼另一个数据集：

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

这就是著名的Titanic数据集，它保存了Titanic上乘客的信息以及他们是否存活。

如果你想要对这个数据集做一个数值方面的总结，你可以使用describe()函数：

titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

但是，这个DataFrame结果可能比你想要的信息显示得更多。

如果你想对这个结果进行过滤，只想显示“五数概括法”（five-number summary）的信息，你可以使用loc函数并传递"min"到"max"的切片:

titanic.describe().loc['min':'max']

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
min	1.0	0.0	1.0	0.420	0.0	0.0	0.0000
25%	223.5	0.0	2.0	20.125	0.0	0.0	7.9104
50%	446.0	0.0	3.0	28.000	0.0	0.0	14.4542
75%	668.5	1.0	3.0	38.000	1.0	0.0	31.0000
max	891.0	1.0	3.0	80.000	8.0	6.0	512.3292

如果你不是对所有列都感兴趣，你也可以传递列名的切片：

titanic.describe().loc['min':'max', 'Pclass':'Parch']

	Pclass	Age	SibSp	Parch
min	1.0	0.420	0.0	0.0
25%	2.0	20.125	0.0	0.0
50%	3.0	28.000	0.0	0.0
75%	3.0	38.000	1.0	0.0
max	3.0	80.000	8.0	6.0

21. 对MultiIndexed Series进行重塑

Titanic数据集的Survived列由1和0组成，因此你可以对这一列计算总的存活率：

titanic.Survived.mean()

0.3838383838383838

如果你想对某个类别，比如“Sex”，计算存活率，你可以使用groupby():

titanic.groupby('Sex').Survived.mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

如果你想一次性对两个类别变量计算存活率，你可以对这些类别变量使用groupby()：

titanic.groupby(['Sex', 'Pclass']).Survived.mean()

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

该结果展示了由Sex和Passenger Class联合起来的存活率。它存储为一个MultiIndexed Series，也就是说它对实际数据有多个索引层级。

这使得该数据难以读取和交互，因此更为方便的是通过unstack()函数将MultiIndexed Series重塑成一个DataFrame:

titanic.groupby(['Sex', 'Pclass']).Survived.mean().unstack()

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

该DataFrame包含了与MultiIndexed Series一样的数据，不同的是，现在你可以用熟悉的DataFrame的函数对它进行操作。

22. 创建数据透视表（pivot table）

如果你经常使用上述的方法创建DataFrames，你也许会发现用pivot_table()函数更为便捷：

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean')

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

想要使用数据透视表，你需要指定索引(index), 列名(columns), 值(values)和聚合函数(aggregation function)。

数据透视表的另一个好处是，你可以通过设置margins=True轻松地将行和列都加起来：

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean',
                    margins=True)

Pclass	1	2	3	All
Sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

这个结果既显示了总的存活率，也显示了Sex和Passenger Class的存活率。

最后，你可以创建交叉表（cross-tabulation），只需要将聚合函数由"mean"改为"count":

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='count',
                    margins=True)

Pclass	1	2	3	All
Sex
female	94	76	144	314
male	122	108	347	577
All	216	184	491	891

这个结果展示了每一对类别变量组合后的记录总数。

23. 将连续数据转变成类别数据

我们来看一下Titanic数据集中的Age那一列：

titanic.Age.head(10)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

它现在是连续性数据，但是如果我们想要将它转变成类别数据呢？

一个解决办法是对年龄范围打标签，比如"adult", "young adult", "child"。实现该功能的最好方式是使用cut()函数：

pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['child', 'young adult', 'adult']).head(10)

0    young adult
1          adult
2          adult
3          adult
4          adult
5            NaN
6          adult
7          child
8          adult
9          child
Name: Age, dtype: category
Categories (3, object): ['child' < 'young adult' < 'adult']

这会对每个值打上标签。0到18岁的打上标签"child"，18-25岁的打上标签"young adult"，25到99岁的打上标签“adult”。

注意到，该数据类型为类别变量，该类别变量自动排好序了（有序的类别变量）。

24. 更改显示选项

我们再来看一眼Titanic 数据集：

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

注意到，Age列保留到小数点后1位，Fare列保留到小数点后4位。如果你想要标准化，将显示结果保留到小数点后2位呢？

你可以使用set_option()函数：

pd.set_option('display.float_format', '{:.2f}'.format)

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S

set_option()函数中第一个参数为选项的名称，第二个参数为Python格式化字符。可以看到，Age列和Fare列现在已经保留小数点后两位。注意，这并没有修改基础的数据类型，而只是修改了数据的显示结果。

你也可以重置任何一个选项为其默认值：

pd.reset_option('display.float_format')

对于其它的选项也是类似的使用方法。

25. Style a DataFrame

上一个技巧在你想要修改整个jupyter notebook中的显示会很有用。但是，一个更灵活和有用的方法是定义特定DataFrame中的格式化（style）。

我们回到stocks这个DataFrame:

stocks

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL

我们可以创建一个格式化字符串的字典，用于对每一列进行格式化。然后将其传递给DataFrame的style.format()函数：

format_dict = {'Date':'{:%m/%d/%y}', 'Close':'${:.2f}', 'Volume':'{:,}'}
stocks.style.format(format_dict)

	Date	Close	Volume	Symbol
0	10/03/16	$31.50	14,070,500	CSCO
1	10/03/16	$112.52	21,701,800	AAPL
2	10/03/16	$57.42	19,189,500	MSFT
3	10/04/16	$113.00	29,736,800	AAPL
4	10/04/16	$57.24	20,085,900	MSFT
5	10/04/16	$31.35	18,460,400	CSCO
6	10/05/16	$57.64	16,726,400	MSFT
7	10/05/16	$31.59	11,808,600	CSCO
8	10/05/16	$113.05	21,453,100	AAPL

注意到，Date列是month-day-year的格式，Close列包含一个$符号，Volume列包含逗号。

我们可以通过链式调用函数来应用更多的格式化：

(stocks.style.format(format_dict)
 .hide_index()
 .highlight_min('Close', color='red')
 .highlight_max('Close', color='lightgreen')
)

Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

我们现在隐藏了索引，将Close列中的最小值高亮成红色，将Close列中的最大值高亮成浅绿色。

这里有另一个DataFrame格式化的例子：

(stocks.style.format(format_dict)
 .hide_index()
 .background_gradient(subset='Volume', cmap='Blues')
)

Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

Volume列现在有一个渐变的背景色，你可以轻松地识别出大的和小的数值。

最后一个例子：

(stocks.style.format(format_dict)
 .hide_index()
 .bar('Volume', color='lightblue', align='zero')
 .set_caption('Stock Prices from October 2016')
)

Stock Prices from October 2016
Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

现在，Volumn列上有一个条形图，DataFrame上有一个标题。请注意，还有许多其他的选项你可以用来格式化DataFrame。

额外技巧：Profile a DataFrame

假设你拿到一个新的数据集，你不想要花费太多力气，只是想快速地探索下。那么你可以使用pandas-profiling这个模块。在你的系统上安装好该模块，然后使用ProfileReport()函数，传递的参数为任何一个DataFrame。它会返回一个互动的HTML报告：

第一部分为该数据集的总览，以及该数据集可能出现的问题列表；
第二部分为每一列的总结。你可以点击"toggle details"获取更多信息；
第三部分显示列之间的关联热力图；
第四部分为缺失值情况报告；
第五部分显示该数据及的前几行。

使用示例如下（只显示第一部分的报告）：

import pandas_profiling
pandas_profiling.ProfileReport(titanic)

posted @ 2022-11-18 20:31 王陸阅读(537) 评论(0) 收藏举报

刷新页面返回顶部

王陸

我可不是为了被全人类喜欢才活着的，只要对于某一个人来说我是必要的，我就能活下去。