16. How do I handle missing values in pandas? (video)

In [121]:

 
​
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
ufo = pd.read_csv(url3)#用read_csv打开csv文件
ufo.tail()

Out[121]:

	City	Colors Reported	Shape Reported	State	Time
18236	Grant Park	NaN	TRIANGLE	IL	12/31/2000 23:00
18237	Spirit Lake	NaN	DISK	IA	12/31/2000 23:00
18238	Eagle River	NaN	NaN	WI	12/31/2000 23:45
18239	Eagle River	RED	LIGHT	WI	12/31/2000 23:45
18240	Ybor	NaN	OVAL	FL	12/31/2000 23:59

“NaN”是什么意思？

“NaN”不是字符串，而是一个特殊值：numpy.nan。它代表“非数字”并表示缺失值。 read_csv在读取文件时检测缺失值（默认情况下），并用此特殊值替换它们。

Documentation for read_csv

In [122]:

 
# ''isnull'返回一个布尔DataFrame（如果缺少则为True，如果没有缺失则为False）
ufo.isnull().tail()

Out[122]:

	City	Colors Reported	Shape Reported	State	Time
18236	False	True	False	False	False
18237	False	True	False	False	False
18238	False	True	True	False	False
18239	False	False	False	False	False
18240	False	True	False	False	False

In [123]:

 
# 'nonnull'返回'isnull'的反面（如果不丢失，则返回true，如果丢失则返回False）
ufo.notnull().tail()

Out[123]:

	City	Colors Reported	Shape Reported	State	Time
18236	True	False	True	True	True
18237	True	False	True	True	True
18238	True	False	False	True	True
18239	True	True	True	True	True
18240	True	False	True	True	True

Documentation for isnull and notnull

In [124]:

 
# 计算每个Series中缺失值的数量
ufo.isnull().sum()

Out[124]:

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

此计算有效，因为：

1.默认情况下，DataFrame的sum **方法在axis = 0 上运行（因此产生列总和）。 2.为了添加布尔值，pandas将“True”转换为 1 ，将“False”转换为 0 **。

In [125]:

 
# 使用'isnull'Series方法过滤DataFrame行
ufo[ufo.City.isnull()].head()

Out[125]:

	City	Colors Reported	Shape Reported	State	Time
21	NaN	NaN	NaN	LA	8/15/1943 0:00
22	NaN	NaN	LIGHT	LA	8/15/1943 0:00
204	NaN	NaN	DISK	CA	7/15/1952 12:30
241	NaN	BLUE	DISK	MT	7/4/1953 14:00
613	NaN	NaN	DISK	NV	7/1/1960 12:00

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:

In [126]:

 
# 检查行数和列数
ufo.shape

Out[126]:

(18241, 5)

In [127]:

 
# 如果连续缺少“任何”值，则删除该行
ufo.dropna(how='any').shape

Out[127]:

(2486, 5)

Documentation for dropna

In [128]:

 
# 默认情况下，'dropna'的inplace'参数为False，因此只暂时删除行
ufo.shape

Out[128]:

(18241, 5)

In [129]:

 
# 如果一行中缺少'all'值，则删除该行（在这种情况下不丢弃）
ufo.dropna(how='all').shape

Out[129]:

(18241, 5)

In [130]:

 
# 给 any 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

Out[130]:

(15576, 5)

In [131]:

 
# 给all 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape

Out[131]:

(18237, 5)

In [132]:

 
# 默认情况下，value_counts'不包含缺失值
ufo['Shape Reported'].value_counts().head()

Out[132]:

LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
CIRCLE      1365
Name: Shape Reported, dtype: int64

In [133]:

 
# 明确包含缺失值
ufo['Shape Reported'].value_counts(dropna=False).head()

Out[133]:

LIGHT       2803
NaN         2644
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

Documentation for value_counts

In [134]:

 
# 使用指定值填充缺失值
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

Documentation for fillna

In [135]:

 
# 确认填写了缺失值
ufo['Shape Reported'].value_counts().head()

Out[135]:

VARIOUS     2977
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64

Working with missing data in pandas

[Back to top]

17. What do I need to know about the pandas index? (Part 1) (video)

In [137]:

 
​
drinks = pd.read_csv(url7)
drinks.head()

Out[137]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [138]:

 
# 每个DataFrame都有一个索引（有时称为“行标签”）
drinks.index

Out[138]:

RangeIndex(start=0, stop=193, step=1)

In [139]:

 
# 列名也存储在特殊的“索引”对象中
drinks.columns

Out[139]:

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

In [140]:

 
# 索引和列名都不包含在形状中
drinks.shape

Out[140]:

(193, 6)

In [141]:

 
# 如果不定义索引和列名，则默认为整数
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()

Out[141]:

	0	1	2	3	4
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

索引用于什么？

识别选择对齐

In [142]:

 
# i识别：过滤DataFrame时，每行保留索引
drinks[drinks.continent=='South America']

Out[142]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
6	Argentina	193	25	221	8.3	South America
20	Bolivia	167	41	8	3.8	South America
23	Brazil	245	145	16	7.2	South America
35	Chile	130	124	172	7.6	South America
37	Colombia	159	76	3	4.2	South America
52	Ecuador	162	74	3	4.2	South America
72	Guyana	93	302	1	7.1	South America
132	Paraguay	213	117	74	7.3	South America
133	Peru	163	160	21	6.1	South America
163	Suriname	128	178	7	5.6	South America
185	Uruguay	115	35	220	6.6	South America
188	Venezuela	333	100	3	7.7	South America

In [143]:

 
# 选择：使用索引选择DataFrame的一部分
drinks.loc[23, 'beer_servings']

Out[143]:

Documentation for loc

In [144]:

 
# 将现有列设置为索引
drinks.set_index('country', inplace=True)
drinks.head()

Out[144]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
country
Afghanistan	0	0	0	0.0	Asia
Albania	89	132	54	4.9	Europe
Algeria	25	0	14	0.7	Africa
Andorra	245	138	312	12.4	Europe
Angola	217	57	45	5.9	Africa

Documentation for set_index

In [145]:

 
# 'country'现在是索引
drinks.index

Out[145]:

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

In [146]:

 
# 'country'不再是列名
drinks.columns

Out[146]:

Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

In [147]:

 
# 'country'数据不再是DataFrame内容的一部分
drinks.shape

Out[147]:

(193, 5)

In [148]:

 
# 国家名字现在可用于选择
drinks.loc['Brazil', 'beer_servings']

Out[148]:

In [149]:

 
# 索引名称是可选的
drinks.index.name = None
drinks.head()

Out[149]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
Afghanistan	0	0	0	0.0	Asia
Albania	89	132	54	4.9	Europe
Algeria	25	0	14	0.7	Africa
Andorra	245	138	312	12.4	Europe
Angola	217	57	45	5.9	Africa

In [150]:

 
# 恢复索引名称，并将索引移回列
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()

Out[150]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

Documentation for reset_index

In [151]:

 
# 许多DataFrame方法输出一个DataFrame
drinks.describe()

Out[151]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
count	193.000000	193.000000	193.000000	193.000000
mean	106.160622	80.994819	49.450777	4.717098
std	101.143103	88.284312	79.697598	3.773298
min	0.000000	0.000000	0.000000	0.000000
25%	20.000000	4.000000	1.000000	1.300000
50%	76.000000	56.000000	8.000000	4.200000
75%	188.000000	128.000000	59.000000	7.200000
max	376.000000	438.000000	370.000000	14.400000

In [152]:

 
# 您可以使用其索引和列与任何DataFrame进行交互
drinks.describe().loc['25%', 'beer_servings']

Out[152]:

20.0

Indexing and selecting data

[Back to top]

18. What do I need to know about the pandas index? (Part 2) (video)

In [153]:

 
​
drinks = pd.read_csv(url7)
drinks.head()

Out[153]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [154]:

 
# 每个DataFrame都有一个索引
drinks.index

Out[154]:

RangeIndex(start=0, stop=193, step=1)

In [155]:

 
# 每个Series还有一个索引（从DataFrame继承）
drinks.continent.head()

Out[155]:

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

In [156]:

 
# 将'country'设置为索引
drinks.set_index('country', inplace=True)

Documentation for set_index

In [157]:

 
# Series 索引在左侧，值在右侧
drinks.continent.head()

Out[157]:

country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object

In [158]:

 
# 系Series的另一个例子（来自'value_counts'方法的输出）
drinks.continent.value_counts()

Out[158]:

Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for value_counts

In [159]:

 
# Series 访问索引
drinks.continent.value_counts().index

Out[159]:

Index(['Africa', 'Europe', 'Asia', 'North America', 'Oceania',
       'South America'],
      dtype='object')

In [160]:

 
# 访问Series 的值
drinks.continent.value_counts().values

Out[160]:

array([53, 45, 44, 23, 16, 12], dtype=int64)

In [161]:

 
#可以通过索引选择系列中的元素（使用括号表示法）
drinks.continent.value_counts()['Africa']

Out[161]:

In [162]:

 
# 任何Series都可以按其值排序
drinks.continent.value_counts().sort_values()

Out[162]:

South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64

In [163]:

 
# a任何Series都可以按其索引排序
drinks.continent.value_counts().sort_index()

Out[163]:

Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64

Documentation for sort_values and sort_index

索引用于什么？

识别选择对齐

In [164]:

 
# beer_servings'包含每人平均每年的啤酒摄取量
drinks.beer_servings.head()

Out[164]:

country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64

In [165]:

 
# 创建一个包含两个国家人口的Series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people

Out[165]:

Albania    3000000
Andorra      85000
Name: population, dtype: int64

Documentation for Series

In [166]:

 
# 计算每个国家的年度啤酒总量
(drinks.beer_servings * people).head()

Out[166]:

Afghanistan            NaN
Albania        267000000.0
Algeria                NaN
Andorra         20825000.0
Angola                 NaN
dtype: float64

两个Series按其指数对齐。如果任一Series中缺少值，则结果标记为NaN。对齐使我们能够轻松处理不完整的数据。

In [167]:

 
# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)
pd.concat([drinks, people], axis=1).head()

Out[167]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	population
Afghanistan	0	0	0	0.0	Asia	NaN
Albania	89	132	54	4.9	Europe	3000000.0
Algeria	25	0	14	0.7	Africa	NaN
Andorra	245	138	312	12.4	Europe	85000.0
Angola	217	57	45	5.9	Africa	NaN

Documentation for concat

Indexing and selecting data

[Back to top]

19. How do I select multiple rows and columns from a pandas DataFrame? (video)

In [168]:

 
​
ufo = pd.read_csv(url3)
ufo.head(3)

Out[168]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00

The loc y loc方法用于按标签选择行和列。你可以传递它：

单个标签标签列表标签切片布尔Series 冒号（表示“所有标签”）

In [169]:

 
# 第0行，所有列
ufo.loc[0, :]

Out[169]:

City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object

In [170]:

 
# 头三行，所有列
ufo.loc[[0, 1, 2], :]

Out[170]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00

In [171]:

 
# 0：2行 包括2, all columns
ufo.loc[0:2, :]

Out[171]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00

In [172]:

 
# t这意味着“所有列”，但明确声明“所有列”更好
ufo.loc[0:2]

Out[172]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00

In [173]:

 
# 行0到行2（包含行2），列'城市'
ufo.loc[0:2, 'City']

Out[173]:

0         Ithaca
1    Willingboro
2        Holyoke
Name: City, dtype: object

In [174]:

 
# 行0到行2（包含行2）, 列 'City' and 'State'
ufo.loc[0:2, ['City', 'State']]

Out[174]:

	City	State
0	Ithaca	NY
1	Willingboro	NJ
2	Holyoke	CO

In [175]:

 
# 使用双括号完成相同的事情 - 但使用'loc'是首选，因为它更明确
ufo[['City', 'State']].head(3)

Out[175]:

	City	State
0	Ithaca	NY
1	Willingboro	NJ
2	Holyoke	CO

In [176]:

 
# 行0到行2（包含行2）, 列 'City' 到 'State')
ufo.loc[0:2, 'City':'State']

Out[176]:

	City	Colors Reported	Shape Reported	State
0	Ithaca	NaN	TRIANGLE	NY
1	Willingboro	NaN	OTHER	NJ
2	Holyoke	NaN	OVAL	CO

In [177]:

 
# 使用'head'和'drop'完成同样的事情
ufo.head(3).drop('Time', axis=1)

Out[177]:

	City	Colors Reported	Shape Reported	State
0	Ithaca	NaN	TRIANGLE	NY
1	Willingboro	NaN	OTHER	NJ
2	Holyoke	NaN	OVAL	CO

In [178]:

 
# 'City'为'Oakland'的行，'State'列
ufo.loc[ufo.City=='Oakland', 'State']

Out[178]:

1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object

In [179]:

 
# 使用“链式索引”完成同样的事情 - 但使用'loc'是首选，因为链式索引可能会导致问题 
ufo[ufo.City=='Oakland'].State

Out[179]:

1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object

The iloc iloc'方法用于通过整数位置选择行和列。你可以传递它：

单个整数位置
整数位置列表
整数位置切片
冒号（表示“所有整数位置”）

In [180]:

 
# 位置0和1中的行，位置0和3中的列
ufo.iloc[[0, 1], [0, 3]]

Out[180]:

	City	State
0	Ithaca	NY
1	Willingboro	NJ

In [181]:

 
# 位置0到2中的行（不包括），位置0到4中的列（不包括）
ufo.iloc[0:2, 0:4]

Out[181]:

	City	Colors Reported	Shape Reported	State
0	Ithaca	NaN	TRIANGLE	NY
1	Willingboro	NaN	OTHER	NJ

In [182]:

 
# 所有列中位置0到2（不包括）的行
ufo.iloc[0:2, :]

Out[182]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00

In [183]:

 
# 完成同样的事情 - 但使用'iloc'是首选，因为它更明确
ufo[0:2]

Out[183]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00

The ix ix方法用于通过标签或整数位置选择行和列，并且只应在需要在同一调用中混合基于标签和基于整数的选择时使用。

In [184]:

 
# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index
drinks = pd.read_csv(url7, index_col='country')
drinks.head()

Out[184]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
country
Afghanistan	0	0	0	0.0	Asia
Albania	89	132	54	4.9	Europe
Algeria	25	0	14	0.7	Africa
Andorra	245	138	312	12.4	Europe
Angola	217	57	45	5.9	Africa

In [185]:

 
# 标有'Albania'的行，位于0的列
drinks.ix['Albania', 0]

c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

Out[185]:

In [186]:

 
# row in position 1, column with label 'beer_servings'
drinks.ix[1, 'beer_servings']

Out[186]:

使用带有ix的数字的规则：

如果索引是字符串，则数字被视为整数位置，因此切片在右侧是不包含。
如果索引是整数，则数字被视为标签，因此切片是包含。

In [187]:

 
# 行'阿尔巴尼亚'到'安道尔'（包括），0到2位的列（包含）
drinks.ix['Albania':'Andorra', 0:2]

Out[187]:

	beer_servings	spirit_servings
country
Albania	89	132
Algeria	25	0
Andorra	245	138

In [188]:

 
# 行0到2（含），0到2位的列（不包括）
ufo.ix[0:2, 0:2]

c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

Out[188]:

	City	Colors Reported
0	Ithaca	NaN
1	Willingboro	NaN
2	Holyoke	NaN

Summary of the pandas API for selection

[Back to top]

20. When should I use the "inplace" parameter in pandas? (video)

In [189]:

 
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv(url3)
ufo.head()

Out[189]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [190]:

ufo.shape

Out[190]:

(18241, 5)

In [191]:

 
# 删除“城市”列（因为inplace = False，不会影响DataFrame）
ufo.drop('City', axis=1).head()

Out[191]:

	Colors Reported	Shape Reported	State	Time
0	NaN	TRIANGLE	NY	6/1/1930 22:00
1	NaN	OTHER	NJ	6/30/1930 20:00
2	NaN	OVAL	CO	2/15/1931 14:00
3	NaN	DISK	KS	6/1/1931 13:00
4	NaN	LIGHT	NY	4/18/1933 19:00

In [192]:

 
# 确认实际上没有删除“城市”列
ufo.head()

Out[192]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [193]:

 
# 删除“城市”列（因为inplace = True，确实会影响DataFrame）
ufo.drop('City', axis=1, inplace=True)

In [194]:

 
# 确认实际删除了“城市”列
ufo.head()

Out[194]:

	Colors Reported	Shape Reported	State	Time
0	NaN	TRIANGLE	NY	6/1/1930 22:00
1	NaN	OTHER	NJ	6/30/1930 20:00
2	NaN	OVAL	CO	2/15/1931 14:00
3	NaN	DISK	KS	6/1/1931 13:00
4	NaN	LIGHT	NY	4/18/1933 19:00

In [195]:

 
# 如果该行中缺少任何值，则删除一行（因为inplace = False，不会影响DataFrame）
ufo.dropna(how='any').shape

Out[195]:

(2490, 4)

In [196]:

 
# 确认没有实际删除任何行
ufo.shape

Out[196]:

(18241, 4)

In [197]:

 
# 使用赋值语句而不是'inplace'参数
ufo = ufo.set_index('Time')
ufo.tail()

Out[197]:

	Colors Reported	Shape Reported	State
Time
12/31/2000 23:00	NaN	TRIANGLE	IL
12/31/2000 23:00	NaN	DISK	IA
12/31/2000 23:45	NaN	NaN	WI
12/31/2000 23:45	RED	LIGHT	WI
12/31/2000 23:59	NaN	OVAL	FL

In [198]:

 
# 使用“向后填充”策略填充缺失值（因为inplace = False不影响DataFrame)
ufo.fillna(method='bfill').tail()

Out[198]:

	Colors Reported	Shape Reported	State
Time
12/31/2000 23:00	RED	TRIANGLE	IL
12/31/2000 23:00	RED	DISK	IA
12/31/2000 23:45	RED	LIGHT	WI
12/31/2000 23:45	RED	LIGHT	WI
12/31/2000 23:59	NaN	OVAL	FL

In [199]:

 
# 使用“前向填充”策略进行操作（因为inplace = False，不会影响DataFrame）
ufo.fillna(method='ffill').tail()

Out[199]:

	Colors Reported	Shape Reported	State
Time
12/31/2000 23:00	RED	TRIANGLE	IL
12/31/2000 23:00	RED	DISK	IA
12/31/2000 23:45	RED	DISK	WI
12/31/2000 23:45	RED	LIGHT	WI
12/31/2000 23:59	RED	OVAL	FL

[Back to top]

21. How do I make my pandas DataFrame smaller and faster? (video)

In [200]:

 
​
drinks = pd.read_csv(url7)
drinks.head()

Out[200]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [201]:

 
# 确切的内存使用是未知的，因为对象列是别处的引用
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB

In [202]:

 
#强制pandas计算真正的内存使用量
drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB

In [203]:

 
# 计算每个系列的内存使用量（以字节为单位）
drinks.memory_usage(deep=True)

Out[203]:

Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64

Documentation for info and memory_usage

In [204]:

 
# 使用'category'数据类型（pandas 0.15）将'continent'字符串存储为整数
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes

Out[204]:

country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

In [205]:

 
# 'continent'系列似乎没有变化
drinks.continent.head()

Out[205]:

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [206]:

 
# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head()

Out[206]:

0    1
1    2
2    0
3    2
4    0
dtype: int8

In [207]:

 
# 现在对编码进行编码（0表示'非洲'，1表示'亚洲'，2表示'欧洲'等）
drinks.memory_usage(deep=True)

Out[207]:

Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [208]:

 
# country' Series中重复这个过程'
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)

Out[208]:

Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [209]:

 
# 因为我们创建了193个类别，因此内存使用增加
drinks.country.cat.categories

Out[209]:

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)

类别数据类型只应与具有少量可能值的字符串Series一起使用。

In [210]:

 
# 从字典中创建一个小型DataFrame
df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})
df

Out[210]:

	ID	quality
0	100	good
1	101	very good
2	102	good
3	103	excellent

In [211]:

 
# 按“质量”系列排序DataFrame（按字母顺序排列）
df.sort_values('quality')

Out[211]:

	ID	quality
3	103	excellent
0	100	good
2	102	good
1	101	very good

In [212]:

 
# 定义类别的逻辑顺序
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
df.quality

c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead

Out[212]:

0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]

In [213]:

 
# 按“质量”系列（逻辑顺序）输出DataFrame)
df.sort_values('quality')

Out[213]:

	ID	quality
0	100	good
2	102	good
1	101	very good
3	103	excellent

In [214]:

 
# 比较运算符使用有序类别
df.loc[df.quality > 'good', :]

Out[214]:

	ID	quality
1	101	very good
3	103	excellent

Overview of categorical data in pandas

API reference for categorical methods

[Back to top]

22. How do I use pandas with scikit-learn to create Kaggle submissions? (video)

In [215]:

 
# 从Kaggle的泰坦尼克号竞赛中将训练数据集读入DataFrame
url5 = 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_train.csv'
train = pd.read_csv(url5)
train.head()

Out[215]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Goal: Predict passenger survival aboard the Titanic based on passenger attributes

Video: What is machine learning, and how does it work?

In [216]:

#根据乘客属性预测泰坦尼克号上的乘客生存率

In [217]:

 
# 通过选择两个DataFrame列来创建特征矩阵“X”
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
X.shape

Out[217]:

(891, 2)

In [218]:

 
# 通过选择一个系列来创建一个响应向量'y'
y = train.Survived
y.shape

Out[218]:

(891,)

Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.

In [219]:

 
#无需将这些pandas对象转换为NumPy数组。 scikit-learn将理解这些对象，只要它们完全是数字和正确的形状

In [220]:

 
# 将分类模型拟合到训练数据中
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)

Out[220]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Video series: Introduction to machine learning with scikit-learn

In [221]:

 
# 从Kaggle的泰坦尼克号竞赛中读取测试数据集到DataFrame
url6 ="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_test.csv"
test = pd.read_csv(url6)
test.head()

Out[221]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

In [222]:

 
# 根据与训练数据匹配的测试数据创建特征矩阵
X_new = test.loc[:, feature_cols]
X_new.shape

Out[222]:

(418, 2)

In [223]:

 
# 使用拟合模型对测试集进行预测
new_pred_class = logreg.predict(X_new)

In [224]:

 
# 输出结果
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()

Out[224]:

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	0

Documentation for the DataFrame constructor

In [225]:

 
# 通过将PassengerID设置为索引来确保PassengerID是第一列
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()

Out[225]:

	Survived
PassengerId
892	0
893	0
894	0
895	0
896	0

In [226]:

 
# 将DataFrame写入可以提交给Kaggle的CSV文件
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')

Documentation for to_csv

In [227]:

 
# save a DataFrame to disk ("pickle it")
train.to_pickle('train.pkl')

In [228]:

 
# read a pickled object from disk ("unpickle it")
pd.read_pickle('train.pkl').head()

Out[228]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Documentation for to_pickle and read_pickle

[Back to top]

23. More of your pandas questions answered! (video)

Question: Could you explain how to read the pandas documentation?

pandas API reference

Question: What is the difference between ufo.isnull() and pd.isnull(ufo)?

In [229]:

 
​
ufo = pd.read_csv(url3)
ufo.head()

Out[229]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [230]:

 
# ＃使用'isnull'作为顶层函数
pd.isnull(ufo).head()

Out[230]:

	City	Colors Reported	Shape Reported	State	Time
0	False	True	False	False	False
1	False	True	False	False	False
2	False	True	False	False	False
3	False	True	False	False	False
4	False	True	False	False	False

In [231]:

 
# 等效：使用'isnull'作为DataFrame方法
ufo.isnull().head()

Out[231]:

	City	Colors Reported	Shape Reported	State	Time
0	False	True	False	False	False
1	False	True	False	False	False
2	False	True	False	False	False
3	False	True	False	False	False
4	False	True	False	False	False

Documentation for isnull

Question: Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?

In [232]:

 
# 基于标签的切片包括开始和结束
ufo.loc[0:4, :]

Out[232]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [233]:

 
# 基于位置的切片包括开始和不包括结束
ufo.iloc[0:4, :]

Out[233]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00

Documentation for loc and iloc

In [234]:

 
# iloc'只是遵循NumPy的切片惯例......
ufo.values[0:4, :]

Out[234]:

array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)

In [235]:

 
# ..而NumPy只是遵循Python的切片约定
'python'[0:4]

Out[235]:

'pyth'

In [236]:

 
# oc'包含停止标签，因为您不一定知道它后面会是什么标签
ufo.loc[0:4, 'City':'State']

Out[236]:

	City	Colors Reported	Shape Reported	State
0	Ithaca	NaN	TRIANGLE	NY
1	Willingboro	NaN	OTHER	NJ
2	Holyoke	NaN	OVAL	CO
3	Abilene	NaN	DISK	KS
4	New York Worlds Fair	NaN	LIGHT	NY

Question: How do I randomly sample rows from a DataFrame?

In [237]:

 
# 随机取三行
ufo.sample(n=3)

Out[237]:

	City	Colors Reported	Shape Reported	State	Time
16158	Summer Lake	RED ORANGE	SPHERE	OR	3/28/2000 7:45
3334	San Jose	NaN	DISK	CA	8/1/1976 21:00
15116	Dallas	GREEN	FLASH	TX	11/16/1999 5:00

Documentation for sample

In [238]:

 
# 使用'random_state'参数进行使结果可以再现
ufo.sample(n=3, random_state=42)

Out[238]:

	City	Colors Reported	Shape Reported	State	Time
217	Norridgewock	NaN	DISK	ME	9/15/1952 14:00
12282	Ipava	NaN	TRIANGLE	IL	10/1/1998 21:15
17933	Ellinwood	NaN	FIREBALL	KS	11/13/2000 22:00

In [239]:

 
# 按比例采样
train = ufo.sample(frac=0.75, random_state=99)

In [240]:

 
# 剩下数据存到另一个DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]

Documentation for isin

[Back to top]

24. How do I create dummy variables in pandas? (video)

In [241]:

 
​
train = pd.read_csv(url5)
train.head()

Out[241]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [242]:

 
# 使用'map'方法创建'Sex_male'虚拟变量
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()

Out[242]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_male
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	1

Documentation for map

In [243]:

 
# 替代方法：使用'get_dummies'为每个可能的值创建一列
pd.get_dummies(train.Sex).head()

Out[243]:

	female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1

Generally speaking:

If you have "K" possible values for a categorical feature, you only need "K-1" dummy variablesto capture all of the information about that feature.
One convention is to drop the first dummy variable, which defines that level as the "baseline".

In [244]:

 
"""一般来说：
​
如果分类特征具有“K”个可能值，则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。
一种惯用方法是删除第一个虚拟变量，该变量将该级别定义为“基线”。"""

Out[244]:

'一般来说：\n\n如果分类特征具有“K”个可能值，则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。\n一种惯用方法是删除第一个虚拟变量，该变量将该级别定义为“基线”。'

In [245]:

 
# 使用'iloc'方法删除第一个虚拟变量（'female'）
pd.get_dummies(train.Sex).iloc[:, 1:].head()

Out[245]:

	male
0	1
1	0
2	0
3	0
4	1

In [246]:

 
# 添加前缀以标识虚拟变量的来源
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()

Out[246]:

	Sex_male
0	1
1	0
2	0
3	0
4	1

In [247]:

 
# 使用 'get_dummies' 处理具有3个可能值
pd.get_dummies(train.Embarked, prefix='Embarked').head(10)

Out[247]:

	Embarked_C	Embarked_Q	Embarked_S
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
5	0	1	0
6	0	0	1
7	0	0	1
8	0	0	1
9	1	0	0

In [248]:

 
# 删除第一个虚拟变量（'C'）
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

Out[248]:

	Embarked_Q	Embarked_S
0	0	1
1	0	0
2	0	1
3	0	1
4	0	1
5	1	0
6	0	1
7	0	1
8	0	1
9	0	0

How to translate these values back to the original 'Embarked' value:

0, 0 means C
1, 0 means Q
0, 1 means S

In [249]:

 
# 保存虚拟变量的DataFrame并将它们连接到原始DataFram
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()

Out[249]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_male	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	1	1

Documentation for concat

In [250]:

 
# 重新读取DataFrame
train = pd.read_csv(url5)
train.head()

Out[250]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [251]:

 
#  将DataFrame传递给'get_dummies'并指定哪些列为dummy（它删除了原始列）
pd.get_dummies(train, columns=['Sex', 'Embarked']).head()

Out[251]:

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	Cabin	Sex_female	Sex_male	Embarked_C	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	NaN	0	1	0	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	C85	1	0	1	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	NaN	1	0	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	C123	1	0	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	NaN	0	1	0	1

In [252]:

 
# 使用'drop_first'参数（pandas 0.18）删除每个特征的第一个虚拟变量
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()

Out[252]:

	PassengerId	Survived	Pclass	Name	Age	SibSp	Ticket	Fare	Cabin	Sex_male	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	22.0	1	A/5 21171	7.2500	NaN	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	PC 17599	71.2833	C85	0	0
2	3	1	3	Heikkinen, Miss. Laina	26.0	0	STON/O2. 3101282	7.9250	NaN	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	113803	53.1000	C123	0	1
4	5	0	3	Allen, Mr. William Henry	35.0	0	373450	8.0500	NaN	1	1

Documentation for get_dummies

[Back to top]

25. How do I work with dates and times in pandas? (video)

In [253]:

 
​
ufo = pd.read_csv(url3)
ufo.head()

Out[253]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [254]:

 
# 'Time'当前存储为字符串
ufo.dtypes

Out[254]:

City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

In [255]:

 
# 可以使用字符串切片访问小时，但这种方法很容易出问题
ufo.Time.str.slice(-5, -3).astype(int).head()

Out[255]:

0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int32

In [256]:

 
# 将“时间”转换为日期时间格式
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()

Out[256]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	1930-06-01 22:00:00
1	Willingboro	NaN	OTHER	NJ	1930-06-30 20:00:00
2	Holyoke	NaN	OVAL	CO	1931-02-15 14:00:00
3	Abilene	NaN	DISK	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NaN	LIGHT	NY	1933-04-18 19:00:00

In [257]:

ufo.dtypes

Out[257]:

City                       object
Colors Reported            object
Shape Reported             object
State                      object
Time               datetime64[ns]
dtype: object

Documentation for to_datetime

In [258]:

 
#使用Series方法
ufo.Time.dt.hour.head()

Out[258]:

0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int64

In [259]:

 
ufo.Time.dt.weekday_name.head()

Out[259]:

0     Sunday
1     Monday
2     Sunday
3     Monday
4    Tuesday
Name: Time, dtype: object

In [260]:

 
ufo.Time.dt.dayofyear.head()

Out[260]:

0    152
1    181
2     46
3    152
4    108
Name: Time, dtype: int64

API reference for datetime properties and methods

In [261]:

 
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts

Out[261]:

Timestamp('1999-01-01 00:00:00')

In [262]:

 
# 将datetime系列与时间戳进行比较
ufo.loc[ufo.Time >= ts, :].head()

Out[262]:

	City	Colors Reported	Shape Reported	State	Time
12832	Loma Rica	NaN	LIGHT	CA	1999-01-01 02:30:00
12833	Bauxite	NaN	NaN	AR	1999-01-01 03:00:00
12834	Florence	NaN	CYLINDER	SC	1999-01-01 14:00:00
12835	Lake Henshaw	NaN	CIGAR	CA	1999-01-01 15:00:00
12836	Wilmington Island	NaN	LIGHT	GA	1999-01-01 17:15:00

In [263]:

 
# 使用时间戳执行数学运算（输出timedelta对象）
ufo.Time.max() - ufo.Time.min()

Out[263]:

Timedelta('25781 days 01:59:00')

In [264]:

 
# timedelta对象还具有您可以访问的属性
(ufo.Time.max() - ufo.Time.min()).days

Out[264]:

In [265]:

 
# 允许绘图出现在笔记本jupyter notebook中
%matplotlib inline

In [266]:

 
# 计算每年不明飞行物报告的数量
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().head()

Out[266]:

1930    2
1931    2
1933    1
1934    1
1935    1
Name: Year, dtype: int64

In [267]:

 
# 绘制每年不明飞行物报告的数量（默认折线图）
ufo.Year.value_counts().sort_index().plot()

Out[267]:

<matplotlib.axes._subplots.AxesSubplot at 0x296f1bd1240>

[Back to top]

26. How do I find and remove duplicate rows in pandas? (video)

In [268]:

 
​
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table(url2, sep='|', header=None, names=user_cols, index_col='user_id')
users.head()

Out[268]:

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213

In [269]:

users.shape

Out[269]:

(943, 4)

In [270]:

 
# 检测重复的邮政编码：如果项目与之前的项目相同，则为真
users.zip_code.duplicated().tail()

Out[270]:

user_id
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool

In [271]:

 
# 计算重复项（True变为1，False变为0）
users.zip_code.duplicated().sum()

Out[271]:

In [272]:

 
# 检测重复的DataFrame行：如果整行与前一行相同，则为True
users.duplicated().tail()

Out[272]:

user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool

In [273]:

 
# 数重复的行
users.duplicated().sum()

Out[273]:

Logic for duplicated:

keep='first' (default): Mark duplicates as True except for the first occurrence.
keep='last': Mark duplicates as True except for the last occurrence.
keep=False: Mark all duplicates as True.

In [274]:

 
"""duplicated的逻辑：
​
keep ='first'（默认值）：将重复项标记为True，但第一次出现除外。
keep ='last'：将重复项标记为True，但最后一次出现除外。
keep = False：将所有重复项标记为True。"""

Out[274]:

"duplicated的逻辑：\n\nkeep ='first'（默认值）：将重复项标记为True，但第一次出现除外。\nkeep ='last'：将重复项标记为True，但最后一次出现除外。\nkeep = False：将所有重复项标记为True。"

In [275]:

 
# 检查重复行，忽略最第一次出现的
users.loc[users.duplicated(keep='first'), :]

Out[275]:

	age	gender	occupation	zip_code
user_id
496	21	F	student	55414
572	51	M	educator	20003
621	17	M	student	60402
684	28	M	student	55414
733	44	F	other	60630
805	27	F	other	20009
890	32	M	student	97301

In [276]:

 
# 检查重复行，忽略最后一个出现的
users.loc[users.duplicated(keep='last'), :]

Out[276]:

	age	gender	occupation	zip_code
user_id
67	17	M	student	60402
85	51	M	educator	20003
198	21	F	student	55414
350	32	M	student	97301
428	28	M	student	55414
437	27	F	other	20009
460	44	F	other	60630

In [277]:

 
# 检查重复行，保留所有重复数据
users.loc[users.duplicated(keep=False), :]

Out[277]:

	age	gender	occupation	zip_code
user_id
67	17	M	student	60402
85	51	M	educator	20003
198	21	F	student	55414
350	32	M	student	97301
428	28	M	student	55414
437	27	F	other	20009
460	44	F	other	60630
496	21	F	student	55414
572	51	M	educator	20003
621	17	M	student	60402
684	28	M	student	55414
733	44	F	other	60630
805	27	F	other	20009
890	32	M	student	97301

In [278]:

 
# 删除重复的行（默认情况下为inplace = False）
users.drop_duplicates(keep='first').shape

Out[278]:

(936, 4)

In [279]:

 
users.drop_duplicates(keep='last').shape

Out[279]:

(936, 4)

In [280]:

 
users.drop_duplicates(keep=False).shape

Out[280]:

(929, 4)

Documentation for drop_duplicates

In [281]:

 
# 在识别重复项时，只考虑列的子集
users.duplicated(subset=['age', 'zip_code']).sum()

Out[281]:

In [282]:

 
users.drop_duplicates(subset=['age', 'zip_code']).shape

Out[282]:

(927, 4)

[Back to top]

27. How do I avoid a SettingWithCopyWarning in pandas? (video)

In [283]:

 
​
movies = pd.read_csv(url4)
movies.head()

Out[283]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [284]:

 
# 数'content_rating'系列中的缺失值
movies.content_rating.isnull().sum()

Out[284]:

In [285]:

 
# 检查包含缺少值的DataFrame行
movies[movies.content_rating.isnull()]

Out[285]:

	star_rating	title	content_rating	genre	duration	actors_list
187	8.2	Butch Cassidy and the Sundance Kid	NaN	Biography	110	[u'Paul Newman', u'Robert Redford', u'Katharin...
649	7.7	Where Eagles Dare	NaN	Action	158	[u'Richard Burton', u'Clint Eastwood', u'Mary ...
936	7.4	True Grit	NaN	Adventure	128	[u'John Wayne', u'Kim Darby', u'Glen Campbell']

In [286]:

 
# 检查'content_rating'系列中的唯一值
movies.content_rating.value_counts()

Out[286]:

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.

In [287]:

#目标：将“未评级”值标记为缺失值，由“NaN”表示。

In [288]:

 
# 首先，找到相关的行
movies[movies.content_rating=='NOT RATED'].head()

Out[288]:

	star_rating	title	content_rating	genre	duration	actors_list
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
41	8.5	Sunset Blvd.	NOT RATED	Drama	110	[u'William Holden', u'Gloria Swanson', u'Erich...
63	8.4	M	NOT RATED	Crime	99	[u'Peter Lorre', u'Ellen Widmann', u'Inge Land...
66	8.4	Munna Bhai M.B.B.S.	NOT RATED	Comedy	156	[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']

In [289]:

 
# 然后，从这些行中选择'content_rating'系列
movies[movies.content_rating=='NOT RATED'].content_rating.head()

Out[289]:

5     NOT RATED
6     NOT RATED
41    NOT RATED
63    NOT RATED
66    NOT RATED
Name: content_rating, dtype: object

In [290]:

 
# 最后，用'NaN'（从NumPy导入）替换'NOT RATED'值
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

c:\users\qq123\anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.
But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.

In [291]:

 
"""问题：该语句涉及两个操作，__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。
​
如果__getitem__返回数据视图，__ setitem__将影响'movies'DataFrame。
但是如果__getitem__返回数据的副本，__ setitem__将不会影响'movies'DataFrame。"""

Out[291]:

"问题：该语句涉及两个操作，__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。\n\n如果__getitem__返回数据视图，__ setitem__将影响'movies'DataFrame。\n但是如果__getitem__返回数据的副本，__ setitem__将不会影响'movies'DataFrame。"

In [292]:

 
# 'content_rating'系列没有改变
movies.content_rating.isnull().sum()

Out[292]:

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__operation.

In [293]:

 
​
"""解决方案：使用loc方法，该方法用单个__setitem__操作中替换“NOT RATED”值。"""

Out[293]:

'解决方案：使用loc方法，该方法用单个__setitem__操作中替换“NOT RATED”值。'

In [294]:

 
# 用'NaN'替换'NOT RATED'值（不会导致SettingWithCopyWarning）
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan

In [295]:

 
# 这一次，'content_rating'Sreies发生了变化
movies.content_rating.isnull().sum()

Out[295]:

Summary: Use the loc method any time you are selecting rows and columns in the same statement.

More information: Modern Pandas (Part 1)

In [296]:

 
# 创建一个仅包含高'star_rating'影片的DataFrame
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies

Out[296]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...

Goal: Fix the 'duration' for 'The Shawshank Redemption'.

In [297]:

'目标：修复“The Shawshank Redemption”的“持续时间”。'

Out[297]:

'目标：修复“The Shawshank Redemption”的“持续时间”。'

In [298]:

 
# 用正确的持续时间覆盖相关的单元格
top_movies.loc[0, 'duration'] = 150

c:\users\qq123\anaconda3\lib\site-packages\pandas\core\indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.

In [299]:

"问题：pandas不确定'top_movies'是视图还是'电影'的副本。"

Out[299]:

"问题：pandas不确定'top_movies'是视图还是'电影'的副本。"

In [300]:

 
# 结果还是修改了
top_movies

Out[300]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	150	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...

In [301]:

 
# 'movies' 电影的DataFrame没有改变
movies.head(1)

Out[301]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

In [302]:

"解决方案：每次尝试创建DataFrame副本时，请使用copy方法。"

Out[302]:

'解决方案：每次尝试创建DataFrame副本时，请使用copy方法。'

In [303]:

 
# 明确创建'电影'的副本
top_movies = movies.loc[movies.star_rating >= 9, :].copy()

In [304]:

 
# andas现在知道你正在更新副本而不是视图（不会导致SettingWithCopyWarning
top_movies.loc[0, 'duration'] = 150

In [305]:

 
# 'top_movies'DataFrame已更新
top_movies

Out[305]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	150	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...

Documentation on indexing and selection: Returning a view versus a copy

Stack Overflow: What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?

[Back to top]

28. How do I change display options in pandas? (video)

In [306]:

 
​
drinks = pd.read_csv(url7)

In [307]:

 
# 打印时仅显示60行
drinks

Out[307]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa
5	Antigua & Barbuda	102	128	45	4.9	North America
6	Argentina	193	25	221	8.3	South America
7	Armenia	21	179	11	3.8	Europe
8	Australia	261	72	212	10.4	Oceania
9	Austria	279	75	191	9.7	Europe
10	Azerbaijan	21	46	5	1.3	Europe
11	Bahamas	122	176	51	6.3	North America
12	Bahrain	42	63	7	2.0	Asia
13	Bangladesh	0	0	0	0.0	Asia
14	Barbados	143	173	36	6.3	North America
15	Belarus	142	373	42	14.4	Europe
16	Belgium	295	84	212	10.5	Europe
17	Belize	263	114	8	6.8	North America
18	Benin	34	4	13	1.1	Africa
19	Bhutan	23	0	0	0.4	Asia
20	Bolivia	167	41	8	3.8	South America
21	Bosnia-Herzegovina	76	173	8	4.6	Europe
22	Botswana	173	35	35	5.4	Africa
23	Brazil	245	145	16	7.2	South America
24	Brunei	31	2	1	0.6	Asia
25	Bulgaria	231	252	94	10.3	Europe
26	Burkina Faso	25	7	7	4.3	Africa
27	Burundi	88	0	0	6.3	Africa
28	Cote d'Ivoire	37	1	7	4.0	Africa
29	Cabo Verde	144	56	16	4.0	Africa
...	...	...	...	...	...	...
163	Suriname	128	178	7	5.6	South America
164	Swaziland	90	2	2	4.7	Africa
165	Sweden	152	60	186	7.2	Europe
166	Switzerland	185	100	280	10.2	Europe
167	Syria	5	35	16	1.0	Asia
168	Tajikistan	2	15	0	0.3	Asia
169	Thailand	99	258	1	6.4	Asia
170	Macedonia	106	27	86	3.9	Europe
171	Timor-Leste	1	1	4	0.1	Asia
172	Togo	36	2	19	1.3	Africa
173	Tonga	36	21	5	1.1	Oceania
174	Trinidad & Tobago	197	156	7	6.4	North America
175	Tunisia	51	3	20	1.3	Africa
176	Turkey	51	22	7	1.4	Asia
177	Turkmenistan	19	71	32	2.2	Asia
178	Tuvalu	6	41	9	1.0	Oceania
179	Uganda	45	9	0	8.3	Africa
180	Ukraine	206	237	45	8.9	Europe
181	United Arab Emirates	16	135	5	2.8	Asia
182	United Kingdom	219	126	195	10.4	Europe
183	Tanzania	36	6	1	5.7	Africa
184	USA	249	158	84	8.7	North America
185	Uruguay	115	35	220	6.6	South America
186	Uzbekistan	25	101	8	2.4	Asia
187	Vanuatu	21	18	11	0.9	Oceania
188	Venezuela	333	100	3	7.7	South America
189	Vietnam	111	2	1	2.0	Asia
190	Yemen	6	0	0	0.1	Asia
191	Zambia	32	19	4	2.5	Africa
192	Zimbabwe	64	18	4	4.7	Africa

193 rows × 6 columns

In [308]:

 
# 检查'max_rows'选项的当前设置
pd.get_option('display.max_rows')

Out[308]:

Documentation for get_option

In [309]:

 
# 覆盖当前设置，以便显示所有行
pd.set_option('display.max_rows', None)
drinks

Out[309]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa
5	Antigua & Barbuda	102	128	45	4.9	North America
6	Argentina	193	25	221	8.3	South America
7	Armenia	21	179	11	3.8	Europe
8	Australia	261	72	212	10.4	Oceania
9	Austria	279	75	191	9.7	Europe
10	Azerbaijan	21	46	5	1.3	Europe
11	Bahamas	122	176	51	6.3	North America
12	Bahrain	42	63	7	2.0	Asia
13	Bangladesh	0	0	0	0.0	Asia
14	Barbados	143	173	36	6.3	North America
15	Belarus	142	373	42	14.4	Europe
16	Belgium	295	84	212	10.5	Europe
17	Belize	263	114	8	6.8	North America
18	Benin	34	4	13	1.1	Africa
19	Bhutan	23	0	0	0.4	Asia
20	Bolivia	167	41	8	3.8	South America
21	Bosnia-Herzegovina	76	173	8	4.6	Europe
22	Botswana	173	35	35	5.4	Africa
23	Brazil	245	145	16	7.2	South America
24	Brunei	31	2	1	0.6	Asia
25	Bulgaria	231	252	94	10.3	Europe
26	Burkina Faso	25	7	7	4.3	Africa
27	Burundi	88	0	0	6.3	Africa
28	Cote d'Ivoire	37	1	7	4.0	Africa
29	Cabo Verde	144	56	16	4.0	Africa
30	Cambodia	57	65	1	2.2	Asia
31	Cameroon	147	1	4	5.8	Africa
32	Canada	240	122	100	8.2	North America
33	Central African Republic	17	2	1	1.8	Africa
34	Chad	15	1	1	0.4	Africa
35	Chile	130	124	172	7.6	South America
36	China	79	192	8	5.0	Asia
37	Colombia	159	76	3	4.2	South America
38	Comoros	1	3	1	0.1	Africa
39	Congo	76	1	9	1.7	Africa
40	Cook Islands	0	254	74	5.9	Oceania
41	Costa Rica	149	87	11	4.4	North America
42	Croatia	230	87	254	10.2	Europe
43	Cuba	93	137	5	4.2	North America
44	Cyprus	192	154	113	8.2	Europe
45	Czech Republic	361	170	134	11.8	Europe
46	North Korea	0	0	0	0.0	Asia
47	DR Congo	32	3	1	2.3	Africa
48	Denmark	224	81	278	10.4	Europe
49	Djibouti	15	44	3	1.1	Africa
50	Dominica	52	286	26	6.6	North America
51	Dominican Republic	193	147	9	6.2	North America
52	Ecuador	162	74	3	4.2	South America
53	Egypt	6	4	1	0.2	Africa
54	El Salvador	52	69	2	2.2	North America
55	Equatorial Guinea	92	0	233	5.8	Africa
56	Eritrea	18	0	0	0.5	Africa
57	Estonia	224	194	59	9.5	Europe
58	Ethiopia	20	3	0	0.7	Africa
59	Fiji	77	35	1	2.0	Oceania
60	Finland	263	133	97	10.0	Europe
61	France	127	151	370	11.8	Europe
62	Gabon	347	98	59	8.9	Africa
63	Gambia	8	0	1	2.4	Africa
64	Georgia	52	100	149	5.4	Europe
65	Germany	346	117	175	11.3	Europe
66	Ghana	31	3	10	1.8	Africa
67	Greece	133	112	218	8.3	Europe
68	Grenada	199	438	28	11.9	North America
69	Guatemala	53	69	2	2.2	North America
70	Guinea	9	0	2	0.2	Africa
71	Guinea-Bissau	28	31	21	2.5	Africa
72	Guyana	93	302	1	7.1	South America
73	Haiti	1	326	1	5.9	North America
74	Honduras	69	98	2	3.0	North America
75	Hungary	234	215	185	11.3	Europe
76	Iceland	233	61	78	6.6	Europe
77	India	9	114	0	2.2	Asia
78	Indonesia	5	1	0	0.1	Asia
79	Iran	0	0	0	0.0	Asia
80	Iraq	9	3	0	0.2	Asia
81	Ireland	313	118	165	11.4	Europe
82	Israel	63	69	9	2.5	Asia
83	Italy	85	42	237	6.5	Europe
84	Jamaica	82	97	9	3.4	North America
85	Japan	77	202	16	7.0	Asia
86	Jordan	6	21	1	0.5	Asia
87	Kazakhstan	124	246	12	6.8	Asia
88	Kenya	58	22	2	1.8	Africa
89	Kiribati	21	34	1	1.0	Oceania
90	Kuwait	0	0	0	0.0	Asia
91	Kyrgyzstan	31	97	6	2.4	Asia
92	Laos	62	0	123	6.2	Asia
93	Latvia	281	216	62	10.5	Europe
94	Lebanon	20	55	31	1.9	Asia
95	Lesotho	82	29	0	2.8	Africa
96	Liberia	19	152	2	3.1	Africa
97	Libya	0	0	0	0.0	Africa
98	Lithuania	343	244	56	12.9	Europe
99	Luxembourg	236	133	271	11.4	Europe
100	Madagascar	26	15	4	0.8	Africa
101	Malawi	8	11	1	1.5	Africa
102	Malaysia	13	4	0	0.3	Asia
103	Maldives	0	0	0	0.0	Asia
104	Mali	5	1	1	0.6	Africa
105	Malta	149	100	120	6.6	Europe
106	Marshall Islands	0	0	0	0.0	Oceania
107	Mauritania	0	0	0	0.0	Africa
108	Mauritius	98	31	18	2.6	Africa
109	Mexico	238	68	5	5.5	North America
110	Micronesia	62	50	18	2.3	Oceania
111	Monaco	0	0	0	0.0	Europe
112	Mongolia	77	189	8	4.9	Asia
113	Montenegro	31	114	128	4.9	Europe
114	Morocco	12	6	10	0.5	Africa
115	Mozambique	47	18	5	1.3	Africa
116	Myanmar	5	1	0	0.1	Asia
117	Namibia	376	3	1	6.8	Africa
118	Nauru	49	0	8	1.0	Oceania
119	Nepal	5	6	0	0.2	Asia
120	Netherlands	251	88	190	9.4	Europe
121	New Zealand	203	79	175	9.3	Oceania
122	Nicaragua	78	118	1	3.5	North America
123	Niger	3	2	1	0.1	Africa
124	Nigeria	42	5	2	9.1	Africa
125	Niue	188	200	7	7.0	Oceania
126	Norway	169	71	129	6.7	Europe
127	Oman	22	16	1	0.7	Asia
128	Pakistan	0	0	0	0.0	Asia
129	Palau	306	63	23	6.9	Oceania
130	Panama	285	104	18	7.2	North America
131	Papua New Guinea	44	39	1	1.5	Oceania
132	Paraguay	213	117	74	7.3	South America
133	Peru	163	160	21	6.1	South America
134	Philippines	71	186	1	4.6	Asia
135	Poland	343	215	56	10.9	Europe
136	Portugal	194	67	339	11.0	Europe
137	Qatar	1	42	7	0.9	Asia
138	South Korea	140	16	9	9.8	Asia
139	Moldova	109	226	18	6.3	Europe
140	Romania	297	122	167	10.4	Europe
141	Russian Federation	247	326	73	11.5	Asia
142	Rwanda	43	2	0	6.8	Africa
143	St. Kitts & Nevis	194	205	32	7.7	North America
144	St. Lucia	171	315	71	10.1	North America
145	St. Vincent & the Grenadines	120	221	11	6.3	North America
146	Samoa	105	18	24	2.6	Oceania
147	San Marino	0	0	0	0.0	Europe
148	Sao Tome & Principe	56	38	140	4.2	Africa
149	Saudi Arabia	0	5	0	0.1	Asia
150	Senegal	9	1	7	0.3	Africa
151	Serbia	283	131	127	9.6	Europe
152	Seychelles	157	25	51	4.1	Africa
153	Sierra Leone	25	3	2	6.7	Africa
154	Singapore	60	12	11	1.5	Asia
155	Slovakia	196	293	116	11.4	Europe
156	Slovenia	270	51	276	10.6	Europe
157	Solomon Islands	56	11	1	1.2	Oceania
158	Somalia	0	0	0	0.0	Africa
159	South Africa	225	76	81	8.2	Africa
160	Spain	284	157	112	10.0	Europe
161	Sri Lanka	16	104	0	2.2	Asia
162	Sudan	8	13	0	1.7	Africa
163	Suriname	128	178	7	5.6	South America
164	Swaziland	90	2	2	4.7	Africa
165	Sweden	152	60	186	7.2	Europe
166	Switzerland	185	100	280	10.2	Europe
167	Syria	5	35	16	1.0	Asia
168	Tajikistan	2	15	0	0.3	Asia
169	Thailand	99	258	1	6.4	Asia
170	Macedonia	106	27	86	3.9	Europe
171	Timor-Leste	1	1	4	0.1	Asia
172	Togo	36	2	19	1.3	Africa
173	Tonga	36	21	5	1.1	Oceania
174	Trinidad & Tobago	197	156	7	6.4	North America
175	Tunisia	51	3	20	1.3	Africa
176	Turkey	51	22	7	1.4	Asia
177	Turkmenistan	19	71	32	2.2	Asia
178	Tuvalu	6	41	9	1.0	Oceania
179	Uganda	45	9	0	8.3	Africa
180	Ukraine	206	237	45	8.9	Europe
181	United Arab Emirates	16	135	5	2.8	Asia
182	United Kingdom	219	126	195	10.4	Europe
183	Tanzania	36	6	1	5.7	Africa
184	USA	249	158	84	8.7	North America
185	Uruguay	115	35	220	6.6	South America
186	Uzbekistan	25	101	8	2.4	Asia
187	Vanuatu	21	18	11	0.9	Oceania
188	Venezuela	333	100	3	7.7	South America
189	Vietnam	111	2	1	2.0	Asia
190	Yemen	6	0	0	0.1	Asia
191	Zambia	32	19	4	2.5	Africa
192	Zimbabwe	64	18	4	4.7	Africa

In [310]:

 
# 将'max_rows'选项重置为默认值
pd.reset_option('display.max_rows')

Documentation for set_option and reset_option

In [311]:

 
# 'max_columns'选项类似于'max_rows'
pd.get_option('display.max_columns')

Out[311]:

In [312]:

 
​
train = pd.read_csv(url5)
train.head()

Out[312]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [313]:

 
# 由于“max_colwidth”选项，省略号显示在第1行的“名称”单元格中
pd.get_option('display.max_colwidth')

Out[313]:

In [314]:

 
# 覆盖当前设置，以便显示更多字符
pd.set_option('display.max_colwidth', 1000)
train.head()

Out[314]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [315]:

 
# 覆盖''precision' '设置以显示'票价'小数点后的2位数
pd.set_option('display.precision', 2)
train.head()

Out[315]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.28	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.92	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.10	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.05	NaN	S

In [316]:

 
# 在饮料DataFrame中添加两个无意义的列
drinks['x'] = drinks.wine_servings * 1000
drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000
drinks.head()

Out[316]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	x	y
0	Afghanistan	0	0	0	0.0	Asia	0	0.0
1	Albania	89	132	54	4.9	Europe	54000	4900.0
2	Algeria	25	0	14	0.7	Africa	14000	700.0
3	Andorra	245	138	312	12.4	Europe	312000	12400.0
4	Angola	217	57	45	5.9	Africa	45000	5900.0

In [317]:

 
# 使用Python格式字符串将逗号指定为千位分隔符
pd.set_option('display.float_format', '{:,}'.format)
drinks.head()

Out[317]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	x	y
0	Afghanistan	0	0	0	0.0	Asia	0	0.0
1	Albania	89	132	54	4.9	Europe	54000	4,900.0
2	Algeria	25	0	14	0.7	Africa	14000	700.0
3	Andorra	245	138	312	12.4	Europe	312000	12,400.0
4	Angola	217	57	45	5.9	Africa	45000	5,900.0

In [318]:

 
# y'受影响（但不是'x'）因为'float_format'选项只影响浮点数（不是整数）
drinks.dtypes

Out[318]:

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
x                                 int64
y                               float64
dtype: object

In [319]:

 
# 查看选项说明（包括默认值和当前值）
pd.describe_option()

compute.use_bottleneck : bool
    Use the bottleneck library to accelerate if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

compute.use_numexpr : bool
    Use the numexpr library to accelerate computation if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]

display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See formats.format.EngFormatter for an example.
    [default: None] [currently: <built-in method format of str object at 0x00000296F1B05880>]

display.height : int
    Deprecated.
    [default: 60] [currently: 60]

display.html.border : int
    A ``border=value`` attribute is inserted in the ``<table>`` tag
    for the DataFrame HTML repr.
    [default: 1] [currently: 1]

display.html.table_schema : boolean
    Whether to publish a Table Schema representation for frontends
    that support it.
    (default: False)
    [default: False] [currently: False]

display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]

display.latex.escape : bool
    This specifies if the to_latex method of a Dataframe uses escapes special
    characters.
    Valid values: False,True
    [default: True] [currently: True]

display.latex.longtable :bool
    This specifies if the to_latex method of a Dataframe uses the longtable
    format.
    Valid values: False,True
    [default: False] [currently: False]

display.latex.multicolumn : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: True] [currently: True]

display.latex.multicolumn_format : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: l] [currently: l]

display.latex.multirow : bool
    This specifies if the to_latex method of a Dataframe uses multirows
    to pretty-print MultiIndex rows.
    Valid values: False,True
    [default: False] [currently: False]

display.latex.repr : boolean
    Whether to produce a latex DataFrame representation for jupyter
    environments that support it.
    (default: False)
    [default: False] [currently: False]

display.line_width : int
    Deprecated.
    [default: 80] [currently: 80]

display.max_categories : int
    This sets the maximum number of categories pandas should output when
    printing out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]

display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 20]

display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 1000]

display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]

display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]

display.memory_usage : bool, string or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called. Valid values True,False,'deep'
    [default: True] [currently: True]

display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: True]

display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]

display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 6] [currently: 2]

display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.unicode.east_asian_width : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 80]

html.border : int
    A ``border=value`` attribute is inserted in the ``<table>`` tag
    for the DataFrame HTML repr.
    [default: 1] [currently: 1]
    (Deprecated, use `display.html.border` instead.)

io.excel.xls.writer : string
    The default Excel writer engine for 'xls' files. Available options:
    auto, xlwt.
    [default: auto] [currently: auto]

io.excel.xlsm.writer : string
    The default Excel writer engine for 'xlsm' files. Available options:
    auto, openpyxl.
    [default: auto] [currently: auto]

io.excel.xlsx.writer : string
    The default Excel writer engine for 'xlsx' files. Available options:
    auto, openpyxl, xlsxwriter.
    [default: auto] [currently: auto]

io.hdf.default_format : format
    default format writing format, if None, then
    put will default to 'fixed' and append will default to 'table'
    [default: None] [currently: None]

io.hdf.dropna_table : boolean
    drop ALL nan rows when appending to a table
    [default: False] [currently: False]

io.parquet.engine : string
    The default parquet reader/writer engine. Available options:
    'auto', 'pyarrow', 'fastparquet', the default is 'auto'
    [default: auto] [currently: auto]

mode.chained_assignment : string
    Raise an exception, warn, or no action if trying to use chained assignment,
    The default is warn
    [default: warn] [currently: warn]

mode.sim_interactive : boolean
    Whether to simulate interactive mode for purposes of testing
    [default: False] [currently: False]

mode.use_inf_as_na : boolean
    True means treat None, NaN, INF, -INF as NA (old way),
    False means None and NaN are null, but INF, -INF are not NA
    (new way).
    [default: False] [currently: False]

mode.use_inf_as_null use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.
    [default: False] [currently: False]
    (Deprecated, use `mode.use_inf_as_na` instead.)

plotting.matplotlib.register_converters : bool
    Whether to register converters with matplotlib's units registry for
    dates, times, datetimes, and Periods. Toggling to False will remove
    the converters, restoring any converters that pandas overwrote.
    [default: True] [currently: True]

In [320]:

 
# 按名称搜索特定选项
pd.describe_option('rows')

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]

Documentation for describe_option

In [321]:

 
# 将所有选项重置为其默认值
pd.reset_option('all')

html.border has been deprecated, use display.html.border instead
(currently both are identical)


use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.

Options and Settings

[Back to top]

29. How do I create a pandas DataFrame from another object? (video)

In [322]:

 
# 从字典创建一个DataFrame（键成为列名，值成为数据）
pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']})

Out[322]:

	color	id
0	red	100
1	blue	101
2	red	102

In [323]:

 
# 指定列的顺序并定义索引
df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])
df

Out[323]:

	id	color
a	100	red
b	101	blue
c	102	red

Documentation for DataFrame

In [324]:

 
# 从列表中的列表结构创建一个DataFrame（每个内部列表成为一行）
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])

Out[324]:

	id	color
0	100	red
1	101	blue
2	102	red

In [325]:

 
# 创建一个NumPy数组（形状为4乘2）并用0到1之间的随机数填充它
import numpy as np
arr = np.random.rand(4, 2)
arr

Out[325]:

array([[0.75703155, 0.7684219 ],
       [0.06043145, 0.1511115 ],
       [0.5908619 , 0.02133966],
       [0.78627718, 0.98944222]])

In [326]:

 
# 从NumPy数组创建一个DataFrame
pd.DataFrame(arr, columns=['one', 'two'])

Out[326]:

	one	two
0	0.757032	0.768422
1	0.060431	0.151112
2	0.590862	0.021340
3	0.786277	0.989442

In [327]:

 
# 创建学生ID（100到109）和测试分数（60到10之间的随机整数）的DataFrame
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})

Out[327]:

	student	test
0	100	61
1	101	84
2	102	100
3	103	75
4	104	97
5	105	73
6	106	67
7	107	82
8	108	76
9	109	69

Documentation for np.arange and np.random

In [328]:

 
# set_index'可以与DataFrame构造函数链接以选择索引
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')

Out[328]:

	test
student
100	97
101	94
102	82
103	76
104	72
105	86
106	70
107	84
108	84
109	100

Documentation for set_index

In [329]:

 
# 使用Series构造函数创建一个新Series
s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')
s

Out[329]:

c     round
b    square
Name: shape, dtype: object

Documentation for Series

In [330]:

 
# 连接DataFrame和Series（使用axis = 1来连接列）
pd.concat([df, s], axis=1)

Out[330]:

	id	color	shape
a	100	red	NaN
b	101	blue	square
c	102	red	round

Notes:

The Series name became the column name in the DataFrame.
The Series data was aligned to the DataFrame by its index.
The 'shape' for row 'a' was marked as a missing value (NaN) because that index was not present in the Series.

Documentation for concat

[Back to top]

In [331]:

 
"""笔记：
​
系列名称成为DataFrame中的列名称。
Series数据通过其索引与DataFrame对齐。
行'a'的'形状'被标记为缺失值（NaN），因为该索引在系列中不存在。"""

Out[331]:

"笔记：\n\n系列名称成为DataFrame中的列名称。\nSeries数据通过其索引与DataFrame对齐。\n行'a'的'形状'被标记为缺失值（NaN），因为该索引在系列中不存在。"

30. How do I apply a function to a pandas Series or DataFrame? (video)

In [332]:

 
# read the training dataset from Kaggle's Titanic competition into a DataFrame
train = pd.read_csv(url5)
train.head()

Out[332]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Goal: Map the existing values of a Series to a different set of values

Method: map (Series method)

In [333]:

 
​
"""目标：将Series的现有值映射到不同的值集
​
方法：map（Series方法）"""

Out[333]:

'目标：将Series的现有值映射到不同的值集\n\n方法：map（Series方法）'

In [334]:

 
# map 'female' to 0 and 'male' to 1
train['Sex_num'] = train.Sex.map({'female':0, 'male':1})
train.loc[0:4, ['Sex', 'Sex_num']]

Out[334]:

	Sex	Sex_num
0	male	1
1	female	0
2	female	0
3	female	0
4	male	1

Goal: Apply a function to each element in a Series

Method: apply (Series method)

Note: map can be substituted for apply in many cases, but apply is more flexible and thus is recommended

In [335]:

 
"""目标：将函数应用于系列中的每个元素
​
方法：apply（Series方法）
​
注意：在许多情况下，map可以替换为apply，但是apply更灵活，因此建议使用"""

Out[335]:

'目标：将函数应用于系列中的每个元素\n\n方法：apply（Series方法）\n\n注意：在许多情况下，map可以替换为apply，但是apply更灵活，因此建议使用'

In [336]:

 
# 计算“姓名”系列中每个字符串的长度
train['Name_length'] = train.Name.apply(len)
train.loc[0:4, ['Name', 'Name_length']]

Out[336]:

	Name	Name_length
0	Braund, Mr. Owen Harris	23
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	51
2	Heikkinen, Miss. Laina	22
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	44
4	Allen, Mr. William Henry	24

In [337]:

 
# 将“票价”系列中的每个元素四舍五入到下一个整数
import numpy as np
train['Fare_ceil'] = train.Fare.apply(np.ceil)
train.loc[0:4, ['Fare', 'Fare_ceil']]

Out[337]:

	Fare	Fare_ceil
0	7.2500	8.0
1	71.2833	72.0
2	7.9250	8.0
3	53.1000	54.0
4	8.0500	9.0

In [338]:

 
# 我们想要提取每个人的姓氏
train.Name.head()

Out[338]:

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [339]:

 
# 使用字符串方法以逗号分隔“名称”系列（返回一Series列表）
train.Name.str.split(',').head()

Out[339]:

0                           [Braund,  Mr. Owen Harris]
1    [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                            [Heikkinen,  Miss. Laina]
3      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                          [Allen,  Mr. William Henry]
Name: Name, dtype: object

In [340]:

 
# 定义一个函数，该函数根据位置从列表中返回一个元素
def get_element(my_list, position):
    return my_list[position]

In [341]:

 
# 应用'get_element'函数并将'position'作为关键字参数传递
train.Name.str.split(',').apply(get_element, position=0).head()

Out[341]:

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

In [342]:

 
# 或者，使用lambda函数
train.Name.str.split(',').apply(lambda x: x[0]).head()

Out[342]:

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

Goal: Apply a function along either axis of a DataFrame

Method: apply (DataFrame method)

In [343]:

 
"""目标：沿DataFrame的任一轴应用函数
​
方法：apply（DataFrame方法）"""

Out[343]:

'目标：沿DataFrame的任一轴应用函数\n\n方法：apply（DataFrame方法）'

In [344]:

 
​
drinks = pd.read_csv(url7)
drinks.head()

Out[344]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [345]:

 
# 选择要使用的DataFrame的子集
drinks.loc[:, 'beer_servings':'wine_servings'].head()

Out[345]:

	beer_servings	spirit_servings	wine_servings
0	0	0	0
1	89	132	54
2	25	0	14
3	245	138	312
4	217	57	45

In [346]:

 
# 沿axix 0应用'max'函数来计算每列中的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)

Out[346]:

beer_servings      376
spirit_servings    438
wine_servings      370
dtype: int64

In [347]:

 
# 沿axis 1应用'max'函数来计算每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head()

Out[347]:

0      0
1    132
2     25
3    312
4    217
dtype: int64

In [348]:

 
# 使用'np.argmax'计算哪一列具有每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head()

c:\users\qq123\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:52: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)

Out[348]:

0      beer_servings
1    spirit_servings
2      beer_servings
3      wine_servings
4      beer_servings
dtype: object

Goal: Apply a function to every element in a DataFrame

Method: applymap (DataFrame method)

In [349]:

 
# 将每个DataFrame元素转换为float
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head()

Out[349]:

	beer_servings	spirit_servings	wine_servings
0	0.0	0.0	0.0
1	89.0	132.0	54.0
2	25.0	0.0	14.0
3	245.0	138.0	312.0
4	217.0	57.0	45.0

In [350]:

 
#覆盖现有的DataFrame列
drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.head()

Out[350]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0.0	0.0	0.0	0.0	Asia
1	Albania	89.0	132.0	54.0	4.9	Europe
2	Algeria	25.0	0.0	14.0	0.7	Africa
3	Andorra	245.0	138.0	312.0	12.4	Europe
4	Angola	217.0	57.0	45.0	5.9	Africa

[Back to top]

跨界混子

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础Q&A 16-30

16. How do I handle missing values in pandas? (video)

17. What do I need to know about the pandas index? (Part 1) (video)

18. What do I need to know about the pandas index? (Part 2) (video)

19. How do I select multiple rows and columns from a pandas DataFrame? (video)

20. When should I use the "inplace" parameter in pandas? (video)

21. How do I make my pandas DataFrame smaller and faster? (video)

22. How do I use pandas with scikit-learn to create Kaggle submissions? (video)

23. More of your pandas questions answered! (video)

24. How do I create dummy variables in pandas? (video)

25. How do I work with dates and times in pandas? (video)

26. How do I find and remove duplicate rows in pandas? (video)

27. How do I avoid a SettingWithCopyWarning in pandas? (video)

28. How do I change display options in pandas? (video)

29. How do I create a pandas DataFrame from another object? (video)

30. How do I apply a function to a pandas Series or DataFrame? (video)

公告

	student	test
0	100	61
1	101	84
2	102	100
3	103	75
4	104	97
5	105	73
6	106	67
7	107	82
8	108	76
9	109	69

	student	test
0	100	61
1	101	84
2	102	100
3	103	75
4	104	97
5	105	73
6	106	67
7	107	82
8	108	76
9	109	69

跨界混子

萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础Q&A 16-30

16. How do I handle missing values in pandas? (video)

17. What do I need to know about the pandas index? (Part 1) (video)

18. What do I need to know about the pandas index? (Part 2) (video)

19. How do I select multiple rows and columns from a pandas DataFrame? (video)

20. When should I use the "inplace" parameter in pandas? (video)

21. How do I make my pandas DataFrame smaller and faster? (video)

22. How do I use pandas with scikit-learn to create Kaggle submissions? (video)

23. More of your pandas questions answered! (video)

24. How do I create dummy variables in pandas? (video)

25. How do I work with dates and times in pandas? (video)

26. How do I find and remove duplicate rows in pandas? (video)

27. How do I avoid a SettingWithCopyWarning in pandas? (video)

28. How do I change display options in pandas? (video)

29. How do I create a pandas DataFrame from another object? (video)

30. How do I apply a function to a pandas Series or DataFrame? (video)

公告

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础Q&A 16-30

	student	test
0	100	61
1	101	84
2	102	100
3	103	75
4	104	97
5	105	73
6	106	67
7	107	82
8	108	76
9	109	69