萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础Q&A 16-30

16. How do I handle missing values in pandas? (video)

In [121]:
 
 
 
 
 
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
ufo = pd.read_csv(url3)#用read_csv打开csv文件
ufo.tail()
 
 
Out[121]:
 CityColors ReportedShape ReportedStateTime
18236 Grant Park NaN TRIANGLE IL 12/31/2000 23:00
18237 Spirit Lake NaN DISK IA 12/31/2000 23:00
18238 Eagle River NaN NaN WI 12/31/2000 23:45
18239 Eagle River RED LIGHT WI 12/31/2000 23:45
18240 Ybor NaN OVAL FL 12/31/2000 23:59
 

“NaN”是什么意思?

“NaN”不是字符串,而是一个特殊值:numpy.nan。 它代表“非数字”并表示缺失值。 read_csv在读取文件时检测缺失值(默认情况下),并用此特殊值替换它们。

Documentation for read_csv

In [122]:
 
 
 
 
 
# ''isnull'返回一个布尔DataFrame(如果缺少则为True,如果没有缺失则为False)
ufo.isnull().tail()
 
 
Out[122]:
 CityColors ReportedShape ReportedStateTime
18236 False True False False False
18237 False True False False False
18238 False True True False False
18239 False False False False False
18240 False True False False False
In [123]:
 
 
 
 
 
# 'nonnull'返回'isnull'的反面(如果不丢失,则返回true,如果丢失则返回False)
ufo.notnull().tail()
 
 
Out[123]:
 CityColors ReportedShape ReportedStateTime
18236 True False True True True
18237 True False True True True
18238 True False False True True
18239 True True True True True
18240 True False True True True
 

Documentation for isnull and notnull

In [124]:
 
 
 
 
 
# 计算每个Series中缺失值的数量
ufo.isnull().sum()
 
 
Out[124]:
City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64
 

此计算有效,因为:

1.默认情况下,DataFrame的sum **方法在axis = 0 上运行(因此产生列总和)。 2.为了添加布尔值,pandas将“True”转换为 1 ,将“False”转换为 0 **。

In [125]:
 
 
 
 
 
# 使用'isnull'Series方法过滤DataFrame行
ufo[ufo.City.isnull()].head()
 
 
Out[125]:
 CityColors ReportedShape ReportedStateTime
21 NaN NaN NaN LA 8/15/1943 0:00
22 NaN NaN LIGHT LA 8/15/1943 0:00
204 NaN NaN DISK CA 7/15/1952 12:30
241 NaN BLUE DISK MT 7/4/1953 14:00
613 NaN NaN DISK NV 7/1/1960 12:00
 

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:

In [126]:
 
 
 
 
 
# 检查行数和列数
ufo.shape
 
 
Out[126]:
(18241, 5)
In [127]:
 
 
 
 
 
# 如果连续缺少“任何”值,则删除该行
ufo.dropna(how='any').shape
 
 
Out[127]:
(2486, 5)
 

Documentation for dropna

In [128]:
 
 
 
 
 
# 默认情况下,'dropna'的inplace'参数为False,因此只暂时删除行
ufo.shape
 
 
Out[128]:
(18241, 5)
In [129]:
 
 
 
 
 
# 如果一行中缺少'all'值,则删除该行(在这种情况下不丢弃)
ufo.dropna(how='all').shape
 
 
Out[129]:
(18241, 5)
In [130]:
 
 
 
 
 
# 给 any 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
 
 
Out[130]:
(15576, 5)
In [131]:
 
 
 
 
 
# 给all 一个范围
ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape
 
 
Out[131]:
(18237, 5)
In [132]:
 
 
 
 
 
# 默认情况下,value_counts'不包含缺失值
ufo['Shape Reported'].value_counts().head()
 
 
Out[132]:
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
CIRCLE      1365
Name: Shape Reported, dtype: int64
In [133]:
 
 
 
 
 
# 明确包含缺失值
ufo['Shape Reported'].value_counts(dropna=False).head()
 
 
Out[133]:
LIGHT       2803
NaN         2644
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64
 

Documentation for value_counts

In [134]:
 
 
 
 
 
# 使用指定值填充缺失值
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)
 
 
 

Documentation for fillna

In [135]:
 
 
 
 
 
# 确认填写了缺失值
ufo['Shape Reported'].value_counts().head()
 
 
Out[135]:
VARIOUS     2977
LIGHT       2803
DISK        2122
TRIANGLE    1889
OTHER       1402
Name: Shape Reported, dtype: int64
 

17. What do I need to know about the pandas index? (Part 1) (video)

In [137]:
 
 
 
 
 
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[137]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [138]:
 
 
 
 
 
# 每个DataFrame都有一个索引(有时称为“行标签”)
drinks.index
 
 
Out[138]:
RangeIndex(start=0, stop=193, step=1)
In [139]:
 
 
 
 
 
# 列名也存储在特殊的“索引”对象中
drinks.columns
 
 
Out[139]:
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')
In [140]:
 
 
 
 
 
# 索引和列名都不包含在形状中
drinks.shape
 
 
Out[140]:
(193, 6)
In [141]:
 
 
 
 
 
# 如果不定义索引和列名,则默认为整数
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()
 
 
Out[141]:
 01234
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
 

索引用于什么?

识别 选择 对齐

In [142]:
 
 
 
 
 
# i识别:过滤DataFrame时,每行保留索引
drinks[drinks.continent=='South America']
 
 
Out[142]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
6 Argentina 193 25 221 8.3 South America
20 Bolivia 167 41 8 3.8 South America
23 Brazil 245 145 16 7.2 South America
35 Chile 130 124 172 7.6 South America
37 Colombia 159 76 3 4.2 South America
52 Ecuador 162 74 3 4.2 South America
72 Guyana 93 302 1 7.1 South America
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
163 Suriname 128 178 7 5.6 South America
185 Uruguay 115 35 220 6.6 South America
188 Venezuela 333 100 3 7.7 South America
In [143]:
 
 
 
 
 
# 选择:使用索引选择DataFrame的一部分
drinks.loc[23, 'beer_servings']
 
 
Out[143]:
245
 

Documentation for loc

In [144]:
 
 
 
 
 
# 将现有列设置为索引
drinks.set_index('country', inplace=True)
drinks.head()
 
 
Out[144]:
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
country     
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
 

Documentation for set_index

In [145]:
 
 
 
 
 
# 'country'现在是索引
drinks.index
 
 
Out[145]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)
In [146]:
 
 
 
 
 
# 'country'不再是列名
drinks.columns
 
 
Out[146]:
Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')
In [147]:
 
 
 
 
 
# 'country'数据不再是DataFrame内容的一部分
drinks.shape
 
 
Out[147]:
(193, 5)
In [148]:
 
 
 
 
 
# 国家名字现在可用于选择
drinks.loc['Brazil', 'beer_servings']
 
 
Out[148]:
245
In [149]:
 
 
 
 
 
# 索引名称是可选的
drinks.index.name = None
drinks.head()
 
 
Out[149]:
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [150]:
 
 
 
 
 
# 恢复索引名称,并将索引移回列
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()
 
 
Out[150]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
 

Documentation for reset_index

In [151]:
 
 
 
 
 
# 许多DataFrame方法输出一个DataFrame
drinks.describe()
 
 
Out[151]:
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000
In [152]:
 
 
 
 
 
# 您可以使用其索引和列与任何DataFrame进行交互
drinks.describe().loc['25%', 'beer_servings']
 
 
Out[152]:
20.0
 

18. What do I need to know about the pandas index? (Part 2) (video)

In [153]:
 
 
 
 
 
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[153]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [154]:
 
 
 
 
 
# 每个DataFrame都有一个索引
drinks.index
 
 
Out[154]:
RangeIndex(start=0, stop=193, step=1)
In [155]:
 
 
 
 
 
# 每个Series还有一个索引(从DataFrame继承)
drinks.continent.head()
 
 
Out[155]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object
In [156]:
 
 
 
 
 
# 将'country'设置为索引
drinks.set_index('country', inplace=True)
 
 
 

Documentation for set_index

In [157]:
 
 
 
 
 
# Series 索引在左侧,值在右侧
drinks.continent.head()
 
 
Out[157]:
country
Afghanistan      Asia
Albania        Europe
Algeria        Africa
Andorra        Europe
Angola         Africa
Name: continent, dtype: object
In [158]:
 
 
 
 
 
# 系Series的另一个例子(来自'value_counts'方法的输出)
drinks.continent.value_counts()
 
 
Out[158]:
Africa           53
Europe           45
Asia             44
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64
 

Documentation for value_counts

In [159]:
 
 
 
 
 
# Series 访问索引
drinks.continent.value_counts().index
 
 
Out[159]:
Index(['Africa', 'Europe', 'Asia', 'North America', 'Oceania',
       'South America'],
      dtype='object')
In [160]:
 
 
 
 
 
# 访问Series 的值
drinks.continent.value_counts().values
 
 
Out[160]:
array([53, 45, 44, 23, 16, 12], dtype=int64)
In [161]:
 
 
 
 
 
#可以通过索引选择系列中的元素(使用括号表示法)
drinks.continent.value_counts()['Africa']
 
 
Out[161]:
53
In [162]:
 
 
 
 
 
# 任何Series都可以按其值排序
drinks.continent.value_counts().sort_values()
 
 
Out[162]:
South America    12
Oceania          16
North America    23
Asia             44
Europe           45
Africa           53
Name: continent, dtype: int64
In [163]:
 
 
 
 
 
# a任何Series都可以按其索引排序
drinks.continent.value_counts().sort_index()
 
 
Out[163]:
Africa           53
Asia             44
Europe           45
North America    23
Oceania          16
South America    12
Name: continent, dtype: int64
 

Documentation for sort_values and sort_index

 

索引用于什么?

识别 选择 对齐

In [164]:
 
 
 
 
 
# beer_servings'包含每人平均每年的啤酒摄取量
drinks.beer_servings.head()
 
 
Out[164]:
country
Afghanistan      0
Albania         89
Algeria         25
Andorra        245
Angola         217
Name: beer_servings, dtype: int64
In [165]:
 
 
 
 
 
# 创建一个包含两个国家人口的Series
people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')
people
 
 
Out[165]:
Albania    3000000
Andorra      85000
Name: population, dtype: int64
 

Documentation for Series

In [166]:
 
 
 
 
 
# 计算每个国家的年度啤酒总量
(drinks.beer_servings * people).head()
 
 
Out[166]:
Afghanistan            NaN
Albania        267000000.0
Algeria                NaN
Andorra         20825000.0
Angola                 NaN
dtype: float64
 

两个Series按其指数对齐。 如果任一Series中缺少值,则结果标记为NaN。 对齐使我们能够轻松处理不完整的数据。

In [167]:
 
 
 
 
 
# concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)
pd.concat([drinks, people], axis=1).head()
 
 
Out[167]:
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentpopulation
Afghanistan 0 0 0 0.0 Asia NaN
Albania 89 132 54 4.9 Europe 3000000.0
Algeria 25 0 14 0.7 Africa NaN
Andorra 245 138 312 12.4 Europe 85000.0
Angola 217 57 45 5.9 Africa NaN
 

19. How do I select multiple rows and columns from a pandas DataFrame? (video)

In [168]:
 
 
 
 
 
ufo = pd.read_csv(url3)
ufo.head(3)
 
 
Out[168]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
 

The loc y loc方法用于按标签选择行和列。你可以传递它:

单个标签 标签列表 标签切片 布尔Series 冒号(表示“所有标签”)

In [169]:
 
 
 
 
 
# 第0行,所有列
ufo.loc[0, :]
 
 
Out[169]:
City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object
In [170]:
 
 
 
 
 
# 头三行,所有列
ufo.loc[[0, 1, 2], :]
 
 
Out[170]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
In [171]:
 
 
 
 
 
# 0:2行 包括2, all columns
ufo.loc[0:2, :]
 
 
Out[171]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
In [172]:
 
 
 
 
 
# t这意味着“所有列”,但明确声明“所有列”更好
ufo.loc[0:2]
 
 
Out[172]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
In [173]:
 
 
 
 
 
# 行0到行2(包含行2),列'城市'
ufo.loc[0:2, 'City']
 
 
Out[173]:
0         Ithaca
1    Willingboro
2        Holyoke
Name: City, dtype: object
In [174]:
 
 
 
 
 
# 行0到行2(包含行2), 列 'City' and 'State'
ufo.loc[0:2, ['City', 'State']]
 
 
Out[174]:
 CityState
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
In [175]:
 
 
 
 
 
# 使用双括号完成相同的事情 - 但使用'loc'是首选,因为它更明确
ufo[['City', 'State']].head(3)
 
 
Out[175]:
 CityState
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
In [176]:
 
 
 
 
 
# 行0到行2(包含行2), 列 'City' 到 'State')
ufo.loc[0:2, 'City':'State']
 
 
Out[176]:
 CityColors ReportedShape ReportedState
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
In [177]:
 
 
 
 
 
# 使用'head'和'drop'完成同样的事情
ufo.head(3).drop('Time', axis=1)
 
 
Out[177]:
 CityColors ReportedShape ReportedState
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
In [178]:
 
 
 
 
 
# 'City'为'Oakland'的行,'State'列
ufo.loc[ufo.City=='Oakland', 'State']
 
 
Out[178]:
1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object
In [179]:
 
 
 
 
 
# 使用“链式索引”完成同样的事情 - 但使用'loc'是首选,因为链式索引可能会导致问题 
ufo[ufo.City=='Oakland'].State
 
 
Out[179]:
1694     CA
2144     CA
4686     MD
7293     CA
8488     CA
8768     CA
10816    OR
10948    CA
11045    CA
12322    CA
12941    CA
16803    MD
17322    CA
Name: State, dtype: object
 

The iloc iloc'方法用于通过整数位置选择行和列。你可以传递它:

  • 单个整数位置
  • 整数位置列表
  • 整数位置切片
  • 冒号(表示“所有整数位置”)
In [180]:
 
 
 
 
 
# 位置0和1中的行,位置0和3中的列
ufo.iloc[[0, 1], [0, 3]]
 
 
Out[180]:
 CityState
0 Ithaca NY
1 Willingboro NJ
In [181]:
 
 
 
 
 
# 位置0到2中的行(不包括),位置0到4中的列(不包括)
ufo.iloc[0:2, 0:4]
 
 
Out[181]:
 CityColors ReportedShape ReportedState
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
In [182]:
 
 
 
 
 
# 所有列中位置0到2(不包括)的行
ufo.iloc[0:2, :]
 
 
Out[182]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
In [183]:
 
 
 
 
 
# 完成同样的事情 - 但使用'iloc'是首选,因为它更明确
ufo[0:2]
 
 
Out[183]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
 

The ix ix方法用于通过标签或整数位置选择行和列,并且只应在需要在同一调用中混合基于标签和基于整数的选择时使用。

In [184]:
 
 
 
 
 
# read a dataset of alcohol consumption into a DataFrame and set 'country' as the index
drinks = pd.read_csv(url7, index_col='country')
drinks.head()
 
 
Out[184]:
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
country     
Afghanistan 0 0 0 0.0 Asia
Albania 89 132 54 4.9 Europe
Algeria 25 0 14 0.7 Africa
Andorra 245 138 312 12.4 Europe
Angola 217 57 45 5.9 Africa
In [185]:
 
 
 
 
 
# 标有'Albania'的行,位于0的列
drinks.ix['Albania', 0]
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
Out[185]:
89
In [186]:
 
 
 
 
 
# row in position 1, column with label 'beer_servings'
drinks.ix[1, 'beer_servings']
 
 
Out[186]:
89
 

使用带有ix的数字的规则:

  • 如果索引是字符串,则数字被视为整数位置,因此切片在右侧是不包含
  • 如果索引是整数,则数字被视为标签,因此切片是包含
In [187]:
 
 
 
 
 
# 行'阿尔巴尼亚'到'安道尔'(包括),0到2位的列(包含)
drinks.ix['Albania':'Andorra', 0:2]
 
 
Out[187]:
 beer_servingsspirit_servings
country  
Albania 89 132
Algeria 25 0
Andorra 245 138
In [188]:
 
 
 
 
 
# 行0到2(含),0到2位的列(不包括)
ufo.ix[0:2, 0:2]
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
Out[188]:
 CityColors Reported
0 Ithaca NaN
1 Willingboro NaN
2 Holyoke NaN
 

20. When should I use the "inplace" parameter in pandas? (video)

In [189]:
 
 
 
 
 
# read a dataset of UFO reports into a DataFrame
ufo = pd.read_csv(url3)
ufo.head()
 
 
Out[189]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [190]:
 
 
 
 
 
ufo.shape
 
 
Out[190]:
(18241, 5)
In [191]:
 
 
 
 
 
# 删除“城市”列(因为inplace = False,不会影响DataFrame)
ufo.drop('City', axis=1).head()
 
 
Out[191]:
 Colors ReportedShape ReportedStateTime
0 NaN TRIANGLE NY 6/1/1930 22:00
1 NaN OTHER NJ 6/30/1930 20:00
2 NaN OVAL CO 2/15/1931 14:00
3 NaN DISK KS 6/1/1931 13:00
4 NaN LIGHT NY 4/18/1933 19:00
In [192]:
 
 
 
 
 
# 确认实际上没有删除“城市”列
ufo.head()
 
 
Out[192]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [193]:
 
 
 
 
 
# 删除“城市”列(因为inplace = True,确实会影响DataFrame)
ufo.drop('City', axis=1, inplace=True)
 
 
In [194]:
 
 
 
 
 
# 确认实际删除了“城市”列
ufo.head()
 
 
Out[194]:
 Colors ReportedShape ReportedStateTime
0 NaN TRIANGLE NY 6/1/1930 22:00
1 NaN OTHER NJ 6/30/1930 20:00
2 NaN OVAL CO 2/15/1931 14:00
3 NaN DISK KS 6/1/1931 13:00
4 NaN LIGHT NY 4/18/1933 19:00
In [195]:
 
 
 
 
 
# 如果该行中缺少任何值,则删除一行(因为inplace = False,不会影响DataFrame)
ufo.dropna(how='any').shape
 
 
Out[195]:
(2490, 4)
In [196]:
 
 
 
 
 
# 确认没有实际删除任何行
ufo.shape
 
 
Out[196]:
(18241, 4)
In [197]:
 
 
 
 
 
# 使用赋值语句而不是'inplace'参数
ufo = ufo.set_index('Time')
ufo.tail()
 
 
Out[197]:
 Colors ReportedShape ReportedState
Time   
12/31/2000 23:00 NaN TRIANGLE IL
12/31/2000 23:00 NaN DISK IA
12/31/2000 23:45 NaN NaN WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 NaN OVAL FL
In [198]:
 
 
 
 
 
# 使用“向后填充”策略填充缺失值(因为inplace = False不影响DataFrame)
ufo.fillna(method='bfill').tail()
 
 
Out[198]:
 Colors ReportedShape ReportedState
Time   
12/31/2000 23:00 RED TRIANGLE IL
12/31/2000 23:00 RED DISK IA
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 NaN OVAL FL
In [199]:
 
 
 
 
 
# 使用“前向填充”策略进行操作(因为inplace = False,不会影响DataFrame)
ufo.fillna(method='ffill').tail()
 
 
Out[199]:
 Colors ReportedShape ReportedState
Time   
12/31/2000 23:00 RED TRIANGLE IL
12/31/2000 23:00 RED DISK IA
12/31/2000 23:45 RED DISK WI
12/31/2000 23:45 RED LIGHT WI
12/31/2000 23:59 RED OVAL FL
 

21. How do I make my pandas DataFrame smaller and faster? (video)

In [200]:
 
 
 
 
 
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[200]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [201]:
 
 
 
 
 
# 确切的内存使用是未知的,因为对象列是别处的引用
drinks.info()
 
 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB
In [202]:
 
 
 
 
 
#强制pandas计算真正的内存使用量
drinks.info(memory_usage='deep')
 
 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB
In [203]:
 
 
 
 
 
# 计算每个系列的内存使用量(以字节为单位)
drinks.memory_usage(deep=True)
 
 
Out[203]:
Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64
 

Documentation for info and memory_usage

In [204]:
 
 
 
 
 
# 使用'category'数据类型(pandas 0.15)将'continent'字符串存储为整数
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes
 
 
Out[204]:
country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object
In [205]:
 
 
 
 
 
# 'continent'系列似乎没有变化
drinks.continent.head()
 
 
Out[205]:
0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]
In [206]:
 
 
 
 
 
# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head()
 
 
Out[206]:
0    1
1    2
2    0
3    2
4    0
dtype: int8
In [207]:
 
 
 
 
 
# 现在对编码进行编码(0表示'非洲',1表示'亚洲',2表示'欧洲'等)
drinks.memory_usage(deep=True)
 
 
Out[207]:
Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64
In [208]:
 
 
 
 
 
# country' Series中重复这个过程'
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)
 
 
Out[208]:
Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64
In [209]:
 
 
 
 
 
# 因为我们创建了193个类别,因此内存使用增加
drinks.country.cat.categories
 
 
Out[209]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)
 

类别数据类型只应与具有少量可能值的字符串Series一起使用。

In [210]:
 
 
 
 
 
 
# 从字典中创建一个小型DataFrame
df = pd.DataFrame({'ID':[100, 101, 102, 103], 'quality':['good', 'very good', 'good', 'excellent']})
df
 
 
Out[210]:
 IDquality
0 100 good
1 101 very good
2 102 good
3 103 excellent
In [211]:
 
 
 
 
 
# 按“质量”系列排序DataFrame(按字母顺序排列)
df.sort_values('quality')
 
 
Out[211]:
 IDquality
3 103 excellent
0 100 good
2 102 good
1 101 very good
In [212]:
 
 
 
 
 
 
# 定义类别的逻辑顺序
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
df.quality
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  
Out[212]:
0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]
In [213]:
 
 
 
 
 
# 按“质量”系列(逻辑顺序)输出DataFrame)
df.sort_values('quality')
 
 
Out[213]:
 IDquality
0 100 good
2 102 good
1 101 very good
3 103 excellent
In [214]:
 
 
 
 
 
# 比较运算符使用有序类别
df.loc[df.quality > 'good', :]
 
 
Out[214]:
 IDquality
1 101 very good
3 103 excellent
 

22. How do I use pandas with scikit-learn to create Kaggle submissions? (video)

In [215]:
 
 
 
 
 
# 从Kaggle的泰坦尼克号竞赛中将训练数据集读入DataFrame
url5 = 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_train.csv'
train = pd.read_csv(url5)
train.head()
 
 
Out[215]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
 

Goal: Predict passenger survival aboard the Titanic based on passenger attributes

Video: What is machine learning, and how does it work?

In [216]:
 
 
 
 
 
#根据乘客属性预测泰坦尼克号上的乘客生存率
 
 
In [217]:
 
 
 
 
 
# 通过选择两个DataFrame列来创建特征矩阵“X”
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
X.shape
 
 
Out[217]:
(891, 2)
In [218]:
 
 
 
 
 
# 通过选择一个系列来创建一个响应向量'y'
y = train.Survived
y.shape
 
 
Out[218]:
(891,)
 

Note: There is no need to convert these pandas objects to NumPy arrays. scikit-learn will understand these objects as long as they are entirely numeric and the proper shapes.

In [219]:
 
 
 
 
 
#无需将这些pandas对象转换为NumPy数组。 scikit-learn将理解这些对象,只要它们完全是数字和正确的形状
 
 
In [220]:
 
 
 
 
 
# 将分类模型拟合到训练数据中
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
 
 
Out[220]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [221]:
 
 
 
 
 
# 从Kaggle的泰坦尼克号竞赛中读取测试数据集到DataFrame
url6 ="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_test.csv"
test = pd.read_csv(url6)
test.head()
 
 
Out[221]:
 PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
In [222]:
 
 
 
 
 
# 根据与训练数据匹配的测试数据创建特征矩阵
X_new = test.loc[:, feature_cols]
X_new.shape
 
 
Out[222]:
(418, 2)
In [223]:
 
 
 
 
 
# 使用拟合模型对测试集进行预测
new_pred_class = logreg.predict(X_new)
 
 
In [224]:
 
 
 
 
 
# 输出结果
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).head()
 
 
Out[224]:
 PassengerIdSurvived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 0
 

Documentation for the DataFrame constructor

In [225]:
 
 
 
 
 
 
# 通过将PassengerID设置为索引来确保PassengerID是第一列
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').head()
 
 
Out[225]:
 Survived
PassengerId 
892 0
893 0
894 0
895 0
896 0
In [226]:
 
 
 
 
 
 
# 将DataFrame写入可以提交给Kaggle的CSV文件
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId').to_csv('sub.csv')
 
 
 

Documentation for to_csv

In [227]:
 
 
 
 
 
# save a DataFrame to disk ("pickle it")
train.to_pickle('train.pkl')
 
 
In [228]:
 
 
 
 
 
# read a pickled object from disk ("unpickle it")
pd.read_pickle('train.pkl').head()
 
 
Out[228]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
 

Documentation for to_pickle and read_pickle

[Back to top]

 

23. More of your pandas questions answered! (video)

 

Question: Could you explain how to read the pandas documentation?

pandas API reference

 

Question: What is the difference between ufo.isnull() and pd.isnull(ufo)?

In [229]:
 
 
 
 
 
ufo = pd.read_csv(url3)
ufo.head()
 
 
Out[229]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [230]:
 
 
 
 
 
# #使用'isnull'作为顶层函数
pd.isnull(ufo).head()
 
 
Out[230]:
 CityColors ReportedShape ReportedStateTime
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False
In [231]:
 
 
 
 
 
# 等效:使用'isnull'作为DataFrame方法
ufo.isnull().head()
 
 
Out[231]:
 CityColors ReportedShape ReportedStateTime
0 False True False False False
1 False True False False False
2 False True False False False
3 False True False False False
4 False True False False False
 

Documentation for isnull

 

Question: Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?

In [232]:
 
 
 
 
 
# 基于标签的切片包括开始和结束
ufo.loc[0:4, :]
 
 
Out[232]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [233]:
 
 
 
 
 
# 基于位置的切片包括开始和不包括结束
ufo.iloc[0:4, :]
 
 
Out[233]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
 

Documentation for loc and iloc

In [234]:
 
 
 
 
 
# iloc'只是遵循NumPy的切片惯例......
ufo.values[0:4, :]
 
 
Out[234]:
array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ['Abilene', nan, 'DISK', 'KS', '6/1/1931 13:00']], dtype=object)
In [235]:
 
 
 
 
 
# ..而NumPy只是遵循Python的切片约定
'python'[0:4]
 
 
Out[235]:
'pyth'
In [236]:
 
 
 
 
 
# oc'包含停止标签,因为您不一定知道它后面会是什么标签
ufo.loc[0:4, 'City':'State']
 
 
Out[236]:
 CityColors ReportedShape ReportedState
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
 

Question: How do I randomly sample rows from a DataFrame?

In [237]:
 
 
 
 
 
# 随机取三行
ufo.sample(n=3)
 
 
Out[237]:
 CityColors ReportedShape ReportedStateTime
16158 Summer Lake RED ORANGE SPHERE OR 3/28/2000 7:45
3334 San Jose NaN DISK CA 8/1/1976 21:00
15116 Dallas GREEN FLASH TX 11/16/1999 5:00
 

Documentation for sample

In [238]:
 
 
 
 
 
# 使用'random_state'参数进行使结果可以再现
ufo.sample(n=3, random_state=42)
 
 
Out[238]:
 CityColors ReportedShape ReportedStateTime
217 Norridgewock NaN DISK ME 9/15/1952 14:00
12282 Ipava NaN TRIANGLE IL 10/1/1998 21:15
17933 Ellinwood NaN FIREBALL KS 11/13/2000 22:00
In [239]:
 
 
 
 
 
# 按比例采样
train = ufo.sample(frac=0.75, random_state=99)
 
 
In [240]:
 
 
 
 
 
# 剩下数据存到另一个DataFrame
test = ufo.loc[~ufo.index.isin(train.index), :]
 
 
 

Documentation for isin

[Back to top]

 

24. How do I create dummy variables in pandas? (video)

In [241]:
 
 
 
 
 
train = pd.read_csv(url5)
train.head()
 
 
Out[241]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [242]:
 
 
 
 
 
# 使用'map'方法创建'Sex_male'虚拟变量
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()
 
 
Out[242]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1
 

Documentation for map

In [243]:
 
 
 
 
 
# 替代方法:使用'get_dummies'为每个可能的值创建一列
pd.get_dummies(train.Sex).head()
 
 
Out[243]:
 femalemale
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
 

Generally speaking:

  • If you have "K" possible values for a categorical feature, you only need "K-1" dummy variablesto capture all of the information about that feature.
  • One convention is to drop the first dummy variable, which defines that level as the "baseline".
In [244]:
 
 
 
 
 
"""一般来说:
如果分类特征具有“K”个可能值,则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。
一种惯用方法是删除第一个虚拟变量,该变量将该级别定义为“基线”。"""
 
 
Out[244]:
'一般来说:\n\n如果分类特征具有“K”个可能值,则只需要“K-1”虚拟变量来捕获有关该特征的所有信息。\n一种惯用方法是删除第一个虚拟变量,该变量将该级别定义为“基线”。'
In [245]:
 
 
 
 
 
# 使用'iloc'方法删除第一个虚拟变量('female')
pd.get_dummies(train.Sex).iloc[:, 1:].head()
 
 
Out[245]:
 male
0 1
1 0
2 0
3 0
4 1
In [246]:
 
 
 
 
 
# 添加前缀以标识虚拟变量的来源
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()
 
 
Out[246]:
 Sex_male
0 1
1 0
2 0
3 0
4 1
In [247]:
 
 
 
 
 
# 使用 'get_dummies' 处理具有3个可能值
pd.get_dummies(train.Embarked, prefix='Embarked').head(10)
 
 
Out[247]:
 Embarked_CEmbarked_QEmbarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
5 0 1 0
6 0 0 1
7 0 0 1
8 0 0 1
9 1 0 0
In [248]:
 
 
 
 
 
# 删除第一个虚拟变量('C')
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)
 
 
Out[248]:
 Embarked_QEmbarked_S
0 0 1
1 0 0
2 0 1
3 0 1
4 0 1
5 1 0
6 0 1
7 0 1
8 0 1
9 0 0
 

How to translate these values back to the original 'Embarked' value:

  • 0, 0 means C
  • 1, 0 means Q
  • 0, 1 means S
In [249]:
 
 
 
 
 
# 保存虚拟变量的DataFrame并将它们连接到原始DataFram
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]
train = pd.concat([train, embarked_dummies], axis=1)
train.head()
 
 
Out[249]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSex_maleEmbarked_QEmbarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1 0 1
 

Documentation for concat

In [250]:
 
 
 
 
 
# 重新读取DataFrame
train = pd.read_csv(url5)
train.head()
 
 
Out[250]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [251]:
 
 
 
 
 
#  将DataFrame传递给'get_dummies'并指定哪些列为dummy(它删除了原始列)
pd.get_dummies(train, columns=['Sex', 'Embarked']).head()
 
 
Out[251]:
 PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0 1 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 1 0 1 0 0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 1 0 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 1 0 0 0 1
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 1 0 0 1
In [252]:
 
 
 
 
 
# 使用'drop_first'参数(pandas 0.18)删除每个特征的第一个虚拟变量
pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True).head()
 
 
Out[252]:
 PassengerIdSurvivedPclassNameAgeSibSpParchTicketFareCabinSex_maleEmbarked_QEmbarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 0 0 0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 0 0 1
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 1 0 1
 

Documentation for get_dummies

[Back to top]

 

25. How do I work with dates and times in pandas? (video)

In [253]:
 
 
 
 
 
ufo = pd.read_csv(url3)
ufo.head()
 
 
Out[253]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [254]:
 
 
 
 
 
# 'Time'当前存储为字符串
ufo.dtypes
 
 
Out[254]:
City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object
In [255]:
 
 
 
 
 
# 可以使用字符串切片访问小时,但这种方法很容易出问题
ufo.Time.str.slice(-5, -3).astype(int).head()
 
 
Out[255]:
0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int32
In [256]:
 
 
 
 
 
# 将“时间”转换为日期时间格式
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()
 
 
Out[256]:
 CityColors ReportedShape ReportedStateTime
0 Ithaca NaN TRIANGLE NY 1930-06-01 22:00:00
1 Willingboro NaN OTHER NJ 1930-06-30 20:00:00
2 Holyoke NaN OVAL CO 1931-02-15 14:00:00
3 Abilene NaN DISK KS 1931-06-01 13:00:00
4 New York Worlds Fair NaN LIGHT NY 1933-04-18 19:00:00
In [257]:
 
 
 
 
 
ufo.dtypes
 
 
Out[257]:
City                       object
Colors Reported            object
Shape Reported             object
State                      object
Time               datetime64[ns]
dtype: object
 

Documentation for to_datetime

In [258]:
 
 
 
 
 
#使用Series方法
ufo.Time.dt.hour.head()
 
 
Out[258]:
0    22
1    20
2    14
3    13
4    19
Name: Time, dtype: int64
In [259]:
 
 
 
 
 
ufo.Time.dt.weekday_name.head()
 
 
Out[259]:
0     Sunday
1     Monday
2     Sunday
3     Monday
4    Tuesday
Name: Time, dtype: object
In [260]:
 
 
 
 
 
ufo.Time.dt.dayofyear.head()
 
 
Out[260]:
0    152
1    181
2     46
3    152
4    108
Name: Time, dtype: int64
 

API reference for datetime properties and methods

In [261]:
 
 
 
 
 
# convert a single string to datetime format (outputs a timestamp object)
ts = pd.to_datetime('1/1/1999')
ts
 
 
Out[261]:
Timestamp('1999-01-01 00:00:00')
In [262]:
 
 
 
 
 
# 将datetime系列与时间戳进行比较
ufo.loc[ufo.Time >= ts, :].head()
 
 
Out[262]:
 CityColors ReportedShape ReportedStateTime
12832 Loma Rica NaN LIGHT CA 1999-01-01 02:30:00
12833 Bauxite NaN NaN AR 1999-01-01 03:00:00
12834 Florence NaN CYLINDER SC 1999-01-01 14:00:00
12835 Lake Henshaw NaN CIGAR CA 1999-01-01 15:00:00
12836 Wilmington Island NaN LIGHT GA 1999-01-01 17:15:00
In [263]:
 
 
 
 
 
# 使用时间戳执行数学运算(输出timedelta对象)
ufo.Time.max() - ufo.Time.min()
 
 
Out[263]:
Timedelta('25781 days 01:59:00')
In [264]:
 
 
 
 
 
# timedelta对象还具有您可以访问的属性
(ufo.Time.max() - ufo.Time.min()).days
 
 
Out[264]:
25781
In [265]:
 
 
 
 
 
# 允许绘图出现在笔记本jupyter notebook中
%matplotlib inline
 
 
In [266]:
 
 
 
 
 
# 计算每年不明飞行物报告的数量
ufo['Year'] = ufo.Time.dt.year
ufo.Year.value_counts().sort_index().head()
 
 
Out[266]:
1930    2
1931    2
1933    1
1934    1
1935    1
Name: Year, dtype: int64
In [267]:
 
 
 
 
 
# 绘制每年不明飞行物报告的数量(默认折线图)
ufo.Year.value_counts().sort_index().plot()
 
 
Out[267]:
<matplotlib.axes._subplots.AxesSubplot at 0x296f1bd1240>
 
 

26. How do I find and remove duplicate rows in pandas? (video)

In [268]:
 
 
 
 
 
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table(url2, sep='|', header=None, names=user_cols, index_col='user_id')
users.head()
 
 
Out[268]:
 agegenderoccupationzip_code
user_id    
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
In [269]:
 
 
 
 
 
users.shape
 
 
Out[269]:
(943, 4)
In [270]:
 
 
 
 
 
# 检测重复的邮政编码:如果项目与之前的项目相同,则为真
users.zip_code.duplicated().tail()
 
 
Out[270]:
user_id
939    False
940     True
941    False
942    False
943    False
Name: zip_code, dtype: bool
In [271]:
 
 
 
 
 
# 计算重复项(True变为1,False变为0)
users.zip_code.duplicated().sum()
 
 
Out[271]:
148
In [272]:
 
 
 
 
 
# 检测重复的DataFrame行:如果整行与前一行相同,则为True
users.duplicated().tail()
 
 
Out[272]:
user_id
939    False
940    False
941    False
942    False
943    False
dtype: bool
In [273]:
 
 
 
 
 
# 数重复的行
users.duplicated().sum()
 
 
Out[273]:
7
 

Logic for duplicated:

  • keep='first' (default): Mark duplicates as True except for the first occurrence.
  • keep='last': Mark duplicates as True except for the last occurrence.
  • keep=False: Mark all duplicates as True.
In [274]:
 
 
 
 
 
"""duplicated的逻辑:
keep ='first'(默认值):将重复项标记为True,但第一次出现除外。
keep ='last':将重复项标记为True,但最后一次出现除外。
keep = False:将所有重复项标记为True。"""
 
 
Out[274]:
"duplicated的逻辑:\n\nkeep ='first'(默认值):将重复项标记为True,但第一次出现除外。\nkeep ='last':将重复项标记为True,但最后一次出现除外。\nkeep = False:将所有重复项标记为True。"
In [275]:
 
 
 
 
 
# 检查重复行,忽略最第一次出现的
users.loc[users.duplicated(keep='first'), :]
 
 
Out[275]:
 agegenderoccupationzip_code
user_id    
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301
In [276]:
 
 
 
 
 
# 检查重复行,忽略最后一个出现的
users.loc[users.duplicated(keep='last'), :]
 
 
Out[276]:
 agegenderoccupationzip_code
user_id    
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
In [277]:
 
 
 
 
 
# 检查重复行,保留所有重复数据
users.loc[users.duplicated(keep=False), :]
 
 
Out[277]:
 agegenderoccupationzip_code
user_id    
67 17 M student 60402
85 51 M educator 20003
198 21 F student 55414
350 32 M student 97301
428 28 M student 55414
437 27 F other 20009
460 44 F other 60630
496 21 F student 55414
572 51 M educator 20003
621 17 M student 60402
684 28 M student 55414
733 44 F other 60630
805 27 F other 20009
890 32 M student 97301
In [278]:
 
 
 
 
 
# 删除重复的行(默认情况下为inplace = False)
users.drop_duplicates(keep='first').shape
 
 
Out[278]:
(936, 4)
In [279]:
 
 
 
 
 
users.drop_duplicates(keep='last').shape
 
 
Out[279]:
(936, 4)
In [280]:
 
 
 
 
 
users.drop_duplicates(keep=False).shape
 
 
Out[280]:
(929, 4)
 

Documentation for drop_duplicates

In [281]:
 
 
 
 
 
# 在识别重复项时,只考虑列的子集
users.duplicated(subset=['age', 'zip_code']).sum()
 
 
Out[281]:
16
In [282]:
 
 
 
 
 
users.drop_duplicates(subset=['age', 'zip_code']).shape
 
 
Out[282]:
(927, 4)
 

27. How do I avoid a SettingWithCopyWarning in pandas? (video)

In [283]:
 
 
 
 
 
movies = pd.read_csv(url4)
movies.head()
 
 
Out[283]:
 star_ratingtitlecontent_ratinggenredurationactors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [284]:
 
 
 
 
 
# 数'content_rating'系列中的缺失值
movies.content_rating.isnull().sum()
 
 
Out[284]:
3
In [285]:
 
 
 
 
 
# 检查包含缺少值的DataFrame行
movies[movies.content_rating.isnull()]
 
 
Out[285]:
 star_ratingtitlecontent_ratinggenredurationactors_list
187 8.2 Butch Cassidy and the Sundance Kid NaN Biography 110 [u'Paul Newman', u'Robert Redford', u'Katharin...
649 7.7 Where Eagles Dare NaN Action 158 [u'Richard Burton', u'Clint Eastwood', u'Mary ...
936 7.4 True Grit NaN Adventure 128 [u'John Wayne', u'Kim Darby', u'Glen Campbell']
In [286]:
 
 
 
 
 
# 检查'content_rating'系列中的唯一值
movies.content_rating.value_counts()
 
 
Out[286]:
R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64
 

Goal: Mark the 'NOT RATED' values as missing values, represented by 'NaN'.

In [287]:
 
 
 
 
 
#目标:将“未评级”值标记为缺失值,由“NaN”表示。
 
 
In [288]:
 
 
 
 
 
# 首先,找到相关的行
movies[movies.content_rating=='NOT RATED'].head()
 
 
Out[288]:
 star_ratingtitlecontent_ratinggenredurationactors_list
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6 8.9 The Good, the Bad and the Ugly NOT RATED Western 161 [u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
41 8.5 Sunset Blvd. NOT RATED Drama 110 [u'William Holden', u'Gloria Swanson', u'Erich...
63 8.4 M NOT RATED Crime 99 [u'Peter Lorre', u'Ellen Widmann', u'Inge Land...
66 8.4 Munna Bhai M.B.B.S. NOT RATED Comedy 156 [u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']
In [289]:
 
 
 
 
 
# 然后,从这些行中选择'content_rating'系列
movies[movies.content_rating=='NOT RATED'].content_rating.head()
 
 
Out[289]:
5     NOT RATED
6     NOT RATED
41    NOT RATED
63    NOT RATED
66    NOT RATED
Name: content_rating, dtype: object
In [290]:
 
 
 
 
 
# 最后,用'NaN'(从NumPy导入)替换'NOT RATED'值
import numpy as np
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
 

Problem: That statement involves two operations, a __getitem__ and a __setitem__. pandas can't guarantee whether the __getitem__ operation returns a view or a copy of the data.

  • If __getitem__ returns a view of the data, __setitem__ will affect the 'movies' DataFrame.
  • But if __getitem__ returns a copy of the data, __setitem__ will not affect the 'movies' DataFrame.
In [291]:
 
 
 
 
 
 
"""问题:该语句涉及两个操作,__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。
如果__getitem__返回数据视图,__ setitem__将影响'movies'DataFrame。
但是如果__getitem__返回数据的副本,__ setitem__将不会影响'movies'DataFrame。"""
 
 
Out[291]:
"问题:该语句涉及两个操作,__ getitem__和__setitem__。 pandas无法保证__getitem__操作是返回视图还是数据副本。\n\n如果__getitem__返回数据视图,__ setitem__将影响'movies'DataFrame。\n但是如果__getitem__返回数据的副本,__ setitem__将不会影响'movies'DataFrame。"
In [292]:
 
 
 
 
 
# 'content_rating'系列没有改变
movies.content_rating.isnull().sum()
 
 
Out[292]:
3
 

Solution: Use the loc method, which replaces the 'NOT RATED' values in a single __setitem__operation.

In [293]:
 
 
 
 
 
"""解决方案:使用loc方法,该方法用单个__setitem__操作中替换“NOT RATED”值。"""
 
 
Out[293]:
'解决方案:使用loc方法,该方法用单个__setitem__操作中替换“NOT RATED”值。'
In [294]:
 
 
 
 
 
# 用'NaN'替换'NOT RATED'值(不会导致SettingWithCopyWarning)
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan
 
 
In [295]:
 
 
 
 
 
# 这一次,'content_rating'Sreies发生了变化
movies.content_rating.isnull().sum()
 
 
Out[295]:
68
 

Summary: Use the loc method any time you are selecting rows and columns in the same statement.

More information: Modern Pandas (Part 1)

In [296]:
 
 
 
 
 
# 创建一个仅包含高'star_rating'影片的DataFrame
top_movies = movies.loc[movies.star_rating >= 9, :]
top_movies
 
 
Out[296]:
 star_ratingtitlecontent_ratinggenredurationactors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
 

Goal: Fix the 'duration' for 'The Shawshank Redemption'.

In [297]:
 
 
 
 
 
'目标:修复“The Shawshank Redemption”的“持续时间”。'
 
 
Out[297]:
'目标:修复“The Shawshank Redemption”的“持续时间”。'
In [298]:
 
 
 
 
 
# 用正确的持续时间覆盖相关的单元格
top_movies.loc[0, 'duration'] = 150
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\pandas\core\indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
 

Problem: pandas isn't sure whether 'top_movies' is a view or a copy of 'movies'.

In [299]:
 
 
 
 
 
"问题:pandas不确定'top_movies'是视图还是'电影'的副本。"
 
 
Out[299]:
"问题:pandas不确定'top_movies'是视图还是'电影'的副本。"
In [300]:
 
 
 
 
 
# 结果还是修改了
top_movies
 
 
Out[300]:
 star_ratingtitlecontent_ratinggenredurationactors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
In [301]:
 
 
 
 
 
# 'movies' 电影的DataFrame没有改变
movies.head(1)
 
 
Out[301]:
 star_ratingtitlecontent_ratinggenredurationactors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
 

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

In [302]:
 
 
 
 
 
"解决方案:每次尝试创建DataFrame副本时,请使用copy方法。"
 
 
Out[302]:
'解决方案:每次尝试创建DataFrame副本时,请使用copy方法。'
In [303]:
 
 
 
 
 
# 明确创建'电影'的副本
top_movies = movies.loc[movies.star_rating >= 9, :].copy()
 
 
In [304]:
 
 
 
 
 
# andas现在知道你正在更新副本而不是视图(不会导致SettingWithCopyWarning
top_movies.loc[0, 'duration'] = 150
 
 
In [305]:
 
 
 
 
 
# 'top_movies'DataFrame已更新
top_movies
 
 
Out[305]:
 star_ratingtitlecontent_ratinggenredurationactors_list
0 9.3 The Shawshank Redemption R Crime 150 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
 

28. How do I change display options in pandas? (video)

In [306]:
 
 
 
 
 
drinks = pd.read_csv(url7)
 
 
In [307]:
 
 
 
 
 
# 打印时仅显示60行
drinks
 
 
Out[307]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America
6 Argentina 193 25 221 8.3 South America
7 Armenia 21 179 11 3.8 Europe
8 Australia 261 72 212 10.4 Oceania
9 Austria 279 75 191 9.7 Europe
10 Azerbaijan 21 46 5 1.3 Europe
11 Bahamas 122 176 51 6.3 North America
12 Bahrain 42 63 7 2.0 Asia
13 Bangladesh 0 0 0 0.0 Asia
14 Barbados 143 173 36 6.3 North America
15 Belarus 142 373 42 14.4 Europe
16 Belgium 295 84 212 10.5 Europe
17 Belize 263 114 8 6.8 North America
18 Benin 34 4 13 1.1 Africa
19 Bhutan 23 0 0 0.4 Asia
20 Bolivia 167 41 8 3.8 South America
21 Bosnia-Herzegovina 76 173 8 4.6 Europe
22 Botswana 173 35 35 5.4 Africa
23 Brazil 245 145 16 7.2 South America
24 Brunei 31 2 1 0.6 Asia
25 Bulgaria 231 252 94 10.3 Europe
26 Burkina Faso 25 7 7 4.3 Africa
27 Burundi 88 0 0 6.3 Africa
28 Cote d'Ivoire 37 1 7 4.0 Africa
29 Cabo Verde 144 56 16 4.0 Africa
... ... ... ... ... ... ...
163 Suriname 128 178 7 5.6 South America
164 Swaziland 90 2 2 4.7 Africa
165 Sweden 152 60 186 7.2 Europe
166 Switzerland 185 100 280 10.2 Europe
167 Syria 5 35 16 1.0 Asia
168 Tajikistan 2 15 0 0.3 Asia
169 Thailand 99 258 1 6.4 Asia
170 Macedonia 106 27 86 3.9 Europe
171 Timor-Leste 1 1 4 0.1 Asia
172 Togo 36 2 19 1.3 Africa
173 Tonga 36 21 5 1.1 Oceania
174 Trinidad & Tobago 197 156 7 6.4 North America
175 Tunisia 51 3 20 1.3 Africa
176 Turkey 51 22 7 1.4 Asia
177 Turkmenistan 19 71 32 2.2 Asia
178 Tuvalu 6 41 9 1.0 Oceania
179 Uganda 45 9 0 8.3 Africa
180 Ukraine 206 237 45 8.9 Europe
181 United Arab Emirates 16 135 5 2.8 Asia
182 United Kingdom 219 126 195 10.4 Europe
183 Tanzania 36 6 1 5.7 Africa
184 USA 249 158 84 8.7 North America
185 Uruguay 115 35 220 6.6 South America
186 Uzbekistan 25 101 8 2.4 Asia
187 Vanuatu 21 18 11 0.9 Oceania
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa

193 rows × 6 columns

In [308]:
 
 
 
 
 
# 检查'max_rows'选项的当前设置
pd.get_option('display.max_rows')
 
 
Out[308]:
60
 

Documentation for get_option

In [309]:
 
 
 
 
 
# 覆盖当前设置,以便显示所有行
pd.set_option('display.max_rows', None)
drinks
 
 
Out[309]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America
6 Argentina 193 25 221 8.3 South America
7 Armenia 21 179 11 3.8 Europe
8 Australia 261 72 212 10.4 Oceania
9 Austria 279 75 191 9.7 Europe
10 Azerbaijan 21 46 5 1.3 Europe
11 Bahamas 122 176 51 6.3 North America
12 Bahrain 42 63 7 2.0 Asia
13 Bangladesh 0 0 0 0.0 Asia
14 Barbados 143 173 36 6.3 North America
15 Belarus 142 373 42 14.4 Europe
16 Belgium 295 84 212 10.5 Europe
17 Belize 263 114 8 6.8 North America
18 Benin 34 4 13 1.1 Africa
19 Bhutan 23 0 0 0.4 Asia
20 Bolivia 167 41 8 3.8 South America
21 Bosnia-Herzegovina 76 173 8 4.6 Europe
22 Botswana 173 35 35 5.4 Africa
23 Brazil 245 145 16 7.2 South America
24 Brunei 31 2 1 0.6 Asia
25 Bulgaria 231 252 94 10.3 Europe
26 Burkina Faso 25 7 7 4.3 Africa
27 Burundi 88 0 0 6.3 Africa
28 Cote d'Ivoire 37 1 7 4.0 Africa
29 Cabo Verde 144 56 16 4.0 Africa
30 Cambodia 57 65 1 2.2 Asia
31 Cameroon 147 1 4 5.8 Africa
32 Canada 240 122 100 8.2 North America
33 Central African Republic 17 2 1 1.8 Africa
34 Chad 15 1 1 0.4 Africa
35 Chile 130 124 172 7.6 South America
36 China 79 192 8 5.0 Asia
37 Colombia 159 76 3 4.2 South America
38 Comoros 1 3 1 0.1 Africa
39 Congo 76 1 9 1.7 Africa
40 Cook Islands 0 254 74 5.9 Oceania
41 Costa Rica 149 87 11 4.4 North America
42 Croatia 230 87 254 10.2 Europe
43 Cuba 93 137 5 4.2 North America
44 Cyprus 192 154 113 8.2 Europe
45 Czech Republic 361 170 134 11.8 Europe
46 North Korea 0 0 0 0.0 Asia
47 DR Congo 32 3 1 2.3 Africa
48 Denmark 224 81 278 10.4 Europe
49 Djibouti 15 44 3 1.1 Africa
50 Dominica 52 286 26 6.6 North America
51 Dominican Republic 193 147 9 6.2 North America
52 Ecuador 162 74 3 4.2 South America
53 Egypt 6 4 1 0.2 Africa
54 El Salvador 52 69 2 2.2 North America
55 Equatorial Guinea 92 0 233 5.8 Africa
56 Eritrea 18 0 0 0.5 Africa
57 Estonia 224 194 59 9.5 Europe
58 Ethiopia 20 3 0 0.7 Africa
59 Fiji 77 35 1 2.0 Oceania
60 Finland 263 133 97 10.0 Europe
61 France 127 151 370 11.8 Europe
62 Gabon 347 98 59 8.9 Africa
63 Gambia 8 0 1 2.4 Africa
64 Georgia 52 100 149 5.4 Europe
65 Germany 346 117 175 11.3 Europe
66 Ghana 31 3 10 1.8 Africa
67 Greece 133 112 218 8.3 Europe
68 Grenada 199 438 28 11.9 North America
69 Guatemala 53 69 2 2.2 North America
70 Guinea 9 0 2 0.2 Africa
71 Guinea-Bissau 28 31 21 2.5 Africa
72 Guyana 93 302 1 7.1 South America
73 Haiti 1 326 1 5.9 North America
74 Honduras 69 98 2 3.0 North America
75 Hungary 234 215 185 11.3 Europe
76 Iceland 233 61 78 6.6 Europe
77 India 9 114 0 2.2 Asia
78 Indonesia 5 1 0 0.1 Asia
79 Iran 0 0 0 0.0 Asia
80 Iraq 9 3 0 0.2 Asia
81 Ireland 313 118 165 11.4 Europe
82 Israel 63 69 9 2.5 Asia
83 Italy 85 42 237 6.5 Europe
84 Jamaica 82 97 9 3.4 North America
85 Japan 77 202 16 7.0 Asia
86 Jordan 6 21 1 0.5 Asia
87 Kazakhstan 124 246 12 6.8 Asia
88 Kenya 58 22 2 1.8 Africa
89 Kiribati 21 34 1 1.0 Oceania
90 Kuwait 0 0 0 0.0 Asia
91 Kyrgyzstan 31 97 6 2.4 Asia
92 Laos 62 0 123 6.2 Asia
93 Latvia 281 216 62 10.5 Europe
94 Lebanon 20 55 31 1.9 Asia
95 Lesotho 82 29 0 2.8 Africa
96 Liberia 19 152 2 3.1 Africa
97 Libya 0 0 0 0.0 Africa
98 Lithuania 343 244 56 12.9 Europe
99 Luxembourg 236 133 271 11.4 Europe
100 Madagascar 26 15 4 0.8 Africa
101 Malawi 8 11 1 1.5 Africa
102 Malaysia 13 4 0 0.3 Asia
103 Maldives 0 0 0 0.0 Asia
104 Mali 5 1 1 0.6 Africa
105 Malta 149 100 120 6.6 Europe
106 Marshall Islands 0 0 0 0.0 Oceania
107 Mauritania 0 0 0 0.0 Africa
108 Mauritius 98 31 18 2.6 Africa
109 Mexico 238 68 5 5.5 North America
110 Micronesia 62 50 18 2.3 Oceania
111 Monaco 0 0 0 0.0 Europe
112 Mongolia 77 189 8 4.9 Asia
113 Montenegro 31 114 128 4.9 Europe
114 Morocco 12 6 10 0.5 Africa
115 Mozambique 47 18 5 1.3 Africa
116 Myanmar 5 1 0 0.1 Asia
117 Namibia 376 3 1 6.8 Africa
118 Nauru 49 0 8 1.0 Oceania
119 Nepal 5 6 0 0.2 Asia
120 Netherlands 251 88 190 9.4 Europe
121 New Zealand 203 79 175 9.3 Oceania
122 Nicaragua 78 118 1 3.5 North America
123 Niger 3 2 1 0.1 Africa
124 Nigeria 42 5 2 9.1 Africa
125 Niue 188 200 7 7.0 Oceania
126 Norway 169 71 129 6.7 Europe
127 Oman 22 16 1 0.7 Asia
128 Pakistan 0 0 0 0.0 Asia
129 Palau 306 63 23 6.9 Oceania
130 Panama 285 104 18 7.2 North America
131 Papua New Guinea 44 39 1 1.5 Oceania
132 Paraguay 213 117 74 7.3 South America
133 Peru 163 160 21 6.1 South America
134 Philippines 71 186 1 4.6 Asia
135 Poland 343 215 56 10.9 Europe
136 Portugal 194 67 339 11.0 Europe
137 Qatar 1 42 7 0.9 Asia
138 South Korea 140 16 9 9.8 Asia
139 Moldova 109 226 18 6.3 Europe
140 Romania 297 122 167 10.4 Europe
141 Russian Federation 247 326 73 11.5 Asia
142 Rwanda 43 2 0 6.8 Africa
143 St. Kitts & Nevis 194 205 32 7.7 North America
144 St. Lucia 171 315 71 10.1 North America
145 St. Vincent & the Grenadines 120 221 11 6.3 North America
146 Samoa 105 18 24 2.6 Oceania
147 San Marino 0 0 0 0.0 Europe
148 Sao Tome & Principe 56 38 140 4.2 Africa
149 Saudi Arabia 0 5 0 0.1 Asia
150 Senegal 9 1 7 0.3 Africa
151 Serbia 283 131 127 9.6 Europe
152 Seychelles 157 25 51 4.1 Africa
153 Sierra Leone 25 3 2 6.7 Africa
154 Singapore 60 12 11 1.5 Asia
155 Slovakia 196 293 116 11.4 Europe
156 Slovenia 270 51 276 10.6 Europe
157 Solomon Islands 56 11 1 1.2 Oceania
158 Somalia 0 0 0 0.0 Africa
159 South Africa 225 76 81 8.2 Africa
160 Spain 284 157 112 10.0 Europe
161 Sri Lanka 16 104 0 2.2 Asia
162 Sudan 8 13 0 1.7 Africa
163 Suriname 128 178 7 5.6 South America
164 Swaziland 90 2 2 4.7 Africa
165 Sweden 152 60 186 7.2 Europe
166 Switzerland 185 100 280 10.2 Europe
167 Syria 5 35 16 1.0 Asia
168 Tajikistan 2 15 0 0.3 Asia
169 Thailand 99 258 1 6.4 Asia
170 Macedonia 106 27 86 3.9 Europe
171 Timor-Leste 1 1 4 0.1 Asia
172 Togo 36 2 19 1.3 Africa
173 Tonga 36 21 5 1.1 Oceania
174 Trinidad & Tobago 197 156 7 6.4 North America
175 Tunisia 51 3 20 1.3 Africa
176 Turkey 51 22 7 1.4 Asia
177 Turkmenistan 19 71 32 2.2 Asia
178 Tuvalu 6 41 9 1.0 Oceania
179 Uganda 45 9 0 8.3 Africa
180 Ukraine 206 237 45 8.9 Europe
181 United Arab Emirates 16 135 5 2.8 Asia
182 United Kingdom 219 126 195 10.4 Europe
183 Tanzania 36 6 1 5.7 Africa
184 USA 249 158 84 8.7 North America
185 Uruguay 115 35 220 6.6 South America
186 Uzbekistan 25 101 8 2.4 Asia
187 Vanuatu 21 18 11 0.9 Oceania
188 Venezuela 333 100 3 7.7 South America
189 Vietnam 111 2 1 2.0 Asia
190 Yemen 6 0 0 0.1 Asia
191 Zambia 32 19 4 2.5 Africa
192 Zimbabwe 64 18 4 4.7 Africa
In [310]:
 
 
 
 
 
# 将'max_rows'选项重置为默认值
pd.reset_option('display.max_rows')
 
 
 

Documentation for set_option and reset_option

In [311]:
 
 
 
 
 
# 'max_columns'选项类似于'max_rows'
pd.get_option('display.max_columns')
 
 
Out[311]:
20
In [312]:
 
 
 
 
 
train = pd.read_csv(url5)
train.head()
 
 
Out[312]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [313]:
 
 
 
 
 
# 由于“max_colwidth”选项,省略号显示在第1行的“名称”单元格中
pd.get_option('display.max_colwidth')
 
 
Out[313]:
50
In [314]:
 
 
 
 
 
# 覆盖当前设置,以便显示更多字符
pd.set_option('display.max_colwidth', 1000)
train.head()
 
 
Out[314]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [315]:
 
 
 
 
 
# 覆盖''precision' '设置以显示'票价'小数点后的2位数
pd.set_option('display.precision', 2)
train.head()
 
 
Out[315]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.28 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.92 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.10 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.05 NaN S
In [316]:
 
 
 
 
 
# 在饮料DataFrame中添加两个无意义的列
drinks['x'] = drinks.wine_servings * 1000
drinks['y'] = drinks.total_litres_of_pure_alcohol * 1000
drinks.head()
 
 
Out[316]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentxy
0 Afghanistan 0 0 0 0.0 Asia 0 0.0
1 Albania 89 132 54 4.9 Europe 54000 4900.0
2 Algeria 25 0 14 0.7 Africa 14000 700.0
3 Andorra 245 138 312 12.4 Europe 312000 12400.0
4 Angola 217 57 45 5.9 Africa 45000 5900.0
In [317]:
 
 
 
 
 
# 使用Python格式字符串将逗号指定为千位分隔符
pd.set_option('display.float_format', '{:,}'.format)
drinks.head()
 
 
Out[317]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinentxy
0 Afghanistan 0 0 0 0.0 Asia 0 0.0
1 Albania 89 132 54 4.9 Europe 54000 4,900.0
2 Algeria 25 0 14 0.7 Africa 14000 700.0
3 Andorra 245 138 312 12.4 Europe 312000 12,400.0
4 Angola 217 57 45 5.9 Africa 45000 5,900.0
In [318]:
 
 
 
 
 
# y'受影响(但不是'x')因为'float_format'选项只影响浮点数(不是整数)
drinks.dtypes
 
 
Out[318]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
x                                 int64
y                               float64
dtype: object
In [319]:
 
 
 
 
 
# 查看选项说明(包括默认值和当前值)
pd.describe_option()
 
 
 
compute.use_bottleneck : bool
    Use the bottleneck library to accelerate if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

compute.use_numexpr : bool
    Use the numexpr library to accelerate computation if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]

display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]

display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See formats.format.EngFormatter for an example.
    [default: None] [currently: <built-in method format of str object at 0x00000296F1B05880>]

display.height : int
    Deprecated.
    [default: 60] [currently: 60]

display.html.border : int
    A ``border=value`` attribute is inserted in the ``<table>`` tag
    for the DataFrame HTML repr.
    [default: 1] [currently: 1]

display.html.table_schema : boolean
    Whether to publish a Table Schema representation for frontends
    that support it.
    (default: False)
    [default: False] [currently: False]

display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]

display.latex.escape : bool
    This specifies if the to_latex method of a Dataframe uses escapes special
    characters.
    Valid values: False,True
    [default: True] [currently: True]

display.latex.longtable :bool
    This specifies if the to_latex method of a Dataframe uses the longtable
    format.
    Valid values: False,True
    [default: False] [currently: False]

display.latex.multicolumn : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: True] [currently: True]

display.latex.multicolumn_format : bool
    This specifies if the to_latex method of a Dataframe uses multicolumns
    to pretty-print MultiIndex columns.
    Valid values: False,True
    [default: l] [currently: l]

display.latex.multirow : bool
    This specifies if the to_latex method of a Dataframe uses multirows
    to pretty-print MultiIndex rows.
    Valid values: False,True
    [default: False] [currently: False]

display.latex.repr : boolean
    Whether to produce a latex DataFrame representation for jupyter
    environments that support it.
    (default: False)
    [default: False] [currently: False]

display.line_width : int
    Deprecated.
    [default: 80] [currently: 80]

display.max_categories : int
    This sets the maximum number of categories pandas should output when
    printing out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]

display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 20]

display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 1000]

display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]

display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]

display.memory_usage : bool, string or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called. Valid values True,False,'deep'
    [default: True] [currently: True]

display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: True]

display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]

display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 6] [currently: 2]

display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.unicode.east_asian_width : boolean
    Whether to use the Unicode East Asian Width to calculate the display text
    width.
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 80]

html.border : int
    A ``border=value`` attribute is inserted in the ``<table>`` tag
    for the DataFrame HTML repr.
    [default: 1] [currently: 1]
    (Deprecated, use `display.html.border` instead.)

io.excel.xls.writer : string
    The default Excel writer engine for 'xls' files. Available options:
    auto, xlwt.
    [default: auto] [currently: auto]

io.excel.xlsm.writer : string
    The default Excel writer engine for 'xlsm' files. Available options:
    auto, openpyxl.
    [default: auto] [currently: auto]

io.excel.xlsx.writer : string
    The default Excel writer engine for 'xlsx' files. Available options:
    auto, openpyxl, xlsxwriter.
    [default: auto] [currently: auto]

io.hdf.default_format : format
    default format writing format, if None, then
    put will default to 'fixed' and append will default to 'table'
    [default: None] [currently: None]

io.hdf.dropna_table : boolean
    drop ALL nan rows when appending to a table
    [default: False] [currently: False]

io.parquet.engine : string
    The default parquet reader/writer engine. Available options:
    'auto', 'pyarrow', 'fastparquet', the default is 'auto'
    [default: auto] [currently: auto]

mode.chained_assignment : string
    Raise an exception, warn, or no action if trying to use chained assignment,
    The default is warn
    [default: warn] [currently: warn]

mode.sim_interactive : boolean
    Whether to simulate interactive mode for purposes of testing
    [default: False] [currently: False]

mode.use_inf_as_na : boolean
    True means treat None, NaN, INF, -INF as NA (old way),
    False means None and NaN are null, but INF, -INF are not NA
    (new way).
    [default: False] [currently: False]

mode.use_inf_as_null use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.
    [default: False] [currently: False]
    (Deprecated, use `mode.use_inf_as_na` instead.)

plotting.matplotlib.register_converters : bool
    Whether to register converters with matplotlib's units registry for
    dates, times, datetimes, and Periods. Toggling to False will remove
    the converters, restoring any converters that pandas overwrote.
    [default: True] [currently: True]


In [320]:
 
 
 
 
 
# 按名称搜索特定选项
pd.describe_option('rows')
 
 
 
display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]


 

Documentation for describe_option

In [321]:
 
 
 
 
 
# 将所有选项重置为其默认值
pd.reset_option('all')
 
 
 
html.border has been deprecated, use display.html.border instead
(currently both are identical)


use_inf_as_null had been deprecated and will be removed in a future version.
Use `use_inf_as_na` instead.

 

29. How do I create a pandas DataFrame from another object? (video)

In [322]:
 
 
 
 
 
# 从字典创建一个DataFrame(键成为列名,值成为数据)
pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']})
 
 
Out[322]:
 colorid
0 red 100
1 blue 101
2 red 102
In [323]:
 
 
 
 
 
 
# 指定列的顺序并定义索引
df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])
df
 
 
Out[323]:
 idcolor
a 100 red
b 101 blue
c 102 red
 

Documentation for DataFrame

In [324]:
 
 
 
 
 
# 从列表中的列表结构创建一个DataFrame(每个内部列表成为一行)
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])
 
 
Out[324]:
 idcolor
0 100 red
1 101 blue
2 102 red
In [325]:
 
 
 
 
 
# 创建一个NumPy数组(形状为4乘2)并用0到1之间的随机数填充它
import numpy as np
arr = np.random.rand(4, 2)
arr
 
 
Out[325]:
array([[0.75703155, 0.7684219 ],
       [0.06043145, 0.1511115 ],
       [0.5908619 , 0.02133966],
       [0.78627718, 0.98944222]])
In [326]:
 
 
 
 
 
# 从NumPy数组创建一个DataFrame
pd.DataFrame(arr, columns=['one', 'two'])
 
 
Out[326]:
 onetwo
0 0.757032 0.768422
1 0.060431 0.151112
2 0.590862 0.021340
3 0.786277 0.989442
In [327]:
 
 
 
 
 
# 创建学生ID(100到109)和测试分数(60到10之间的随机整数)的DataFrame
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})
 
 
Out[327]:
 studenttest
0 100 61
1 101 84
2 102 100
3 103 75
4 104 97
5 105 73
6 106 67
7 107 82
8 108 76
9 109 69
 

Documentation for np.arange and np.random

In [328]:
 
 
 
 
 
 
# set_index'可以与DataFrame构造函数链接以选择索引
pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')
 
 
Out[328]:
 test
student 
100 97
101 94
102 82
103 76
104 72
105 86
106 70
107 84
108 84
109 100
 

Documentation for set_index

In [329]:
 
 
 
 
 
# 使用Series构造函数创建一个新Series
s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')
s
 
 
Out[329]:
c     round
b    square
Name: shape, dtype: object
 

Documentation for Series

In [330]:
 
 
 
 
 
# 连接DataFrame和Series(使用axis = 1来连接列)
pd.concat([df, s], axis=1)
 
 
Out[330]:
 idcolorshape
a 100 red NaN
b 101 blue square
c 102 red round
 

Notes:

  • The Series name became the column name in the DataFrame.
  • The Series data was aligned to the DataFrame by its index.
  • The 'shape' for row 'a' was marked as a missing value (NaN) because that index was not present in the Series.

Documentation for concat

[Back to top]

In [331]:
 
 
 
 
 
"""笔记:
系列名称成为DataFrame中的列名称。
Series数据通过其索引与DataFrame对齐。
行'a'的'形状'被标记为缺失值(NaN),因为该索引在系列中不存在。"""
 
 
Out[331]:
"笔记:\n\n系列名称成为DataFrame中的列名称。\nSeries数据通过其索引与DataFrame对齐。\n行'a'的'形状'被标记为缺失值(NaN),因为该索引在系列中不存在。"
 

30. How do I apply a function to a pandas Series or DataFrame? (video)

In [332]:
 
 
 
 
 
# read the training dataset from Kaggle's Titanic competition into a DataFrame
train = pd.read_csv(url5)
train.head()
 
 
Out[332]:
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
 

Goal: Map the existing values of a Series to a different set of values

Method: map (Series method)

In [333]:
 
 
 
 
 
"""目标:将Series的现有值映射到不同的值集
方法:map(Series方法)"""
 
 
Out[333]:
'目标:将Series的现有值映射到不同的值集\n\n方法:map(Series方法)'
In [334]:
 
 
 
 
 
# map 'female' to 0 and 'male' to 1
train['Sex_num'] = train.Sex.map({'female':0, 'male':1})
train.loc[0:4, ['Sex', 'Sex_num']]
 
 
Out[334]:
 SexSex_num
0 male 1
1 female 0
2 female 0
3 female 0
4 male 1
 

Goal: Apply a function to each element in a Series

Method: apply (Series method)

Note: map can be substituted for apply in many cases, but apply is more flexible and thus is recommended

In [335]:
 
 
 
 
 
"""目标:将函数应用于系列中的每个元素
方法:apply(Series方法)
注意:在许多情况下,map可以替换为apply,但是apply更灵活,因此建议使用"""
 
 
Out[335]:
'目标:将函数应用于系列中的每个元素\n\n方法:apply(Series方法)\n\n注意:在许多情况下,map可以替换为apply,但是apply更灵活,因此建议使用'
In [336]:
 
 
 
 
 
# 计算“姓名”系列中每个字符串的长度
train['Name_length'] = train.Name.apply(len)
train.loc[0:4, ['Name', 'Name_length']]
 
 
Out[336]:
 NameName_length
0 Braund, Mr. Owen Harris 23
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 51
2 Heikkinen, Miss. Laina 22
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 44
4 Allen, Mr. William Henry 24
In [337]:
 
 
 
 
 
# 将“票价”系列中的每个元素四舍五入到下一个整数
import numpy as np
train['Fare_ceil'] = train.Fare.apply(np.ceil)
train.loc[0:4, ['Fare', 'Fare_ceil']]
 
 
Out[337]:
 FareFare_ceil
0 7.2500 8.0
1 71.2833 72.0
2 7.9250 8.0
3 53.1000 54.0
4 8.0500 9.0
In [338]:
 
 
 
 
 
# 我们想要提取每个人的姓氏
train.Name.head()
 
 
Out[338]:
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
In [339]:
 
 
 
 
 
# 使用字符串方法以逗号分隔“名称”系列(返回一Series列表)
train.Name.str.split(',').head()
 
 
Out[339]:
0                           [Braund,  Mr. Owen Harris]
1    [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                            [Heikkinen,  Miss. Laina]
3      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                          [Allen,  Mr. William Henry]
Name: Name, dtype: object
In [340]:
 
 
 
 
 
# 定义一个函数,该函数根据位置从列表中返回一个元素
def get_element(my_list, position):
    return my_list[position]
 
 
In [341]:
 
 
 
 
 
# 应用'get_element'函数并将'position'作为关键字参数传递
train.Name.str.split(',').apply(get_element, position=0).head()
 
 
Out[341]:
0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object
In [342]:
 
 
 
 
 
# 或者,使用lambda函数
train.Name.str.split(',').apply(lambda x: x[0]).head()
 
 
Out[342]:
0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object
 

Goal: Apply a function along either axis of a DataFrame

Method: apply (DataFrame method)

In [343]:
 
 
 
 
 
"""目标:沿DataFrame的任一轴应用函数
方法:apply(DataFrame方法)"""
 
 
Out[343]:
'目标:沿DataFrame的任一轴应用函数\n\n方法:apply(DataFrame方法)'
In [344]:
 
 
 
 
 
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[344]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [345]:
 
 
 
 
 
# 选择要使用的DataFrame的子集
drinks.loc[:, 'beer_servings':'wine_servings'].head()
 
 
Out[345]:
 beer_servingsspirit_servingswine_servings
0 0 0 0
1 89 132 54
2 25 0 14
3 245 138 312
4 217 57 45
In [346]:
 
 
 
 
 
# 沿axix 0应用'max'函数来计算每列中的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)
 
 
Out[346]:
beer_servings      376
spirit_servings    438
wine_servings      370
dtype: int64
In [347]:
 
 
 
 
 
# 沿axis 1应用'max'函数来计算每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1).head()
 
 
Out[347]:
0      0
1    132
2     25
3    312
4    217
dtype: int64
In [348]:
 
 
 
 
 
# 使用'np.argmax'计算哪一列具有每行的最大值
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head()
 
 
 
c:\users\qq123\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:52: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  return getattr(obj, method)(*args, **kwds)
Out[348]:
0      beer_servings
1    spirit_servings
2      beer_servings
3      wine_servings
4      beer_servings
dtype: object
 

Goal: Apply a function to every element in a DataFrame

Method: applymap (DataFrame method)

In [349]:
 
 
 
 
 
# 将每个DataFrame元素转换为float
drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float).head()
 
 
Out[349]:
 beer_servingsspirit_servingswine_servings
0 0.0 0.0 0.0
1 89.0 132.0 54.0
2 25.0 0.0 14.0
3 245.0 138.0 312.0
4 217.0 57.0 45.0
In [350]:
 
 
 
 
 
 
#覆盖现有的DataFrame列
drinks.loc[:, 'beer_servings':'wine_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)
drinks.head()
 
 
Out[350]:
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0 Afghanistan 0.0 0.0 0.0 0.0 Asia
1 Albania 89.0 132.0 54.0 4.9 Europe
2 Algeria 25.0 0.0 14.0 0.7 Africa
3 Andorra 245.0 138.0 312.0 12.4 Europe
4 Angola 217.0 57.0 45.0 5.9 Africa
posted @ 2019-04-06 12:40  对抗拖延症的二傻子  阅读(846)  评论(0编辑  收藏  举报