python pandas练习:数据的过滤和排序(2)
Ex2 - Filtering and Sorting Data
EX2 - 数据的过滤与排序
This time we are going to pull data directly from the internet.
【这次不想搞在线读取的了,直接把数据下载下来用。】
Step 1. Import the necessary libraries
【第一步,导入必要的库】
import pandas as pd
Step 2. Import the dataset from this address.
【从这个地址导入数据集。(或者下载下来用)。】
这回的数据好像是一次射击比赛的数据。
Step 3. Assign it to a variable called euro12.
【第三步,将数据集赋值给变量euro12.】
sep=','加不加都行,CSV本来就是逗号分隔值文件格式。
# euro12 = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv', sep=',')
euro12 = pd.read_csv('Euro_2012_stats_TEAM.csv', sep=',') # 同一目录下直接用。
euro12
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 | 9 | 9 | 16 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 | 11 | 11 | 19 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 | 7 | 7 | 15 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 | 11 | 11 | 16 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 | 11 | 11 | 19 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 | 18 | 18 | 19 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 | 7 | 7 | 15 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 | 7 | 7 | 17 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 | 14 | 14 | 16 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 | 10 | 10 | 17 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 | 7 | 7 | 16 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 | 9 | 9 | 18 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 | 9 | 9 | 18 |
16 rows × 35 columns
Step 4. Select only the Goal column.
【第四步,只筛选[Goals]这一列。】
euro12.Goals
# 或者
euro12['Goals']
0 4
1 4
2 4
3 5
4 3
5 10
6 5
7 6
8 2
9 2
10 6
11 1
12 5
13 12
14 5
15 2
Name: Goals, dtype: int64
Step 5. How many team participated in the Euro2012?
【第五步,有多少支队伍参加了Euro2012?】
表里每一行就是一个队伍的数据,有几行就有几个队伍。
使用shape可以看出数据集的行与列信息,shape[0]表示只显示行。
euro12.shape[0]
16
Step 6. What is the number of columns in the dataset?
【第六步,这个数据集有多少列?】
使用info()或者shape都可以。
euro12.info()
# 或者
euro12.shape[1]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Team 16 non-null object
1 Goals 16 non-null int64
2 Shots on target 16 non-null int64
3 Shots off target 16 non-null int64
4 Shooting Accuracy 16 non-null object
5 % Goals-to-shots 16 non-null object
6 Total shots (inc. Blocked) 16 non-null int64
7 Hit Woodwork 16 non-null int64
8 Penalty goals 16 non-null int64
9 Penalties not scored 16 non-null int64
10 Headed goals 16 non-null int64
11 Passes 16 non-null int64
12 Passes completed 16 non-null int64
13 Passing Accuracy 16 non-null object
14 Touches 16 non-null int64
15 Crosses 16 non-null int64
16 Dribbles 16 non-null int64
17 Corners Taken 16 non-null int64
18 Tackles 16 non-null int64
19 Clearances 16 non-null int64
20 Interceptions 16 non-null int64
21 Clearances off line 15 non-null float64
22 Clean Sheets 16 non-null int64
23 Blocks 16 non-null int64
24 Goals conceded 16 non-null int64
25 Saves made 16 non-null int64
26 Saves-to-shots ratio 16 non-null object
27 Fouls Won 16 non-null int64
28 Fouls Conceded 16 non-null int64
29 Offsides 16 non-null int64
30 Yellow Cards 16 non-null int64
31 Red Cards 16 non-null int64
32 Subs on 16 non-null int64
33 Subs off 16 non-null int64
34 Players Used 16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
35
Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
【第七步,只看“Team”、“Yellow Cards”、“Red Cards”这三列,并将这三列赋值到一个新的dataframe中去,命名为discipline】
# filter only giving the column names
# 只筛选出给出的列名。
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline
Team | Yellow Cards | Red Cards | |
---|---|---|---|
0 | Croatia | 9 | 0 |
1 | Czech Republic | 7 | 0 |
2 | Denmark | 4 | 0 |
3 | England | 5 | 0 |
4 | France | 6 | 0 |
5 | Germany | 4 | 0 |
6 | Greece | 9 | 1 |
7 | Italy | 16 | 0 |
8 | Netherlands | 5 | 0 |
9 | Poland | 7 | 1 |
10 | Portugal | 12 | 0 |
11 | Republic of Ireland | 6 | 1 |
12 | Russia | 6 | 0 |
13 | Spain | 11 | 0 |
14 | Sweden | 7 | 0 |
15 | Ukraine | 5 | 0 |
Step 8. Sort the teams by Red Cards, then to Yellow Cards
【第八步,将队伍按照“Red Cards”列排序,再按“Yellow Cards”排序】
# 默认是升序排序,设置ascending = False后变为降序排序。
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
Team | Yellow Cards | Red Cards | |
---|---|---|---|
6 | Greece | 9 | 1 |
9 | Poland | 7 | 1 |
11 | Republic of Ireland | 6 | 1 |
7 | Italy | 16 | 0 |
10 | Portugal | 12 | 0 |
13 | Spain | 11 | 0 |
0 | Croatia | 9 | 0 |
1 | Czech Republic | 7 | 0 |
14 | Sweden | 7 | 0 |
4 | France | 6 | 0 |
12 | Russia | 6 | 0 |
3 | England | 5 | 0 |
8 | Netherlands | 5 | 0 |
15 | Ukraine | 5 | 0 |
2 | Denmark | 4 | 0 |
5 | Germany | 4 | 0 |
Step 9. Calculate the mean Yellow Cards given per Team
【第九步,计算每个队伍的平均黄牌数。】
# 使用mean()函数,计算该列的平均值。
round(discipline['Yellow Cards'].mean())
7
Step 10. Filter teams that scored more than 6 goals
【第十步,过滤出得分超过6的队伍】
# 直接中括号里写条件判断,或者使用query()函数都可以。
euro12[euro12.Goals > 6]
# 或者
euro12.query('Goals > 6')
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
2 rows × 35 columns
Step 11. Select the teams that start with G
【第11步,筛选出名字以G开头的队伍。】
分3步,
第一步:筛选出Team列。
第二步:将其转换成字符串。
第三步:使用startswith()函数,找到所有G开头的队伍。
# 使用euro12.Team.str.startswith('G')的返回值是布尔类型,这里将所有返回值为true的行筛选出来。
euro12[euro12.Team.str.startswith('G')]
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
2 rows × 35 columns
Step 12. Select the first 7 columns
【第十二步,筛选出前七列。】
# use .iloc to slices via the position of the passed integers
# 使用.iloc,通过传递整数的位置来进行切片。
# 有两个参数,前面的表示选取的行,后面的表示选取的列。
# : means all, 0:7 means from 0 to 7
# 单独一个:代表选取所有行,0:7代表选取从0到7列。
euro12.iloc[: , 0:7]
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | |
---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 |
Step 13. Select all columns except the last 3.
【第13步,选出除了最后三列的其他所有列。】
# use negative to exclude the last 3 columns
# 使用负数来排除掉最后三列。
euro12.iloc[: , :-3]
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Clean Sheets | Blocks | Goals conceded | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 0 | 10 | 3 | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 1 | 10 | 6 | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 1 | 10 | 5 | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 2 | 29 | 3 | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 1 | 7 | 5 | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 1 | 11 | 6 | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 1 | 23 | 7 | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 2 | 18 | 7 | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 0 | 9 | 5 | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 0 | 8 | 3 | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 2 | 11 | 4 | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 0 | 23 | 9 | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 0 | 8 | 3 | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 5 | 8 | 1 | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 1 | 12 | 5 | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 0 | 4 | 4 | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 |
16 rows × 32 columns
Step 14. Present only the Shooting Accuracy from England, Italy and Russia
【第十四步,只展示英国、意大利和俄罗斯的射击精度(Shooting Accuracy)】
# .loc is another way to slice, using the labels of the columns and indexes
# .loc是另一种切片方法,使用的是列或者索引的标签。
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
Team | Shooting Accuracy | |
---|---|---|
3 | England | 50.0% |
7 | Italy | 43.0% |
12 | Russia | 22.5% |