python pandas练习（1）

参考项目地址：https://github.com/guipsamora/pandas_exercises

Ex3 - Getting and Knowing your Data（获取然后了解你的数据）

This time we are going to pull data directly from the internet.【这次我们准备直接从互联网下载数据】

Step 1. Import the necessary libraries

【第一步，导入必要的库】

import pandas as pd

Step 2. Import the dataset from this address.

【第二步，从该地址导入数据集】

Step 3. Assign it to a variable called users and use the 'user_id' as index

【第三步，分配（Assign）该数据集至变量users，使用‘user_id作为索引’】

参数解释：

sep:设置分隔符(separator)，此处以'|'为分隔符。

index_col:指定数据集中的一列为索引，此处指定'user_id'列为索引

users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')

Step 4. See the first 25 entries

【第四步，查看前25个条目(entries)】

使用DataFrame的head函数，参数为要查看的条目数。

users.head(25)

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213
6	42	M	executive	98101
7	57	M	administrator	91344
8	36	M	administrator	05201
9	29	M	student	01002
10	53	M	lawyer	90703
11	39	F	other	30329
12	28	F	other	06405
13	47	M	educator	29206
14	45	M	scientist	55106
15	49	F	educator	97301
16	21	M	entertainment	10309
17	30	M	programmer	06355
18	35	F	other	37212
19	40	M	librarian	02138
20	42	F	homemaker	95660
21	26	M	writer	30068
22	25	M	writer	40206
23	30	F	artist	48197
24	21	F	artist	94533
25	39	M	engineer	55107

Step 5. See the last 10 entries

【第五步，查看最后10个条目】

使用DataFrame的tail函数，和head函数相似。

users.tail(10)

	age	gender	occupation	zip_code
user_id
934	61	M	engineer	22902
935	42	M	doctor	66221
936	24	M	other	32789
937	48	M	educator	98072
938	38	F	technician	55038
939	26	F	student	33319
940	32	M	administrator	02215
941	20	M	student	97229
942	48	F	librarian	78209
943	22	M	student	77841

Step 6. What is the number of observations in the dataset?

【第六步，这些数据的观测值(number of observations)是多少？】

使用DataFrame的shape函数可以查看DataFrame的行数和列数，返回值是一个元组。

其中shape[0]只查看行数，shape[1]只查看列数。

users.shape[0]

Step 7. What is the number of columns in the dataset?

【第七步，该数据集有多少个列】

同样使用shape函数，shape[1]表示只查看列数。

users.shape[1]

Step 8. Print the name of all the columns.

【第八步，打印出所有列的列名。】

直接使用DataFrame的columns函数，可以查看所有列的信息。

users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

Step 9. How is the dataset indexed?

【第九步，该数据集是如何索引的？】

使用DataFrame的index函数，查看索引信息。

# "the index" (aka "the labels")    
# 【索引，又被称为标签】
users.index

Step 10. What is the data type of each column?

【第十步，每一列的数据类型是什么？】

直接使用DataFrame的dtypes函数，查看所有列的数据类型。

users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Step 11. Print only the occupation column

【第十一步，仅打印'occupation'(职业)列的信息】

users.occupation

#or

users['occupation']

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

Step 12. How many different occupations are in this dataset?

【第十二步，该数据集中有多少个不同的职业？】

使用DataFrame中的nunique函数返回指定列的唯一值个数。

或者使用value_counts()函数返回每个唯一元素的计数信息，再使用count()函数，得到唯一值的个数。

users.occupation.nunique()
#or by using value_counts() which returns the count of unique elements
#users.occupation.value_counts().count()

Step 13. What is the most frequent occupation?

【第十三步，最频繁出现的职业是哪个？】

使用value_counts().head(1)来获取出现频率最多的职业信息，返回一个包含名称和数量的Series。

之后使用index[0]函数来完成只获取名称的操作。

#Because "most" is asked
users.occupation.value_counts().head(1).index[0]

#or
#to have the top 5

# users.occupation.value_counts().head()

'student'

Step 14. Summarize the DataFrame.

【第十四步，概括这个DataFrame。】

使用DataFrame中的describe()函数，返回该DataFrame的描述性统计信息(Descriptive statistics)。

默认只返回数字类型的列的信息。

users.describe() #Notice: by default, only the numeric columns are returned.

	age
count	943.000000
mean	34.051962
std	12.192740
min	7.000000
25%	25.000000
50%	31.000000
75%	43.000000
max	73.000000

Step 15. Summarize all the columns

【第十五步，概括所有的列。】

在describe()函数中增加参数(include = "all")即可。

users.describe(include = "all") #Notice: By default, only the numeric columns are returned.

	age	gender	occupation	zip_code
count	943.000000	943	943	943
unique	NaN	2	21	795
top	NaN	M	student	55414
freq	NaN	670	196	9
mean	34.051962	NaN	NaN	NaN
std	12.192740	NaN	NaN	NaN
min	7.000000	NaN	NaN	NaN
25%	25.000000	NaN	NaN	NaN
50%	31.000000	NaN	NaN	NaN
75%	43.000000	NaN	NaN	NaN
max	73.000000	NaN	NaN	NaN

Step 16. Summarize only the occupation column

【第十六步，仅概括occupation(职业)这一列的信息。】

users.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

Step 17. What is the mean age of users?

【第十七步，age(年龄)的平均数是多少？】

使用mean()函数返回指定列的平均值。

使用round()函数，对浮点数进行近似取值，若只传一个参数，则只保留整数。

round(users.age.mean())

Step 18. What is the age with least occurrence?

【第18步，出现最少的年龄是多少？】

首先使用DataFrame的value_counts()函数获取各个年龄的出现频率，之后使用tail()函数来获取最后的几行（默认最后5行）。

users.age.value_counts().tail() #7, 10, 11, 66 and 73 years -> only 1 occurrence

7     1
66    1
11    1
10    1
73    1
Name: age, dtype: int64

posted @ 2022-04-25 17:13 鑫xin哥阅读(174) 评论(0) 编辑收藏举报

刷新页面返回顶部

鑫xin哥

python pandas练习 （1）