Pandas Numpy
我用python3.9 学的
安装
pip3.9 install numpy
pip3.9 install pandas
pip3.9 install openpyxl
refer to : https://www.cnblogs.com/wobujiaonaoxin/articles/11386046.html
一、生成数据表
1、首先导入pandas库,一般都会用到numpy库,所以我们先导入备用:
2、导入CSV或者xlsx文件:
3、用pandas创建数据表:
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E NaN
5 1006 2018-01-06 qingdao 43 105-F 6000.0
二、数据表信息查看
1、维度查看
2、数据表基本信息(维度、列名称、数据格式、所占空间等):
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id 6 non-null int64
date 6 non-null datetime64[ns]
city 6 non-null object
age 6 non-null int64
category 6 non-null object
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 368.0+ bytes
None
3、每一列数据的格式:
id int64
date datetime64[ns]
city object
age int64
category object
price float64
dtype: object
4、某一列格式:
5、空值:
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
6、查看某一列空值:
1 False
2 False
3 False
4 False
5 False
Name: age, dtype: bool
7、查看某一列的唯一值:
1 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64>
8、查看数据表的值:
[1001 Timestamp('2018-01-02 00:00:00') 'shagnhai' 45 '101-B' nan]
[1003 Timestamp('2018-01-03 00:00:00') 'guangzhou' 56 '102-C' 3000.0]
[1004 Timestamp('2018-01-04 00:00:00') 'chengdu' 33 '103-D' 4000.0]
[1005 Timestamp('2018-01-05 00:00:00') 'wuhan' 24 '104-E' nan]
[1006 Timestamp('2018-01-06 00:00:00') 'qingdao' 43 '105-F' 6000.0]]
9、查看列名称:
10、查看前10行数据、后10行数据:
id date city age category price
0 1000 2018-01-01 beijing 22 100-A 1000
显示倒数第一行:
id date city age category price
5 1006 2018-01-06 qingdao 43 105-F 6000
三、数据表清洗
1、用数字0填充空值:
id date city age category price
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B 0.0
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E 0.0
5 1006 2018-01-06 qingdao 43 105-F 6000.0
2、使用列price的均值对NA进行填充:
0 1000.0
1 3500.0
2 3000.0
3 4000.0
4 3500.0
5 6000.0
3、清除city字段的字符空格:
0 beijing
1 shagnhai
2 guangzhou
3 chengdu
4 wuhan
5 qingdao
Name: city, dtype: object
4、大小写转换:
0 beijing
1 shagnhai
2 guangzhou
3 chengdu
4 wuhan
5 qingdao
Name: city, dtype: object
5、更改数据格式:
0 1000.0
1 NaN
2 3000.0
3 4000.0
4 NaN
5 6000.0
Name: price, dtype: float64
6、更改列名称:
id date city age category-size price
0 1001 2018-01-01 beijing 22 100-A 1000.0
1 1001 2018-01-02 shagnhai 45 101-B NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0
3 1004 2018-01-04 chengdu 33 103-D 4000.0
4 1005 2018-01-05 wuhan 24 104-E NaN
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0
7、删除后出现的重复值:
0 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64
8、删除先出现的重复值
1 1001
2 1003
3 1004
4 1005
5 1006
Name: id, dtype: int64
9、数据替换:
0 beijing
1 sh
2 guangzhou
3 chengdu
4 wuhan
5 QINGDAO
Name: city, dtype: object
四、数据预处理
df1
=
pd.DataFrame({
"id"
:[
1001
,
1002
,
1003
,
1004
,
1005
,
1006
,
1007
,
1008
],
"gender"
:[
'male'
,
'female'
,
'male'
,
'female'
,
'male'
,
'female'
,
'male'
,
'female'
],
"pay"
:[
'Y'
,
'N'
,
'Y'
,
'Y'
,
'N'
,
'Y'
,
'N'
,
'Y'
,],
"m-point"
:[
10
,
12
,
20
,
40
,
40
,
40
,
30
,
20
]})
1、数据表合并
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
左连接:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
右连接:
df_right
全连接:
df_outer
2、设置索引列
date city age category price gender pay m-point
id
1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
1005 2018-01-05 wuhan 24 104-E NaN male N 40
1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
3、按照特定列的值排序:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
4、按照索引列排序:
id date city age category price gender pay m-point
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40
5、如果price列的值>3000,group列显示high,否则显示low:
id date city age category price gender pay m-point group
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high
6、对复合多个条件的数据进行分组标记
id date city age category price gender pay m-point group sign
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN
7、对category字段的值依次进行分列,并创建数据表,索引值为df_inner的索引列,列名称为category和size
0 100 A
1 101 B
2 102 C
3 103 D
4 104 E
5 105 F
8、将完成分裂后的数据表和原df_inner数据表进行匹配
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
五、数据提取
主要用到的三个函数:loc,iloc和ix,loc函数按标签值进行提取,iloc按位置进行提取,ix可以同时按标签和位置进行提取。
1、按索引提取单行的数值
id 1004
date 2018-01-04 00:00:00
city chengdu
age 33
category_x 103-D
price 4000
gender female
pay Y
m-point 40
group high
sign NaN
category_y 103
size D
Name: 3, dtype: object
2、按索引提取区域行数值
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
3、重设索引
index id date city age category_x price gender pay m-point group sign category_y size
0 0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
4、设置日期为索引
id city age category_x price gender pay m-point group sign category_y size
date
2018-01-01 1001 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
2018-01-02 1001 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2018-01-03 1003 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
2018-01-04 1004 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
2018-01-05 1005 wuhan 24 104-E NaN male N 40 low NaN 104 E
2018-01-06 1006 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
5、提取4日之前的所有数据
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
6、使用iloc按位置区域提取数据
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
7、适应iloc按位置单独提起数据
0 100-A 1000.0
2 102-C 3000.0
5 105-F 6000.0
8、使用ix按索引标签和位置混合提取数据
0 1001 2018-01-01 beijing 22
1 1001 2018-01-02 shagnhai 45
2 1003 2018-01-03 guangzhou 56
9、判断city列的值是否为北京
0 True
1 False
2 False
3 False
4 False
5 False
10、判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
11、提取前三个字符,并生成数据表
六、数据筛选
使用与、或、非三个条件配合大于、小于、等于对数据进行筛选,并进行计数和求和。
1、使用“与”进行筛选
id gender
0 1001 male
2、使用“或”进行筛选
id gender age
0 1001 male 22
1 1001 male 45
2 1003 male 56
3 1004 female 33
4 1005 male 24
5 1006 female 43
3、使用“非”条件进行筛选
id gender city
1 1001 male shagnhai
2 1003 male guangzhou
3 1004 female chengdu
4 1005 male wuhan
5 1006 female QINGDAO
4、对筛选后的数据按city列进行计数
1
5、使用query函数进行筛选
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
6、对筛选后的结果按price进行求和
1000.0
七、数据汇总
主要函数是groupby和pivote_table
1、对所有的列进行计数汇总
id date age category_x price gender pay m-point group sign category_y size
city
QINGDAO 1 1 1 1 1 1 1 1 1 0 1 1
beijing 1 1 1 1 1 1 1 1 1 1 1 1
chengdu 1 1 1 1 1 1 1 1 1 0 1 1
guangzhou 1 1 1 1 1 1 1 1 1 0 1 1
shagnhai 1 1 1 1 0 1 1 1 1 0 1 1
wuhan 1 1 1 1 0 1 1 1 1 0 1 1
2、按城市对id字段进行计数
city
QINGDAO 1
beijing 1
chengdu 1
guangzhou 1
shagnhai 1
wuhan 1
3、对两个字段进行汇总计数
city size
QINGDAO F 1
beijing A 1
chengdu D 1
guangzhou C 1
shagnhai B 1
wuhan E 1
Name: id, dtype: int64
4、对city字段进行汇总,并分别计算price的合计和均值
len sum mean
city
QINGDAO 1.0 6000.0 6000.0
beijing 1.0 1000.0 1000.0
chengdu 1.0 4000.0 4000.0
guangzhou 1.0 3000.0 3000.0
shagnhai 1.0 0.0 NaN
wuhan 1.0 0.0 NaN
八、数据统计
数据采样,计算标准差,协方差和相关系数
1、简单的数据采样
id date city age category_x price gender pay m-point group sign category_y size
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
2、手动设置采样权重
id date city age category_x price gender pay m-point group sign category_y size
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
3、采样后不放回
id date city age category_x price gender pay m-point group sign category_y size
0 1001 2018-01-01 beijing 22 100-A 1000.0 male Y 10 low 1.0 100 A
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
4、采样后放回
id date city age category_x price gender pay m-point group sign category_y size
2 1003 2018-01-03 guangzhou 56 102-C 3000.0 male Y 20 low NaN 102 C
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
3 1004 2018-01-04 chengdu 33 103-D 4000.0 female Y 40 high NaN 103 D
5 1006 2018-01-06 QINGDAO 43 105-F 6000.0 female Y 40 high NaN 105 F
4 1005 2018-01-05 wuhan 24 104-E NaN male N 40 low NaN 104 E
1 1001 2018-01-02 shagnhai 45 101-B NaN male Y 10 low NaN 101 B
5、 数据表描述性统计
count mean std min 25% 50% 75% max
id 6.0 1003.33 2.07 1001.0 1001.50 1003.5 1004.75 1006.0
age 6.0 37.17 13.20 22.0 26.25 38.0 44.50 56.0
price 4.0 3500.00 2081.67 1000.0 2500.00 3500.0 4500.00 6000.0
m-point 6.0 26.67 15.06 10.0 12.50 30.0 40.00 40.0
sign 1.0 1.00 NaN 1.0 1.00 1.0 1.00 1.0
6、计算列的标准差
2081.6659994661327
7、计算两个字段间的协方差
28333.333333333332
8、数据表中所有字段间的协方差
id age price m-point sign
id 4.266667 0.333333 4.333333e+03 29.333333 NaN
age 0.333333 174.166667 1.366667e+04 -31.333333 NaN
price 4333.333333 13666.666667 4.333333e+06 28333.333333 NaN
m-point 29.333333 -31.333333 2.833333e+04 226.666667 NaN
sign NaN NaN NaN NaN NaN
9、两个字段的相关性分析
0.9073928715621604
10、数据表的相关性分析
id age price m-point sign
id 1.000000 0.012228 1.000000 0.943242 NaN
age 0.012228 1.000000 0.453406 -0.157699 NaN
price 1.000000 0.453406 1.000000 0.907393 NaN
m-point 0.943242 -0.157699 0.907393 1.000000 NaN
sign NaN NaN NaN NaN NaN
九、数据输出
分析后的数据可以输出为xlsx格式和csv格式
1、写入Excel
'excel_to_python.xlsx'
, sheet_name
=
'bluewhale_cc'
)
2、写入到CSV
'excel_to_python.csv'
)