pandas模块


Python从入门到放弃完整教程目录:https://www.cnblogs.com/nickchen121/p/10718112.html
pandas官方文档:[https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750](https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750)
pandas基于Numpy,可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构,其中Series数据结构类似于Numpy中的一维数组,DataFrame类似于多维表格数据结构。

pandas是python数据分析的核心模块。它主要提供了五大功能:

  1. 支持文件存取操作,支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
  2. 支持增删改查、切片、高阶函数、分组聚合等单表操作,以及和dict、list的互相转换。
  3. 支持多表拼接合并操作。
  4. 支持简单的绘图操作。
  5. 支持简单的统计分析操作。

一、Series数据结构

Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成。

Series比较像列表(数组)和字典的结合体

import numpy as np
import pandas as pd
df = pd.Series(0, index=['a', 'b', 'c', 'd'])
print(df)
a    0
b    0
c    0
d    0
dtype: int64
print(df.values)
[0 0 0 0]
print(df.index)
Index(['a', 'b', 'c', 'd'], dtype='object')

1.1 Series支持NumPy模块的特性(下标)

详解 方法
从ndarray创建Series Series(arr)
与标量运算 df*2
两个Series运算 df1+df2
索引 df[0], df[[1,2,4]]
切片 df[0:2]
通用函数 np.abs(df)
布尔值过滤 df[df>0]
arr = np.array([1, 2, 3, 4, np.nan])
print(arr)
[ 1.  2.  3.  4. nan]
df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
print(df)
a    1.0
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64
print(df**2)
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64
print(df[0])
1.0
print(df['a'])
1.0
print(df[[0, 1, 2]])
a    1.0
b    2.0
c    3.0
dtype: float64
print(df[0:2])
a    1.0
b    2.0
dtype: float64
np.sin(df)
a    0.841471
b    0.909297
c    0.141120
d   -0.756802
e         NaN
dtype: float64
df[df > 1]
b    2.0
c    3.0
d    4.0
dtype: float64

1.2 Series支持字典的特性(标签)

详解 方法
从字典创建Series Series(dic),
in运算 ’a’ in sr
键索引 sr['a'], sr[['a', 'b', 'd']]
df = pd.Series({'a': 1, 'b': 2})
print(df)
a    1
b    2
dtype: int64
print('a' in df)
True
print(df['a'])
1

1.3 Series缺失数据处理

方法 详解
dropna() 过滤掉值为NaN的行
fillna() 填充缺失数据
isnull() 返回布尔数组,缺失值对应为True
notnull() 返回布尔数组,缺失值对应为False
df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print(df)
a    1.0
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64
print(df.dropna())
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
print(df.fillna(5))
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64
print(df.isnull())
a    False
b    False
c    False
d    False
e     True
dtype: bool
print(df.notnull())
a     True
b     True
c     True
d     True
e    False
dtype: bool

二、DataFrame数据结构

DataFrame是一个表格型的数据结构,含有一组有序的列。

DataFrame可以被看做是由Series组成的字典,并且共用一个索引。

2.1 产生时间对象数组:date_range

date_range参数详解:

参数 详解
start 开始时间
end 结束时间
periods 时间长度
freq 时间频率,默认为'D',可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…
dates = pd.date_range('20190101', periods=6, freq='M')
print(dates)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
np.random.seed(1)
arr = 10 * np.random.randn(6, 4)
print(arr)
[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
 [  8.65407629 -23.01538697  17.44811764  -7.61206901]
 [  3.19039096  -2.49370375  14.62107937 -20.60140709]
 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]
 [ -1.72428208  -8.77858418   0.42213747   5.82815214]
 [-11.00619177  11.4472371    9.01590721   5.02494339]]
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])
df
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943

三、DataFrame属性

属性 详解
dtype是 查看数据类型
index 查看行序列或者索引
columns 查看各列的标签
values 查看数据框内的数据,也即不含表头索引的数据
describe 查看数据每一列的极值,均值,中位数,只可用于数值型数据
transpose 转置,也可用T来操作
sort_index 排序,可按行或列index排序输出
sort_values 按数据值来排序
# 查看数据类型
print(df2.dtypes)
0    float64
1    float64
2    float64
3    float64
dtype: object
df
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
print(df.index)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
print(df.columns)
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values)
[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
 [  8.65407629 -23.01538697  17.44811764  -7.61206901]
 [  3.19039096  -2.49370375  14.62107937 -20.60140709]
 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]
 [ -1.72428208  -8.77858418   0.42213747   5.82815214]
 [-11.00619177  11.4472371    9.01590721   5.02494339]]
df.describe()
c1 c2 c3 c4
6.000000 6.000000 6.000000 6.000000
2.022213 -5.466424 7.927203 -6.514830
9.580084 11.107772 8.707171 10.227641
-11.006192 -23.015387 -5.281718 -20.601407
-2.849200 -8.113329 2.570580 -10.931606
0.733054 -4.979054 10.176801 -9.170878
7.288155 -2.830414 13.800233 1.865690
16.243454 11.447237 17.448118 5.828152
df.T
2019-01-31 00:00:00 2019-02-28 00:00:00 2019-03-31 00:00:00 2019-04-30 00:00:00 2019-05-31 00:00:00 2019-06-30 00:00:00
16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192
-6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237
-5.281718 17.448118 14.621079 11.337694 0.422137 9.015907
-10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943
# 按行标签[c1, c2, c3, c4]从大到小排序
df.sort_index(axis=0)
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
# 按列标签[2019-01-01, 2019-01-02...]从大到小排序
df.sort_index(axis=1)
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
# 按c2列的值从大到小排序
df.sort_values(by='c2')
c1 c2 c3 c4
8.654076 -23.015387 17.448118 -7.612069
-1.724282 -8.778584 0.422137 5.828152
16.243454 -6.117564 -5.281718 -10.729686
-3.224172 -3.840544 11.337694 -10.998913
3.190391 -2.493704 14.621079 -20.601407
-11.006192 11.447237 9.015907 5.024943

四、DataFrame取值

df
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943

4.1 通过columns取值

df['c2']
2019-01-31    -6.117564
2019-02-28   -23.015387
2019-03-31    -2.493704
2019-04-30    -3.840544
2019-05-31    -8.778584
2019-06-30    11.447237
Freq: M, Name: c2, dtype: float64
df[['c2', 'c3']]
c2 c3
-6.117564 -5.281718
-23.015387 17.448118
-2.493704 14.621079
-3.840544 11.337694
-8.778584 0.422137
11.447237 9.015907

4.2 loc(通过行标签取值)

# 通过自定义的行标签选择数据
df.loc['2019-01-01':'2019-01-03']
c1 c2 c3 c4
df[0:3]
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407

4.3 iloc(类似于numpy数组取值)

df.values
array([[ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622],
       [  8.65407629, -23.01538697,  17.44811764,  -7.61206901],
       [  3.19039096,  -2.49370375,  14.62107937, -20.60140709],
       [ -3.22417204,  -3.84054355,  11.33769442, -10.99891267],
       [ -1.72428208,  -8.77858418,   0.42213747,   5.82815214],
       [-11.00619177,  11.4472371 ,   9.01590721,   5.02494339]])
# 通过行索引选择数据
print(df.iloc[2, 1])
-2.493703754774101
df.iloc[1:4, 1:4]
c2 c3 c4
-23.015387 17.448118 -7.612069
-2.493704 14.621079 -20.601407
-3.840544 11.337694 -10.998913

4.4 使用逻辑判断取值

df[df['c1'] > 0]
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
df[(df['c1'] > 0) & (df['c2'] > -8)]
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
3.190391 -2.493704 14.621079 -20.601407

五、DataFrame值替换

df
c1 c2 c3 c4
16.243454 -6.117564 -5.281718 -10.729686
8.654076 -23.015387 17.448118 -7.612069
3.190391 -2.493704 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
df.iloc[0:3, 0:2] = 0
df
c1 c2 c3 c4
0.000000 0.000000 -5.281718 -10.729686
0.000000 0.000000 17.448118 -7.612069
0.000000 0.000000 14.621079 -20.601407
-3.224172 -3.840544 11.337694 -10.998913
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
df['c3'] > 10
2019-01-31    False
2019-02-28     True
2019-03-31     True
2019-04-30     True
2019-05-31    False
2019-06-30    False
Freq: M, Name: c3, dtype: bool
# 针对行做处理
df[df['c3'] > 10] = 100
df
c1 c2 c3 c4
0.000000 0.000000 -5.281718 -10.729686
100.000000 100.000000 100.000000 100.000000
100.000000 100.000000 100.000000 100.000000
100.000000 100.000000 100.000000 100.000000
-1.724282 -8.778584 0.422137 5.828152
-11.006192 11.447237 9.015907 5.024943
# 针对行做处理
df = df.astype(np.int32)
df[df['c3'].isin([100])] = 1000
df
c1 c2 c3 c4
0 0 -5 -10
1000 1000 1000 1000
1000 1000 1000 1000
1000 1000 1000 1000
-1 -8 0 5
-11 11 9 5

六、读取CSV文件

import pandas as pd
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''

test_data = StringIO(test_data)
df = pd.read_csv(test_data, header=None)
df.columns = ['c1', 'c2', 'c3', 'c4']
df
c1 c2 c3 c4
5.1 NaN 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 NaN 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 NaN
NaN NaN NaN NaN

七、处理丢失数据

df.isnull()
c1 c2 c3 c4
False True False False
False False False False
False False True False
False False False False
False False False False
False False False True
True True True True
# 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值
print(df.isnull().sum())
c1    1
c2    2
c3    2
c4    2
dtype: int64
# axis=0删除有NaN值的行
df.dropna(axis=0)
c1 c2 c3 c4
4.9 3.0 1.4 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
# axis=1删除有NaN值的列
df.dropna(axis=1)
# 删除全为NaN值得行或列
df.dropna(how='all')
c1 c2 c3 c4
5.1 NaN 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 NaN 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 NaN
# 删除行不为4个值的
df.dropna(thresh=4)
c1 c2 c3 c4
4.9 3.0 1.4 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
# 删除c2中有NaN值的行
df.dropna(subset=['c2'])
c1 c2 c3 c4
4.9 3.0 1.4 0.2
4.7 3.2 NaN 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 NaN
# 填充nan值
df.fillna(value=10)
c1 c2 c3 c4
5.1 10.0 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 10.0 0.2
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 10.0
10.0 10.0 10.0 10.0

八、合并数据

df1 = pd.DataFrame(np.zeros((3, 4)))
df1
0 1 2 3
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
df2 = pd.DataFrame(np.ones((3, 4)))
df2
0 1 2 3
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
# axis=0合并列
pd.concat((df1, df2), axis=0)
0 1 2 3
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
# axis=1合并行
pd.concat((df1, df2), axis=1)
0 1 2 3 0 1 2 3
0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# append只能合并列
df1.append(df2)
0 1 2 3
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0

九、导入导出数据

使用df = pd.read_excel(filename)读取文件,使用df.to_excel(filename)保存文件。

9.1 读取文件导入数据

读取文件导入数据函数主要参数:

参数 详解
sep 指定分隔符,可用正则表达式如'\s+'
header=None 指定文件无行名
name 指定列名
index_col 指定某列作为索引
skip_row 指定跳过某些行
na_values 指定某些字符串表示缺失值
parse_dates 指定某些列是否被解析为日期,布尔值或列表
df = pd.read_excel(filename)
df = pd.read_csv(filename)

9.2 写入文件导出数据

写入文件函数的主要参数:

参数 详解
sep 分隔符
na_rep 指定缺失值转换的字符串,默认为空字符串
header=False 不保存列名
index=False 不保存行索引
cols 指定输出的列,传入列表
df.to_excel(filename)

十、pandas读取json文件

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

df = pd.read_json(strtext, orient='records')
df
code code1 code2 issue time ttery
8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min
5,9,1,2,9 298329129 NaN 20130801-3389 1013395346000 min
3,8,7,3,3 298588733 NaN 20130801-3388 1013395286000 min
0,8,5,2,7 298818527 NaN 20130801-3387 1013395226000 min
df.to_excel('pandas处理json.xlsx',
            index=False,
            columns=["ttery", "issue", "code", "code1", "code2", "time"])

10.1 orient参数的五种形式

orient是表明预期的json字符串格式。orient的设置有以下五个值:

1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(s, orient='split')
df
a b
1 3
2 8
3 9

2.'records' : list like [{column -> value}, ... , {column -> value}]

这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'

df = pd.read_json(strtext, orient='records')
df
code code1 code2 issue time ttery
8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min

3.'index' : dict like {index -> {column -> value}}

以索引为key,以列字段构成的字典为键值。如:

s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
df = pd.read_json(s, orient='index')
df
a b
1 2
9 11

4.'columns' : dict like {column -> {index -> value}}

这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
df = pd.read_json(s, orient='columns')
df
a b
1 2
9 11

5.'values' : just the values array。

values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。

s = '[["a",1],["b",2]]'
df = pd.read_json(s, orient='values')
df
0 1
a 1
b 2

十一、pandas读取sql语句

import numpy as np
import pandas as pd
import pymysql


def conn(sql):
    # 连接到mysql数据库
    conn = pymysql.connect(
        host="localhost",
        port=3306,
        user="root",
        passwd="123",
        db="db1",
    )
    try:
        data = pd.read_sql(sql, con=conn)
        return data
    except Exception as e:
        print("SQL is not correct!")
    finally:
        conn.close()


sql = "select * from test1 limit 0, 10"  # sql语句
data = conn(sql)
print(data.columns.tolist())  # 查看字段
print(data)  # 查看数据
posted @ 2019-05-04 09:39  B站-水论文的程序猿  阅读(3186)  评论(1编辑  收藏  举报