pandas库基础学习

1.Pandas模块的数据结构

Pandas模块的数据结构主要有两种：

Series
DataFrame

Series 是一维数组，基于Numpy的ndarray 结构

DataFrame是Pandas库中的一种数据结构，它类似excel，是一种二维表。

1.1Series

创建Series

arr = pd.Series([1, 2, -3, 4, -5, np.nan])
arr2 = pd.Series(np.arange(6))
d = {'a':1,'b':2,'c':3,'d':4,'e':5}
arr3 = pd.Series(d)

Series的属性

arr = pd.Series([1, 2, -3, 4, -5, np.nan])
arr.values
#array([ 1.,  2., -3.,  4., -5., nan])
arr.index
#RangeIndex(start=0, stop=6, step=1)

#维度
df.shape

根据下标取值

arr = pd.Series([1, 2, -3, 4, -5, np.nan])
arr[3:]
arr.iloc[3:]

统计个数

obj = pd.Series(['Bob', 'Steve', 'Jeff', 'Ryan', 'Jeff', 'Ryan'])
obj.value_counts()

#结果
#Jeff     2
#Ryan     2
#Bob      1
#Steve    1
#dtype: int64

排序

arr = pd.Series([1, 2, -3, 4, -5, np.nan])
#根据值排序
arr.sort_values()
#根据index排序
arr.sort_index()

1.2DataFrame

创建DataFrame

df1 = pd.DataFrame(np.random.randn(3, 3), index=list('abc'), columns=list('ABC'))
print(df1)
#在DataFrame中每一行列是一个Series
#          A         B         C
#a -0.454419 -0.606726  0.499842          
#b -0.666458  1.231203 -1.460624
#c -0.338414 -1.550477 -0.517511

#-----------------------------------------------------------------------------
#直接获取是按columns来操作的，如：['列名']['下标名或者下标序号']
type(df1['A'])
<class 'pandas.core.series.Series'>
df1['A']['a']
#-0.454419
df1['A'][0]
#-0.454419

DataFrame的基础属性

df1.dtypes
df1.index
#Index(['a', 'b', 'c'], dtype='object')
df1.columns
#Index(['A', 'B', 'C'], dtype='object')

iloc操作

df1 = pd.DataFrame(np.random.randn(3, 3), index=list('abc'), columns=list('ABC'))
print(df1)
#          A         B         C
#a -0.454419 -0.606726  0.499842
#b -0.666458  1.231203 -1.460624
#c -0.338414 -1.550477 -0.517511

#-------------------------------------------------------------
#iloc是按行来操作的，并且只能传数字，不接受字符串,如：[行下标,'列下标或名称']
#没办法['行名称'，'列名称']
df1.iloc[0]    #指定某一行（Series）
#        A         B         C
#  -0.454419 -0.606726  0.499842
type(df1.iloc[0])
#<class 'pandas.core.series.Series'>


#-------------------------------------------------------------
df1.iloc[[0]]    #指定某一行（DataFrame），[[]]特指二维，所以是DataFrame
#          A         B         C
#a -0.454419 -0.606726  0.499842
type(df1.iloc[[0]])
#<class 'pandas.core.frame.DataFrame'>


#-------------------------------------------------------------
df.iloc[[0, 1]]			#指定多行（DataFrame）
#          A         B         C
#a -0.454419 -0.606726  0.499842
#b -0.666458  1.231203 -1.460624

#-------------------------------------------------------------
df1.iloc[0:2]     #指定多行（DataFrame）
#          A         B         C
#a -0.454419 -0.606726  0.499842
#b -0.666458  1.231203 -1.460624
type(df1.iloc[0:2])
#<class 'pandas.core.frame.DataFrame'>

#-------------------------------------------------------------
df1.iloc[[True, False, True]]  #还能通过Boolean列表指定取舍
#          A         B         C
#a -0.454419 -0.606726  0.499842
#c -0.338414 -1.550477 -0.517511

#-------------------------------------------------------------
df1.iloc[0,2]             #通过行号列号获取指定值
#0.499842

#-------------------------------------------------------------
df.iloc[lambda x: x.index == 'a']     #甚至能lambda来指定行列
#          A         B         C
#a -0.454419 -0.606726  0.499842

查看前几行或后几行

df1.head()
df1.tail()

其他表格操作

#统计描述
df1.describe()
#	A	B	C
#count	3.000000	3.000000	3.000000
#mean	-0.486430	-0.308667	-0.492765
#std	0.166348	1.414590	0.980467
#min	-0.666458	-1.550477	-1.460624
#25%	-0.560438	-1.078601	-0.989068
#50%	-0.454419	-0.606726	-0.517511
#75%	-0.396417	0.312238	-0.008835
#max	-0.338414	1.231203	0.499842
#-------------------------------------------------------------
#列求和（每一列的和）
df1.sum(axis=0)
#行求和
df1.sum(axis=1)

#-------------------------------------------------------------
#转置
df1.T

#-------------------------------------------------------------
#运算
df1.apply(lambda x: x * 2)

#-------------------------------------------------------------
#空值填充（零填充）
df.fillna(value=0)
#空值填充（均值填充）
df['prince'].fillna(df['prince'].mean())

#-------------------------------------------------------------
#寻找空值(返回boolean列表)
data1 = df['字段1'].isnull()
#非空值(返回boolean列表)
data2 = df['字段2'].notnull()

#-------------------------------------------------------------
#去除空值
data_notnull = df[df['字段2'].notnull()]

#-------------------------------------------------------------
#寻找0值(返回boolean列表)
index = airline_data['字段']==0

#-------------------------------------------------------------
#对表格数据集进行function运算
df['字段1'].map(function)    #map操作
df.apply(function, axis=0)  #默认按列（这里每列数据作为一个Series，对Series做function运算）
#给个例子（列的每个数据开根）
#>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
#>>> df
#   A  B
#0  4  9
#1  4  9
#2  4  9
#>>> df.apply(np.sqrt)
#     A    B
#0  2.0  3.0
#1  2.0  3.0
#2  2.0  3.0


#-------------------------------------------------------------
#清除空格字段
df['字段1']=df['字段1'].map(str.strip)

#-------------------------------------------------------------
#大小写转换
df['字段1']=df['字段1'].str.lower()

#-------------------------------------------------------------
#改数据类型
df['字段1'].astype('int')       

#-------------------------------------------------------------
#更改列名
df.rename(columns={'oldname': 'newname'}) 

#-------------------------------------------------------------
#滚动窗口求和（每三个数据求一次和）
df['字段'].rolling(3).sum()
#pandas.rolling_sum(arg, window, min_periods=None, freq=None, center=False, how=None, **kwargs)
#arg : 为Series或DataFrame
#window : 窗口的大小
#min_periods : 最小的观察数值个数
pd.rolling_sum(df,window=3)

#-------------------------------------------------------------
#求最小值的下标
df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
#                consumption  co2_emissions
#Pork                  10.51         37.20
#Wheat Products       103.11         19.66
#Beef                  55.48       1712.00
df.idxmin()
#consumption                Pork
#co2_emissions    Wheat Products

2.文件读取操作

import pandas as pd
data = pd.read_csv('文件名'，encoding='GB18030')
df = pd.DataFrame(pd.read_csv('name.csv',header=1))

pd.read_excel('name.xlsx')
df = pd.DataFrame(pd.read_excel('name.xlsx'))

posted @ 2020-11-05 13:56 鸭梨的药丸哥阅读(68) 评论(0) 收藏举报来源

刷新页面返回顶部

yalier

pandas库基础学习

pandas库基础学习

1.Pandas模块的数据结构

1.1Series

1.2DataFrame

2.文件读取操作

公告