Pandas库基础

1、pandas简介：
pandas是python数据分析的利器，是一个数据分析库，最初是作为金融分析工具而开发处理，为时间序列分析提供了很好的支持。panel data 面板数据（经济学中关于多维数据集的一个术语），data analysis数据分析，Pandas就是这个面板数据分析的含义。提供panel数据类型 Series和 DataFrame

2、安装与导入

pip install pandas 安装

导入库
from pandas import Series,DataFrame
import pandas as pd
3、pandas数据结构介绍

Series，DataFrame
Series:一种类似于一维数组的对象，由一组数据（各种numpy数据类型）以及一组与之相关的数据标签（索引）组成。
DataFrame:一个表格型的数据结构，含有一组有序的列，每列可以是不同的值类型，有行索引，也有列索引，可以看做是有Series组成的字典。
4、Series:
创建Series：
通过一维数组创建
通过字典创建
Series应用Numpy数组运算
Series缺失值检测
Series自动对齐
Series及其索引的name属性
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
arr=np.array([1,2,3,4])
series01=Series(arr)
包含索引，数据和元素数据类型
series01.index
series01.values
series01.dtype
series02=Series([34.5,56.78,45.67])
series02.index=['product1','product2','product3']
重新指定数据的索引，没有指定索引的话使用默认的索引0--N-1
series03=Series([98,56,88,45],index=['语文','数学','英语','体育'])
创建数组的时候直接指定索引
series03.index
series03.values
通过字典创建series
series可以被看成是一个定长的有序字典，是索引值到数据值得一个映射，可以直接通过字典来创建Series
a_dict={'20071001':678.98,'20071002':34556.89,'20071003':3748758.88}
series04=Series(a_dict)
series04.index
series04.values
series04['20071001'] 通过索引获取值
series04[0] 默认的索引还有用
Numpy中的数组运算在Series中都保留使用，并且Series进行数组运算时，索引与值之间的映射关系不会改变。
Series应用Numpy数组运算
series04
series04[sieries04>10000]
series04/10
series01=Series([1,2,3,4])
np.exp(series01)
scores=Series({'Tom':89,'John':88,'Merry':96,'Max':65})
scores
new_index=['Tom','Max','Joe','John','Merry']
scores=Series(scores,index=new_index)
多一个索引，出现一个缺失值，缺失值将用NaN代替
pandas中的isnull和notnull函数可以用于Series缺失值检测
函数返回一个布尔类型的Series
pd.isnull(scores)
pd.notnull(scores)
结果是索引和True/FALSE对
scores[pd.isnull(scores)] 过滤出缺失值得项
Joe NaN
scores[pd.notnull(scores)] 过滤不是缺失值得项
Tom 89.0
Max 65.0
John 88.0
Merry 96.0
dtype:float64
Series自动对齐
不同Series之间进行算术运算，会自动对齐相同索引的数据
product_num=Series([23,45,67,89],index=['p3','p1','p2','p5'])
procuct_price_table=Series([9,98,2.34,4.56,5.67,8.78],index=['p1','p2','p3','p4','p5'])
product_sum=product_num*product_price_table
p4索引会出现 NaN 的值
Series及其索引的name属性
Series对象本身及其索引都有一个name属性，可以赋值设置
product_num.name='ProductNums'
product_num.index.name='ProductType'
product_num
DataFrame
通过二维数组创建DataFrame
通过字典的方式创建DataFrame
索引对象
df01=DataFrame([['Tom','Merry','John'],[76,98,199]])
生成2*3二维数组自动有列索引columns 0,1,2和行索引0,1
df02=DataFrame([['Tom',76],['Merry',98],['John',100]])
生成3*2二维数组自动有列索引和行索引
arr=np.array([['Tom',76],['Merry',98],['John',100]])
df03=DataFrame(arr,columns=['name','score']) 规定了列索引为name,score
df04=DataFrame(arr,index=['one','two','three'],columns=['name','score'])
自定义了行索引 index 和列索引 columns 默认的索引是0,1,2,3
通过字典创建
data={"apart":['1001','1002','1003','1001'],"profits":[567.87,987.87,873,498.87],"year":[2001,2001,2001,2000]}
df=DataFrame(data)
列索引分别为： apart,profits year
df.index 查看行索引
df.columns 查看列索引
df.values 查看数组值
df=DataFrame(data,index=['one','two','three','four'])
修改行索引
df.index
索引对象：Series对象和DataFrame对象，都有索引对象
负责管理轴标签和其他元数据，通过索引可以从Series和DataFrame中取值或对某个位置的值重新赋值
Series和DataFrame自动化对齐功能就是通过索引进行的
series02=Series([34.56,23.34,45.66,98.08],index=['2001','2002','2003','2004'])
series02['2003']
series02['2002'：'2004']包含有边的边界。区别于python的列表
从series02中取值：
series02['2001':]
series02[:'2003']
series02['2002']
series02[:'2002']
series02
从DataFrame中取值
df
df['year']
df.ix[0] 取第一行
df=DataFrame(data)
df
df['pdn']=np.NaN
给数组加了一列
pandas基本功能：
重新索引，丢弃指定轴上的项，索引选取和过滤，算术运算和数据对齐，函数应用和映射，排序和排名，带有重复值得轴索引。
数据统计：相关系数和协方差
唯一值，值计数以及成员资格。
常用的数据统计方法：
count 统计数量
describe 针对列计算统计
min/max 最大最小值
argmin,argmax 最大值最小值得索引值
quantile 计算样本的分位数
sum 值的总和
mean 值的平均数
median 计算中位数
mad 计算平均绝对离差
var 样本值得方差
std 样本值得标准差
sumsum 样本值的累计和
cummin,cummax 样本值的累计最大最小值
cumprod 样本值得累计积
Pct_change 百分数变化
df.describe()
frame.count() 计算个列上的数据
frame.count(axis=1) 加这个参数才是对行数据进行运算
df=DataFrame({"GDP":[12,23,34,45,56],"air_temperature":[23,25,26,27,30]},index=['2001','2002','2003','2004','2005'])
df
df.corr() 相关系数
dr.cov() 协方差
df['GDP'].corr(df['air_temperature']) 相关系数
df['GDP'].cov(df[air_temperature]) 协方差
series=Series([13,13.4,13.5,13.6,13.7],index=['2001','2002','2003','2004','2005'])
series
df.corrwith(series) 相关系数
唯一值，值计数，成员资格
unique方法获取Series唯一值数组
Value_counts方法计算一个series中各值出现的频率
isin方法，判断矢量化集合的成员资格，用于选取series或者DataFrame中列中数据的子集
ser=Series(['a','b','c','a','a','b','c'])
ser
ser.unique()
df=DataFrame({'orderId':['1001','1002','1003','1004'],'orderAmt':[345.67,34.23,456.77,334.55],'menberId':['a1001','b1002','a1001','a1001']})
df
df['menberId'].unique()
ser
ser.value_counts() 值计数
ser.value_counts(ascending=False)
mask=ser.isin(['b','c'])
ser[mask]
处理缺失数据---检测，过滤，填充
dropna 对轴标签进行过滤是否存在缺失数据。
fillna 用指定值或者插值的方法填充缺失数据
isnull 查看那些值是缺失值
notnull 查看那些不是缺失值
df=DataFrame([['Tom',np.nan,456.67,'M'],['Merry',34,4567.34,np.NaN],['John',23,np.nan,'M'],['Joe',18,342.45,'F']],columns=['name','age','salary','gender'])
df
df.isnull()
df.notnull()
series=Series([1,2,3,4,np.nan,5])
series.dropna()
data=DataFrame([[1.,3.4,4.],[np.nan,np.nan,np.nan],[np.nan,4.5,6.7]])
data
data.dropna()
默认丢弃只要含有缺失值的行
data.dropna(how='all')
丢弃全部为缺失值的行
data[4]=np.nan
data
data.dropna(axis=1,how='all') 丢弃全部为缺失值得列
df=DataFrame(np.random.randn(7,3))
df.ix[:4,1]=np.nan
df.ix[:2,2]=np.nan
df.fillna(0)
df.fillna({1:0.5,2:-1,3:-2})
层次化索引
在某个方向上拥有多个索引级别
通过层次化索引，pandas能够以低纬度形式处理高维度数据
通过层次化索引，可以按层级统计数据
series层次化索引
data=Series([988.44,95859,3949.44,32445.44,234.45],index=[['2001','2001','2001','2002','2002'],['苹果','香蕉','西瓜','苹果','西瓜']])
data
data.index.names=['年份','水果类别']
data

df=DataFrame({'year':[2001,2001,2002,2002,2003],'fruit':['apple','banana','apple','banana','apple'],'production':[2345,3423,4556,4455,543],'profits':[2334.44,4455.55,5566.77,77865.556,3345.55]})
df
new_df=df.set_index(['year','fruit'])
new_df
按层级统计数据
new_df.sum(level='year')
new_df.sum(level='fruit')

posted @ 2022-05-07 09:59 老牛小茂阅读(183) 评论(0) 收藏举报

刷新页面返回顶部

yeskey

Pandas库基础

公告