汇总的函数
方法 |
说明 |
count |
非NA的值数量 |
describe |
针对Series和DataFrame列计算汇总统计 |
min、max |
计算最小值和最大值 |
argmin、argmax |
计算能够获取到最小值和最大值的索引位置 |
idxmin、indxmax |
计算能够获取到最小值和最大值的索引值 |
quantile |
计算四分位数 |
sum |
值的总和 |
mean |
值的平均数 |
median |
值的算术中位数(第50百分位数) |
mad |
根据平均值计算平均绝对离差 |
var |
样本值的方差 |
std |
样本值的标准差 |
skew |
样本值的偏度 |
kurt |
样本值的峰度 |
cumsum |
样本值的累计和 |
cummin、cummax |
累计最大值和累计最小值 |
cumprod |
累计积 |
diff |
计算一阶差分(对时间序列很有用) |
pct_change |
计算百分数变化 |
import numpy as np
import pandas as pd
from pandas import Series
df = pd.DataFrame([[1.4, np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
index=['a','b','c','d'],
columns = ['one','two']
)
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 求每一列和
df.sum()
one 9.25
two -5.80
dtype: float64
# 求每一行和
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
选项 |
说明 |
axis |
约简的轴,axis=1代表统计一行的值,默认统计一列的值。 |
skipna |
排除缺失值,默认值为True |
level |
如果轴是层次化索引的,则根据level分组约简 |
# 返回最大值最小值的索引
df.idxmin()
one d
two b
dtype: object
# 累积列的和
df.cumsum()
### 汇总分析
df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
# 对于非数值型数据,describe会产生另一种汇总统计
obj = Series(['a','a','b','c']*4)
obj.describe()
count 16
unique 3
top a
freq 8
dtype: object
相关系数和协方差
# 需要先pip install pandas_datareader
import pandas_datareader.data as web
# 远程获取股票的数据
all_data={}
for ticker in ['AAPL','IBM','MSFT']:
all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
all_data
{'AAPL': High Low Open Close Volume Adj Close
Date
1999-12-31 3.674107 3.553571 3.604911 3.671875 40952800.0 2.467498
2000-01-03 4.017857 3.631696 3.745536 3.997768 133949200.0 2.686497
2000-01-04 3.950893 3.613839 3.866071 3.660714 128094400.0 2.459998
2000-01-05 3.948661 3.678571 3.705357 3.714286 194580400.0 2.495997
2000-01-06 3.821429 3.392857 3.790179 3.392857 191993200.0 2.279998
2000-01-07 3.607143 3.410714 3.446429 3.553571 115183600.0 2.387998
2000-01-10 3.651786 3.383929 3.642857 3.491071 126266000.0 2.345998
2000-01-11 3.549107 3.232143 3.426339 3.312500 110387200.0 2.225998
2000-01-12 3.410714 3.089286 3.392857 3.113839 244017200.0 2.092498
2000-01-13 3.526786 3.303571 3.374439 3.455357 258171200.0 2.321998
2000-01-14 3.651786 3.549107 3.571429 3.587054 97594000.0 2.410497
2000-01-18 3.785714 3.587054 3.607143 3.712054 114794400.0 2.494497
2000-01-19 3.883929 3.691964 3.772321 3.805804 149410800.0 2.557498
2000-01-20 4.339286 4.053571 4.125000 4.053571 457783200.0 2.723998
2000-01-21 4.080357 3.935268 4.080357 3.975446 123981200.0 2.671498
2000-01-24 4.026786 3.754464 3.872768 3.794643 110219200.0 2.549998
2000-01-25 4.040179 3.656250 3.750000 4.008929 124286400.0 2.693998
2000-01-26 4.078125 3.919643 3.928571 3.935268 91789600.0 2.644498
2000-01-27 4.035714 3.821429 3.886161 3.928571 85036000.0 2.639997
2000-01-28 3.959821 3.593750 3.863839 3.629464 105837200.0 2.438998
2000-01-31 3.709821 3.375000 3.607143 3.705357 175420000.0 2.489997
2000-02-01 3.750000 3.571429 3.714286 3.580357 79508800.0 2.405998
2000-02-02 3.647321 3.464286 3.598214 3.529018 116048800.0 2.371498
2000-02-03 3.723214 3.580357 3.582589 3.689732 118798400.0 2.479497
2000-02-04 3.928571 3.700893 3.712054 3.857143 106330000.0 2.591997
2000-02-07 4.080357 3.783482 3.857143 4.073661 110266800.0 2.737498
2000-02-08 4.147321 3.973214 4.071429 4.102679 102160800.0 2.756998
2000-02-09 4.183036 4.015625 4.075893 4.022321 74841200.0 2.702997
2000-02-10 4.066964 3.928571 4.031250 4.053571 75745600.0 2.723998
2000-02-11 4.075893 3.866071 4.058036 3.883929 53062800.0 2.609998
... ... ... ... ... ... ...
2009-11-18 29.571428 29.142857 29.505714 29.422857 93580200.0 19.772139
2009-11-19 29.230000 28.542856 29.230000 28.644285 135581600.0 19.248940
2009-11-20 28.627142 28.251429 28.450001 28.559999 101666600.0 19.192299
2009-11-23 29.428572 28.992857 29.000000 29.411428 118724200.0 19.764460
2009-11-24 29.411428 28.985714 29.332857 29.205715 79609600.0 19.626223
2009-11-25 29.378571 29.108572 29.342857 29.170000 71613500.0 19.602222
2009-11-27 28.994286 28.338572 28.459999 28.655714 73814300.0 19.256622
2009-11-30 28.811428 28.395714 28.730000 28.558571 106214500.0 19.191339
2009-12-01 28.967142 28.118572 28.891428 28.138571 116440800.0 18.909100
2009-12-02 28.774286 27.964285 28.422857 28.032858 178815000.0 18.838060
2009-12-03 28.425714 28.038572 28.202858 28.068571 112179900.0 18.862062
2009-12-04 28.554285 27.182858 28.528572 27.617144 206721200.0 18.558701
2009-12-07 27.681429 26.954287 27.617144 26.992857 178689700.0 18.139185
2009-12-08 27.478571 26.957144 27.051428 27.124287 172599700.0 18.227499
2009-12-09 28.308571 27.187143 27.325714 28.257143 171195500.0 18.988783
2009-12-10 28.528572 28.017143 28.500000 28.061428 122417400.0 18.857262
2009-12-11 28.285715 27.632856 28.254286 27.809999 107443700.0 18.688299
2009-12-14 28.204287 27.508572 27.910000 28.139999 123947600.0 18.910061
2009-12-15 28.215714 27.610001 27.975714 27.738571 104864900.0 18.640306
2009-12-16 28.071428 27.792856 27.871429 27.861429 88246200.0 18.722862
2009-12-17 27.857143 27.285715 27.751429 27.408571 97209700.0 18.418543
2009-12-18 27.928572 27.514286 27.595715 27.918571 152192600.0 18.761261
2009-12-21 28.535715 27.952858 28.007143 28.318571 152976600.0 19.030060
2009-12-22 28.692858 28.379999 28.491428 28.622858 87378900.0 19.234541
2009-12-23 28.911428 28.687143 28.742857 28.871429 86381400.0 19.401581
2009-12-24 29.907143 29.049999 29.078571 29.862858 125222300.0 20.067825
2009-12-28 30.564285 29.944286 30.245714 30.230000 161141400.0 20.314537
2009-12-29 30.388571 29.818571 30.375713 29.871429 111301400.0 20.073582
2009-12-30 30.285715 29.758572 29.832857 30.234285 103021100.0 20.317423
2009-12-31 30.478571 30.080000 30.447144 30.104286 88102700.0 20.230061
[2516 rows x 6 columns],
price = pd.DataFrame({tic:data['Adj Close'] for tic, data in all_data.items()})
volume = pd.DataFrame({tic:data['Volume'] for tic, data in all_data.items()})
# 计算价格的百分位数变化
returns = price.pct_change()
# 结尾取5个,取局部的
returns.tail()
AAPL IBM MSFT
Date
2009-12-24 0.034340 0.004385 0.002587
2009-12-28 0.012294 0.013326 0.005484
2009-12-29 -0.011861 -0.003477 0.007058
2009-12-30 0.012147 0.005461 -0.013698
2009-12-31 -0.004300 -0.012597 -0.015504
Series的corr方法用于计算两个Serires重叠的、非NA的、按索引对齐的值的相关系数。COV用于计算协方差。
# 计算MSFT和IBM的相关系数
returns.MSFT.corr(returns.IBM)
0.49253706924390156
# 计算MSFT和IBM的协方差
returns.MSFT.cov(returns.IBM)
0.00021557776646297303
DataFrame的corr和cov方法,可以统计出任意两者之间的相关系数和协方差
returns.corr()
AAPL IBM MSFT
AAPL 1.000000 0.412392 0.422852
IBM 0.412392 1.000000 0.492537
MSFT 0.422852 0.492537 1.000000
# 返回单列的IBM与另外三者之间的相关系数
returns.corrwith(returns.IBM)
AAPL 0.412392
IBM 1.000000
MSFT 0.492537
dtype: float64
# 百分比变化和成交量的相关系数
returns.corrwith(volume)
AAPL -0.057665
IBM -0.006592
MSFT -0.016101
dtype: float64
returns.cov()
AAPL IBM MSFT
AAPL 0.001030 0.000254 0.000309
IBM 0.000254 0.000369 0.000216
MSFT 0.000309 0.000216 0.000519
唯一值(unique),返回一个Series,其索引为唯一值,其值为频率
obj = Series(['c','a','d','a','a','b','b','c','c'])
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
# 对返回的结果可以进行排序
uniques.sort()
uniques
array(['a', 'b', 'c', 'd'], dtype=object)
# 统计值出现的次数,为了方便查看按值频率降序排序
obj.value_counts()
a 3
c 3
b 2
d 1
dtype: int64
# pandas中需要将一维数组传入进去
pd.value_counts(obj.values)
a 3
c 3
b 2
d 1
dtype: int64
成员资格isin 计算一个表示"是否在数组中"布尔型数组
mask = obj.isin(['b','c'])
# 第一步先选出是否在,返回bool
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
# 选择真值
obj[mask]
0 c
5 b
6 b
7 c
8 c
dtype: object
data = pd.DataFrame({'Qu1':[100,300,400,300,400],
'Qu2':[200,300,100,200,300],
'Qu3':[100,500,200,400,400]
})
data
Qu1 Qu2 Qu3
0 100 200 100
1 300 300 500
2 400 100 200
3 300 200 400
4 400 300 400
p=data.apply(pd.value_counts).fillna(0)
p
Qu1 Qu2 Qu3
100 1.0 1.0 1.0
200 0.0 2.0 1.0
300 2.0 2.0 0.0
400 2.0 0.0 2.0
500 0.0 0.0 1.0
处理缺失数据
NA处理方法
方法 |
说明 |
dropna |
根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度 |
fillna |
用指定值或插值方法填充缺失数据 |
isnull |
返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值 |
notnull |
isnull的否定式 |
过滤掉
from numpy import nan as NA
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
# 默认会把所有的有NA的行删除
data.dropna()
0 1 2
0 1.0 6.5 3.0
# 传入how='all'将只丢弃全为NA的行
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4]=NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
# axis=1 删除全部为NA的列
data.dropna(axis=1,how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7,5))
df.ix[:4,1] = NA
df.ix[:2,2] = NA
0 1 2 3 4
0 0.758567 NaN NaN -0.064858 -0.385678
1 -0.275586 NaN NaN -0.184934 -0.253343
2 -1.872585 NaN NaN -1.539924 0.794054
3 1.092201 NaN 0.250026 -0.654255 -0.016992
4 0.625871 NaN -1.418505 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318
#这里的thresh函数是选取最少non-NA值个数的行选出来
df.dropna(thresh=5)
0 1 2 3 4
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318
#### 填充缺失数据,如果不想过滤的话
df.fillna(0)
# 字典调用fillna,对不同的列填充不同的值
df.fillna({1:'第一列',2:'第二列'})
0 1 2 3 4
0 0.758567 第一列 第二列 -0.064858 -0.385678
1 -0.275586 第一列 第二列 -0.184934 -0.253343
2 -1.872585 第一列 第二列 -1.539924 0.794054
3 1.092201 第一列 0.250026 -0.654255 -0.016992
4 0.625871 第一列 -1.41851 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.41272 -0.405043 -0.417018 -1.114318
# 上面的都是返回值为新的对象,如果直接向对原数据修改,inplace = True
df.fillna({1:'第一列',2:'第二列'},inplace=True)
# 向上,向下填充(bfill,ffill)
df = pd.DataFrame(np.random.randn(6,3))
0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 NaN 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 NaN
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598
df.fillna(method='ffill',limit=1)
0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 -2.276695 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 -0.839205
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598
#还可以传入Series的平均值或中位数
data = Series([1,NA,3.5])
data.fillna(data.mean())
0 1.00
1 2.25
2 3.50
dtype: float64