pandas: DataFrame
DataFrame是一个表格型的数据结构,含有一组有序的列
DataFrame可以被看作是有Series组成的字典,并且共用一个索引
创建方式
1 df = pd.DataFrame({'one':[1,2,3,4], 'two':[5,6,7,8]}) 2 In [3]: df 3 Out[3]: 4 one two 5 0 1 5 6 1 2 6 7 2 3 7 8 3 4 8 9 10 In [4]: df2 = pd.DataFrame({'one':pd.Series([1,2,3,4],index=['a','b','c','d']),' 11 ...: tow':pd.Series([4,5,6],index=['a','c','e'])}) 12 13 In [5]: df2 14 Out[5]: 15 one tow 16 a 1.0 4.0 17 b 2.0 NaN 18 c 3.0 5.0 19 d 4.0 NaN 20 e NaN 6.0 21 22 In [6]: df3 = pd.DataFrame({'one':pd.Series([1,2,3,4],index=list('abcd')),'tow': 23 ...: pd.Series([4,5,6],index=list('acd'))}) 24 25 In [7]: df3 26 Out[7]: 27 one tow 28 a 1 4.0 29 b 2 NaN 30 c 3 5.0 31 d 4 6.0
csv文件读取与写入
cd 到csv文件目录下 df = pd.read_csv('fiename.csv') # 读取csv文件方式一 # 读取csv文件方式er df = open('filename.csv') df.read() #写入文件 df.to_csv("newfilename.csv")
pandas:DataFrame查看数据
查看数据常用属性及方法
index 获取行索引 In [29]: df2 Out[29]: one tow a 1.0 4.0 b 2.0 NaN c 3.0 5.0 d 4.0 NaN e NaN 6.0 In [30]: df2.index Out[30]: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object') columns 获取列索引 In [32]: df2.columns Out[32]: Index([u'one', u'tow'], dtype='object') values 获取值数组 In [33]: df2.values Out[33]: array([[ 1., 4.], [ 2., nan], [ 3., 5.], [ 4., nan], [nan, 6.]]) T 转置 # 行和列交换 In [35]: df2.T Out[35]: a b c d e one 1.0 2.0 3.0 4.0 NaN tow 4.0 NaN 5.0 NaN 6.0 describe() 获取快速统计 In [34]: df2.describe() Out[34]: one tow count 4.000000 3.0 mean 2.500000 5.0 std 1.290994 1.0 min 1.000000 4.0 25% 1.750000 4.5 50% 2.500000 5.0 75% 3.250000 5.5 max 4.000000 6.0
DataFrame各列name属性:列名
rename(columns={'旧的列名':'新的列名'}) In [37]: df2 Out[37]: one tow a 1.0 4.0 b 2.0 NaN c 3.0 5.0 d 4.0 NaN e NaN 6.0 In [38]: df2.rename(columns={'one':'first'}) Out[38]: first tow a 1.0 4.0 b 2.0 NaN c 3.0 5.0 d 4.0 NaN e NaN 6.0
pandas:DataFrame索引和切片
DataFrame有行索引和列索引
通过标签获取
df = pd.read_csv('601318.csv') df['open'] # 获取指定列 df[['open', 'high']] # 花式列索引 df['open'][0] # 获取open列的下标为第0行的数据 df[0:10] #获取0-10行下标的数据 df[0:10][['date', 'close']] # 获取下标0-10行并且列为'date', 'close'的数据 df.loc[:,['open','close','low']] #获取所有行,列为'open','close','low'的数据 df.loc[:,'open':'close'] #获取所有行,列为'open','close'的数据 df.loc[0,'open'] #获取下标为0的行,open列的数据 df.loc[0:10,['open','low']] # 获取下标为0-10行。列为open,low的数据
通过位置索引(index)
df.iloc[3] # 获取下标为3的数据 df.iloc[3,3] #获取第三行的第三列数据 df.iloc[0:3,4:6] #获取0-3行的4到6列数据 df.iloc[1:5,:] # 获取1到5行的所有数据 df.iloc[[1,2,4],[0,3,6]] #获取1,2,4行,0,3,6列数据
通过布尔值过滤
df[df['open']>20] #获取open列大于20的数据 df[df<50] # 获取df小于50的数据 df[df['date'].isin(['2007-03-01','2007-03-06'])] # 获取date,在['2007-03-01','2007-03-06']里的数据 df[df<50].fillna(0) # 将df大于50的缺失值改为0,未符合查找条件的值系统显示为缺失值NaN