python数据结构:pandas(1)
废话不说,直接上干货
一、数据结构
二、基本用法
1.创建Series对象:类似于一维数组的对象,下面通过list来构建Series
注意:Series由数据和索引构成:索引在左边,数据在右边,索引是自动创建的
er_obj =pd.Series(range(10,20)) # print('type(ser_obj):\n',type(ser_obj)) #pandas的数据类型是:<class 'pandas.core.series.Series'> print('ser_obj=\n',ser_obj) type(ser_obj): <class 'pandas.core.series.Series'> ser_obj= 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64
2.获取数据的值和索引:
print(ser_obj) #显示所有的数据
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
print(type(ser_obj)) #显示数据类型 <class 'pandas.core.series.Series'>
print(ser_obj.values) #打印出数据的value值 [10 11 12 13 14 15 16 17 18 19]
print(type(ser_obj.values)) #打印出values的值的数据类型 <class 'numpy.ndarray'>
print(ser_obj.index) #打印出所有的索引对象 #RangeIndex(start=0, stop=10, step=1)
print(type(ser_obj.index)) #打印出所有索引对象的类型 <class 'pandas.core.indexes.range.RangeIndex'>
print(ser_obj.items()) <zip object at 0x000000000B8DEAC8>
print(type(ser_obj.items())) <class 'zip'>
3.预览数据
print(ser_obj.head(3))
0 10
1 11
2 12
看看head()的源码
def head(self, n=5):默认情况下是前5行
"""
Return the first `n` rows. 返回前n行
这个函数是返回基于位置对象的前n行,对于快速检测你对象中是否有正确类型的数据在其中是很有用的e
This function returns the first `n` rows for the object based
on position. It is useful for quickly testing if your object
has the right type of data in it.
Parameters
----------
n : int, default 5
Number of rows to select.
Returns
-------
obj_head : type of caller
The first `n` rows of the caller object.
See Also
--------
pandas.DataFrame.tail: Returns the last `n` rows.
Examples
--------
>>> df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
Viewing the first 5 lines
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
Viewing the first `n` lines (three in this case)
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
4.通过索引获取数据
print(ser_obj[0]) #10
print(ser_obj[8]) #18
5.索引与数据的对应关系任然保持在数组运算的结果中
print(ser_obj*2)
0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38
print(ser_obj[ser_obj>15])
6 16
7 17
8 18
9 19
6.通过dict构建Series
#通过dict构建Series year_data={2001:17.8,2002:20.1,2003:16.5,2004:19.9,2005:20.2,2006:22.6} ser_obj2 =pd.Series(year_data)
print(ser_obj2.head()) #,默认打印前5行
2001 17.8
2002 20.1
2003 16.5
2004 19.9
2005 20.2
print(ser_obj2.index) #打印出ser_obj2的索引
Int64Index([2001, 2002, 2003, 2004, 2005, 2006], dtype='int64')
7.设置name属性
ser_obj.name = ser_obj.index.name =
ser_obj2.name='temp' #将name设置为temp ser_obj2.index.name='year' #将索引设置为year print(ser_obj2.head()) #打印出前5行 print(ser_obj2.name) #打印出对象的名字 print(ser_obj2.index.name) #打印出索引的名字
8.Pandas数据结构DataFrame
(1)类似于多维数组/表格数据
(2)梅列数据可以是不同的数据类型
(3)索引包括行索引和列索引
(4)可以通过ndarray构建DataFrame
import numpy as np array = np.random.rand(5,4) print(array) #生成一个5行4列的(0,1)之间的随机数组 df_obj = pd.DataFrame(array) #将array转换为DataFrame的一个对象 print(df_obj.head())
[[0.16638712 0.7711124 0.72202224 0.2714576 ]
[0.39650865 0.01447041 0.41879748 0.27559135]
[0.46626184 0.67238444 0.72607271 0.93931229]
[0.41514637 0.23213519 0.68909139 0.83395236]
[0.84700412 0.3739937 0.64183245 0.64426823]]
0 1 2 3
0 0.166387 0.771112 0.722022 0.271458
1 0.396509 0.014470 0.418797 0.275591
2 0.466262 0.672384 0.726073 0.939312
3 0.415146 0.232135 0.689091 0.833952
4 0.847004 0.373994 0.641832 0.644268
(5)通过dict构建DataFrame
# 通过dict构建dataFrame dict_data={'A':1, 'B':pd.Timestamp('20190101'), 'C':pd.Series(1,index=list(range(4)),dtype='float32'), 'D':np.array([3]*4,dtype='int32'), 'E':pd.Categorical(['python','java','C++','C#']), 'F':'ChinaHadoop' } df_obj2 = pd.DataFrame(dict_data) print(df_obj2.head())
构建的结果:
A B C D E F
0 1 2019-01-01 1.0 3 python ChinaHadoop
1 1 2019-01-01 1.0 3 java ChinaHadoop
2 1 2019-01-01 1.0 3 C++ ChinaHadoop
3 1 2019-01-01 1.0 3 C# ChinaHadoop
(6)通过列索引来获取数据(Series类型)
df_obj[col_idx] 或者df_obj.col_obj
dict_data={'A':1, 'B':pd.Timestamp('20190101'), 'C':pd.Series(1,index=list(range(4)),dtype='float32'), 'D':np.array([3]*4,dtype='int32'), 'E':pd.Categorical(['python','java','C++','C#']), 'F':'ChinaHadoop' } df_obj2 = pd.DataFrame(dict_data) print(df_obj2.head()) # 通过列索引来获取数据 print(df_obj2['A']) print(type(df_obj2['A'])) #打印出索引A对应的数据类型,<class 'pandas.core.series.Series'> print(df_obj2.A) #以另一种方式对数据进行访问
0 1
1 1
2 1
3 1
Name: A, dtype: int64
<class 'pandas.core.series.Series'>
0 1
1 1
2 1
3 1
Name: A, dtype: int64
(7)增加列数据,类似dict添加key-value
df_obj[new_col_idx]=data
df_obj2['G']= df_obj2['D']+4 print(df_obj2)
A B C D E F G
0 1 2019-01-01 1.0 3 python ChinaHadoop 7
1 1 2019-01-01 1.0 3 java ChinaHadoop 7
2 1 2019-01-01 1.0 3 C++ ChinaHadoop 7
3 1 2019-01-01 1.0 3 C# ChinaHadoop 7
(8)删除列
del df_obj[col_idx]
#删除列 del(df_obj2['G']) print(df_obj2)
A B C D E F
0 1 2019-01-01 1.0 3 python ChinaHadoop
1 1 2019-01-01 1.0 3 java ChinaHadoop
2 1 2019-01-01 1.0 3 C++ ChinaHadoop
3 1 2019-01-01 1.0 3 C# ChinaHadoop
9.索引对象Index
(1)Series和DataFrame中的索引都是Index对象
print(type(df_obj2)) #打印出dataFrame的索引种类 <class 'pandas.core.frame.DataFrame'> print(type(ser_obj2)) #打印出Series的索引种类 <class 'pandas.core.series.Series'>
(2)不可变(immutable):保证了数据的安全性
# df_obj2.index[0]=3 # raise TypeError("Index does not support mutable operations") #ser_obj2.index[2]=1 #TypeError: Index does not support mutable operations
(3)常见的Index种类
Index
Int64Index
MultiIndex,‘层级’索引
DatatimeIndex,时间戳索引