pandas (2)
pandas 数据类型
赋值
#Series赋值
s = pd.Series([3,-5,7,4],index = ['a','b','c','d'])
#DataFrame 赋值
data = {'Country':['belgium','India','Brazil'],
'Capital':['Brussels','New Delhi','Brasilia'],
'Population':[11190846,1303171035,207847528]}
df = pd.DataFrame(data,cloumns=['Country','Capital','Population'])
数据选择
#选择一个项
s['b']
# -5
#选择多个
df[1:]
#选择第n行,如果已经定义了clonms,还可以直接跟 =['xx','xx']赋值新行
df.loc[n]
选择、布尔下标
By Position 坐标选择:
df.iloc([0],[0])
# `Belgium`
df.iat([0],[0])
# `Belgium`
By Label 标签选择:
df.loc([0],['country'])
df.at([0],['country'])
By Label/Position :
de.ix[2]
# Country Brazil
# Capital rasilia
# Population 207847528
df.ix[:,'Capital']
# 0 Brussels
# 1 Delhi
# 2 Brasilia
df.ix[1,'Capital']
# 'New Delhi'
使用ix方法被提示如下:(ix is deprecated
)
Boolean Indexing 布尔下标(筛选)
s[-(s>1)]
s[(s<-1)|(s>2)]
df[df['Population']>1200000000]
Dropping
s.drop(['a','c'])
df.drop('Country',axis=1)
Sort&Rank 排序
df.sort_index()
df.sort_values(by='Country')
df.rank()
## Retrieving Series/DataFrame Information
### Basic Information
```python
df.shape # (rows,columns)
df.index # Describe index
df.cloumns # Describe DataFrame cloumns
df.info() # Info on DataFrame
df.count() # Number of non-NA values 默认输出每列的项数
Summary 概要
df.sum() #sum of values
df.cumsum() #cummulative sum of values 从上到下的累加,输出一个新的dataframe
df.min()/df.max() #Minimum/maximum values
df.idxmin()/df.idxmax() #Minimum/maximum index values
df.describe() #Summary statistics 所有特征计算汇总统计
df.mean() #Mean of values 平均值(所有int64数据的)
df.median() #Median of values 中间值
Applying Functions 应用函数
f = lambda x : x*2
df.apply(f)
df.applymap(f)
df.apply()
函数只输出 df*2,不改变df的值。此例中博主没发现df.applymap()
和df.apply()
的区别。
Data Alignment 数据对齐
Internal Data Alignment 内部数据对齐
I/O 文件读写
csv文件
pd.read_csv()
pd.to_csv()
Excel文件
pd.read_excel('path')
pd.to_excel('path',sheet_name='name')
#读取单个文件下不同sheets
xlsx = pd.ExcelFile('path')
df = pd.read_excel(xlsx,'sheetname')
SQL Query or Database Table
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql("SELECT * FROM my_table;",engine)
pd.read_sql_table('my_table',engine)
pd.read_sql_query("SELECT * FROM my_table;",engine)
#生成sql
pd.to_sql('myDf',engine)