Pandas模块 04
<title>test_pandas</title>
</div>
</div>
</div>
</body>
**pandas第二部分:**[Pandas模块05](https://www.cnblogs.com/zhangchaocoming/p/11625767.html)
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS_HTML"></script>
<!-- MathJax configuration -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true,
processEnvironments: true
},
// Center justify equations in code and markdown cells. Elsewhere
// we use CSS to left justify single line equations in code cells.
displayAlign: 'center',
"HTML-CSS": {
styles: {'.MathJax_Display': {"margin": 0}},
linebreaks: { automatic: true }
}
});
</script>
<!-- End of mathjax configuration --></head>
<body>
<div tabindex="-1" id="notebook" class="border-box-sizing">
<div class="container" id="notebook-container">
In [6]:
import numpy as np
</div>
为啥使用numpy¶
计算购物车的总价格¶
In [15]:
l1=[2,3,10,5]
l2=[10,200,150,50]
</div>
In [8]:
l1_np = np.array(l1)
l2_np = np.array(l2)
</div>
In [9]:
l1_np
</div>
<div class="prompt output_prompt">Out[9]:</div>
In [10]:
l2_np
</div>
<div class="prompt output_prompt">Out[10]:</div>
In [11]:
l1_np * l2_np
</div>
<div class="prompt output_prompt">Out[11]:</div>
In [13]:
np.sum(l1_np * l2_np)
</div>
<div class="prompt output_prompt">Out[13]:</div>
In [14]:
import numpy as np
import pandas as pd
</div>
Series的学习¶
Series的创建方法¶
In [17]:
s1 = pd.Series([2,3,4,5,6])
</div>
In [18]:
s1
</div>
<div class="prompt output_prompt">Out[18]:</div>
In [19]:
s2 = pd.Series([2,3,4,5,6],index=['a','b','c','d','e'])
</div>
In [20]:
s2
</div>
<div class="prompt output_prompt">Out[20]:</div>
索引取值¶
In [21]:
s1[1]
</div>
<div class="prompt output_prompt">Out[21]:</div>
In [22]:
s2['b']
</div>
<div class="prompt output_prompt">Out[22]:</div>
In [23]:
s2[1] ##### 数字型索引和自定义索引是共存的
</div>
<div class="prompt output_prompt">Out[23]:</div>
In [24]:
s4 = pd.Series(0,index=['a','b','c'])
s4
</div>
<div class="prompt output_prompt">Out[24]:</div>
Series的特性¶
In [26]:
s1 * 2 #### 矢量运算
</div>
<div class="prompt output_prompt">Out[26]:</div>
In [28]:
s3 = pd.Series([2,3,4,5,6])
s3
</div>
<div class="prompt output_prompt">Out[28]:</div>
In [29]:
s1 * s3
</div>
<div class="prompt output_prompt">Out[29]:</div>
In [30]:
s1>4
</div>
<div class="prompt output_prompt">Out[30]:</div>
In [31]:
s1[s1>4] #### 布尔型索引
</div>
<div class="prompt output_prompt">Out[31]:</div>
In [32]:
abs(s1)
</div>
<div class="prompt output_prompt">Out[32]:</div>
In [34]:
sum(s1) #### 求和
</div>
<div class="prompt output_prompt">Out[34]:</div>
缺失值处理¶
In [35]:
st = pd.Series({'sean':12, 'yang':15, 'cloud':20, 'bella':23}) #### 放字典也可以
st
</div>
<div class="prompt output_prompt">Out[35]:</div>
In [37]:
obj1 = pd.Series(st, index=['sean', 'yang', 'cloud'])
obj1
</div>
<div class="prompt output_prompt">Out[37]:</div>
In [39]:
obj2 = pd.Series(st, index=['sean', 'yang', 'cloud', 'rocky'])
obj2
</div>
<div class="prompt output_prompt">Out[39]:</div>
In [41]:
#### 为啥值的类型从整型变成了浮点型?
### 答:因为NaN是浮点类型,所以为了兼容nan的浮点类型,因此强制的将之前的整型变成了浮点型
type(np.nan)
</div>
<div class="prompt output_prompt">Out[41]:</div>
In [42]:
#### nan 他等于谁?
np.nan == np.nan
</div>
<div class="prompt output_prompt">Out[42]:</div>
In [45]:
#### 使用fillna()方法填充nan
obj2.fillna("好")
</div>
<div class="prompt output_prompt">Out[45]:</div>
In [46]:
obj2 ### obj2在使用fillna()时相当于又重新copy了一份
</div>
<div class="prompt output_prompt">Out[46]:</div>
In [47]:
a = obj2
a
</div>
<div class="prompt output_prompt">Out[47]:</div>
In [48]:
#### 删除nan所在的行
obj2.dropna(inplace=True) ### inplace = True:代表对象本身的基础上删掉
</div>
In [49]:
obj2
</div>
<div class="prompt output_prompt">Out[49]:</div>
In [51]:
#### isnull 判断是否为空
obj2.isnull()
</div>
<div class="prompt output_prompt">Out[51]:</div>
两个Series的运算¶
In [52]:
#### 向量运算
</div>
In [53]:
s4 = pd.Series({'height':170, 'age':18, 'salary':2000})
s4
</div>
<div class="prompt output_prompt">Out[53]:</div>
In [54]:
s5 = pd.Series({'height':110, 'age':20, 'salary':2367})
s5
</div>
<div class="prompt output_prompt">Out[54]:</div>
In [55]:
s4 +s5 #### 索引所对应的值进行加减乘除
</div>
<div class="prompt output_prompt">Out[55]:</div>
In [57]:
s6 = pd.Series({'name':110, 'age':20, 'salary':2367})
s6
</div>
<div class="prompt output_prompt">Out[57]:</div>
In [58]:
s4 + s6
</div>
<div class="prompt output_prompt">Out[58]:</div>
In [59]:
s1
</div>
<div class="prompt output_prompt">Out[59]:</div>
Series索引¶
In [62]:
#### 获取 2, 4, 6三个值
#### 花式索引 中括号中套中括号 ,内层中括号写索引下标
s1[[0,2,4]]
</div>
<div class="prompt output_prompt">Out[62]:</div>
In [63]:
arr = np.array([1,2,3,4,5])
</div>
In [64]:
arr
</div>
<div class="prompt output_prompt">Out[64]:</div>
In [65]:
arr[[0,2,4]]
</div>
<div class="prompt output_prompt">Out[65]:</div>
整数索引¶
In [68]:
sr = pd.Series(np.arange(10))
sr
</div>
<div class="prompt output_prompt">Out[68]:</div>
In [71]:
sr1 = sr[3:]
sr1
</div>
<div class="prompt output_prompt">Out[71]:</div>
In [73]:
sr1.iloc[0] #### iloc == index + location
</div>
<div class="prompt output_prompt">Out[73]:</div>
In [75]:
sr1.loc[3] #### 下标
</div>
<div class="prompt output_prompt">Out[75]:</div>
DataFrame¶
In [76]:
df = pd.DataFrame({'one':[1,2,3,4],'two':[5,6,7,8]})
df
</div>
<div class="prompt output_prompt">Out[76]:</div>
In [77]:
df['one'].iloc[0] #### 先是获取列 再然后是行
</div>
<div class="prompt output_prompt">Out[77]:</div>
常见的属性¶
In [78]:
df.index #### 获取行索引
</div>
<div class="prompt output_prompt">Out[78]:</div>
In [79]:
df.columns #### 获取列索引
</div>
<div class="prompt output_prompt">Out[79]:</div>
In [80]:
df.T #### 行列转置
</div>
<div class="prompt output_prompt">Out[80]:</div>
In [81]:
df.values
</div>
<div class="prompt output_prompt">Out[81]:</div>
In [83]:
df['one'].values[0]
</div>
<div class="prompt output_prompt">Out[83]:</div>
In [85]:
df.describe()
</div>
<div class="prompt output_prompt">Out[85]:</div>
企业中处理数据的方式¶
In [86]:
#### 1. 别的同事会给你一个excel文件或者csv文件
#### 2. 使用pandas读取csv文件
</div>
In [87]:
movies = pd.read_csv('./douban_movie.csv') ### 只需要read_csv函数, 将csv中的所有数据读出
movies
</div>
<div class="prompt output_prompt">Out[87]:</div>
In [89]:
movies.describe()
</div>
<div class="prompt output_prompt">Out[89]:</div>
In [98]:
movies.to_csv('./modify_movies.csv',index=False) #### 保存数据到一个文件中 index=False表示不要在列名称前加索引。
</div>
In [99]:
pd.read_csv('./modify_movies.csv')
</div>
<div class="prompt output_prompt">Out[99]:</div>
In [100]:
movies.head() #### 默认显示前5行, head中可以传入行数,查看
</div>
<div class="prompt output_prompt">Out[100]:</div>
In [102]:
movies.head(2) #### 指定数字2,表示显示两行
</div>
<div class="prompt output_prompt">Out[102]:</div>
In [103]:
movies.tail() #### 默认显示后5行, tail中也可以传入行数查看
</div>
<div class="prompt output_prompt">Out[103]:</div>
DataFrame的分组¶
In [104]:
#### 读取某一个url网页下面的所有的表格数据
res = pd.read_html('https://baike.baidu.com/item/NBA%E6%80%BB%E5%86%A0%E5%86%9B/2173192?fr=aladdin')
</div>
In [107]:
champion_res = res[0]
</div>
In [108]:
champion_res
</div>
<div class="prompt output_prompt">Out[108]:</div>
In [109]:
##### 1.将第一行数据变成列名
champion_res.iloc[0]
</div>
<div class="prompt output_prompt">Out[109]:</div>
In [110]:
champion_res.columns = champion_res.iloc[0] ### 将第一行的数据赋值给列名
</div>
In [112]:
champion_res.head()
</div>
<div class="prompt output_prompt">Out[112]:</div>
In [113]:
champion_res.drop([0],inplace=True)
</div>
In [114]:
champion_res
</div>
<div class="prompt output_prompt">Out[114]:</div>
In [116]:
#### 1.求每个队获取冠军的次数
#####思路:对冠军球队进行分组 mysql: group by 分组对象
champion_res.groupby('冠军') #### 获取的是一个分组对象
</div>
<div class="prompt output_prompt">Out[116]:</div>
In [117]:
champion_res.groupby('冠军').groups #### 拿到每支队伍获取冠军的行索引
</div>
<div class="prompt output_prompt">Out[117]:</div>
In [118]:
#### 2.对数据进行聚合 mysql: count sum
champion_res.groupby('冠军').size() #### 获取每支队伍获得冠军的次数
</div>
<div class="prompt output_prompt">Out[118]:</div>
In [119]:
#### 3. 对数据进行排序 mysql: order by
champion_res.groupby('冠军').size().sort_values(ascending=False) ### 默认是升序排序
</div>
<div class="prompt output_prompt">Out[119]:</div>