08:Python数据分析之pandas学习
1.1 数据结构介绍
参考博客:http://www.cnblogs.com/nxld/p/6058591.html
1、pandas介绍
1. 在pandas中有两类非常重要的数据结构,即序列Series和数据框DataFrame。
2. Series类似于numpy中的一维数组,除了通吃一维数组可用的函数或方法,而且其可通过索引标签的方式获取数据,还具有索引的自动对齐功能;
3. DataFrame类似于numpy中的二维数组,同样可以通用numpy数组的函数和方法,而且还具有其他灵活应用,后续会介绍到。
2、Series创建的三种方式
1、通过一维数组创建序列
import numpy as np, pandas as pd arr1 = np.arange(10) print arr1,type(arr1) # [0 1 2 3 4 5 6 7 8 9] <type 'numpy.ndarray'> s1 = pd.Series(arr1) print s1,type(s1) # 0 0 # 1 1 # 2 2 # 3 3 # 4 4 # 5 5 # 6 6 # 7 7 # 8 8 # 9 9 # dtype: int64 <class 'pandas.core.series.Series'>
2、通过字典的方式创建序列
import numpy as np, pandas as pd dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50} s2 = pd.Series(dic1) print s2, type(s2) # a 10 # b 20 # c 30 # d 40 # e 50 # dtype: int64 <class 'pandas.core.series.Series'>
3、通过DataFrame中的某一行或某一列创建序列
3、DataFrame创建的三种方式
1、通过二维数组创建数据框
import numpy as np, pandas as pd arr2 = np.array(np.arange(12)).reshape(4,3) print arr2,type(arr2) # [[ 0 1 2] # [ 3 4 5] # [ 6 7 8] # [ 9 10 11]] df1 = pd.DataFrame(arr2) print df1,type(df1) # 0 1 2 # 0 0 1 2 # 1 3 4 5 # 2 6 7 8 # 3 9 10 11
2.1 通过字典的方式创建数据框
import numpy as np, pandas as pd dic2 = {'a':[1,2,3,4], 'b':[5,6,7,8], 'c':[9,10,11,12], 'd':[13,14,15,16] } df2 = pd.DataFrame(dic2) print df2 # a b c d # 0 1 5 9 13 # 1 2 6 10 14 # 2 3 7 11 15 # 3 4 8 12 16
import numpy as np, pandas as pd dic3 = {'one':{'a':1,'b':2,'c':3,'d':4}, 'two':{'a':5,'b':6,'c':7,'d':8}, 'three':{'a':9,'b':10,'c':11,'d':12} } df3 = pd.DataFrame(dic3) print df3, type(df3) # one three two # a 1 9 5 # b 2 10 6 # c 3 11 7 # d 4 12 8
# -*- coding: utf-8 -*- import json import pandas as pd d = { "slagroupcount": [ { "g_sla": 99.943755250038564, "weight": 20.0, "g_t_v": 19.988751050007714, "sla_nums": 14, "id": 1, "name": "大数据" }, { "g_sla": 99.994763756058816, "weight": 20.0, "g_t_v": 19.998952751211764, "sla_nums": 6, "id": 2, "name": "基础架构" }, ], "slacount": 99.611111411465515 } result = {} gcounts = [] subs = [] for i in range(10): day_of_result = d gcounts.append(float(day_of_result['slacount'])) # "slacount": 99.611111411465515 subs += day_of_result['slagroupcount'] # slagroupcount是一个列表,列表中包含多个字典 result['slacount'] = sum(gcounts) / len(gcounts) print subs df = pd.DataFrame(subs) # subs = [{},{},,{},{}....] # print df g = df.groupby('name').mean() # 将数据按照name分组计算平均值 print g ''' # 下面是g的打印结果(按照name分组,求出各项平均值) g_sla g_t_v id sla_nums weight name 基础架构 99.994764 19.998953 2 6 20.0 大数据 99.943755 19.988751 1 14 20.0 '''
# -*- coding: utf-8 -*- import json import pandas as pd '''一:这里字典d是GroupCountResult表中result字段中的一条数据''' d = { "slagroupcount": [ { "g_sla": 99.943755250038564, "weight": 20.0, "g_t_v": 19.988751050007714, "sla_nums": 14, "id": 1, "name": "大数据" }, { "g_sla": 99.994763756058816, "weight": 20.0, "g_t_v": 19.998952751211764, "sla_nums": 6, "id": 2, "name": "基础架构" }, ], "slacount": 99.611111411465515 } '''二:模拟获取最近10天sla平均值:下面使用for循环伪造从GroupCountResult表中取出了10条数据,进行平均值计算''' result = {} gcounts = [] subs = [] for i in range(10): day_of_result = d gcounts.append(float(day_of_result['slacount'])) # "slacount": 99.611111411465515 subs += day_of_result['slagroupcount'] # slagroupcount是一个列表,列表中包含多个字典 df = pd.DataFrame(subs) # subs = [{},{},,{},{}....] g = df.groupby('name').mean() # 将数据按照name分组计算平均值 print g ''' # 下面是g的打印结果(按照name分组,求出各项平均值) g_sla g_t_v id sla_nums weight name 基础架构 99.994764 19.998953 2 6 20.0 大数据 99.943755 19.988751 1 14 20.0 ''' '''三:将利用pandas计算出来的结果循环到字典中''' result = {} result['slagroupcount'] = [] for index, row in g.iterrows(): result['slagroupcount'].append({'name': row.name, 'id': int(row.id), 'weight': row.weight, 'sla_nums': row.sla_nums, 'g_sla': row.g_sla, 'g_t_v': row.g_t_v}) print result['slagroupcount'] ''' # 这里的d就是求出上面10条平均值后生成的字典 d = { "slagroupcount": [ { "g_sla": 99.943755250038564, "weight": 20.0, "g_t_v": 19.988751050007714, "sla_nums": 14, "id": 1, "name": "大数据" }, { "g_sla": 99.994763756058816, "weight": 20.0, "g_t_v": 19.998952751211764, "sla_nums": 6, "id": 2, "name": "基础架构" }, ], "slacount": 99.611111411465515 } '''
2.2 对数据框分组求值
# -*- coding: utf-8 -*- import json import pandas as pd li = [ {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 97.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 93.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 95.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 98.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 87.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 73.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 55.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 78.07472}, ] # 第一步:将列表字典转换成数据框 df = pd.DataFrame(li) # 将列表字典转换成数据框 # 第二步:将数据按照name分组计算平均值 g = df.groupby('name').mean() # 将数据按照name分组计算平均值 # print g ''' sla name Hospital01 95.82472 Hospital02 73.32472 ''' # 第三步:将二中分组后的值转换成字典 print g.to_dict() ''' { "sla": { "Hospital01": 95.82472, "Hospital02": 73.32472 } } '''
# -*- coding: utf-8 -*- import json import pandas as pd li = [ {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 97.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 93.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'mongodb', 'sla': 95.07472}, {'name': 'Hospital01', 'abbreviation': 'sdhospital','domain': '', 'service': 'redmine', 'sla': 98.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 87.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 73.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'redmine', 'sla': 55.07472}, {'name': 'Hospital02', 'abbreviation': 'sysucc','domain': '', 'service': 'mongodb', 'sla': 78.07472}, ] # 第一步:将列表字典转换成数据框 df = pd.DataFrame(li) # 将列表字典转换成数据框 # 第二步:将数据框按照 service,name,abbreviation 同时分组 service_name_group = df.groupby([df['service'], df['name'], df['abbreviation']]).mean() # print service_name_group ''' service name abbreviation mongodb Hospital01 sdhospital 96.07472 Hospital02 sysucc 75.57472 redmine Hospital01 sdhospital 95.57472 Hospital02 sysucc 71.07472 ''' # 第三步:将分组后的结果转换成字典 # print service_name_group.to_dict() ''' { 'sla': { ('redmine', 'Hospital01', 'sdhospital'): 95.57472, ('redmine', 'Hospital02', 'sysucc'): 71.07472, ('mongodb', 'Hospital02', 'sysucc'): 75.57472, ('mongodb', 'Hospital01', 'sdhospital'): 96.07472 } } ''' # 第四步:将转换成的字典转换成我们想要的字典格式 context = {} for k, v in service_name_group.to_dict()['sla'].items(): context.setdefault(k[0], []) # {'mongodb': [], 'redmine': []} context[k[0]].append({'name': k[1], 'sla': v, 'abbreviation': k[2]}) ''' 这是for循环k,v的结果 ('redmine', 'Hospital01', 'sdhospital') 95.57472 ('redmine', 'Hospital02', 'sysucc') 71.07472 ('mongodb', 'Hospital02', 'sysucc') 75.57472 ('mongodb', 'Hospital01', 'sdhospital') 96.07472 ''' # print context # 这里d是我们最终想要得到的结果 d = { "mongodb": [ { "abbreviation": "sysucc", "name": "Hospital02", "sla": 75.57472 }, { "abbreviation": "sdhospital", "name": "Hospital01", "sla": 96.07472 } ], "redmine": [ { "abbreviation": "sdhospital", "name": "Hospital01", "sla": 95.57472 }, { "abbreviation": "sysucc", "name": "Hospital02", "sla": 71.07472 } ] }
# -*- coding: utf-8 -*- import json import pandas as pd li = [ {'name':'zhangsan','times':'first','math':88,'chinese':82}, {'name':'zhangsan','times':'second','math':84,'chinese':83}, {'name':'zhangsan','times':'third','math':85,'chinese':87}, {'name': 'lisi', 'times': 'first', 'math': 88, 'chinese': 82}, {'name': 'lisi', 'times': 'second', 'math': 84, 'chinese': 83}, {'name': 'lisi', 'times': 'third', 'math': 85, 'chinese': 87}, ] # 第一步:将列表字典转换成数据框 df = pd.DataFrame(li) # subs = [{},{},,{},{}....] # 第二步:将数据框按照name分组 g = df.groupby([df['name']]).mean() # print g ''' chinese math name lisi 84.0 85.666667 zhangsan 84.0 85.666667 ''' # 第三步:将利用pandas计算出来的结果循环到字典中 result = [] for index, row in g.iterrows(): result.append({'name': row.name, 'math': int(row.math), 'chinese': row.chinese, }) # print result ret_li = [ { "chinese": 84, "name": "lisi", "math": 85 }, { "chinese": 84, "name": "zhangsan", "math": 85 } ]
2.3 对数据框进行过滤查询
# -*- coding: utf-8 -*- import json import pandas as pd li = [ {'name':'zhangsan','times':'first','math':88,'chinese':82}, {'name':'zhangsan','times':'second','math':84,'chinese':83}, {'name':'zhangsan','times':'third','math':85,'chinese':87}, {'name': 'lisi', 'times': 'first', 'math': 88, 'chinese': 82}, {'name': 'lisi', 'times': 'second', 'math': 84, 'chinese': 83}, {'name': 'lisi', 'times': 'third', 'math': 85, 'chinese': 87}, ] # 第一步:将列表字典转换成数据框 df = pd.DataFrame(li) # subs = [{},{},,{},{}....] # 第二步:过滤出zhangsan用户,第一次考试的结果 result = df[(df['name'] == 'zhangsan') & (df['times']=='first')] # result = df[(df['name'] == 'zhangsan') | (df['times']=='first')] # 过滤出name='zhangsan' 或者 times='first' 的条目 # 第三步:将第二步中过滤的结果添加到字典中 li = [] for index, row in result.iterrows(): li.append({ '姓名':row['name'], '第几次考试':row['times'], '数学成绩':row['math'], '语文成绩':row['chinese'] }) print json.dumps(li) ''' [{ "第几次考试": "first", "语文成绩": 82, "数学成绩": 88, "姓名": "zhangsan" }] '''
1.2 数据索引index
1、通过索引值或索引标签获取数据
import numpy as np, pandas as pd #1、通过列表生成Series s4 = pd.Series(np.array([1,2,3,4])) print s4 # 0 1 # 1 2 # 2 3 # 3 4 #2、为Series自定义的索引值 s4.index = ['a','b','c','d'] print s4 # a 1 # b 2 # c 3 # d 4 #3、通过两种索引均可获取到值 print s4[3],s4['d'] # 4 4
2、自动化对齐
#-*- coding:utf8 -*- import numpy as np, pandas as pd s5 = pd.Series(np.array([10,15,20,30]), index = ['a','b','c','d']) s6 = pd.Series(np.array([12,11,13,15]), index = ['a','c','g','b']) print s5 + s6 # a 22.0 # b 30.0 # c 31.0 # d NaN # g NaN # 说明:由于s5中的d和s6中的g没有对应的所有,所以数据的运算会产生两个缺失值NaN # 注意,这里的算术结果就实现了两个序列索引的自动对齐,而非简单的将两个序列加总或相除。 # 对于数据框的对齐,不仅仅是行索引的自动对齐,同时也会自动对齐列索引(变量名)
1.3 统计分析
#-*- coding:utf8 -*- import numpy as np, pandas as pd np.random.seed(1234) d1 = pd.Series(2*np.random.normal(size = 100)+3) # 生成Series 100个 d1.count() #非空元素计算 d1.min() #最小值 d1.max() #最大值 d1.idxmin() #最小值的位置,类似于R中的which.min函数 d1.idxmax() #最大值的位置,类似于R中的which.max函数 d1.quantile(0.1) #10%分位数 d1.sum() #求和 d1.mean() #均值 d1.median() #中位数 d1.mode() #众数 d1.var() #方差 d1.std() #标准差 d1.mad() #平均绝对偏差 d1.skew() #偏度 d1.kurt() #峰度 d1.describe() #一次性输出多个描述性统计指标
作者:学无止境
出处:https://www.cnblogs.com/xiaonq
生活不只是眼前的苟且,还有诗和远方。