利用python进行数据分析-01-引言
1.将gov数据导入并读出数据: (time_zones类型为list,为tz的值)-第一种方法
import json path = 'B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt' records = [json.loads(line) for line in open(path)] #print(records[0]) time_zones = [rec['tz'] for rec in records if 'tz' in rec]
自定义计数函数计算time_zones里面各个值的个数
def get_counts(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts from collections import defaultdict def get_counts2(sequence): counts =defaultdict(int) for x in sequence: counts[x] += 1 return counts
如计算time_zones中America/New_York的个数(coun为字典):
coun = get_counts2(time_zones) print (coun['America/New_York'])
输出前20个:count_dict是字典,items属性
def count_value(count_dict,n=10): value_count_dict = [(count,tz) for tz,count in count_dict.items()] value_count_dict.sort() return value_count_dict[-n:] print(count_value(coun,10))
将gov数据导入并读出数据: (time_zones类型为list,为tz的值)-第二种方法
采用collections中的Counter方法:
import json path = 'B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt' records = [json.loads(line) for line in open(path)] time_zones = [rec['tz'] for rec in records if 'tz' in rec] from collections import Counter coun = Counter(time_zones) print(coun.most_common(10))
2、用pandas中的dataframe来进行视图展示
from pandas import DataFrame,Series import pandas as pd;import numpy as np frame = DataFrame(records) print (frame['tz'][:10])
tz的摘要视图.同时frame['tz']series的对象使用value_counts()方法 计数
print (frame['tz'][:10])
tz_counts = frame['tz'].value_counts() print(tz_counts[:10])
3、matplotlib生成图片
fillna函数代替缺失值,空值用unknown表示
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknow'
tz_counts = clean_tz.value_counts()
plot画图 kind = ’bar‘ 图标类型为条形图,rot 为转向率,倾斜角度
tz_counts[:5].plot(kind = 'bar',rot = 0)
4、用数据中的 a 数据进行切片提取第一个数据,frame.a.dropna 和 frame['a'].dropna 是一样的
results = Series([x.split()[0] for x in frame.a.dropna()])
dropna 对于一个 Series,dropna 返回一个仅含非空数据和索引值的 Series。
按照a中 windows 和非windows进行分类统计
cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe.a.str.contains('Windows'),
'Windows','Not Windows')
print(operating_system[:5])
by_tz_os = cframe.groupby(['tz',operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
print(agg_counts[:10])
operating_system = np.where(cframe.a.str.contains('Windows'),
'Windows','Not Windows')
print(operating_system[:5])
by_tz_os = cframe.groupby(['tz',operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
print(agg_counts[:10])