利用python进行数据分析-01-引言

1.将gov数据导入并读出数据: (time_zones类型为list,为tz的值)-第一种方法

import json
path = 'B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
#print(records[0])
time_zones = [rec['tz'] for rec in records if 'tz' in rec]

自定义计数函数计算time_zones里面各个值的个数

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
    
from collections import defaultdict
def get_counts2(sequence):
    counts =defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts

如计算time_zones中America/New_York的个数(coun为字典):

coun = get_counts2(time_zones)
print (coun['America/New_York'])

输出前20个:count_dict是字典,items属性

def count_value(count_dict,n=10):
    value_count_dict = [(count,tz) for tz,count in count_dict.items()]
    value_count_dict.sort()
    return value_count_dict[-n:]

print(count_value(coun,10))
    

  将gov数据导入并读出数据: (time_zones类型为list,为tz的值)-第二种方法

采用collections中的Counter方法:

import json
path = 'B:/test/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
from collections import Counter
coun = Counter(time_zones)
print(coun.most_common(10))

2、用pandas中的dataframe来进行视图展示

from pandas import DataFrame,Series
import pandas as pd;import numpy as np
frame = DataFrame(records)

print (frame['tz'][:10])

tz的摘要视图.同时frame['tz']series的对象使用value_counts()方法 计数

print (frame['tz'][:10])
tz_counts = frame['tz'].value_counts()
print(tz_counts[:10])

3、matplotlib生成图片

fillna函数代替缺失值,空值用unknown表示

clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknow'
tz_counts = clean_tz.value_counts()

plot画图 kind = ’bar‘ 图标类型为条形图,rot 为转向率,倾斜角度

tz_counts[:5].plot(kind = 'bar',rot = 0)

4、用数据中的 a 数据进行切片提取第一个数据,frame.a.dropna   和 frame['a'].dropna 是一样的

results = Series([x.split()[0] for x in frame.a.dropna()])

dropna 对于一个 Series,dropna 返回一个仅含非空数据和索引值的 Series。

按照a中 windows 和非windows进行分类统计

cframe = frame[frame.a.notnull()]
operating_system = np.where(cframe.a.str.contains('Windows'),
                            'Windows','Not Windows')
print(operating_system[:5])
by_tz_os = cframe.groupby(['tz',operating_system])
agg_counts = by_tz_os.size().unstack().fillna(0)
print(agg_counts[:10])

 


 

 

posted @ 2015-10-19 16:04  Groupe  阅读(614)  评论(0编辑  收藏  举报