用python进行数据分析--引言

前言

这是用学习《用python进行数据分析》的连载。这篇博客记录的是学习第二章引言部分的内容

 

内容

一、分析usa.org的数据

(1)载入数据

import json
if __name__ == "__main__":
# load data
path = "../../datasets/bitly_usagov/example.txt"
with open(path) as data:
records = [json.loads(line) for line in data]
 

(2)计算数据中不同时区的数量

1、复杂的方式

# define the count function
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts

2.简单的方式


from collections import defaultdict

#
to simple the count function def simple_get_counts(sequence): # initialize the all value to zero counts = defaultdict(int) for x in sequence: counts[x] += 1 return counts

简单之所以简单是因为它引用了defaultdicta函数,初始化了字典中的元素,将其值全部初始化为0,就省去了判断其值是否在字典中出现的过程

(3)获取出现次数最高的10个时区名和它们出现的次数

1.复杂方式

# get the top 10 timezone which value is biggest
def top_counts(count_dict, n=10):
    value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
    # this sort method is asc
    value_key_pairs.sort()
    return value_key_pairs[-n:]
# get top counts by get_count function
counts = simple_get_counts(time_zones)
top_counts = top_counts(counts)

2.简单方式

from collections import Counter

# simple the top_counts method
def simple_top_counts(timezone, n=10):
    counter_counts = Counter(timezone)
    return counter_counts.most_common(n)
simple_top_counts(time_zones)

简单方式之所以简单,是因为他利用了collections中自带的counter函数,它能够自动的计算不同值出现的次数,并且不需要先计算counts

(4)利用pylab和dataframe画出不同timezone的出现次数,以柱状图的形式。

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import pylab as pyl

# use the dataframe to show the counts of timezone
def show_timezone_data(records):
    frame = DataFrame(records)
    clean_tz = frame['tz'].fillna("Missing")
    clean_tz[clean_tz == ''] = 'Unknown'
    tz_counts = clean_tz.value_counts()
    tz_counts[:10].plot("barh", rot=0)
    pyl.show()

以下是结果图:

 

我刚开始的时候,能够执行代码,但是一直显示不出这个图。后来我查阅了资料发现是因为,我用的是pycharm,它是一个编辑器,不处于pylab模式,所以你需要先引入pylab模块,然后执行 pylab.show()指令才能够使这个图出现

(5)利用pylab和dataframe画出不同的timezone的window的分布情况

 

# use the dataframe to show the character of timezone and windows
def show_timezone_winows(records):
    frame = DataFrame(records)
    # results = Series([x.split()[0] for x in frame.a.dropna()])
    cframe = frame[frame.a.notnull()]
    operatine_system = np.where(cframe['a'].str.contains("Windows"), "Windows", "Not Windows")
    by_tz_os = cframe.groupby(["tz", operatine_system])
    agg_counts = by_tz_os.size().unstack().fillna(0)
    indexer = agg_counts.sum(1).argsort()
    count_subset = agg_counts.take(indexer)[-10:]
    count_subset.plot(kind='barh', stacked=True)
    pyl.show()
    # let the sum to be 1
    normed_subset = count_subset.div(count_subset.sum(1), axis=0)
    normed_subset.plot(kind='barh', stacked=True)
    pyl.show()

 

 

 

没有归一化前的结果图:

 

归一化后的结果图:

 

二、分析movielens的数据

(1)读取数据

users:记录了用户信息, ratings:记录了用户的评分信息, movies:记录了电影信息

import pandas as pd

# read the three tables
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table("../../datasets/movielens/users.dat", sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table("../../datasets/movielens/ratings.dat", sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../../datasets/movielens/movies.dat', sep="::", header=None, names=mnames)

 

(2)合并数据

数据的合并用到了pandas中merge函数,它能够直接根据数据对象中列名直接推断出哪些列是外键

  # merge data
    return pd.merge(pd.merge(users, ratings), movies)

(3) 查询同一电影不同性别的评分

def mean_score(data):
    '''
    print the mean rating of the same title between genders
    :param data:the data source
    :return:
    '''
    mean_ratings = data.pivot_table(values='rating', index=['title'], columns='gender', aggfunc='mean')
    rating_by_title = data.groupby('title').size()
    active_titles = rating_by_title.index[rating_by_title >= 250]
    mean_ratings = mean_ratings.ix[active_titles]
    print(mean_ratings)

这里要特别注意的是,现在的dataframe中的pivot_table中已经没有了rows和cols这两项,并且在定义聚集值是需要指定。

#以前的代码
   mean_ratings =  data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
#现在的代码
    mean_ratings = data.pivot_table(values='rating', index=['title'], columns='gender', aggfunc='mean')
   

(4)计算评分分歧

def diff_score(mean_ratings, data):
    '''
    calculate the diff between different group like male and female
    :param data:
    :return:
    '''
    mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
    # sorted the table by diff ,asc
    sorted_by_diff = mean_ratings.sort_index(by='diff')
    print("sorted by diff between male and female")
    print(sorted_by_diff[:10])
    # calculate the most diff movies by ignoring gender
    rating_std_by_title = data.groupby('title')['rating'].std()
    rating_std_by_title.sort_values(ascending=False)
    print("sorted by diff in rating")
    print(rating_std_by_title[:10])

这里要注意的是,现在的series中已经没有order这个函数了,取而代之的是sort_values和sort_index,不过作用是一样的。从以下的代码可以看到区别。

#以前的代码
rating_std_by_title.order(ascending=False)
#现在的代码
rating_std_by_title.sort_values(ascending=False)

 

三、分析婴儿姓名

(1)读取数据

def contact_data():
    '''
    contact all name in different year into one table
    :return:
    '''
    years = range(1880, 2011)
    pieces = []
    columns = ['name', 'sex', 'births']
    for year in years:
        path = "../../datasets/babynames/yob%d.txt" % year
        frame = pd.read_csv(path, names=columns)
        frame['year'] = year
        pieces.append(frame)
    data = pd.concat(pieces,ignore_index=True)
    return data

(2)画出每年不同姓名的婴儿的出生数量

def gather_function(data):
    total_births = data.pivot_table(values='births', index=['year'], columns='sex', aggfunc=sum)
    total_births.tail()
    total_births.plot(title='total births by sex and year')
    pyl.show()

结果图:

(3)获取命名次数前1000的名字

def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

def prop_function(data):
    names = data.groupby(["year", "sex"]).apply(add_prop)
    # print(np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1))
    grouped = names.groupby(['year', 'sex'])
    top1000 = grouped.apply(get_top1000)
    return top1000

这里比较特别的是用了apply进行groupby函数的group方法的重写。

(4)分析命名趋势

def name_trend(data, top1000):
    boys = top1000[top1000.sex == 'M']
    girls = top1000[top1000.sex == 'F']
    total_births = top1000.pivot_table(values='births', index=['year'], columns='name', aggfunc=sum)
    subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
    subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
    pyl.show()

分析了john,harry,mary和marilyn这四个名字在不同年份的命名趋势。

 

 (4)名字多样性的增长

有两种方式可以进行辅证

1.计算排名前1000的名字在所有名字中的占比。

def variety_growth(top1000):
    table = top1000.pivot_table(values='prop', index=['year'], columns='sex', aggfunc=sum)
    table.plot(title="sum of top1000.prop by year and sex", yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
    pyl.show()
    

可以得到下图:

2.计算不同性别,不同年份,在top1000中需要多少个名字才能达到占比0.5

 def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False)
    return group.prop.cumsum().searchsorted(q)+1

def variety_growth(top1000):
    '''
    show the plot of sum of top1000.prop by year and sex
    and the plot of number of the popular names in top 50%
    :param top1000:
    :return:
    '''
    table = top1000.pivot_table(values='prop', index=['year'], columns='sex', aggfunc=sum)
    table.plot(title="sum of top1000.prop by year and sex", yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
    pyl.show()
    diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
    diversity = diversity.unstack('sex')
    print(diversity.head())

结果如下:

(4)最后一个字母的变革

查看不同年份,不同性别最后一个字母的占比。

def last_name(data):
    '''
    the ratio of the character in last name by year and sex
    :param data: 
    :return: 
    '''
    get_last_letter = lambda x: x[-1]
    last_letters = data.name.map(get_last_letter)
    data['last_letters'] = last_letters
    table = data.pivot_table(values='births', index=['last_letters'], columns=['sex', 'year'], aggfunc=sum)
    subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
    letter_prop = subtable/subtable.sum().astype(float)
    fig, axes = plt.subplots(2, 1, figsize=(10, 8))
    letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
    letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
    pyl.show()
    letter_prop = table/table.sum().astype(float)
    dny_ts = letter_prop.ix[['d', 'n', 'y'], 'M'].T
    dny_ts.plot()
    pyl.show()

它的结果图如下:

 

在写这个代码的时候要注意,要加上一行 data['last_letters'] = last_letters,用来将last_letters这个列并入到data中,否则按照书上的代码执行会报错。

(5)变成女孩名字的男孩名字(以及相反情况)

 

def name_change_sex(top1000):
    allnames = top1000.name.unique()
    mask = np.array(['lesl' in i.lower() for i in allnames])
    lesley_like = allnames[mask]
    # filter by lesley_like
    filtered = top1000[top1000.name.isin(lesley_like)]
    table = filtered.pivot_table(values='births', index=['year'], columns='sex', aggfunc=sum)
    table = table.div(table.sum(1), axis=0)
    table.plot(style={'M': 'k-', 'F': 'k--'})
    pyl.show()

结果图:

 

posted @ 2018-05-31 18:04  whatyouknow123  阅读(412)  评论(0编辑  收藏  举报