数据科学 IPython 笔记本 7.2 数据整理

7.2 数据整理

原文:Data Wrangling

译者:飞龙

协议:CC BY-NC-SA 4.0(原文协议:Apache License 2.0

数据流

Imgur

直接从 GitHub 挖掘数据,VizGitHub API 提供支持,并利用以下内容:

将来,Google BigQueryGitHub Archive 也可以补充 GitHub API。

导入

import re

import pandas as pd

准备仓库数据

加载仓库数据并删除重复:

repos = pd.read_csv("data/2017/repos-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', repos.shape)
repos = repos.drop_duplicates(subset='full_name', keep='last')
print('Shape after  dropping duplicates', repos.shape)
repos.head()

'''
Shape before dropping duplicates (8697, 5)
Shape after  dropping duplicates (8697, 5)
'''
full_namestarsforksdescriptionlanguage
0thedaviddias/Front-End-Checklist242672058? The perfect Front-End Checklist for modern w…JavaScript
1GoogleChrome/puppeteer219761259Headless Chrome Node APIJavaScript
2parcel-bundler/parcel13981463?? Blazing fast, zero configuration web applic…JavaScript
3Chalarangelo/30-seconds-of-code134661185Curated collection of useful Javascript snippe…JavaScript
4wearehive/project-guidelines11279970A set of best practices for JavaScript projectsJavaScript

userrepofull_name分离,变成新列:

def extract_user(line):
    return line.split('/')[0]

def extract_repo(line):
    return line.split('/')[1]

repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)
print(repos.shape)
repos.head()

# (8697, 7)
full_namestarsforksdescriptionlanguageuserrepo
0thedaviddias/Front-End-Checklist242672058? The perfect Front-End Checklist for modern w…JavaScriptthedaviddiasFront-End-Checklist
1GoogleChrome/puppeteer219761259Headless Chrome Node APIJavaScriptGoogleChromepuppeteer
2parcel-bundler/parcel13981463?? Blazing fast, zero configuration web applic…JavaScriptparcel-bundlerparcel
3Chalarangelo/30-seconds-of-code134661185Curated collection of useful Javascript snippe…JavaScriptChalarangelo30-seconds-of-code
4wearehive/project-guidelines11279970A set of best practices for JavaScript projectsJavaScriptwearehiveproject-guidelines

准备用户数据

加载用户数据并删除重复:

users = pd.read_csv("data/2017/user-geocodes-dump.csv", quotechar='"', skipinitialspace=True)
print('Shape before dropping duplicates', users.shape)
users = users.drop_duplicates(subset='id', keep='last')
print('Shape after  dropping duplicates', users.shape)
users.head()

'''
Shape before dropping duplicates (6426, 8)
Shape after  dropping duplicates (6426, 8)
'''
idnametypelocationlatlongcitycountry
0dns-violationsNaNOrganizationNaNNaNNaNNaNNaN
1hannobHanno BöckUserBerlin52.52000713.404954BerlinGermany
2takecianTakeshi FujikiUserTokyo, Japan35.689487139.691706TokyoJapan
3jtomschroederTom SchroederUserChicago41.878114-87.629798ChicagoUnited States
4wapiflapiWannes RomboutsUserFrance46.2276382.213749NaNFrance

id列重命名为user

users.rename(columns={'id': 'user'}, inplace=True)
users.head()
usernametypelocationlatlongcitycountry
0dns-violationsNaNOrganizationNaNNaNNaNNaNNaN
1hannobHanno BöckUserBerlin52.52000713.404954BerlinGermany
2takecianTakeshi FujikiUserTokyo, Japan35.689487139.691706TokyoJapan
3jtomschroederTom SchroederUserChicago41.878114-87.629798ChicagoUnited States
4wapiflapiWannes RomboutsUserFrance46.2276382.213749NaNFrance

合并仓库和用户数据

左连接仓库和用户:

repos_users = pd.merge(repos, users, on='user', how='left')
print('Shape repos:', repos.shape)
print('Shape users:', users.shape)
print('Shape repos_users:', repos_users.shape)
repos_users.head()

'''
Shape repos: (8697, 7)
Shape users: (6426, 8)
Shape repos_users: (8697, 14)
'''
full_namestarsforksdescriptionlanguageuserreponametypelocationlatlongcitycountry
0thedaviddias/Front-End-Checklist242672058? The perfect Front-End Checklist for modern w…JavaScriptthedaviddiasFront-End-ChecklistDavid DiasUserFrance, Mauritius, CanadaNaNNaNNaNNaN
1GoogleChrome/puppeteer219761259Headless Chrome Node APIJavaScriptGoogleChromepuppeteerNaNOrganizationNaNNaNNaNNaNNaN
2parcel-bundler/parcel13981463?? Blazing fast, zero configuration web applic…JavaScriptparcel-bundlerparcelParcelOrganizationNaNNaNNaNNaNNaN
3Chalarangelo/30-seconds-of-code134661185Curated collection of useful Javascript snippe…JavaScriptChalarangelo30-seconds-of-codeAngelos ChalarisUserAthens, Greece37.98381023.727539AthensGreece
4wearehive/project-guidelines11279970A set of best practices for JavaScript projectsJavaScriptwearehiveproject-guidelinesHiveOrganizationLondon51.507351-0.127758LondonUnited Kingdom

整理仓库和用户数据

重新排序列:

repos_users = repos_users.reindex_axis(['full_name',
                                        'repo',
                                        'description',
                                        'stars',
                                        'forks',
                                        'language',
                                        'user',
                                        'name',
                                        'type',
                                        'location',
                                        'lat',
                                        'long',
                                        'city',
                                        'country'], axis=1)
print(repos_users.shape)
repos_users.head()

# (8697, 14)
full_namerepodescriptionstarsforkslanguageusernametypelocationlatlongcitycountry
0thedaviddias/Front-End-ChecklistFront-End-Checklist? The perfect Front-End Checklist for modern w…242672058JavaScriptthedaviddiasDavid DiasUserFrance, Mauritius, CanadaNaNNaNNaNNaN
1GoogleChrome/puppeteerpuppeteerHeadless Chrome Node API219761259JavaScriptGoogleChromeNaNOrganizationNaNNaNNaNNaNNaN
2parcel-bundler/parcelparcel?? Blazing fast, zero configuration web applic…13981463JavaScriptparcel-bundlerParcelOrganizationNaNNaNNaNNaNNaN
3Chalarangelo/30-seconds-of-code30-seconds-of-codeCurated collection of useful Javascript snippe…134661185JavaScriptChalarangeloAngelos ChalarisUserAthens, Greece37.98381023.727539AthensGreece
4wearehive/project-guidelinesproject-guidelinesA set of best practices for JavaScript projects11279970JavaScriptwearehiveHiveOrganizationLondon51.507351-0.127758LondonUnited Kingdom

添加整体排名

根据星数对每个元素排名:

repos_users['rank'] = repos_users['stars'].rank(ascending=False)
print(repos_users.shape)
repos_users.head()

# (8697, 15)
full_namerepodescriptionstarsforkslanguageusernametypelocationlatlongcitycountryrank
0thedaviddias/Front-End-ChecklistFront-End-Checklist? The perfect Front-End Checklist for modern w…242672058JavaScriptthedaviddiasDavid DiasUserFrance, Mauritius, CanadaNaNNaNNaNNaN3
1GoogleChrome/puppeteerpuppeteerHeadless Chrome Node API219761259JavaScriptGoogleChromeNaNOrganizationNaNNaNNaNNaNNaN4
2parcel-bundler/parcelparcel?? Blazing fast, zero configuration web applic…13981463JavaScriptparcel-bundlerParcelOrganizationNaNNaNNaNNaNNaN11
3Chalarangelo/30-seconds-of-code30-seconds-of-codeCurated collection of useful Javascript snippe…134661185JavaScriptChalarangeloAngelos ChalarisUserAthens, Greece37.98381023.727539AthensGreece13
4wearehive/project-guidelinesproject-guidelinesA set of best practices for JavaScript projects11279970JavaScriptwearehiveHiveOrganizationLondon51.507351-0.127758LondonUnited Kingdom16

验证结果:用户

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:> = 100 user:donnemartin

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

repos_users[repos_users['user'] == 'donnemartin']
full_namerepodescriptionstarsforkslanguageusernametypelocationlatlongcitycountryrank
3308donnemartin/system-design-primersystem-design-primerLearn how to design large-scale systems. Prep …217802633PythondonnemartinDonne MartinUserWashington, D.C.38.907192-77.036871WashingtonUnited States5

验证结果:Python 仓库

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:>=100 language:python

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

print(repos_users[repos_users['language'] == 'Python'].shape)
repos_users[repos_users['language'] == 'Python'].head()

# (1357, 15)
full_namerepodescriptionstarsforkslanguageusernametypelocationlatlongcitycountryrank
3308donnemartin/system-design-primersystem-design-primerLearn how to design large-scale systems. Prep …217802633PythondonnemartinDonne MartinUserWashington, D.C.38.907192-77.036871WashingtonUnited States5
3309python/cpythoncpythonThe Python programming language150603779PythonpythonPythonOrganizationNaNNaNNaNNaNNaN9
3310ageitgey/face_recognitionface_recognitionThe world’s simplest facial recognition api fo…84871691PythonageitgeyAdam GeitgeyUserVarious placesNaNNaNNaNNaN31
3311tonybeltramelli/pix2codepix2codepix2code: Generating Code from a Graphical Use…8037605PythontonybeltramelliTony BeltramelliUserDenmarkNaNNaNNaNNaN34
3312google/python-firepython-firePython Fire is a library for automatically gen…7663386PythongoogleGoogleOrganizationNaNNaNNaNNaNNaN36

验证结果:所有仓库

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:>=100

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

print(repos_users.shape)
repos_users.head()

# (8697, 15)
full_namerepodescriptionstarsforkslanguageusernametypelocationlatlongcitycountryrank
0thedaviddias/Front-End-ChecklistFront-End-Checklist? The perfect Front-End Checklist for modern w…242672058JavaScriptthedaviddiasDavid DiasUserFrance, Mauritius, CanadaNaNNaNNaNNaN3
1GoogleChrome/puppeteerpuppeteerHeadless Chrome Node API219761259JavaScriptGoogleChromeNaNOrganizationNaNNaNNaNNaNNaN4
2parcel-bundler/parcelparcel?? Blazing fast, zero configuration web applic…13981463JavaScriptparcel-bundlerParcelOrganizationNaNNaNNaNNaNNaN11
3Chalarangelo/30-seconds-of-code30-seconds-of-codeCurated collection of useful Javascript snippe…134661185JavaScriptChalarangeloAngelos ChalarisUserAthens, Greece37.98381023.727539AthensGreece13
4wearehive/project-guidelinesproject-guidelinesA set of best practices for JavaScript projects11279970JavaScriptwearehiveHiveOrganizationLondon51.507351-0.127758LondonUnited Kingdom16

输出结果

将结果写出到 csv 来在 Tableau 中可视化:

users.to_csv('data/2017/users.csv', index=False)
repos_users.to_csv('data/2017/repos-users-geocodes.csv', index=False)
repos_users.to_csv('data/2017/repos-users.csv', index=False)

repos_rank = repos_users.reindex_axis(['full_name', 'rank'], axis=1)
repos_rank.to_csv('data/2017/repos-ranks.csv', index=False)
posted @ 2020-05-13 09:45  绝不原创的飞龙  阅读(12)  评论(0编辑  收藏  举报  来源