人口分析 案例

需求:

  • 导入文件,查看原始数据
  • 将人口数据和各州简称数据进行合并
  • 将合并的数据中重复的abbreviation列进行删除
  • 查看存在缺失数据的列
  • 找到有哪些state/region使得state的值为NaN,进行去重操作
  • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
  • 合并各州面积数据areas
  • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
  • 去除含有缺失数据的行
  • 找出2010年的全民人口数据
  • 计算各州的人口密度
  • 排序,并找出人口密度最高的五个州 df.sort_values()
import numpy as np
from pandas import DataFrame,Series
import pandas as pd
abb = pd.read_csv('./data/state-abbrevs.csv')
pop = pd.read_csv('./data/state-population.csv')
area = pd.read_csv('./data/state-areas.csv')
abb.head(2)
state abbreviation
0 Alabama AL
1 Alaska AK
pop.head(2)
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
area.head(2)
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425

1. 将人口数据和各州简称数据进行合并

# 将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
abb_pop.head(4)
state abbreviation state/region ages year population
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
2 Alabama AL AL under18 2010 1130966.0
3 Alabama AL AL total 2010 4785570.0

2.将合并的数据中重复的abbreviation列进行删除


abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head(4)
state state/region ages year population
0 Alabama AL under18 2012 1117489.0
1 Alabama AL total 2012 4817528.0
2 Alabama AL under18 2010 1130966.0
3 Alabama AL total 2010 4785570.0

3.查看存在缺失数据的列

#查看存在缺失数据的列
# abb_pop.loc[]
abb_pop.isnull().any(axis=0)
state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

4.找到有哪些state/region使得state的值为NaN,进行去重操作

#找到有哪些state/region使得state的值为NaN,进行去重操作

# 判断是否为空
abb_pop['state'].isnull()

# 定位到 state 为空的数据
abb_pop.loc[abb_pop['state'].isnull()]
#将结果中的state/region列取出,返回的是一个Series
abb_pop.loc[abb_pop['state'].isnull()]['state/region']

#对Series进行去重
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()
  • array(['USA'], dtype=object)
#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
abb_pop.loc[abb_pop['state/region'] == 'PR']

PR_indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
PR_indexs
Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
            2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
            2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
            2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
            2492, 2493, 2494, 2495],
           dtype='int64')

5.给对应的 空值赋值


abb_pop.loc[PR_indexs,'state'] = 'ppprrr'
usa_index = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
# abb_pop.loc[usa_index,'state'] = 'Alabama'
abb_pop.loc[usa_index,'state']

6.合并各州面积数据areas


area.head()
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()
state state/region ages year population area (sq. mi)
0 Alabama AL under18 2012.0 1117489.0 52423.0
1 Alabama AL total 2012.0 4817528.0 52423.0
2 Alabama AL under18 2010.0 1130966.0 52423.0
3 Alabama AL total 2010.0 4785570.0 52423.0
4 Alabama AL under18 2011.0 1125763.0 52423.0

7.我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行

#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index

8.去除含有缺失数据的行

#去除含有缺失数据的行
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)

9.找出2010年的全民人口数据

#找出2010年的全民人口数据
abb_pop_area.query('year == 2010 & ages == "total"')

10.计算各州的人口密度

#计算各州的人口密度
abb_pop_area['population']/abb_pop_area['area (sq. mi)']

# 将数据添加到 总数据中
abb_pop_area['midu'] = abb_pop_area['population']/abb_pop_area['area (sq. mi)']
abb_pop_area.head()
state state/region ages year population area (sq. mi) midu
0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221
2 Alabama AL under18 2010.0 1130966.0 52423.0 21.573851
3 Alabama AL total 2010.0 4785570.0 52423.0 91.287603
4 Alabama AL under18 2011.0 1125763.0 52423.0 21.474601

11.排序,并找出人口密度最高的五个州 df.sort_values()

##排序,并找出人口密度最高的五个州 df.sort_values()
# ascending=False 倒叙
abb_pop_area.sort_values('midu',axis=0,ascending=False).head(5)
state state/region ages year population area (sq. mi) midu
439 District of Columbia DC total 2013.0 646449.0 68.0 9506.602941
433 District of Columbia DC total 2012.0 633427.0 68.0 9315.102941
435 District of Columbia DC total 2011.0 619624.0 68.0 9112.117647
479 District of Columbia DC total 1990.0 605321.0 68.0 8901.779412
437 District of Columbia DC total 2010.0 605125.0 68.0 8898.897059
posted @ 2019-04-22 12:41  拐弯  阅读(458)  评论(0编辑  收藏  举报