DataExploration

机器学习算法完整版见fenghaootong-github

DataExploration

We know the data is very important in data science,but it is time-consuming.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

df_train = pd.read_csv('../DATA/SalePrice_train.csv')

df_train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',  
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',  
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

# help(df_train.columns)

1 What can we expect？

In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

2 First: Analysing ‘SalePrice’

First, we need to see the ‘SalePrice’,because it is our reason
Some data about ‘SalePrice’

df_train['SalePrice'].describe()

count      1460.000000   
mean     180921.195890   
std       79442.502883  
min       34900.000000
25%      129975.000000  
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

we can get distributing about SaleePrice

sns.distplot(df_train['SalePrice'])

<matplotlib.axes._subplots.AxesSubplot at 0x2410a439be0>

这里写图片描述

calculate Skewness(偏度) and Kurtosis(峰度)
- 偏度Skewness(三阶) ——是统计数据分布偏斜方向和程度的度,三阶中心距除以标准差的三次方, 正态分布的偏度为0，偏度小于0为负偏度，位于均值左边的比右边的多，正偏度相反，上图为正偏度
  - 偏度为0表示其数据分布形态与正态分布的偏斜程度相同；偏度大于0表示其数据分布形态与正态分布相比为正偏或右偏，即有一条长尾巴拖在右边，数据右端有较多的极端值；偏度小于0表示其数据分布形态与正态分布相比为负偏或左偏，即有一条长尾拖在左边，数据左端有较多的极端值。偏度的绝对值数值越大表示其分布形态的偏斜程度越大。
- 峰度Kurtosis (四阶) ——描述总体中所有取值分布形态陡缓程度的统计量, 概率密度在均值处峰值高低的特征，常定义四阶中心矩除以方差的平方，减去三；
  - 正态分布的峰度为3。以一般而言，正态分布为参照，峰度可以描述分布形态的陡缓程度，若bk<3，则称分布具有不足的峰度，若bk>3，则称分布具有过度的峰度。若知道分布有可能在峰度上偏离正态分布时，可用峰度来检验分布的正态性。在相同的标准差下，峰度系数越大，分布就有更多的极端值

print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())

Skewness: 1.882876 Kurtosis: 6.536282

‘SalePrice’, her buddies and her interests

GrLinArea
TotalBsmtSF

Relationship with numerical variable

var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));

这里写图片描述

var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));

这里写图片描述

#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

这里写图片描述

var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

这里写图片描述

feature selection and feature engineering can help us analysis data

3 Keep calm and work smart

Next, do a more objectve analysis

Raw Data like soup

Raw data like soup, we know a little
Using follow to analysis
- Correlation matrix (heatmap style).
- ‘SalePrice’ correlation matrix (zoomed heatmap style).
- Scatter plots between the most correlated variables (move like Jagger style).

Correlation matrix

#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2410a1c99b0>

这里写图片描述

SalePrice correlation matrix (zoomed heatmap style)

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)

cols

Index(['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 
       'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt'], 
      dtype='object')

cm

array([[ 1.        ,  0.7909816 ,  0.70862448,  0.6404092 ,  0.62343144,   
         0.61358055,  0.60585218,  0.56066376,  0.53372316,  0.52289733],
       [ 0.7909816 ,  1.        ,  0.59300743,  0.60067072,  0.56202176,
         0.5378085 ,  0.47622383,  0.55059971,  0.42745234,  0.57232277],
       [ 0.70862448,  0.59300743,  1.        ,  0.46724742,  0.46899748,
         0.4548682 ,  0.56602397,  0.63001165,  0.82548937,  0.19900971],
       [ 0.6404092 ,  0.60067072,  0.46724742,  1.        ,  0.88247541,
         0.43458483,  0.43931681,  0.46967204,  0.36228857,  0.53785009],
       [ 0.62343144,  0.56202176,  0.46899748,  0.88247541,  1.        ,
         0.48666546,  0.48978165,  0.40565621,  0.33782212,  0.47895382],
       [ 0.61358055,  0.5378085 ,  0.4548682 ,  0.43458483,  0.48666546, 
         1.        ,  0.81952998,  0.32372241,  0.28557256,  0.391452  ], 
       [ 0.60585218,  0.47622383,  0.56602397,  0.43931681,  0.48978165, 
         0.81952998,  1.        ,  0.38063749,  0.40951598,  0.28198586],
       [ 0.56066376,  0.55059971,  0.63001165,  0.46967204,  0.40565621, 
         0.32372241,  0.38063749,  1.        ,  0.55478425,  0.46827079],
       [ 0.53372316,  0.42745234,  0.82548937,  0.36228857,  0.33782212,
         0.28557256,  0.40951598,  0.55478425,  1.        ,  0.09558913],
       [ 0.52289733,  0.57232277,  0.19900971,  0.53785009,  0.47895382,
         0.391452  ,  0.28198586,  0.46827079,  0.09558913,  1.        ]])

#help(corrmat.nlargest)

sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

这里写图片描述

根据上图，我们可以看出

‘OverallQual’，’GrLivArea’和’TotalBsmtSF’与’SalePrice’强相关。之前我们分析过
“GarageCars”和“GarageArea”也是一些最相关的变量。但是，正如我们在最后一点所讨论的那样，适合车库的车辆数量是车库面积的结果。 “GarageCars”和“GarageArea”就像孪生兄弟。你永远无法区分它们。因此，在我们的分析中，我们只需要其中的一个变量（我们可以保留“GarageCars”，因为它与“SalePrice”的关联性更高）
“TotalBsmtSF”和“1stFloor”似乎也是双胞胎兄弟。我们可以保留“TotalBsmtSF”只是说我们的第一个猜测是正确的
FullBath？不确定
“TotRmsAbvGrd”和“GrLivArea”？
“YearBuilt”似乎与“SalePrice”略有关联。

Scatter plots between ‘SalePrice’ and correlated variables (move like Jagger style)

#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show()

这里写图片描述

4 Missing Data

Important questions when thinking about missing data:
- How prevalent is the missing data?
- Is missing data random or does it have a pattern?

#missing data 
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis= 1, keys=['Total', 'Percent'])
missing_data.head(20)

	Total	Percent
PoolQC	1453	0.995205
MiscFeature	1406	0.963014
Alley	1369	0.937671
Fence	1179	0.807534
FireplaceQu	690	0.472603
LotFrontage	259	0.177397
GarageCond	81	0.055479
GarageType	81	0.055479
GarageYrBlt	81	0.055479
GarageFinish	81	0.055479
GarageQual	81	0.055479
BsmtExposure	38	0.026027
BsmtFinType2	38	0.026027
BsmtFinType1	37	0.025342
BsmtCond	37	0.025342
BsmtQual	37	0.025342
MasVnrArea	8	0.005479
MasVnrType	8	0.005479
Electrical	1	0.000685
Utilities	0	0.000000

当超过15%的数据丢失时，应该删除相应的变量
可以看到GarageX有相同的数据量缺失，所以缺少的数据肯定是指同一组观察值，由于GarageCars表示了关于车库的最重要信息，所以我们可以删除这几个有5%缺失的变量
BsmtX也可以这样删除
MasVnrArea和MasVnrTypeu与Yearbuilt和OverallQual有很强的相关性，所以亦可以删除
对于Electrical只有一个缺失数据，可以删除这次记录不用删除变量

df_train = df_train.drop((missing_data[missing_data['Total']>1]).index, 1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)

df_train.isnull().sum().max()

完整版的见fenghaootong-github

posted on 2018-03-07 14:20 一小白阅读(155) 评论(0) 编辑收藏举报