【数据分析】相关性矩阵可视化(热力图heatmap)

数据概览

# 以波士顿房价数据为例
import pandas as pd
train=pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
train.head
<bound method NDFrame.head of         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1455  1456          60       RL         62.0     7917   Pave   NaN      Reg   
1456  1457          20       RL         85.0    13175   Pave   NaN      Reg   
1457  1458          70       RL         66.0     9042   Pave   NaN      Reg   
1458  1459          20       RL         68.0     9717   Pave   NaN      Reg   
1459  1460          20       RL         75.0     9937   Pave   NaN      Reg   

     LandContour Utilities  ... PoolArea PoolQC  Fence MiscFeature MiscVal  \
0            Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
1            Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
2            Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
3            Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
4            Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
...          ...       ...  ...      ...    ...    ...         ...     ...   
1455         Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
1456         Lvl    AllPub  ...        0    NaN  MnPrv         NaN       0   
1457         Lvl    AllPub  ...        0    NaN  GdPrv        Shed    2500   
1458         Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
1459         Lvl    AllPub  ...        0    NaN    NaN         NaN       0   

     MoSold YrSold  SaleType  SaleCondition  SalePrice  
0         2   2008        WD         Normal     208500  
1         5   2007        WD         Normal     181500  
2         9   2008        WD         Normal     223500  
3         2   2006        WD        Abnorml     140000  
4        12   2008        WD         Normal     250000  
...     ...    ...       ...            ...        ...  
1455      8   2007        WD         Normal     175000  
1456      2   2010        WD         Normal     210000  
1457      5   2010        WD         Normal     266500  
1458      4   2010        WD         Normal     142125  
1459      6   2008        WD         Normal     147500  

[1458 rows x 81 columns]>

在这里插入图片描述

相关性矩阵获取

import numpy as np

k=10
corrmat=train_drop.corr()#获取相关性矩阵
#获取相关度最高的K个特征
cols=corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
# 获得相关性最高的 K 个特征组成的子数据集
cm=np.corrcoef(train_drop[cols].values.T)#获取相关性矩阵
print(cm)
array([[1.        , 0.79577427, 0.73496816, 0.65115291, 0.64104701,
        0.63153038, 0.62921745, 0.56216475, 0.53776882, 0.5236084 ],
       [0.79577427, 1.        , 0.58941358, 0.53859453, 0.60074082,
        0.46909184, 0.55723004, 0.54841971, 0.4206215 , 0.57136809],
       [0.73496816, 0.58941358, 1.        , 0.40879348, 0.47544152,
        0.53369718, 0.4563575 , 0.63837846, 0.8294982 , 0.19439712],
       [0.65115291, 0.53859453, 0.40879348, 1.        , 0.45188972,
        0.80382963, 0.47506909, 0.32772043, 0.26614613, 0.40026576],
       [0.64104701, 0.60074082, 0.47544152, 0.45188972, 1.        ,
        0.44919454, 0.8873045 , 0.46819822, 0.36115155, 0.5373007 ],
       [0.63153038, 0.46909184, 0.53369718, 0.80382963, 0.44919454,
        1.        , 0.47729916, 0.38212   , 0.39638135, 0.28125344],
       [0.62921745, 0.55723004, 0.4563575 , 0.47506909, 0.8873045 ,
        0.47729916, 1.        , 0.4040763 , 0.32871405, 0.47799759],
       [0.56216475, 0.54841971, 0.63837846, 0.32772043, 0.46819822,
        0.38212   , 0.4040763 , 1.        , 0.55303847, 0.46714602],
       [0.53776882, 0.4206215 , 0.8294982 , 0.26614613, 0.36115155,
        0.39638135, 0.32871405, 0.55303847, 1.        , 0.09122031],
       [0.5236084 , 0.57136809, 0.19439712, 0.40026576, 0.5373007 ,
        0.28125344, 0.47799759, 0.46714602, 0.09122031, 1.        ]])

数据可视化

sns.set(font_scale=1.25)#字符大小设定
hm=sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

在这里插入图片描述

posted @ 2020-11-04 13:36  ccql  阅读(49)  评论(0编辑  收藏  举报  来源