想要改变世界,就得先改变自己。 ------ 博客首页

4-9 Panadas与sklearn结合实例

 

 

1.显示百分比的柱状图

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#导入数据
np.random.seed(0)
df=pd.DataFrame({'Condition 1':np.random.rand(20),
                'Condition 2':np.random.rand(20)*0.9,
                 'Condition 3':np.random.rand(20)*1.1})
df.head()
Out[1]:
 
 Condition 1Condition 2Condition 3
0 0.548814 0.880757 0.395459
1 0.715189 0.719243 0.480735
2 0.602763 0.415331 0.767394
3 0.544883 0.702476 0.066248
4 0.423655 0.106447 0.733443
In [2]:
fig,ax=plt.subplots()
df.plot.bar(ax=ax,stacked=True)#stacked=True,可堆叠
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x8b32eb8>
 
In [3]:
#根据数据的百分比来显示图的大小
from matplotlib.ticker import FuncFormatter#导入包
df_ratio=df.div(df.sum(axis=1),axis=0)#显示百分比
#画图
fig,ax=plt.subplots()
df_ratio.plot.bar(ax=ax,stacked=True)#stacked=True,可堆叠
ax.yaxis.set_major_formatter(FuncFormatter(lambda y,_:'{:.0%}'.format(y)))
 
 

2.sklearn实例

In [4]:
#从网站上读取数据
url='https://archive.ics.uci.edu/ml/machine-learning-databases/00383/risk_factors_cervical_cancer.csv'
df=pd.read_csv(url,na_values="?")
df.head()
Out[4]:
 
 AgeNumber of sexual partnersFirst sexual intercourseNum of pregnanciesSmokesSmokes (years)Smokes (packs/year)Hormonal ContraceptivesHormonal Contraceptives (years)IUD...STDs: Time since first diagnosisSTDs: Time since last diagnosisDx:CancerDx:CINDx:HPVDxHinselmannSchillerCitologyBiopsy
0 18 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN 0 0 0 0 0 0 0 0
1 15 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN 0 0 0 0 0 0 0 0
2 34 1.0 NaN 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... NaN NaN 0 0 0 0 0 0 0 0
3 52 5.0 16.0 4.0 1.0 37.0 37.0 1.0 3.0 0.0 ... NaN NaN 1 0 1 0 0 0 0 0
4 46 3.0 21.0 4.0 0.0 0.0 0.0 1.0 15.0 0.0 ... NaN NaN 0 0 0 0 0 0 0 0

5 rows × 36 columns

 

2-1 对缺失值填充

In [5]:
from sklearn.preprocessing import  Imputer#按均值填充缺失值

impute=pd.DataFrame(Imputer().fit_transform(df))
impute.colums=df.columns
impute.index=df.index

impute.head()
 
E:\Software\Anaconda3_5.2.0\lib\site-packages\ipykernel_launcher.py:4: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  after removing the cwd from sys.path.
Out[5]:
 
 0123456789...26272829303132333435
0 18.0 4.0 15.0000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 6.140845 5.816901 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 15.0 1.0 14.0000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 6.140845 5.816901 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 34.0 1.0 16.9953 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 6.140845 5.816901 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 52.0 5.0 16.0000 4.0 1.0 37.0 37.0 1.0 3.0 0.0 ... 6.140845 5.816901 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 46.0 3.0 21.0000 4.0 0.0 0.0 0.0 1.0 15.0 0.0 ... 6.140845 5.816901 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 36 columns

In [7]:
# 1.引入inputer() 使用均值对缺失值进行填充
impute = pd.DataFrame(Imputer().fit_transform(df))
print(impute.head())
impute.columns = df.columns
impute.index = df.index
# 2.导入相关的包
%matplotlib notebook
import numpy as np
import seaborn as sns#针对统计绘图的工具
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA#sklearn.decomposition模块包括矩阵分解算法,包括PCA,NMF或ICA。 该模块的大多数算法可以被视为降维技术。
from mpl_toolkits.mplot3d import Axes3D#画3D图的包

# 3.取出样品特征, 取出Dx:Cancer 
features = impute.drop('Dx:Cancer', axis=1)
y = impute['Dx:Cancer']
# 4进行PCA操作
pca = PCA(n_components=3)
X_r = pca.fit_transform(features)
# '{:.2%}'表示保留两位小数, pca.explained_variabce_ratio表示所占的比例
print('Explained variance:\nPC1{:.2%}\nPC2{:.2%}\nPC3{:.2%}'
    .format(pca.explained_variance_ratio_[0],
            pca.explained_variance_ratio_[1],
            pca.explained_variance_ratio_[2],))
# 构造三维坐标系
fig = plt.figure()
ax = Axes3D(fig)
# 画散点图
ax.scatter(X_r[:, 0], X_r[:, 1], X_r[:, 2], c='r', cmap=plt.cm.coolwarm)
# 对三个维度的坐标进行标注
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
 
     0    1        2    3    4     5     6    7     8    9  ...         26  \
0  18.0  4.0  15.0000  1.0  0.0   0.0   0.0  0.0   0.0  0.0 ...   6.140845   
1  15.0  1.0  14.0000  1.0  0.0   0.0   0.0  0.0   0.0  0.0 ...   6.140845   
2  34.0  1.0  16.9953  1.0  0.0   0.0   0.0  0.0   0.0  0.0 ...   6.140845   
3  52.0  5.0  16.0000  4.0  1.0  37.0  37.0  1.0   3.0  0.0 ...   6.140845   
4  46.0  3.0  21.0000  4.0  0.0   0.0   0.0  1.0  15.0  0.0 ...   6.140845   

         27   28   29   30   31   32   33   34   35  
0  5.816901  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  5.816901  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  5.816901  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
3  5.816901  1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  
4  5.816901  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 36 columns]
Explained variance:
PC159.41%
PC214.59%
PC39.02%
 
 
 
Out[7]:
Text(0.5,0,'PC3')
posted @ 2019-10-28 10:32  karina512  阅读(477)  评论(0编辑  收藏  举报