机器学习实战(Machine Learning in Action)学习笔记————09.利用PCA简化数据
机器学习实战(Machine Learning in Action)学习笔记————09.利用PCA简化数据
关键字:PCA、主成分分析、降维
作者:米仓山下
时间:2018-11-15
机器学习实战(Machine Learning in Action,@author: Peter Harrington)
源码下载地址:https://www.manning.com/books/machine-learning-in-action
git@github.com:pbharrin/machinelearninginaction.git
**************************************************************************************************************************
关于PCA的理解,要理解方差、协方差、协方差矩阵、特征值、特征向量等概念。主成分分析,研究变量间的关系,就是要寻找协方差矩阵的特征向量和特征值就等价于拟合一条能保留最大方差的直线或主成分。
特征向量描绘了方阵(协方差矩阵)的固有属性,即在方阵作用(变换)下,不会发生方向变化的向量。特征向量是线性变换的不变轴。看到对微信公总号【机器之心】上把PAC原理解释非常好的文章
【教程 | 从特征分解到协方差矩阵:详细剖析和实现PCA算法】地址https://mp.weixin.qq.com/s/tJ_FbL2nFQfkvKqpQJ8kmg
#输入数据: import numpy as np x=np.array([2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1]) y=np.array([2.4,0.7,2.9,2.2,3,2.7,1.6,1.1,1.6,0.9]) #归一化数据: mean_x=np.mean(x) mean_y=np.mean(y) scaled_x=x-mean_x scaled_y=y-mean_y data=np.matrix([[scaled_x[i],scaled_y[i]] for i in range(len(scaled_x))]) #绘制散点图查看数据分布: import matplotlib.pyplot as plt plt.plot(scaled_x,scaled_y,'o')
plt.show() #求协方差矩阵: cov=np.cov(scaled_x,scaled_y) #求协方差矩阵的特征值和特征向量: eig_val, eig_vec = np.linalg.eig(cov) #求出特征向量后,我们需要选择降维后的数据维度 k(n 维数据降为 k 维数据),但我们的数据只有两维,所以只能降一维: eig_pairs = [(np.abs(eig_val[i]), eig_vec[:,i]) for i in range(len(eig_val))] eig_pairs.sort(reverse=True) feature=eig_pairs[0][1] #转化得到降维后的数据: new_data_reduced=np.transpose(np.dot(feature,np.transpose(data))
>>> eig_val
array([0.0490834 , 1.28402771])
>>> eig_vec
array([[-0.73517866, -0.6778734 ],
[ 0.6778734 , -0.73517866]])
>>>
**************************************************************************************************************************
机器学习实战中PAC函数如下:
from numpy import *
def pca(dataMat, topNfeat=9999999): meanVals = mean(dataMat, axis=0) #计算平均值 meanRemoved = dataMat - meanVals #去中心化 covMat = cov(meanRemoved, rowvar=0) #协方差矩阵 eigVals,eigVects = linalg.eig(mat(covMat)) #特征值、特征向量 eigValInd = argsort(eigVals) #排序 eigValInd = eigValInd[:-(topNfeat+1):-1] #截取topNfeat特征值、特征向量 redEigVects = eigVects[:,eigValInd] #reorganize eig vects largest to smallest lowDDataMat = meanRemoved * redEigVects #对去中心化数据进行变换 reconMat = (lowDDataMat * redEigVects.T) + meanVals #变换后的矩阵乘以特征向量构成的矩阵的转置(等于它的逆),又变换回去了,如果是只选取前k个特征向量,则转换回去后其他维度信息损失 return lowDDataMat, reconMat
**************************************************************************************************************************
测试:
1. python createFig1.py #将测试数据'testSet.txt'绘制成散点图
''' Created on Jun 1, 2011 @author: Peter ''' from numpy import * import matplotlib import matplotlib.pyplot as plt n = 1000 #number of points to create xcord0 = [] ycord0 = [] xcord1 = [] ycord1 = [] markers =[] colors =[] fw = open('testSet.txt','w') for i in range(n): [r0,r1] = random.standard_normal(2) fFlyer = r0 + 9.0 tats = 1.0*r1 + fFlyer + 0 xcord0.append(fFlyer) ycord0.append(tats) fw.write("%f\t%f\n" % (fFlyer, tats)) fw.close() fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(xcord0,ycord0, marker='^', s=30) plt.xlabel('hours of direct sunlight') plt.ylabel('liters of water') plt.show()
2. python createFig2.py #将测试数据'testSet.txt'绘制成散点图(蓝色);将原始数据经过主成分分析之后,只保留一个主成分,然后逆变换回去(只有一个主成分信息)的散点图(红色)
''' Created on Jun 1, 2011 @author: Peter ''' from numpy import * import matplotlib import matplotlib.pyplot as plt import pca dataMat = pca.loadDataSet('testSet.txt') lowDMat, reconMat = pca.pca(dataMat, 1) fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(dataMat[:,0].tolist(), dataMat[:,1].tolist(), marker='^', s=30) ax.scatter(reconMat[:,0].tolist(), reconMat[:,1].tolist(), marker='o', s=30, c='red') plt.show()
3. python createFig3.py #对比主成分分析前后散点图。上图为原始数据散点图,下图为主成分变换之后,将第一成分和零构成的散点图。
''' Created on Jun 1, 2011 @author: Peter ''' from numpy import * import matplotlib import matplotlib.pyplot as plt import pca n = 1000 #number of points to create xcord0 = []; ycord0 = [] xcord1 = []; ycord1 = [] xcord2 = []; ycord2 = [] markers =[] colors =[] fw = open('testSet3.txt','w') for i in range(n): groupNum = int(3*random.uniform()) [r0,r1] = random.standard_normal(2) if groupNum == 0: x = r0 + 16.0 y = 1.0*r1 + x xcord0.append(x) ycord0.append(y) elif groupNum == 1: x = r0 + 8.0 y = 1.0*r1 + x xcord1.append(x) ycord1.append(y) elif groupNum == 2: x = r0 + 0.0 y = 1.0*r1 + x xcord2.append(x) ycord2.append(y) fw.write("%f\t%f\t%d\n" % (x, y, groupNum)) fw.close() fig = plt.figure() ax = fig.add_subplot(211) ax.scatter(xcord0,ycord0, marker='^', s=90) ax.scatter(xcord1,ycord1, marker='o', s=50, c='red') ax.scatter(xcord2,ycord2, marker='v', s=50, c='yellow') ax = fig.add_subplot(212) myDat = pca.loadDataSet('testSet3.txt') lowDDat,reconDat = pca.pca(myDat[:,0:2],1) label0Mat = lowDDat[nonzero(myDat[:,2]==0)[0],:2][0] #get the items with label 0 label1Mat = lowDDat[nonzero(myDat[:,2]==1)[0],:2][0] #get the items with label 1 label2Mat = lowDDat[nonzero(myDat[:,2]==2)[0],:2][0] #get the items with label 2 #ax.scatter(label0Mat[:,0],label0Mat[:,1], marker='^', s=90) #ax.scatter(label1Mat[:,0],label1Mat[:,1], marker='o', s=50, c='red') #ax.scatter(label2Mat[:,0],label2Mat[:,1], marker='v', s=50, c='yellow') ax.scatter(label0Mat[:,0].tolist(),zeros(shape(label0Mat)[0]), marker='^', s=90) ax.scatter(label1Mat[:,0].tolist(),zeros(shape(label1Mat)[0]), marker='o', s=50, c='red') ax.scatter(label2Mat[:,0].tolist(),zeros(shape(label2Mat)[0]), marker='v', s=50, c='yellow') plt.show()
4. python createFig4.py #前20个主成分对应的方差百分比。大部分方差包含在前几个方差中。数据变换后对应的方差即为特征值【主成份分析最大方差等于特征值的证明问题】
''' Created on Jun 14, 2011 @author: Peter ''' from numpy import * import matplotlib import matplotlib.pyplot as plt import pca dataMat = pca.replaceNanWithMean() #below is a quick hack copied from pca.pca() meanVals = mean(dataMat, axis=0) meanRemoved = dataMat - meanVals #remove mean covMat = cov(meanRemoved, rowvar=0) eigVals,eigVects = linalg.eig(mat(covMat)) eigValInd = argsort(eigVals) #sort, sort goes smallest to largest eigValInd = eigValInd[::-1]#reverse sortedEigVals = eigVals[eigValInd] total = sum(sortedEigVals) varPercentage = sortedEigVals/total*100 fig = plt.figure() ax = fig.add_subplot(111) ax.plot(range(1, 21), varPercentage[:20], marker='^') plt.xlabel('Principal Component Number') plt.ylabel('Percentage of Variance') plt.show()