ML--k近邻算法

ML–k近邻算法

本节内容

  • k近邻分类算法
  • 从文本文件中解析和导入数据
  • 使用Matplotlib创建扩散图
  • 归一化数值


一.K近邻算法概述

简单地说,k近邻算法采用测量不同特征值之间的距离方法进行分类

k近邻算法

优点:精度高,对异常值不敏感,无数据输入假定

缺点:计算复杂度高,空间复杂度高

适用数据范围:数值型和标称型

使用k近邻算法分类爱情片和动作片,根据电影的打斗镜头和接吻镜头,确定是爱情片还是动作片?

from IPython.display import Image
Image(filename="./data/2_1.png",width=500)

output_6_0.png

首先我们需要知道这个未知电影存在多少个打斗镜头和接吻镜头,"?"是该未知电影出现的镜头数图形化展示

电影名称 打斗镜头 接吻镜头 电影类型
California Man 3 104 爱情片
He’s Not Really into Dudes 2 100 爱情片
Beautiful Woman 1 81 爱情片
Kevin Longblade 101 10 动作片
Robo Slayer 3000 99 5 动作片
Amped II 98 2 动作片
? 18 90 未知

即使不知道未知电影属于哪种类型,我们也可以通过某种方法计算出来.首先计算未知电影与样本集中其他电影的距离

电影名称 与未知电影的距离
Cafifornia Man 20.5
He’s Not Really into Dudes 18.7
Beautiful Woman 19.2
Kevin Longblade 115.3
Robo Slayer 3000 117.4
Amped II 118.9

现在我们得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到k个距离最近的电影.假定k=3则三个最靠近的电影依次是He’s Not Really into Dudes,Beautiful WomanCalifornia Man.k近邻算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片

k近邻算法的一般流程

  1. 收集数据:可以使用任何方法
  2. 准备数据:距离计算所需要的数值
  3. 分析数据:可以使用任何方法
  4. 训练算法:此步骤不适用于k近邻算法
  5. 测试算法:计算错误率
  6. 使用算法:首先需要输入样本数据和结构化的输出结果,然后运行k近邻算法判定输入数据分别属于哪个分类,最后应用对计算出的分类执行后续的处理


1.准备:使用python导入数据

import numpy as np
import operator

def createDataSet():
    dataset=np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])
    labels=["爱情片","爱情片","爱情片","动作片","动作片","动作片"]
    return dataset,labels
dataset,labels=createDataSet()
dataset
array([[  3, 104],
       [  2, 100],
       [  1,  81],
       [101,  10],
       [ 99,   5],
       [ 98,   2]])
labels
['爱情片', '爱情片', '爱情片', '动作片', '动作片', '动作片']

向量labels包含了每个数据点的标签信息,labels包含的元素个数等于dataset矩阵行行数.红色点是爱情片,蓝色点是动作片

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot([3,2,1],[104,100,81],"ro",[101,99,98],[10,5,2],"b^")
[<matplotlib.lines.Line2D at 0x2075b9f8358>,
 <matplotlib.lines.Line2D at 0x2075b9f8470>]

output_19_1.png


2.实施KNN分类算法

对未知类比属性的数据集中的每个点依次执行以下操作:

  1. 计算已知类别数据集中的每个点依次执行以下操作
  2. 按照距离递增次序排序
  3. 选取与当前点距离最小的k个点
  4. 确定前k个点所在类别的出现频率
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类
def classMovieTest(X,dataset,labels,k):
    """
    :param x: 用于分类的输入向量
    :param dataset: 输入的训练样本集
    :param labels: 标签向量
    :param k: 用于选择最近邻居的数目
    :return: 分类标签;与已知样本的距离
    """
    
    # 距离计算
    datasetSize=dataset.shape[0]
    datasetMat=np.tile(X,(datasetSize,1))-dataset
    sqdatasetMat=datasetMat**2
    sqDistances=sqdatasetMat.sum(axis=1)
    distances=sqDistances**0.5
    sortDistIndicies=distances.argsort()
    classcount={}
    for i in range(k):
        voteLabel=labels[sortDistIndicies[i]]
        # 选择距离最小的 k个点
        classcount[voteLabel]=classcount.get(voteLabel,0)+1
        
    # 排序
    sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
    return sortClasscount[0][0],distances

预测数据所在分类,输入X=[18,90],其输出结果应该与上面分析一致

classMovieTest([18,90],dataset,labels,3)
('爱情片', array([ 20.51828453,  18.86796226,  19.23538406, 115.27792503,
        117.41379817, 118.92854998]))


二.使用k近邻算法改进约会网站的配对效果

三种类型的人:

  • 不喜欢的人
  • 魅力一般的人
  • 极具魅力的人


1.准备数据:从文本文件中解析数据

数据放在文本文件datingTestSet2.txt中,每个样本数据占据一行,总共有1000行.样本主要包含以下3种特征:

  1. 每年获得的飞行常客里程数
  2. 玩视频游戏所耗时间百分比
  3. 每周消费的冰淇淋公升数

创建名为fileTmatrix的函数,以此来处理输入格式问题.该函数的输入为文件名字符串,输出为训练样本矩阵和类标签向量

def fileTmatrix(filename):
    """
    :param filename: 数据集文件名
    :return: 训练数据矩阵;类标签向量
    """
    fr=open(filename)
    arrayLines=fr.readlines()
    
    # 得到文件行数
    numberLines=len(arrayLines)
    
    # 创建返回的Numpy矩阵
    datasetMat=np.zeros((numberLines,3))
    classLabelVector=[]
    index=0
    
    # 解析文件数据到列表
    for line in arrayLines:
        line=line.strip()
        listFromLine=line.split("\t")
        datasetMat[index,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return datasetMat,classLabelVector
dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
dataMat
array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
       [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
       [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
       ...,
       [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
       [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
       [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
dataLabels[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]


2.分析数据:使用Matplotlib创建散点图

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(dataMat[:,1],dataMat[:,2],"bo")

plt.xlabel("Percentage of Time Spent Playing Video Games")
plt.ylabel("Liters of ice cream consumed per week")

plt.show()

output_36_0.png

Matplotlib库提供的scatter函数支持个性化标记散点图上的点

fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,1],dataMat[:,2],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
<matplotlib.collections.PathCollection at 0x2075c05ea58>

output_38_1.png
使用数据矩阵dataMat的第一和第二列属性却可以得到更好的效果,图中清晰地标识了三个不同的样本分类区域,具有不同爱好的人其类别区域也不同

fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,0],dataMat[:,1],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
<matplotlib.collections.PathCollection at 0x2075d1d50b8>

output_40_1.png


3.准备数据:归一化数值

将取值范围的特征值转化为0到1区间内的值:

newValue=(oldValue-min)/(max-min)

使用函数Norm将数字特征值转化为0到1的区间

def Norm(dataset):
    """
    :param dataset: 数据集
    :return: 归一化数据集;极值差;最小值
    """
    
    # 参数0使得函数可以从列中选取最小值
    minVal=dataset.min(0)
    maxVal=dataset.max(0)
    ranges=maxVal-minVal
    normDataset=np.zeros(np.shape(dataset))
    m=dataset.shape[0]
    normDataset=dataset-np.tile(minVal,(m,1))
    
    # 特征值相除
    normDataset=normDataset/np.tile(ranges,(m,1))
    return normDataset,ranges,minVal
normMat,ranges,minVal=Norm(dataMat)
normMat
array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])
ranges
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
minVal
array([0.      , 0.      , 0.001156])


4.测试算法:作为完整程序验证分类器

def classMovieTest(X,dataset,labels,k):
    """
    :param x: 用于分类的输入向量
    :param dataset: 输入的训练样本集
    :param labels: 标签向量
    :param k: 用于选择最近邻居的数目
    :return: 分类标签
    """
    
    # 距离计算
    datasetSize=dataset.shape[0]
    datasetMat=np.tile(X,(datasetSize,1))-dataset
    sqdatasetMat=datasetMat**2
    sqDistances=sqdatasetMat.sum(axis=1)
    distances=sqDistances**0.5
    sortDistIndicies=distances.argsort()
    classcount={}
    for i in range(k):
        voteLabel=labels[sortDistIndicies[i]]
        # 选择距离最小的 k个点
        classcount[voteLabel]=classcount.get(voteLabel,0)+1
        
    # 排序
    sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
    return sortClasscount[0][0]
def classTest():
    haRatio=0.10
    dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(dataMat)
    m=normMat.shape[0]
    numTestVecs=int(m*haRatio)
    errorcount=0.0
    
    for i in range(numTestVecs):
        classifierResult=classMovieTest(normMat[i,:],normMat[numTestVecs:m,:],dataLabels[numTestVecs:m],3)
        print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,dataLabels[i]))
        if (classifierResult!=dataLabels[i]):
            errorcount+=1.0
    print("The total error rate is:%d"%errorcount)
    print("The total error rate is:%f"%(errorcount/numTestVecs))
classTest()
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:1
The total error rate is:5
The total error rate is:0.050000

假设我们使用全部的训练集来进行训练,看是否能提高准确率?

def classTest2():
    dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(dataMat)
    m=normMat.shape[0]
    errorcount=0.0
    
    for i in range(m):
        classifierResult2=classMovieTest(normMat[i,:],normMat[:,:],dataLabels[:],3)
        
        if (classifierResult2!=dataLabels[i]):
            errorcount+=1.0
    print("The total error rate:",(errorcount/m))
classTest2()
The total error rate: 0.027

结果表明,错误率从5%降低到2.7%,提高了准确率


5.使用算法:构建完整可用系统

def classifyPerson():
    resultList=["not at all","in small doses","in large doses"]
    percentTats=float(input("Percentage of time spent playing video games:"))
    ffMiles=float(input("Frequent flier miles earned per year:"))
    iceCream=float(input("liters of ice cream consumed per year:"))
    datingDataMat,datingLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(datingDataMat)
    inArr=np.array([ffMiles,percentTats,iceCream])
    classifierResult=classMovieTest((inArr-minvals)/ranges,normMat,datingLabels,3)
    print("You will probably like thie person:",resultList[classifierResult-1])
classifyPerson()
Percentage of time spent playing video games: 10
Frequent flier miles earned per year: 10000
liters of ice cream consumed per year: 0.5


You will probably like thie person: in small doses


三.手写识别系统

构造系统识别数字0到9.处理成具有相同的色彩和大小:宽高是32*32的黑白图像


1.准备数据:将图像转换为测试向量

实际图像存储在trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;目录testDigits中包含了大约900个测试数据

from IPython.display import Image

Image(filename="./data/2_2.png",width=500)

output_64_0.png

Image(filename="./data/2_3.png",width=500)

output_65_0.png

Image(filename="./data/2_4.png",width=500)

output_66_0.png

我们将把一个32_32的二进制图像矩阵转换为1_1024的向量.首先编写一段函数imgTvector,将图像转换为向量

def imgTvector(filename):
    returnVect=np.zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        lineStr=fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect
testVector=imgTvector("./data/digits/testDigits/0_13.txt")
testVector[0,0:31]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


2.测试算法:使用k近邻算法识别手写数字

from os import listdir

def handwritingClassTest():
    hwLabels=[]
    trainingFileList=listdir("./data/digits/trainingDigits/")
    m=len(trainingFileList)
    trainingMat=np.zeros((m,1024))
    for i in range(m):
        fileNameStr=trainingFileList[i]
        fileStr=fileNameStr.split(".")[0]
        classNumStr=int(fileStr.split("_")[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:]=imgTvector("./data/digits/trainingDigits/%s"%fileNameStr)
    testFileList=listdir("./data/digits/testDigits/")
    errorCount=0.0
    mTest=len(testFileList)
    for i in range(mTest):
        fileNameStr=testFileList[i]
        fileStr=fileNameStr.split(".")[0]
        classNumStr=int(fileStr.split("_")[0])
        vectorUnderTest=imgTvector("./data/digits/testDigits/%s"%fileNameStr)
        classifierResult=classMovieTest(vectorUnderTest,trainingMat,hwLabels,3)
        print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,classNumStr))
        if (classifierResult!=classNumStr):
            errorCount+=1.0
    print("The total number of errors is:%d"%errorCount)
    print("The total error rate is:%f"%(errorCount/float(mTest)))
handwritingClassTest()
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
.
.
.
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The total number of errors is:10
The total error rate is:0.010571

k近邻算法识别手写数字数据集,错误率为1.1%

posted @ 2019-05-09 08:58  LQ6H  阅读(192)  评论(0编辑  收藏  举报