机器学习学习笔记之一：K最近邻算法（KNN）

算法

假定数据有M个特征，则这些数据相当于在M维空间内的点

\[X = \begin{pmatrix} x_{11} & x_{12} & ... & x_{1M} \\ x_{21} & x_{22} & ... & x_{2M} \\ . & . & & .\\ . & . & & .\\ . & . & & .\\ x_{N1} & x_{N2} & ... & x_{NM} \end{pmatrix}\]

同时我们有标注集向量

\[\vec{y} = \begin{pmatrix} y_1 \\ y_2 \\ . \\ . \\ . \\ y_M \end{pmatrix}\]

那么对于一个新的数据点

\[\vec{x_z} = \begin{pmatrix} x_{z1} & x_{z2} & ... & x_{zM} \end{pmatrix}\]

我们通过计算其与其他所有点的欧氏距离

\[D_j=\sqrt{(x_{z1}-x_{j1})^2+(x_{z2}-x_{j2})^2+...+(x_{zM}-x_{jM})^2} \]

得到与所有点的距离向量（并按从小到大排序）

\[\vec{D} = \begin{pmatrix} D_1 \\ D_2 \\ . \\ . \\ . \\ D_M \end{pmatrix}\]

取前k个点即为最近邻的k个点。

\[\vec{D_k} = \begin{pmatrix} D_1 \\ D_2 \\ . \\ . \\ . \\ D_k \end{pmatrix}\]

根据这k个点所对应的标注，统计这些标注出现的次数\(n_k\)

\[\vec{y'}=\begin{pmatrix} y_1 & n_1 \\ y_2 & n_2 \\ . & .\\ . & .\\ . & .\\ y_k & n_k \end{pmatrix}\]

取数量最大的标注作为\(\vec{x_z}\)的标注。

\[y_z = \max_n{\vec{y'}} \]

算法实现（Python）

from numpy import *

def KNNclassify(inX, dataset, labels, k):
    """
    K-Nearest Neighbour algorithm
    :param inX: Input vector X
    :param dataset: Training Dataset
    :param labels: Labels vector
    :param k: the number of nearest neighbours
    :return: The class of input
    """
    dataset_size = dataset.shape[0]
    diffMat = tile(inX, (dataset_size, 1)) - dataset  # Use inX to fill a matrix of dataset_size
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)  # Sum according to rows of matrix
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()  # Get the index of all distances
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

算法优点

算法实现简单；
不需要事先训练，可直接应用于数据。

算法缺点

数据条目很多时算法消耗时间很长，因为它要计算新数据点到每个已存在的数据点的距离；
可能会出现多个相同的最大值，导致新的数据点无法准确判断真实的类别标注；
如果直接使用KNN算法，则数据范围大的特征对结果影响很大。为了消除这种影响，应该对数据进行归一化的预处理。

posted @ 2017-08-15 22:11 飞鸟_Asuka 阅读(410) 评论(0) 编辑收藏举报

刷新页面返回顶部

飞鸟的笔记本

机器学习学习笔记之一：K最近邻算法（KNN）

算法

算法实现（Python）

算法优点

算法缺点

公告