kNN(k近邻)算法代码实现
目标:预测未知数据(或测试数据)X的分类y
批量kNN算法
1.输入一个待预测的X(一维或多维)给训练数据集,计算出训练集X_train中的每一个样本与其的距离
2.找到前k个距离该数据最近的样本-->所属的分类y_train
3.将前k近的样本进行统计,哪个分类多,则我们将x分类为哪个分类
# 准备阶段: import numpy as np # import matplotlib.pyplot as plt raw_data_X = [[3.393533211, 2.331273381], [3.110073483, 1.781539638], [1.343808831, 3.368360954], [3.582294042, 4.679179110], [2.280362439, 2.866990263], [7.423436942, 4.696522875], [5.745051997, 3.533989803], [9.172168622, 2.511101045], [7.792783481, 3.424088941], [7.939820817, 0.791637231] ] raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] X_train = np.array(raw_data_X) y_train = np.array(raw_data_y) x = np.array([8.093607318, 3.365731514])
核心代码:
目标:预测未知数据(或测试数据)X的分类y 批量kNN算法 1.输入一个待预测的X(一维或多维)给训练数据集,计算出训练集X_train中的每一个样本与其的距离 2.找到前k个距离该数据最近的样本-->所属的分类y_train 3.将前k近的样本进行统计,哪个分类多,则我们将x分类为哪个分类 from math import sqrt from collections import Counter # 已知X_train,y_train # 预测x的分类 def predict(x, k=5): # 计算训练集每个样本与x的距离 distances = [sqrt(np.sum((x-x_train)**2)) for x_train in X_train] # 这里用了numpy的fancy方法,np.sum((x-x_train)**2) # 获得距离对应的索引,可以通过这些索引找到其所属分类y_train nearest = np.argsort(distances) # 得到前k近的分类y topK_y = [y_train[neighbor] for neighbor in nearest[:k]] # 投票的方式,得到一个字典,key是分类,value数个数 votes = Counter(topK_y) # 取出得票第一名的分类 return votes.most_common(1)[0][0] # 得到y_predict predict(x, k=6)
面向对象的方式,模仿sklearn中的方法实现kNN算法:
import numpy as np from math import sqrt from collections import Counter class kNN_classify: def __init__(self, n_neighbor=5): self.k = n_neighbor self._X_train = None self._y_train = None def fit(self, X_train, y_train): self._X_train = X_train self._y_train = y_train return self def predict(self, X): '''接收多维数据,返回y_predict也是多维的''' y_predict = [self._predict(x) for x in X] # return y_predict return np.array(y_predict) # 返回array的格式 def _predict(self, x): '''接收一个待预测的x,返回y_predict''' distances = [sqrt(np.sum((x-x_train)**2)) for x_train in self._X_train] nearest = np.argsort(distances) topK_y = [self._y_train[neighbor] for neighbor in nearest[:self.k]] votes = Counter(topK_y) return votes.most_common(1)[0][0] def __repr__(self): return 'kNN_clf(k=%d)' % self.k