K-最邻近算法(KNN算法)
概述
K-最邻近算法是一种分类算法,假设有一个训练集
D
D
D,
D
D
D包含n个训练样例,现有一个测试样例
s
s
s,计算
s
s
s与训练集
D
D
D中每个样例的相似度,找出k个与
s
s
s最相似的样本,这k个样本中哪个类别占比最多则作为测试样例
s
s
s的类别。一般用欧几里得距离衡量相似度,其定义为:
d
i
s
t
(
s
,
d
i
)
=
∑
l
=
1
m
(
a
l
−
a
i
l
)
2
dist(s,d_i)=\sqrt{\sum_{l=1}^{m}(a_{l}-a_{il})^2}
dist(s,di)=l=1∑m(al−ail)2
其中
d
i
d_i
di指样本集
D
D
D中第
i
i
i个样本,
a
l
a_l
al指样例
s
s
s第
l
l
l个属性值,
a
i
l
a_{il}
ail指样例
d
i
d_i
di中第
l
l
l个属性值。
例子
以下表人员信息作为样本数据,只用身高属性作为距离计算属性,采用K-最邻近算法对<Pat,女,1.6>进行分类
- 计算测试样本身高属性到每个训练样本身高属性的欧几里得距离:
- 按距离递增排序
- 取前5个样本构成样本集合,依据类别比例确定测试样本类别
下表计算了测试样例,与每个样例的距离
按距离递增排序,取前5个样例组成一个集合
其中4个样例属于”矮“,1个样例属于"中等",所以最终认为测试样例<Pat,女,1.6>为矮个。
鸢尾植物例子
# -*- coding: utf-8 -*-
# 如果你的程序(包括注释)中包含中文,请加上上面这一行,用于指明编码(utf-8)。
# #号开头的行都是注释。
# 如何编辑本程序:
# 用任何文本编辑器均可编辑。
# 如何运行本程序:
# 在终端中,运行:python p405_exercises.py
# 本程序实现一个简单的分类算法(kNN,k-最近邻算法)。
#
# 你有两个数据集(训练集和检验集):
# 训练集:data/iris.training.arff
# 检验集:data/iris.test.arff
# 这两个文件的格式是一样的。每一行数据代表一个数据对象,包括
# 四个属性(分别是花萼和花瓣的长度和宽度,均为实数)和
# 一个类别(鸢尾花的子类别,字符串)。
# 文件中以%或@开头的行或空行均应忽略。
#
# kNN算法如下:
# (1)对于测试集中的每个数据对象,计算它与训练集中每个对象之间的差异,
# 即它的四个属性与训练集中每个对象的四个属性之间的欧几里得距离:
# dist(x, y) = math.sqrt((x1-y1)**2 + (x2-y2)**2 + (x3-y3)**2 + (x4-y4)**2)
# 其中x1,...,x4 和 y1,...,y4 分别是两个对象的四个属性值。
# (2)将这些距离按从低到高排序,选择k个距离最短(即最近邻)的训练数据,
# 将这k个训练数据的主要类别作为测试数据的类别(分类结果)。
# (3)将分类结果与真实类别进行比较。统计正确分类的数目和比例。
#
# TODO: 选择 k=1,3,5,7,9,记录正确分类的百分比(保留小数点后两位):
# k 准确率
# 1 ?
# 3 ?
# 5 ?
# 7 ?
# 9 ?
import math
def dist(obj1, obj2):
'''计算并返回两个数据对象之间的距离。
'''
# 计算并返回一个实数。
x1 = float(list(obj1.values())[0][0])
x2 = float(list(obj1.values())[0][1])
x3 = float(list(obj1.values())[0][2])
x4 = float(list(obj1.values())[0][3])
y1 = float(list(obj2.values())[0][0])
y2 = float(list(obj2.values())[0][1])
y3 = float(list(obj2.values())[0][2])
y4 = float(list(obj2.values())[0][3])
return math.sqrt((x1-y1)**2 + (x2-y2)**2 + (x3-y3)**2 + (x4-y4)**2)
def read_data(filename):
'''从training_filename文件中读取训练数据,
保存在一个列表中,读取结束后返回该列表。
'''
data_list = []
# 类别:
with open(filename) as f :
for line in f:
line.strip()
line = line.split("\n")
# 从文件中读取训练数据并填入training_list。
if line[0].endswith('Iris-setosa') or line[0].endswith('Iris-versicolor') or line[0].endswith('Iris-virginica'):
line = line[0].split(',')
data = {line[4]:[line[0],line[1],line[2],line[3]]}
data_list.append(data)
return data_list
def sort_by_distance(list_data):
for i in range(len(list_data)):
for j in range(i+1,len(list_data)):
if list_data[i][1] > list_data[j][1]:
temp = list_data[i]
list_data[i] = list_data[j]
list_data[j] = temp
return list_data
def classify(training_list, test_obj, k):
'''kNN算法。
training_list: 训练数据集。
test_obj: 一个测试数据对象。
k:kNN参数。
返回:测试对象的类别。
'''
result = list()
for eachTestData in test_obj:
data_list = list()
for eachTrainingData in training_list:
distance = dist(eachTestData,eachTrainingData)
d = list(eachTrainingData.keys())[0],distance
# print(d)
data_list.append(d)
# print(data_list)
#对列表按照距离排序
data_list = sort_by_distance(data_list)
# print(data_list)
c1 = 0
c2 = 0
c3 = 0
for i in range(k):
if data_list[i][0] == 'Iris-setosa':
c1+=1
elif data_list[i][0] == 'Iris-versicolor':
c2+=1
elif data_list[i][0] == 'Iris-virginica':
c3+=1
if max(c1,c2,c3) == c1:
result.append('Iris-setosa')
elif max(c1,c2,c3) == c2:
result.append('Iris-versicolor')
elif max(c1,c2,c3) == c3:
result.append('Iris-virginica')
return result
if __name__ == '__main__':
K = ''
result_list = list()
title = "{0} {1}".format('K','准确率')
result_list.append(title+'\n')
filePath1 = input("请输入训练集绝对路径:\n")
filePath2 = input("请输入测试集绝对路径:\n")
while(True):
try:
training_list = read_data(filePath1)#'D:/python_course/data/iris.training.arff'
test_list = read_data(filePath2)#'D:/python_course/data/iris.test.arff'
except FileNotFoundError:
msg = '对不起输入路径有误,或文件不存在.'
print(msg)
break
else:
K = input("请输入K值,选择k=1,3,5,7,9,输入其他值退出:\n")
if(K.isdigit()):
K = int(K)
else:
break
label = classify(training_list, test_list, K)
counter = 0
for i in range(len(label)):
if label[i] == list(test_list[i].keys())[0]:
counter+=1
rate = round((counter/len(test_list))*100,2)
result = "{0} {1}%".format(K,rate)
print(title)
print(result)
result_list.append(result+'\n')
file_name = 'D:/python_course/data/result.txt'
with open(file_name, 'a') as f:
f.seek(0)
f.truncate() #清空文
for each in result_list:
f.write(each)
训练集
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
% 3. Past Usage:
% - Publications: too many to mention!!! Here are a few.
% 1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
% Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
% to Mathematical Statistics" (John Wiley, NY, 1950).
% 2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
% (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
% 3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
% Structure and Classification Rule for Recognition in Partially Exposed
% Environments". IEEE Transactions on Pattern Analysis and Machine
% Intelligence, Vol. PAMI-2, No. 1, 67-71.
% -- Results:
% -- very low misclassification rates (0% for the setosa class)
% 4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE
% Transactions on Information Theory, May 1972, 431-433.
% -- Results:
% -- very low misclassification rates again
% 5. See also: 1988 MLC Proceedings, 54-64. Cheeseman et al's AUTOCLASS II
% conceptual clustering system finds 3 classes in the data.
%
% 4. Relevant Information:
% --- This is perhaps the best known database to be found in the pattern
% recognition literature. Fisher's paper is a classic in the field
% and is referenced frequently to this day. (See Duda & Hart, for
% example.) The data set contains 3 classes of 50 instances each,
% where each class refers to a type of iris plant. One class is
% linearly separable from the other 2; the latter are NOT linearly
% separable from each other.
% --- Predicted attribute: class of iris plant.
% --- This is an exceedingly simple domain.
%
% 5. Number of Instances: 150 (50 in each of three classes)
%
% 6. Number of Attributes: 4 numeric, predictive attributes and the class
%
% 7. Attribute Information:
% 1. sepal length in cm
% 2. sepal width in cm
% 3. petal length in cm
% 4. petal width in cm
% 5. class:
% -- Iris Setosa
% -- Iris Versicolour
% -- Iris Virginica
%
% 8. Missing Attribute Values: None
%
% Summary Statistics:
% Min Max Mean SD Class Correlation
% sepal length: 4.3 7.9 5.84 0.83 0.7826
% sepal width: 2.0 4.4 3.05 0.43 -0.4194
% petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
% petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
%
% 9. Class Distribution: 33.3% for each of 3 classes.
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
% END
测试集
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
4.7,3.2,1.3,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
% END