知识分析与应用基础作业(一)
博客代码:180913
作业代码:180312
KNN的应用(python代码)
-
KNN简介
KNN基本原理图
KNN算法通过计算不同的已知的数据集到目标数据的“距离【1】”,按照大小顺序排序后;依靠不同的阈值K确定范围内不同类型所拥有的数据个数的最大值,并认为未知数据即属于类型。
【1】常见的几种记录算法
-
优点:
- 简单
- 对于基本识别问题(basic recognition problems)效果较好
-
缺点:
- 慵懒的学习算法(lazy learner),无法从训练集中学习,只是单纯的使用训练集计算
- 对于大量的测试集,需要花费大量的内存
-
算法步骤
- 获得K(阈值)
- 计算数据集到测试数据的距离
- 排序获得最近的k个距离对应数据的类型
- 计算不同类型的个数
- 得到最多的个数对应的类型即为测试数据的类型
-
homework
- 三种不同的类型
- 四个数据(属性attributes)
获得最高的准确度
-
具体python代码实现
1 from numpy import * 2 import operator 3 import numpy as np 4 import codecs 5 6 f = open("train.txt") 7 8 lines = f.readlines() 9 count_1 = len(lines) 10 A = zeros((count_1,4)) 11 A_row = 0 12 for line in lines: 13 list_1 = line.strip('\n').split(',') 14 list_1 = [l for l in list_1 if len(l) > 0] 15 A[A_row:] = list_1[0:4] 16 A_row+=1 17 18 B = list() 19 for line in lines: 20 list_2 = line.strip('\n').split(',') 21 list_2 = [l for l in list_2 if len(l) > 0] 22 B.append(list_2[4:5]) 23 24 def KNN(test, A, B, k): 25 num = A.shape[0] 26 diff = tile(test, (num, 1)) - A 27 squareddiff = diff ** 2 28 squareddist = sum(squareddiff, axis = 1) 29 distance = squareddist ** 0.5 30 31 sortdist = argsort(distance) 32 classcount = {} 33 for i in range(k): 34 vote = B[sortdist[i]] 35 vote = tuple(vote) 36 #print(vote) 37 classcount[vote] = classcount.get(vote, 0) + 1 38 39 maxcount = 0 40 for key, value in classcount.items(): 41 if value > maxcount: 42 maxcount = value 43 maxindex = key 44 45 return maxindex 46 47 ftest = open("test_try.txt") 48 lines = ftest.readlines() 49 count_2 = len(lines) 50 C = zeros((count_2,4)) 51 Test_row = 0 52 for line in lines: 53 list_3 = line.strip('\n').split(',') 54 list_3 = [l for l in list_3 if len(l) > 0] 55 C[Test_row:] = list_3[0:4] 56 Test_row+=1 57 58 D = list() 59 for line in lines: 60 list_4 = line.strip('\n').split(',') 61 list_4 = [l for l in list_4 if len(l) > 0] 62 D.append(list_4[4:5]) 63 64 def tryBestK(A, B): 65 num1 = C.shape[0] 66 maxright = 0 67 for k in range(1,count_1): 68 out = [] 69 for i in range(num1): 70 output = KNN(C[i], A, B, k) 71 output = list(output) 72 out.append(output) 73 count_3 = 0 74 for i in range(num1): 75 if out[i] == D[i]: 76 count_3+=1 77 right = count_3/num1 78 if maxright < right: 79 maxright = right 80 maxans = k 81 82 return maxans 83 84 K = tryBestK(A, B) 85 num1 = C.shape[0] 86 out = [] 87 for i in range(num1): 88 output = KNN(C[i], A, B, K) 89 output = list(output) 90 out.append(output) 91 92 C = np.column_stack((C,out)) 93 f = codecs.open("test_ans.txt",'w','utf-8') 94 for i in C: 95 f.write(",".join(i)+'\r\n') 96 f.close()
该代码具体用到的不同部分将在博客代码180914,180915,180916三篇分开具体介绍