知识分析与应用基础作业(一)

博客代码:180913

作业代码:180312

KNN的应用(python代码)

  • KNN简介

 

KNN基本原理图

 

KNN算法通过计算不同的已知的数据集到目标数据的“距离【1】”,按照大小顺序排序后;依靠不同的阈值K确定范围内不同类型所拥有的数据个数的最大值,并认为未知数据即属于类型。

【1】常见的几种记录算法

 

  • 优点:

  1. 简单
  2. 对于基本识别问题(basic recognition problems)效果较好
  • 缺点:

  1. 慵懒的学习算法(lazy learner),无法从训练集中学习,只是单纯的使用训练集计算
  2. 对于大量的测试集,需要花费大量的内存
  • 算法步骤

  1. 获得K(阈值)
  2. 计算数据集到测试数据的距离
  3. 排序获得最近的k个距离对应数据的类型
  4. 计算不同类型的个数
  5. 得到最多的个数对应的类型即为测试数据的类型
  • homework

  1. 三种不同的类型
  2. 四个数据(属性attributes)

 获得最高的准确度

 

  • 具体python代码实现

 1 from numpy import *
 2 import operator
 3 import numpy as np
 4 import codecs
 5 
 6 f = open("train.txt")
 7 
 8 lines = f.readlines()
 9 count_1 = len(lines)
10 A = zeros((count_1,4))
11 A_row = 0
12 for line in lines:
13     list_1 = line.strip('\n').split(',')
14     list_1 = [l for l in list_1 if len(l) > 0]
15     A[A_row:] = list_1[0:4]
16     A_row+=1
17     
18 B = list()
19 for line in lines:
20     list_2 = line.strip('\n').split(',')
21     list_2 = [l for l in list_2 if len(l) > 0]
22     B.append(list_2[4:5])
23     
24 def KNN(test, A, B, k):
25     num = A.shape[0]
26     diff = tile(test, (num, 1)) - A
27     squareddiff = diff ** 2
28     squareddist = sum(squareddiff, axis = 1)
29     distance = squareddist ** 0.5
30 
31     sortdist = argsort(distance)
32     classcount = {}
33     for i in range(k):
34         vote = B[sortdist[i]]
35         vote = tuple(vote)
36         #print(vote)
37         classcount[vote] = classcount.get(vote, 0) + 1
38 
39     maxcount = 0
40     for key, value in classcount.items():
41         if value > maxcount:
42             maxcount = value
43             maxindex = key
44 
45     return maxindex
46 
47 ftest = open("test_try.txt")
48 lines = ftest.readlines()
49 count_2 = len(lines)
50 C = zeros((count_2,4))
51 Test_row = 0
52 for line in lines:
53     list_3 = line.strip('\n').split(',')
54     list_3 = [l for l in list_3 if len(l) > 0]
55     C[Test_row:] = list_3[0:4]
56     Test_row+=1
57 
58 D = list()
59 for line in lines:
60     list_4 = line.strip('\n').split(',')
61     list_4 = [l for l in list_4 if len(l) > 0]
62     D.append(list_4[4:5])
63 
64 def tryBestK(A, B):
65     num1 = C.shape[0]
66     maxright = 0
67     for k in range(1,count_1):
68         out = []
69         for i in range(num1):
70             output = KNN(C[i], A, B, k)
71             output = list(output)
72             out.append(output)    
73         count_3 = 0
74         for i in range(num1):
75             if out[i] == D[i]:
76                 count_3+=1
77         right = count_3/num1
78         if maxright < right:
79             maxright = right
80             maxans = k
81 
82     return maxans
83 
84 K = tryBestK(A, B)
85 num1 = C.shape[0]
86 out = []
87 for i in range(num1):
88     output = KNN(C[i], A, B, K)
89     output = list(output)
90     out.append(output)
91     
92 C = np.column_stack((C,out))
93 f = codecs.open("test_ans.txt",'w','utf-8')
94 for i in C:
95     f.write(",".join(i)+'\r\n')
96 f.close()

 

 

该代码具体用到的不同部分将在博客代码180914,180915,180916三篇分开具体介绍

 

posted @ 2018-09-26 10:35  降腰  阅读(257)  评论(0编辑  收藏  举报