为了简单起见,这里构造的系统只能识别数字0到9,需要识别的数字已经使用图形处理软件,处理成具有相同的色彩和大小:宽高是32像素的黑白图像。尽管采用文本格式存储图像不能有效地利用内存空间,但是为了方便理解,我们还是将图像转换为文本格式。
---1.收集数据:提供文本文件
该数据集合修改自“手写数字数据集的光学识别”-一文中的数据集合,该文登载于2010年10月3日的UCI机器学习资料库中http://archive.ics.uci.edu/ml。
---2.准备数据:将图像转换为测试向量
trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;testDigits中包含了大约900个测试数据。两组数据没有重叠。
我们先将图像格式化处理为一个向量。我们将一个32*32的二进制图像矩阵转换为1*1024的向量。
我们首先编写函数img2vector,将图像转换为向量:该函数创建1*1024的NumPy数组,然后打开指定的文件,循环读出文件的前32行,并将每行的前32个字符值存储在NumPy数组中,最后返回数组。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | #!/usr/bin/python # -*- coding: utf-8 -*- from numpy import * #引入科学计算包numpy from os import listdir import operator #经典python函数库,运算符模块 #算法核心 #inX:用户分类的输入向量,即将对其进行分类 #dataSet:训练样本集 #labels:标签向量 def classifyO(inX,dataSet,labels,k): #距离计算 dataSetSize = dataSet.shape[ 0 ] #得到数组的行数,即知道有几个训练数据 diffMat = tile(inX,(dataSetSize, 1 )) - dataSet #tile是numpy中的函数,tile将原来的一个数组,扩充成了4个一样的数组;diffMat得到目标与训练数值之间的差值 sqDiffMat = diffMat * * 2 #各个元素分别平方 sqDistances = sqDiffMat. sum (axis = 1 ) distances = sqDistances * * 0.5 #开方,得到距离 sortedDistIndicies = distances.argsort() #升序排列 #选择距离最小的k个点 classCount = {} for i in range (k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0 ) + 1 #排序 sortedClassCount = sorted (classCount.iteritems(),key = operator.itemgetter( 1 ),reverse = True ) return sortedClassCount[ 0 ][ 0 ] def img2vector(filename): returnVect = zeros(( 1 , 1024 )) fr = open (filename) for i in range ( 32 ): lineStr = fr.readline() for j in range ( 32 ): returnVect[ 0 , 32 * i + j] = int (lineStr[j]) return returnVect |
在python命令行中输入下列命令测试img2vector函数,然后与本文编辑器打开的文件进行比较:
1 2 3 4 5 6 7 8 9 10 | >>> import kNN >>> testVector = kNN.img2vector( 'digits/testDigits/0_13.txt' ) #根据自己的目录写 >>> testVector[ 0 , 0 : 31 ] array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]) >>> testVector[ 0 , 32 : 63 ] array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]) |
---3.测试算法:使用k-近邻算法识别手写数字
我们已经将数据处理成分类器可以识别的格式,现在要做的是将这些数据输入到分类器,检查分类器的执行结果。handwritingClassTest()是测试分类器的代码,将其写入kNN.py文件中。在写入之前,保证将from os import listdir写入文件的起始部分。这段代码主要功能是从os模块中导入函数listdir,它可以列出给定目录的文件名。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def handwritingClassTest(): hwLabels = [] trainingFileList = listdir( 'E:\\python excise\\digits\\trainingDigits' ) m = len (trainingFileList) trainingMat = zeros((m, 1024 )) for i in range (m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split( '.' )[ 0 ] classNumStr = int (fileStr.split( '_' )[ 0 ]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector( 'digits/trainingDigits/%s' % fileNameStr) testFileList = listdir( 'E:/python excise/digits/testDigits' ) errorCount = 0.0 mTest = len (testFileList) for i in range (mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split( '.' )[ 0 ] classNumStr = int (fileStr.split( '_' )[ 0 ]) vectorUnderTest = img2vector( 'digits/testDigits/%s' % fileNameStr) classifierResult = classifyO(vectorUnderTest,trainingMat,hwLabels, 3 ) print "the classifier came back with:%d,the real answeris:%d" % (classifierResult,classNumStr) if (classifierResult ! = classNumStr):errorCount + = 1.0 print "\nthe total number of error is:%d" % errorCount print "\nthe total error rate is:%f" % (errorCount / float (mTest)) |
解释:将E:\\python excise\\digits\\trainingDigits目录中的文件内容存储到列表trainingFileList中,然后可以得到文件中有有多少文件,并将其存储在变量m中。接着,代码创建一个m行1024列的训练矩阵,该矩阵的每行数据存储一个图像。我们可以从文件名中解析出分类数字,该目录下的文件按照规则命名,如文件9_45.txt的分类是9,它是数字9的第45个实例。然后我们可以将类代码存储到hwLabels向量中,使用前面的img2vector函数载入图像。
下一步中,对E:/python excise/digits/testDigits目录中文件执行相似的操作,不同的是我们并不将这个目录下的文件载入矩阵,而是使用classifyO()函数测试该目录下的每个文件。由于文件中的值已经在0和1之间,所以不用归一化。
在python命令提示符中输入kNN.handwritingClassTest(),测试该函数的输出结果。依赖于机器速度,夹在数据集可能需要话费很长时间,然后函数依次测试每个文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | >>> kNN.handwritingClassTest() the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 0 ,the real answeris: 0 the classifier came back with: 1 ,the real answeris: 1 the classifier came back with: 1 ,the real answeris: 1 the classifier came back with: 1 ,the real answeris: 1 the classifier came back with: 1 ,the real answeris: 1 ... the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the classifier came back with: 9 ,the real answeris: 9 the total number of error is : 11 the total error rate is : 0.011628 |
总结
k-近邻算法识别手写数字数据集,错误率为1.2%。改变变量k的值、修改函数handwritingClassTest随机选取训练样本、改变训练样本的数目,都会对k-近邻算法的错误率产生影响。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通