字符和文档识别的四十年研究
字符和文档识别的四十年研究
---工业前景的瞻望
文档来源:http://www.sciencedirect.com/science/article/pii/S0031320308000964
文章历史:
Received15 February 2008
Receivedin revised form 10 March 2008
Accepted11 March 2008
摘要:本文简要介绍在过去的40年中字符和文档识别领域的技术进步,对于每十年中的代表进行了简单的阐述。然后,对汉字识别的关键技术的发展进行了主要阐述。大量的篇幅内容讨论了稳鲁棒性设计原则——已经被证实是用来解决邮政地址识别等复杂问题的有效途径,包括假说驱动的原则,的递延决定/多个假设的原则,信息集成原则,替代解决方案的原则和扰动的原则等。最后,对未来的预测,长尾理论,并对新的应用进行了讨论。
©2008 Elsevier Ltd. All rights reserved.
关键字:OCR;字符识别;手写识别;汉字识别;邮政地址识别;稳健性;鲁棒性设计;信息集成;假说驱动方法;数码笔
1前言
本文是根据手写汉字识别大赛中的一些资料以工业的视角来描述字符和文档识别技术的。在20世纪50年代出现了商业光学字符阅读器(OCRs),也就是从那时开始,在整个开发过程中,字符和文档识别技术的进步提供了先进的产品和系统,以满足工业和商业的需要。与此同时,企业基于此技术的利润使得他们在这方面的投入更多的资金开发更先进的技术。在这里我们可以观察到一个良性的循环。新技术已经促进了新的应用,新的应用支持发展更好的新技术。字符和文档识别是模式识别上一个非常成功的领域。
在过去的四十年的字符和文档识别的主要商业及工业应用一直在阅读方面,银行支票的阅读和邮政地址读。通过支持这些应用,识别能力在多个方面展开:写作模式,脚本类型的文件,识别模式等。写作机印刷,手工印刷,手写脚本。可识别的脚本开始用阿拉伯数字和扩展的拉丁字母、日文片假名的音节字符、汉字字符(中国字的日文版)、中国字,和韩文字符。
现在做的工作,是为了使印度和阿拉伯脚本的可读性。今天的光学字符识别系统,可以读取很多种类的纸张表格,包括银行支票、明信片、信封、书页和名片等。OCR-A和OCR-B字体字体标准,已经使光学字符识别即使是在早期阶段也足够可靠。在相同情况下,特别设计的OCR形式简化了分割的问题,即使是不成熟的识别技术。今天的光学字符识别成功地用来读取任何类型的字体和自由手写字符。
字符和文档识别领域的发展并不总是一帆风顺的,曾经两次被新的数字技术浪潮所动摇。宣称要减弱OCR技术的作用。第一次这样的浪潮是在20世纪80年代早期的自动化办公领域。从此以后,大部分信息似乎要“原生态数字”。逐渐的想减少OCRS的现实需求,并且一些研究人员对未来OCR的发展存在消极的心理。但是,事实证明,在日本销售字符识别产品在20世纪80年代达到顶峰。具有讽刺意味的是这个现象这是由于推广引进办公室的电脑导致的。使用的纸张已不断增加已经是显而易见的了。
我们现在面临着第二波浪潮。IT和网络技术的可能会对OCR产生不同程度的影响。许多应用程序可以在网络上完成,可以即时分享世界各地的信息。但是,仍然不知道对于字符和文档的识别的需求是否会减少或者是否需要更先进的技术将创建新的应用技术。搜索引擎已经为无处不在的到的图像文件,照片和视频提供搜索,并扩大其搜索范围。人们正在重新评估手写的重要性,并试图将其集成到数字世界。所以纸张仍然不会消失不见。现在的CPU能够实时识别带有微型摄像机的移动设备,这里我们将讨论未来的发展前景。
2.发展史回顾
2.1概述
第一个实用的 OCR产品在20世纪50年代出现在美国,也是在同一时期出现了第一台商用计算机UNIVAC。从此以后,每隔十年,都会看到OCR技术的长足进步。20世纪60年代初期,IBM公司生产出了他的第一款光学阅读机,IBM1418(1960)和IBM1428(1962年),分别能够阅读印刷的数字和手写的数字。那个时候的识别机器可以读取200个打印文档的字体,并作为IBM1401计算机的输入设备。除此之外,在20世纪60年代,邮政业实现了利用光学字符识别的自动化信件分拣机,这是有史以来第一次自动识别邮政编码来确定目的地。美国邮政首次引入识别地址的的OCR系统是在1965年开始阅读城市/州/ZIP的印刷信封[2]。在日本,东芝和NEC开发了邮政编码识别的手写数字字符识别系统,并把它们投入使用。在德国,邮政编码系统于1961年在世界上首次引入[4]。但是,邮政编码阅读信件分拣机在欧洲于1973年在意大利的第一个字母自动分拣机的地址阅读器于1978年引入德国[5]。
日本于20世纪60年代开始引入商业光学字符识别系统。日立公司在1968年制造了他们字符识别系统用于印刷字符(包括数字和字母),并在1972年将手写数字识别系统应用于商业用途。此外NEC在1976年开发出了第一个可以识别读取手写名片的OCR系统。模式信息处理开始于1971年的日本国际贸易和工业部(后改名为经济,贸易和工业部)进行了为期10年的耗资200亿日元国家项目。其他研究课题上,东芝在汉字识别印刷上、富士通在手写字符识别上分别进行研究。作为这个项目的一部分,有助于汉字光学字符识别[6]。Asaby产品的研究和发展,包括汉字字符的ETL字符数据库创建,和模式识别领域的项目,都吸引了许多学生和研究人员。在美国,IBM公司于1971年推出了存款处理系统(IBM3895),能够识别没有任何限制的手写字体支票金额。笔者有幸在1981年在美国匹兹堡的梅隆银行参观了机器的运行操作。据说,他可以阅读50%左右的手写支票,剩下的一半是由手工编码识别的。最先进的字符识别是在20世纪60年代和20世纪70年代,并且是有资料可查的。
20世纪80年代在半导体技术发生着显著的进步,如CCD图像传感器、微处理器、动态随机存储器(DRAM)以及自己设计的LSI。例如,光学字符识别系统变得更小更适合桌面办公(fig.1),还有越来越便宜的兆字节存储器和CCD图像传感器,使扫描的图像被整页的存储到存储器来进行进一步的处理,从而使更先进的的识别技术和更广泛的应用领域。例如,1983年第一次出现的手写数字识别系统可以识别字符。使写作可以没有物理形式的约束。在20世纪80年代后期, 日本厂商的光学字符识别引入到自己的产品线,可以识别约2400印刷和手写汉字字符。这些软件用于读取数据输入的姓名和地址。更详细的技术审查,可在文献[9,10]。
20世纪80年代是办公自动化的繁荣时期,在日本影响下,有两个特点,其中之一是日本语言处理的计算机和日语文字处理器的出现,和汉字是这种发展的一个自然结果。另一个特点是用作计算机的存储系统,在20世纪80年代初,该系统进行了开发和投入使用光盘。一个典型的应用是在美国和日本的专利自动化系统的专利说明书文件存储的图像。日本专利的FICE系统,可以存储大约50万份文件到200万12英寸光盘的数字化页面上。每个磁盘可以存储7GB的数据,相当于20万数字化的页面。系统使用80日立光盘的单位和80个光盘库单位。这些系统可以被认为是第一个数字图书馆。这种新的计算机应用直接和间接鼓励对文件的理解和文件奠定了在日本的分析研究。更重要的是,在这十年中,该应用第一次成为计算机处理的焦点。
在20世纪90年代的变化是由于UNIX工作站和个人电脑的升级性能导致的。仍然由硬件完成扫描和图像预处理,识别的重要组成部分一般由的计算机上的软件实现。这是可以使用编程语言得,如C和C++代码可以编写识别算法,这让让更多的工程师开发出更复杂的算法,并扩大包括学术界的社会研究。在这十年中,OCR商业软件在个人电脑上运行的程序包也出现在市场上。自由手写字符识别技术进行了广泛的研究,并成功应用于银行支票识别机和邮政地址识别机。先进的布局分析技术,使得能识别更广泛的商业信息载体类型。如CENPARMI。在这一领域研究的专门机构教授Srihari领导和教授Govindaraju这些进步做出了贡献。新的高科技厂商,包括A2iA,这是由已故教授西蒙开始在法国[11],和Parascript,一位在开始在俄罗斯和美国做生意的人。在日本,日本邮政省在1994年和1996年进行的第三代邮政自动化等多个项目,其中东芝,NEC,日立加入到开发邮政地址识别系统,可以进行排序的序列。本项目使日本地址识别系统得到了的显着进步。
国际模式识别协会在20世纪90年代初开始举行会议,如ICDAR,IWFHR,DAS。据报道,在这些会议深入讨论了的最新的研究方法。如人工神经网络,隐马尔可夫模型(HMM模型),多项式函数分类,改进二次判别函数(MQDF)分类[12],支持向量机(SVM),分类组合[13- 15],信息集成,和词汇字符串识别[16-19],其中有一些独到的见解的是建立在原有的基础上的。从20世纪60年代[20,21]。这些技术在今天的系统中发挥着关键的作用。与前几十年相比,大多数行业使用专有的点播技术,上世纪90年代经历了重要的学术界和工业界之间的互动学习。学术研究解决了现实的技术难题,开发了先进的理论为基础的方法,使行业受益于他们的研究。读者可能会发现字符识别系统,包括图像预处理,特征提取,模式识别,和单词识别的状态,在文献[22]中有详细的论述。
在下面的小节中,主要描述20世纪90年代以前的汉字字符分类,字符分割算法和语言处理领域的技术成果。
2.2汉字的分类
在20世纪70年代,文字识别有两个相互不同的方式,结构分析和模板匹配(或统计方法)。当代商业光学字符识别用结构化的方法来读取手写数字,字母和片假名用模板匹配的方法来读取印刷的字母数字。20世纪70年代后期模板匹配的方法已被实验证明尤其适用于打印的汉字[23-26],但其对于手写的或打印的手写汉字是有问题的。手写汉字识别的问题看起来是很严峻的,一个未开发的领域。很明显,无论是结构性的,还是简单的模板匹配方法能够解决。前者有与数量庞大拓扑变化的困难和复杂的结构,而后者则有非线性形状变化的困难。然而,在以前的工作中,对于手写数字识别,使用模板匹配的方法,似乎有更大的成功机会[27]。
模糊特征提取概念问题的关键,是定向特征的应用,找出有效识别手写汉字的方法[27,28]。连续空间特征提取的引进使最佳模糊量特别的大。日立OCR阅读手写汉字时在四组的16×16阵列的灰度值上使用简单的模板匹配即基于模糊方向特征的特征模板。定向模板匹配,这是在1979年日本的专利。使用一个二维的梯度计算来确定中风的方向(Fig.2),甚至适用于灰度图像[29]。虽然只是间接相关Hubel和Wiesel的工作在鼓励我们认为,定向的特点是有希望的[30]。非线性形状正常化[31-33]和统计分类方法[12,34]能够提高识别的准确率。我们了解到,模糊应被视为一种减少计算成本的一种方法,而不是获得潜在的尺寸(子空间)的手段,虽然效果可能看起来差不多。例如,网眼大小为8×8中使用统计方法确定光香农采样定理的最佳模糊参数,然而较大的筛目,并且具有相同的模糊参数的方法却没有显示出的识别性能。
由Kimura教授领导小组的深入研究,提高了统计的二次分类[12],已成功应用于手写汉字识别。事实上,现在已知的基本理论,在20世纪70年代的计算机中没有足够的计算能力来完成这样的统计方法的研究。今天,四方向的特征矢量可以表示由8×8×4元素组成的汉字模式,统计协变分析通过以下方式获得从100到第140的尺寸的子空间。然而,8×8阵列在许多复杂的日文汉字字符中大小是超乎想象的(反直觉)。个别自由手写汉字识别精度还不够高。因此,语言类字符的情况下,如名称和地址,还要提高识别的准确性。为了降低计算成本,基于集群的两阶段分类是用于减少必须匹配的模板的数目。识别的汉字(中文字符)的最新进展之一是识别引擎的尺寸减小,尤其是针对移动电话应用而设计。在文献报道的紧凑型识别引擎[35,36],只需要613KB的内存来存储参数就能识别4344个中国汉字印刷的字符。
2.3字符分割算法
在20世纪60年代和70年代,飞点扫描仪或激光扫描仪的旋转镜都使用一个光电倍增管,将光信号转换成电信号。字符分割通常借助于这些种扫描机制。例如,当扫描被阅读者使用标记字符时,有边缘的信号字符线存在的情况。此外,写表格上的框的位置必须被预先登记,和框的颜色对于扫描仪传感器是透明的。因此,光学字符识别,可以很容易地提取包含一个单一的手写字符的图像。
然后,在20世纪80年代,半导体传感器和存储器AP的出现,使光学字符识别扫描和存储整个页面的图像。这是一个划时代的里程碑,对用户具有重要意义,因为它减弱了OCR形式规范严格的条件,例如,通过让他们使用更小的的非分离写作盒。但是,这需要了解图像如何在存储器[37]存储的解决方案。在此之前的变化,已扫描图像的二进制像素数组,分割是基于像素的,但是从这个时候起,在内存中的二进制图像的游程编码表示。适合的游程表示连接的COM组件进行分析和轮廓跟踪。连接的部件进行了处理,而不是像素的黑色物体。1983年,日立生产的第一个光学字符识别,可以部分之一,并承认触摸基于多假设分割-手写体数字识别方法(图3)。轮廓形状分析,以确定候选人的感应点,和多对强制分开的图案被送入分类器。通过统计的值分类,能够选择正确的假设。这个方向的变化,导致我们的处理,其最终目标是要读的未知形式,或至少是形式,并非专为光学字符识别。然而,这意味着用户可能在他们的写作变得不那么小心,所以光学字符识别对于自由手写字符有更准确的识别。
分割的问题是比邮政地址识别更为困难的。图4显示了水平手写的地址,一个字符的宽度的变化有两个因素,和一些自由基和组件也是有效字符。如图4中所示。它很难以合适的组件进行分组,以形成正确的字符模式,其中有些字符是相当宽的,还有特别窄的。为了解决分组问题,语言的信息(或住址知识)还需要另外的几何结构上相似的信息。此问题将在第3节中详细地讨论。
2.4语言信息的整合
使用手写汉字的光学字符识别的主要目的是阅读申请表格的名称和地址。在这样的应用中,以避免分割问题,设置了预印固定的盒子。但是,如何实现高度精确的词/短语识别仍然是一个问题。
我们可以利用先验的语言知识,从候选格准确地识别单词和短语中选择正确的选项。在这里,晶格是一个表,其中每一列代表一个候选类,每一行对应于在其上的字符。如果一个字符串包含N个汉字字符,并为每个字符有K种可能,则会有K的N次方种可能的解释(或字识别结果)。语言处理要从包含许多可能的解释中选择。要做到这一点,我们开发出了一种基于有限状态自动机的关键技术[38]。基本的想法是首先让自动机识别比较大的字符,然后观察哪个被自动机接受,其中的自动机模型是动态生成的晶格(图5)。通常L是一个数字就像万级以上,但只有其第一个字符出现在第一列中的晶格才会被接受。为了提高准确性,我们可以考虑的第二个字符出现在第二列中的格子的规则。这样方式是字符被逐个送入的自动机,和状态变化来确定一个路径(一系列的边缘)。然后,得出了相应的动作,以及和输入项相关的选项。传递的第一条边,为0,并通过最后给出了判定为15时,K=16。在这种方式中,被确定为是一个术语的最小判定识别的单词。为每个字符的数目的候选被自适应地控制为等于或小于K,就像toexclude等极不可能的候选字符。该算法已成功地使用,用于地址短语的识别,字符必须是可靠的分段。 marukawa等人的实验表明,在10828个字符的一个词典中,字符识别的准确度从90.2%提高至99.7%,99.1%地址短语的识别精度。在这里,我们可以注意到,错误发生的统计是独立的。语言处理解决了难以分割的问题(参见图4),在第3节中讨论。
3处理不确定性和可变性的鲁棒性设计
邮政地址识别是众多技术的中和挑战,在这个意义上,对于研究人员是一种比较理想的应用,但是,与此同时,作为邮政办公自动化的投资的回报就是大量的创新。在20世纪90年代,在美国,欧洲和日本的R&D项目开发地址的读者,可以识别自由手写和印刷的完整地址。这些载体序列排序,邮政工人单调乏味的任务实现自动化。识别任务是充分认识到的目标地址,包括街道和公寓,以确定确切的交货点。日本的问题是确定一个40000000地址点。在本节中,作者的研究小组[39,40]的经验的基础上,处理的不确定性和可变性的鲁棒性设计的主要问题进行了讨论。
日本语地址识别是一个困难的事情,作为示于图6。印刷和手写邮件的阅读率较高,分别超过90%和70%。被拒绝的邮件片的图像被送到视频编码站的人工操作,输入地址信息。自动识别和人类编码的结果被转换为地址代码,然后将其喷洒于相应的邮件片,因为它们运行通过分拣机后的地址码映射到示出了载波的序列的号码,通过使用两通基数排序方法,邮件片可以按顺序进行排序。
识别系统由一个高速扫描仪,图像预处理硬件,进行地址块的位置的计算机软件,字符的行分割,字符的字符串识别(即地址短语的解释),字符分类和后处理(布局分析图7)。框图中可以看出,有许多模型造成了复杂的字符判定,即,不确定性将始终参与。解决具体问题的算法很容易使图像发生改变,所以最根本的问题是如何处理的不确定性和可变性,以及如何植入鲁棒性到系统中。一个更合适的问题可能是如何组成一个识别系统识别模块,或者是如何连接这些模块。
在回答这些问题时,应该认识到有基本的设计原则来指导研究人员和工程师。我们可以把他们叫做鲁棒性的设计原则。表1列出了它们,并给出了简单的解释。在下面的章节中,将对5个这样的原则进行了讨论。
3.1假设驱动的原则
可变性意味着没有任何一种解决方案适合所有的情况。因此,问题必须经常被分为一定数目的情况下每一种情况有不同的解决方案(问题的解算器)。然而,在输入问题是未知的情况下,假说驱动的原则,就可以应用于在这种情况。日本地址块识别的问题,就是一个例子。布局有6个基本类型,但在现实生活中,其实有12个类型,因为信封,有时上下颠倒使用。我们采取的方法是选择的显着特征来区分这种情况,每一种情况下的观测值等显著特点的基础上来评估。
作为一个总体框架的假设驱动的方法,我所观察到的显着特点证据的情况下,一个统计假设检验方法可用于评估的可能性,我们称之为一个假设。后验概率,对于这一假说。可以计算为:例子(1),Hk代表第K个假设,ek代表代表第K个假设的特征向量。在公式(1)中,L是假设Hk的一个比值,计算方法如下公式(2),假设特征是静态的。函数P(Eki|Hk)和函数P(Eki|Hk)可以从训练中获得。
因此,对于所有的假设,{Ek|k=1.2.3…..K} 使得计算L(ek|Hk)和P(Hk|ek)成为可能。这样就可以找到最匹配的假设。
在假设驱动的方法中,在确定对象的假设后,被调用来处理输入的解决方案只适用于相应输入的情形。
3.2递延决定/多假设原则
在一个复杂的模式识别系统中,必须做出许多判断才能得到最终的结果。与往常一样,每一个识别是不是100%的准确,所以决策模块不能简单地级联。每个模块不应该草率的作出决定,应该暂缓决定并且提出多种假设输入到下一个模块。这个概念本身是很简单的。比如邮政地址识别,可以有很多功能模块如下所示:
•方向检测线
•字符大小(大/小)的测定
•字符线的形成和提取
•地址块识别
•字符的类型(machine-printed/handwritten类型)识别
•脚本(汉字/假名)识别
•文字方向识别
•字符分割
•字符分类
•单词识别
•短语解释
•地址号码识别
•建筑/室内数字识别
•收件人姓名识别
•最终决策(接受/拒绝/重试)
表格1鲁棒设计原则
规则 |
|
期望 |
P1 |
假设驱动的原则 |
当类型的一个问题是不确定的,建立假设,并对其进行测试 |
P2 |
递延决定/多种假设原则 |
决定不离开决定未来的专家进行多个假设 |
P3a |
流程整合 |
作为一个团队,由多个不同领域的专家,解决问题 |
P3b |
信息集成原则 结合集成 |
多个专家作为一个团队 |
P3c |
“佐证”为基础的整合 |
使用其他输入信息,寻求更多的证据 |
P4 |
替代解决方案的原则 |
由多个备选办法解决问题 |
P5 |
摄动原理 |
稍微修改问题,然后再试一次 |
这些功能模块,被转发到下一个模块,它生成多重假设,然后再次生成多个假设。因此,此过程创建种的假设示于图8的层次结构树。这里的问题是如何按照在最短的时间内达到最好的答案时得到最佳的分支机构。在知名的搜索方法中,我们基本上是用带回溯的搜索方式,通过它我们可以在最短的时间内得到最佳的解决方案。当它具有的置信度值小于预先设定的阈值,最佳的分支在稍后阶段就会被拒绝,以及其他的分支都是这样的处理。Beam搜索在后期的使用能有效地提高识别精度,而其在早期阶段的使用则过于耗时。控制搜索的数量的假设是非常重要的,这是因为在我们的例子中,计算的时间被限制在3.7秒有限的时间内,来对时间和精度之间进行权衡。当然时间越短越好,因为它需要更少的计算量。
3.3信息集成原则
我们知道字符和文档识别领域不确定性的三种信息集成方式:(1)流程整合,(2)组合整合,(3)佐证为基础的整合。第一种方法,流程整合,集成了两个或三个过程,以形成一个解决单一问题的能力。例如分割-识别和分割 -识别 -解释的方法。这种方法早在20世纪70年代出现在该地区的语言理解方面。第二个以组合为基础整合方式是一个字符分类和分类组合器的集成[13-15]。不同的分类,在期望的分类,行为互补,例如统计和结构分类器和神经网络相结合(集成)来推导出一个结果。这种方法被称为多数投票和Dempster-Shafer方法,可用算法实现。最后,佐证为基础的集成方法,是寻找更多支持结果的证据或多个相同的信息输入来源。一个很好的例子是读出的银行支票金额确认的数字、邮政地址识别、邮政编码和地址短语在词中的读取来获得准确的结果。收件人姓名识别是另一个例子佐证。采取这种方法时,街道号码不会被确认。
在邮政地址识别中,最重要的需要考虑的因素就是对于字符分割、字符分类、短语解释(语言处理)这三个阶段的信息整合。就像前面章节所描述的那样,地址识别必须解决在字符分割上含糊不清和相似的问题。所以,对于复杂的多假设模型来说简单的应用程序是远远不够的。一种叫做词汇引导或者词汇驱动的方式被认为是多假设驱动方法。该方法如图9所示。其中输入被通过搜索路径的预分割网络所解释(图10),在网络中的表示语言知识最佳的路径相匹配如图11。我们可以说,这是相当于语言网络中的路径搜索最匹配的路径的预分段网络[18,19]。这种解释知识的识别过程在西蒙[43]给出的解释之中:
当他在一个语义丰富的领域解决问题的时候,在内存中必须保存长期的记忆,很大一部分的问题的搜索是在大量存储信息的指导下发现的。
在我们这里所表示的情形,长期记忆是指大量的语言知识,而短期知识是指预分段网络所得到的结论。
我们已经开发出了这样的算法,图12是多个版本中的一个,由Liu等人提出[19]。词典驱动的手写地址识别算法的识别率为83.7%与1.1%的误差,这是使用3589实际的邮件和一个包含111349地址短语的词典做的一个实验。束搜索方法和搜索控制以TRIE结构的语言模型为代表。使用了奔腾III/600MHz机器的确认时间为100ms左右。
3.4替代解决方案的原则
还有许多包括图像级别的问题,包括感应字符、感应下划线,窗口阴影噪声,取消邮票/感应地址字符等问题,。另一种解决方案是对同一个问题提供多个解决方案。它有效地提供解决方案的互补性。例如,对于感应字符,使用全面的方法或被迫分离方法(二分法)可能会解决问题。特别是在处理数字方面,一对字符可以被视为一个字符的100个种类。整体性和二分法的分类结果等全面的分类,合并产生更可靠的识别结果。另一个替代的解决方案的例子,是用来解决窗口噪音问题。当窗口的噪音存在时,两个解决问题的方式是必要的。假设阴影薄或微弱的,一个试图通过侵蚀(变薄)操作消除这种噪音。假设阴影是相当坚实的,尝试提取线段形成一个框架。希望这两个问题解决中的其中一个会成功的。
3.5摄动原理
摄动的原则是在解决一个问题有困难的时候或者需要再次解决问题的时候来适应问题的轻微变化。如果模式识别是一个持续的过程,摄动原理是行不通的。然而,在现实中,它通常是一个不连续的过程。非常小的改变可能会导致最终的识别结果的变化。希望这种变化是从排斥到正确的一个过程,或从错误到正确的识别。在20世纪80年代,就使用结构化方法来识别手写体数字。由于轻微的拓扑变化引起的排斥反应,每次的扰动参数或输入图像的变化可能会提高识别率。近年来对系统的深入研究再次显示了该方法的有效性。各种输入图像的变换,如形态(膨胀/腐蚀)和几何变换(旋转,倾斜,角度,收缩和扩张),都会对图像进行干扰。在Ha和Bunke的工作[44]中,手写体数字转换有12中不同的方法,并承认使用帧的分类相结合的工作。他们的做法,承认困难,偏心手写更好地比传统的分类,如k-NN和神经网络。顺便说一下,模糊是图像变换其中之一,但还没有被应用于上下文中的扰动之中。模糊中使用字符特征提取的不是那种‘轻微的转换’。
摄动原则也被成功地应用到日本的邮政地址识别。我们测试的方法实现了改善约10-15个百分点的平均识别率。当我们没有识别时间限制,反复扰动操作,包括旋转变换,二值化,和其他一些参数的修改序列,我们发现,53%的被拒绝的图像被正确识别了,只有12%的错误率。虽然结果是非常有吸引力的,减少额外的错误是使用这种方法的一个必要步骤。一种可能的方式去实现这种方法就是应用HA和Bunke[44]的组合方案。例如,经过一系列的拒绝算法、多重干扰机制获得最终结论而不是采用第一次的识别结果。鉴于现代计算机不断强大的计算能力,这种做法似乎是非常有希望和前途的。应该指出的是,扰动不仅是有效的字符分类方法,也是对于布局分析,直线提取,字符分割,及其他中间决定很有效的方式。
3.6鲁棒性实验
在前面的小节中描述的设计原则,涉及到结构和算法识别系统,分类算法及各种参数都必须谨慎,同时进行必须的训练和调整[40]才可以。即使为特定的解决问题的模块也同样如此。虽然是新手,很多问题是过程中的发展阶段。因此,鲁棒性的实现,对于研究人员和工程师来说是一项艰巨的任务。以下是高效率和有效的发展过程中的关键。
•用户网站现场的样品
•鲁棒性测量的试验样品使用了许多'bag'
•加速数据集
•采样样本的原因分析
如果可能的话,从用户的站点中收集样品是非常可取的。我们把这些实际样品叫做现场样本。然而,当是在多个会话中收集的样本集时候,现场的样品不应该被混合成一个单一的样品。重要的是要选择合适的场合来采集样品,因为样品的特性的不同的运作模式和季节性倾向也会有所不同。没有被混合的序列集 ,我们可以保持许多不同的'袋'的样本。可为每个袋测量识别率(或识别的精确度),如图所示13。在这里,图中的一各分支导致是该数据组号被重新排列,以便以识别率递减的顺序重新排列。一个陡峭的斜坡意味着识别系统,是不稳定的。此外,如果一个数据集的识别性能是非常低的,我们可以重新审视更加详细的数据,这是体积过小原因所造成的(即低识别率)。
加速数据集是已被拒绝或辨识错误的识别样本的集合。每一个样本中的数据集可以被赋予一个唯一的标识符,该样品可以进行采样样本的原因分析,更重要的是,通过改进可以追溯到整个开发过程。如果名称和问题的代码可以被确定到具体的问题的情况下,可以更适当地管理从纠正过程中非直接的进程。
4.前景
一个40-50年OCR历史的概述,讲述了了目前的市场会出现的所有人最成熟的技术和观点。然而,很明显,该技术仍然在发展之中,远远满足不了于人类的认知需求。从技术成熟的角度来看,目前的状态是市场(或应用)的一部分。根据这种观点,市场“头”的部分有少量的应用但是却有大量的文件内容需要识别。他们是商业形式的阅读,银行支票的识别和邮政地址识别。他们的投资总算是得到了应有的回报。或者投资回报率已经几乎总是有的。当然,技术的进步已把头部的一部分朝向尾部延伸,但余下的尾部是很长的。考虑部件的头三个应用领域也有尾巴部分。有很多的商业形式如支票和邮件上,是非常难以阅读的。更先进的识别技术无疑是急切需求的。例如,小到大中型企业(SME)在日本仍然使用纸做银行交易以及纸张收入形式向当地政府提交报告。每一个这样的公司进行交易的话,那么所有的公司加起来的数量是非常大的。因此,要接受各种形式的银行等公司,使用更智能、多功能的光学字符识别是一个很大的问题。长尾现象,适用于邮政地址识别。问题是如果需求方可以预见在拟议的新产品和新系统的投资回报率,科学家和工程师如果能说服他们,那么技术问题是多样的。这就是典型的长尾问题。
从不同的角度来谈论这个问题的时候,就像鸡和蛋哪个先有一样的问题,或需要和种子的问题,一般情况下这是很难回答的。从行业的角度来看,似乎更重要的是考虑需求,或至少是潜在的需求,至少在目前主观上时需要的。充分认识到当前未满足的的需求,其中包括:(1)电子政务,办公文件档案(2)手写的移动设备的人机接口,(3)视频搜索视频中的文本(4)书籍和历史文献的全局搜索。还有两个其他应用程序:(5)文本在现场的信息采集和(6)手写文档管理工作者。
未知标识和未知的语言在道路上对于旅游者迅速做出决定是一大障碍。在商店、在机场等使用移动设备与一台数码相机,即信息获取摄像头[45]在这种情况下有助于他们(图14)。随着更高的高性能的微处理器的出现,在场景中的文本可以被充分的识别。该技术困难,包括彩色图像处理,几何角度常态化,文本分割,自适应阈值,未知的语言识别,语言翻译,等等。在日本的每一个移动电话配备都有数码相机,其微处理器传感器也很强大。一些数码相机现在有定位人脸图像中的要采取信息的智能功能。现在的问题是,为什么是文字识别这么很难。在日本一些手机现在可以识别超过4000个汉字字符[36]。似乎动态的识别能力是一个更有趣的挑战,通过复杂的情况下相机拍摄的图像识别出多个用户的自觉行动,确保高识别性能。用户可以尝试不同的角度和位置,瞄准识别的目标进行测试。它可以被认为是互动的干扰识别。
另一个比较有吸引力的领域是数码笔和手写文件管理工具。手写的行为正在重新考虑它在教育重要性和工作内容的基础上。用今天的数码笔写作可以帮助人们读,写,记忆。它可以非常自然的方式捕获手写批注和备忘录,我们也可以将这些行为纳入信息系统。TheAnoto是其中的一个先进的技术,可以数字捕捉手写笔画数据及其它相关数据(图15)。有研究小组正在使用这种数字笔来创建更智能的信息管理系统[46- 49]。他们的目标是精确地管理数字墨水的文档。一组倡导“即时信息”(iJIT)的研究人员开发的试验系统,支持他们的笔记和混合文档管理[49]。他们的手写的研究笔记本可以始终保持与他们在电脑中的数字模式兼容。通过这种方式,即使他们相距较远,他们也可以轻松地共享组中的信息。该系统的另一个特征是,用户用印刷文件的数字笔书写的数字文档,用这样一种方法打印任何形式的文档(图16)。换句话说,数字文档的内容用Anoto点覆盖。因此,手写笔划就可以被捕获,用户可以标记和打印输出这些注解,并与已经存在的计算机同步相应的文档。这种系统的价值在于一个数字文件在计算机里有相同的注解。这意味着他们可以扔掉的纸张文件,任何情况下没有任何的信息损失。这个概念使用户在在数字世界和现实世界中的工作中同样出色。这是一个企图超越神话的一个无纸化办公系统[50]。当这样的数字笔的使用成为一种普遍的现象的时候,手写字符识别,手写查询处理,更智能的知识管理能力的要求等等将是一个很自然的需求。我们着力打造信息系统,识别技术是我们追求的一种方式。我们希望看到更多的先进信息系统需要更先进的识别技术。
5.结论
前瞻性的眼光和基本技术都是我们的技术社区未来发展的关键。前瞻性预示着应用的价值和定位。对于在新技术的投资,就这种新的技术,吸引力着需要许多人甚至是一些创新型人才的进入。这是一个TOP-DOWN的创新方式。基本技术创新要从基础的地方开始。在这里,我们所讨论的技术有两部分组成:一个是技术,从底部基本的地方支持我们的社会,另一种是我们自己的,即字符和文档识别技术。在第一部分,我们已经看到了先进的半导体器件,高性能计算机,以及更先进的软件开发工具,以及支持识别技术的对我们生活造成的影响。他们不仅使表面上的更先进的OCR系统,同时也邀请和推动学术界进入了这个领域,更有助于识别技术的进步。我们愿意看到这种良性循环一直进行下去。
致谢:
本文作者感谢在日立公司进行邮政地址识别系统开发的小组成员:H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K.Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T.Takahashi。感谢来自中国的C-L刘博士在北京自动化研究所做的大量工作。还有东京明成大学的Y.Shima博士在我们实验室的工作。还要感谢G博士对我们非常有价值的意见。还要感谢西门子的U.Miletzki博士ElectroCom提供有关的发展进程。
参考文献:
[1]H. Fujisawa, A view on the past and future of character and documentrecognition, in: Proceedings of the Seventh ICDAR, Curitiba, Brazil,September 2007, pp. 3--7.
[2]The United States Postal Service: An American History 1775--2002,Government Relations, United States Postal Service, 2003.
[3]H. Genchi, K. Mori, S. Watanabe, S. Katsuragi, Recognition ofhandwritten numeral characters for automatic letter sorting, Proc.IEEE 56 (1968) 1292--1301.
[4]W. Schaaf, G. Ohling, et al., Recognizing the Essentials, SiemensElectroCom, Konstanz, 1997.
[5]http://www.industry.siemens.com/postal-automation/usa.
[6]K. Yamamoto, H. Yamada, T. Saito, I. Sakaga, Recognition ofhandprinted characters in the first level of JIS Chinese characters,in: Proceedings of the Eighth ICPR, 1986, pp. 570--572.
[7]J.R. Ullmann, Pattern Recognition Techniques, Butterworths, London,1973.
[8]C.Y. Suen, M. Berthod, S. Mori, Automatic recognition of handprintedcharacters---The state of art, Proc. IEEE 68 (4) (1980) 469--487.
[9]S. Mori, C.Y. Suen, K. Yamamoto, Historical review of OCR researchand development, Proc. IEEE 80 (7) (1992) 1029--1058.
10]G. Nagy, At the frontiers of OCR, Proc. IEEE 80 (7) (1992)1093--1100.
11]J.C. Simon, Off-line cursive word recognition, Proc. IEEE 80 (7)(1992) 1150--1161.
12]F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadraticdiscriminant functions and the application to Chinese characterrecognition, IEEE Trans. PAMI 9 (1) (1987) 149--153.
13]C.Y. Suen, C. Nadal, T.A. Mai, R. Legault, L. Lam, Recognition oftotally unconstrained handwritten numerals based on the concept ofmultiple experts, in: Proceedings of the First IWFHR, Montreal,Canada, 1990, pp. 131--143.
14]L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multipleclassifiers and their applications to handwriting recognition, IEEETrans. SMC 22 (3) (1992) 418--435.
15]T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multipleclassifier systems, IEEE Trans. PAMI 16 (1) (1994) 66--75.
[16]F. Kimura, M. Sridhar, Z. Chen, Improvements of lexicon-directedalgorithm for recognition of unconstrained hand-written words, in:Proceedings of the Second ICDAR, Tsukuba, Japan, October 1993, pp.18--22.
[17]C.H. Chen, Lexicon-driven word recognition, in: Proceedings of theThird ICDAR, Montreal, Canada, August 1995, pp. 919--922.
[18]M. Koga, R. Mine, H. Sako, H. Fujisawa, Lexical search approach forcharacter-string recognition, in: Proceedings of the Third DAS,Nagano, Japan, November 1998, pp. 237--251.
[19]C.-L. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation andrecognition of handwritten character strings for Japanese addressreading, IEEE Trans. PAMI 24 (11) (2002) 425--1437.
[20]M. Aizermann, E. Braverman, L. Rozonoer, Theoretical foundations ofthe potential function method in pattern recognition learning,Automat. Remote Control 25 (1964) 821--837.
[21]U. Miletzki, Schürmann-polynomials---roots and offsprings, in:Proceedings of the Eighth IWFHR, 2002, pp. 3--10.
[22]M. Cheriet, N. Kharma, C.-L. Liu, C.Y. Suen, Character RecognitionSystems---A Guide for Students and Practitioners, John Wiley &Sons, Inc., Hoboken, NJ, 2007.
[23]R. Casey, G. Nagy, Recognition of printed Chinese characters, IEEETrans. Electron. Comput. EC-15 (1) (1966) 91--101.
[24]S. Yamamoto, A. Nakajima, K. Nakata, Chinese character recognition byhierarchical pattern matching, in: Proceedings of the First IJCPR,Washington, DC, 1973, pp. 183--194.
[25]H. Fujisawa, Y. Nakano, Y. Kitazume, M. Yasuda, Development of aKanji OCR: an optical Chinese character reader, in: Proceedings ofthe Fourth IJCPR, Kyoto, November 1978, pp. 815--820.
[26]G. Nagy, Chinese character recognition: a twenty-five-yearretrospective, in: Proceedings of the Ninth ICPR, 1988, pp. 163--167.
[27]M. Yasuda, H. Fujisawa, An Improvement of Correlation Method forCharacter Recognition, vol. 10 (2), Systems, Computers, Controls,Scripta Publishing Co., 1979, pp. 29--38.
[28]H. Fujisawa, C.-L. Liu, Directional pattern matching for characterrecognition revisited, in: Proceedings of the Seventh ICDAR,Edinburgh, August 2003, pp. 794--798.
[29]H. Fujisawa, O. Kunisaki, Method of pattern recognition, JapanesePatent 1,520,768 granted in 1989, filed in 1979.
[30]D.H. Hubel, T.N. Wiesel, Functional architecture of macaque monkeyvisual cortex, Proc. R. Soc. London Ser. B 198 (1977) 1--59.
[31]J. Tsukumo, H. Tanaka, Classification of handprinted Chinesecharacters using non-linear normalization and correlation methods,in: Proceedings of the Ninth ICPR, Rome, Italy, 1988, pp. 168--171.
[32]C.-L. Liu, Normalization-cooperated gradient feature extraction forhandwritten character recognition, IEEE Trans. PAMI 29 (6) (2007)1465--1469.
[33]C.-L. Liu, Handwritten Chinese character recognition: effects ofshape normalization and feature extraction, in: Proceedings of theSummit on Arabic and Chinese Handwriting, College Park, September2006, pp. 23--27.
[34]K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: areview, IEEE Trans. PAMI 22 (1) (2000) 4--37.
[35]C.-L. Liu, R. Mine, M. Koga, Building compact classifier for largecharacter set recognition using discriminative feature extraction,in: Proceedings of the Eighth ICDAR, Seoul, Korea, 2005, pp.846--850.
[36]M. Koga, R. Mine, T. Kameyama, T. Takahashi, M. Yamazaki, T.Yamaguchi, Camera-based Kanji OCR for mobile-phones: practicalissues, in: Proceedings of the Eighth ICDAR, Seoul, Korea, 2005, pp.635--639.
[37]H. Fujisawa, Y. Nakano, K. Kurino, Segmentation methods for characterrecognition: from segmentation to document structure analysis, Proc.IEEE 80 (7) (1992) 1079--1092.
[38]K. Marukawa, M. Koga, Y. Shima, H. Fujisawa, An error correctionalgorithm for handwritten Chinese character address recognition, in:Proceedings of the First ICDAR, Saint-Malo, September 1991, pp.916--924.
[39]H. Fujisawa, How to deal with uncertainty and variability: experienceand solutions, in: Proceedings of the Summit on Arabic and ChineseHandwriting, College Park, September 2006, pp. 29--39.
[40]H. Fujisawa, Robustness design of industrial strength recognitionsystems, in: B.B. Chaudhuri (Ed.), Digital Document Processing: MajorDirections and Recent Advances, Springer, London, 2007, pp. 185--212.
[41]T. Kagehiro, H. Fujisawa, Multiple hypotheses document analysis, in:S. Marinai, H. Fujisawa (Eds.), Studies in ComputationalIntelligence, vol. 90, Springer, Berlin, Heidelberg, 2008, pp.277--303.
[42]T. Kagehiro, M. Koga, H. Sako, H. Fujisawa, Segmentation ofhandwritten Kanji numerals integrating peripheral information byBayesian rule, in: Proceedings of the IAPR MVA'98, Chiba, Japan,November 1998, pp. 439--442.
[43]H.A. Simon, The Sciences of the Artificial, third ed., The MIT Press,Cambridge, MA, 1998, pp. 87--88.
[44]T.M. Ha, H. Bunke, Off-line, handwritten numeral recognition byperturbation method, IEEE Trans. PAMI 19 (5) (1997) 535--539.
[45]H. Fujisawa, H. Sako, Y. Okada, S-W. Lee, Information capturingcamera and developmental issues, in: Proceedings of the FifthICDAR'99, Bangalore, September 1999, pp. 205--208.
[46]F. Guimbretière, Paper augmented digital documents, in: Proceedingsof the ACM Symposium on User Interface Software and Technology,UIST2003, Vancouver, Canada, 2003, pp. 51--60.
[47]C. Liao, F. Guimbretière, PapierCraft: a command system forinteractive paper, in: Proceedings of the ACM Symposium on UserInterface Software and Technology, UIST2005, Seattle, USA, 2005, pp.241--244.
[48]R. Yeh, C. Liao, S. Klemmer, F. Guimbretière, B. Lee, B. Kakaradov,J. Stamberger, A. Paepcke, ButterflyNet: a mobile capture and accesssystem for field biology research, in: Proceedings of theInternational Conference on Computer—Human Interaction, CHI2006,Montreal, Canada, 2006, pp. 571--580.
[49]H. Ikeda, K. Konishi, N. Furukawa, iJITinOffice: desktop environmentenabling integration of paper and electronic documents, in:Proceedings of the ACM Symposium on User Interface Software andTechnology, UIST2006, Montreux, Switzerland, October 2006.
[50]A.J. Sellen, R.H. Harper, The Myth of the Paperless Office, The MITPress, Cambridge, MA, 200
附录:英文原文:
1.Introduction
Presentedis an industrial view on the character and document recognitiontechnology, based on some material presented at ICDAR [1]. Commercialoptical character readers (OCRs) emerged in the 1950s, and sincethen, the character and document recognition technology has advancedsignificantly providing products and systems to meet industrial andcommercial needs throughout the development process. At the sametime, the profits from businesses based on this technology have beeninvested in research and development of more advanced technology. Wecan observe here a virtuous cycle. New technologies have enabled newapplications, and the new applications have supported the developmentof better technology. Character and document recognition has been avery successful area of pattern recognition. The main business andindustrial applications of character and document recognition in thelast forty years have been in form reading, bank check reading andpostal address reading. By supporting these applications, recognitioncapability has expanded in multiple dimensions: mode of writing,scripts, types of documents, and so on. The recognizable modes ofwriting are machine-printing, handprint-ing, and script handwriting.Recognizable scripts started with Arabic numerals and expanded to theLatin alphabets, Japanese Katakana syllabic characters, Kanji(Japanese version of Chinese) characters, Chinese characters, andHangul characters. Work is now being done
tomake Indian and Arabic scripts readable. Many different kinds ofpaper forms can be read by today's OCRs, including bank checks, postcards, envelopes, book pages, and business cards. Typeface standardssuch as OCR-A and OCR-B fonts have contributed to making OCRsreliable enough even in the early stages. In the same context,specially designed OCR forms have simplified the segmentation problemand made handprinted character OCRs readable even by immaturerecognition technology. Today's OCRs are successfully used to readany type of fonts and freely handwritten characters. The field ofcharacter and document recognition has not always been peaceful. Ithas twice been disturbed by waves of new digital technologies thatthreatened to diminish the role of OCR technology. The first suchwave was that of office automation in the early 1980s. Starting then,most of information seemed to be going to be 'born digital',potentially diminishing demand for OCRs, and some researchers werepessimistic about the future. However, it turned out that the salesof OCRs in Japan, for example, peaked in the 1980s. This wasironically due to the promoted introduction of office computers. Itis well known that the use of paper has kept increasing. We are nowfacing the second wave. IT and Web technologies might have adifferent impact. Many kinds of applications can now be completed onthe Web. Information can flow around the world in an instant.However, it is still not known whether the demand for character anddocument recognition will decrease or whether new applicationsrequiring more advanced technology will be created. Search engineshave become ubiquitous and are expanding their reach into the areasof image documents, photographs, and videos. People are re-evaluatingthe importance of handwriting and trying to integrate it into thedigital world. It seems that paper is still not going to disappear.Mobile devices with micro cameras now have CPUs capable of real-timerecognition. The future prospects of these developments are discussedhere.
2.Brief historical view
2.1.Overview
Thefirst practical OCR appeared in the United States in the 1950s, inthe same decade as the first commercial computer UNIVAC. Since then,each decade has seen advances in OCR technology. In the early 1960s,IBM produced their first models of optical readers, the IBM 1418(1960) and IBM 1428 (1962), which were, respectively, capable ofreading printed numerals and handprinted numerals. One of the modelsof those days could read 200 printed document fonts and were used asinput apparatus for IBM 1401 computers. Also in the 1960s, postaloperations were automated using mechanical letter sorters with OCRs,which for the first time automatically read postal codes to determinedestinations. The United States Postal Service first introducedaddress-reading OCRs, which in 1965 began reading the city/state/ZIPline of printed envelopes [2]. In Japan, Toshiba and NEC developedhandprinted numeral OCRs for postal code recognition,
andput them into use in 1968 [3]. In Germany, a postal code system wasintroduced for the first time in the world in 1961 [4]. However, thefirst postal code reading letter sorter in Europe was introduced inItaly in 1973, and the first letter sorter with an automatic addressreader was introduced in Germany in 1978 [5].
Japanstarted to introduce commercial OCRs in the late 1960s.Hitachiproduced their first OCR for printed alpha numerics in 1968and thefirst handprinted numeral OCR for business use in 1972. NEC developedthe first OCR that could read handprinted Katakana in addition in1976. The Japanese Ministry of International Trade and Industry(since renamed the Ministry of Economy, Trade and Industry) conducteda 10-year 20 billion-yen national project on pattern in-formationprocessing starting in 1971. Among other research topics, Toshibaworked on printed Kanji recognition, and Fujitsu worked onhandwritten character recognition. The ETL character databasesincluding Kanji characters were created as part of this project,which contributed to research and development of Kanji OCRs [6].Asaby product, the project attracted many students and researchersinto the pattern recognition area. In the United States, IBMintroduced a deposit processing system (IBM 3895) in 1977, which wasable to recognize unconstrained handwritten check amounts. The authorhad a chance to observe it in operation at Mellon Bank in Pittsburghin 1981, and it could reportedly read about 50% of handwritten checkswith the remaining half being hand coded. The state of the art incharacter recognition in the 1960s and 1970s is well documented inthe literature [7,8].
The1980s witnessed significant technological advances in semi-conductordevices such as CCD image sensors, microprocessors, dynamic randomaccess memories (DRAMs), and custom-designed LSIs. For example, OCRsbecame smaller than ever fitting on desktops (Fig. 1). Then cheapermegabyte-size memories and CCD image sensors enabled whole-pageimages to be scanned into memory for further processing, in turnenabling more advanced recognition and
widerapplications. For example, handwritten numeral OCRs that couldrecognize touching characters were introduced for the first time in1983, making it possible to relax physical form constraints andwriting constraints. In the late 1980s, Japanese vendors of OCRsintroduced into their product lines new OCRs that could recognizeabout 2400 printed and handprinted Kanji characters. These were usedto read names and addresses for data entry. More detailed tech-nologyreviews are available in the literature [9,10].
Theoffice automation boom of the 1980s, which was influential in Japan,had two features. One was Japanese language processing by computersand Japanese word processors. Emergence of Kanji OCRs was a naturalconsequence of this development. The other feature was optical disksused as computer storage systems, which were developed and put intouse in the early 1980s. A typical application was patent automationsystems in the United States and Japan that stored images of patentspecification documents. The Japanese patent office system thenstored approximately 50 million documents or 200 million digitizedpages on 12-in optical disks. Each disk could store 7GB of data, theequivalent of 200000 digitized pages. The sys-tem used 80 Hitachioptical disk units and 80 optical library units. These systems can beconsidered one of the first digital libraries. This kind of newcomputer applications directly and indirectly encouraged studies ondocument understanding and document lay-out analysis in Japan. Moreimportantly, it was in this decade that documents became the focus ofcomputer processing for the first time.
Thechanges in the 1990s were due to the upgraded performances of UNIXworkstations and then personal computers. Though scanning and imagepreprocessing were still done by the hardware, a major part ofrecognition was implemented by the software on general-purposecomputers. The implication of this was that programming languageslike c and c ++ could be used to code recognition algorithms,allowing more engineers to develop more complicated algorithms andexpanding the research community to include academia. During thisdecade, commercial software OCR packages running on PCs also appearedon the market. Techniques for recognizing freely handwrittencharacters were extensively studied, and successfully applied to bankcheck readers and postal address readers. Advanced layout analysistechniques enabled recognition of wider varieties of business forms.Research institutions specializing in this field such as CENPARMI,led by Prof. Suen and CEDAR, led by Prof. Srihari and Prof.Govindaraju contributed to these advances. New high-tech vendorsappeared, including A2iA, which was started by the late Prof. Simonin France [11] , and Parascript, which was started in Russia to do
businessin the United States. In Japan, the Japanese Postal Ministryconducted the third generation postal automation project between 1994and 1996, in which Toshiba, NEC, and Hitachi joined to develop postaladdress recognition systems that could sort sequences. This projectenabled significant advances in Japanese address reading.
TheInternational Association for Pattern Recognition began holdingconferences such as ICDAR, IWFHR, and DAS in the early 1990s. Manyintensively studied methods have been reported in these conferences.Examples are artificial neural networks, hidden Markov models (HMMs),polynomial function classifiers, modified quadratic discriminantfunction (MQDF) classifiers [12] , support vector machines (SVMs),classifier combination [13--15] , information integration, andlexicon-directed character string recognition [16--19] , some ofwhich are based on original ideas from the 1960s [20,21]. Most ofthese play key roles in today's systems. In contrast with previousdecades, in which industry mostly used proprietary in-housetechnology, the 1990s witnessed important interactions betweenacademia and industry. Academics studied real technical problems anddeveloped sophisticated theory-based methods, enabling industry tobenefit from their research. Readers may find the state of the art ofcharacter recognition systems, including image preprocessing, featureextraction, pattern classification, and word recognition, welldescribed in the literature[22] .
Inthe following subsections, major pre-1990s technical achievements inthe area of Kanji character classifiers, character segmentationalgorithms, and linguistic processing are described.
2.2.Kanji character classifiers
Inthe 1970s, there were two competing approaches to characterrecognition, structural analysis and template matching (or thestatistical approach). Contemporary commercial OCRs were usingstructural methods to read handprinted alphanumerics and Katakana,and template matching methods to read printed alphanumerics.Tem-plate matching methods had been experimentally proven to beapplicable to printed Kanji recognition by the late 1970s[23--26] ,but their applicability to handwritten (or handprinted) Kanji was inquestion. The problem of recognizing handwritten Kanji seemed like asteep, unexplored mountain. It was clear that neither the structuralnor the simple template matching approaches could conquer it alone.The former had difficulty with the huge number of topologicalvariations due to complex stroke structures, while the latter haddifficulty with nonlinear shape variations. However, in light ofprevious work on handwritten numeral recognition using a templatematch-ing approach, the latter approach seemed to have a greaterchance of success [27] .
Thekey was the concept of blurring as feature extraction, which wasapplied to directional features and found to be effective inrecognizing handwritten Kanji [27,28]. The introduction of continuousspatial feature extraction made the optimum amount of blurringsurprisingly large. The first Hitachi OCR for reading handprintedKanji used simple template matching based on blurred directionalfeatures where the feature templates were four sets of 16× 16 arraysof gray values. The directional feature, which was patented in Japanin 1979, was computed using a two-dimensional gradient to determinestroke direction ( Fig. 2) and was even applicable to grayscaleimages [29] . Although it was only indirectly relevant, Hubel andWiesel's work encouraged our view that the directional feature waspromising [30] . Nonlinear shape normalization [31--33] andstatistical classifier methods [12,34] boosted recognition accuracy.We learned that blurring should be considered as a means of obtaininglatent dimensions (subspace) rather than as a means of reducingcomputational cost, though the effects might seem similar. Forexample, the mesh size of 8× 8 used in statistical approaches wasdetermined by the optimum blurring parameter in light of the Shannonsampling theorem, and bigger mesh sizes with the same blurringparameter did not give better recognition performances.
Thethorough studies of the research group led by Prof. Kimuracontributed to advancing statistical quadratic classifiers [12] ,which were successfully applied to handwritten Kanji recognition.Actually, the basic theory had been known, but computers of the1970s did not have sufficient computational power to be applied tostudies of such statistical approaches. Today, the four-directionalfeature vector for Kanji patterns consists of 8 × 8 × 4 elements,and the subspace obtained by statistical covariant analysis is offrom 100 to 140 dimensions. However, the size of the 8× 8 array issurprisingly (counter-intuitively) small in light of many complexKanji characters. Recognition accuracy for individual freelyhandwritten Kanji is not yet high enough, however. Therefore,linguistic context such as name and address is used to enhance totalrecognition accuracy. To reduce computational cost, cluster-basedtwo-stage classification is used
toreduce the number of templates that must be matched. One of therecent advances in Kanji (and Chinese character) recognition is thereduced size of recognition engines designed especially for mobilephone applications. A compact recognition engine reported in Refs.[35,36] requires only 613kB of memory to store parameters torecognize 4344 classes of printed Chinese characters.
2.3.Character segmentation algorithms
Inthe 1960s and 1970s, a flying-spot scanner or a laser scanner with arotated mirror was used together with a photo-multiplier to convertoptical signals into electrical signals. Character segmentation wasusually carried out with the help of these kinds of scanningmechanisms. For example, forms for handprint reading used marks on anedge that signaled the presence of a character line to be scanned. Inaddition, the locations of writing boxes on the forms were registeredbeforehand, and the colors of the boxes were trans-parent to thescanner sensor. Therefore, OCRs could easily extract images thatcontained exactly one single handprinted character.
Then,in the 1980s, semiconductor sensors and memories appeared, enablingOCRs to scan and store images of whole pages. This was an epochmaking change that was significant to users because it relaxed strictconditions on OCR form specifications, for example, by enabling themto use smaller non-separated writing boxes. However, it required asolution of the problem of touching numerals and change in how imagesare represented in memory[37] . Be-fore this change, scanned imageshad been arrays of binary pixels, and segmentation was pixel-based,but from this time on, the bi-nary image in the memory wasrepresented by run-length codes. The
runlengthrepresentation was suited to conducting connected component analysisand contour following. The connected components were processed asblack objects rather than as pixels. In 1983, Hitachi produced one ofthe first OCRs that could segment and recognize touching handwrittennumerals based on a multiple-hypothesis segmentation--recognitionmethod (Fig. 3 ). Contour shape analysis was able to identifycandidates of touching points, and multiple pairs
offorcedly separated patterns were fed into the classifier. Byconsulting the confidence values from the classifier, the recognizerwas able to choose the right hypothesis. This direction of changeshas led us to forms processing whose ultimate goal is to read unknownforms, or at least those forms that are not specifically designed forOCRs. However, this means that users might become less careful intheir writing, so OCRs have to be more accurate for freelyhandwritten characters as well.
Thesegmentation problem was far tougher in postal address recognition.Fig. 4 shows horizontally handwritten addresses. The width of acharacter varies by as much as a factor of two, and some of theradicals and components are also valid characters. As shown in Fig. 4, it is difficult to group the right components to form the rightcharacter patterns, where some characters are quite wide and othersnarrow. To resolve the grouping problem, linguistic information (oraddress knowledge) is required in addition to geometric andsimilarity information. This issue will be discussed in more detailin Section 3.
2.4.Integration of linguistic information
Majorbusiness uses of handprinted Kanji OCRs have been the reading ofnames and addresses in application forms. In such applications, toavoid the segmentation problem, forms have separate preprinted fixedboxes, but how to achieve highly accurate word/phrase recognition isstill a question. We can utilize a priori linguistic knowledge tochoose the right options from the candidate lattice to accuratelyrecognize words and phrases. Here, the lattice is a table in whicheach column carries candidate classes, and each row corresponds tocharacters on the sheet. If a string consists of N Kanji charactersand there are K candidates for each, there are KN possibleinterpretations (or word recognition results). The linguisticprocessing consists of choosing one of the many possibleinterpretations. To do this, we developed a method based on a finitestate automaton as a key technique [38] . The basic idea is to throwL lexical terms at the automaton, and see which terms the automatonaccepts, where the model of the automaton is dynamically generatedfrom the lattice (Fig. 5). L is usually a number as big as severaltens of thousands, but only the terms whose first character appearsin the first column of the lattice are to be accepted. To improveaccuracy, we may consider the terms whose second character appears inthe second column of the lattice as well. Such terms are fed into theautomaton one by one, and the state transitions determine a path (aseries of edges). Then the corresponding penalties are summed up andassociated with the input term. Passing the first edge gives apenalty of zero, and passing the last gives a penalty of 15 when K =16. In this way, a term with the smallest penalty is deter-mined tobe the recognized word. The number of candidates for each characteris adaptively controlled to be equal to or less than K ,toexcludeextremely unlikely word candidates. This algorithm has been usedsuccessfully for address phrases, provided that the characters arereliably segmented. Marukawa et al.'s experiments showed thatcharacter recognition accuracy was raised to 99.7% from 90.2% for alexicon with 10828 terms, resulting in address phrase recognitionaccuracy of 99.1%. Here, we can note that error occurrences are notstatistically independent. Linguistic processing that solvesdifficult segmentation problems (cf. Fig. 4) is discussed in Section3.
3.Robustness design to deal with uncertainty and variability
Postaladdress recognition was an ideal application for re-searchers in thesense that it presented many technical challenges, but, at the sametime, the innovation was an expected one for post office automationand the investments really paid off. In the 1990s, R&D projectswere conducted in the United States, Europe, and Japan to developaddress readers that could recognize freely handwritten and printedfull addresses. These were intended to automate carrier
sequencesorting, a tedious task for postal workers. The recognition task wasto identify an exact delivery point by recognizing the fulldestination address including street and apartment numbers. Theproblem in Japan is to identify one of 40000000 address points. Inthis section, the main issues of robustness design intended to dealwith uncertainty and variability are discussed based on theexperience of the author's team [39,40].
Japaneseaddress recognition is a difficult task as shown in Fig. 6. The readrates for printed and handwritten mail are higher than 90% and 70%,respectively. Images of the rejected mail pieces are sent tovideo-coding stations where human operators enter addressinformation. The results of automatic recognition and human codingare transformed to address codes, which are then sprayed on thecorresponding mail pieces as they run through the sorting machine.After the address codes are mapped to numbers that show a carriersequence, the mail pieces can be sorted in sequence by using thetwo-pass radix sort method.
Therecognition system consists of a high-speed scanner, imagepreprocessing hardware, and the computer software that carries outlayout analysis for address block location, character linesegmentation, character string recognition (i.e., address phraseinterpretation), character classification and post processing (Fig. 7). As can be seen in the block diagram, there are many modules thatmake imperfect decisions; i.e., uncertainty is always involved.Algorithms to solve
specificproblems are susceptible to variations in the images, so the mostbasic questions are how to deal with uncertainty and variability andhow to implant robustness into the system. A more appropriatequestion may be how to compose such a recognition system from smallpieces of recognition modules, or how to connect those modules.
Inanswering these questions, it should be recognized that there aredesign principles that can guide researchers and engineers. We maycall them robustness design principles. Table 1 lists them and givessimple explanations. In the following subsections, five suchprinciples are discussed.
3.1.Hypothesis-driven principle
Variabilitymeans that no one solution can fit all situations. There-fore,problems must often be divided into a certain number of cases with adifferent solution (problem-solver) to each case. However, the caseto which an input in question belongs is unknown. Thehypothesis-driven principle can be applied in such cases, and theproblem of Japanese address block identification is one such case.There are six layout types basically, but in real life, there areactually twelve types because envelopes are sometimes usedupside-down. The approach we take is to choose salient features todistinguish between such cases and to evaluate the likelihood of eachcase based on the observed value of such salient features. As ageneral framework of the hypothesis-driven approach, we call the casea hypothesis and the observed salient features evidence, and astatistical hypothesis test method may be used to evaluatelikelihood. The a posteriori probability of the k -th hypothesisafter observing evidence for this hypothesis can be computed as inEq. (1), where Hk represents the k -th hypothesis, and e k thefeature vector for the kth hypothesis. In Eq. (1), L is a likelihoodratio of hypothesis H k to null hypothesis¯Hk and is computed as inEq. (2) assuming the statistical independence of the features.Functions, P( ki |H k) andP(eki|¯Hk), can be learned from thetraining samples.
Therefore,observing evidence {ek|k = 1,...,K } for all hypotheses makes itpossible to compute L( e k|Hk) and P(Hk|ek) accordingly, to find themost probable hypothesis [41] .
Inthe hypothesis-driven approach, after identifying candidates ofhypotheses, the corresponding problem-solvers applicable only to thatkind of input are called to process the input.
3.2.Deferred decision/multiple-hypotheses principle
Ina complex pattern recognition system, many decisions must be made toobtain the final result. As always, each decision is not 100%accurate, so the decision-making modules cannot be simply cascaded.Each module should not make a decision but should defer the decisionand forward multiple hypotheses to the next module. The idea itselfis a simple one. In the case of postal address recognition, there canbe as many functional modules as shown below:
• Lineorientation detection
• Charactersize (large/small) determination
• Characterline formation and extraction
• Addressblock identification
• Charactertype (machine-printed/handwritten) identification
• Script(Kanji/Kana) identification
• Characterorientation identification
• Charactersegmentation
• Characterclassification
• Wordrecognition
• Phraseinterpretation
• Addressnumber recognition
• Building/roomnumber recognition
• Recipientname recognition
• Finaldecision making (accept/reject/retry)
Thesefunctional modules generate multiple hypotheses each of which is thenforwarded to the next module, which again generates multiplehypotheses. This process therefore creates the kind of hierarchicaltree of hypotheses shown in Fig. 8 . The question here is how to findwhich optimum branches to follow to reach the best possible answer inthe shortest possible time. Among the well known search methods, webasically use the Hill Climbing Search with backtracking, by which wecan reach the optimum solution in the shortest time. When an optimumbranch is rejected at a later stage because it has a confidence valuesmaller than a preset threshold, other branches are processed. Theuse of the Beam Search at the later stages effectively boosts therecognition accuracy, while its use in earlier stages is too costly.Search control on the number of hypotheses to generate is importanttrade-off between time and accuracy because computational time islimited to 3.7s in our case. Of course, shorter is better because itrequires less computational power.
3.3.Information integration principle
Werecognize three kinds of information integration known in thecharacter and document recognition field to attack the uncertaintyissue: (1) process integration, (2) combination-based integration,and (3) corroboration-based integration. The first approach, processintegration, integrates two or three processes to form a singleproblem-solver. Examples are segmentation—recognition methods andsegmentation--recognition--interpretation methods.
Thisapproach started in the area of speech understanding back in the1970s. The second combination-based integration approach is the onetaken in character classification and known as classifier combinationor classifier ensemble [13--15] . Different classifiers such asstatistical and structural classifiers and neural networks arecombined (integrated) to deduce a single result, in the expectationthat the classifiers will behave complementarily. Methods known asmajority voting and Dempster Shafer approaches can be used toimplement the algorithm. Finally, corroboration-based integration isthe approach of finding additional evidence that supports the resultor looking for multiple input information sources for the sameinformation. A good example is reading bank check amounts byrecognizing both the courtesy amount (numerals) and the legal amount(numbers in words).
Inpostal address recognition, both the postal code and the addressphrase in words are read to obtain more accurate results. Recipientname recognition is another example of corroboration. This approachis taken when street numbers are not recognized. In postal addressrecognition, the most important consideration is to integrate thethree processes of character segmentation, character classification,and interpretation of the phrases (or linguistic processing). Asdescribed in previous sections, address knowledge is required toresolve the ambiguities in segmentation incorporation withgeometrical information[42] and character similarity, so simple application of the multiple-hypotheses principle was not sufficient.An approach known as the lexicon-directed or lexicon-driven approachhas been developed and can be considered a hypothesis-drivenapproach, as explained below. The approach is illustrated in Fig. 9,where an input pattern is interpreted by searching for the path inthe presegmentation network ( Fig. 10) that best matches the path inthe network that represents linguistic knowledge ( Fig. 11 ). We can
saythat this is the equivalent of searching for a path in the linguisticnetwork that best matches a path in the presegmented network [18,19].This interpretation of the knowledge-directed recognition process isin line with an explanation given by Simon[43] :
When it is solvingproblems in semantically rich domains, a large part of theproblem-solving search takes place in long-term memory and is guidedby information discovered in that memory.
Inour case, the long-term memory refers to the linguistic knowledge,and the short-term memory refers to the presegmented network.
Wehave developed several versions of such algorithms, one of which(Fig. 12) was presented by Liu et al. [19] . The recognition rate ofthe lexicon-driven handwritten address recognition algorithm was83.7% with 1.1% error in an experiment, which was done using 3589actual mail pieces and a lexicon containing 111349 address phrases.The linguistic model was represented in the TRIE structure, and thesearch was controlled by the Beam Search method. Recognition time wasabout 100ms using a Pentium III/600MHz machine.
3.4.Alternative solutions principle
Thereare many image level problems including touching characters, touchingunderlines, window shadow noise, cancellation stampscovering/touching address characters, and so on. The alternativesolutions approach is to provide more than one solution to a problem.It effectively provides solutions that are complementary to eachother. For example, the problem of touching characters may be solvedusing a holistic approach or a forced separation (dichotomizing)approach. Especially when dealing with numerals, a pair of touchingnumerals can be treated as one character out of 100 classes. Trainingsuch holistic classifiers enables the results of the holistic anddichotomizing classifiers to be merged producing more reliablerecognition results. Another example of the alternative solutionsapproach is used to solve the window noise problem. When existence ofwindow noise is suspected, two problem-solvers are needed. Oneattempts to eliminate such noise by erosion (thinning) operation,assuming the shadow is thin or faint. The other attempts to extractline segments that form a frame, assuming the shadow is rather solid.These two problem-solvers are used hoping one will succeed.
3.5.Perturbation principle
Theprinciple of perturbation is to modify the problem slightly when itis difficult to solve and to try again to solve it. If patternrecognition were such a continuous process, the perturbationprinciple would not work. In reality, however, it is often adiscontinuous process. Very small modifications may change the finalrecognition results. It is hoped that the change is from rejection tocorrect recognition or from error to correct recognition. Thisapproach was used in the 1980s to recognize handwritten numeralsusing a structural approach. Because slight topological variationscaused rejection, perturbation of parameters or of input imagesimproved the recognition rate. In recent years more systematicstudies have again shown the effectiveness of the approach. Inputimages are perturbed by various transformations such as morphological(dilation/erosion) and geo-metrical transformations (rotation,slanting, perspective, shrinking, and expanding). In Ha and Bunke'swork [44] , handwritten numerals were transformed in twelve ways andrecognized using the frame-work of classifier combination. Theirapproach recognized difficult, eccentric handwriting better thanclassical classifiers such as k –NN and neural network. By the way,blurring is one of image transformations but has not been applied inthe context of perturbation. Blurring used in character featureextraction is not the kind of 'slight transformation'.
Theperturbation approach has also been successfully applied to Japanesepostal address recognition. Our test of the approach achieved about10--15 percentage point improvements in recognition rates on theaverage. When we did not set limits on recognition time and repeatedmore perturbation operations including rotational transformation,rebinarization, and some other parametric modifications in sequence,we found that 53% of rejected images were correctly recognized with a12% error rate. Although the result was attractive, reduction ofadditional errors is a necessary step to using this approach. Onepossible way to pursue this is to apply the combination scheme as Haand Bunke did [44] . Instead of taking the first recognition resultafter a series of rejections, multiple perturbations may besimultaneously applied yielding one result by voting, for example. Inthe light of ever increasing computing power, this approach seems tobe very promising. It should be noted here that perturbation is notonly effective to character classification but also effective tolayout analysis, line extraction, character segmentation, and otherintermediate decisions.
3.6.Robustness implementation
Thedesign principles described in the previous subsections con-cern thestructure and algorithms of a recognition system, but classifiers andvarious parameters have to be carefully and simultaneously trainedand adjusted [40] . The same is true even for specificproblem-solving modules. Though minor, many problems emerge duringthe development phases. Robustness implementation, therefore, is adifficult task for researchers and engineers. The following areimportant keys to an efficient and effective development process.
• Livesamples at users' sites
• Robustnessmeasurement using many 'bags' of test samples
• Accelerationdata sets
• Sample-by-samplecause analysis
Ifpossible, it is highly desirable to gather samples from the users'sites. We call these real samples live samples. However, live samplesshould not be mixed into a single sample set while samples areusually collected in multiple sessions. It is important to choose theright occasions to capture samples because sample characteristicsvary depending on the operational modes and seasonal tendencies.With-out mixing the collections, we have kept samples in manydifferent 'bags'. Recognition rates (or recognition accuracy) may bemeasured for each of the bags (or data sets), as shown in Fig. 13.Here, a trick in the graph is that the data set numbers arerearranged so that the recognition rates are in decreasing order.Arranging the graph this way enables observation of the profiles ofrecognition rates, where a steeper slope means that the recognitionsystem is less robust. In addition, if recognition performance for adata set is very low, then we can re-examine that data set in detail,which is small in size, to identify the cause of the problem (i.e.,low recognition rate).
Accelerationdata sets are collections of samples that have been rejected orerroneously recognized by a version of the recognizer concerned.Every sample in the data sets may be given a unique identifier bywhich the samples can be subjected to sample-by-sample causeanalysis, and more importantly, by which the improvements can betraced throughout the development process. If names and problem codescan be assigned to problematic situations, the non-straightforwardprogress resulting from the remedying processes can be managed moreappropriately.
4.Future prospects
A40--50 year overview of OCR history and an overview of the currentmarket may give rise to the view that the technology is al-mostmatured. However, it is clear that the technology is still in themidst of development and is far inferior to human cognition. From theviewpoint that the technology is mature, it seems that the cur-rentstate is the long tail part of the market (or applications).According to this view, the ''head'' part of the market has a smallnumber of applications having huge amount of documents to read. Theyare business form reading, bank check reading and postal addressread-ing. They have been investment-effective due to sufficientlyheavy demands. Or return on investment has been almost alwayspromised. Of course, the technological advances have elongated thehead part towards the tail, but the remaining tail is very long. Thethree application areas considered parts of the head have also tailparts. There are a lot of business forms, checks, and mailpieces thatare very difficult to read. More advanced recognition techniques areundoubtedly needed. For example, small to medium-sized enterprises(SME) in Japan are still using paper forms to do bank transactionsand paper income forms to report to local government. The number oftransactions carried out by each such company is not very large, andthere is not much incentive for them to innovate. Banks that receivedifferent forms from such companies, therefore, want to use moreintelligent, versatile OCRs. The long-tail phenomenon applies topostal address recognition as well. The questions are if the demandside can foresee the return on the investment in proposed newproducts and systems, and if the scientists and engineers canconvince them of the return, while technical problems are piecewiseand diverse. These are typical long-tail questions.
Intalking about the future from a different angle, there is thequestion of chicken and egg, or need and seed, which is difficult toanswer in general. From the industry's viewpoint, it seems moreimportant to think of needs, or at least latent needs, and the futureneeds seem to be subjective at least for now. The well recognizedunfilled needs of today include: (1) office document archives fore-Government, (2) handwriting for human interface of mobile devices,(3) text in videos for video search, and (4) books and historicaldoc-uments for global search. There are also two other applications:(5) text-in-the-scene for information capture, and (6) handwritingdoc-ument management for knowledge workers.
Unknownscripts and unknown languages are a big handicap for travelers inforeign countries making quick decisions on the road, in shops, atthe airport, etc. A mobile device with a digital camera, i.e., anInformation Capturing Camera [45] may be an aid in such a situation (Fig. 14). With a higher performance microprocessor, text in the scenecan be recognized. The technical challenges to this technologyinclude color image processing, geometric perspective normalization,text segmentation, adaptive thresholding, unknown script recognition,language translation, and so on. Every mobile phone in Japan isequipped with a digital camera, and their microprocessors arebecoming more powerful. Some of digital cameras now have suchintelligent functionality to locate faces in images to be taken. Thequestion is why is text recognition so difficult. Some mo-bile phonesin Japan can now recognize over 4000 Kanji characters [36] . Whatseems interesting to challenge is a dynamic recognition capability,which ensures high recognition performance by repeatedly recognizingmultiple shots of camera images without users' conscious operation.Users may try various angles and positions aiming at a target ofrecognition. It can be considered interactive perturbation.
Anotherattractive area is a digital pen and handwriting document management.The act of handwriting is being reconsidered based on its importancein education and knowledge work contexts. The act of writing helpspeople read, write, and memorize, and we may integrate these actsinto information systems by using today's digital pens, which cancapture handwritten annotations and memos in a very natural way. TheAnoto functionality is one of such advanced techniques and digitallycaptures handwriting stroke data and other related data (Fig. 15).There are research groups that are using such digital pens to createmore intelligent information management systems [46--49] . Their goalis to seamlessly manage documents with digital inks. A groupadvocating 'Information Just-in-Time' (iJIT )is developing a pilotsystem for researchers that supports their note-taking and hybriddocument management [49] . Their handwritten research notebooks canalways be kept compatible with their digital counterparts incomputers. By doing so, they can easily share information in thegroup even when they are located remotely. Another feature of thesystem is that users can print any digital document in such a waythat the printed document is sensitive to a digital pen (Fig. 16 ).In other words, the content of a digital document is printed overlaidwith Anoto dots. Therefore, the users can mark and write annotationsonto those printouts, and handwriting strokes are captured andsynchronized with the corresponding document already existing incomputer. The value of this kind of system is that a dig-italdocument in computer comes to have the same annotations as thephysical counterpart, meaning that they can throw away paperdocuments anytime without any loss of information. This conceptenables users to work equally well in the digital world and in thereal world. This is an attempt to go beyond the myth of thepaper-less office [50] . When such a use of digital pens becomes acommon practice, it will be a natural demand to ask for capabilitiesof hand-written character recognition, handwritten query processing,and more intelligent knowledge management. Effort to createinformation systems that would require recognition technology is away that we may pursue. We hope more advanced information systemsrequire more advanced recognition technology.
5.Conclusion
Visionand fundamental technologies are both key to the future of ourtechnical community. Vision takes the form of forecasted applicationswith new value propositions. For investment to be made in newtechnology, such new propositions need to be attractive to manypeople or at least to some innovative people. This is a top--downapproach to innovation. Fundamental technologies may start innovationfrom the bottom as well. Here, the technologies we are discussinghave two parts: one is the technology that supports our communityfrom the bottom; the other is the technology of our own, i.e.character and document recognition. For the first part, we have seenimpacts of advanced semiconductor devices, high-performancecomputers, and more advanced software development tools, which havesupported the advances in recognition technology. They not onlyenabled more advanced OCR systems on the surface, but also invitedand promoted more academia into this community, which have alsocontributed to the advances in recognition tech-nology. We would liketo see this kind of virtuous cycles happen forever.
Acknowledgments
Theauthor is grateful to the members of his research team in Hitachi whoworked on development of the postal address recognition system: H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K. Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T. Takahashi. He isalso grateful to Dr. C.-L. Liu at the Institute of Automation of theChinese Academy of Sciences, Beijing, and Prof. Y. Shima at MeiseiUniversity, Tokyo, for the work they did at our laboratory. Theauthor also thanks Prof. G. Nagy of Rensselaer Polytechnic Institutefor his valuable discussions and comments on this manuscript. Thanksalso go to Dr. U. Miletzki of Siemens ElectroCom for providinginformation regarding their historical work.