第一个实用的 OCR产品在20世纪50年代出现在美国,也是在同一时期出现了第一台商用计算机UNIVAC。从此以后,每隔十年,都会看到OCR技术的长足进步。20世纪60年代初期,IBM公司生产出了他的第一款光学阅读机,IBM14181960)和IBM14281962年),分别能够阅读印刷的数字和手写的数字。那个时候的识别机器可以读取200个打印文档的字体,并作为IBM1401计算机的输入设备。除此之外,在20世纪60年代,邮政业实现了利用光学字符识别的自动化信件分拣机,这是有史以来第一次自动识别邮政编码来确定目的地。美国邮政首次引入识别地址的的OCR系统是在1965年开始阅读城市//ZIP的印刷信封[2]。在日本,东芝和NEC开发了邮政编码识别的手写数字字符识别系统,并把它们投入使用。在德国,邮政编码系统于1961年在世界上首次引入[4]。但是,邮政编码阅读信件分拣机在欧洲于1973年在意大利的第一个字母自动分拣机的地址阅读器于1978年引入德国[5]


20世纪80年代在半导体技术发生着显著的进步,如CCD图像传感器、微处理器、动态随机存储器(DRAM)以及自己设计的LSI。例如,光学字符识别系统变得更小更适合桌面办公(fig.1),还有越来越便宜的兆字节存储器和CCD图像传感器,使扫描的图像被整页的存储到存储器来进行进一步的处理,从而使更先进的的识别技术和更广泛的应用领域。例如,1983年第一次出现的手写数字识别系统可以识别字符。使写作可以没有物理形式的约束。在20世纪80年代后期, 日本厂商的光学字符识别引入到自己的产品线,可以识别约2400印刷和手写汉字字符。这些软件用于读取数据输入的姓名和地址。更详细的技术审查,可在文献[9,10]




国际模式识别协会在20世纪90年代初开始举行会议,如ICDARIWFHRDAS。据报道,在这些会议深入讨论了的最新的研究方法。如人工神经网络,隐马尔可夫模型(HMM模型),多项式函数分类,改进二次判别函数(MQDF)分类[12],支持向量机(SVM),分类组合[13- 15],信息集成,和词汇字符串识别[16-19],其中有一些独到的见解的是建立在原有的基础上的。从20世纪60年代[20,21]。这些技术在今天的系统中发挥着关键的作用。与前几十年相比,大多数行业使用专有的点播技术,上世纪90年代经历了重要的学术界和工业界之间的互动学习。学术研究解决了现实的技术难题,开发了先进的理论为基础的方法,使行业受益于他们的研究。读者可能会发现字符识别系统,包括图像预处理,特征提取,模式识别,和单词识别的状态,在文献[22]中有详细的论述。













我们可以利用先验的语言知识,从候选格准确地识别单词和短语中选择正确的选项。在这里,晶格是一个表,其中每一列代表一个候选类,每一行对应于在其上的字符。如果一个字符串包含N个汉字字符,并为每个字符有K种可能,则会有KN次方种可能的解释(或字识别结果)。语言处理要从包含许多可能的解释中选择。要做到这一点,我们开发出了一种基于有限状态自动机的关键技术[38]。基本的想法是首先让自动机识别比较大的字符,然后观察哪个被自动机接受,其中的自动机模型是动态生成的晶格(图5)。通常L是一个数字就像万级以上,但只有其第一个字符出现在第一列中的晶格才会被接受。为了提高准确性,我们可以考虑的第二个字符出现在第二列中的格子的规则。这样方式是字符被逐个送入的自动机,和状态变化来确定一个路径(一系列的边缘)。然后,得出了相应的动作,以及和输入项相关的选项。传递的第一条边,为0,并通过最后给出了判定为15时,K=16。在这种方式中,被确定为是一个术语的最小判定识别的单词。为每个字符的数目的候选被自适应地控制为等于或小于K,就像toexclude等极不可能的候选字符。该算法已成功地使用,用于地址短语的识别,字符必须是可靠的分段。 marukawa等人的实验表明,在10828个字符的一个词典中,字符识别的准确度从90.2%提高至99.7%,99.1%地址短语的识别精度。在这里,我们可以注意到,错误发生的统计是独立的。语言处理解决了难以分割的问题(参见图4),在第3节中讨论。









因此,对于所有的假设,{Ek|k=1.2.3…..K} 使得计算Lek|Hk)和PHk|ek)成为可能。这样就可以找到最匹配的假设。
































信息集成原则 结合集成













我们知道字符和文档识别领域不确定性的三种信息集成方式:(1)流程整合,(2)组合整合,(3)佐证为基础的整合。第一种方法,流程整合,集成了两个或三个过程,以形成一个解决单一问题的能力。例如分割-识别和分割 -识别 -解释的方法。这种方法早在20世纪70年代出现在该地区的语言理解方面。第二个以组合为基础整合方式是一个字符分类和分类组合器的集成[13-15]。不同的分类,在期望的分类,行为互补,例如统计和结构分类器和神经网络相结合(集成)来推导出一个结果。这种方法被称为多数投票和Dempster-Shafer方法,可用算法实现。最后,佐证为基础的集成方法,是寻找更多支持结果的证据或多个相同的信息输入来源。一个很好的例子是读出的银行支票金额确认的数字、邮政地址识别、邮政编码和地址短语在词中的读取来获得准确的结果。收件人姓名识别是另一个例子佐证。采取这种方法时,街道号码不会被确认。
















如果可能的话,从用户的站点中收集样品是非常可取的。我们把这些实际样品叫做现场样本。然而,当是在多个会话中收集的样本集时候,现场的样品不应该被混合成一个单一的样品。重要的是要选择合适的场合来采集样品,因为样品的特性的不同的运作模式和季节性倾向也会有所不同。没有被混合的序列集 ,我们可以保持许多不同的''的样本。可为每个袋测量识别率(或识别的精确度),如图所示13。在这里,图中的一各分支导致是该数据组号被重新排列,以便以识别率递减的顺序重新排列。一个陡峭的斜坡意味着识别系统,是不稳定的。此外,如果一个数据集的识别性能是非常低的,我们可以重新审视更加详细的数据,这是体积过小原因所造成的(即低识别率)。






另一个比较有吸引力的领域是数码笔和手写文件管理工具。手写的行为正在重新考虑它在教育重要性和工作内容的基础上。用今天的数码笔写作可以帮助人们读,写,记忆。它可以非常自然的方式捕获手写批注和备忘录,我们也可以将这些行为纳入信息系统。TheAnoto是其中的一个先进的技术,可以数字捕捉手写笔画数据及其它相关数据(图15)。有研究小组正在使用这种数字笔来创建更智能的信息管理系统[46- 49]。他们的目标是精确地管理数字墨水的文档。一组倡导“即时信息”(iJIT)的研究人员开发的试验系统,支持他们的笔记和混合文档管理[49]。他们的手写的研究笔记本可以始终保持与他们在电脑中的数字模式兼容。通过这种方式,即使他们相距较远,他们也可以轻松地共享组中的信息。该系统的另一个特征是,用户用印刷文件的数字笔书写的数字文档,用这样一种方法打印任何形式的文档(图16)。换句话说,数字文档的内容用Anoto点覆盖。因此,手写笔划就可以被捕获,用户可以标记和打印输出这些注解,并与已经存在的计算机同步相应的文档。这种系统的价值在于一个数字文件在计算机里有相同的注解。这意味着他们可以扔掉的纸张文件,任何情况下没有任何的信息损失。这个概念使用户在在数字世界和现实世界中的工作中同样出色。这是一个企图超越神话的一个无纸化办公系统[50]。当这样的数字笔的使用成为一种普遍的现象的时候,手写字符识别,手写查询处理,更智能的知识管理能力的要求等等将是一个很自然的需求。我们着力打造信息系统,识别技术是我们追求的一种方式。我们希望看到更多的先进信息系统需要更先进的识别技术。




本文作者感谢在日立公司进行邮政地址识别系统开发的小组成员:H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K.Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T.Takahashi。感谢来自中国的C-L刘博士在北京自动化研究所做的大量工作。还有东京明成大学的Y.Shima博士在我们实验室的工作。还要感谢G博士对我们非常有价值的意见。还要感谢西门子的U.Miletzki博士ElectroCom提供有关的发展进程。


Presentedis an industrial view on the character and document recognitiontechnology, based on some material presented at ICDAR [1]. Commercialoptical character readers (OCRs) emerged in the 1950s, and sincethen, the character and document recognition technology has advancedsignificantly providing products and systems to meet industrial andcommercial needs throughout the development process. At the sametime, the profits from businesses based on this technology have beeninvested in research and development of more advanced technology. Wecan observe here a virtuous cycle. New technologies have enabled newapplications, and the new applications have supported the developmentof better technology. Character and document recognition has been avery successful area of pattern recognition. The main business andindustrial applications of character and document recognition in thelast forty years have been in form reading, bank check reading andpostal address reading. By supporting these applications, recognitioncapability has expanded in multiple dimensions: mode of writing,scripts, types of documents, and so on. The recognizable modes ofwriting are machine-printing, handprint-ing, and script handwriting.Recognizable scripts started with Arabic numerals and expanded to theLatin alphabets, Japanese Katakana syllabic characters, Kanji(Japanese version of Chinese) characters, Chinese characters, andHangul characters. Work is now being done

tomake Indian and Arabic scripts readable. Many different kinds ofpaper forms can be read by today's OCRs, including bank checks, postcards, envelopes, book pages, and business cards. Typeface standardssuch as OCR-A and OCR-B fonts have contributed to making OCRsreliable enough even in the early stages. In the same context,specially designed OCR forms have simplified the segmentation problemand made handprinted character OCRs readable even by immaturerecognition technology. Today's OCRs are successfully used to readany type of fonts and freely handwritten characters. The field ofcharacter and document recognition has not always been peaceful. Ithas twice been disturbed by waves of new digital technologies thatthreatened to diminish the role of OCR technology. The first suchwave was that of office automation in the early 1980s. Starting then,most of information seemed to be going to be 'born digital',potentially diminishing demand for OCRs, and some researchers werepessimistic about the future. However, it turned out that the salesof OCRs in Japan, for example, peaked in the 1980s. This wasironically due to the promoted introduction of office computers. Itis well known that the use of paper has kept increasing. We are nowfacing the second wave. IT and Web technologies might have adifferent impact. Many kinds of applications can now be completed onthe Web. Information can flow around the world in an instant.However, it is still not known whether the demand for character anddocument recognition will decrease or whether new applicationsrequiring more advanced technology will be created. Search engineshave become ubiquitous and are expanding their reach into the areasof image documents, photographs, and videos. People are re-evaluatingthe importance of handwriting and trying to integrate it into thedigital world. It seems that paper is still not going to disappear.Mobile devices with micro cameras now have CPUs capable of real-timerecognition. The future prospects of these developments are discussedhere.

2.Brief historical view


Thefirst practical OCR appeared in the United States in the 1950s, inthe same decade as the first commercial computer UNIVAC. Since then,each decade has seen advances in OCR technology. In the early 1960s,IBM produced their first models of optical readers, the IBM 1418(1960) and IBM 1428 (1962), which were, respectively, capable ofreading printed numerals and handprinted numerals. One of the modelsof those days could read 200 printed document fonts and were used asinput apparatus for IBM 1401 computers. Also in the 1960s, postaloperations were automated using mechanical letter sorters with OCRs,which for the first time automatically read postal codes to determinedestinations. The United States Postal Service first introducedaddress-reading OCRs, which in 1965 began reading the city/state/ZIPline of printed envelopes [2]. In Japan, Toshiba and NEC developedhandprinted numeral OCRs for postal code recognition,

andput them into use in 1968 [3]. In Germany, a postal code system wasintroduced for the first time in the world in 1961 [4]. However, thefirst postal code reading letter sorter in Europe was introduced inItaly in 1973, and the first letter sorter with an automatic addressreader was introduced in Germany in 1978 [5].

Japanstarted to introduce commercial OCRs in the late 1960s.Hitachiproduced their first OCR for printed alpha numerics in 1968and thefirst handprinted numeral OCR for business use in 1972. NEC developedthe first OCR that could read handprinted Katakana in addition in1976. The Japanese Ministry of International Trade and Industry(since renamed the Ministry of Economy, Trade and Industry) conducteda 10-year 20 billion-yen national project on pattern in-formationprocessing starting in 1971. Among other research topics, Toshibaworked on printed Kanji recognition, and Fujitsu worked onhandwritten character recognition. The ETL character databasesincluding Kanji characters were created as part of this project,which contributed to research and development of Kanji OCRs [6].Asaby product, the project attracted many students and researchersinto the pattern recognition area. In the United States, IBMintroduced a deposit processing system (IBM 3895) in 1977, which wasable to recognize unconstrained handwritten check amounts. The authorhad a chance to observe it in operation at Mellon Bank in Pittsburghin 1981, and it could reportedly read about 50% of handwritten checkswith the remaining half being hand coded. The state of the art incharacter recognition in the 1960s and 1970s is well documented inthe literature [7,8].

The1980s witnessed significant technological advances in semi-conductordevices such as CCD image sensors, microprocessors, dynamic randomaccess memories (DRAMs), and custom-designed LSIs. For example, OCRsbecame smaller than ever fitting on desktops (Fig. 1). Then cheapermegabyte-size memories and CCD image sensors enabled whole-pageimages to be scanned into memory for further processing, in turnenabling more advanced recognition and

widerapplications. For example, handwritten numeral OCRs that couldrecognize touching characters were introduced for the first time in1983, making it possible to relax physical form constraints andwriting constraints. In the late 1980s, Japanese vendors of OCRsintroduced into their product lines new OCRs that could recognizeabout 2400 printed and handprinted Kanji characters. These were usedto read names and addresses for data entry. More detailed tech-nologyreviews are available in the literature [9,10].

Theoffice automation boom of the 1980s, which was influential in Japan,had two features. One was Japanese language processing by computersand Japanese word processors. Emergence of Kanji OCRs was a naturalconsequence of this development. The other feature was optical disksused as computer storage systems, which were developed and put intouse in the early 1980s. A typical application was patent automationsystems in the United States and Japan that stored images of patentspecification documents. The Japanese patent office system thenstored approximately 50 million documents or 200 million digitizedpages on 12-in optical disks. Each disk could store 7GB of data, theequivalent of 200000 digitized pages. The sys-tem used 80 Hitachioptical disk units and 80 optical library units. These systems can beconsidered one of the first digital libraries. This kind of newcomputer applications directly and indirectly encouraged studies ondocument understanding and document lay-out analysis in Japan. Moreimportantly, it was in this decade that documents became the focus ofcomputer processing for the first time.

Thechanges in the 1990s were due to the upgraded performances of UNIXworkstations and then personal computers. Though scanning and imagepreprocessing were still done by the hardware, a major part ofrecognition was implemented by the software on general-purposecomputers. The implication of this was that programming languageslike c and c ++ could be used to code recognition algorithms,allowing more engineers to develop more complicated algorithms andexpanding the research community to include academia. During thisdecade, commercial software OCR packages running on PCs also appearedon the market. Techniques for recognizing freely handwrittencharacters were extensively studied, and successfully applied to bankcheck readers and postal address readers. Advanced layout analysistechniques enabled recognition of wider varieties of business forms.Research institutions specializing in this field such as CENPARMI,led by Prof. Suen and CEDAR, led by Prof. Srihari and Prof.Govindaraju contributed to these advances. New high-tech vendorsappeared, including A2iA, which was started by the late Prof. Simonin France [11] , and Parascript, which was started in Russia to do

businessin the United States. In Japan, the Japanese Postal Ministryconducted the third generation postal automation project between 1994and 1996, in which Toshiba, NEC, and Hitachi joined to develop postaladdress recognition systems that could sort sequences. This projectenabled significant advances in Japanese address reading.

TheInternational Association for Pattern Recognition began holdingconferences such as ICDAR, IWFHR, and DAS in the early 1990s. Manyintensively studied methods have been reported in these conferences.Examples are artificial neural networks, hidden Markov models (HMMs),polynomial function classifiers, modified quadratic discriminantfunction (MQDF) classifiers [12] , support vector machines (SVMs),classifier combination [13--15] , information integration, andlexicon-directed character string recognition [16--19] , some ofwhich are based on original ideas from the 1960s [20,21]. Most ofthese play key roles in today's systems. In contrast with previousdecades, in which industry mostly used proprietary in-housetechnology, the 1990s witnessed important interactions betweenacademia and industry. Academics studied real technical problems anddeveloped sophisticated theory-based methods, enabling industry tobenefit from their research. Readers may find the state of the art ofcharacter recognition systems, including image preprocessing, featureextraction, pattern classification, and word recognition, welldescribed in the literature[22] .

Inthe following subsections, major pre-1990s technical achievements inthe area of Kanji character classifiers, character segmentationalgorithms, and linguistic processing are described.

2.2.Kanji character classifiers

Inthe 1970s, there were two competing approaches to characterrecognition, structural analysis and template matching (or thestatistical approach). Contemporary commercial OCRs were usingstructural methods to read handprinted alphanumerics and Katakana,and template matching methods to read printed alphanumerics.Tem-plate matching methods had been experimentally proven to beapplicable to printed Kanji recognition by the late 1970s[23--26] ,but their applicability to handwritten (or handprinted) Kanji was inquestion. The problem of recognizing handwritten Kanji seemed like asteep, unexplored mountain. It was clear that neither the structuralnor the simple template matching approaches could conquer it alone.The former had difficulty with the huge number of topologicalvariations due to complex stroke structures, while the latter haddifficulty with nonlinear shape variations. However, in light ofprevious work on handwritten numeral recognition using a templatematch-ing approach, the latter approach seemed to have a greaterchance of success [27] .

Thekey was the concept of blurring as feature extraction, which wasapplied to directional features and found to be effective inrecognizing handwritten Kanji [27,28]. The introduction of continuousspatial feature extraction made the optimum amount of blurringsurprisingly large. The first Hitachi OCR for reading handprintedKanji used simple template matching based on blurred directionalfeatures where the feature templates were four sets of 16× 16 arraysof gray values. The directional feature, which was patented in Japanin 1979, was computed using a two-dimensional gradient to determinestroke direction ( Fig. 2) and was even applicable to grayscaleimages [29] . Although it was only indirectly relevant, Hubel andWiesel's work encouraged our view that the directional feature waspromising [30] . Nonlinear shape normalization [31--33] andstatistical classifier methods [12,34] boosted recognition accuracy.We learned that blurring should be considered as a means of obtaininglatent dimensions (subspace) rather than as a means of reducingcomputational cost, though the effects might seem similar. Forexample, the mesh size of 8× 8 used in statistical approaches wasdetermined by the optimum blurring parameter in light of the Shannonsampling theorem, and bigger mesh sizes with the same blurringparameter did not give better recognition performances.

Thethorough studies of the research group led by Prof. Kimuracontributed to advancing statistical quadratic classifiers [12] ,which were successfully applied to handwritten Kanji recognition.Actually, the basic theory had been known, but computers of the1970s did not have sufficient computational power to be applied tostudies of such statistical approaches. Today, the four-directionalfeature vector for Kanji patterns consists of 8 × 8 × 4 elements,and the subspace obtained by statistical covariant analysis is offrom 100 to 140 dimensions. However, the size of the 8× 8 array issurprisingly (counter-intuitively) small in light of many complexKanji characters. Recognition accuracy for individual freelyhandwritten Kanji is not yet high enough, however. Therefore,linguistic context such as name and address is used to enhance totalrecognition accuracy. To reduce computational cost, cluster-basedtwo-stage classification is used

toreduce the number of templates that must be matched. One of therecent advances in Kanji (and Chinese character) recognition is thereduced size of recognition engines designed especially for mobilephone applications. A compact recognition engine reported in Refs.[35,36] requires only 613kB of memory to store parameters torecognize 4344 classes of printed Chinese characters.

2.3.Character segmentation algorithms

Inthe 1960s and 1970s, a flying-spot scanner or a laser scanner with arotated mirror was used together with a photo-multiplier to convertoptical signals into electrical signals. Character segmentation wasusually carried out with the help of these kinds of scanningmechanisms. For example, forms for handprint reading used marks on anedge that signaled the presence of a character line to be scanned. Inaddition, the locations of writing boxes on the forms were registeredbeforehand, and the colors of the boxes were trans-parent to thescanner sensor. Therefore, OCRs could easily extract images thatcontained exactly one single handprinted character.

Then,in the 1980s, semiconductor sensors and memories appeared, enablingOCRs to scan and store images of whole pages. This was an epochmaking change that was significant to users because it relaxed strictconditions on OCR form specifications, for example, by enabling themto use smaller non-separated writing boxes. However, it required asolution of the problem of touching numerals and change in how imagesare represented in memory[37] . Be-fore this change, scanned imageshad been arrays of binary pixels, and segmentation was pixel-based,but from this time on, the bi-nary image in the memory wasrepresented by run-length codes. The

runlengthrepresentation was suited to conducting connected component analysisand contour following. The connected components were processed asblack objects rather than as pixels. In 1983, Hitachi produced one ofthe first OCRs that could segment and recognize touching handwrittennumerals based on a multiple-hypothesis segmentation--recognitionmethod (Fig. 3 ). Contour shape analysis was able to identifycandidates of touching points, and multiple pairs

offorcedly separated patterns were fed into the classifier. Byconsulting the confidence values from the classifier, the recognizerwas able to choose the right hypothesis. This direction of changeshas led us to forms processing whose ultimate goal is to read unknownforms, or at least those forms that are not specifically designed forOCRs. However, this means that users might become less careful intheir writing, so OCRs have to be more accurate for freelyhandwritten characters as well.

Thesegmentation problem was far tougher in postal address recognition.Fig. 4 shows horizontally handwritten addresses. The width of acharacter varies by as much as a factor of two, and some of theradicals and components are also valid characters. As shown in Fig. 4, it is difficult to group the right components to form the rightcharacter patterns, where some characters are quite wide and othersnarrow. To resolve the grouping problem, linguistic information (oraddress knowledge) is required in addition to geometric andsimilarity information. This issue will be discussed in more detailin Section 3.

2.4.Integration of linguistic information

Majorbusiness uses of handprinted Kanji OCRs have been the reading ofnames and addresses in application forms. In such applications, toavoid the segmentation problem, forms have separate preprinted fixedboxes, but how to achieve highly accurate word/phrase recognition isstill a question. We can utilize a priori linguistic knowledge tochoose the right options from the candidate lattice to accuratelyrecognize words and phrases. Here, the lattice is a table in whicheach column carries candidate classes, and each row corresponds tocharacters on the sheet. If a string consists of N Kanji charactersand there are K candidates for each, there are KN possibleinterpretations (or word recognition results). The linguisticprocessing consists of choosing one of the many possibleinterpretations. To do this, we developed a method based on a finitestate automaton as a key technique [38] . The basic idea is to throwL lexical terms at the automaton, and see which terms the automatonaccepts, where the model of the automaton is dynamically generatedfrom the lattice (Fig. 5). L is usually a number as big as severaltens of thousands, but only the terms whose first character appearsin the first column of the lattice are to be accepted. To improveaccuracy, we may consider the terms whose second character appears inthe second column of the lattice as well. Such terms are fed into theautomaton one by one, and the state transitions determine a path (aseries of edges). Then the corresponding penalties are summed up andassociated with the input term. Passing the first edge gives apenalty of zero, and passing the last gives a penalty of 15 when K =16. In this way, a term with the smallest penalty is deter-mined tobe the recognized word. The number of candidates for each characteris adaptively controlled to be equal to or less than K ,toexcludeextremely unlikely word candidates. This algorithm has been usedsuccessfully for address phrases, provided that the characters arereliably segmented. Marukawa et al.'s experiments showed thatcharacter recognition accuracy was raised to 99.7% from 90.2% for alexicon with 10828 terms, resulting in address phrase recognitionaccuracy of 99.1%. Here, we can note that error occurrences are notstatistically independent. Linguistic processing that solvesdifficult segmentation problems (cf. Fig. 4) is discussed in Section3.

3.Robustness design to deal with uncertainty and variability

Postaladdress recognition was an ideal application for re-searchers in thesense that it presented many technical challenges, but, at the sametime, the innovation was an expected one for post office automationand the investments really paid off. In the 1990s, R&D projectswere conducted in the United States, Europe, and Japan to developaddress readers that could recognize freely handwritten and printedfull addresses. These were intended to automate carrier

sequencesorting, a tedious task for postal workers. The recognition task wasto identify an exact delivery point by recognizing the fulldestination address including street and apartment numbers. Theproblem in Japan is to identify one of 40000000 address points. Inthis section, the main issues of robustness design intended to dealwith uncertainty and variability are discussed based on theexperience of the author's team [39,40].

Japaneseaddress recognition is a difficult task as shown in Fig. 6. The readrates for printed and handwritten mail are higher than 90% and 70%,respectively. Images of the rejected mail pieces are sent tovideo-coding stations where human operators enter addressinformation. The results of automatic recognition and human codingare transformed to address codes, which are then sprayed on thecorresponding mail pieces as they run through the sorting machine.After the address codes are mapped to numbers that show a carriersequence, the mail pieces can be sorted in sequence by using thetwo-pass radix sort method.

Therecognition system consists of a high-speed scanner, imagepreprocessing hardware, and the computer software that carries outlayout analysis for address block location, character linesegmentation, character string recognition (i.e., address phraseinterpretation), character classification and post processing (Fig. 7). As can be seen in the block diagram, there are many modules thatmake imperfect decisions; i.e., uncertainty is always involved.Algorithms to solve

specificproblems are susceptible to variations in the images, so the mostbasic questions are how to deal with uncertainty and variability andhow to implant robustness into the system. A more appropriatequestion may be how to compose such a recognition system from smallpieces of recognition modules, or how to connect those modules.

Inanswering these questions, it should be recognized that there aredesign principles that can guide researchers and engineers. We maycall them robustness design principles. Table 1 lists them and givessimple explanations. In the following subsections, five suchprinciples are discussed.

3.1.Hypothesis-driven principle

Variabilitymeans that no one solution can fit all situations. There-fore,problems must often be divided into a certain number of cases with adifferent solution (problem-solver) to each case. However, the caseto which an input in question belongs is unknown. Thehypothesis-driven principle can be applied in such cases, and theproblem of Japanese address block identification is one such case.There are six layout types basically, but in real life, there areactually twelve types because envelopes are sometimes usedupside-down. The approach we take is to choose salient features todistinguish between such cases and to evaluate the likelihood of eachcase based on the observed value of such salient features. As ageneral framework of the hypothesis-driven approach, we call the casea hypothesis and the observed salient features evidence, and astatistical hypothesis test method may be used to evaluatelikelihood. The a posteriori probability of the k -th hypothesisafter observing evidence for this hypothesis can be computed as inEq. (1), where Hk represents the k -th hypothesis, and e k thefeature vector for the kth hypothesis. In Eq. (1), L is a likelihoodratio of hypothesis H k to null hypothesis¯Hk and is computed as inEq. (2) assuming the statistical independence of the features.Functions, P( ki |H k) andP(eki|¯Hk), can be learned from thetraining samples.

Therefore,observing evidence {ek|k = 1,...,K } for all hypotheses makes itpossible to compute L( e k|Hk) and P(Hk|ek) accordingly, to find themost probable hypothesis [41] .

Inthe hypothesis-driven approach, after identifying candidates ofhypotheses, the corresponding problem-solvers applicable only to thatkind of input are called to process the input.

3.2.Deferred decision/multiple-hypotheses principle

Ina complex pattern recognition system, many decisions must be made toobtain the final result. As always, each decision is not 100%accurate, so the decision-making modules cannot be simply cascaded.Each module should not make a decision but should defer the decisionand forward multiple hypotheses to the next module. The idea itselfis a simple one. In the case of postal address recognition, there canbe as many functional modules as shown below:

Lineorientation detection

Charactersize (large/small) determination

Characterline formation and extraction

Addressblock identification

Charactertype (machine-printed/handwritten) identification

Script(Kanji/Kana) identification

Characterorientation identification





Addressnumber recognition

Building/roomnumber recognition

Recipientname recognition

Finaldecision making (accept/reject/retry)

Thesefunctional modules generate multiple hypotheses each of which is thenforwarded to the next module, which again generates multiplehypotheses. This process therefore creates the kind of hierarchicaltree of hypotheses shown in Fig. 8 . The question here is how to findwhich optimum branches to follow to reach the best possible answer inthe shortest possible time. Among the well known search methods, webasically use the Hill Climbing Search with backtracking, by which wecan reach the optimum solution in the shortest time. When an optimumbranch is rejected at a later stage because it has a confidence valuesmaller than a preset threshold, other branches are processed. Theuse of the Beam Search at the later stages effectively boosts therecognition accuracy, while its use in earlier stages is too costly.Search control on the number of hypotheses to generate is importanttrade-off between time and accuracy because computational time islimited to 3.7s in our case. Of course, shorter is better because itrequires less computational power.

3.3.Information integration principle

Werecognize three kinds of information integration known in thecharacter and document recognition field to attack the uncertaintyissue: (1) process integration, (2) combination-based integration,and (3) corroboration-based integration. The first approach, processintegration, integrates two or three processes to form a singleproblem-solver. Examples are segmentation—recognition methods andsegmentation--recognition--interpretation methods.

Thisapproach started in the area of speech understanding back in the1970s. The second combination-based integration approach is the onetaken in character classification and known as classifier combinationor classifier ensemble [13--15] . Different classifiers such asstatistical and structural classifiers and neural networks arecombined (integrated) to deduce a single result, in the expectationthat the classifiers will behave complementarily. Methods known asmajority voting and Dempster Shafer approaches can be used toimplement the algorithm. Finally, corroboration-based integration isthe approach of finding additional evidence that supports the resultor looking for multiple input information sources for the sameinformation. A good example is reading bank check amounts byrecognizing both the courtesy amount (numerals) and the legal amount(numbers in words).

Inpostal address recognition, both the postal code and the addressphrase in words are read to obtain more accurate results. Recipientname recognition is another example of corroboration. This approachis taken when street numbers are not recognized. In postal addressrecognition, the most important consideration is to integrate thethree processes of character segmentation, character classification,and interpretation of the phrases (or linguistic processing). Asdescribed in previous sections, address knowledge is required toresolve the ambiguities in segmentation incorporation withgeometrical information[42] and character similarity, so simple application of the multiple-hypotheses principle was not sufficient.An approach known as the lexicon-directed or lexicon-driven approachhas been developed and can be considered a hypothesis-drivenapproach, as explained below. The approach is illustrated in Fig. 9,where an input pattern is interpreted by searching for the path inthe presegmentation network ( Fig. 10) that best matches the path inthe network that represents linguistic knowledge ( Fig. 11 ). We can

saythat this is the equivalent of searching for a path in the linguisticnetwork that best matches a path in the presegmented network [18,19].This interpretation of the knowledge-directed recognition process isin line with an explanation given by Simon[43] :

When it is solvingproblems in semantically rich domains, a large part of theproblem-solving search takes place in long-term memory and is guidedby information discovered in that memory.

Inour case, the long-term memory refers to the linguistic knowledge,and the short-term memory refers to the presegmented network.

Wehave developed several versions of such algorithms, one of which(Fig. 12) was presented by Liu et al. [19] . The recognition rate ofthe lexicon-driven handwritten address recognition algorithm was83.7% with 1.1% error in an experiment, which was done using 3589actual mail pieces and a lexicon containing 111349 address phrases.The linguistic model was represented in the TRIE structure, and thesearch was controlled by the Beam Search method. Recognition time wasabout 100ms using a Pentium III/600MHz machine.

3.4.Alternative solutions principle

Thereare many image level problems including touching characters, touchingunderlines, window shadow noise, cancellation stampscovering/touching address characters, and so on. The alternativesolutions approach is to provide more than one solution to a problem.It effectively provides solutions that are complementary to eachother. For example, the problem of touching characters may be solvedusing a holistic approach or a forced separation (dichotomizing)approach. Especially when dealing with numerals, a pair of touchingnumerals can be treated as one character out of 100 classes. Trainingsuch holistic classifiers enables the results of the holistic anddichotomizing classifiers to be merged producing more reliablerecognition results. Another example of the alternative solutionsapproach is used to solve the window noise problem. When existence ofwindow noise is suspected, two problem-solvers are needed. Oneattempts to eliminate such noise by erosion (thinning) operation,assuming the shadow is thin or faint. The other attempts to extractline segments that form a frame, assuming the shadow is rather solid.These two problem-solvers are used hoping one will succeed.

3.5.Perturbation principle

Theprinciple of perturbation is to modify the problem slightly when itis difficult to solve and to try again to solve it. If patternrecognition were such a continuous process, the perturbationprinciple would not work. In reality, however, it is often adiscontinuous process. Very small modifications may change the finalrecognition results. It is hoped that the change is from rejection tocorrect recognition or from error to correct recognition. Thisapproach was used in the 1980s to recognize handwritten numeralsusing a structural approach. Because slight topological variationscaused rejection, perturbation of parameters or of input imagesimproved the recognition rate. In recent years more systematicstudies have again shown the effectiveness of the approach. Inputimages are perturbed by various transformations such as morphological(dilation/erosion) and geo-metrical transformations (rotation,slanting, perspective, shrinking, and expanding). In Ha and Bunke'swork [44] , handwritten numerals were transformed in twelve ways andrecognized using the frame-work of classifier combination. Theirapproach recognized difficult, eccentric handwriting better thanclassical classifiers such as k –NN and neural network. By the way,blurring is one of image transformations but has not been applied inthe context of perturbation. Blurring used in character featureextraction is not the kind of 'slight transformation'.

Theperturbation approach has also been successfully applied to Japanesepostal address recognition. Our test of the approach achieved about10--15 percentage point improvements in recognition rates on theaverage. When we did not set limits on recognition time and repeatedmore perturbation operations including rotational transformation,rebinarization, and some other parametric modifications in sequence,we found that 53% of rejected images were correctly recognized with a12% error rate. Although the result was attractive, reduction ofadditional errors is a necessary step to using this approach. Onepossible way to pursue this is to apply the combination scheme as Haand Bunke did [44] . Instead of taking the first recognition resultafter a series of rejections, multiple perturbations may besimultaneously applied yielding one result by voting, for example. Inthe light of ever increasing computing power, this approach seems tobe very promising. It should be noted here that perturbation is notonly effective to character classification but also effective tolayout analysis, line extraction, character segmentation, and otherintermediate decisions.

3.6.Robustness implementation

Thedesign principles described in the previous subsections con-cern thestructure and algorithms of a recognition system, but classifiers andvarious parameters have to be carefully and simultaneously trainedand adjusted [40] . The same is true even for specificproblem-solving modules. Though minor, many problems emerge duringthe development phases. Robustness implementation, therefore, is adifficult task for researchers and engineers. The following areimportant keys to an efficient and effective development process.

Livesamples at users' sites

Robustnessmeasurement using many 'bags' of test samples

Accelerationdata sets

Sample-by-samplecause analysis

Ifpossible, it is highly desirable to gather samples from the users'sites. We call these real samples live samples. However, live samplesshould not be mixed into a single sample set while samples areusually collected in multiple sessions. It is important to choose theright occasions to capture samples because sample characteristicsvary depending on the operational modes and seasonal tendencies.With-out mixing the collections, we have kept samples in manydifferent 'bags'. Recognition rates (or recognition accuracy) may bemeasured for each of the bags (or data sets), as shown in Fig. 13.Here, a trick in the graph is that the data set numbers arerearranged so that the recognition rates are in decreasing order.Arranging the graph this way enables observation of the profiles ofrecognition rates, where a steeper slope means that the recognitionsystem is less robust. In addition, if recognition performance for adata set is very low, then we can re-examine that data set in detail,which is small in size, to identify the cause of the problem (i.e.,low recognition rate).

Accelerationdata sets are collections of samples that have been rejected orerroneously recognized by a version of the recognizer concerned.Every sample in the data sets may be given a unique identifier bywhich the samples can be subjected to sample-by-sample causeanalysis, and more importantly, by which the improvements can betraced throughout the development process. If names and problem codescan be assigned to problematic situations, the non-straightforwardprogress resulting from the remedying processes can be managed moreappropriately.

4.Future prospects

A40--50 year overview of OCR history and an overview of the currentmarket may give rise to the view that the technology is al-mostmatured. However, it is clear that the technology is still in themidst of development and is far inferior to human cognition. From theviewpoint that the technology is mature, it seems that the cur-rentstate is the long tail part of the market (or applications).According to this view, the ''head'' part of the market has a smallnumber of applications having huge amount of documents to read. Theyare business form reading, bank check reading and postal addressread-ing. They have been investment-effective due to sufficientlyheavy demands. Or return on investment has been almost alwayspromised. Of course, the technological advances have elongated thehead part towards the tail, but the remaining tail is very long. Thethree application areas considered parts of the head have also tailparts. There are a lot of business forms, checks, and mailpieces thatare very difficult to read. More advanced recognition techniques areundoubtedly needed. For example, small to medium-sized enterprises(SME) in Japan are still using paper forms to do bank transactionsand paper income forms to report to local government. The number oftransactions carried out by each such company is not very large, andthere is not much incentive for them to innovate. Banks that receivedifferent forms from such companies, therefore, want to use moreintelligent, versatile OCRs. The long-tail phenomenon applies topostal address recognition as well. The questions are if the demandside can foresee the return on the investment in proposed newproducts and systems, and if the scientists and engineers canconvince them of the return, while technical problems are piecewiseand diverse. These are typical long-tail questions.

Intalking about the future from a different angle, there is thequestion of chicken and egg, or need and seed, which is difficult toanswer in general. From the industry's viewpoint, it seems moreimportant to think of needs, or at least latent needs, and the futureneeds seem to be subjective at least for now. The well recognizedunfilled needs of today include: (1) office document archives fore-Government, (2) handwriting for human interface of mobile devices,(3) text in videos for video search, and (4) books and historicaldoc-uments for global search. There are also two other applications:(5) text-in-the-scene for information capture, and (6) handwritingdoc-ument management for knowledge workers.

Unknownscripts and unknown languages are a big handicap for travelers inforeign countries making quick decisions on the road, in shops, atthe airport, etc. A mobile device with a digital camera, i.e., anInformation Capturing Camera [45] may be an aid in such a situation (Fig. 14). With a higher performance microprocessor, text in the scenecan be recognized. The technical challenges to this technologyinclude color image processing, geometric perspective normalization,text segmentation, adaptive thresholding, unknown script recognition,language translation, and so on. Every mobile phone in Japan isequipped with a digital camera, and their microprocessors arebecoming more powerful. Some of digital cameras now have suchintelligent functionality to locate faces in images to be taken. Thequestion is why is text recognition so difficult. Some mo-bile phonesin Japan can now recognize over 4000 Kanji characters [36] . Whatseems interesting to challenge is a dynamic recognition capability,which ensures high recognition performance by repeatedly recognizingmultiple shots of camera images without users' conscious operation.Users may try various angles and positions aiming at a target ofrecognition. It can be considered interactive perturbation.

Anotherattractive area is a digital pen and handwriting document management.The act of handwriting is being reconsidered based on its importancein education and knowledge work contexts. The act of writing helpspeople read, write, and memorize, and we may integrate these actsinto information systems by using today's digital pens, which cancapture handwritten annotations and memos in a very natural way. TheAnoto functionality is one of such advanced techniques and digitallycaptures handwriting stroke data and other related data (Fig. 15).There are research groups that are using such digital pens to createmore intelligent information management systems [46--49] . Their goalis to seamlessly manage documents with digital inks. A groupadvocating 'Information Just-in-Time' (iJIT )is developing a pilotsystem for researchers that supports their note-taking and hybriddocument management [49] . Their handwritten research notebooks canalways be kept compatible with their digital counterparts incomputers. By doing so, they can easily share information in thegroup even when they are located remotely. Another feature of thesystem is that users can print any digital document in such a waythat the printed document is sensitive to a digital pen (Fig. 16 ).In other words, the content of a digital document is printed overlaidwith Anoto dots. Therefore, the users can mark and write annotationsonto those printouts, and handwriting strokes are captured andsynchronized with the corresponding document already existing incomputer. The value of this kind of system is that a dig-italdocument in computer comes to have the same annotations as thephysical counterpart, meaning that they can throw away paperdocuments anytime without any loss of information. This conceptenables users to work equally well in the digital world and in thereal world. This is an attempt to go beyond the myth of thepaper-less office [50] . When such a use of digital pens becomes acommon practice, it will be a natural demand to ask for capabilitiesof hand-written character recognition, handwritten query processing,and more intelligent knowledge management. Effort to createinformation systems that would require recognition technology is away that we may pursue. We hope more advanced information systemsrequire more advanced recognition technology.


Visionand fundamental technologies are both key to the future of ourtechnical community. Vision takes the form of forecasted applicationswith new value propositions. For investment to be made in newtechnology, such new propositions need to be attractive to manypeople or at least to some innovative people. This is a top--downapproach to innovation. Fundamental technologies may start innovationfrom the bottom as well. Here, the technologies we are discussinghave two parts: one is the technology that supports our communityfrom the bottom; the other is the technology of our own, i.e.character and document recognition. For the first part, we have seenimpacts of advanced semiconductor devices, high-performancecomputers, and more advanced software development tools, which havesupported the advances in recognition technology. They not onlyenabled more advanced OCR systems on the surface, but also invitedand promoted more academia into this community, which have alsocontributed to the advances in recognition tech-nology. We would liketo see this kind of virtuous cycles happen forever.


Theauthor is grateful to the members of his research team in Hitachi whoworked on development of the postal address recognition system: H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K. Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T. Takahashi. He isalso grateful to Dr. C.-L. Liu at the Institute of Automation of theChinese Academy of Sciences, Beijing, and Prof. Y. Shima at MeiseiUniversity, Tokyo, for the work they did at our laboratory. Theauthor also thanks Prof. G. Nagy of Rensselaer Polytechnic Institutefor his valuable discussions and comments on this manuscript. Thanksalso go to Dr. U. Miletzki of Siemens ElectroCom for providinginformation regarding their historical work.

