我的github

Contents
Introduction. ....................................................................................... 524
       History. ........................................................................................ 524
       Evolution of the Research Field. ..................................................526
Features. ........................................................................................... 528
       Polar Representation. ..................................................................530
       Invariance to Similarities............................................................. 530
       Pixel Descriptors. ........................................................................531
       Multi-scale/Resolution Decomposition. ....................................... 534
       Structural Descriptors. ..................................................................535
Recognition Methods. ........................................................................ 538
       Distance and Similarity Measures. ............................................. 539
       Embedding Methods. ................................................................. 540
       Structural Classification. .............................................................. 540
       Statistical Classification. .............................................................. 541
Symbol Spotting................................................................................... 543
Conclusion. ........................................................................................ 546
Cross-References. ................................................................................. 547
References. ........................................................................................ 547
      Further Reading. .............................................................................. 551

Abstract摘要

According to the Cambridge Dictionaries Online, a symbol is a sign, shape, or object that is used to represent something else.

根据剑桥词典,符号(symbol)是用来表示其他事物的符号、形状或物体。

Symbol recognition is a subfield of general pattern recognition problems that focuses on identifying, detecting,and recognizing symbols in technical drawings, maps, or miscellaneous documents such as logos and musical scores.

符号识别是一般模式识别问题的一个子领域,它关注于识别、检测和识别技术图纸、地图或其他文档(如徽标和乐谱)中的符号。

This chapter aims at providing the reader an overview of the different existing ways of describing and recognizing symbols and how the field has evolved to attain a certain degree of maturity.

本章旨在为读者提供描述和识别符号的不同现有方式的概述,以及该领域如何进化以达到一定程度的成熟度。

关键词:模式识别,形状描述子,结构描述子,符号识别,符号检测

Introduction简介
In any symbol recognition process, the following operations are usually performed (not necessarily in the same order): segmentation, feature extraction, invariance to similarities, and comparison. Traditionally, these operations are seen as a twostep process consisting of feature extraction and symbol recognition. Most of the features are extracted on segmented symbols. However, symbols are often embedded in technical documents, connected to other symbols, and associated with text. Moreover, it is usually very difficult to perform symbol recognition by simply assuming that they have been cleanly extracted. Therefore, “symbol spotting” methods have been proposed to localize symbols without the need of a full segmentation step.

 

Some of the methodologies proposed in the forthcoming sections have been developed to deal with particular problems coming from technical documents. The evolution of symbol recognition has a close link with text recognition because characters can also be considered as symbols. Several methods in symbol recognition have been inspired from character recognition. Specialized techniques for OCR are described in details in Part C (Text Recognition) of this book, and this chapter will focus on symbol recognition only.

 

This chapter starts with a brief history of pattern recognition methods related to symbol recognition from the 1950s until today. This historical review will be used as a basis for the structure of the forthcoming sections.

 

History历史

The first works on pattern recognition appeared in the late 1950s and early 1960s (Fig. 16.1). These studies were modest in terms of objectives and data used in experimentation, but at that time authors of these works felt the potential of applications behind the pattern recognition field. Forty years later, much progress has been made but there is still room for researches.

 

 

Fig. 16.1 Chronological history related to shape recognition purposes

图16.1与形状识别目的相关的年代史

From today’s perspective where everyone has a camera and can transmit images worldwide instantly, one of the main difficulties that researchers had 40 years ago was image acquisition and manipulation. For instance, the Lincoln Laboratory was one of the few laboratories in the world that could read and process digital images through the “Computer Memory Test” (CMT) in 1955. At that time, Dinneen [25] studied whether the averaging and edge operators could be used for simple shape recognition purposes (images of A’s, O’s, triangles, and squares). In 1959, Bomba [10] proposed the first structural descriptor for character recognition and based on a set of features (straight lines at different orientations, four orientations of T- and L-junctions, and some selected V-junctions). To extract those features, the averaging operator studied by Dinneen was applied to reduce the noisy pattern and to normalize line width. These features were computed from sliding windows where local features were computed by aggregating pixels at different orientations. As a result of this process, characters were decomposed (segmented) into different layers and recognition was performed with a decision tree. In 1961, Freeman [34] proposed the Chain Code, an encoding method for the representation of arbitrary planar curves, by a sequence of integers ranging from 0 to 7, largely used in many applications and until now.

 

These previous works by Dinneen, Bomba, and Freeman assumed that patterns (squares and triangles) and characters were isolated inside the image, perfectly oriented, and of the same scale. In 1962, Hu [40] defined a first set of invariant features, namely, geometric moments, to deal with shapes at different positions, scales, and orientations. The theory of moment invariants was based on the algebraic invariant theory developed during the second half of the nineteenth century by Cayley, Sylvester, and Boole. In these works, Hu observed that the more the number of moments were used, the higher the discrimination capacity of his method would be

 

All these methods proposed a set of features for describing shapes based on authors intuition about which kind of shape information was relevant for recognition purposes. At that time, a set of features inspired by psychological and psychophysical works [4] was proposed in [9, 82]. Based on those psychophysical knowledge, Blum [9] proposed in 1967 the Medial Axis Transform (MAT). The MAT is defined as the locus of points that are equidistant to the shape contour, and it has largely been used in many other recognition systems, sometimes under the name Shape Skeleton. Blum used the MAT to represent pattern’s symmetrical lines and observed some properties that make this transform useful for recognition purposes

 

Zahn and Roskies [82] proposed a set of descriptors based on shape contours in 1972. By construction, the amplitude of the Fourier coefficients are invariant to shape’s scale, position, and rotation. However, the phase of the Fourier coefficients lacks invariance properties because it depends on the starting point used to parameterize the curve. Thereby, Zahn and Roskies proposed a family of invariant functions of the phase, regardless the starting point.

 

The aforementioned contributions are typical examples of methods in which authors supposed that symbols had previously been segmented from the document. On the contrary, some works assumed that symbols were connected to other parts of the document. For instance in 1969, Shaw [71] proposed the Picture Description Language (PDL) to describe graphics for the interpretation of electrical circuits. In this case the symbols representing each of the circuit elements were not segmented from the document, and their recognition was performed during the interpretation of the document through a grammar. Shaw represented circuits by directed graphs where each node was a segment of the circuit and the nodes were connected based on adjacency segment relations [35]. This chapter will not go into details about the use of grammar as an interpretation method since it is discussed in Chap. 17 (Analysis and Interpretation of Graphical Documents) and it only focuses on the representation way

 

In these early works, from a statistical point of view, it was necessary that symbols were segmented and the number of classes was quite limited (rather printed capital letters), and from structural one, the described relationships were narrowed to adjacent relationships (regardless of inclusion or overlap). It should also be noted that, at that time, different algorithms were not available in common programming languages as debugging tools or image processing libraries like today (e.g., C language was developed in the late 1960). Therefore, the development of any new method had a high cost in time and resources. Later on, with the advent of the IBM 704 and the construction of the first scanner, researchers were able to process images in a more similar manner to what is done today. The majority of the most competitive state-of-the-art methods nowadays rely on concepts and ideas that could be found in these early works. The next section will show how different contributions by adding more processing layers will refine and improve these early works.

 

Evolution of the Research Field研究领域的演变
Symbol recognition has become a fertile field where several methods have been proposed in many directions. That methods introduced in the 1960s and 1970s are still relevant today, despite some improvements. For instance, the Angular Radial Transform (ART) descriptor [53], which is included in the MPEG-7 standard, is an evolution of descriptors based on Zernike moments, which in turn evolved from the Hu moments [40]. Likewise, based on the works of Zahn and Roskies, the univariate [63, 79] and bivariate Fourier descriptors [47, 84] have been widely used and reused to describe shapes for different applications. The current technology makes that methods that were impossible to apply 50 years ago, due to their computation complexity, become possible. Early works in these years sowed the seeds for the growth of many of today’s methods

 

For the following, to illustrate the evolution of symbol recognition methods, two axes of changes are selected. The first axe will show how structural methods have evolved into representations of complex data, which has led to increasingly efficient matching algorithms. The second one will show how the problem of deformation of symbols was addressed whatever the used method.

 

Structural methods focus on describing relations among elementary parts of symbols, namely, primitives, that make up symbols. Primitives are, in most of the cases, vectors and arcs, and the relations between them could be their adjacency or inclusion/overlap. Trees and graphs are data structures that best represent this type of information, and the recognition process is to find patterns of symbols in these structures. Therefore, structural approaches focus on developing matching methods that are efficient enough to find substructure within graphs or trees. It is important however to use suitable data structures so that matching algorithms give the expected results. For instance, directed graphs have been used for the representation of electronic circuits [71]. In addition, some attempts to use associative graphs to describe electrical diagrams [22] and shapes [62] will also be mentioned later in this chapter. In 1982, Bunke [11] used attributed graphs for the interpretation of electrical diagrams. This method was then extended as a basis for the proposal of region adjacency graphs (RAG) for similar purposes in 2001 [48].

 

Despite many improvements, matching algorithms are still so computationally expensive that they cannot be applied to large data sets. For this reason, the interest in kernel embedding methods has increased recently. The goal of such methods is to transform graphs in feature vectors to apply machine learning and statistical pattern recognition techniques. Further details on these methods are discussed in section “Embedding Methods.”

 

These structures are able to represent any kind of technical documents by establishing simple relations among primitives. However, symbols can also be deformed or show local distortions near boundaries, and existing methods at that time, such as the Chain Code, were not able to deal with them. In the late 1970s, these difficulties were known and relaxation techniques were used to find approximative correspondences between shapes with, or without, incomplete information [22]. The edit distance is another technique used for the same kind of problem [48, 55]

 

A slightly different approach to the previous ones are elastic, or deformable, models. These methods consider symbols as objects that can be deformed by adding enough energy to warp one shape into another. Thus, a vectorial field based on curve equations was used by Burr to transform one symbol into another [14]. In the same vein, deformable models based on active contours were created to calculate similarities between handwritten symbols [77].

 

In the 1980s and 1990s, the use of graphics tablets with CAD software allowed the acquisition of images of diagrams and freehand sketches of architectural symbols directly in addition to digitized documents. These new devices allowed to explore new symbol recognition applications such as the recognition of mathematical expressions [52]. In addition, online acquisition added dynamic information of symbol drawing, which was used for segmentation and recognition tasks as well. However, the use of this dynamic information introduced new challenges to the symbol recognition field. For instance, the time of response of the system became an important issue to be taken into account, and the order for which the strokes of a symbol were drawn affected the recognition results. In 2010, an adjacency grammar, which can be generated either on- or off-line, was proposed in [54]. Depending on the source of data, a set of primitives were taken and used as terminal elements. The derived adjacency grammar is built from region adjacency graphs [48], but unlike the grammars proposed by Shaw [71] or Bunke [11], inclusion and neighborhood relations between terminal elements and between terminal and nonterminals elements were defined.

 

These methods are just a few examples of how the problem of nonrigid deformations of the symbols has been tackled. In general, structural descriptors are more flexible than statistical ones because the adopted metrics and matching algorithms allow greater variability between symbols of the same class. For structural descriptors, relationships between primitives are usually independent of position, scale, and rotation since they are defined locally. The matching algorithms are, as with nonrigid transformations, in charge of finding symbols regardless of the position, scale, and rotation. Statistical descriptors work differently from the structural ones to be invariant to similarities. In statistical methods, relationships between primitives are not made explicitly. Images of symbols are seen as surfaces or contours defined in the plane, and features are obtained as the result of applying mathematical transformations. Thus, to be invariant to these deformations, the possible deformations of symbols are taken into account [5, 6] in the comparison stage and not in the statistical descriptor. Generally speaking, statistical descriptors can achieve invariance just before feature extraction by using some kind of normalizations. Ideally, the similarity invariance should be achieved during the process of feature extraction. More details on descriptors invariance will be found in the next section.

 

Features特征
This section focuses on feature extraction methods for the construction of symbol descriptors. Several surveys have been proposed in the literature to summarize advances in shape descriptors [61, 85]. However, due to the large number of methods, and since many of them are combinations of the previous ones which could be of different types, it is difficult to establish a clear categorization of symbol recognition methods. Broadly speaking, descriptors are divided into four main groups. On the one hand, descriptors are categorized according to their structural or statistical properties. On the other hand, descriptors are classified depending on whether they are extracted from the regions or the contours. A slightly different terminology had been used by Pavlidis [61] in 1978. Pavlidis divided algorithms for shape analysis in several binary classes: external methods refer to methods defined over the local boundary, whereas internal methods are defined over the whole shape. Pavlidis also made another distinction between scalar and domain transforms. Domain methods transform one image to another, whereas scalar methods compute scalar features from input images. According to these criteria, he defines the following four classes of algorithms: external scalar transforms, internal scalar transforms, external space domain techniques, and internal space domain techniques. Scalar features are simple descriptors like area, compactness, rectangularity, and ellipticity

 

Nevertheless, the distinction between contours and regions is somehow confused. Indeed, many transforms can be applied to both contours and regions (e.g., Fourier transform, wavelet transform, Radon transform, regression methods), and many of their properties do not depend on whether they are applied on one-dimensional or two-dimensional data. Moreover, the notion of contour also depends on the topology of a symbol. Full symbols such as logos are different from the ones like wire diagrams. Extracting contours in the latter case means that the skeleton, or the MAT (see Chap. 15 (Graphics Recognition Techniques)), is considered, whereas for full symbols, the contour means the external shape. Moreover, as symbols are often represented in black and white, for noiseless full symbols, describing a symbol by its region or external contour is equivalent. On contrary, for noisy full symbols, region descriptors are more robust than contour ones. Following these considerations, Fig. 16.2 proposes a taxonomy of the most representative pixel feature methods following their representation (polar or Cartesian), their decomposition in a multi-scale/resolution space, and the domain of applicability (contour/skeleton vs. region).

 

Polar Representation极坐标表示
Shapes are usually expressed in Cartesian coordinates but sometimes descriptors are based on polar coordinates. In the polar representation, the description of a shape is more concise and therefore less sensitive to noise and shape variations. However, the main drawback of this representation is the definition of the coordinate origin. The change of Cartesian to polar, and also polar to Cartesian, is based on the distance of points from the origin. The same shape can be represented in a very different manner depending on the definition of the origin, leading to instability when the shape is noisy. Examples of methods to determine the origin coordinates are the center of gravity, the center of the bounding box, or the center of the minimal enclosing circle. Each of these methods will result in a different polar description of the shape. Another drawback is the effect of shift in a polar description. While a shift in Cartesian coordinates follows a linear map, a shift in polar coordinates follows a sinusoidal function. This fact causes difficulty in getting invariance to shape translation. The change of Cartesian-to-polar coordinates is also a timeconsuming process, and methods on pseudo-polar transform have been proposed using concentric squares instead of concentric circles to represent shapes [64] to speed up the change. However, this transform introduces geometric distortions due to the approximation of circles by squares

 

Although most of the descriptors using this kind of representation are built from pixel images, some examples of structural descriptors coded in polar coordinates are also found [42]. Hough and Radon transforms describe straight lines in terms of slope angle and distance to the origin which provides a polar description that has been used in structural and statistical descriptors. The R-transform, which is an integral function of the Radon transform in the radial parameter, gives rise to a descriptor called R-signature [38, 74]. Ridgelets descriptor [64] is defined by performing a wavelet transform on the Radon space or combination of several transforms (Radon, Fourier, and wavelets [16]).

 

Polar Fourier transform simply consists in computing 2D Fourier transform in polar coordinates. The projection in the polar space provides the rotation invariance. An example is the generic Fourier descriptor [85]. The Fourier–Mellin transform is defined by using the Fourier and Mellin transforms, respectively, on the angular and radial parameters [1, 39]. Trace transform generalizes the Radon transform by applying other functionals on a set of lines [43].

 

ART [53] decomposes a shape in a basis defined by the multiplication of a radial and an angular function. Both functions, angular and radial, are defined by a parameter that determines the ART coefficients. Finally, Zernike moments [17] are defined by the same angular function as ART descriptors, but the radial function is a real-valued polynomial.

 

Invariance to Similarities相似性不变性
Usually, in invoice documents, trademark logos are not rotated and thus, descriptors do not need to be rotation invariant. On the contrary, in architectural or electronic documents, symbols can be found in almost any orientation. Such documents could be scanned at different resolutions or, in case of architectural maps, drawn at different scales. In such cases, descriptors have to be invariant to scale. For instance, the descriptor in Fig. 16.3 is invariant to scale and translation but not rotation. In the latter case, the descriptor is shifted horizontally and circularly. Also, when the document is nonplanar such as thick-bound book pages or for documents captured by mobile devices (see Chaps. 2 (Document Creation, Image Acquisition and Document Quality) and 4 (Imaging Techniques in Document Analysis Processes)), images are often warped and this deformation can be approximated by an affine transform.

 

Fig. 16.3 Example of similarities invariance图16.3相似不变性示例

 

 

There are different ways of achieving invariance to similarities. The first one is to extract directly from symbols an invariant descriptor. The next section will show a set of different pixel descriptors that are intrinsically invariant to similarities. In other cases, symbol is normalized by estimating its center, scale, and orientation. Having done this, these changes can be reversed to obtain the original one. Symbol normalization is sensitive to wrong segmentations, document noise, and partial occlusions or distortions. The estimation of these parameters is done before applying any feature extraction method. When symbols are completely segmented, this normalization is achieved as it is done for polar representation. Thus, the symbol center is defined either as the center of gravity of the symbol or as the center of bounding boxes, convex hull, or minimal enclosing circle. Therefore, invariance to translation and scaling is achieved by shifting the symbol center to the coordinate center and rescaling the symbol to a fixed value. Achieving invariance to symbol rotation by normalization is more critical because sometimes symbols do not have a clear orientation, as it is the case for characters. A typical technique to recover symbol rotation is to use the angle of the main axis defined by the second-order moment, but this technique is sensitive to noise and distortions and not robust when the eccentricity value is near to 1, meaning that the symbol is quite circular. Consequently, when possible, it is better to achieve similarity invariance by means of an invariant feature extraction method since they do not require the estimation of symbol position, scale, and rotation.

 

The last way of taking into account the invariance is to incorporate the invariance directly into the measure of similarity. For instance, to recover warping deformation, several works use the property of elasticity of the dynamic time warping (DTW) distance [5, 32]. Until now, affine invariance has only been slightly addressed in the symbol recognition community, but with the popularity of mobile devices, the interest on invariance to affine transformation will increase substantially in the future.

 

Pixel Descriptors像素描述符

Pixel descriptors are features directly computed from raw images (Fig. 16.4). These type of descriptors usually have been named statistical since traditionally they have been used as input of statistical classifiers. Some examples of the most simple pixel features are the Euler number, the number of connected components, the area,the perimeter, the compactness, the rectangularity, and the symbol ellipticity. These scalar features can be enough to solve very simple recognition problems, or they can also be used as attributes in graph descriptors. For more complex tasks, more complex transforms are needed like those reviewed below

 

Fig. 16.4 Example of pixel descriptors图16.4像素描述符示例

The Fourier transform is probably the most popular extraction method in pattern recognition problems, both in one and two dimensions. There are several ways of applying the Fourier transform to a planar curve. As explained in the section “History,” the first time the Fourier transform was applied to recognition problems was in 1972 [82]. The phase of the Fourier coefficients computed from shape contours was used to compute similarity between shapes [5]. Moreover, the Fourier transform was applied to the R-signature, obtained from the Radon transform of the image, to get invariance to rotation [74]. The strength of Fourier descriptors is that they permit to get a global description of curves without requiring a large number of coefficients. However, they lose their discriminant capability when the similarity between shapes is important because slight shape differences are confused with noise. Bivariate Fourier transform is also used for symbol recognition after applying other transforms like contourlets [15] or Radon transform [16, 39]. The generic Fourier descriptor computed as bivariate Fourier transform in polar coordinates has proved to be more robust to symbol distortions [85].

傅里叶变换可能是模式识别问题中最流行的提取方法,无论是一维还是二维。有几种方法可以将傅里叶变换应用于平面曲线。如“历史”一节所述,第一次将傅里叶变换应用于识别问题是在1972年[82]。根据形状轮廓计算出的傅里叶系数的相位用于计算形状之间的相似性[5]。此外,将傅里叶变换应用于从图像的Radon变换获得的R-签名,以获得对旋转的不变性[74]。Fourier描述子的优点是,它们允许在不需要大量系数的情况下获得曲线的全局描述。然而,当形状之间的相似性很重要时,它们就失去了鉴别能力,因为微小的形状差异会与噪声混淆。二元傅里叶变换也用于符号识别后,应用其他变换,如轮廓线[15]或Radon变换[16,39]。在极坐标系中计算为二元傅里叶变换的通用傅里叶描述子被证明对符号失真更为鲁棒[85]。

Geometric moments were introduced in 1962 by Hu [40] to build descriptors invariant to similarity transforms. If, instead of using monomials of order p, xp, orthogonal polynomials like Zernike and Legendre are used over the unitary disk we obtain the Zernike and Legendre moments. Moreover, the moment order is defined as the polynomial degree. The related mathematical theory proves that we can express any bivariate function defined over the unitary disk as a Legendre or Zernike polynomial of infinite degree. Then, an approximation of the original symbol is obtained by truncating these infinite polynomials. Zernike moments have proved to be more discriminant and robust to noise than geometric and Legendre moments. However, they are more computationally expensive. Some comparative studies have been carried out in this direction in [17]. Some other works concerning Legendre and Zernike moments are found in [75, 81], just to mention a few. Based on the geometric invariant theory [57], several wavelet invariant descriptors have been proposed in [28, 76], but most of the proposed invariant functions require contour-based description of symbols and their extension to regions is not straightforward [33, 45].

Hu[40]于1962年引入几何矩来构造对相似变换不变性的描述符。如果不使用p,xp阶单项式,而是在酉盘上使用Zernike,Legendre等正交多项式,则得到Zernike,Legendre矩。此外,矩阶定义为多项式次。相关的数学理论证明,在酉盘上定义的任何二元函数都可以表示为无穷次的Legendre或Zernike多项式。然后,通过截断这些无穷多项式来获得原始符号的近似值。Zernike矩比几何矩和Legendre矩更具鉴别能力和抗噪声能力。然而,它们的计算成本更高。文献[17]对此进行了一些比较研究。关于Legendre和Zernike矩的一些其他著作,在[75,81]中有发现,仅举几个例子。基于几何不变量理论[57],文[28,76]中提出了几个小波不变量描述符,但大多数提出的不变量函数都需要基于轮廓的符号描述,它们对区域的扩展并不简单[33,45]。

Local norm methods compute the norm over a set of features. This type of descriptors was formalized in [19]. Zoning descriptors are the most basic kind of local norm descriptors where an image is divided into cells and the area is computed on each cell. In general, such descriptors are not invariant to similarities unless symbols are centered and resized before computing local norms. These descriptors are useful since shape description is compact and the size of descriptor is reduced. However, the discrimination capability is less, especially for tasks with a lot of different classes of symbols. The blurred shape model (BSM) is a sophisticated zoning descriptor which is robust to symbol deformations [29]. The R-signature is invariant to shift and scale because the signature is computed along the radial parameter of the Radon transform [74].

局部范数方法计算一组特征上的范数。这种类型的描述符在[19]中被形式化了。分区描述子是最基本的局部范数描述子,它将图像分割成单元,并计算每个单元的面积。一般来说,除非在计算局部规范之前对符号进行居中和调整大小,否则这些描述符对相似性不是不变的。这些描述子是有用的,因为形状描述是紧凑的,描述子的大小是缩小的。但识别能力较低,特别是对于具有大量不同类别符号的任务。模糊形状模型(BSM)是一种复杂的分区描述符,对符号变形具有鲁棒性[29]。R-签名对于移位和缩放是不变的,因为签名是沿着Radon变换的径向参数计算的[74]。

Auto-regressive (AR) methods consist of computing the parameters of closed curves using regression techniques like least squares. These methods are usually applied to contour curves. Coefficients fitting contour curves are then used to derive descriptors invariant to similarity transforms. Bivariate AR models were proposed in [21, 70] to overcome some shape representation problems instead of univariate function representing the shape boundary [44]. With a bivariate function, convex and non-convex shapes are treated in the same way. One of the drawbacks of stochastic methods is that the number of coefficients required to describe the shape is high for complex shapes and is usually chosen empirically.

自回归(AR)方法包括使用最小二乘等回归技术计算闭合曲线的参数。这些方法通常应用于等高线。然后,利用拟合轮廓曲线的系数来导出对相似变换不变量的描述。文献[21,70]提出了二元AR模型,用以克服一些形状表示问题,而不是用一元函数表示形状边界[44]。对于二元函数,凸形和非凸形的处理方法相同。随机方法的一个缺点是,对于复杂形状,描述形状所需的系数数目很高,通常是根据经验选择的。

Curvature function is based on the second derivative, describes a planar curve (except its position and orientation), and is invariant to shift and rotation in shape. Changes of curvature in shapes are considered to be dominant features, and they have been the focus of detailed studies since the beginning of the 1980s. It has been shown that this kind of descriptor usually has good performance for general shape description purpose. However, the required computation of the second derivative makes this descriptor sensitive to noise. Curvature function has largely been used as symbol descriptor. The concept of curvature primal sketch, which is in fact a Curvature Scale Space, was introduced in [3]. Besides, maxima curvature points were proposed in [7]. The curvature function was computed and local maxima points were extracted to construct a structural descriptor. Finally, the Curvature Scale Space (CSS) descriptor [56] is based on a multi-resolution description of the curvature function and was included in the MPEG-7 standard [53].

曲率函数是以二阶导数为基础,描述一条平面曲线(除了它的位置和方向),对形状上的移动和旋转具有不变性。形状的曲率变化被认为是形状描述的主要特征,自20世纪80年代以来,曲率变化一直是人们研究的热点,研究表明,这种描述子通常具有良好的通用形状描述性能。然而,所需的二阶导数计算使得该描述符对噪声敏感。曲率函数在很大程度上被用作符号描述符。在文献[3]中引入了曲率原始草图的概念,它实际上是一个曲率尺度空间。此外,在[7 ]中提出了最大曲率点。计算曲率函数,提取局部极大值点,构造结构描述子。最后,曲率尺度空间(CSS)描述符[56]基于曲率函数的多分辨率描述,并包含在MPEG-7标准[53]中。

Directional methods compute shape gradients in several directions [49]. They are essentially based on the implementation of discrete derivatives, and, as a result, their strengths and weaknesses are strongly influenced by this fact. In general, these descriptors are extremely sensitive to contour distortions and local occlusions of shapes, but they can be easily applied and adapted. Therefore, these descriptors have been used for both machine-printed and handwritten non-Latin characters with different degrees of success since the end of 1970s. In these works, directional information is extracted by means of different masks like Kirsh or Sobel masks. A different approach has been proposed by Kimura et al. in 1997 based on Chain Codes for handwritten Japanese character recognition [46] and successfully applied for mathematical symbols recognition [52]. An advantage of this type of descriptors in comparison with the gradient-based approach is the computation time. However, the Chain Code-based descriptor usually shows lower accuracy compared to gradient-based descriptors.

定向方法计算几个方向上的形状梯度[49]。它们基本上是基于离散导数的实现,因此,它们的优缺点受到这一事实的强烈影响。一般来说,这些描述子对形状的轮廓扭曲和局部遮挡非常敏感,但它们很容易应用和适应。因此,自20世纪70年代末以来,这些描述子一直被用于机器印刷和手写的非拉丁字符,并取得了不同程度的成功。在这些工作中,方向信息是通过不同的掩码(如Kirsh或Sobel掩码)提取的。Kimura等人提出了一种不同的方法。1997年基于链码的手写体日文字符识别[46]并成功应用于数学符号识别[52]。与基于梯度的方法相比,这种类型的描述符的一个优点是计算时间。然而,与基于梯度的描述符相比,基于链码的描述符通常显示出较低的精度。

Histogram descriptors are empirical approximations of probability density functions of features. These descriptors are useful because they allow us to reduce the feature space into a small set of bins where feature information is accumulated, like directions or variations in gradient modulus. Their use is not restricted to shape description. For instance, histograms based on color features [73], computed from global, or local, shape features [36, 80], have been proposed in the literature (Fig. 16.4).

直方图描述符是特征概率密度函数的经验近似。这些描述符是有用的,因为它们允许我们将特征空间缩减为一组小的存储单元,在这些存储单元中,特征信息是累积的,比如方向或梯度模的变化。它们的使用不限于形状描述。例如,文献中提出了基于颜色特征的直方图[73],由全局或局部形状特征[36,80]计算(图16.4)。

Multi-scale/Resolution Decomposition多尺度/分辨率分解

Some of the previous descriptors are defined in multi-scale or multi-resolution frameworks. The underlying hypothesis is that the most relevant features are preserved at rough scales. The multi-scale decomposition of a shape basically consists in smoothing by convolving it with a scale function, which is most of the time the Gaussian function. The scale function depends on a parameter, , which is usually referred to as the scale parameter. An inherent drawback of multiscale decomposition is that the size of the shape is always the same, in spite of a reduction in the number of features. This means that the size of the descriptor is the same, regardless of the scale used. For this reason, descriptors based on this method usually extract features from a pyramidal decomposition of shape from the roughest scale to the finest one to complete the shape description. Contour-based scale space descriptors are obtained by convolving the contours of the shape by a Gaussian kernel (first and second derivative of a Gaussian filter). The Curvature Scale Space [56] and the “primal sketch curvature” [3] are defined in a multi-scale context where local maximal curvatures are extracted, leading to a reduction in the size of the descriptor. Region-based scale space descriptors are obtained by applying Gaussian kernels over the whole image. For instance, the SIFT descriptor [49] is defined by selecting key points at extrema of the difference of Gaussian function in a scale space where at each scale and at each selected point a directional descriptor is computed.

以前的一些描述符是在多尺度或多分辨率框架中定义的。基本假设是,最相关的特征在粗糙尺度下被保留。形状的多尺度分解基本上是将形状卷积成一个尺度函数来进行平滑,而这个尺度函数通常是高斯函数。scale函数依赖于一个参数,通常称为scale参数。多尺度分解的一个固有缺点是,尽管特征数量减少,但形状的大小始终相同。这意味着描述符的大小是相同的,而不考虑使用的比例。因此,基于此方法的描述子通常从形状的金字塔分解中提取特征,从最粗糙的尺度到最精细的尺度来完成形状描述。利用高斯核(高斯滤波器的一阶和二阶导数)对形状轮廓进行卷积,得到基于轮廓的尺度空间描述子。曲率尺度空间[56 ]和“原始草图曲率”[3 ]在多尺度上下文中被定义,其中局部最大曲率被提取,导致描述符的大小减小。基于区域的尺度空间描述子是通过对整个图像应用高斯核来获得的。例如,SIFT描述符[49]是通过选择尺度空间中高斯函数差的极值处的关键点来定义的,其中在每个尺度和每个选择点处计算方向描述符。

Different from multi-scale decomposition, multi-resolution decomposition is derived from the Multi-Resolution Analysis (MRA) theory [51]. Wavelets were the first MRA methods used as symbol descriptors and applied to shape contours in combination with affine invariants [28]. Moreover, wavelet transform has also been used directly on images to detect horizontal or vertical lines and corners. As symbols are composed of lines oriented in any direction, the performance of bivariate wavelets descriptors in symbol recognition tasks is not high. Therefore, other MRA methods like Gabor wavelets [78], ridgelets [64], and contourlets [15] have been used to overcome this problem of line orientation (Fig. 16.5).

与多尺度分解不同,多分辨率分解是从多分辨率分析(MRA)理论导出的[51]。小波是第一种用作符号描述符的MRA方法,并结合仿射不变量应用于形状轮廓[28]。此外,小波变换也被直接用于图像的水平或垂直线和角点的检测。由于符号是由任意方向的直线构成的,二元小波描述子在符号识别任务中的性能不高。因此,其他的MRA方法,如Gabor小波[78]、脊波[64]和轮廓波[15]已经被用来解决直线定向的问题(图16.5)。

Structural Descriptors结构描述符

Structural descriptors consider the shape structure in their definition. Shape structures are the logical relations (perpendicularity, adjacency, crossing, and so on) between the primitive elements composing the shape. These types of descriptors are usually stored in graph or grammar structures.

结构描述符在其定义中考虑形状结构。形状结构是构成形状的基本元素之间的逻辑关系(垂直、相邻、交叉等)。这些类型的描述符通常存储在图形或语法结构中。

This section is dedicated in the descriptors themselves or, more specifically, in the existing structures to represent the relations between shape entities. The next section “Structural Classification” will consider main techniques for matching algorithms and methods to reduce their high complexity. Structural descriptors for graphics recognition can be broadly divided into two classes: syntactic and prototype-based descriptors.

本节致力于描述符本身,或者更具体地,在现有结构中表示形状实体之间的关系。下一节“结构分类”将考虑匹配算法和方法的主要技术,以降低它们的高复杂度。用于图形识别的结构描述子可以大致分为两类:句法描述子和基于原型的描述子。

Syntactic descriptors are determined by a grammar, which is based on the formal language theory introduced by Chomsky in the middle of the 1950s [37], for graphic document interpretation (see Chap. 17 (Analysis and Interpretation of Graphical Documents) for further details). A grammar is a condensed representation of a large set of prototypes. From a finite set of elements and a set of rules, a large set of prototypes are produced in a similar way as in human language, in which alphabets and language grammar rules allow us to produce words. This kind of representation is suitable when the number of prototype patterns is big, when common substructures among patterns are large, and when the available knowledge about the structure facilitates the grammar inference. When any of these factors is not held, it will be better to use a prototype-based descriptor.

句法描述符由语法决定,语法基于乔姆斯基在20世纪50年代中期引入的形式语言理论[37],用于图形文档的解释(详见第17章(图形文档的分析和解释))。语法是一大组原型的浓缩表示。从一组有限的元素和一组规则中,一组大的原型以类似于人类语言的方式产生,其中字母表和语言语法规则允许我们产生单词。当原型模式的个数较大,模式间的公共子结构较大,且已有的结构知识有助于语法推理时,这种表示方法是合适的。当这些因素中的任何一个没有保留时,最好使用基于原型的描述符。

 

Fig. 16.5 Example of a multi-resolution descriptor based on ridgelet decomposition

图16.5基于脊波分解的多分辨率描述符示例

A prototype is a class representative usually represented by using strings and graphs. Using the graph theory (graph and subgraphs isomorphisms), it is possible to compare and to classify shapes that can even be partially occluded. The use of prototypes, instead of all graph descriptors, reduces the complexity of matching algorithms, but the computation of graph prototypes may be computationally expensive also. However, graph prototypes are computed only once during the learning phase, and matching is performed each time at the query level. There are several definitions of prototypes and researches in learning graph prototypes are still a subject of interest. For instance, a recent work [65] has proposed to use a genetic algorithm for learning graph prototypes (generalized median set, generalized discriminative set).

原型是类的代表,通常用字符串和图来表示。利用图论(图和子图同构),可以比较甚至部分封闭的形状并对其进行分类。原型的使用,而不是所有的图形描述符,降低了匹配算法的复杂性,但图形原型的计算也可能在计算上是昂贵的。但是,在学习阶段,图形原型只计算一次,并且每次在查询级别执行匹配。原型有多种定义,学习图形原型的研究仍然是人们感兴趣的课题。例如,最近的一项工作[65]提出使用遗传算法来学习图形原型(广义中值集,广义判别集)。

In addition to the above, the use of embedding techniques (see section “Embedding Methods”) allows to replace the computation of the graph-edit distance by the computation of a Lp distance in a n-dimensional space, thus reducing the computational complexity. The most representative graph structures used in symbol recognition are labeled graphs, attributed graphs, and associative graphs

除了上述之外,嵌入技术的使用(见部分“嵌入方法”)允许通过在n维空间中计算LP距离来代替图形编辑距离的计算,从而降低计算复杂度。符号识别中最具代表性的图结构是有标记图、属性图和关联图

A labeled graph is a set of nodes, edges, and labeling functions. It is one of the simplest graph structures still used which can be constructed from a graphic document after a vectorization process. The formal definition of a labeled graph can be found in any structural pattern recognition textbook with slight differences in notation and names. However, the set of nodes is usually composed of vectors, which play the role of primitives, and the set of edges is composed of pairs of nodes representing touching vectors. The definition of the labeling function depends on each particular method. It has been shown before that a grammar was used in 1969 to create the PDL. Later on, labeled graphs were used as a basis to build more sophisticated representations like, for instance, attributed graphs. In 1999, a labeled graph, called “shock graph,” was proposed in [72]. Here, the set of nodes is composed of terminal and junction points obtained after applying the MAT to symbols

标记图是一组节点、边和标记函数。它是一个最简单的图形结构仍然使用,可以从一个图形文件后,矢量化过程。标记图的形式化定义可以在任何结构模式识别教科书中找到,在符号和名称上略有不同。然而,节点集通常是由向量组成的,向量起着基元的作用,而边缘集则是由表示接触向量的成对节点组成。标记函数的定义取决于每个特定的方法。在1969年,一个语法被用来创建PDL之前就已经被证明了。后来,标记图被用作构建更复杂表示的基础,例如属性图。1999年,在[72]中提出了一个标记图,称为“冲击图”。在这里,节点集由终端和在将MAT应用于符号后获得的连接点组成

An attributed graph is a labeled graph (Fig. 16.6) with two more functions that assign a set of attributes to nodes and edges [11]. Graphs without attributes represent the structural information of symbols, while attributes add semantic information for the interpretation of schematic diagrams. In addition to symbol recognition, attributed graphs have also been used in other related applications such as handwriting recognition (see Chaps. 11 (Handprinted Character and Word Recognition) and 12 (Continuous Handwritten Script Recognition)) and general structural pattern recognition (see Chap. 15 (Graphics Recognition Techniques)). A particular case of attributed graphs is the RAG where the minimal closed loops (regions) are extracted within their adjacency relations [48]

属性图是一个有标签的图(图16.6),还有两个功能,为节点和边分配一组属性[11]。没有属性的图形表示符号的结构信息,而属性则添加语义信息来解释示意图。除了符号识别之外,属性图也被用于其他相关应用,如手写识别(见第2章)。11(手写字符和文字识别)和12(连续手写脚本识别)以及一般结构模式识别(见第15章(图形识别技术))。属性图的一个特殊情况是RAG,其中在其邻接关系内提取最小闭环(区域)[48]

An associative graph is a different graph-based representation. Each node is an association between local descriptions of a symbol model and partial descriptions of the document being processed. Then, each association is evaluated by a matching function which is defined locally. Therefore, in this representation, two nodes are connected if the matching value is below a preset threshold. With this representation, graphs do not represent simple primitive connections but incorporate information related to the sought primitive symbol models. This representation, which provided a definition of associated graph for trees, was used [22, 62] for comparing shock graphs.

关联图是一种不同的基于图的表示。每个节点是符号模型的本地描述和正在处理的文档的部分描述之间的关联。然后,每个关联由本地定义的匹配函数求值。因此,在此表示中,如果匹配值低于预设阈值,则连接两个节点。通过这种表示,图不表示简单的基元连接,而是包含与所寻求的基元符号模型相关的信息。这个表示提供了树的关联图的定义,用于比较冲击图。

Regardless of the graph structure used for symbol description, structural relations can be grouped taking into account the number of primitives:

无论用于符号描述的图形结构如何,结构关系都可以根据原语的数量进行分组:

• Unitary relations are commonly used as node attributes in graphs. For instance, attributed graphs based on the MAT have been proposed in [24, 72]. Another set of structural information that is classically used is the area, perimeter, eccentricity, length, and the number of holes of a region.

•酉关系通常用作图中的节点属性。例如,在[24,72]中提出了基于MAT的属性图。另一组经典使用的结构信息是区域的面积、周长、偏心率、长度和孔数。

• Binary relations are the most used set to capture structural information from shapes including for instance, parallelism, angle of intersection, or inclusion between lines [59]. These relations are classically represented by means of edge attributes in graph structures or in a constraint set for adjacency grammars [54]

•二进制关系是最常用的从形状中获取结构信息的集合,例如,平行度、交角或线之间的包含[59]。这些关系通常通过图结构中的边属性或邻接文法的约束集来表示[54]

• Ternary relations are less often used than binary ones but may be more important in some applications like text detection. For instance, a ternary descriptor in a Markov random field was proposed in [83] to measure the text alignment using adjacent regions as primitives.

•三元关系比二元关系使用较少,但在某些应用中,如文本检测,可能更为重要。例如,文献[83]提出了马尔可夫随机场中的三值描述子,用相邻区域作为基元来度量文本对齐。

 

 Fig. 16.6 Example of structural descriptor based on an attributed graph representation 图16.6基于属性图表示的结构描述符示例

Recognition Methods 识别方法

In 50 years of publications in the field of pattern recognition, many textbooks [8,27] have been proposed. While some of them are dedicated to specific techniques [31], many others have been applied to symbol recognition. Therefore, a thorough review of them is not only unworkable but is also beyond the scope of this chapter. In this perspective, only the most used and well-known methods in the field are discussed in this section

在模式识别领域50年的出版物中,许多教科书[8,27]被提出。虽然其中一些技术专门用于特定技术[31],但许多其他技术已应用于符号识别。因此,对它们进行彻底的审查不仅行不通,而且超出了本章的范围。从这个角度来看,本节仅讨论该领域中最常用和最著名的方法

Pattern recognition process is divided into two stages: feature extraction and classification. As seen, feature extraction involves the construction of symbol descriptors which can be pixel or structural, invariant or not to transformations. Classification indicates the set of methods that will allow to recognize symbols. In general, the majority of symbol recognition methods fit within supervised learning methods (i.e., classification), and therefore, a set of learning symbols is given.

模式识别过程分为特征提取和分类两个阶段。如图所示,特征提取涉及到符号描述符的构造,这些描述符可以是像素或结构的、不变的或不可转换的。分类表示允许识别符号的方法集。一般来说,大多数符号识别方法都适用于有监督学习方法(即分类),因此,给出了一组学习符号。

When the training set is lacking or there is no need for supervised learning, the classification step reduces to mere calculation of distance or similarity measure between symbols to be recognized and a set of models. Therefore, the first part of this section will be first dedicated to similarity and distance measures for both structural and statistical methods. The computational cost is one of the motivation for the introduction of embedding and kernel techniques that allow to move from a graph description to a feature one. The second part gives an overview of the respective techniques in structural and statistical classification that have been used to recognize symbols.

当缺乏训练集或不需要监督学习时,分类步骤简化为仅计算待识别符号与一组模型之间的距离或相似度。因此,本节第一部分将首先介绍结构和统计方法的相似性和距离度量。计算成本是引入嵌入和内核技术的动机之一,这些技术允许从图描述转移到特征描述。第二部分概述了用于识别符号的结构分类和统计分类技术。

Distance and Similarity Measures 距离和相似性度量

The easiest way to recognize a symbol is to compare it to a reference set, namely, models or prototypes, and assign to it the label of the most similar model. If the set of models is not very large, this comparison can be done sequentially using a similarity measure. A variety of similarity measures such as distances, correlation, inner product, trigonometric functions, and integral operators, all of which can be applied to symbols recognition, can be found in the literature.

识别符号的最简单方法是将其与参考集(即模型或原型)进行比较,并为其指定最相似模型的标签。如果模型集不是很大,可以使用相似性度量按顺序进行比较。各种相似性度量,如距离、相关性、内积、三角函数和积分算子,都可以应用于符号识别,这些都可以在文献中找到。

Generally speaking, any functional defined on two elements that returns a scalar value can be interpreted as a similarity measure. Of course, it is preferable that this value has a meaning. An example of a similarity measure used in some symbol recognition methods is the Kullback–Leibler (KL) divergence, which is used as a measure for comparing two probability function densities q and p [86].

一般来说,在返回标量值的两个元素上定义的任何函数都可以解释为相似性度量。当然,这个值最好有意义。在一些符号识别方法中使用的相似性度量的一个例子是Kullback-Leibler(KL)散度,它被用作比较两个概率函数密度q和p的度量[86]。

A special case of similarity measure is the distance, or metric. A set X with a distance d is called a metric space. From a formal viewpoint, if d is a distance, a set of metric spaces properties can be directly applied. The definition of the distance that one can find in any book of elementary geometry verifies the three well-known properties: positivity, symmetry, and triangle inequality. Examples of the most common distances are:

相似性度量的一个特例是距离或度量。距离为d的集合X称为度量空间。从形式上看,如果d是距离,则可以直接应用一组度量空间属性。在初等几何的任何一本书中,距离的定义都证明了三个众所周知的性质:正性、对称性和三角形不等式。最常见的距离示例如下:

1. Real vectorial space Rn with any of the Lp distance: d.x; y/ D Pn iD1.xi  yi/p/1=p. For p D 1, 2, and 1, it is, respectively, the Manhattan, the Euclidean, and the supremum distance. The sum operator is replaced by the max operator for the supremum distance.
实数向量空间Rn与任意LP距离:D.x;y/d PN ID1.XII/P/1=P。对于P D 1, 2,和1,分别是曼哈顿、欧几里得和上确界距离。对于上确界距离,求和运算符替换为max运算符。

2. The space of real functions Lp.R/ composed of p-integrable functions: d.f; g/ D RR.f .x/  g.x//pdx1=p. For p D 1, it is the Banach space and for p D 2 the Hilbert space.

实函数空间Lp.R/由p-可积函数组成:d.f;g/d RR.f.x/g.x//pdx1=p。对于p 1,它是Banach空间,对于p 2,它是Hilbert空间。

In general, for any vectorial space with norm N , a distance is defined as d.x; y/ D N.x  y/. When the chosen symbol descriptor is a feature vector, it can be useful to consider one of the distances in the previous examples. In such case, the feature vector is embedded into a vectorial space of finite dimension n < 1 where all distances are topologically equivalent. That is, given two distances d1 and d2 defined in the metric space X, for all values x and y, two real positive values A and B exist such that

一般来说,对于范数为N的任何向量空间,距离被定义为d.x;y/dn.xy/。当所选择的符号描述符是特征向量时,可以考虑前面示例中的距离之一。在这种情况下,特征向量被嵌入到有限维n<1的向量空间中,其中所有距离在拓扑上都是等价的。也就是说,给定度量空间X中定义的两个距离d1和d2,对于所有值x和y,存在两个实正值a和b,使得

In practice, this means that given a symbol described by means of feature vectors, regardless of the distances from the Example 1 used, differences in classification rates are insignificant. In other words, the performance of a given feature vector will not change much if the Manhattan distance is used instead of the Euclidean, but the complexity will decrease.

实际上,这意味着,给定通过特征向量描述的符号,无论与所使用的示例1的距离如何,分类率的差异都是微不足道的。换言之,如果使用曼哈顿距离代替欧几里德距离,给定特征向量的性能不会有太大变化,但复杂度会降低。

If the L p distances are the most widely used in finite-dimensional vector spaces, the edit distance, which is in a broad sense a similarity measure, is most often used to compare structural descriptors. It was initially defined to compare strings and, later, extended to trees and graphs. The distance calculation is obtained by adding the costs of edit operations: insertion, deletion, and substitution needed to transform one string to another. The costs associated with each operation depend on the application, and if they are chosen properly, the edit distance is a true distance which satisfies the three properties required for a distance.

如果L p距离是有限维向量空间中使用最广泛的距离,那么编辑距离(广义上是一种相似性度量)最常用于比较结构描述符。它最初是为了比较字符串而定义的,后来扩展到树和图。距离计算是通过添加编辑操作的成本获得的:插入、删除和替换需要将一个字符串转换为另一个字符串。与每个操作相关联的成本取决于应用程序,如果选择正确,则编辑距离是满足距离所需的三个属性的真实距离。

The edit distance is a measure that is robust to errors obtained during the extraction of primitives, but it is computationally expensive to be calculated accurately. To overcome this problem, an algorithm allowing an estimation of this distance by means of a bipartite graph was proposed in [66].

编辑距离是一种对基元提取过程中获得的误差具有鲁棒性的度量,但要精确地计算编辑距离,计算成本很高。为了克服这一问题,在文献[66]中提出了一种利用二部图估计距离的算法。

Embedding Methods嵌入方法

The goal of kernel and embedding methods is to apply statistical methods to structural descriptors. The reason here is twofold. On the one hand, they take benefit from structural descriptors, which allow a richer representation of symbols by using graphs and trees than feature vectors. On the other hand, they extend the range of classification methods that can be used in classification problems and reduce the order of complexity of some operations, for example, the calculation of the generalized median graph [30].

核方法和嵌入方法的目标是将统计方法应用于结构描述符。原因有两个。一方面,它们得益于结构描述符,与特征向量相比,结构描述符允许使用图形和树来更丰富地表示符号。另一方面,它们扩展了可用于分类问题的分类方法的范围,并减少了一些操作的复杂性的顺序,例如,广义中值图的计算〔30〕。

Embedding methods are categorized formally as implicit or explicit. Explicit methods transform a graph into a feature vector. Thus, we can apply any statistical method: dimension reduction such as PCA by Fisher’s discriminant analysis and classifiers such as KNN, boosting, neural network, and SVM. In all cases, the difficulty of embedding methods is to find suitable embedding functions.

嵌入方法在形式上分为隐式方法和显式方法。显式方法将图形转换为特征向量。因此,我们可以使用任何统计方法:通过Fisher判别分析进行PCA降维,以及使用KNN、boosting、神经网络和支持向量机等分类器。在所有情况下,嵌入方法的难点在于找到合适的嵌入函数。

Rather than seeking explicit transforms, implicit embedding is based on graph kernels. A kernel is a bivariate function that performs two operations at once. It first embeds structural descriptors into a vectorial space and then performs the dot product in such a space. The advantage of using kernel functions is that the embedding transformation is not needed to be known and it is much easier to define kernel functions than embedding functions. Further details on how to apply this framework for graphs are found in [12, 13, 58].

隐式嵌入不是寻求显式变换,而是基于图核。内核是一个同时执行两个操作的二元函数。它首先将结构描述符嵌入到向量空间中,然后在向量空间中执行点积。使用核函数的优点是不需要知道嵌入转换,定义核函数比嵌入函数容易得多。有关如何将此框架应用于图的更多详细信息,请参见[12、13、58]。

Structural Classification结构分类

Classification methods with structural descriptors consist mostly in finding substructures in global representations of documents. One advantage of these methods, in contrast to statistical ones, is that they do not require a learning phase. However, an expert is needed to set the parameters and the heuristics to have good performance, either in execution time or recognition rate. Basic programming techniques such as dynamic programming [5, 14] or branch-and-bound techniques [48] were applied for searching subgraphs. For an overview of these algorithms, not only for symbol recognition but also for any field of applications in structural pattern recognition, some overviews can be found in [13, 18].

带结构描述符的分类方法主要是在文档的全局表示中寻找子结构。与统计方法相比,这些方法的一个优点是不需要学习阶段。然而,无论是在执行时间上还是在识别率上,都需要专家来设置参数和启发式算法以获得良好的性能。基本的编程技术,如动态编程[5,14]或分支定界技术[48]被应用于搜索子图。对于这些算法的概述,不仅对于符号识别,而且对于结构模式识别中的任何应用领域,可以在[13,18]中找到一些概述。

Ideally, matching algorithms look for exact matches between the object to be recognized and a list of known patterns and models. This means that, if data are represented by graphs, both structures must have at least the same number of nodes. On the contrary, in statistical methods distances between two feature vectors are required to have the same dimension. The first component of the first vector has to be compared to the first component of the second vector and so on until last component. In contrast, in structural approaches, there is no canonical order of nodes, and a priori all nodes are compared between them, making sure that all relations between nodes of the first graph are the same as the defined relations between nodes in the second graph. The complexity of these algorithms, in the worst case, grows exponentially with the number of nodes. This, in practice, makes these approaches intractable, especially when graphs have many nodes. Matching algorithms seek to perform these searches in a more intelligent way, with heuristics to reduce the search space. Structural descriptors obtained after a feature extraction process are not free of errors. Thus, two descriptors extracted from different images of the same object can have different representations in the number of nodes and edges. Matching algorithms have to be able to deal with this source of errors in descriptors, and therefore, the graph matching problem in symbol recognition is rather a problem of subgraph matching. Furthermore, a technique to provide error tolerance due to the primitive extraction process is achieved through the edit distance. Other possibilities are relaxation or elastic techniques [22] and active contours [77]

理想情况下,匹配算法寻找待识别对象与已知模式和模型列表之间的精确匹配。这意味着,如果数据由图表示,则两个结构必须至少具有相同数量的节点。相反,在统计方法中,要求两个特征向量之间的距离具有相同的维数。必须将第一向量的第一分量与第二向量的第一分量进行比较,依此类推,直到最后一分量。相比之下,在结构方法中,没有节点的规范顺序,并且先验地比较它们之间的所有节点,确保第一个图中节点之间的所有关系都与第二个图中定义的节点之间的关系相同。在最坏的情况下,这些算法的复杂性随节点数目呈指数增长。在实践中,这使得这些方法很难处理,特别是当图有许多节点时。匹配算法寻求以更智能的方式执行这些搜索,并使用启发式方法减少搜索空间。特征提取后得到的结构描述子不存在误差。因此,从同一对象的不同图像中提取的两个描述子在节点和边的数量上可以有不同的表示。匹配算法必须能够处理描述符中的这种错误源,因此,符号识别中的图匹配问题相当于子图匹配问题。此外,还通过编辑距离实现了由于原语提取过程而产生的容错技术。其他可能性是松弛或弹性技术[22]和活动轮廓[77]

Finally, relations between nodes of trees and graphs can also be represented by adjacency matrices. The values of these matrices depend on the type of trees or graphs, but in any case they are real or even complex numbers. Thus, eigenvalues and eigenvectors of these matrices are computed to compare two different graphs [72]. One advantage is that the order of complexity of the matching algorithm is similar to the complexity of any given distance but graphs should have the same number of nodes and two similar eigenvectors do not necessarily correspond to similar graphs.

最后,树和图的节点之间的关系也可以用邻接矩阵来表示。这些矩阵的值取决于树或图的类型,但在任何情况下,它们都是实数或复数。因此,计算这些矩阵的特征值和特征向量以比较两个不同的图[72]。一个优点是匹配算法的复杂度类似于任何给定距离的复杂性,但是图应该具有相同数量的节点,并且两个相似的特征向量不一定对应于相似的图。

Statistical Classification统计分类

Statistical methods use information from labeled data to learn classifiers for symbol recognition purposes. The aim of these methods is to learn decision rules to classify with the minimum possible risk of being wrong. In its simplest formulation, the decision rule is defined from the a posteriori probabilities of classes. That is, it maximizes the a posteriori probability (MAP rule) that the class is !m given a descriptor x:

统计方法利用标记数据中的信息来学习用于符号识别的分类器。这些方法的目的是学习决策规则,以尽可能小的出错风险进行分类。在最简单的公式中,决策规则是由类的后验概率定义的。也就是说,它最大化了该类的后验概率(MAP规则)!m给定描述符x:

   (16.1)

 

This formulation is useful for a classifier, the response of which approximates the a posteriori probabilities of classes. In contrast, for other classifiers it is necessary to introduce the concept of loss function and risk of error:

这个公式对于分类器是有用的,分类器的响应近似于类的后验概率。相比之下,对于其他分类器,有必要引入损失函数和错误风险的概念:

           (16.2)

 

A loss function, L evaluates the loss suffered when a mistake in classification is made. When the loss function is 0/1, MAP rule is recovered, Eq. (16.1). The risk error R.yjx/ is an operator that indicates the risk of error given a descriptor x, and it is obtained from the loss function and the a posteriori probability of the class. In symbol recognition, the classifier traditionally used is the nearest neighbors. Other classifiers such as the k-th nearest neighbors (KNN), support vector machine (SVM), boosting methods, or genetic algorithms become to be used since benchmarks are available

一个损失函数,L评估分类错误时所遭受的损失。当损失函数为0/1时,恢复映射规则,公式(16.1)。风险误差R.yjx/是一个表示给定描述符x的错误风险的算子,它是从类的损失函数和后验概率中得到的。在符号识别中,传统的分类器是最近邻分类器。其他的分类方法,如第k近邻(KNN),支持向量机(SVM),boosting方法,或者遗传算法,因为基准是可用的

Thus, the KNN is the simplest and most used classifier in symbol recognition. The distance between the unknown symbol and all the elements of the data set is computed, and the labels of the closest elements are counted. Then, the label of the majority class is assigned to the unknown symbol. When k increases the ratio obtained by dividing the number of votes with k estimates the posteriori probability [41].

因此,KNN是符号识别中最简单、最常用的分类器。计算未知符号与数据集所有元素之间的距离,并计算最近元素的标签。然后,将大多数类的标签分配给未知符号。当k增加通过用k除以票数得到的比率时,估计后验概率[41]。

Other state-of-the-art classifiers, like boosting and SVM methods, are two-class classifiers that have been extended to multi-class classifiers to perform symbol recognition. That extension will depend on the strategy used to learn. In classical recognition problems, one possibility is to follow a strategy 1 vs. 1 in which we take m  1 classifiers for each of the m classes. The predictions of these m  1 classifiers can be combined using any of the techniques described in [2]. Another possibility is to use a 1 vs. all strategy in which m classifiers, one per class, are trained. In this case, one class is composed of all the elements of the same class (positive class), and the other class (negative class) is composed of elements taken randomly from the other classes.

其他最新的分类器,如boosting和支持向量机方法,是两类分类器,已经扩展到多类分类器来进行符号识别。这一扩展将取决于用于学习的策略。在经典的识别问题中,一种可能性是遵循策略1对1,其中我们对每个m类使用m 1分类器。这些m 1分类器的预测可以使用[2]中描述的任何技术组合。另一种可能性是使用一对多的策略,其中m个分类器,每个类一个,被训练。在这种情况下,一个类由同一类(正类)的所有元素组成,另一个类(负类)由从其他类中随机抽取的元素组成。

SVM are a family of classifiers looking for the optimal hyperplane separating two classes. When this hyperplane does not exist (because classes overlap), the least bad hyperplane that best separates two classes is sought. To find the optimal solution, the problem is formulated as a quadratic optimization problem with boundary constraints. The method of Lagrange multipliers is used to determine the vector w, perpendicular to the separation hyperplane, and a scalar b which is the offset of the hyperplane from the origin. Hence, for classifying a new element x, the sign operator is applied:

支持向量机是一类寻找最优超平面的分类器。当这个超平面不存在(因为类重叠)时,寻求最好分离两个类的最坏的超平面。为了寻找最优解,该问题被描述为带边界约束的二次优化问题。用拉格朗日乘子法确定垂直于分离超平面的向量w和超平面与原点的偏移量b。因此,为了对新元素x进行分类,应用符号运算符:

   (16.3)

 

To better fit data, instead of looking for planes, surfaces are sought. In other words, data are transformed so that the optimal surface which separates two classes becomes flat. That’s the idea behind the kernel trick and which has inspired kernel methods for structural descriptors. In this case, the dot product of Eq. (16.3) is replaced by a kernel function k which is in charge of performing the dot product of x and w, transformed by the embedding function , in another vectorial space, usually of higher dimension than the original data:

为了更好地拟合数据,不必寻找平面,而是寻找曲面。换言之,对数据进行变换,使得分离两个类的最佳曲面变得平坦。这就是内核技巧背后的思想,它启发了用于结构描述符的内核方法。在这种情况下,等式(16.3)的点积被一个核函数k代替,该核函数k负责在另一个向量空间中执行由嵌入函数变换的x和w的点积,该向量空间通常比原始数据具有更高的维数:

  (16.4)

 

Choosing the appropriate kernel function is one of the difficulties to solve. Examples of kernel functions are polynomial of order p, .hx; yi C 1/p; radial basis functions, exp  jjx2 y2jj2 ; and the hyperbolic tangent, tanh.hx; yi  ı/. Once the kernel function is chosen, optimal parameters can be obtained by a cross validation on the training set. However, learning SVM for large data sets (millions of support vectors) is still a challenging problem [26].

选择合适的核函数是一个很难解决的问题。核函数的例子有p阶多项式,.hx;yi C 1/p;径向基函数,exp jjx2 y2jj2;双曲正切,tanh.hx;yiı/。一旦选择了核函数,就可以通过对训练集的交叉验证获得最优参数。然而,对于大数据集(数以百万计的支持向量)学习支持向量机仍然是一个具有挑战性的问题[26]。

Boosting methods seek to build up a good classifier reinforcing the learning of what is called a weak classifier. A weak classifier makes slightly better than the object class chosen randomly. That is to say, for a two-class classification problems, it is a classifier with a recognition rate slightly more than 50 %. Boosting methods assigns a weight to each learning sample, which is used to learn a new classifier. This new classifier is also evaluated so that the weights of samples well classified are reduced while the weight of misclassified samples is increased. The resulting classifier is an additive model consisting of the sum of all weak classifiers ffpg that have been trained so far, and cp is a weight that is obtained from the error of classification for fp on the training set:

Boosting方法寻求建立一个好的分类器,以加强所谓弱分类器的学习。弱分类器比随机选择的对象类稍好一些。也就是说,对于两类分类问题,它是一个识别率略高于50%的分类器。Boosting方法为每个学习样本分配一个权重,用于学习新的分类器。对新的分类器进行了评价,在提高误分类样本权重的同时,有效地降低了分类样本的权重。得到的分类器是一个加法模型,由迄今为止训练过的所有弱分类器ffpg之和组成,cp是从训练集上fp的分类误差中获得的权重:

  (16.5)

 

Both SVM and boosting methods have proven to be good classifiers in many applications of pattern recognition and are often used and proposed as reference classifiers too. For instance, directional features were used in [52] for SVM classification. A descriptor based on a combination of contourlets and Fourier coefficients, which were trained using an AdaBoost classifier, was proposed in [15]. However, recent experiments using these two kinds of classifiers on zoning-like descriptors show some degree of success whose differences are not statistically significant [29].

在模式识别的许多应用中,支持向量机和boosting方法都被证明是很好的分类器,并且经常被用作参考分类器。例如,在[52]中,方向特征被用于支持向量机分类。文献[15]提出了一种基于轮廓线和傅立叶系数组合的描述符,该描述符使用AdaBoost分类器进行训练。然而,最近使用这两种分类器对类分区描述符进行的实验显示,它们取得了一定程度的成功,但差异并不具有统计学意义[29]。

Symbol Spotting符号识别

All symbol descriptors seen so far assume that symbols are cleanly segmented. However, when symbols are embedded in documents, the well-known paradox rises: to correctly recognize the symbols, one should be able to segment the input data, but to correctly segment them, one need to recognize the symbols! Generally speaking, symbol spotting is a kind of “strategy” used to break down this paradox since one does not try to recognize a symbol in a document as a whole but as a set of primitives.

到目前为止看到的所有符号描述符都假定符号是干净分割的。然而,当符号嵌入到文档中时,众所周知的悖论就出现了:要正确地识别符号,就应该能够分割输入数据,但是要正确地分割它们,就需要识别符号!一般来说,符号识别是一种用来打破这种悖论的“策略”,因为人们不试图将文档中的符号识别为一个整体,而是将其视为一组原语。

In a spotting system, the user submits a query he or she wants to retrieve from the document database, and the system retrieves zones in the documents that are likely to contain the query (see Fig. 16.7). The query defined by the user is usually a cropped image of a document or a hand-sketched one belonging to the database. The retrieval stage is done online and, of course, short delay of response is expected.

在spotting系统中,用户提交他或她想要从文档数据库检索的查询,系统检索文档中可能包含该查询的区域(参见图16.7)。用户定义的查询通常是文档的裁剪图像或属于数据库的手绘图像。检索阶段是在线完成的,当然,响应的延迟很短。

 

 Fig. 16.7 (a) The query. (b) Retrieved zones in the document that are likely to contain the query  图16.7(a)查询。(b) 文档中检索到的可能包含查询的区域

The analogy to the computer vision community is that a spotting system looks like a content-based image retrieval (CBIR) application that focuses on subpart retrieval of the image. The main difference here is that technical documents have different photometrical conditions. Documents are usually in black and white and descriptors based on color or texture features cannot be used. Thus, although the underlying strategies are somewhat similar, the features are different since the structural representation is very important for a document.

与计算机视觉社区类似的是,一个定位系统看起来像是一个基于内容的图像检索(CBIR)应用程序,它专注于图像的子部分检索。这里的主要区别是技术文件有不同的测光条件。文档通常是黑白的,不能使用基于颜色或纹理特征的描述符。因此,尽管底层策略有点相似,但是特性是不同的,因为结构表示对于文档来说非常重要。

Symbol spotting is an emerging topic and few works have been proposed so far. As a manuscript on symbol spotting in digital libraries [68] has been published recently, the state of the art will be short in this section. Broadly speaking, symbol spotting methods are decomposed into two main steps. The first one describes the document and is related to the feature extraction method (pixel or structural). This description could be done locally with focus on points or regions of interest, or globally through a structural representation. The second one is the decision step that occurs in the retrieval stage. Since the query is decomposed into a set of primitives like the document, a strategy is needed to organize these primitives to generate a list of location hypotheses in the document that are likely to contain the queried symbol. If the number of symbols and documents is high, a sequential search is not realistic. In this perspective, hashing structures have shown their efficiency in quickly generating hypotheses. In [68], a relational indexing scheme is proposed where a hash table stores the adjacency matrix of proximity graphs and a hash function is designed for feature vectors allowing, in fact, to combine numerical and structural descriptions of symbols. Other structures have been proposed based on hierarchical organization or inverted file combined with the vectorial model. These structures, however, require more time complexity. As in CBIR applications, the list of hypotheses is ranked from the most to the least likely based on similarity measures or voting strategies.

符号识别是一个新兴的课题,目前还没有人提出相关的研究工作。由于数字图书馆中的符号识别手稿[68]已于近期出版,本节将简要介绍当前的技术状况。广义上讲,符号识别方法主要分为两个步骤。第一个描述文档,与特征提取方法(像素或结构)相关。这种描述可以在本地完成,重点放在感兴趣的点或区域上,也可以通过结构表示进行全局描述。第二步是检索阶段的决策步骤。由于查询被分解为一组原语(如文档),因此需要一种策略来组织这些原语,以生成文档中可能包含所查询符号的位置假设列表。如果符号和文档的数量很大,则顺序搜索是不现实的。从这个角度来看,散列结构在快速生成假设方面显示了其效率。在[68 ]中,提出了一种关系索引方案,其中哈希表存储邻近图的邻接矩阵,并且为特征向量设计散列函数,实际上允许将符号的数值和结构描述相结合。提出了基于层次结构或倒排文件结合矢量模型的结构。然而,这些结构需要更多的时间复杂度。在CBIR应用中,假设列表根据相似性度量或投票策略从最有可能到最不可能进行排序。

When the number of documents is small and the same symbols have a low variability in size and orientation, a brute force solution could be applied. In this case the spotting works like a correlation filter. The document is decomposed into regions defined on either the connected components, a grid partition, or a sliding window, and then the similarity (like a correlation function) with the query is measured. For instance, a method to localize mechanical objects in grayscale images was proposed in [50]. The approach does not require any preprocessing step and is based on a pyramidal decomposition. The first step is to search for potential positions of the query symbol in the document as maxima of a normalized correlation surface. These positions are then propagated level by level towards the lowest level of the pyramid in order to precisely locate the corresponding objects. The approach defined in [29] is based on a sliding window strategy and a descriptor (circular blurred shape model) describing the spatial arrangement around a centroid point of the object region in a correlogram structure. These methods based on correlation principle locate accurately positions of a query symbol and are robust enough to be applied on real applications, but the invariance to symbol variations (size and rotation) remains a bottleneck. In this perspective, to avoid the scanning of the whole document, a variant is to focus on interest points, usually corner points, and to use these points to locally describe the document by means of visual words, as it is done in information retrieval where documents are indexed by textual words, or to organize these points spatially to validate hypotheses in order to find a symbol.

当文档数量较少且相同的符号在大小和方向上具有较低的可变性时,可以应用暴力解决方案。在这种情况下,定位就像一个相关滤波器。将文档分解为在连接组件、网格分区或滑动窗口上定义的区域,然后测量与查询的相似性(如相关函数)。例如,在[50]中提出了一种在灰度图像中定位机械对象的方法。该方法不需要任何预处理步骤,基于金字塔分解。第一步是搜索文档中查询符号的潜在位置作为归一化相关表面的最大值。然后将这些位置逐级传播到金字塔的最底层,以便精确定位相应的对象。[29]中定义的方法是基于滑动窗口策略和描述在相关图结构中对象区域的质心点周围的空间排列的描述符(圆形模糊形状模型)。这些基于相关原理的方法能够准确定位查询符号的位置,并且具有很强的鲁棒性,可以应用于实际应用中,但对符号变化(大小和旋转)的不变性仍然是一个瓶颈。从这个角度来看,为了避免对整个文档的扫描,一个变体是关注兴趣点,通常是角点,并使用这些点通过可视词来局部描述文档,就像在文档由文本词索引的信息检索中那样,或者在空间上组织这些点以验证为了找到一个符号而提出的假设。

Lines remain one of the most used and simplest primitives [60]. An extension to polyline primitives was proposed to take into account circular shapes and segment fragmentation due to the sensitivity to noise of raster-to-vector algorithms. In [68, 69], regions are considered as being circular polylines and symbols are supposed to be defined by closed contours but it is not necessary the case in electrical drawings. Some structural approaches encode primitives in terms of strings [69] or dendrograms [87] representation; however, graphs remain the most popular data structure since it offers a more powerful representation to encode structural relations between different parts of symbols. In some approaches, graph is a means to extract a vectorial signature [68] using a windowing or bucket decomposition of the document for correlation filter. The main limitation of these approaches is that they are not robust to line fragmentation. They could, however, be used to presegment the document into zones of interest, which offer a richer description than interest points. Others [48] consider graph as a whole and the spotting is seen as a problem of subgraph matching between the query and document, both represented by graphs. However, these approaches suffer from the classical drawbacks of subgraph isomorphism in pattern recognition such as tolerance to noise and high time complexity, as discussed in section “Structural Classification.”

线条仍然是最常用和最简单的原语之一[60]。考虑到栅格算法对噪声的敏感性,提出了一种对折线基元的扩展,以考虑圆形和线段分割。在[68,69]中,区域被认为是圆形多段线,符号应该由闭合轮廓定义,但在电气图纸中不一定如此。一些结构方法根据字符串[69]或树形图[87]表示对原语进行编码;然而,图仍然是最流行的数据结构,因为它提供了更强大的表示来编码符号不同部分之间的结构关系。在一些方法中,图是使用文档的窗口或桶分解来提取向量签名的方法,用于相关滤波器。这些方法的主要局限性在于它们对线分裂不具有鲁棒性。但是,它们可以用于将文档预先分段到感兴趣的区域,这比兴趣点提供了更丰富的描述。其他文献[48]将图看作一个整体,发现问题被看作是查询和文档之间的子图匹配问题,两者都由图表示。然而,这些方法受到模式识别中子图同构的经典缺点,如对噪声的容忍和高时间复杂度,正如在“结构分类”一节中所讨论的。

Spotting methods are strongly related to the adopted symbol descriptors as well as the subsequent treatments, since the heart of the methods relies on the description of document. Usually the used primitives are either pixel or structural. Pixel descriptors are less dependent on the quality of prior segmentations but are less robust to partial occlusions and provide more false responses because they do not take into account the structural configuration of a symbol. In contrast, structural descriptors offer a more powerful description of the document (less false alarms are generated), and they are more flexible to occlusions but they are dependent on the quality of the preprocessing step needed to feed the structural descriptor. Therefore, the compactness and precision of the representation are very important because they have an impact on the performance of the spotting method, on the indexing efficiency of the system and on the delay of response.

由于识别方法的核心依赖于对文档的描述,因此识别方法与所采用的符号描述符以及后续处理密切相关。通常使用的基本体是像素或结构。像素描述符不太依赖于先前分割的质量,但对部分遮挡的鲁棒性较差,并且提供更多的错误响应,因为它们不考虑符号的结构配置。相反,结构描述符提供了对文档的更强大的描述(产生较少的假警报),并且它们对闭塞更具灵活性,但它们依赖于馈送结构描述符所需的预处理步骤的质量。因此,表示的紧凑性和精确性是非常重要的,因为它们对点阵方法的性能、系统的索引效率和响应延迟都有影响。

Conclusion
The field of symbol recognition has reached its maturity with many descriptors which have been proposed so far. Some descriptors are variants of the previous ones, and several descriptors have reached very high recognition rate even when symbols are noisy, partially occluded, and transformed under similarities. However, the problem of obtaining descriptors robust to severe deformations remains a topic of interest. To achieve a high level of performance, several assumptions on symbols related to their number of models, variability, and segmentation quality are required. Thus, some specific problems still remain for future research.

符号识别领域已经发展到成熟阶段,目前已经提出了许多描述词。一些描述子是前一个描述子的变体,即使在符号有噪声、部分遮挡和相似变换的情况下,也有几个描述子达到了很高的识别率。然而,如何获得对严重变形具有鲁棒性的描述子仍然是一个令人感兴趣的课题。为了获得高水平的性能,需要对与其模型数量、可变性和分割质量相关的符号进行若干假设。因此,还有一些具体问题需要进一步研究。

When the number of symbol models or the complexity of the symbols increases, the confusion between different classes increases. For instance, in an aircraft electrical wiring diagrams, symbols are numerous and are complex entities that associate a graphical representation, a number of connection points, and associated text annotations. The main challenge here is to discriminate symbols not on their global shape but on small details (e.g., the number of connections). Combining outputs of classifiers, descriptors, or selecting features is one of the strategies used to improve the recognition rate. Although it is relatively easy to adapt these strategies to statistical descriptors, the extension to structural ones remains difficult. Recent works on embedding methods [12, 30] have been proposed with promising results.

当符号模型的数目或符号的复杂度增加时,不同类别之间的混淆增加。例如,在飞机电气接线图中,符号数量众多,并且是关联图形表示、多个连接点和关联文本注释的复杂实体。这里的主要挑战是不根据符号的整体形状,而是根据小的细节(例如,连接的数量)来区分符号。结合分类器、描述符或特征选择的输出是提高识别率的策略之一。尽管将这些策略应用于统计描述符相对容易,但扩展到结构描述符仍然很困难。最近关于嵌入方法的工作[12,30]已经被提出,并且有希望的结果。

Performance evaluation campaigns on massive data collection to test the scalability of the proposed approaches are lacking (see Chap. 30 (Tools and Metrics for Document Analysis Systems Evaluation)). However, contrary to a huge number of documents managed, technical documents are rare and few data are available as benchmarks to the image analysis community. Several efforts have been made to create database in graphical contest, but these databases are often small and synthetic, in spite of the effort to generate synthetic documents that look like real ones [23]. The probable reason is the difficulty in getting real documents and the cost to associated the ground truth to real documents since the amount of time to generate it manually would require a huge effort.

缺乏关于大规模数据收集的性能评估活动,以测试拟议方法的可扩展性(见第30章(文件分析系统评估的工具和指标))。然而,与管理的大量文档相反,技术文档很少,很少有数据可用作图像分析社区的基准。为了在图形竞赛中创建数据库已经做了一些努力,但是这些数据库通常是小的和合成的,尽管它们努力生成看起来像真实文档的合成文档[23]。可能的原因是获取真实文档的困难,以及将基本事实与真实文档关联起来的成本,因为手动生成它需要花费大量的时间。

With the growing popularity of digital input devices like tablet PCs and smartphones, there is an increasing interest in designing systems that can recognize automatically hand-sketched or camera-captured symbols (see Chap. 28 (Sketching Interfaces)). These types of symbols are warped and thus have a high variation and distortion. Even if general de-warping document methods can be applied to the particular domain of symbols, new descriptors robust to shape deformations or affine transformations are needed.

随着平板电脑和智能手机等数字输入设备的日益普及,人们对设计能够自动识别手绘或相机捕捉符号的系统越来越感兴趣(见第28章(草图界面))。这些类型的符号是扭曲的,因此具有高度的变化和扭曲。即使一般的去扭曲文档方法可以应用于符号的特定领域,也需要新的对形状变形或仿射变换具有鲁棒性的描述符。

The spotting methods are still in their early years, the proposed methods so far are only applied on synthetic documents. Spotting methods need to reach a high level of maturity to consider applications to real documents. Methods combining hash tables and voting strategies seem to be efficient in terms of time and indexing complexity when we face large collections of documents. Moreover, although performance measures [67] have been proposed for symbol spotting methods, these measures are more dedicated to synthetic documents, and their applicability on real documents has not been really verified. In addition, we believe that even if symbol spotting is closely related to symbol recognition, it can be seen as a particular case of document mining and the next step to symbol spotting would be new methods for symbol mining, and there are still many things to do to reach this goal. In this vein, recent works [20] use an original adaptation of a Galois lattice which has shown good performances in the field of data mining. The use of Galois lattices is a means to get a “symbolic representation” from a numeric one in the perspective to narrow the semantic gap.

识别方法还处于起步阶段,目前所提出的方法仅适用于合成文档。识别方法需要达到很高的成熟度,才能考虑应用到实际文档中。当我们面对大量的文档集合时,结合哈希表和投票策略在时间和索引复杂度方面似乎是高效的。此外,虽然已经为符号识别方法提出了性能度量[67],但是这些度量更多的是针对合成文档,并且它们在真实文档上的适用性还没有得到真正的验证。此外,我们认为,即使符号识别与符号识别密切相关,也可以看作是文档挖掘的一个特例,符号识别的下一步将是符号挖掘的新方法,要达到这一目标还有很多事情要做。在这方面,最近的工作[20]使用了Galois格的原始自适应,它在数据挖掘领域显示了良好的性能。伽罗瓦格的使用是一种从数字角度获得“符号表示”以缩小语义鸿沟的方法。

Cross-References
Analysis and Interpretation of Graphical Documents
Asian Character Recognition
Continuous Handwritten Script Recognition
Document Creation, Image Acquisition and Document Quality
Graphics Recognition Techniques
Imaging Techniques in Document Analysis Processes
Language, Script, and Font Recognition
Middle Eastern Character Recognition
Sketching Interfaces
Text Segmentation for Document Recognition
Tools and Metrics for Document Analysis Systems Evaluation
Further Reading进一步阅读

This chapter does not claim to be exhaustive, and therefore, interested readers are referred to broaden their knowledge with the following books [19, 68] for shape descriptions and symbol spotting considerations and [8, 27] for more global skills on shape classification and machine learning

本章并不意味着详尽无遗,因此,感兴趣的读者可参考以下书籍[19,68]了解形状描述和符号识别注意事项,以及[8,27]了解更多有关形状分类和机器学习的全球技能

posted on 2020-01-09 21:04  XiaoNiuFeiTian  阅读(414)  评论(0编辑  收藏  举报