Abstract 概要
This paper proposes LPRNet - end-to-end method for Automatic License Plate Recognition without preliminary character segmentation.
本文提出了一种LPRNet -端到端车牌自动识别方法,该方法不需要进行初步的字符分割。
Our approach is inspired by recent breakthroughs in Deep Neural Networks, and works in real-time with recognition accuracy up to 95% for Chinese license plates: 3 ms/plate on nVIDIA GeForceTMGTX 1080 and 1.3 ms/plate on Intel R CoreTMi7-6700K CPU
我们的方法受到了深度神经网络最近的突破的启发,可以实时识别中国车牌,识别率高达95%:在nVIDIA GeForceTMGTX 1080上为3 ms/牌照,在Intel R CoreTMi7-6700K CPU上为1.3 ms/牌照
LPRNet consists of the lightweight Convolutional Neural Network, so it can be trained in end-to-end way.
To the best of our knowledge, LPRNet is the first real-time License Plate Recognition system that does not use RNNs.
As a result, the LPRNet algorithm may be used to create embedded solutions for LPR that feature high level accuracy even on challenging Chinese license plates.
1.Introduction 介绍
Automatic License Plate Recognition is a challenging and important task which is used in traffic management, digital security surveillance, vehicle recognition, parking management of big cities.
This task is a complex problem due to many factors which include but are not limited to: blurry images, poor lighting conditions, variability of license plates numbers (including special characters e.g. logograms for China, Japan), physical impact (deformations), weather conditions (see some examples in Fig. 1).
license plates numbers (including special characters e.g. logograms for China, Japan), physical impact (deformations), weather conditions (see some examples in Fig. 1).
This paper tackles the License Plate Recognition problem and introduces the LPRNet algorithm, which is designed to work without pre-segmentation and consequent recognition of characters.
In the present paper, we do not consider License Plate Detection problem, however, for our particular case it can be done through LBP-cascade.
LPRNet is based on Deep Convolutional Neural Network.
Recent studies proved effectiveness and superiority
of Convolutional Neural Networks in many Computer Vision tasks such as image classification, object detection and semantic segmentation.
However, running most of them on embedded devices still remains a challenging problem.
LPRNet is a very efficient neural network, which takes only 0.34 GFLops to make a single forward pass.
LPRNet是一种非常高效的神经网络,它只需要0.34 GFLops就可以完成一次前进。
Also, our model is real-time on Intel Core i7-6700K SkyLake CPU with high accuracy on challenging Chinese License plates and can be trained end-to-end.
此外,我们的模型在英特尔酷睿i7-6700K SkyLake CPU上是实时的,对具有挑战性的中国牌照具有很高的准确性,并且可以端到端训练。
Moreover, LPRNet can be partially ported on FPGA, which can free up CPU power for other parts of the pipeline.
Our main contributions can be summarized as follows:
● LPRNet is a real-time framework for high-quality license plate recognition supporting template and character independent variable-length license plates, performing LPR without character pre-segmentation, trainable end-to-end from scratch for different national license plates.
● LPRNet是一个实时的高质量车牌识别框架,支持模板和字符独立的变长车牌,对不同国家的车牌进行无字符预分割的LPR,从头到尾可训练。
● LPRNet is the first real-time approach that does not use Recurrent Neural Networks and is lightweight enough to run on variety of platforms, including embedded devices
● LPRNet是第一个不使用循环神经网络的实时方法,它足够轻量级,可以在各种平台上运行,包括嵌入式设备
● Application of LPRNet to real traffic surveillance video shows that our approach is robust enough to handle difficult cases, such as perspective and cameradependent distortions, hard lighting conditions, change of viewpoint, etc.
● LPRNet在实际交通监控视频中的应用表明,该方法具有足够的鲁棒性,能够处理诸如视角和摄像机依赖畸变、光照条件恶劣、视点变化等困难情况。
● The rest of the paper is organized as follows.
● 本文的其余部分组织如下。
● Section 2 describes the related work.
● 第2节描述相关的工作。
● In sec. 3 we review the details of the LPRNet model.
● 在第3节中,我们将回顾LPRNet模型的细节。
● Sec. 4 reports the results on Chinese License Plates and includes an ablation study of our algorithm.
● 第四部分报告了中国车牌的结果,并包括对我们算法的消融研究。
● We summarize and conclude our work in sec. 5.
● 我们在第5部分中总结和总结我们的工作。
In the earlier works on general LP recognition, such as the pipeline consist of character segmentation and char acter classification stages:
Character segmentation typically uses different handcrafted algorithms, combining projections, connectivity and contour based image components.
It takes a binary image or intermediate representation as input so character segmentation quality is highly affected by the input image noise, low resolution, blur or deformations.
Character classification typically utilizes one of the optical character recognition (OCR) methods - adopted for LP character set.
Since classification follows the character segmentation, end-to-end recognition quality depends heavily on the applied segmentation method.
In order to solve the problem of character segmentation there were proposed endto-end Convolutional Neural Networks (CNNs) based solutions taking the whole LP image as input and producing the output character sequence.
The segmentation-free model in [2] is based on variable length sequence decoding driven by connectionist temporal classification (CTC) loss [3, 4].
[ ] H. Li and C. Shen, “Reading Car License Plates Using Deep Convolutional Neural Networks and LSTMs,”
arXiv:1601.05610 [cs], Jan. 2016, arXiv: 1601.05610. 2,4
It uses hand-crafted features LBP built on a binarized image as CNN input to produce character classes probabilities.
Applied to all input image positions via the sliding window approach it makes the input sequence for the bi-directional Long-Short Term Memory (LSTM) [5] based decoder.
Since the decoder output and target character sequence lengths are different, CTC loss is used for the pre-segmentation free end-to-end training.
The model in [6] mostly follows the approach described in [2] except that the sliding window method was replaced by CNN output spatial splitting to the RNN input sequence (”sliding window” over feature map instead of input).
[6]中的模型除了用CNN输出空间分割到RNN输入序列(feature map上的“滑动窗口”而不是输入)来代替滑动窗口方法外,基本遵循了[2]中描述的方法。
[ ] T. K. Cheang, Y. S. Chong, and Y. H. Tay, “Segmentationfree Vehicle License Plate Recognition using ConvNetRNN,” arXiv:1701.06439 [cs], Jan. 2017, arXiv:
1701.06439. 2
In contrast [7] uses the CNN-based model for the whole LP image to produce the global LP embedding which is decoded to a 11-character-length sequence via 11 fully connected model heads.
[ ] V. Jain, Z. Sasindran, A. Rajagopal, S. Biswas, H. S. Bharadwaj, and K. R. Ramakrishnan, “Deep Automatic License Plate Recognition System,” in Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing, ser. ICVGIP ’16. New York, NY, USA: ACM, 2016, pp. 6:1–6:8. 2
Each of the heads is trained to classify the i-th target string character (which is assumed to be padded to the predefined fixed length), so the whole recognition can be done in a single feed-forward pass.
It also utilizes the Spatial Transformer Network (STN) [8] to reduce the effect of input image deformations.
[ ] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks, arXiv:1506.02025 [cs], Jun. 2015, arXiv: 1506.02025. 2, 3
The algorithm in [9] makes an attempt to solve both license plate detection and license plate recognition problems by single Deep Neural Network.
[ ] H. Li, P. Wang, and C. Shen, “Towards End-to-End Car License Plates Detection and Recognition with Deep Neural Networks,” ArXiv e-prints, Sep. 2017. 2
Recent work [10] tries to exploit synthetic data generation approach based on Generative Adversarial Networks [11] for data generation procedure to obtain large representative license plates dataset.
[ ] X. Wang, Z. Man, M. You, and C. Shen, “Adversarial Generation of Training Examples: Applications to Moving Vehicle License Plate Recognition,” ArXiv e-prints, Jul. 2017. 2
[ ] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
“Generative Adversarial Networks,” ArXiv e-prints, Jun. 2014. 2
In our approach, we avoided using hand-crafted features over a binarized image - instead we used raw RGB pixels as CNN input.
The LSTM-based sequence decoder working on outputs of a sliding window CNN was replaced with a fully convolutional model which output is interpreted as character probabilities sequence for CTC loss training and greedy or prefix search string inference.
For better performance the pre-decoder intermediate feature map was augmented by the global context embedding as described in [12].
Also the backbone CNN model was reduced significantly using the low computation cost basic building block inspired by SqueezeNet Fire Blocks [13] and Inception Blocks of [14, 15, 16].
此外,利用受SqueezeNet Fire Blocks[13]和Inception Blocks[14,15,16]启发的低计算成本的基本构建块,大大减少了主干CNN模型。
Batch Normalization [17] and Dropout [18] techniques were used for regularization.
使用Batch Normalization[17]和Dropout[18]技术进行正则化。
LP image input size affects both the computational cost and the recognition quality [19], as a result there is a tradeoff between using high [6] or moderate [7, 2] resolution.
[ ] S. Agarwal, D. Tran, L. Torresani, and H. Farid, “Deciphering Severely Degraded License Plates,” San Francisco, CA, 2017. 2
3 LPRNet
In this section we describe our LPRNet network architecture design in detail.
In recent studies tend to use parts of the powerful classification networks such as VGG, ResNet or GoogLeNet as ‘backbone‘ for their tasks by applying transfer learning techniques.
However, this is not the best option for building fast and lightweight networks, so in our case we redesigned main ‘backbone‘ network applying recently discovered architecture tricks.
The basic building block of our CNN model backbone (Table 2) was inspired by SqueezeNet Fire Blocks [13] and Inception Blocks of [14, 15, 16].
我们CNN模型主干的基本构建模块(表2)的灵感来自于SqueezeNet Fire Blocks[13]和Inception Blocks[14,15,16]。
We also followed the research best practices and used Batch Normalization [17] and ReLU activation after each convolution operation.
我们也遵循研究的最佳实践,并在每次卷积操作后使用Batch Normalization[17]和ReLU激活。
In a nutshell our design consists of:
• location network with Spatial Transformer Layer [8]
• light-weight convolutional neural network (backbone)
• per-position character classification head
• character probabilities for further sequence decoding
• post-filtering procedure
First, the input image is preprocessed by the Spatial Transformer Layer, as proposed in [8].
This step is optional but allows to explore how one can transform the input image to have better characteristics for recognition.
The original LocNet (see the Table 1) architecture was used to estimate optimal transformation parameters.
The backbone network architecture is described in Table 3.
The backbone takes a raw RGB image as input and calculates spatially distributed rich features.
Wide convolution (with 1 × 13 kernel) utilizes the local character context instead of using LSTM-based RNN.
宽卷积(1 × 13核)利用局部字符上下文,而不是使用基于lstm的RNN。
The backbone subnetwork output can be interpreted as a sequence of character probabilities whose length corresponds to the input image pixel width.
Since the decoder output and the target character sequence lengths are of different length, we apply the method of CTC loss [20] - for segmentation-free end-to-end training.
CTC loss is a well-known approach for situations where input and output sequences are misaligned and have variable lengths.
Moreover, CTC provides an efficient way to go from probabilities at each time step to the probability of an output sequence.
More detailed explanation about CTC loss can be found in .
To further improve performance, the pre-decoder intermediate feature map was augmented with the global context embedding as in [12].
It is computed via a fully-connected layer over backbone output, tiled to the desired size and concatenated with backbone output.
In order to adjust the depth of feature map to the character class number additional 1 × 1 convolution is applied.
为了使特征图的深度与字符类数相适应,采用了额外的1 × 1卷积。
For the decoding procedure at the inference stage we considered 2 options: greedy search and beam search.
While greedy search takes the maximum of class probabilities in each position, beam search maximizes the total probability of the output sequence [3, 4].
For post-filtering we use a task-oriented language model implemented as a set of the target country LP templates.
Note that post-filtering is applied together with Beam Search.
The post-filtering procedure gets top-N most probable sequences found by beam search and returns the first one that matches the set of predefined templates which depends on country LP regulations.
All training experiments were done with the help of TensorFlow [21].
We train our model with ’Adam’ optimizer using batch size of 32, initial learning rate 0.001 and gradient noise scale of 0.001.
We drop the learning rate once after every 100k iterations by a factor of 10 and train our network for 250k iterations in total.
In our experiments we use data augmentation by random affine transformations, e.g. rotation, scaling and shift.
It is worth mentioning, that application of LocNet from the beginning of training leads to degradation of results, because LocNet cannot get reasonable gradients from a recognizer which is typically too weak for the first few iterations.
So, in our experiments, we turn LocNet on only after 5k iterations.
All other hyper-parameters were chosen by crossvalidation over the target dataset.
The LPRNet baseline network, from which we started our experiments with different architectures, was inspired by [2].
It’s mainly based on Inception blocks followed by a bidirectional LSTM (biLSTM) decoder and trained with CTC loss.
它主要基于Inception块和双向LSTM (biLSTM)解码器,并使用CTC损耗进行训练。
We first performed some experiments aimed at replacing biLSTM with biGRU cells, but we did not observe any clear benefits of using biGRU over biLSTM.
Then, we focused on eliminating of the complicated biLSTM decoder, because most modern embedded devices still do not have sufficient compute and memory to efficiently execute biLSTM.
Importantly, our LSTM is applied to a spatial sequence rather than to a temporal one.
Thus all LSTM inpuuuts are known upfront both at the training stage as well as at the inference stage.
Therefore we believe that RNN can be replaced by spatial convolutions without a significant drop in accuracy.
The RNN-less model with some backbone modifications is referenced as LPRNet basic and it was described in details in sec. 3.
To improve runtime performance we also modified LPRNet basic by using 2 × 2 strides for all pooling layers.
为了提高运行时性能,我们还修改了LPRNet basic,对所有池化层使用2 × 2的strides。
This modification (the LPRNet reduced model) reduces the size of intermediate feature maps and total inference computational cost significantly (see GFLOPs column of the Table 4).
Ablation study消融实验
It is of vital importance to conduct the ablation study to identify correlation between various enhancements and respective accuracy/performance improvements.
This helps other researchers adopt ideas from the paper and reuse most promising architecture approaches.
Table 5 shows a summary of architecture approaches and their impact on accuracy.
As one can see, the largest accuracy gain (36%) was achieved using the global context.
The data augmentation techniques also help to improve accuracy significantly (by 28.6%).
Without using data augmentation and the global context we could not train the model from scratch.
The STN-based alignment subnetwork provides noticeable improvement of 2.8-5.2%.
Beam Search with postfiltering further improves recognition accuracy by 0.4- 0.6%.
带有后滤波的波束搜索进一步提高了0.4- 0.6%的识别精度。
In this work, we have shown that for License Plate Recognition one can utilize pretty small convolutional neural networks.
LPRNet model was introduced, which can be used for challenging data, achieving up to 95% recognition accuracy.
Architecture details, its motivation and the ablation study was conducted.
We showed that LPRNet can perform inference in realtime on a variety of hardware architectures including CPU, GPU and FPGA.
We have no doubt that LPRNet could attain real-time performance even on more specialized embedded low-power devices.
The LPRNet can likely be compressed using modern pruning and quantization techniques, which would potentially help to reduce the computational complexity even further.
As a future direction of research, LPRNet work can be extended by merging CNN-based detection part into our algorithm, so that both detection and recognition tasks will be evaluated as a single network in order to outperform the LBP-based cascaded detector quality.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· AI 智能体引爆开源社区「GitHub 热点速览」
· 写一个简单的SQL生成工具