Improving the quality of the output
2016-05-17 23:24 狼人:-) 阅读(444) 评论(0) 编辑 收藏 举报There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help.
- Image processing
- Page segmentation method
- Dictionaries, word lists, and patterns
- Still having problems?
Image processing
Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.
You can see how Tesseract has processed the image by using the configuration variabletessedit_write_images
to true
when running Tesseract. If the resulting tessinput.tif
file looks problematic, try some of these image processing operations before passing the image to Tesseract.
Rescaling
Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.
Binarisation
This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.
Noise Removal
Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.
Rotation / Deskewing
A skewed image is when an page has been scanned when not straight. The quality of Tesseract's line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.
Border Removal
Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.
Tools / Libraries
Examples
If you need an example how to improve image quality programmatically, have a look at this examples:
- OpenCV - Rotation (Deskewing) - c++ example
- Fred's ImageMagick TEXTCLEANER - bash script for processing a scanned document of text to clean the text background.
- rotation_spacing.py - python script for automatic detection of rotation and line spacing of an image of text
- crop_morphology.py - Finding blocks of text in an image using Python, OpenCV and numpy
Page segmentation method
By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm
argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.
To see a complete list of supported page segmentation modes, use tesseract -h
. Here's the list as of 3.04:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
Dictionaries, word lists, and patterns
By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.
Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn't dictionary words. They can be disabled by setting the both of the configuration variablesload_system_dawg
and load_freq_dawg
to false
.
It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.
If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist
configuration variable. See the FAQ for an example.
Still having problems?
If you've tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南