OCR文字辨别

2013-07-01

OCR文字识别OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别，获取

OCR文字识别

OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别，获取的过程。 ?Tesseract：开源的OCR识别引擎，初期Tesseract引擎由HP实验室研发，后来贡献给了开源软件业，后经由Google进行改进，消除bug，优化，重新发布。

http://code.google.com/p/tesseract-ocr/

Summary:Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

Supported Platforms:Tesseract works on Linux, Windows (with VC++ Express or CygWin) and Mac OSX. See the ReadMe for more details and install instructions. It can also be compiled for other platforms, including Android and the iPhone, though these are not as well tested platforms. See also the AddOns page for other projects using Tesseract on various platforms.

----------------------------------------------------------

1、linux安装tesseract，http://code.google.com/p/tesseract-ocr/wiki/Compiling

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]例如：  tesseract code.jpg result  -l chi_sim -psm 7 nobatch  -l chi_sim 表示用简体中文字库（需要下载中文字库文件，解压后，存放到tessdata目录下去,字库文件扩展名为.raineddata 简体中文字库文件名为:  chi_sim.traineddata）  -psm 7 表示告诉tesseract code.jpg图片是一行文本，这个参数可以减少识别错误率， 默认为 3configfile 参数值为tessdata\configs 和  tessdata\tessconfigs 目录下的文件名。

6）java调用tesseract-ocr，?http://blog.sina.com.cn/s/blog_025270e90101avgb.html

7）windows下使用tesseract-ocr，http://blog.csdn.net/xiaochunyong/article/details/7193744

8）仅识别数字，tesseract imagename outputbase digits

热点排行

移动开发

OCR文字辨别