Tesseract3.01 中文乱码有关问题

2013-10-31

Tesseract3.01 中文乱码问题C#代码:Bitmap bmp (Bitmap)Bitmap.FromFile(@E:\12.png)TesseractProces

Tesseract3.01 中文乱码问题
Tesseract3.01 中文乱码有关问题
C#代码:
Bitmap bmp = (Bitmap)Bitmap.FromFile(@"E:\12.png");
TesseractProcessor tp = new TesseractProcessor();
if (tp.Init(null, "chi_sim", 7))
{
string result = tp.Apply(bmp);
}
我用以上代码OCR上面图片(Tesseract版本为3.01),结果为:
result == "涓腑涓腑涓腑涓腑涓?涓腑\n\n";
用Encoding.UTF8.GetString(Encoding.GetEncoding("GB2312").GetBytes(result)),结果为:
"中中中中中中中中 Tesseract3.01 中文乱码有关问题 ?中中\n\n";

感觉Tesseract OCR中文编码为GB2312;,但是结果总是受字符间的空格影响为乱码.
Encoding.GetEncoding("gb2312").GetBytes(result)所得byte[]为:
228,184,173,228,184,173,32.228,184,173,228,184,173,228,184,173,228,184,173,32,228,184,173,228,184,63,228,184,173,228,184,173,10,10

其中32为空格,228,184,173三字节一个"中"字,但总会有大量的,像"228,184,63"这样的结果出乱码.请各路大神指点一二.谢谢.

以下例子一样:
Tesseract3.01 中文乱码有关问题
string result = tp.Apply(bmp);
result =="浣犱綘浼?\n";
用Encoding.UTF8.GetString(Encoding.GetEncoding("GB2312").GetBytes(result)),结果为:
"你你 ?\n";
Encoding.GetEncoding("gb2312").GetBytes(result)所得byte[]为:
228,189,160,228,189,160,32,228,188,63,10 Tesseract3.01 乱码
[解决办法]
图片里存在的是二进制数。即然是图片，那就用字节来进行转换。

和gb2312没有任何的关系。

热点排行