Tesseract OCR使用介绍

#Tesseract OCR使用介绍

##目录
[TOC]

##下载地址及介绍

官网介绍：http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Github源码连接： https://github.com/tesseract-ocr
开源贡献者主页 https://kevintechnology.com/

##安装 Tesseract

语言包查看 https://www.macports.org/ports.php?by=name&substr=tesseract-
支持Windows、linux、macOS

1、安装 tesseract和语言包
sudo port install tesseract	
sudo port install tesseract-<langcode>

2、homebrew 安装
brew install tesseract
brew install --with-training-tools tesseract

3、重新安装
brew uninstall tesseract
brew install --with-training-tools tesseract

Homebrew 是一个包管理器，如果没装的话，在终端执行

1	ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

##使用 Tesseract

使用命令行进行图像识别
imagename 就是要识别的图片文件的名称，outputbase 就是识别结果输出文件的名称。
lang 就是要识别的语言代码，例如英语为 eng、简体中文为 chi_sim 等等。可以同时识别多种语言，使用 “+” 相连，例如 eng+chi_sim。缺省时识别英语。

1、格式信息如下

1	tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

2、示例: 识别image图片并将结果保存在out.txt文件中

1 2	tesseract image.png out -l chi_sim tesseract image.png out -l chi_sim -psm 10

3、pagesegmode 为识别的具体模式，具体包含以下模式：

•	0 = Orientation and script detection (OSD) only.
•	1 = Automatic page segmentation with OSD.
•	2 = Automatic page segmentation, but no OSD, or OCR
•	3 = Fully automatic page segmentation, but no OSD. (Default)
•	4 = Assume a single column of text of variable sizes.
•	5 = Assume a single uniform block of vertically aligned text.
•	6 = Assume a single uniform block of text.
•	7 = Treat the image as a single text line.
•	8 = Treat the image as a single word.
•	9 = Treat the image as a single word in a circle.
•	10 = Treat the image as a single character.
•	11 = Sparse text. Find as much text as possible in no particular order.
•	12 = Sparse text with OSD.
•	13 = Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

##训练样本

训练工具 https://github.com/tesseract-ocr/tesseract/wiki/AddOns
使用教程 https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract
提高识别率 https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
清理文本背景 http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
提取文本区域 http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html
以jTessBoxEditor为例

> 1、收集文本信息的图片
> 2、制作的图片转为tiff格式
> 3、jTessBoxEditor进行tiff格式图片合成 <Tool->Merge TIFF>

合成后的图片取名规范 [lang].[fontn 大专栏  Tesseract OCR使用介绍ame].exp[num].tif
[lang]是语言，[fontname]是字体，[num]是标号

###1、Make Box Files

使用 Tesseract 识别，生成 box 文件：
确保 tif 和 box 文件同名且位于同一目录下，用 jTessBoxEditor 打开 tif 文件），或者直接用文本编辑器编辑。

1	tesseract hz.font.exp0.tif hz.font.exp0 -l chi_sim -psm 10 batch.nochop makebox

###2、Run Tesseract for Training

使用修改正确后的 box 文件，对 Tesseract 进行训练，生成 .tr 文件：

1	tesseract hz.font.exp0.tif hz.font.exp0 -psm 10 nobatch box.train

###3、Compute the Character Set

生成字符集的文本

unicharset_extractor hz.font.exp0.box hz.font.exp1.box

After 3.03
training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata

正确的格式应该如下：

110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
...

###4、font_properties (new in 3.01)

定义字体特征文件，Tesseract-OCR 3.01 以上的版本在训练之前需要创建一个名称为 font_properties 的字体特征文件。font_properties 不含有 BOM 头，文件内容格式如下：

1	<fontname> <italic> <bold> <fixed> <serif> <fraktur>

其中 fontname 为字体名称，必须与 [lang].[fontname].exp[num].box 中的名称保持一致。、、、、的取值为 1 或 0，表示字体是否具有这些属性。
这里就是普通字体，不倾斜不加粗，所以新建一个名为 font_properties 的文件，内容为： font 0 0 0 0 0

###5、Clustering

修改 Clustering 过程生成的 4 个文件（inttemp、pffmtable、normproto、shapetable）

shapeclustering -F font_properties -U unicharset hz.font.exp0.tr hz.font.exp1.tr ...

mftraining -F font_properties -U unicharset -O hz.unicharset hz.font.exp0.tr hz.font.exp1.tr ...

cntraining hz.font.exp0.tr hz.font.exp1.tr ...
``` 

* 生成后的文件需要添加前缀， 如这里改为 hz.inttemp、hz.pffmtable、hz.normproto、hz.shapetable。

###6、Putting it all together

* 生成最后的训练文件

combine_tessdata hz.

###7、use example

* 使用训练的文件进行识别

tesseract test.png out -l hz

1
2

##脚本运行

#!/bin/sh
read -p “输入你语言:” lang
echo ${lang}
read -p “输入你的字体:” font
echo ${font}
echo “完整文件名为：”
echo ${lang}.${font}.exp0.tif
echo “开始。。。”
echo ${font} 0 0 0 0 0 >font_properties

#tesseract ${lang}.${font}.exp0.tif $(lang).$(font).exp0 -l chi_sim -psm 10 batch.nochop makebox

#read -p “继续生产tr文件？”
tesseract ${lang}.${font}.exp0.tif ${lang}.${font}.exp0 -psm 10 nobatch box.train
unicharset_extractor ${lang}.${font}.exp0.box
shapeclustering -F font_properties -U unicharset ${lang}.${font}.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset ${lang}.${font}.exp0.tr
cntraining ${lang}.${font}.exp0.tr
echo “开始重命名文件”
mv inttemp ${font}.inttemp
mv normproto ${font}.normproto
mv pffmtable ${font}.pffmtable
mv shapetable ${font}.shapetable
mv unicharset ${font}.unicharset
echo “生成最终文件”
combine_tessdata ${font}.
echo “完成”
`

Tesseract OCR使用介绍的更多相关文章

Python下Tesseract Ocr引擎及安装介绍
1.Tesseract介绍 tesseract 是一个google支持的开源ocr项目,其项目地址:https://github.com/tesseract-ocr/tesseract,目前最新的源码 ...
Tesseract——OCR图像识别入门篇
Tesseract——OCR图像识别入门篇最近给了我一个任务,让我研究图像识别,从我们项目的screenshot中识别文字信息,so我开始了学习,与大家分享下. 我看到目前OCR技术有很多,最主要 ...
Tesseract Ocr引擎
Tesseract Ocr引擎 1.Tesseract介绍 tesseract 是一个google支持的开源ocr项目,其项目地址:https://github.com/tesseract-ocr/t ...
tesseract ocr文字识别Android实例程序和训练工具全部源代码
tesseract ocr是一个开源的文字识别引擎,Android系统中也可以使用.可以识别50多种语言,通过自己训练识别库的方式,可以大大提高识别的准确率. 为了节省大家的学习时间,现将自己近期的学 ...
开源图片文字识别引擎——Tesseract OCR
Tessseract为一款开源.免费的OCR引擎,能够支持中文十分难得.虽然其识别效果不是很理想,但是对于要求不高的中小型项目来说,已经足够用了. 文字识别可应用于许多领域,如阅读.翻译.文献资料的检 ...
Tesseract OCR简单实用介绍
做字符识别,不能不了解google的Tesseract-OCR,但是如何在自己的工程中使用其API倒是语焉不详,官网上倒是很详尽地也很啰嗦地介绍如何重新编译生成适合自己平台的lib和dll,经过近些天 ...
selenium使用笔记（二）——Tesseract OCR
在自动化测试过程中我们经常会遇到需要输入验证码的情况,而现在一般以图片验证码居多.通常我们处理这种情况应该用最简单的方式,让开发给个万能验证码或者直接将验证码这个环节跳过.之前在技术交流群里也跟朋友讨 ...
Tesseract–OCR 库原理探索
一,简介: Tesseract is probably the most accurate open source OCR engine available. Combined with the Le ...
alfresco install in linux, and integrated with tesseract ocr
本文描述在Linux系统上安装Alfresco的步骤: 1. 下载安装文件:alfresco-community-5.0.d-installer-linux-x64.bin 2. 增加执行权限并执行: ...

随机推荐

Random Access Iterator
Random Access Iterator 树型概率DP dp[u]代表以当前点作为根得到正确结果的概率将深度最深的几个点dp[u]很明显是1 然后很简单的转移有k次,但我们要先看一次的情况,然 ...
JavaScript学习笔记 - 进阶篇（4）- 函数
什么是函数函数的作用,可以写一次代码,然后反复地重用这个代码. 如:我们要完成多组数和的功能. var sum; sum = 3+2; alert(sum); sum=7+8 ; alert(sum ...
SQL：找到特定日期每个顾客最高购买量：Find the highest purchase amount ordered by the each customer on a particular date， with their ID, order date and highest purchase amount.
A: SELECT customer_id,ord_date,MAX(purch_amt) FROM orders GROUP BY customer_id,ord_date; find the hi ...
Python笔记_第三篇_面向对象_8.对象属性和类属性及其动态添加属性和方法
1. 对象属性和类属性. 我们之前接触到,在类中,我们一般都是通过构造函数的方式去写一些类的相关属性.在第一次介绍类的时候我们把一些属性写到构造函数外面并没有用到构造函数,其实当时在写的时候,就是在给 ...
[原]调试实战——使用windbg调试崩溃在ole32!CStdMarshal::DisconnectSrvIPIDs
原调试debugwindbg崩溃crash 前言最近程序会不定期崩溃,很是头疼!今晚终于忍无可忍,下决心要干掉它!从之前的几个相关的dump可以猜到是有接口未释放导致的问题,但没有确认到底是哪个接口 ...
PAT甲级——1140.Look-and-say Sequence (20分)
Look-and-say sequence is a sequence of integers as the following: D, D1, D111, D113, D11231, D112213 ...
java centos7 gcc编码解决socket通信汉字乱码
1.把 Java eclipes 设置编码成utf-8 windows->preference->workspace 2.centos7 gcc 默认为utf-8
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/bin/java: No such file or directory
在linux使用两个tomcat的时候, 执行./shutdown.sh的时候, 遇到了这个问题这个可怎么办呢原来是我的java下面的文件目录是/java-1.8.0-openjdk-1.8.0. ...
[学习笔记]连通分量与Tarjan算法
目录强连通分量求割点求桥点双连通分量模板题 Go around the Labyrinth 所以Tarjan到底怎么读强连通分量基本概念强连通如果两个顶点可以相互通达,则称两个顶点强 ...
阿里巴巴IconFont的使用方式
一.解释一下为什么要使用IconFont? IconFont顾名思义就是把图标用字体的方式呈现. 其优点在于以下几个方面: 1.可以通过css的样式改变其颜色:(最霸气的理由) 2.相对于图片来说,具 ...

Tesseract OCR使用介绍

Tesseract OCR使用介绍的更多相关文章

随机推荐

热门专题