tesserocr与pytesseract模块的使用

1.tesserocr的使用

#从文件识别图像字符

In [7]: tesserocr.file_to_text('image.png')

Out[7]: 'Python3WebSpider\n\n'

#查看tesseract已安装的语言包

In [8]: tesserocr.get_languages()

Out[8]: ('/usr/share/tesseract/tessdata/', ['eng'])

#从图片数据识别图像字符

In [9]: tesserocr.image_to_text(im)

Out[9]: 'Python3WebSpider\n\n'

#查看版本信息

In [10]: tesserocr.tesseract_version()

Out[10]: 'tesseract 3.04.00\n leptonica-1.72\n  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0\n'

2.pytesseract使用

功能：

get_tesseract_version　　返回系统中安装的Tesseract版本。
image_to_string　　将图像上的Tesseract OCR运行结果返回到字符串
image_to_boxes　　返回包含已识别字符及其框边界的结果
image_to_data　　返回包含框边界，置信度和其他信息的结果。需要Tesseract 3.05+。有关更多信息，请查看Tesseract TSV文档
image_to_osd　　返回包含有关方向和脚本检测的信息的结果。

参数：

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING)

image object　　图像对象
lang String，Tesseract　　语言代码字符串
config String　　任何其他配置为字符串，例如：config='--psm 6'
nice Integer　　修改Tesseract运行的处理器优先级。Windows不支持。尼斯调整了类似unix的流程的优点。
output_type　　类属性，指定输出的类型，默认为string。有关所有支持类型的完整列表，请检查pytesseract.Output类的定义。

from PIL import Image

import pytesseract

#如果PATH中没有tesseract可执行文件，请指定tesseract路径

pytesseract.pytesseract.tesseract_cmd='C:\Program Files (x86)\Tesseract-OCR\\tesseract.exe'

#打印识别的图像的字符串

print(pytesseract.image_to_string(Image.open('test.png')))

#指定语言识别图像字符串,eng为英语

print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='eng'))

#获取图像边界框

print(pytesseract.image_to_boxes(Image.open('test.png')))

#获取包含边界框，置信度，行和页码的详细数据

print(pytesseract.image_to_data(Image.open('test.png')))

#获取方向和脚本检测

print(pytesseract.image_to_osd(Image.open('test.png'))

图像识别简单应用

一般图像处理验证，需要通过对图像进行灰度处理、二值化后增加图像文字的辨识度，下面是一个简单的对图像验证码识别处理，如遇到复杂点的图像验证码如中间带多条同等大小划线的验证码需要对文字进行乔正切割等操作，但它的识别度也只有百分之30左右，所以得另外想别的办法来绕过验证

from PIL import Image

import pytesseract

im = Image.open('66.png')

#二值化图像传入图像和阈值

def erzhihua(image,threshold):

    ''':type image:Image.Image'''

    image=image.convert('L')

    table=[]

    for i in range(256):

        if i <  threshold:

            table.append(0)

        else:

            table.append(1)

    return image.point(table,'')

image=erzhihua(im,127)

image.show()

result=pytesseract.image_to_string(image,lang='eng')

print(result)

模拟自动识别验证码登陆：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# @Time    : 2018/7/13 8:58

# @Author  : Py.qi

# @File    : login.py

# @Software: PyCharm

from selenium import webdriver

from selenium.common.exceptions import TimeoutException,WebDriverException

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.remote.webelement import WebElement

from io import BytesIO

from PIL import Image

import pytesseract

import time

user='zhang'

password=''

url='http://10.0.0.200'

driver=webdriver.Chrome()

wait=WebDriverWait(driver,10)

#识别验证码

def acker(content):

    im_erzhihua=erzhihua(content,127)

    result=pytesseract.image_to_string(im_erzhihua,lang='eng')

    return result

#验证码二值化

def erzhihua(image,threshold):

    ''':type image:Image.Image'''

    image=image.convert('L')

    table=[]

    for i in range(256):

        if i <  threshold:

            table.append(0)

        else:

            table.append(1)

    return image.point(table,'')

#自动登陆

def login():

    try:

        driver.get(url)

        #获取用户输入框

        input=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#loginname'))) #type:WebElement

        input.clear()

        #发送用户名

        input.send_keys(user)

        #获取密码框

        inpass=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#password'))) #type:WebElement

        inpass.clear()

        #发送密码

        inpass.send_keys(password)

        #获取验证输入框

        yanzheng=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#code'))) #type:WebElement

        #获取验证码在画布中的位置

        codeimg=wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#codeImg'))) #type:WebElement

        image_location = codeimg.location

        #截取页面图像并截取掩码码区域图像

        image=driver.get_screenshot_as_png()

        im=Image.open(BytesIO(image))

        imag_code=im.crop((image_location['x'],image_location['y'],488,473))

        #输入验证码并登陆

        yanzheng.clear()

        yanzheng.send_keys(acker(imag_code))

        time.sleep(2)

        yanzheng.send_keys(Keys.ENTER)

    except TimeoutException as e:

        print('timeout:',e)

    except WebDriverException as e:

        print('webdriver error:',e)

if __name__ == '__main__':

    login()

原文:https://www.cnblogs.com/-qing-/p/11027821.html

tesserocr与pytesseract模块的使用的更多相关文章

python3光学字符识别模块tesserocr与pytesseract
OCR,即Optical Character Recognition,光学字符识别,是指通过扫描字符,然后通过其形状将其翻译成电子文本的过程,对应图形验证码来说,它们都是一些不规则的字符,这些字符是由 ...
Python脚本破解图形验证码(tesserocr和pytesseract)
在学习之前,我们先了解OCR.tesseract.tesserocr.pytesseract和opencv这几个跟图片处理有关的库. OCR(Optical Character Recognition ...
Python验证码识别安装Pillow、tesseract-ocr与pytesseract模块的安装以及错误解决
1.安装Pillow pip install Pillow 2.安装tesseract-ocr OCR(Optical Character Recognition, 光学字符识别) 软件安装包含两个 ...
Python 之pytesseract模块读取知乎验证码案例
import pytesseract from PIL import Image import requests import time # 获取只会验证码图片并保存为本地 def get_data_ ...
Python 3.6 版本-使用Pytesseract 模块进行图像验证码识别
环境: (1) win7 64位 (2) Idea (3) python 3.6 (4) pip install pillow <&nbsp>pip install pytesse ...
Python之pytesseract模块-实现OCR
在给PC端应用做自动化测试时,某些情况下无法定位界面上的控件,但我们又想获得界面上的文字,则可以通过截图后从图片上去获取该文字信息.那么,Python中有没有对应的工具来实现OCR呢?答案是有的,它叫 ...
Python之selenium+pytesseract 实现识别验证码自动化登录脚本
今天写自己的爆破靶场WP时候,遇到有验证码的网站除了使用pkav的工具我们同样可以通过py强大的第三方库来实现识别验证码+后台登录爆破,这里做个笔记~~~ 0x01关于selenium seleniu ...
关于在 mac上配置pytesseract的相关问题
因为踩了两个小时坑特别是在配置依赖tesseract-ORC识别库时候的问题特别麻烦一定要用brewhome 一定要用brewhome 一定要用brewhome 重要的事情说三遍. 刚开始我在网 ...
pytesseract在识别只有一个数字的图片时识别不出来
大家好,近期在做自动化测试时,遇到了一个问题需要通过识别图片来实现,遂用到了pytesseract模块和tesseract-ocr这个工具.在使用过程中发现,识别带有数字的图片时,如果这个图片上仅有一 ...

随机推荐

Java里的参数类型/返回值类型
参数类型/返回值类型: ##数据类型: ###基本类型: ###引用类型: ####数组 ####类 ####接口参数类型/返回值类型是类和接口的情况: 1.参数类型是普通类的情况为什么写成静态, ...
redis相关笔记(一.安装及单机及哨兵使用)
redis笔记一 redis笔记二 redis笔记三 1.安装 cd /usr/src #进入下载目录(这个目录自己定) yum install -y wget gcc make tcl #安装依赖 ...
FTP的PORT和PASV的连接方式以及数据连接端口号计算
FTP的PORT和PASV的连接方式以及数据连接端口号计算 PORT(自动)方法的连接途中是: 客户端向服务器的FTP端口(原始是21)发送连接请求,服务器领受连接,建立一条command链路. ...
【TCP】TCP状态
下图所示,TCP通信过程包括三个步骤:建立TCP连接通道(三次握手).数据传输.断开TCP连接通道(四次挥手). 这里进一步探究TCP三路握手和四次挥手过程中的状态变迁以及数据传输过程.先看TCP状态 ...
Ubuntu 14.04/16.04/18.04安装最新版Eigen3.3.5
https://blog.csdn.net/xiangxianghehe/article/details/81236299 sudo cp -r /usr/local/include/eigen3 / ...
python对具有宏excel的操作
一.使用win32com库安装pip install pypiwin32 import win32com.client #excel xlApp =win32com.client.DispatchE ...
Python100天打卡
基于tkinter模块的GUIPython默认的GUI开发模块是tkinter(在Python 3以前的版本中名为Tkinter)使用tkinter来开发GUI应用需要以下5个步骤: 导入tkinte ...
selenium之文件上传所有方法整理总结
本文转载“灰蓝”的原创博客.http://blog.csdn.net/huilan_same/article/details/52439546 文件上传是所有UI自动化测试都要面对的一个头疼问题,今天 ...
sea.js模块加载工具
seajs的使用 seajs是一个jS模块加载器,由淘宝前端架构师玉伯开发,它可以解决命名空间污染,文件依赖的问题.可以在一个js文件中引入另外一个js.require('a.js') 1.安装 np ...
jQuery 封装的ajax
jquery封装的ajax 具体操作: $.get(url [,data] [,fn回调函数] [, dataType]); data:给服务器传递的数据,请求字符串 .json对象都可以设 ...

tesserocr与pytesseract模块的使用

tesserocr与pytesseract模块的使用的更多相关文章

随机推荐

热门专题