python对不同类型文件（doc,txt,pdf）的字符查找

python对不同类型文件的字符查找

TXT文件:

    def txt_handler(self, f_name, find_str):

        """

         处理txt文件

        :param file_name:

        :return:

        """

        line_count = 1;

        file_str_dict = {}

        if os.path.exists(f_name):

            f = open(f_name, 'r', encoding='utf-8')

            for line in f :

                if find_str in line:

                    file_str_dict['file_name'] = f_name

                    file_str_dict['line_count'] = line_count

                    break

                else:

                    line_count += 1

        return file_str_dict

docx文件

需要用到docx包

pip install python-docx

参考https://python-docx.readthedocs.io/en/latest/

from docx import Document

def docx_handler(self, f_name, find_str):

        """

         处理word docx文件

        :param file_name:

        :return:

        """

        # line_count = 1;

        file_str_dict = {}

        if os.path.exists(f_name):

            document = Document(f_name)  # 打开文件x.docx

            for paragraph in document.paragraphs:  # 每个获取段落

                # print(paragraph.text)

                if find_str in paragraph.text:

                    file_str_dict['file_name'] = f_name

                    # file_str_dict['line_count'] = line_count

                    break

        return file_str_dict

doc文件:

python没有专门处理doc文件的包，需要把doc转换成docx，再用docx文件类型方式进行处理

from win32com import client as wc

def doc_to_docx(self, fileName):

        # 将doc转换成docx

        word = wc.Dispatch("Word.Application")

        doc = word.Documents.Open(fileName)

        # 使用参数16表示将doc转换成docx，保存成docx后才能 读文件

        FileNameDocx = fileName[:-4] + '.docx'

        doc.SaveAs(FileNameDocx, 16)

        doc.Close()

        word.Quit()

        return FileNameDocx

pdf文件：

这里使用PDFMiner包

python3安装

python -m pip install pdfminer.six

参考文章

https://dzone.com/articles/exporting-data-from-pdfs-with-python

import io

from pdfminer.converter import TextConverter

from pdfminer.pdfinterp import PDFPageInterpreter

from pdfminer.pdfinterp import PDFResourceManager

from pdfminer.pdfpage import PDFPage   

def pdf_handler(self, f_name, find_str):

        """

         处理pdf文件

        :param file_name:

        :return:

        """

        # line_count = 1;

        file_str_dict = {}

        if os.path.exists(f_name):

            # pdf = pdfplumber.open(f_name)  # 打开文件x.pdf

            for page in self.extract_text_by_page(f_name):

                # 获取当前页面的全部文本信息，包括表格中的文字

                if find_str in page:

                    file_str_dict['file_name'] = f_name

                    # file_str_dict['line_count'] = line_count

                    break

        return file_str_dict

    @staticmethod

    def extract_text_by_page(pdf_path):

        """

        按页读取PDF

        生成器函数按页生成（yield）了文本

        :param pdf_path:

        :return:

        """

        with open(pdf_path, 'rb') as fh:

            for page in PDFPage.get_pages(fh,

                                          caching=True,

                                          check_extractable=True):

                resource_manager = PDFResourceManager()

                fake_file_handle = io.StringIO()

                converter = TextConverter(resource_manager, fake_file_handle)

                page_interpreter = PDFPageInterpreter(resource_manager, converter)

                page_interpreter.process_page(page)

                text = fake_file_handle.getvalue()

                yield text  # 使用生成器

                # close open handles

                converter.close()

                fake_file_handle.close()

python对不同类型文件（doc,txt,pdf）的字符查找的更多相关文章

doc或docx(word)或image类型文件批量转PDF脚本
doc或docx(word)或image类型文件批量转PDF脚本 1.实际生产环境中遇到文件展示只能适配PDF版本的文件,奈何一万个文件有七千个都是word或者image类型的,由此搞个脚本批量转换下 ...
python反编译chm文件并生成pdf文件
# -*- coding: utf-8 -*- import os import os.path import logging import pdfkit original_chm = r'C:\Us ...
python基础——python解析yaml类型文件
一.yaml介绍 yaml全称Yet Another Markup Language(另一种标记语言).采用yaml作为配置文件,文件看起来直观.简洁.方便理解.yaml文件可以解析字典.列表和一些基 ...
【python】实例-创建文件并通过键盘输入字符
import os lnend=os.linesep ##windows行结束符号是“\r\n” FileName=raw_input("please input filename:&quo ...
python数据处理（三）之处理pdf文件
代码以及资料 https://github.com/jackiekazil/data-wrangling 1.前言尽可能地寻找可以替代pdf格式的数据 2.解析pdf的编程方法安装slate pi ...
python基础——元组、文件及其它
Python核心数据类型--元组元组对象(tuple)是序列,它具有不可改变性,和字符串类似.从语法上讲,它们便在圆括号中,它们支持任意类型.任意嵌套及常见的序列操作. 任意对象的有序集合:与字符串 ...
solr6.6 导入 pdf/doc/txt/json/csv/xml文件
文本主要介绍通过solr界面dataimport工具导入文件,包括pdf.doc.txt .json.csv.xml等文件,看索引结果有什么不同.其实关键是managed-schema.solrcon ...
python第六篇文件处理类型
阅读目录一文件操作二打开文件的模式三操作文件的方法四文件内光标移动五文件的修改文件处理 ...
[大数据]-Fscrawler导入文件（txt,html,pdf,worf...）到Elasticsearch5.3.1并配置同义词过滤
fscrawler是ES的一个文件导入插件,只需要简单的配置就可以实现将本地文件系统的文件导入到ES中进行检索,同时支持丰富的文件格式(txt.pdf,html,word...)等等.下面详细介绍下f ...

随机推荐

Vue工程化之引入element-ui框架后图标失效
场景: vue-cli搭建的工程化项目,引入element框架后发现图标无效,变为方块解决方案: 在index.html引入样式文件CDN链接即可  <l ...
springBoot 利用Idea打包部署
springBoot 打包部署 1 项目如图: 2 依赖打包插件 3 打包操作 4 运行项目:
珠峰培训node正式课--【笔记】|全局对象 | process | util | fs | stream 流
全局对象: console : __filename ; __dirname ; setTimeOut ; setImmediate(把参数函数放在下一个环节执行) proc ...
关于Classloader（学习笔记）
1)类加载的过程是怎么样的?①加载:根据具体需求,选择合适的加载器(Bootstrap ClassLoader不可直接获取.Extension ClassLoader.系统.自定义)来控制字节流的获取 ...
[译] 2017 年比较 Angular、React、Vue 三剑客
原文地址:Angular vs. React vs. Vue: A 2017 comparison 原文作者:Jens Neuhaus 译文出自:掘金翻译计划本文永久链接:github.com/xi ...
java.lang.IllegalStateException: Service id not legal hostname (leyou_item_service)
. ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ...
UEditor富文本WEB编辑器设置代码高亮
UEditor编译器支持代码高亮显示,设置方法如下: 1.页面head引入UEditor类包文件shCore.js.shCoreDefault.css代码 (注:引入文件路径根据需求变更即可) < ...
MongoDB 聚合查询报错
1.Distinct聚合查询报错 db.users.distinct("uname") db.runCommand({"distinct":"user ...
luogu_3645: 雅加达的摩天楼
雅加达的摩天楼题意描述: 有\(N\)座摩天楼,从左到右依次编号为\(0\)到\(N-1\). 有\(M\)个信息传递员,编号依次为\(0\)到\(M-1\).编号为i的传递员最初在编号为\(B_i ...
Xamarin.Android开发中遇到的问题
开发 1.Resource.Id未包含xxx的定义打开了一个OK的Id,是位于\obj\Debug\90\designtime\Resource.designer.cs ,打开文件搜索xxx,果然没 ...

python对不同类型文件（doc,txt,pdf）的字符查找

python对不同类型文件的字符查找

TXT文件:

docx文件

doc文件:

pdf文件：

python对不同类型文件（doc,txt,pdf）的字符查找的更多相关文章

随机推荐

热门专题