使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现

转载请注明出处：http://blog.csdn.net/dongdong9223/article/details/76273859

本文出自【我是干勾鱼的博客】

这里讲一下使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现。

涉及到的类一共有两个：

LuceneCreateIndex，创建索引：

package com.yhd.test.poi;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class LuceneCreateIndex {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        // 保存word文件的路径
        String dataDirectory = "D:\\Studying\\poi\\test\\dataDirectory";
        // 保存Lucene索引文件的路径
        String indexDirectory = "D:\\Studying\\poi\\test\\indexDirectory";
        // 创建Directory对象 ，也就是分词器对象
        Directory directory = new SimpleFSDirectory(new File(indexDirectory));
        // 创建一个简单的分词器,可以对数据进行分词
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

        // 创建索引实例
        // 第1个参数是Directory,
        // 第2个是分词器,
        // 第3个表示是否是创建, true代表覆盖原先数据, 如果为false为在此基础上面修改,
        // 第4个MaxFieldLength表示对每个Field限制建立分词索引的最大数目，
        // 如果是MaxFieldLength.UNLIMITED，表示长度没有限制;
        // 如果是MaxFieldLength.LIMITED则表示有限制，可以通过IndexWriter对象的setMaxFieldLength（int
        // n）进行指定
        IndexWriter indexWriter = new IndexWriter(directory, analyzer, true,
                IndexWriter.MaxFieldLength.UNLIMITED);
        // 获取所有需要建立索引的文件
        File[] files = new File(dataDirectory).listFiles();

        for (int i = 0; i < files.length; i++) {
            // 文件是第几个
            System.out.println("这是第" + i + "个文件----------------");
            // 文件的完整路径
            System.out.println("完整路径：" + files[i].toString());
            // 获取文件名称
            String fileName = files[i].getName();
            // 获取文件后缀名，将其作为文件类型
            String fileType = fileName.substring(fileName.lastIndexOf(".") + 1,
                    fileName.length()).toLowerCase();
            // 文件名称
            System.out.println("文件名称：" + fileName);
            // 文件类型
            System.out.println("文件类型：" + fileType);

            Document doc = new Document();

            // String fileCode = FileType.getFileType(files[i].toString());
            // 查看各个文件的文件头标记的类型
            // System.out.println("fileCode=" + fileCode);

            InputStream in = new FileInputStream(files[i]);
            InputStreamReader reader = null;

            if (fileType != null && !fileType.equals("")) {

                if (fileType.equals("doc")) {
                    // 获取doc的word文档
                    WordExtractor wordExtractor = new WordExtractor(in);
                    // 创建Field对象，并放入doc对象中
                    // Field的各个字段含义如下：
                    // 第1个参数是设置field的name，
                    // 第2个参数是value，value值可以是文本（String类型，Reader类型或者是预分享的TokenStream）,
                    // 二进制（byet[]）, 或者是数字（一个 Number类型）
                    // 第3个参数是Field.Store，选择是否存储，如果存储的话在检索的时候可以返回值
                    // 第4个参数是Field.Index，用来设置索引方式
                    doc.add(new Field("contents", wordExtractor.getText(),
                            Field.Store.YES, Field.Index.ANALYZED));
                    // 关闭文档
                    wordExtractor.close();
                    System.out.println("注意：已为文件“" + fileName + "”创建了索引");

                } else if (fileType.equals("docx")) {
                    // 获取docx的word文档
                    XWPFWordExtractor xwpfWordExtractor = new XWPFWordExtractor(
                            new XWPFDocument(in));
                    // 创建Field对象，并放入doc对象中
                    doc.add(new Field("contents", xwpfWordExtractor.getText(),
                            Field.Store.YES, Field.Index.ANALYZED));
                    // 关闭文档
                    xwpfWordExtractor.close();
                    System.out.println("注意：已为文件“" + fileName + "”创建了索引");

                } else if (fileType.equals("pdf")) {
                    // 获取pdf文档
                    PDFParser parser = new PDFParser(in);
                    parser.parse();
                    PDDocument pdDocument = parser.getPDDocument();
                    PDFTextStripper stripper = new PDFTextStripper();
                    // 创建Field对象，并放入doc对象中
                    doc.add(new Field("contents", stripper.getText(pdDocument),
                            Field.Store.NO, Field.Index.ANALYZED));
                    // 关闭文档
                    pdDocument.close();
                    System.out.println("注意：已为文件“" + fileName + "”创建了索引");

                } else if (fileType.equals("txt")) {
                    // 建立一个输入流对象reader
                    reader = new InputStreamReader(in);
                    // 建立一个对象，它把文件内容转成计算机能读懂的语言
                    BufferedReader br = new BufferedReader(reader);
                    String txtFile = "";
                    String line = null;

                    while ((line = br.readLine()) != null) {
                        // 一次读入一行数据
                        txtFile += line;
                    }
                    // 创建Field对象，并放入doc对象中
                    doc.add(new Field("contents", txtFile, Field.Store.NO,
                            Field.Index.ANALYZED));
                    System.out.println("注意：已为文件“" + fileName + "”创建了索引");

                } else {

                    System.out.println();
                    continue;

                }

            }
            // 创建文件名的域，并放入doc对象中
            doc.add(new Field("filename", files[i].getName(), Field.Store.YES,
                    Field.Index.NOT_ANALYZED));
            // 创建时间的域，并放入doc对象中
            doc.add(new Field("indexDate", DateTools.dateToString(new Date(),
                    DateTools.Resolution.DAY), Field.Store.YES,
                    Field.Index.NOT_ANALYZED));
            // 写入IndexWriter
            indexWriter.addDocument(doc);
            // 换行
            System.out.println();
        }
        // 查看IndexWriter里面有多少个索引
        System.out.println("numDocs=" + indexWriter.numDocs());
        // 关闭索引
        indexWriter.close();

    }
}

LuceneSearch，进行搜索：

package com.yhd.test.poi;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;

public class LuceneSearch {
    public static void main(String[] args) throws IOException, ParseException {
        // 保存索引文件的地方
        String indexDirectory = "D:\\Studying\\poi\\test\\indexDirectory";
        // 创建Directory对象 ，也就是分词器对象
        Directory directory = new SimpleFSDirectory(new File(indexDirectory));
        // 创建 IndexSearcher对象，相比IndexWriter对象，这个参数就要提供一个索引的目录就行了
        IndexSearcher indexSearch = new IndexSearcher(directory);
        // 创建QueryParser对象,
        // 第1个参数表示Lucene的版本,
        // 第2个表示搜索Field的字段,
        // 第3个表示搜索使用分词器
        QueryParser queryParser = new QueryParser(Version.LUCENE_30,
                "contents", new StandardAnalyzer(Version.LUCENE_30));
        // 生成Query对象
        Query query = queryParser.parse("百度");
        // 搜索结果 TopDocs里面有scoreDocs[]数组，里面保存着索引值
        TopDocs hits = indexSearch.search(query, 10);
        // hits.totalHits表示一共搜到多少个
        System.out.println("找到了" + hits.totalHits + "个");
        // 循环hits.scoreDocs数据，并使用indexSearch.doc方法把Document还原，再拿出对应的字段的值
        for (int i = 0; i < hits.scoreDocs.length; i++) {
            ScoreDoc sdoc = hits.scoreDocs[i];
            Document doc = indexSearch.doc(sdoc.doc);
            System.out.println(doc.get("filename"));
        }
        indexSearch.close();
    }
}

详细的解释在代码注释里都有了，就不做过多解释了。需要的jar包如下：

读取poi的类到poi官网下载，读取pdf的类到Apache PDFBox官网下载，这里用的1.8.13版本，2.0版本的调用方式与1.0版本已经不太一样了。

项目整体结构如下：

先运行类：

LuceneCreateIndex

会读取目录dataDirectory，即：

D:\Studying\poi\test\dataDirectory

下的文件，建立索引，索引会保存在目录indexDirectory，即：

D:\Studying\poi\test\indexDirectory

下，然后运行：

LuceneSearch

使用索引进行查询，就能看到效果了。

使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现的更多相关文章

CEBX格式的文档如何转换为PDF格式文档、DOCX文档？
方正阿帕比CEBX格式的文档如何转换为PDF格式文档.DOCX文档? 简介: PDF.Doc.Docx格式的文档使用的非常普遍,金山WPS可以直接打开PDF和Doc.Docx文档,使用也很方便. CE ...
java通过url在线预览Word、excel、ppt、pdf、txt文档
java通过url在线预览Word.excel.ppt.pdf.txt文档中的内容[只获得其中的文字] 在页面上显示各种文档中的内容.在servlet中的逻辑 word: BufferedInputS ...
PDF文件可以转换成txt文档吗
PDF是一种便携式的文件格式,传送和阅读都非常方便,是Adobe公司开发的跨平台文件格式,它无论在哪种打印机上都可以保证精确的颜色和准确的打印效果.可是有点遗憾的是PDF格式一般不能在手机上打开,或者 ...
C# 将内容写入txt文档
<1> FileStream fs = new FileStream(@"D:\text.txt", FileMode.Append); StreamWriter s ...
QTP操作txt文档
QTP可以在txt文件(文本文件中读取数据) 首先创造一个文档对象 set fso = createObject("scripting.filesystemobject") 然后用 ...
利用IDL将一个txt文档拆分为多个
测试.txt文档,每47行的格式相同,通过代码每47行存为一个txt,txt文档命名为其第一行数据. 代码如下: file='G:\data\测试.txt' openr,lun,file,/Get_L ...
PDF 补丁丁 0.4.1 版：新增嵌入中文字库、替换文档字库的功能
PDF 补丁丁 0.4.1 版新增了嵌入中文字库.替换文档字库的功能. 嵌入汉字字库历史上有一批黄底黑字的 PDF 文档.这批文档都具有相同的问题:没有嵌入字库.在一些设备上阅读时显示乱码.复制文本 ...
用matlab查找txt文档中的关键字，并把关键字后面的数据存到起来用matlab处理
用matlab查找txt文档中的关键字,并把关键字后面的数据存到起来用matlab处理我测了一组数据存到txt文件中,是个WIFI信号强度文档,里面有我们需要得到的数据,有没用的数据,想用matla ...
WebService 实现BS环境与BS环境传递参数，根据参数生成txt文档
客户端: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Client.as ...

随机推荐

HDU4135Co-prime（容斥原理）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4135 题目解析: 给你一个闭区间[A,B](1 <= A <= B <= 1015) ...
eclipse+maven+tomcat构建web工程
我们要利用Maven构建一个web应用,开发环境为eclipse+tomcat.构建过程如下: 1.工具准备 eclipse:版本为eclipse 4.2(Juno Service),maven插件的 ...
log4j2介绍及配置
一.log4j2概述在日常的开发,测试和生产环境中,日志记录了应用,服务运行过程中的关键信息,以及出现异常时的堆栈,这些信息常常作为查询,定位,解决问题的关键,因此在任何系统中,对日志的使用得当,将 ...
react分享
后台项目应用分享后台项目应用分享 webpack + react + redux + antd 后台项目应用分享策略篇框架选择组件化开发组件?组件! CSS in JS下的样式开发思路展示 ...
7.3 Models -- Creating And Deleting Records
一.Creating 1. 你可以通过调用在store中的createRecord方法来创建records. store.createRecord('post', { title: 'Rails is ...
cocos代码研究（10）ActionEase子类学习笔记
理论部分缓动动作的基类,继承自 ActionInterval类.ActionEase本身是一个抽象的概念父类,开发者最好不要在代码中直接创建它的对象,因为它没有具体的执行效果,这一类的子类速度变化大 ...
EF Code First学习笔记：数据库创建（转）
控制数据库的位置默认情况下,数据库是创建在localhost\SQLEXPRESS服务器上,并且默认的数据库名为命名空间+context类名,例如我们前面的BreakAway.BreakAwayCo ...
Java实现动态规划法求解0/1背包问题
摘要: 使用动态规划法求解0/1背包问题. 难度: 初级 0/1背包问题的动态规划法求解,前人之述备矣,这里所做的工作,不过是自己根据理解实现了一遍,主要目的还是锻炼思维和编程能力,同时,也是为了增进 ...
常用DOS命令总结
本文主要参考:http://www.jb51.net/article/12360.htm http://blog.csdn.net/kofterry/article/details/5183110 常 ...
centos配置用户级别的jdk的环境变量
前面讲解了centos配置jdk的环境变量的root级别的jdk配置 ,这里讲解用户级别的jdk配置. 在用户的当前目录下,如下,有四个隐藏的文件,文件打头是.bash******: 1.编辑.ba ...

使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现

使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现的更多相关文章

随机推荐

热门专题