用Lucene4.5对中文文本建立索引

　　这里需要完成一个能对txt文本建立索引，并能完成检索查询。完成这个功能，使用的是Lucene4.5，同时使用其自带的中文分析器。

　　准备工作是在一个文件夹里面建一些txt文件，这是我的文件结构：

　　首先要对这些文本建立索引，代码如下

 package com.test;

 import java.io.*;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;

 import org.apache.lucene.document.*;

 import org.apache.lucene.index.IndexWriter;

 import org.apache.lucene.index.IndexWriterConfig;

 import org.apache.lucene.store.Directory;

 import org.apache.lucene.store.FSDirectory;

 import org.apache.lucene.util.Version;

 public class Indexer {

     /**

      * @param args

      */

     private static String fileInput = "C:\\Users\\Press-Lab\\Desktop\\五月天歌词文件";

     //此处是索引存放的路径

     private static String indexPath = "C:\\Users\\Press-Lab\\Desktop\\index"; 

     public static void main(String[] args) throws Exception {

         // TODO Auto-generated method stub

         //此处去处txt的内容和路径，装入list中，方便下一步放入document中

         File[] files = new File(fileInput).listFiles();

         List<FileBag> list = new ArrayList<FileBag>();

         for(File f : files){

             BufferedReader br = new BufferedReader(new FileReader(f));

             StringBuffer sb = new StringBuffer();

             String line = null;

             while((line = br.readLine()) != null){

                 sb.append(line);

             }

             br.close();

             FileBag fileBag = new FileBag();

             fileBag.setContent(sb.toString());

             fileBag.setPath(f.getAbsolutePath());

             list.add(fileBag);

         }

         //此处为建立索引

         Directory dir = FSDirectory.open(new File(indexPath));

         //此处使用自带的中文分析器

         SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_45);

         IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_45, analyzer);

         IndexWriter writer = new IndexWriter(dir, config);

         //对list中的每一个对象，分别连击索引

         for(FileBag fileBag : list){

             Document doc = new Document();

             doc.add(new Field("contents", fileBag.getContent(), Field.Store.YES,Field.Index.ANALYZED));

             doc.add(new StringField("path", fileBag.getPath(), Field.Store.YES));

             writer.addDocument(doc);

         }

         writer.close();

     }

 }

 //建立一个与txt对应的domain对象

 class FileBag{

     private String content;

     private String path;

     public String getContent() {

         return content;

     }

     public void setContent(String content) {

         this.content = content;

     }

     public String getPath() {

         return path;

     }

     public void setPath(String path) {

         this.path = path;

     }

 }

　　在这段代码中，我对文本进行了存储，一般情况下是无需存储的，这里为为了方便查看结果才进行存储。

　　特别注意代码的54行，如果使用Lucene4.5推荐的TextField，这无法建立索引。不知道这是个什么原因，有人解决过这个问题的麻烦告知下。

　　下面是进行检索的代码：

 package com.test;

 import java.io.*;

 import org.apache.lucene.analysis.Analyzer;

 import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;

 import org.apache.lucene.analysis.standard.StandardAnalyzer;

 import org.apache.lucene.document.Document;

 import org.apache.lucene.index.DirectoryReader;

 import org.apache.lucene.index.IndexReader;

 import org.apache.lucene.index.Term;

 import org.apache.lucene.queryparser.classic.QueryParser;

 import org.apache.lucene.search.IndexSearcher;

 import org.apache.lucene.search.Query;

 import org.apache.lucene.search.ScoreDoc;

 import org.apache.lucene.search.TermQuery;

 import org.apache.lucene.search.TopDocs;

 import org.apache.lucene.store.Directory;

 import org.apache.lucene.store.FSDirectory;

 import org.apache.lucene.util.Version;

 import com.dong.Constants;

 public class Seacher {

     /**

      * @param args

      */

     private static String indexPath = "C:\\Users\\Press-Lab\\Desktop\\index";

     public static void main(String[] args) throws Exception {

         // TODO Auto-generated method stub

            Directory dir=FSDirectory.open(new File(indexPath));

            IndexReader reader=DirectoryReader.open(dir);

            IndexSearcher searcher=new IndexSearcher(reader);

            //此处也需要是用中文分析器

            SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_45);

            QueryParser parser = new QueryParser(Version.LUCENE_45, "contents", analyzer);

            Query query = parser.parse("如果") ; 

            TopDocs topdocs=searcher.search(query, 5);

            ScoreDoc[] scoreDocs=topdocs.scoreDocs;  

            System.out.println("共有数据:" +  topdocs.scoreDocs.length + "条");

            for(int i=0; i < scoreDocs.length; i++) {

                 int doc = scoreDocs[i].doc;

                 Document document = searcher.doc(doc);

                 System.out.println("第" + i + "条文本的路径是:  " + document.get("path"));

                 System.out.println("第" + i + "条文本内容是:   " + document.get("contents"));  

             }

             reader.close();

     }  

 }

　　这里需要注意的是，查询时候也需要中文分析器。

　　下一篇要做的是，实现索引的分页查询。

用Lucene4.5对中文文本建立索引的更多相关文章

Lucene4.9学习笔记——Lucene建立索引
基本上创建索引需要三个步骤: 1.创建索引库IndexWriter对象 2.根据文件创建文档Document 3.向索引库中写入文档内容这其中主要涉及到了IndexWriter(索引的核心组件,用于 ...
万字总结Keras深度学习中文文本分类
摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...
lucene 建立索引的过程
时间 -- ::  CSDN博客原文 http://blog.csdn.net/caohaicheng/article/details/ 看lucene主页(http://lucene.apach ...
【Lucene4.8教程之二】索引
一.基础内容 0.官方文档说明 (1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates ...
（转）Mysql哪些字段适合建立索引
工作中处理数据时,发现某个表的数据达近亿条,所以要为表建索引提高查询性能,以下两篇文章总结的很好,记录一下,以备后用. 数据库建立索引常用的规则如下: 1.表的主键.外键必须有索引: 2.数据量超过3 ...
Mysql哪些字段适合建立索引
数据库建立索引常用的规则如下: 1.表的主键.外键必须有索引: 2.数据量超过300的表应该有索引: 3.经常与其他表进行连接的表,在连接字段上应该建立索引: 4.经常出现在Where子句中的字段,特 ...
mysql中建立索引的一些原则
1.先存数据,再建索引有索引的好处是搜索比较快但是在有索引的前提下进行插入.更新操作会很慢 2.不要对规模小的数据表建立索引,数据量超过300的表应该有索引:对于规模小的数据表建立索引不仅不会提高 ...
MongoDB优化，建立索引实例及索引机制原理讲解
MongoDB优化,建立索引实例及索引机制原理讲解为什么需要索引? 当你抱怨MongoDB集合查询效率低的时候,可能你就需要考虑使用索引了,为了方便后续介绍,先科普下MongoDB里的索引机制(同样 ...
mysql建立索引的一些小规则
1.表的主键.外键必须有索引: 2.数据量超过300的表应该有索引: 3.经常与其他表进行连接的表,在连接字段上应该建立索引: 4.经常出现在Where子句中的字段,特别是大表的字段,应该建立索引: ...

随机推荐

JQuery字符串的操作
一.String对象属性 1.length属性: length算是字符串中非常常用的一个属性了,它的功能是获取字符串的长度.当然需要注意的是js中的中文每个汉字也只代表一个字符,这里可能跟其他语言有些 ...
pymongo
import pymongofrom bson import ObjectIdimport jsonmongo_client=pymongo.MongoClient(host='127.0.0.1', ...
python学习之----正则表达式
CentOS7自定义安装git
1. 介绍使用Coding管理项目,上面要求使用的git版本为1.8.0以上,而很多yum源上自动安装的git版本为1.7,所以需要掌握手动编译安装git方法. 2. 安装git依赖包yum ins ...
20165205 2017-2018-2《Java程序设计》结对编程一第一周总结
20165205 2017-2018-2<Java程序设计>结对编程一第一周总结需求分析对输入的算式进行计算,要求满足一下条件: 支持整数运算,如2+5,47+7865. 支持多运算 ...
Linux将某目录授权给某组里的某用户
chown -Rf 用户名:组名目录
http etag
基础知识 1) 什么是”Last-Modified”? 在浏览器第一次请求某一个URL时,服务器端的返回状态会是200,内容是你请求的资源,同时有一个Last-Modified的属性 ...
【Flex】自定义组件-combobox组件
1包结构 2 Test.mxml <?xml version="1.0" encoding="utf-8"?> <s:Application ...
oracle中查询表是否存在
select count(*) from user_tables where table_name='表名' 或者 select 1 from user_tables where table_name ...
Linux Centos6.5 SVN服务器搭建以及客户端安装
转载:http://www.cnblogs.com/mymelon/p/5483215.html /******开始*********/ 系统环境:Centos 6.5 第一步:通过yum命令安装sv ...

用Lucene4.5对中文文本建立索引

用Lucene4.5对中文文本建立索引的更多相关文章

随机推荐

热门专题