Lucene 4.x Spellcheck使用说明

　　Spellcheck是Lucene新版本的功能，在介绍spellcheck之前，我们需要弄清楚Spellcheck支持几种数据源。Spellcheck构造函数需要传入Dictionary接口：

package org.apache.lucene.search.spell;

/*

 * Licensed to the Apache Software Foundation (ASF) under one or more

 * contributor license agreements.  See the NOTICE file distributed with

 * this work for additional information regarding copyright ownership.

 * The ASF licenses this file to You under the Apache License, Version 2.0

 * (the "License"); you may not use this file except in compliance with

 * the License.  You may obtain a copy of the License at

 *

 *     http://www.apache.org/licenses/LICENSE-2.0

 *

 * Unless required by applicable law or agreed to in writing, software

 * distributed under the License is distributed on an "AS IS" BASIS,

 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

 * See the License for the specific language governing permissions and

 * limitations under the License.

 */

import java.io.IOException;

import org.apache.lucene.search.suggest.InputIterator;

/**

 * A simple interface representing a Dictionary. A Dictionary

 * here is a list of entries, where every entry consists of

 * term, weight and payload.

 *

 */

public interface Dictionary {

  /**

   * Returns an iterator over all the entries

   * @return Iterator

   */

  InputIterator getEntryIterator() throws IOException;

}

　　常用的Dictionary主要有以下几种，常用的主要有基于文本型的和基于lucene索引构建的：

　　下面是我测试用的一段代码，代码包括索引构建和索引查询：

package com.tianditu.com.search;

import java.io.File;

import java.io.IOException;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.search.spell.LuceneDictionary;

import org.apache.lucene.search.spell.SpellChecker;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.store.MMapDirectory;

import org.apache.lucene.util.Version;

public class GlobalSuggest {

	//拼写检查构建的索引

	private  final String SPELL_CHECK_FOLDER = "c:\\spellcheck\\";

	//根据已有的索引

	private final String GLOBAL_PINYIN_SUGGEST = "O:\\searchwork_custom\\data_index\\pinyin2008\\";

	//构建索引

	public void testIndexPinyin2008() throws IOException{

		long start = System.currentTimeMillis();

		//北京吉威时代软件股份有限公司

		//String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\";

		Directory direct = new MMapDirectory(new File(GLOBAL_PINYIN_SUGGEST));

		LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name");

		ld.getEntryIterator();

		Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER));

		SpellChecker sc = new SpellChecker(spd);

		//sc.in

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null);

		//往spellcheck目录下写索引--------------

		sc.indexDictionary(ld, iwc, true);

		sc.close();

		long end = System.currentTimeMillis();

		System.out.println("索引完毕,耗时:"+(end-start)+"ms");

	}

	public void testIndex() throws IOException{

		long start = System.currentTimeMillis();

		//北京吉威时代软件股份有限公司

		String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\";

		Directory direct = new MMapDirectory(new File(indexDir));

		LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name");

		ld.getEntryIterator();

		Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER));

		SpellChecker sc = new SpellChecker(spd);

		//sc.in

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null);

		sc.indexDictionary(ld, iwc, true);

		sc.close();

		long end = System.currentTimeMillis();

		System.out.println("索引完毕,耗时:"+(end-start)+"ms");

	}

	public void testSearch(String wd) throws IOException{

		//构建Directory

		Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER));

		//实例化 spellcheck组件

		SpellChecker sc = new SpellChecker(spd);

		//根据输入关键字  获得N条最相近的几率 第三个鄙视精确度 越大越匹配 安装实际需要调整

		String[] suggests = sc.suggestSimilar(wd, 10,0.6f);

		if(suggests!=null){

			for(String word:suggests){

				System.out.println("Dou you mean:"+word);

			}

		}

	}

	/**

	 * @param args

	 * @throws IOException

	 */

	public static void main(String[] args) throws IOException {

		GlobalSuggest spellcheck = new GlobalSuggest();

		//spellcheck.testIndexPinyin2008();

		spellcheck.testSearch("beijing京鸭");

		//spellcheck.testSearch("beijng");

	}

}

　　其中索引构建处代码：

	//构建索引

	public void testIndexPinyin2008() throws IOException{

		long start = System.currentTimeMillis();

		//北京吉威时代软件股份有限公司

		//String indexDir ="O:\\searchwork_custom\\data_index\\GlobalIndex\\";

		Directory direct = new MMapDirectory(new File(GLOBAL_PINYIN_SUGGEST));

		LuceneDictionary ld = new LuceneDictionary(DirectoryReader.open(direct), "name");

		ld.getEntryIterator();

		Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER));

		SpellChecker sc = new SpellChecker(spd);

		//sc.in

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_30,null);

		//往spellcheck目录下写索引--------------

		sc.indexDictionary(ld, iwc, true);

		sc.close();

		long end = System.currentTimeMillis();

		System.out.println("索引完毕,耗时:"+(end-start)+"ms");

	}

　　此处代码，就是根据已有的索引来构建Spellcheck所需的索引。

Spellcheck查询索引代码片段如下：

//构建Directory

		Directory spd = FSDirectory.open(new File(SPELL_CHECK_FOLDER));

		//实例化 spellcheck组件

		SpellChecker sc = new SpellChecker(spd);

		//根据输入关键字  获得N条最相近的几率 第三个鄙视精确度 越大越匹配 安装实际需要调整

		String[] suggests = sc.suggestSimilar(wd, 10,0.6f);

		if(suggests!=null){

			for(String word:suggests){

				System.out.println("Dou you mean:"+word);

			}

		}

　相关算法：默认是 LevensteinDistance 。

　查询样例：

　　　　1、查询汉字，有错别字情况：

　　　　2、查询拼音：

　　　　3、拼音汉字夹杂：

（备注：发现问题了，拼音和汉字夹杂的情况不行，如果想使用，需要进行某种处理。）

　　　　4、如果处理一长串汉字，中间夹杂错别字：

　　总结：看来spellcheck能力还是有限，如果需要用还可能改造。

Lucene 4.x Spellcheck使用说明的更多相关文章

lucene字典实现原理
http://www.cnblogs.com/LBSer/p/4119841.html 1 lucene字典使用lucene进行查询不可避免都会使用到其提供的字典功能,即根据给定的term找到该te ...
lucene字典实现原理——FST
转自:http://www.cnblogs.com/LBSer/p/4119841.html 1 lucene字典使用lucene进行查询不可避免都会使用到其提供的字典功能,即根据给定的term找到 ...
Elasticsearch .Net Client NEST使用说明 2.x
Elasticsearch .net client NEST使用说明 2.x Elasticsearch.Net与NEST是Elasticsearch为C#提供的一套客户端驱动,方便C#调用Elast ...
Lucene 02 - Lucene的入门程序(Java API的简单使用)
目录 1 准备环境 2 准备数据 3 创建工程 3.1 创建Maven Project(打包方式选jar即可) 3.2 配置pom.xml, 导入依赖 4 编写基础代码 4.1 编写图书POJO 4. ...
Elasticsearch .net client NEST使用说明 2.x -更新版
Elasticsearch .net client NEST使用说明目录: Elasticsearch .net client NEST 5.x 使用总结 elasticsearch_.net_cl ...
solr5.3的spellcheck功能
1.增加schema.xml中的检查字段. <field name="title" type="text_cn" indexed="true&q ...
solr特点四: SpellCheck(拼写检查)
接下来,我将介绍如何向应用程序添加 “您是不是要找……”(拼写检查). 提供拼写建议 Lucene 和 Solr 很久以前就开始提供拼写检查功能了,但直到添加了 SearchComponent架构之后 ...
lucene字典实现原理（转）
原文:https://www.cnblogs.com/LBSer/p/4119841.html 1 lucene字典使用lucene进行查询不可避免都会使用到其提供的字典功能,即根据给定的term找 ...
Atitit.项目修改补丁打包工具使用说明
Atitit.项目修改补丁打包工具使用说明 1.1. 打包工具已经在群里面.打包工具.bat1 1.2. 使用方法:放在项目主目录下,执行即可1 1.3. 打包工具的原理以及要打包的项目列表1 1. ...

随机推荐

[Android] 建立与使用Library
[Android] 建立与使用Library 前言使用Eclipse开发Android项目时,开发人员可以将可重用的程序代码,封装为Library来提供其他开发人员使用.本篇文章介绍如何将可重用的程 ...
Quill – 可以灵活自定义的开源的富文本编辑器
Quill 的建立是为了解决现有的所见即所得(WYSIWYG)的编辑器本身就是所见即所得(指不能再扩张)的问题.如果编辑器不正是你想要的方式,这是很难或不可能对其进行自定义以满足您的需求. Quill ...
JavaScript学习笔记-函数
函数的两种创建方式:函数定义表达式.函数声明语句编译时,函数声明语句创建的函数会‘被提前’至外部函数的作用域顶部,在该作用域内可以被随意调用: 而函数表达式创建的函数,要调用它必须赋值给一个变量,编 ...
BFC布局原理
写这篇博客的初衷其实是在解决浮动的时候看到的这个方法,就想着BFC是什么,为什么可以清除浮动.结果不看不知道,一看越看越不明白,潜下心来研究看看,总结一下学习心得. 1.BFC是什么 BFC就是Box ...
[Android]基于RxJava、RxAndroid的EventBus实现
以下内容为原创,欢迎转载,转载请注明来自天天博客:http://www.cnblogs.com/tiantianbyconan/p/4578699.html Github:https://gith ...
CLLocationManagerDelegate不调用didUpdateLocations （地图）
这是因为xcode升级造成的定位权限设置问题.升级xcode6以后打开以前xcode5工程,程序不能定位.工程升级到xcode6编译时需要iOS8 要自己写授权,不然没权限定位.解决方法:首先在 in ...
BiliBili 第三方 Android 客户端应用源码
基于 Material Design 的 BiliBili 第三方 Android 客户端,我们知道这个APP目前比较流行,所以大家也比较喜欢模仿,需要的参考一下文档共享 : https://dri ...
【VLC-Android】Mac下编译vlc-android
前言突然想整整VLC-Android,然后就下一个玩玩看,这里记录点遇到的问题. 声明欢迎转载,但请保留文章原始出处:) 博客园:http://www.cnblogs.com 农民伯伯: htt ...
C++语言-09-多任务
概述概念计算机同时运行多个程序的能力,多任务处理的方法是:运行第一个程序的一段代码,保存工作环境:再运行第二个程序的一段代码,保存工作环境:--恢复第一个程序的工作环境,执行第一个程序的下一段代码 ...
MyCat：取代Cobar数据库中间件
什么是MyCAT?简单的说,MyCAT就是: 一个彻底开源的,面向企业应用开发的“大数据库集群” 支持事务.ACID.可以替代Mysql的加强版数据库 ? 一个可以视为“Mysql”集群的企业级数据库 ...

Lucene 4.x Spellcheck使用说明

Lucene 4.x Spellcheck使用说明的更多相关文章

随机推荐

热门专题