编译Ansj之Solr插件

　　Ansj是一个比较优秀的中文分词组件，具体情况就不在本文介绍了。ansj作者在其官方代码中，提供了对lucene接口的支持。如果用在Solr下，还需要简单的扩展一下。

1、基于maven管理

ansj是基于maven进行开发管理的。我们首先修改一下其pom.xml，具体如下所示：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<parent>

		<groupId>org.ansj</groupId>

		<artifactId>MavenAccount-aggregator</artifactId>

		<version>0.0.1</version>

		<relativePath>../pom.xml</relativePath>

	</parent>

	<artifactId>ansj_lucene4_plug</artifactId>

	<version>2.0.2</version>

	<packaging>jar</packaging>

	<name>ansj_lucene4_plug</name>

 	<properties>

        <solr.version>4.8.0</solr.version>

    </properties>

	<dependencies>

		<dependency>

			<groupId>org.ansj</groupId>

			<artifactId>ansj_seg</artifactId>

			<version>2.0.5</version>

			<classifier>min</classifier>

			<scope>provided</scope>

		</dependency>

		<dependency>

			<groupId>org.apache.lucene</groupId>

			<artifactId>lucene-core</artifactId>

			<version>${solr.version}</version>

			<scope>provided</scope>

		</dependency>

		<dependency>

			<groupId>org.apache.lucene</groupId>

			<artifactId>lucene-highlighter</artifactId>

			<version>${solr.version}</version>

			<scope>provided</scope>

		</dependency>

		<dependency>

			<groupId>org.apache.lucene</groupId>

			<artifactId>lucene-queries</artifactId>

			<version>${solr.version}</version>

			<scope>provided</scope>

		</dependency>

		<dependency>

			<groupId>org.apache.lucene</groupId>

			<artifactId>lucene-queryparser</artifactId>

			<version>${solr.version}</version>

			<scope>provided</scope>

		</dependency>

	   <dependency>

			<groupId>org.apache.solr</groupId>

			<artifactId>solr-dataimporthandler</artifactId>

			<version>${solr.version}</version>

			<scope>provided</scope>

	  </dependency>

		<dependency>

			<groupId>junit</groupId>

			<artifactId>junit</artifactId>

			<version>4.4</version>

			<scope>test</scope>

		</dependency>

	</dependencies>

</project>

　　其中，代码依赖的配置项：<scope>provided</scope> 表示只用于代码编译阶段。依赖关系整理好以后，写一个TokenizerFactory类，用于solr中配置使用，代码如下：

package org.ansj.solr;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStreamReader;

import java.io.Reader;

import java.util.HashSet;

import java.util.Map;

import java.util.Set;

import org.ansj.lucene.util.AnsjTokenizer;

import org.ansj.splitWord.analysis.IndexAnalysis;

import org.ansj.splitWord.analysis.ToAnalysis;

import org.apache.lucene.analysis.Tokenizer;

import org.apache.lucene.analysis.util.TokenizerFactory;

import org.apache.lucene.util.AttributeSource.AttributeFactory;

public class AnsjTokenizerFactory extends TokenizerFactory{

    boolean pstemming;

    boolean isQuery;

    private String stopwordsDir;

    public Set<String> filter;  

    public AnsjTokenizerFactory(Map<String, String> args) {

        super(args);

        assureMatchVersion();

        isQuery = getBoolean(args, "isQuery", true);

        pstemming = getBoolean(args, "pstemming", false);

        stopwordsDir = get(args,"words");

        addStopwords(stopwordsDir);

    }

    //add stopwords list to filter

    private void addStopwords(String dir) {

        if (dir == null){

            System.out.println("no stopwords dir");

            return;

        }

        //read stoplist

        System.out.println("stopwords: " + dir);

        filter = new HashSet<String>();

        File file = new File(dir);

        InputStreamReader reader;

        try {

            reader = new InputStreamReader(new FileInputStream(file),"UTF-8");

            BufferedReader br = new BufferedReader(reader);

            String word = br.readLine();

            while (word != null) {

                filter.add(word);

                word = br.readLine();

            }

        } catch (FileNotFoundException e) {

            System.out.println("No stopword file found");

        } catch (IOException e) {

            System.out.println("stopword file io exception");

        }

    }

    @Override

    public Tokenizer create(AttributeFactory factory, Reader input) {

        if(isQuery == true){

            //query

            return new AnsjTokenizer(new ToAnalysis(new BufferedReader(input)), input, filter, pstemming);

        } else {

            //index

            return new AnsjTokenizer(new IndexAnalysis(new BufferedReader(input)), input, filter, pstemming);

        }

    }

}

　　pstemming 参数是ansj需要的参数。

　　isQuery 是用于判断是查询还是索引，一般搜索index阶段分词比较细，查询的分词比较粗。

2、编译jar包。

代码结构如下：

　　编写mavn编译命令：mvn install -DskipTests=true# 忽略单元测试编译。

执行编译：

[INFO] Scanning for projects...

[INFO]

[INFO] ------------------------------------------------------------------------

[INFO] Building ansj_lucene4_plug 2.0.2

[INFO] ------------------------------------------------------------------------

[INFO]

[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ ansj_lucene4_plug ---

[INFO] Deleting R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target

[INFO]

[INFO] --- maven-resources-plugin:2.4.3:resources (default-resources) @ ansj_lucene4_plug ---

[INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] skip non existing resourceDirectory R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\src\main\resources

[INFO]

[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ ansj_lucene4_plug ---

[INFO] Compiling 5 source files to R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\classes

[INFO]

[INFO] --- maven-resources-plugin:2.4.3:testResources (default-testResources) @ ansj_lucene4_plug ---

[INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] skip non existing resourceDirectory R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\src\test\resources

[INFO]

[INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ansj_lucene4_plug ---

[INFO] Compiling 3 source files to R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\test-classes

[INFO]

[INFO] --- maven-surefire-plugin:2.7.1:test (default-test) @ ansj_lucene4_plug ---

[INFO] Tests are skipped.

[INFO]

[INFO] --- maven-jar-plugin:2.3.1:jar (default-jar) @ ansj_lucene4_plug ---

[INFO] Building jar: R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\ansj_lucene4_plug-2.0.2.jar

[INFO]

[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ ansj_lucene4_plug ---

[INFO] Installing R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\target\ansj_lucene4_plug-2.0.2.jar to C:\Users\GCZX-016\.m2\repository\org\ansj\ansj_lucene4_plug\2.0.2\ansj_lucene4_plug-2.0.2.jar

[INFO] Installing R:\ansj-seg\ansj_seg\plug\ansj_lucene4_plug\pom.xml to C:\Users\GCZX-016\.m2\repository\org\ansj\ansj_lucene4_plug\2.0.2\ansj_lucene4_plug-2.0.2.pom

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 8.149s

[INFO] Finished at: Tue May 05 15:29:19 CST 2015

[INFO] Final Memory: 27M/245M

[INFO] ------------------------------------------------------------------------

编译Ansj之Solr插件的更多相关文章

Eclipse反编译工具Jad及插件JadClipse配置(转)
Eclipse反编译工具Jad及插件JadClipse配置 Jad是一个Java的一个反编译工具,是用命令行执行,和通常JDK自带的java,javac命令是一样的.不过因为是控制台运行,所以用起来不 ...
Eclipse反编译工具Jad及插件
Eclipse反编译工具Jad及插件下载路径 http://download.csdn.net/detail/lijun7788/9689312 http://files.cnblogs.com/fi ...
solr插件导入数据库中的数据
solr插件导入数据库中的数据 1:自定义与数据库对应的域: 1.1. 设置业务系统Field 如果不使用Solr提供的Field可以针对具体的业务需要自定义一套Field. 例如:如下是商品信息Fi ...
Solr插件的弊端
在前文<Solr Update插件自定义条件索引>中,我介绍了如何通过插件的模式,自定义Solr的Update过程.但是在大半年的使用过程中,发现这种方式存在如下弊端. 1.环境难以维护. ...
Hadoop2 自己动手编译Hadoop的eclipse插件
前言: 毕业两年了,之前的工作一直没有接触过大数据的东西,对hadoop等比较陌生,所以最近开始学习了.对于我这样第一次学的人,过程还是充满了很多疑惑和不解的,不过我采取的策略是还是先让环 ...
Eclipse反编译工具Jad及插件JadClipse配置
Jad是一个Java的一个反编译工具,是用命令行执行,和通常JDK自带的java,javac命令是一样的.不过因为是控制台运行,所以用起来不太方便.不过幸好有一个eclipse的插件JadClipse ...
编译hadoop eclipse的插件（hadoop1.0）
原创文章,转载请注明: 转载自工学1号馆欢迎关注我的个人博客:www.wuyudong.com, 更多云计算与大数据的精彩文章在hadoop-1.0中,不像0.20.2版本,有现成的eclipse ...
2: Eclipse反编译工具Jad及插件JadClipse配置
Jad是一个Java的一个反编译工具,是用命令行执行,和通常JDK自带的java,javac命令是一样的.不过因为是控制台运行,所以用起来不太方便.不过幸好有一个eclipse的插件JadClipse ...
编译OSG的FreeType插件时注意的问题
使用自己编译的freetype.lib,在编译osgdb_freetype插件项目时,报错LINK错误,找不到png的一堆函数最简单的方式是不要使用PNG编译freetype.记住不要犯贱.

随机推荐

来自 Codrops 的7种创新的拖放交互界面
Codrops 分享了一些界面拖放的交互设计想法.基本的思路是在拖动一个项目的时候,为特定的操作呈现出可投掷的区域.这节省了大量的界面空间 ,并给出了一个有趣的动态的互动.有很多应用场景,例如分类和组 ...
Delphi 时间耗时统计
处理事情: 数据处理过程中,速度很慢,无法准确定位分析是DB问题还是客户端处理问题,所以增加计时统计日志: Delphi计时首次使用,查阅资料,予以记录: var BgPoint, EdPoind: ...
JavaScript中的各种变量提升（Hoisting）
首先纠正下,文章标题里的 “变量提升” 名词是随大流叫法,“变量提升” 改为 “标识符提升” 更准确.因为变量一般指使用 var 声明的标识符,JS 里使用 function 声明的标识符也存在提升( ...
HTML <base> 标签为页面上的所有链接规定默认地址或默认目标
定义和用法 <base> 标签为页面上的所有链接规定默认地址或默认目标. 通常情况下,浏览器会从当前文档的 URL 中提取相应的元素来填写相对 URL 中的空白. 使用 <base& ...
为 MDS 修改 SharePoint 2013组件
了解如何修改 SharePoint 项目中的组件以在 SharePoint 2013 中利用最少下载策略(MDS). 本文内容为何修改 SharePoint 组件? 母版页 ASP.NET 页面 ...
Marketing with Microsoft Dynamics CRM IDEA CONFERENCE
Object:Marketing with Microsoft Dynamics CRM IDEA CONFERENCE 24 SEPTEMBER 2015 | BROADCAST ONLINE ...
[ACM] 1007 -球球方格
与兔子方格类似,不过一秒走一格: 输入代码 #include<iostream> using namespace std; int main(void) { int test_count ...
xcode7无证书真机调试 Error: An App ID with identifier "*" is not avaliable. Please enter a different string.
1. Error: An App ID with identifier "*" is not avaliable. Please enter a different string. ...
我的android学习经历
我为什么选择android? 我基本上前一年的时间都是在学习java的语法和线程之类的,没有注意java的分类,所以到现在慢慢接触到深处的时候我了解到,java的优势主要在web,而我不是特别喜欢网页 ...

编译Ansj之Solr插件

编译Ansj之Solr插件的更多相关文章

随机推荐

热门专题