solr6.6 solrJ索引富文本(word/pdf)文件

　　1、文件配置

　　　　在core下面新建lib文件夹，存放相关的jar包，如图所示：

　　　　修改solrconfig.xml

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/contrib/clustering/lib/" regex=".*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-clustering-\d.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/contrib/langid/lib/" regex=".*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-langid-\d.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/contrib/velocity/lib" regex=".*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <lib dir="./lib" regex=".*\.jar"/>

　　　　增加配置，如果有则不用添加：

 <requestHandler name="/update/extract"

                  startup="lazy"

                  class="solr.extraction.ExtractingRequestHandler" >

    <lst name="defaults">

      <str name="fmap.content">text</str>

      <str name="fmap.meta">ignored_</str>

      <str name="lowernames">true</str>

      <str name="uprefix">attr_</str>

      <str name="captureAttr">true</str>

    </lst>

  </requestHandler>

　　配置managed-schema文件：

　　修改managed-schema文件，增加字段：

  <field name="path"      type="string"   indexed="true"  stored="true"  multiValued="false" />

  <field name="pathftype"      type="string"   indexed="true"  stored="true"  multiValued="false" />

  <field name="pathuploaddate"      type="string"   indexed="true"  stored="true"  multiValued="false" />

  <field name="pathsummary"      type="string"   indexed="true"  stored="true"  multiValued="false" />

  <field name="attr_content"      type="text_general"   indexed="true"  stored="true"  multiValued="false" />

　　2、Java代码solrj操作（6.6.0版本）

import java.io.File;

import java.io.IOException;

import java.text.SimpleDateFormat;

import java.util.Date;

import org.apache.solr.client.solrj.SolrClient;

import org.apache.solr.client.solrj.SolrQuery;

import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.impl.HttpSolrClient;

import org.apache.solr.client.solrj.request.AbstractUpdateRequest.ACTION;

import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;

import org.apache.solr.client.solrj.response.QueryResponse;

/**

 * @Author：sks

 * @Description：索引pdf等富文本文件

 * @Date：Created in 15:16 2017/12/13

 * @Modified by：

 **/

public class solr_pdf {

    public static void main(String[] args)

    {

        String fileName = "D:/work/Solr/ImportData/20160229001cn.pdf";

        String solrId = "20160229001cn.pdf";

        try

        {

            indexFilesSolrCell(solrId, solrId,fileName);

        }

        catch (IOException e)

        {

            e.printStackTrace();

        }

        catch (SolrServerException e)

        {

            e.printStackTrace();

        }

    }

    /**

     * @Author：sks

     * @Description：获取系统当天日期yyyy-mm-dd

     * @Date：

     */

    private static String GetCurrentDate(){

        Date dt = new Date();

        //最后的aa表示“上午”或“下午”    HH表示24小时制    如果换成hh表示12小时制

//        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss aa");

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

        String day =sdf.format(dt);

        return day;

    }

    public static void indexFilesSolrCell(String fileName, String solrId, String path)

            throws IOException, SolrServerException

    {

        String urlString = "http://localhost:8983/solr/test";

        SolrClient solr = new HttpSolrClient.Builder(urlString).build();

        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

        String contentType = getFileContentType(fileName);

        up.addFile(new File(path), contentType);

        String fileType = fileName.substring(fileName.lastIndexOf(".")+1);

        up.setParam("literal.id", fileName);

        up.setParam("literal.path", path);//文件路径

        up.setParam("literal.pathuploaddate", GetCurrentDate());//文件上传时间

        up.setParam("literal.pathftype", fileType);//文件类型，doc,pdf

        up.setParam("fmap.content", "attr_content");//文件内容

        up.setAction(ACTION.COMMIT, true, true);

        solr.request(up);

    }

    /**

    * @Author：sks

    * @Description：根据文件名获取文件的ContentType类型

    * @Date：

    */

    public static String getFileContentType(String filename) {

        String contentType = "";

        String prefix = filename.substring(filename.lastIndexOf(".") + 1);

        if (prefix.equals("xlsx")) {

            contentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";

        } else if (prefix.equals("pdf")) {

            contentType = "application/pdf";

        } else if (prefix.equals("doc")) {

            contentType = "application/msword";

        } else if (prefix.equals("txt")) {

            contentType = "text/plain";

        } else if (prefix.equals("xls")) {

            contentType = "application/vnd.ms-excel";

        } else if (prefix.equals("docx")) {

            contentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";

        } else if (prefix.equals("ppt")) {

            contentType = "application/vnd.ms-powerpoint";

        } else if (prefix.equals("pptx")) {

            contentType = "application/vnd.openxmlformats-officedocument.presentationml.presentation";

        }

        else {

            contentType = "othertype";

        }

        return contentType;

    }

}

solr6.6 solrJ索引富文本(word/pdf)文件的更多相关文章

搜索引擎Solr6.2.1 索引富文本(word/pdf/txt/html)
一:首先建立Core 在core下面新建lib文件夹,存放相关的jar包,如图所示: lib文件夹打开所示,这些类库在solr6.2.1解压之后都能找到: 修改solrconfig.xml,把刚刚建的 ...
利用 Pandoc 将 Markdown 生成 Word/PDF 文件
Pandoc 是一个格式转化工具,可以用于各(luan)种(qi)各(ba)样(zao)的文件转换, 反正我是认不全官网上的那个图(傲娇脸), 之前一直使用它将 Markdown 文件转换成 Html ...
SolrCloud索引富文本数据
solrconfig配置文件: schema配置文件: 执行目录: /opt/solr-5.5.4/server/scripts/cloud-scripts -- 下载配置文件 ./zkcli.sh ...
个人永久性免费-Excel催化剂功能第88波-批量提取pdf文件信息（图片、表格、文本等）
日常办公场合中,除了常规的Excel.Word.PPT等文档外,还有一个不可忽略的文件格式是pdf格式,而对于想从pdf文件中获取信息时,常规方法将变得非常痛苦和麻烦.此篇给大家送一pdf文件提取信息 ...
富文本编辑器UEditor的配置使用方法
将下载的富文本编辑器的文件解压后放到 webcontent 下如果文件中的jsp文件夹下的controller.java文件报错的话就将jsp下的lib文件夹中的文件都复制到 web-i ...
iOS - 开发中加载本地word/pdf文档说明
最近项目中要加载一个本地的word/pdf等文件比如<用户隐私政策><用户注册说明>,有两种方法加载 > 用QLPreviewController控制器实现步骤 : & ...
uedit富文本编辑器及图片上传控件
微力后台 uedit富文本编辑器及文件上传控件的使用,无时间整理,暂略,参考本地代码.能跑起来.
给Django后台富文本编辑器添加上传文件的功能
使用富文本编辑器上传的文件是要放到服务器上的,所以这是一个request.既然是一个request,就需要urls.py进行转发请求views.py进行处理.views.py处理完了返回一个文件所在的 ...
「Python实用秘技04」为pdf文件批量添加文字水印
本文完整示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/PythonPracticalSkills 这是我的系列文章「Python实用秘技」的第4期 ...

随机推荐

mysql五-2:多表查询
一介绍本节主题多表连接查询复合条件连接查询子查询准备表 company.employeecompany.department #建表 create table department( id ...
RabbitMQ消息队列（一）: 简单队列
1. 示例选用python的pika模块进行测试,需要预先安装pika模块: https://pypi.python.org/pypi/pika/0.10.0#downloads 上述地址下载源码,加 ...
使用go写一个检测tcpudp状态的包
使用go写一个检测tcpudp状态的包 http://www.2cto.com/os/201501/367596.html
使用div实现progress进度条
在百度上搜了很多方法去修改HTML5 progress的样式,然而并没有实现. 所以自己用div实现了一个. 简单粗暴(*^-^*) 可以在CSS里改样式,可以JS里改进度. <div cla ...
消除Git diff中^M的差异
消除Git diff中^M的差异在Windows上把一个刚commit的文件夹上传到了Ubuntu.在Ubuntu上使用git status查看,发现很多文件都被红色标注,表示刚刚修改未add.在W ...
病毒&烦人的幻灯片
<病毒>传送门 <烦人的幻灯片>传送门病毒描述有一天,小y突然发现自己的计算机感染了一种病毒!还好,小y发现这种病毒很弱,只是会把文档中的所有字母替换成其它字母,但并不改 ...
layer close 关闭层IE9-浏览器崩溃问题解决
针对ayer弹出层在IE上关闭导致浏览器崩溃的问题: 导致原因: 查看src源码,layer.close关闭总方法中有这么一行: layer.close = function(index){ ] + ...
Pycharm上python和unittest两种姿势傻傻分不清楚【转载】
前言经常有人在群里反馈,明明代码一样的啊,为什么别人的能出报告,我的出不了报告:为什么别人运行结果跟我的不一样啊... 这种问题先检查代码,确定是一样的,那就是运行姿势不对了,一旦导入unittes ...
mysql的expain（zz)
两张表,T1和T2,都只有一个字段,id int.各插入1000条记录,运行如下语句: explain SELECT t1.id,t2.id FROM t1 INNER JOIN t2 ON t1.i ...
sonarQube6.1 升级至6.2
在使用sonarQube6.1一段时间后,今天才发现sonarQube6.2已经更新,为了尝鲜,我决定在本机先尝试一下,如何升级至6.2 在这里,根据站点提示的升级步骤 1.下载新版本sonarQub ...

solr6.6 solrJ索引富文本(word/pdf)文件

1、文件配置

solr6.6 solrJ索引富文本(word/pdf)文件的更多相关文章

随机推荐

热门专题

　　1、文件配置