1. Detecting Languages During Indexing

  在索引的时候,solr可以使用langid UpdateRequestProcessor来识别语言,然后映射文本到特定语言的字段.solr支持这个功能的两个实现:

  1. Tika的语言解析功能:http://tika.apache.org/0.10/detection.html
  2. LangDetect语言解析:http://code.google.com/p/language-detection/

  可以从 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html中看到它们之间的对比.一般情况下,LangDetect支持更多的语言,具有更高的性能.

  参考http://wiki.apache.org/solr/LanguageDetection获取更多的关于langid UpdateRequestProcessor信息.

1.1 Configuring Language Detection

  可以在solrconfig.xml中配置langid UpdateRequestProcessor.两个实现具有相同的参数,最少,你需要指定语言识别的字段和字段的结果语言编码.

1.2 Configuring Tika Language Detection

  这里是solrconfig.xml 中 Tika langid UpdateRequestProcessor的最小的配置.

<processor
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

1.3 Configuring LangDetect Language Detection

  这里是solrconfig.xml中最小的LangDetect langid配置.

<processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFac
tory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>

1.4 langid Parameters

  正如上面所提到的,两个langid  UpdateRequestProcessor的实现具有相同的参数:

参数 类型 默认 必填 描述
langid Boolean true no 开启/关闭语言解析
langid.fl string none yes 逗号或者空格分隔的字段列表.用于语言探测解析.
langid.langField string none yes 对返回的语言代码指定字段
langid.langsField multivalued string none no 对返回的语言代码指定字段.如果使用langid.map.individual,每一个解析的语言都被添加到这个字段.
langid.overwrite Boolean false no 指定langField和langsField字段的内容是否被重写.如果它们包含值的话.
langid.lcmap string none false 空格分隔的列表,指定冒号分隔的语言代码用于语言解析.举例,你可以能使用这个映射中文,日文,韩文到一个cjk字段,并且映射美国英语和英国英语到一个en代码.可以使用langid.lcmap=ja:cjk zh:cjk ko:cjk
. This affects both the values put into the  en_GB:en en_US:en.这使这两个值放入到langField和langsField字段中.
langid.threshold float 0.5 no Specifies a threshold value between 0 and 1 that the language
identification score must reach before  accepts it. With longer langid
text fields, a high threshold such at 0.8 will give good results. For
shorter text fields, you may need to lower the threshold for language
identification, though you will be risking somewhat lower quality
results. We recommend experimenting with your data to tune your
results.
langid.whitelist string none no Specifies a list of allowed language identification codes. Use this in
combination with  to ensure that you only index langid.map
documents into fields that are in your schema.
langid.map Boolean false no Enables field name mapping. If true, Solr will map field names for all
fields listed in  . langid.fl
langid.map.fl string none no A comma-separated list of fields for  that is different langid.map
than the fields specified in  . langid.fl
langid.map.keepOrig Boolean false no If true, Solr will copy the field during the field name mapping process,
leaving the original field in place.
langid..map.individual Boolean false no If true, Solr will detect and map languages for each field individually
langid.map.individual.fl stromh none no 逗号分割的字段列表,使用 langid.map.individual.不同于langid.fl中指定的字段.
langid.fallbackFields string none no If no language is detected that meets the  score langid.threshold
, or if the detected language is not on the  , this langid.whitelist
field specifies language codes to be used as fallback values. If no
appropriate fallback languages are found, Solr will use the language
code specified in  .
langid.fallback string none no Specifies a language code to use if no language is detected or
specified in  .
langid.map.lcmap string determined by
langid.lcmap
no A space-separated list specifying colon delimited language code
mappings to use when mapping field names. For example, you might
use this to make Chinese, Japanese, and Korean language fields use
a common  suffix, and map both American and British English *_cjk
fields to a single  by using  *_en langid.map.lcmap=ja:cjk
. zh:cjk ko:cjk en_GB:en en_US:en
langid.map.pattern Java
regular
expression
none no By default, fields are mapped as <field>_<language>. To change this
pattern, you can specify a Java regular expression in this parameter.
langid.map.replace Java replace none no By default, fields are mapped as <field>_<language>. To change this
pattern, you can specify a Java replace in this parameter.
langid.enforceSchema Boolean true no If false, the  processor does not validate field names against langid
your schema. This may be useful if you plan to rename or delete
fields later in the UpdateChain

1.6.7 Detecting Languages During Indexing的更多相关文章

  1. 1.6 Indexing and Basic Data Operations--目录

    1.6.1 什么是 Indexing 1.6.2 Uploading Data with Index Handlers 1.6.3 Uploading Data with Solr Cell usin ...

  2. 1.5.8 语言分析器(Analyzer)

    语言分析器(Analyzer) 这部分包含了分词器(tokenizer)和过滤器(filter)关于字符转换和使用指定语言的相关信息.对于欧洲语言来说,tokenizer是相当直接的,Tokens被空 ...

  3. Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载

    原文地址:https://gist.github.com/maxivak/3e3ee1fca32f3949f052 Install Solr download and install Solr fro ...

  4. Solr 6.7学习笔记(03)-- 样例配置文件 solrconfig.xml

    位于:${solr.home}\example\techproducts\solr\techproducts\conf\solrconfig.xml <?xml version="1. ...

  5. Solr基础知识二(导入数据)

    上一篇讲述了solr的安装启动过程,这一篇讲述如何导入数据到solr里. 一.准备数据 1.1 学生相关表 创建学生表.学生专业关联表.专业表.学生行业关联表.行业表.基础信息表,并创建一条小白的信息 ...

  6. Go Programming Language

    [Go Programming Language] 1.go run %filename 可以直接编译并运行一个文件,期间不会产生临时文件.例如 main.go. go run main.go 2.P ...

  7. Indexing Sensor Data

    In particular embodiments, a method includes, from an indexer in a sensor network, accessing a set o ...

  8. ESSENTIALS OF PROGRAMMING LANGUAGES (THIRD EDITION) :编程语言的本质 —— (一)

    # Foreword> # 序 This book brings you face-to-face with the most fundamental idea in computer prog ...

  9. 论文阅读(Xiang Bai——【CVPR2012】Detecting Texts of Arbitrary Orientations in Natural Images)

    Xiang Bai--[CVPR2012]Detecting Texts of Arbitrary Orientations in Natural Images 目录 作者和相关链接 方法概括 方法细 ...

随机推荐

  1. Dagger2学习资源

    文章 Jack Wharton关于Dagger的幻灯片 代码 用Dagger2改写Jack Wharton的U+2020 我自己写的,包含了dagger2和单元测试 chiuki写的,包含了dagge ...

  2. C++11用于元编程的类别属性

    [C++11用于元编程的类别属性] 许多算法能作用在不同的数据类别; C++ 模板支持泛型,这使得代码能更紧凑和有用.然而,算法经常会需要目前作用的数据类别的信息.这种信息可以通过类别属性 (type ...

  3. Linux应用总结:自动删除n天前日志

    linux是一个很能自动产生文件的系统,日志.邮件.备份等.虽然现在硬盘廉价,我们可以有很多硬盘空间供这些文件浪费,让系统定时清理一些不需要的文件很有一种爽快的事情.不用你去每天惦记着是否需要清理日志 ...

  4. A Dream

    A Dream 2013年10月20日,成都,天气阴,铜牌16.离2012年10月14日长春现场赛刚好隔了一年,刚看了下去年写的总结http://blog.csdn.net/cc_again/arti ...

  5. Java Service Wrapper配置详解

    #encoding=UTF-8 # Configuration files must begin with a line specifying the encoding # of the the fi ...

  6. Array--Good parts

    js数组没有上届 --如果你用大于或等于当前length的数字作为下标来存储一个元素,那么length会被增大以容纳新元素,不会发生数组越界. 数组也是对象 --可以添加属性.a["name ...

  7. ORA-04091: 表 发生了变化, 触发器/函数不能读它

    触发器中新调用了一个存储过程. 触发器: create or replace trigger tr_credits_wzclorder_clwzjk after update on app_wzclo ...

  8. easymock入门贴

    from:http://macrochen.iteye.com/blog/298032 关于EasyMock常见的几个问题, 这里(http://ozgwei.blogspot.com/2007/06 ...

  9. CI reids 缓存

    注意:在项目中的application/libraries 中自己定义的类最好不要以cache命名. 连接 Redis 服务器的配置信息必须保存到 application/config/redis.p ...

  10. Sublime Text3 激活教程

    Sublime Text3激活 在使用Sublime时会定期弹出购买提示框,避免出现购买提示,影响工作效率,我们可以使用网上的激活码,虽然有些不厚道,但是工作以后,一定选择购买正版支持一下. 打开Su ...