1.源码包

core: Lucene core library

analyzers-common: Analyzers for indexing content in different languages and domains.
analyzers-icu: Analysis integration with ICU (International Components for Unicode).
analyzers-kuromoji: Japanese Morphological Analyzer
analyzers-morfologik: Analyzer for dictionary stemming, built-in Polish dictionary
analyzers-nori: Korean Morphological Analyzer
analyzers-opennlp: OpenNLP Library Integration
analyzers-phonetic: Analyzer for indexing phonetic signatures (for sounds-alike search)
analyzers-smartcn: Analyzer for indexing Chinese
analyzers-stempel: Analyzer for indexing Polish
backward-codecs: Codecs for older versions of Lucene.
benchmark: System for benchmarking Lucene
classification: Classification module for Lucene
codecs: Lucene codecs and postings formats.
demo: Simple example code
expressions: Dynamically computed values to sort/facet/search on based on a pluggable grammar.
facet: Faceted indexing and search capabilities
grouping: Collectors for grouping search results.
highlighter: Highlights search keywords in results
join: Index-time and Query-time joins for normalized content
memory: Single-document in-memory index implementation
misc: Index tools and other miscellaneous code
queries: Filters and Queries that add to core Lucene
queryparser: Query parsers and parsing framework
replicator: Files replication utility
sandbox: Various third party contributions and new ideas
spatial: Geospatial search
spatial3d: 3D spatial planar geometry APIs
spatial-extras: Geospatial search
suggest: Auto-suggest and Spellchecking support
test-framework: Framework for testing Lucene-based applications

其中core包下面的api

org.apache.lucene
Top-level package.
org.apache.lucene.analysis
Text analysis.
org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizer StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
org.apache.lucene.analysis.tokenattributes
General-purpose attributes for text analysis.
org.apache.lucene.codecs
Codecs API: API for customization of the encoding and structure of the index.
org.apache.lucene.codecs.blocktree
BlockTree terms dictionary.
org.apache.lucene.codecs.compressing
StoredFieldsFormat that allows cross-document and cross-field compression of stored fields.
org.apache.lucene.codecs.lucene50
Components from the Lucene 5.0 index format See org.apache.lucene.codecs.lucene50 for an overview of the index format.
org.apache.lucene.codecs.lucene60
Components from the Lucene 6.0 index format.
org.apache.lucene.codecs.lucene62
Components from the Lucene 6.2 index format See org.apache.lucene.codecs.lucene70 for an overview of the current index format.
org.apache.lucene.codecs.lucene70
Lucene 7.0 file format.
org.apache.lucene.codecs.perfield
Postings format that can delegate to different formats per-field.
org.apache.lucene.document
The logical representation of a Document for indexing and searching.
org.apache.lucene.geo
Geospatial Utility Implementations for Lucene Core
org.apache.lucene.index
Code to maintain and access indices.
org.apache.lucene.search
Code to search indices.
org.apache.lucene.search.similarities
This package contains the various ranking models that can be used in Lucene.
org.apache.lucene.search.spans
The calculus of spans.
org.apache.lucene.store
Binary i/o API, used for all index data.
org.apache.lucene.util
Some utility classes.
org.apache.lucene.util.automaton
Finite-state automaton for regular expressions.
org.apache.lucene.util.bkd
Block KD-tree, implementing the generic spatial data structure described in this paper.
org.apache.lucene.util.fst
Finite state transducers
org.apache.lucene.util.graph
Utility classes for working with token streams as graphs.
org.apache.lucene.util.mutable
Comparable object wrappers
org.apache.lucene.util.packed
Packed integer arrays and streams.

2.术语定义

2.1Term

org.apache.lucene.index.Term

  1. /**
  2. A Term represents a word from text. This is the unit of search. It is
  3. composed of two elements, the text of the word, as a string, and the name of
  4. the field that the text occurred in.
  5.  
  6. Note that terms may represent more than words from text fields, but also
  7. things like dates, email addresses, urls, etc. */

2.2 Field

org.apache.lucene.document.Field

  1. /**
  2. * Expert: directly create a field for a document. Most
  3. * users should use one of the sugar subclasses:
  4. * <ul>
  5. * <li>{@link TextField}: {@link Reader} or {@link String} indexed for full-text search
  6. * <li>{@link StringField}: {@link String} indexed verbatim as a single token
  7. * <li>{@link IntPoint}: {@code int} indexed for exact/range queries.
  8. * <li>{@link LongPoint}: {@code long} indexed for exact/range queries.
  9. * <li>{@link FloatPoint}: {@code float} indexed for exact/range queries.
  10. * <li>{@link DoublePoint}: {@code double} indexed for exact/range queries.
  11. * <li>{@link SortedDocValuesField}: {@code byte[]} indexed column-wise for sorting/faceting
  12. * <li>{@link SortedSetDocValuesField}: {@code SortedSet<byte[]>} indexed column-wise for sorting/faceting
  13. * <li>{@link NumericDocValuesField}: {@code long} indexed column-wise for sorting/faceting
  14. * <li>{@link SortedNumericDocValuesField}: {@code SortedSet<long>} indexed column-wise for sorting/faceting
  15. * <li>{@link StoredField}: Stored-only value for retrieving in summary results
  16. * </ul>
  17. *
  18. * <p> A field is a section of a Document. Each field has three
  19. * parts: name, type and value. Values may be text
  20. * (String, Reader or pre-analyzed TokenStream), binary
  21. * (byte[]), or numeric (a Number). Fields are optionally stored in the
  22. * index, so that they may be returned with hits on the document.
  23. *
  24. * <p>
  25. * NOTE: the field type is an {@link IndexableFieldType}. Making changes
  26. * to the state of the IndexableFieldType will impact any
  27. * Field it is used in. It is strongly recommended that no
  28. * changes be made after Field instantiation.
  29. */

 2.3 Document

  org.apache.lucene.document.Document

  1. /** Documents are the unit of indexing and search.
  2. *
  3. * A Document is a set of fields. Each field has a name and a textual value.
  4. * A field may be {@link org.apache.lucene.index.IndexableFieldType#stored() stored} with the document, in which
  5. * case it is returned with search hits on the document. Thus each document
  6. * should typically contain one or more stored fields which uniquely identify
  7. * it.
  8. *
  9. * <p>Note that fields which are <i>not</i> {@link org.apache.lucene.index.IndexableFieldType#stored() stored} are
  10. * <i>not</i> available in documents retrieved from the index, e.g. with {@link
  11. * ScoreDoc#doc} or {@link IndexReader#document(int)}.
  12. */

2.4 segment

org.apache.lucene.index.SegmentInfo

  1. /**
  2. * Information about a segment such as its name, directory, and files related
  3. * to the segment.
  4. *
  5. * @lucene.experimental
  6. */

2.5 FSDirectory

  1. /**
  2. * Base class for Directory implementations that store index
  3. * files in the file system.
  4. * <a name="subclasses"></a>
  5. * There are currently three core
  6. * subclasses:
  7. *
  8. * <ul>
  9. *
  10. * <li>{@link SimpleFSDirectory} is a straightforward
  11. * implementation using Files.newByteChannel.
  12. * However, it has poor concurrent performance
  13. * (multiple threads will bottleneck) as it
  14. * synchronizes when multiple threads read from the
  15. * same file.
  16. *
  17. * <li>{@link NIOFSDirectory} uses java.nio's
  18. * FileChannel's positional io when reading to avoid
  19. * synchronization when reading from the same file.
  20. * Unfortunately, due to a Windows-only <a
  21. * href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734">Sun
  22. * JRE bug</a> this is a poor choice for Windows, but
  23. * on all other platforms this is the preferred
  24. * choice. Applications using {@link Thread#interrupt()} or
  25. * {@link Future#cancel(boolean)} should use
  26. * {@code RAFDirectory} instead. See {@link NIOFSDirectory} java doc
  27. * for details.
  28. *
  29. * <li>{@link MMapDirectory} uses memory-mapped IO when
  30. * reading. This is a good choice if you have plenty
  31. * of virtual memory relative to your index size, eg
  32. * if you are running on a 64 bit JRE, or you are
  33. * running on a 32 bit JRE but your index sizes are
  34. * small enough to fit into the virtual memory space.
  35. * Java has currently the limitation of not being able to
  36. * unmap files from user code. The files are unmapped, when GC
  37. * releases the byte buffers. Due to
  38. * <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038">
  39. * this bug</a> in Sun's JRE, MMapDirectory's {@link IndexInput#close}
  40. * is unable to close the underlying OS file handle. Only when
  41. * GC finally collects the underlying objects, which could be
  42. * quite some time later, will the file handle be closed.
  43. * This will consume additional transient disk usage: on Windows,
  44. * attempts to delete or overwrite the files will result in an
  45. * exception; on other platforms, which typically have a &quot;delete on
  46. * last close&quot; semantics, while such operations will succeed, the bytes
  47. * are still consuming space on disk. For many applications this
  48. * limitation is not a problem (e.g. if you have plenty of disk space,
  49. * and you don't rely on overwriting files on Windows) but it's still
  50. * an important limitation to be aware of. This class supplies a
  51. * (possibly dangerous) workaround mentioned in the bug report,
  52. * which may fail on non-Sun JVMs.
  53. * </ul>
  54. *
  55. * <p>Unfortunately, because of system peculiarities, there is
  56. * no single overall best implementation. Therefore, we've
  57. * added the {@link #open} method, to allow Lucene to choose
  58. * the best FSDirectory implementation given your
  59. * environment, and the known limitations of each
  60. * implementation. For users who have no reason to prefer a
  61. * specific implementation, it's best to simply use {@link
  62. * #open}. For all others, you should instantiate the
  63. * desired implementation directly.
  64. *
  65. * <p><b>NOTE:</b> Accessing one of the above subclasses either directly or
  66. * indirectly from a thread while it's interrupted can close the
  67. * underlying channel immediately if at the same time the thread is
  68. * blocked on IO. The channel will remain closed and subsequent access
  69. * to the index will throw a {@link ClosedChannelException}.
  70. * Applications using {@link Thread#interrupt()} or
  71. * {@link Future#cancel(boolean)} should use the slower legacy
  72. * {@code RAFDirectory} from the {@code misc} Lucene module instead.
  73. *
  74. * <p>The locking implementation is by default {@link
  75. * NativeFSLockFactory}, but can be changed by
  76. * passing in a custom {@link LockFactory} instance.
  77. *
  78. * @see Directory
  79. */

3.概念实例

  1. Analyzer analyzer = new StandardAnalyzer();
  2.  
  3. // Store the index in memory:
  4. Directory directory = new RAMDirectory();
  5. // To store an index on disk, use this instead:
  6. //Directory directory = FSDirectory.open("/tmp/testindex");
  7. IndexWriterConfig config = new IndexWriterConfig(analyzer);
  8. IndexWriter iwriter = new IndexWriter(directory, config);
  9. Document doc = new Document();
  10. String text = "This is the text to be indexed.";
  11. doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
  12. iwriter.addDocument(doc);
  13. iwriter.close();
  14.  
  15. // Now search the index:
  16. DirectoryReader ireader = DirectoryReader.open(directory);
  17. IndexSearcher isearcher = new IndexSearcher(ireader);
  18. // Parse a simple query that searches for "text":
  19. QueryParser parser = new QueryParser("fieldname", analyzer);
  20. Query query = parser.parse("text");
  21. ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
  22. assertEquals(1, hits.length);
  23. // Iterate through the results:
  24. for (int i = 0; i < hits.length; i++) {
  25. Document hitDoc = isearcher.doc(hits[i].doc);
  26. assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
  27. }
  28. ireader.close();
  29. directory.close();

参考文献

【1】http://lucene.apache.org/core/7_5_0/

【2】http://lucene.apache.org/core/7_5_0/core/index.html

lucene源码分析(1)基本要素的更多相关文章

  1. Lucene 源码分析之倒排索引(三)

    上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...

  2. 一个lucene源码分析的博客

    ITpub上的一个lucene源码分析的博客,写的比较全面:http://blog.itpub.net/28624388/cid-93356-list-1/

  3. lucene源码分析的一些资料

    针对lucene6.1较新的分析:http://46aae4d1e2371e4aa769798941cef698.devproxy.yunshipei.com/conansonic/article/d ...

  4. Lucene 源码分析之倒排索引(一)

    倒排索引是 Lucene 的核心数据结构,该系列文章将从源码层面(源码版本:Lucene-7.3.0)分析.该系列文章将以如下的思路展开. 什么是倒排索引? 如何定位 Lucene 中的倒排索引? 倒 ...

  5. Lucene 源码分析之倒排索引(二)

    本文以及后面几篇文章将讲解如何定位 Lucene 中的倒排索引.内容很多,唯有静下心才能跟着思路遨游. 我们可以思考一下,哪个步骤与倒排索引有关,很容易想到检索文档一定是要查询倒排列表的,那么就从此处 ...

  6. lucene源码分析(8)MergeScheduler

    1.使用IndexWriter.java mergeScheduler.merge(this, MergeTrigger.EXPLICIT, newMergesFound); 2.定义MergeSch ...

  7. lucene源码分析(7)Analyzer分析

    1.Analyzer的使用 Analyzer使用在IndexWriter的构造方法 /** * Constructs a new IndexWriter per the settings given ...

  8. lucene源码分析(6)Query分析

    查询的入口 /** Lower-level search API. * * <p>{@link LeafCollector#collect(int)} is called for ever ...

  9. lucene源码分析(5)lucence-group

    1. 普通查询的用法 org.apache.lucene.search.IndexSearcher public void search(Query query, Collector results) ...

随机推荐

  1. swipe js bug

    最近因为要写新的mobile site页面,有好几个页面上面必须用到photo slider. 使用插件: /* * Swipe 2.0 * * Brad Birdsall * Copyright 2 ...

  2. Android sdcard文件读写操作

    这次演示以,安卓原生操作系统 Nexus_6手机进行操作: AndroidManifest.xml配置相关权限: <!-- 增加权限 --> <uses-permission and ...

  3. git archive命令详解

    git archive可以将加了tag的某个版本打包提取出来,例如: git archive -v --format= > v0..zip --format表示打包的格式,如zip,-v表示对应 ...

  4. 【转】PHP操作MongoDB【NoSQL】

    原文:http://blog.sina.com.cn/s/blog_4b67d3240101519b.html 一.MongoDB简介 MongoDB (名称来自"humongous&quo ...

  5. openfire消息发送

    找了一些demo,做了一些示例,演示了基于xmpp协议的openfire的客户端之间消息的发送. 代码需要两个包,smack.jar ,smackx.jar. 第一个代码,只是点对点发送消息的,不涉及 ...

  6. 使用Postman验证TFS Rest API

    概述 你可能已经了解到,TFS自2015版本发布以来,开始支持通过REST API的方式提供接口服务,第三方平台可以通过通用的HTTP协议访问TFS系统,获取数据.请求编译等.REST API在原有. ...

  7. linux系统编程之文件与IO(七):时间函数小结

    从系统时钟获取时间方式 time函数介绍: 1.函数名称: localtime 2.函数名称: asctime 3.函数名称: ctime 4.函数名称: difftime 5.函数名称: gmtim ...

  8. 设计模式之模版方法模式(Template Method Pattern)

    一.什么是模版方法模式? 首先,模版方法模式是用来封装算法骨架的,也就是算法流程 既然被称为模版,那么它肯定允许扩展类套用这个模版,为了应对变化,那么它也一定允许扩展类做一些改变 事实就是这样,模版方 ...

  9. leetcode 缺失数字

    给定一个包含 0, 1, 2, ..., n 中 n 个数的序列,找出 0 .. n 中没有出现在序列中的那个数. 示例 1: 输入: [3,0,1] 输出: 2 示例 2: 输入: [9,6,4,2 ...

  10. c#字典怎么获取第一个键值 List<对象>获取重复项,转成Dictionary<key,List<对象>>

    c#字典怎么获取第一个键值 Dictionary<string, int> dictionary = new Dictionary<string, int>(); dictio ...