记一次企业级爬虫系统升级改造(五):基于JieBaNet+Lucene.Net实现全文搜索
实现效果:
上一篇文章有附全文搜索结果的设计图,下面截一张开发完成上线后的实图:
基本风格是模仿的百度搜索结果,绿色的分页略显小清新。
目前已采集并创建索引的文章约3W多篇,索引文件不算太大,查询速度非常棒。
刀不磨要生锈,人不学要落后。每天都要学一些新东西。
基本技术介绍:
还记得上一次做全文搜索是在2013年,主要核心设计与代码均是当时的架构师写的,自己只能算是全程参与。
当时使用的是经典搭配:盘古分词+Lucene.net。
前几篇文章有说到,盘古分词已经很多年不更新了,我在SupportYun系统一直引用的JieBaNet来做分词技术。
那么是否也有成型的JieBaNet+Lucene.Net的全文搜索方案呢?
经过多番寻找,在GitHub上面找到一个简易的例子:https://github.com/anderscui/jiebaForLuceneNet
博主下面要讲的实现方案就是从这个demo得到的启发,大家有兴趣可以去看看这个demo。
博主使用的具体版本:Lucene.net 3.0.3.0 ,JieBaNet 0.38.3.0(做过简易的调整与扩展,前面文章有讲到)
首先我们对Lucene.Net的分词器Tokenizer、分析器Analyzer做一个基于JieBaNet的扩展。
1.基于LuceneNet扩展的JieBa分析器JiebaForLuceneAnalyzer
/// <summary>
/// 基于LuceneNet扩展的JieBa分析器
/// </summary>
public class JiebaForLuceneAnalyzer : Analyzer
{
protected static readonly ISet<string> DefaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET; private static ISet<string> StopWords; static JiebaForLuceneAnalyzer()
{
StopWords = new HashSet<string>();
var stopWordsFile = Path.GetFullPath(JiebaNet.Analyser.ConfigManager.StopWordsFile);
if (File.Exists(stopWordsFile))
{
var lines = File.ReadAllLines(stopWordsFile);
foreach (var line in lines)
{
StopWords.Add(line.Trim());
}
}
else
{
StopWords = DefaultStopWords;
}
} public override TokenStream TokenStream(string fieldName, TextReader reader)
{
var seg = new JiebaSegmenter();
TokenStream result = new JiebaForLuceneTokenizer(seg, reader);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, StopWords);
return result;
}
}
2.基于LuceneNet扩展的JieBa分词器:JiebaForLuceneTokenizer
/// <summary>
/// 基于Lucene的JieBa分词扩展
/// </summary>
public class JiebaForLuceneTokenizer:Tokenizer
{
private readonly JiebaSegmenter segmenter;
private readonly ITermAttribute termAtt;
private readonly IOffsetAttribute offsetAtt;
private readonly ITypeAttribute typeAtt; private readonly List<Token> tokens;
private int position = -; public JiebaForLuceneTokenizer(JiebaSegmenter seg, TextReader input):this(seg, input.ReadToEnd()) { } public JiebaForLuceneTokenizer(JiebaSegmenter seg, string input)
{
segmenter = seg;
termAtt = AddAttribute<ITermAttribute>();
offsetAtt = AddAttribute<IOffsetAttribute>();
typeAtt = AddAttribute<ITypeAttribute>(); var text = input;
tokens = segmenter.Tokenize(text, TokenizerMode.Search).ToList();
} public override bool IncrementToken()
{
ClearAttributes();
position++;
if (position < tokens.Count)
{
var token = tokens[position];
termAtt.SetTermBuffer(token.Word);
offsetAtt.SetOffset(token.StartIndex, token.EndIndex);
typeAtt.Type = "Jieba";
return true;
} End();
return false;
} public IEnumerable<Token> Tokenize(string text, TokenizerMode mode = TokenizerMode.Search)
{
return segmenter.Tokenize(text, mode);
}
}
理想如果不向现实做一点点屈服,那么理想也将归于尘土。
实现方案设计:
我们做全文搜索的设计时一定会考虑的一个问题就是:我们系统是分很多模块的,不同模块的字段差异很大,怎么才能实现同一个索引,既可以单个模块搜索又可以全站搜索,甚至按一些字段做条件来搜索呢?
这些也是SupportYun系统需要考虑的问题,因为目前的数据就天然的拆分成了活动、文章两个类别,字段也大有不同。博主想实现的是一个可以全站搜索(结果包括活动、文章),也可以在文章栏目/活动栏目分别搜索,并且可以按几个指定字段来做搜索条件。
要做一个这样的全文搜索功能,我们需要从程序设计上来下功夫。下面就介绍一下博主的设计方案:
一、索引创建
1.我们设计一个IndexManager来处理最基本的索引创建、更新、删除操作。
public class IndexManager
{
/// <summary>
/// 索引存储目录
/// </summary>
public static readonly string IndexStorePath = ConfigurationManager.AppSettings["IndexStorePath"];
private IndexWriter indexWriter;
private FSDirectory entityDirectory; ~IndexManager()
{
if (entityDirectory != null)
{
entityDirectory.Dispose();
}
if (indexWriter != null)
{
indexWriter.Dispose();
}
} /// <summary>
/// 对内容新增索引
/// </summary>
public void BuildIndex(List<IndexContent> indexContents)
{
try
{
if (entityDirectory == null)
{
entityDirectory = FSDirectory.Open(new DirectoryInfo(IndexStorePath));
}
if (indexWriter == null)
{
Analyzer analyzer = new JiebaForLuceneAnalyzer();
indexWriter = new IndexWriter(entityDirectory, analyzer, IndexWriter.MaxFieldLength.LIMITED);
}
lock (IndexStorePath)
{
foreach (var indexContent in indexContents)
{
var doc = GetDocument(indexContent);
indexWriter.AddDocument(doc);
}
indexWriter.Commit();
indexWriter.Optimize();
indexWriter.Dispose();
}
}
catch (Exception exception)
{
LogUtils.ErrorLog(exception);
}
finally
{
if (entityDirectory != null)
{
entityDirectory.Dispose();
}
if (indexWriter != null)
{
indexWriter.Dispose();
}
}
} /// <summary>
/// 删除索引
/// </summary>
/// <param name="moduleType"></param>
/// <param name="tableName">可空</param>
/// <param name="rowID"></param>
public void DeleteIndex(string moduleType, string tableName, string rowID)
{
try
{
if (entityDirectory == null)
{
entityDirectory = FSDirectory.Open(new DirectoryInfo(IndexStorePath));
}
if (indexWriter == null)
{
Analyzer analyzer = new JiebaForLuceneAnalyzer();
indexWriter = new IndexWriter(entityDirectory, analyzer, IndexWriter.MaxFieldLength.LIMITED);
}
lock (IndexStorePath)
{
var query = new BooleanQuery
{
{new TermQuery(new Term("ModuleType", moduleType)), Occur.MUST},
{new TermQuery(new Term("RowId", rowID)), Occur.MUST}
};
if (!string.IsNullOrEmpty(tableName))
{
query.Add(new TermQuery(new Term("TableName", tableName)), Occur.MUST);
} indexWriter.DeleteDocuments(query);
indexWriter.Commit();
indexWriter.Optimize();
indexWriter.Dispose();
}
}
catch (Exception exception)
{
LogUtils.ErrorLog(exception);
}
finally
{
if (entityDirectory != null)
{
entityDirectory.Dispose();
}
if (indexWriter != null)
{
indexWriter.Dispose();
}
}
} /// <summary>
/// 更新索引
/// </summary>
/// <param name="indexContent"></param>
public void UpdateIndex(IndexContent indexContent)
{
try
{
if (entityDirectory == null)
{
entityDirectory = FSDirectory.Open(new DirectoryInfo(IndexStorePath));
}
if (indexWriter == null)
{
Analyzer analyzer = new JiebaForLuceneAnalyzer();
indexWriter = new IndexWriter(entityDirectory, analyzer, IndexWriter.MaxFieldLength.LIMITED);
}
lock (IndexStorePath)
{
var query = new BooleanQuery
{
{new TermQuery(new Term("ModuleType", indexContent.ModuleType)), Occur.MUST},
{new TermQuery(new Term("RowId", indexContent.RowId.ToString())), Occur.MUST}
};
if (!string.IsNullOrEmpty(indexContent.TableName))
{
query.Add(new TermQuery(new Term("TableName", indexContent.TableName)), Occur.MUST);
} indexWriter.DeleteDocuments(query); var document = GetDocument(indexContent);
indexWriter.AddDocument(document); indexWriter.Commit();
indexWriter.Optimize();
indexWriter.Dispose();
}
}
catch (Exception exception)
{
LogUtils.ErrorLog(exception);
}
finally
{
if (entityDirectory != null)
{
entityDirectory.Dispose();
}
if (indexWriter != null)
{
indexWriter.Dispose();
}
}
} private Document GetDocument(IndexContent indexContent)
{
var doc = new Document();
doc.Add(new Field("ModuleType", indexContent.ModuleType, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("TableName", indexContent.TableName, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("RowId", indexContent.RowId.ToString().ToLower(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("Title", indexContent.Title, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("IndexTextContent", ReplaceIndexSensitiveWords(indexContent.IndexTextContent), Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("CollectTime", indexContent.CollectTime.ToString("yyyy-MM-dd HH:mm:ss"),Field.Store.YES, Field.Index.NO)); // 预留
doc.Add(new Field("Tag1", indexContent.Tag1.Value, GetStoreEnum(indexContent.Tag1.Store)
, GetIndexEnum(indexContent.Tag1.Index)));
doc.Add(new Field("Tag2", indexContent.Tag2.Value, GetStoreEnum(indexContent.Tag2.Store)
, GetIndexEnum(indexContent.Tag2.Index)));
doc.Add(new Field("Tag3", indexContent.Tag3.Value, GetStoreEnum(indexContent.Tag3.Store)
, GetIndexEnum(indexContent.Tag3.Index)));
doc.Add(new Field("Tag4", indexContent.Tag4.Value, GetStoreEnum(indexContent.Tag4.Store)
, GetIndexEnum(indexContent.Tag4.Index)));
doc.Add(new Field("Tag5", indexContent.Tag5.Value, GetStoreEnum(indexContent.Tag5.Store)
, GetIndexEnum(indexContent.Tag5.Index)));
doc.Add(new Field("Tag6", indexContent.Tag6.Value, GetStoreEnum(indexContent.Tag6.Store)
, GetIndexEnum(indexContent.Tag6.Index)));
doc.Add(new Field("Tag7", indexContent.Tag7.Value, GetStoreEnum(indexContent.Tag7.Store)
, GetIndexEnum(indexContent.Tag7.Index)));
doc.Add(new Field("Tag8", indexContent.Tag8.Value, GetStoreEnum(indexContent.Tag8.Store)
, GetIndexEnum(indexContent.Tag8.Index)));
var field = new NumericField("FloatTag9", GetStoreEnum(indexContent.FloatTag9.Store),
indexContent.FloatTag9.Index != IndexEnum.NotIndex);
field = field.SetFloatValue(indexContent.FloatTag9.Value);
doc.Add(field);
field = new NumericField("FloatTag10", GetStoreEnum(indexContent.FloatTag10.Store),
indexContent.FloatTag10.Index != IndexEnum.NotIndex);
field = field.SetFloatValue(indexContent.FloatTag10.Value);
doc.Add(field);
return doc;
} /// <summary>
/// 权益方法,临时使用
/// 去除文本中非索引文本
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
private string ReplaceIndexSensitiveWords(string str)
{
for (var i = ; i < ; i++)
{
str = str.Replace(" ", "");
str = str.Replace(" ", "").Replace("\n", "");
}
return str;
} private Field.Index GetIndexEnum(IndexEnum index)
{
switch (index)
{
case IndexEnum.NotIndex:
return Field.Index.NO;
case IndexEnum.NotUseAnalyzerButIndex:
return Field.Index.NOT_ANALYZED;
case IndexEnum.UseAnalyzerIndex:
return Field.Index.ANALYZED;
default:
return Field.Index.NO;
}
} private Field.Store GetStoreEnum(bool store)
{
return store ? Field.Store.YES : Field.Store.NO;
}
}
2.创建、更新使用到的标准数据类:IndexContent。
我们设计TableName(对应DB表名)、RowId(对应DB主键)、CollectTime(对应DB数据创建时间)、ModuleType(所属系统模块)、Title(检索标题)、IndexTextContent(检索文本)等六个基础字段,所有模块需要创建索引必须构建该6个字段(大家可据具体情况扩展)。
然后设计10个预留字段Tag1-Tag10,用以兼容各大模块其他不同字段。
预留字段的存储、索引方式可独立配置。
/// <summary>
/// 索引内容扩展类
/// 增加10个预留字段(8个文本型,2个数值型)
/// </summary>
public class IndexContent : BaseIndexContent
{
public IndexContent()
{
Tag1 = new IndexContentStringValue();
Tag2 = new IndexContentStringValue();
Tag3 = new IndexContentStringValue();
Tag4 = new IndexContentStringValue();
Tag5 = new IndexContentStringValue();
Tag6 = new IndexContentStringValue();
Tag7 = new IndexContentStringValue();
Tag8 = new IndexContentStringValue();
FloatTag9 = new IndexContentFloatValue();
FloatTag10 = new IndexContentFloatValue();
} /// <summary>
/// 预留1
/// </summary>
public IndexContentStringValue Tag1 { get; set; } /// <summary>
/// 预留2
/// </summary>
public IndexContentStringValue Tag2 { get; set; } /// <summary>
/// 预留3
/// </summary>
public IndexContentStringValue Tag3 { get; set; } /// <summary>
/// 预留4
/// </summary>
public IndexContentStringValue Tag4 { get; set; } /// <summary>
/// 预留5
/// </summary>
public IndexContentStringValue Tag5 { get; set; } /// <summary>
/// 预留6
/// </summary>
public IndexContentStringValue Tag6 { get; set; } /// <summary>
/// 预留7
/// </summary>
public IndexContentStringValue Tag7 { get; set; } /// <summary>
/// 预留8
/// </summary>
public IndexContentStringValue Tag8 { get; set; } /// <summary>
/// 预留9(数值型)
/// </summary>
public IndexContentFloatValue FloatTag9 { get; set; } /// <summary>
/// 预留10(数值型)
/// </summary>
public IndexContentFloatValue FloatTag10 { get; set; }
} /// <summary>
/// 索引值及方式
/// </summary>
public class IndexContentStringValue
{
public IndexContentStringValue()
{
Value = "";
Store = true;
Index = IndexEnum.NotIndex;
} /// <summary>
/// 字符值
/// </summary>
public string Value { get; set; } /// <summary>
/// 是否存储
/// </summary>
public bool Store { get; set; } /// <summary>
/// 索引&分词方式
/// </summary>
public IndexEnum Index { get; set; }
} /// <summary>
/// 索引值及方式
/// </summary>
public class IndexContentFloatValue
{
public IndexContentFloatValue()
{
Value = ;
Store = true;
Index = IndexEnum.NotIndex;
} /// <summary>
/// 字符值
/// </summary>
public float Value { get; set; } /// <summary>
/// 是否存储
/// </summary>
public bool Store { get; set; } /// <summary>
/// 是否索引且分词
/// </summary>
public IndexEnum Index { get; set; }
}
其中BaseIndexContent含有六个基础字段。
3.创建一个子模块索引构建器的接口:IIndexBuilder。
各子模块通过继承实现IIndexBuilder,来实现索引的操作。
/// <summary>
/// 各子模块内容索引构建器接口
/// </summary>
public interface IIndexBuilder<TIndexContent>
{
/// <summary>
/// 将内容集合建立索引
/// </summary>
void BuildIndex(List<TIndexContent> indexContents); /// <summary>
/// 删除索引
/// </summary>
void DeleteIndex(string tableName, string rowID); /// <summary>
/// 更新索引
/// </summary>
/// <param name="indexContents"></param>
void UpdateIndex(List<TIndexContent> indexContents);
}
4.下面我们以活动模块为例,来实现索引创建。
a)首先创建一个基于活动模块的数据类:ActivityIndexContent,可以将我们需要索引或存储的字段都设计在内。
public class ActivityIndexContent
{
/// <summary>
/// 关联表格名
/// </summary>
public string TableName { get; set; } /// <summary>
/// 关联表格行ID
/// </summary>
public Guid RowId { get; set; } /// <summary>
/// 采集分析时间
/// </summary>
public DateTime CollectTime { get; set; } public string Title { get; set; } /// <summary>
/// 详情
/// </summary>
public string InformationContent { get; set; } /// <summary>
/// 活动类别
/// </summary>
public List<ActivityType> ActivityTypes { get; set; } public Guid CityId { get; set; } /// <summary>
/// 活动地址
/// </summary>
public string Address { get; set; } /// <summary>
/// 活动日期
/// </summary>
public DateTime? ActivityDate { get; set; } /// <summary>
/// 源链接
/// </summary>
public string Url { get; set; } /// <summary>
/// 采集源名称
/// </summary>
public string SourceName { get; set; } /// <summary>
/// 采集源主站地址
/// </summary>
public string SourceUrl { get; set; } /// <summary>
/// 采集源官方热线
/// </summary>
public string SourceOfficialHotline { get; set; }
}
b)我们再创建ActivityIndexBuilder并继承IIndexBuilder,实现其创建、更新、删除方法。
/// <summary>
/// 活动数据索引创建器
/// </summary>
public class ActivityIndexBuilder : IIndexBuilder<ActivityIndexContent>
{
public const string MODULETYPE = "活动"; /// <summary>
/// 创建索引
/// </summary>
/// <param name="activityIndexContents"></param>
public void BuildIndex(List<ActivityIndexContent> activityIndexContents)
{
var indexManager = new IndexManager();
var indexContents = activityIndexContents.Select(activityIndexContent => new IndexContent
{
ModuleType = MODULETYPE,
TableName = activityIndexContent.TableName,
RowId = activityIndexContent.RowId,
Title = activityIndexContent.Title,
IndexTextContent = activityIndexContent.InformationContent,
CollectTime = activityIndexContent.CollectTime,
Tag1 = new IndexContentStringValue
{
// 活动分类
Value = activityIndexContent.GetActivityTypeStr()
},
Tag2 = new IndexContentStringValue
{
// 源链接
Value = activityIndexContent.Url
},
Tag3 = new IndexContentStringValue
{
// 采集源名称
Value = activityIndexContent.SourceName,
Index = IndexEnum.UseAnalyzerIndex
},
Tag4 = new IndexContentStringValue
{
// 采集源官方热线
Value = activityIndexContent.SourceOfficialHotline
},
Tag5 = new IndexContentStringValue
{
// 采集源主站地址
Value = activityIndexContent.SourceUrl
},
Tag6 = new IndexContentStringValue()
{
// 采集活动举办城市ID
Value = activityIndexContent.CityId.ToString().ToLower(),
Index = IndexEnum.NotUseAnalyzerButIndex
},
Tag7 = new IndexContentStringValue()
{
// 采集活动举办地址
Value = string.IsNullOrEmpty(activityIndexContent.Address)?"":activityIndexContent.Address
},
Tag8 = new IndexContentStringValue()
{
// 采集活动举办时间
Value = activityIndexContent.ActivityDate.HasValue?activityIndexContent.ActivityDate.Value.ToString("yyyy年MM月dd日"):""
}
}).ToList();
indexManager.BuildIndex(indexContents);
} /// <summary>
/// 删除索引
/// </summary>
/// <param name="tableName"></param>
/// <param name="rowID"></param>
public void DeleteIndex(string tableName, string rowID)
{
var indexManager = new IndexManager();
indexManager.DeleteIndex(MODULETYPE, tableName, rowID);
} /// <summary>
/// 更新索引
/// </summary>
/// <param name="indexContents"></param>
public void UpdateIndex(List<ActivityIndexContent> indexContents)
{
foreach (var indexContent in indexContents)
{
if (indexContent.RowId != Guid.Empty &&
indexContent.TableName != null)
{
// 删除索引
this.DeleteIndex(indexContent.TableName,
indexContent.RowId.ToString().ToLower());
}
} // 添加索引
this.BuildIndex(indexContents);
}
}
代码就不解释了,很简单。主要就是调用IndexManager来执行操作。
我们只需要在需要创建活动数据索引的业务点,构建ActivityIndexBuilder对象,并构建ActivityIndexContent集合作为参数,调用BuildIndex方法即可。
二、全文搜索
全文搜索我们采用同样的设计方式。
1.设计一个抽象的搜索类:BaseIndexSearch,所有搜索模块(包括全站)均需继承它来实现搜索效果。
public abstract class BaseIndexSearch<TIndexSearchResultItem>
where TIndexSearchResultItem : IndexSearchResultItem
{
/// <summary>
/// 索引存储目录
/// </summary>
private static readonly string IndexStorePath = ConfigurationManager.AppSettings["IndexStorePath"];
private readonly string[] fieldsToSearch;
protected static readonly SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<em>", "</em>");
private static IndexSearcher indexSearcher = null; /// <summary>
/// 索引内容命中片段大小
/// </summary>
public int FragmentSize { get; set; } /// <summary>
/// 构造方法
/// </summary>
/// <param name="fieldsToSearch">搜索文本字段</param>
protected BaseIndexSearch(string[] fieldsToSearch)
{
FragmentSize = ;
this.fieldsToSearch = fieldsToSearch;
} /// <summary>
/// 创建搜索结果实例
/// </summary>
/// <returns></returns>
protected abstract TIndexSearchResultItem CreateIndexSearchResultItem(); /// <summary>
/// 修改搜索结果(主要修改tag字段对应的属性)
/// </summary>
/// <param name="indexSearchResultItem">搜索结果项实例</param>
/// <param name="content">用户搜索内容</param>
/// <param name="docIndex">索引库位置</param>
/// <param name="doc">当前位置内容</param>
/// <returns>搜索结果</returns>
protected abstract void ModifyIndexSearchResultItem(ref TIndexSearchResultItem indexSearchResultItem, string content, int docIndex, Document doc); /// <summary>
/// 修改筛选器(各模块)
/// </summary>
/// <param name="filter"></param>
protected abstract void ModifySearchFilter(ref Dictionary<string, string> filter); /// <summary>
/// 全库搜索
/// </summary>
/// <param name="content">搜索文本内容</param>
/// <param name="filter">查询内容限制条件,默认为null,不限制条件.</param>
/// <param name="fieldSorts">对字段进行排序</param>
/// <param name="pageIndex">查询结果当前页,默认为1</param>
/// <param name="pageSize">查询结果每页结果数,默认为20</param>
public PagedIndexSearchResult<TIndexSearchResultItem> Search(string content
, Dictionary<string, string> filter = null, List<FieldSort> fieldSorts = null
, int pageIndex = , int pageSize = )
{
try
{
if (!string.IsNullOrEmpty(content))
{
content = ReplaceIndexSensitiveWords(content);
content = GetKeywordsSplitBySpace(content,
new JiebaForLuceneTokenizer(new JiebaSegmenter(), content));
}
if (string.IsNullOrEmpty(content) || pageIndex < )
{
throw new Exception("输入参数不符合要求(用户输入为空,页码小于等于1)");
} var stopWatch = new Stopwatch();
stopWatch.Start(); Analyzer analyzer = new JiebaForLuceneAnalyzer();
// 索引条件创建
var query = MakeSearchQuery(content, analyzer);
// 筛选条件构建
filter = filter == null ? new Dictionary<string, string>() : new Dictionary<string, string>(filter);
ModifySearchFilter(ref filter);
Filter luceneFilter = MakeSearchFilter(filter); #region------------------------------执行查询--------------------------------------- TopDocs topDocs;
if (indexSearcher == null)
{
var dir = new DirectoryInfo(IndexStorePath);
FSDirectory entityDirectory = FSDirectory.Open(dir);
IndexReader reader = IndexReader.Open(entityDirectory, true);
indexSearcher = new IndexSearcher(reader);
}
else
{
IndexReader indexReader = indexSearcher.IndexReader;
if (!indexReader.IsCurrent())
{
indexSearcher.Dispose();
indexSearcher = new IndexSearcher(indexReader.Reopen());
}
}
// 收集器容量为所有
int totalCollectCount = pageIndex*pageSize;
Sort sort = GetSortByFieldSorts(fieldSorts);
topDocs = indexSearcher.Search(query, luceneFilter, totalCollectCount, sort ?? Sort.RELEVANCE); #endregion #region-----------------------返回结果生成------------------------------- ScoreDoc[] hits = topDocs.ScoreDocs;
var start = (pageIndex - )*pageSize + ;
var end = Math.Min(totalCollectCount, hits.Count()); var result = new PagedIndexSearchResult<TIndexSearchResultItem>
{
PageIndex = pageIndex,
PageSize = pageSize,
TotalRecords = topDocs.TotalHits
}; for (var i = start; i <= end; i++)
{
var scoreDoc = hits[i - ];
var doc = indexSearcher.Doc(scoreDoc.Doc); var indexSearchResultItem = CreateIndexSearchResultItem();
indexSearchResultItem.DocIndex = scoreDoc.Doc;
indexSearchResultItem.ModuleType = doc.Get("ModuleType");
indexSearchResultItem.TableName = doc.Get("TableName");
indexSearchResultItem.RowId = Guid.Parse(doc.Get("RowId"));
if (!string.IsNullOrEmpty(doc.Get("CollectTime")))
{
indexSearchResultItem.CollectTime = DateTime.Parse(doc.Get("CollectTime"));
}
var title = GetHighlighter(formatter, FragmentSize).GetBestFragment(content, doc.Get("Title"));
indexSearchResultItem.Title = string.IsNullOrEmpty(title) ? doc.Get("Title") : title;
var text = GetHighlighter(formatter, FragmentSize)
.GetBestFragment(content, doc.Get("IndexTextContent"));
indexSearchResultItem.Content = string.IsNullOrEmpty(text)
? (doc.Get("IndexTextContent").Length >
? doc.Get("IndexTextContent").Substring(, )
: doc.Get("IndexTextContent"))
: text;
ModifyIndexSearchResultItem(ref indexSearchResultItem, content, scoreDoc.Doc, doc);
result.Add(indexSearchResultItem);
}
stopWatch.Stop();
result.Elapsed = stopWatch.ElapsedMilliseconds*1.0/; return result; #endregion
}
catch (Exception exception)
{
LogUtils.ErrorLog(exception);
return null;
}
} private Sort GetSortByFieldSorts(List<FieldSort> fieldSorts)
{
if (fieldSorts == null)
{
return null;
}
return new Sort(fieldSorts.Select(fieldSort => new SortField(fieldSort.FieldName, SortField.FLOAT, !fieldSort.Ascend)).ToArray());
} private static Filter MakeSearchFilter(Dictionary<string, string> filter)
{
Filter luceneFilter = null;
if (filter != null && filter.Keys.Any())
{
var booleanQuery = new BooleanQuery();
foreach (KeyValuePair<string, string> keyValuePair in filter)
{
var termQuery = new TermQuery(new Term(keyValuePair.Key, keyValuePair.Value));
booleanQuery.Add(termQuery, Occur.MUST);
}
luceneFilter = new QueryWrapperFilter(booleanQuery);
}
return luceneFilter;
} private Query MakeSearchQuery(string content, Analyzer analyzer)
{
var query = new BooleanQuery();
// 总查询参数
// 属性查询
if (!string.IsNullOrEmpty(content))
{
QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_30, fieldsToSearch, analyzer);
Query queryObj;
try
{
queryObj = parser.Parse(content);
}
catch (ParseException parseException)
{
throw new Exception("在FileLibraryIndexSearch中构造Query时出错。", parseException);
}
query.Add(queryObj, Occur.MUST);
}
return query;
} private string GetKeywordsSplitBySpace(string keywords, JiebaForLuceneTokenizer jiebaForLuceneTokenizer)
{
var result = new StringBuilder(); var words = jiebaForLuceneTokenizer.Tokenize(keywords); foreach (var word in words)
{
if (string.IsNullOrWhiteSpace(word.Word))
{
continue;
} result.AppendFormat("{0} ", word.Word);
} return result.ToString().Trim();
} private string ReplaceIndexSensitiveWords(string str)
{
str = str.Replace("+", "");
str = str.Replace("+", "");
str = str.Replace("-", "");
str = str.Replace("-", "");
str = str.Replace("!", "");
str = str.Replace("!", "");
str = str.Replace("(", "");
str = str.Replace(")", "");
str = str.Replace("(", "");
str = str.Replace(")", "");
str = str.Replace(":", "");
str = str.Replace(":", "");
str = str.Replace("^", "");
str = str.Replace("[", "");
str = str.Replace("]", "");
str = str.Replace("【", "");
str = str.Replace("】", "");
str = str.Replace("{", "");
str = str.Replace("}", "");
str = str.Replace("{", "");
str = str.Replace("}", "");
str = str.Replace("~", "");
str = str.Replace("~", "");
str = str.Replace("*", "");
str = str.Replace("*", "");
str = str.Replace("?", "");
str = str.Replace("?", "");
return str;
} protected Highlighter GetHighlighter(Formatter formatter, int fragmentSize)
{
var highlighter = new Highlighter(formatter, new Segment()) { FragmentSize = fragmentSize };
return highlighter;
}
}
几个protected abstract方法,是需要继承的子类来实现的。
其中为了实现搜索结果对命中关键词进行高亮显示,特引用了盘古分词的Highlighter。原则是此处应该是参照盘古分词的源码,自己使用JieBaNet来做实现的,由于工期较紧,直接引用了盘古。
2.我们设计一个IndexSearchResultItem,表示搜索结果的基类。
/// <summary>
/// 全库搜索结果单项内容
/// </summary>
public class IndexSearchResultItem
{
/// <summary>
/// 内容索引
/// </summary>
public int DocIndex { get; set; } /// <summary>
/// 模块类别
/// </summary>
public string ModuleType { get; set; } /// <summary>
/// 表名
/// </summary>
public string TableName { get; set; } /// <summary>
/// 行号
/// </summary>
public Guid RowId { get; set; } /// <summary>
/// 文档标题
/// </summary>
public string Title { get; set; } /// <summary>
/// 文档内容片段
/// </summary>
public string Content { get; set; } public DateTime? CollectTime { get; set; }
}
3.我们来看看具体的实现,先来看全站搜索的SearchService
public class IndexSearch : BaseIndexSearch<IndexSearchResultItem>
{
public IndexSearch()
: base(new[] { "IndexTextContent", "Title" })
{
} protected override IndexSearchResultItem CreateIndexSearchResultItem()
{
return new IndexSearchResultItem();
} protected override void ModifyIndexSearchResultItem(ref IndexSearchResultItem indexSearchResultItem, string content,
int docIndex, Document doc)
{
//不做修改
} protected override void ModifySearchFilter(ref Dictionary<string, string> filter)
{
//不做筛选条件修改
}
}
是不是非常简单。由于我们此处搜索的是全站,结果展示直接用基类,取出基本字段即可。
4.再列举一个活动的搜索实现。
a)我们首先创建一个活动搜索结果类ActivityIndexSearchResultItem,继承自结果基类IndexSearchResultItem
public class ActivityIndexSearchResultItem : IndexSearchResultItem
{
/// <summary>
/// 活动类别
/// </summary>
public string ActivityTypes { get; set; } public Guid CityId { get; set; } /// <summary>
/// 活动地址
/// </summary>
public string Address { get; set; } /// <summary>
/// 活动日期
/// </summary>
public string ActivityDate { get; set; } /// <summary>
/// 源链接
/// </summary>
public string Url { get; set; } /// <summary>
/// 采集源名称
/// </summary>
public string SourceName { get; set; } /// <summary>
/// 采集源主站地址
/// </summary>
public string SourceUrl { get; set; } /// <summary>
/// 采集源官方热线
/// </summary>
public string SourceOfficialHotline { get; set; }
}
b)然后创建活动模块的搜索服务:ActivityIndexSearch,同样需要继承BaseIndexSearch,这时候ActivityIndexSearch只需要相对全站搜索修改几个参数即可。
public class ActivityIndexSearch: BaseIndexSearch<ActivityIndexSearchResultItem>
{
public ActivityIndexSearch()
: base(new[] { "IndexTextContent", "Title" })
{
} protected override ActivityIndexSearchResultItem CreateIndexSearchResultItem()
{
return new ActivityIndexSearchResultItem();
} protected override void ModifyIndexSearchResultItem(ref ActivityIndexSearchResultItem indexSearchResultItem, string content,
int docIndex, Document doc)
{
indexSearchResultItem.ActivityTypes = doc.Get("Tag1");
indexSearchResultItem.Url = doc.Get("Tag2");
indexSearchResultItem.SourceName = doc.Get("Tag3");
indexSearchResultItem.SourceOfficialHotline = doc.Get("Tag4");
indexSearchResultItem.SourceUrl = doc.Get("Tag5");
indexSearchResultItem.CityId=new Guid(doc.Get("Tag6"));
indexSearchResultItem.Address = doc.Get("Tag7");
indexSearchResultItem.ActivityDate = doc.Get("Tag8");
} protected override void ModifySearchFilter(ref Dictionary<string, string> filter)
{
filter.Add("ModuleType", "活动");
}
}
筛选条件加上模块=活动,返回结果数据类指定,活动特有字段返回赋值。
业务调用就非常简单了。
全站全文搜索:我们直接new IndexSearch(),然后调用其Search()方法
活动全文搜索:我们直接new ActivityIndexSearch(),然后调用其Search()方法
Search()方法几个参数:
///<param name="content">搜索文本内容</param>
/// <param name="filter">查询内容限制条件,默认为null,不限制条件.</param>
/// <param name="fieldSorts">对字段进行排序</param>
/// <param name="pageIndex">查询结果当前页,默认为1</param>
/// <param name="pageSize">查询结果每页结果数,默认为20</param>
如果我们用软能力而不是用技术能力来区分程序员的好坏 – 是不是有那么点反常和变态。
很多思路均来源于13年那次做全文搜索,跟当时的架构师学习的。
在此表示感谢。
原创文章,代码都是从自己项目里贴出来的。转载请注明出处哦,亲~~~
记一次企业级爬虫系统升级改造(五):基于JieBaNet+Lucene.Net实现全文搜索的更多相关文章
- 记一次企业级爬虫系统升级改造(二):基于AngleSharp实现的抓取服务
爬虫系统升级改造正式启动: 在第一篇文章,博主主要介绍了本次改造的爬虫系统的业务背景与全局规划构思: 未来Support云系统,不仅仅是爬虫系统,是集爬取数据.数据建模处理统计分析.支持全文检索资源库 ...
- 记一次企业级爬虫系统升级改造(六):基于Redis实现免费的IP代理池
前言: 首先表示抱歉,春节后一直较忙,未及时更新该系列文章. 近期,由于监控的站源越来越多,就偶有站源做了反爬机制,造成我们的SupportYun系统小爬虫服务时常被封IP,不能进行数据采集. 这时候 ...
- python爬虫主要就是五个模块:爬虫启动入口模块,URL管理器存放已经爬虫的URL和待爬虫URL列表,html下载器,html解析器,html输出器 同时可以掌握到urllib2的使用、bs4(BeautifulSoup)页面解析器、re正则表达式、urlparse、python基础知识回顾(set集合操作)等相关内容。
本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...
- scrapy爬虫学习系列五:图片的抓取和下载
系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备: http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...
- 记一次WMS的系统改造(2)-敲定方案
既定改造方案 基于上一篇分析出的种种问题,我们将库房人员的系统操作划分为两大类. 第一类为货物驱动的操作,这类操作主要随着货物而前进,人员不看或者看软件的次数比较少,更多是对货物的状态进行系统上的确认 ...
- JAVAEE——宜立方商城02:服务中间件dubbo、工程改造为基于soa架构、商品列表实现
1. 学习计划 第二天:商品列表功能实现 1.服务中间件dubbo 2.工程改造为基于soa架构 3.商品列表查询功能实现. 2. 将工程改造为SOA架构 2.1. 分析 由于宜立方商城是基于soa的 ...
- 五个对你有用的Everything搜索技巧
分享五个对你有用的Everything搜索技巧: 一, empty:(查找空文件夹);二, dupe:(查重复文档);三, 空格(与), |(或),!(非); 四, e:\ (路径搜索);五, wil ...
- [开源 .NET 跨平台 Crawler 数据采集 爬虫框架: DotnetSpider] [五] 如何做全站采集?
[DotnetSpider 系列目录] 一.初衷与架构设计 二.基本使用 三.配置式爬虫 四.JSON数据解析与配置系统 五.如何做全站采集 如何做全站采集? 很多同学加群都在问, 如何使用Dotne ...
- [Python爬虫] 之十五:Selenium +phantomjs根据微信公众号抓取微信文章
借助搜索微信搜索引擎进行抓取 抓取过程 1.首先在搜狗的微信搜索页面测试一下,这样能够让我们的思路更加清晰 在搜索引擎上使用微信公众号英文名进行“搜公众号”操作(因为公众号英文名是公众号唯一的,而中文 ...
随机推荐
- Android学习笔记之Broadcast Receiver
可程序间通信 注册通信,注销通信,发送消息 package com.jiahemeikang.helloandroid; import com.jiahemikang.service.EchoServ ...
- JavaScript(三)---- 控制流程语句
常用的控制流程语句有判断语句.分支语句.循环语句.基本用法都和java中的一致,switch有几点特殊. 1.判断语句 格式: if(判断条件){ 符合条件执行的代 ...
- h2database. 官方文档
http://www.h2database.com/html/advanced.html http://www.h2database.com/html/tutorial.html#csv http:/ ...
- hbase 第一篇
参考:http://www.jdon.com/38244 http://chuanwang66.iteye.com/blog/1683533
- Unity3d 开发之 ulua 坑的总结
相同的 lua 代码在安卓上能正常运行,但在 IOS 上可能不会正常运行而导致报红,崩溃等,我在使用 lua 编程时遇到的一些坑总结如下: 1. File.ReadAllText, 诸如以下代码在 i ...
- 控制流之while
while语句只要在一个条件为真的情况下,while语句允许你重复执行一块语句.while语句是所谓 循环 语句的一个例子.while语句有一个可选的else从句.使用while语句~~~~~~~~~ ...
- iOS中GCD的使用小结
http://www.jianshu.com/p/ae786a4cf3b1 本篇博客共分以下几个模块来介绍GCD的相关内容: 多线程相关概念 多线程编程技术的优缺点比较? GCD中的三种队列类型 Th ...
- Qt下libusb-win32的使用方法(转)
源:Qt下libusb-win32的使用方法 之前一直找不到适合WIN7下的Tiny6410的USB下载软件,正好这几天开始学习USB,所以打算自己写一个专门用于Tiny6410的WIN7下的USB下 ...
- jquery为某div下的所有textbox的赋值
html代码 <input type="button" value="变量div_Alltext中的变量" onclick="Do_DivAll ...
- 在 WindowMobile 上的模拟LED 显示屏插件(转)
源:在 WindowMobile 上的模拟LED 显示屏插件 我在给一个对话框上的控件查找翻看合适的图标时,无形中看到了一个LED显示屏的图标,这里所说的LED显示屏是指由很多LED灯密集排列组成的点 ...