一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时间复杂度是O(n)。原理图如下(摘自Aho-Corasick string matching in C#):
Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)
C#版本的实现代码可以从Aho-Corasick string matching in C#得到,也可以点击这里获得该算法的PDF文档。
这是一个应用示例:
它能将载入的RTF文档中的搜索关键字高亮,检索速度较快,示例没有实现全字匹配,算法代码简要如下:
- /* Aho-Corasick text search algorithm implementation
- *
- * For more information visit
- * - http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
- */
- using System;
- using System.Collections;
- namespace EeekSoft.Text
- {
- /// <summary>
- /// Interface containing all methods to be implemented
- /// by string search algorithm
- /// </summary>
- public interface IStringSearchAlgorithm
- {
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- bool IgnoreCase { get; set; }
- /// <summary>
- /// List of keywords to search for
- /// </summary>
- string[] Keywords { get; set; }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- StringSearchResult[] FindAll(string text);
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- StringSearchResult FindFirst(string text);
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- bool ContainsAny(string text);
- #endregion
- }
- /// <summary>
- /// Structure containing results of search
- /// (keyword and position in original text)
- /// </summary>
- public struct StringSearchResult
- {
- #region Members
- private int _index;
- private string _keyword;
- /// <summary>
- /// Initialize string search result
- /// </summary>
- /// <param name="index">Index in text</param>
- /// <param name="keyword">Found keyword</param>
- public StringSearchResult(int index, string keyword)
- {
- _index = index; _keyword = keyword;
- }
- /// <summary>
- /// Returns index of found keyword in original text
- /// </summary>
- public int Index
- {
- get { return _index; }
- }
- /// <summary>
- /// Returns keyword found by this result
- /// </summary>
- public string Keyword
- {
- get { return _keyword; }
- }
- /// <summary>
- /// Returns empty search result
- /// </summary>
- public static StringSearchResult Empty
- {
- get { return new StringSearchResult(-1, ""); }
- }
- #endregion
- }
- /// <summary>
- /// Class for searching string for one or multiple
- /// keywords using efficient Aho-Corasick search algorithm
- /// </summary>
- public class StringSearch : IStringSearchAlgorithm
- {
- #region Objects
- /// <summary>
- /// Tree node representing character and its
- /// transition and failure function
- /// </summary>
- class TreeNode
- {
- #region Constructor & Methods
- /// <summary>
- /// Initialize tree node with specified character
- /// </summary>
- /// <param name="parent">Parent node</param>
- /// <param name="c">Character</param>
- public TreeNode(TreeNode parent, char c)
- {
- _char = c; _parent = parent;
- _results = new ArrayList();
- _resultsAr = new string[] { };
- _transitionsAr = new TreeNode[] { };
- _transHash = new Hashtable();
- }
- /// <summary>
- /// Adds pattern ending in this node
- /// </summary>
- /// <param name="result">Pattern</param>
- public void AddResult(string result)
- {
- if (_results.Contains(result)) return;
- _results.Add(result);
- _resultsAr = (string[])_results.ToArray(typeof(string));
- }
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- //public void AddTransition(TreeNode node)
- //{
- // AddTransition(node, false);
- //}
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- public void AddTransition(TreeNode node, bool ignoreCase)
- {
- if (ignoreCase) _transHash.Add(char.ToLower(node.Char), node);
- else _transHash.Add(node.Char, node);
- TreeNode[] ar = new TreeNode[_transHash.Values.Count];
- _transHash.Values.CopyTo(ar, 0);
- _transitionsAr = ar;
- }
- /// <summary>
- /// Returns transition to specified character (if exists)
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>Returns TreeNode or null</returns>
- public TreeNode GetTransition(char c, bool ignoreCase)
- {
- if (ignoreCase)
- return (TreeNode)_transHash[char.ToLower(c)];
- return (TreeNode)_transHash[c];
- }
- /// <summary>
- /// Returns true if node contains transition to specified character
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>True if transition exists</returns>
- public bool ContainsTransition(char c, bool ignoreCase)
- {
- return GetTransition(c, ignoreCase) != null;
- }
- #endregion
- #region Properties
- private char _char;
- private TreeNode _parent;
- private TreeNode _failure;
- private ArrayList _results;
- private TreeNode[] _transitionsAr;
- private string[] _resultsAr;
- private Hashtable _transHash;
- /// <summary>
- /// Character
- /// </summary>
- public char Char
- {
- get { return _char; }
- }
- /// <summary>
- /// Parent tree node
- /// </summary>
- public TreeNode Parent
- {
- get { return _parent; }
- }
- /// <summary>
- /// Failure function - descendant node
- /// </summary>
- public TreeNode Failure
- {
- get { return _failure; }
- set { _failure = value; }
- }
- /// <summary>
- /// Transition function - list of descendant nodes
- /// </summary>
- public TreeNode[] Transitions
- {
- get { return _transitionsAr; }
- }
- /// <summary>
- /// Returns list of patterns ending by this letter
- /// </summary>
- public string[] Results
- {
- get { return _resultsAr; }
- }
- #endregion
- }
- #endregion
- #region Local fields
- /// <summary>
- /// Root of keyword tree
- /// </summary>
- private TreeNode _root;
- /// <summary>
- /// Keywords to search for
- /// </summary>
- private string[] _keywords;
- #endregion
- #region Initialization
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- /// <param name="ignoreCase">Ignore case of letters (the default is false)</param>
- public StringSearch(string[] keywords, bool ignoreCase)
- : this(keywords)
- {
- IgnoreCase = ignoreCase;
- }
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- public StringSearch(string[] keywords)
- {
- Keywords = keywords;
- }
- /// <summary>
- /// Initialize search algorithm with no keywords
- /// (Use Keywords property)
- /// </summary>
- public StringSearch()
- { }
- #endregion
- #region Implementation
- /// <summary>
- /// Build tree from specified keywords
- /// </summary>
- void BuildTree()
- {
- // Build keyword tree and transition function
- _root = new TreeNode(null, ' ');
- foreach (string p in _keywords)
- {
- // add pattern to tree
- TreeNode nd = _root;
- foreach (char c in p)
- {
- TreeNode ndNew = null;
- foreach (TreeNode trans in nd.Transitions)
- {
- if (this.IgnoreCase)
- {
- if (char.ToLower(trans.Char) == char.ToLower(c)) { ndNew = trans; break; }
- }
- else
- {
- if (trans.Char == c) { ndNew = trans; break; }
- }
- }
- if (ndNew == null)
- {
- ndNew = new TreeNode(nd, c);
- nd.AddTransition(ndNew, this.IgnoreCase);
- }
- nd = ndNew;
- }
- nd.AddResult(p);
- }
- // Find failure functions
- ArrayList nodes = new ArrayList();
- // level 1 nodes - fail to root node
- foreach (TreeNode nd in _root.Transitions)
- {
- nd.Failure = _root;
- foreach (TreeNode trans in nd.Transitions) nodes.Add(trans);
- }
- // other nodes - using BFS
- while (nodes.Count != 0)
- {
- ArrayList newNodes = new ArrayList();
- foreach (TreeNode nd in nodes)
- {
- TreeNode r = nd.Parent.Failure;
- char c = nd.Char;
- while (r != null && !r.ContainsTransition(c, this.IgnoreCase)) r = r.Failure;
- if (r == null)
- nd.Failure = _root;
- else
- {
- nd.Failure = r.GetTransition(c, this.IgnoreCase);
- foreach (string result in nd.Failure.Results)
- nd.AddResult(result);
- }
- // add child nodes to BFS list
- foreach (TreeNode child in nd.Transitions)
- newNodes.Add(child);
- }
- nodes = newNodes;
- }
- _root.Failure = _root;
- }
- #endregion
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- public bool IgnoreCase
- {
- get;
- set;
- }
- /// <summary>
- /// Keywords to search for (setting this property is slow, because
- /// it requieres rebuilding of keyword tree)
- /// </summary>
- public string[] Keywords
- {
- get { return _keywords; }
- set
- {
- _keywords = value;
- BuildTree();
- }
- }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- public StringSearchResult[] FindAll(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- ret.Add(new StringSearchResult(index - found.Length + 1, found));
- index++;
- }
- return (StringSearchResult[])ret.ToArray(typeof(StringSearchResult));
- }
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- public StringSearchResult FindFirst(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- return new StringSearchResult(index - found.Length + 1, found);
- index++;
- }
- return StringSearchResult.Empty;
- }
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- public bool ContainsAny(string text)
- {
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- if (ptr.Results.Length > 0) return true;
- index++;
- }
- return false;
- }
- #endregion
- }
- }
示例下载页面:http://www.uushare.com/user/m2nlight/file/2722093
一个字符串搜索的Aho-Corasick算法的更多相关文章
- 多模字符串匹配算法-Aho–Corasick
背景 在做实际工作中,最简单也最常用的一种自然语言处理方法就是关键词匹配,例如我们要对n条文本进行过滤,那本身是一个过滤词表的,通常进行过滤的代码如下 for (String document : d ...
- 【ToolGood.Words】之【StringSearch】字符串搜索——基于BFS算法
字符串搜索中,BFS算法很巧妙,个人认为BFS算法效率是最高的. [StringSearch]就是根据BFS算法并优化. 使用方法: string s = "中国|国人|zg人|fuck|a ...
- C#算法之判断一个字符串是否是对称字符串
记得曾经一次面试时,面试官给我电脑,让我现场写个算法,判断一个字符串是不是对称字符串.我当时用了几分钟写了一个很简单的代码. 这里说的对称字符串是指字符串的左边和右边字符顺序相反,如"abb ...
- 基于python 3.5 所做的找出来一个字符串中最长不重复子串算法
功能:找出来一个字符串中最长不重复子串 def find_longest_no_repeat_substr(one_str): #定义一个列表用于存储非重复字符子串 res_list=[] #获得字符 ...
- 算法 - 给出一个字符串str,输出包含两个字符串str的最短字符串,如str为abca时,输出则为abcabca
今天碰到一个算法题觉得比较有意思,研究后自己实现了出来,代码比较简单,如发现什么问题请指正.思路和代码如下: 基本思路:从左开始取str的最大子字符串,判断子字符串是否为str的后缀,如果是则返回st ...
- 算法:Manacher,给定一个字符串str,返回str中最长回文子串的长度。
[题目] 给定一个字符串str,返回str中最长回文子串的长度 [举例] str="123", 1 str="abc1234321ab" 7 [暴力破解] 从左 ...
- 字符串模式匹配算法2 - AC算法
上篇文章(http://www.cnblogs.com/zzqcn/p/3508442.html)里提到的BF和KMP算法都是单模式串匹配算法,也就是说,模式串只有一个.当需要在字符串中搜索多个关键字 ...
- 字符串混淆技术应用 设计一个字符串混淆程序 可混淆.NET程序集中的字符串
关于字符串的研究,目前已经有两篇. 原理篇:字符串混淆技术在.NET程序保护中的应用及如何解密被混淆的字符串 实践篇:字符串反混淆实战 Dotfuscator 4.9 字符串加密技术应对策略 今天来 ...
- Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
随机推荐
- html中 iframe子页面 与父页面之间的方法调用 ;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
- MVC学习笔记--IEnumerable的用法
IEnumerable的用法 IEnumerable和IEnumerable<T>接口在.NET中是非常重要的接口,它允许开发人员定义foreach语句功能的实现 并支持非泛型方法的简单的 ...
- ACdream 1063 平衡树
写的很丑的字典树.听王大神的话 需要改进. #include<stdio.h> #include<string.h> #include<math.h> #incl ...
- 常用类型转换 一.常用int和string类型转换
常用类型转换 一.常用int类型转换1. int.parse(string) 这个类型只支持string类型 2.double doubleType = Int32.MaxValue + 1; i ...
- POJ 2977 Box walking
题目链接:http://poj.org/problem?id=2977 Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 222 ...
- jquery获取li中的各项属性值attr
发布新内容时的设计 默认显示一个按钮 如:发布按钮(放在h3字体里面)(鼠标上面时.显示发布到哪个模块下拉菜单发在li里面) $('#pup_model li , #pup_model h3').cl ...
- s=a+aa+aaa+aaaa+aa...aaaa
main(){ int a,n,count=1; long int sn=0,tn=0; cout<<"input a and n:"; cin>>a> ...
- [ An Ac a Day ^_^ ] Codeforces Round #368 Div. 2 A B C
昨天才回学校 刚好赶上CF所以就没写博客 不过还是水题了…… A. 比赛的时候被hack了 仔细读题才知道grey也算是黑白的 英语不好好伤心…… #include<stdio.h> #i ...
- Chapter 16_4 私密性
在Lua面向对象编程的基础设计当中,没有提供私密性机制.但是可以用其他方法实现,从而获得对象的访问控制. 这种实现不常用,作为兴趣爱好,只做基本了解. 基本做法是:通过两个table来表示一个对象.一 ...
- 【Python爬虫实战--3】html写正则表达式
以下是要爬虫的html内容: <div class="article block untagged mb15" id='qiushi_tag_113452216'> & ...