一个字符串搜索的Aho-Corasick算法
Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时间复杂度是O(n)。原理图如下(摘自Aho-Corasick string matching in C#):
Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)
C#版本的实现代码可以从Aho-Corasick string matching in C#得到,也可以点击这里获得该算法的PDF文档。
这是一个应用示例:
它能将载入的RTF文档中的搜索关键字高亮,检索速度较快,示例没有实现全字匹配,算法代码简要如下:
- /* Aho-Corasick text search algorithm implementation
- *
- * For more information visit
- * - http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
- */
- using System;
- using System.Collections;
- namespace EeekSoft.Text
- {
- /// <summary>
- /// Interface containing all methods to be implemented
- /// by string search algorithm
- /// </summary>
- public interface IStringSearchAlgorithm
- {
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- bool IgnoreCase { get; set; }
- /// <summary>
- /// List of keywords to search for
- /// </summary>
- string[] Keywords { get; set; }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- StringSearchResult[] FindAll(string text);
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- StringSearchResult FindFirst(string text);
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- bool ContainsAny(string text);
- #endregion
- }
- /// <summary>
- /// Structure containing results of search
- /// (keyword and position in original text)
- /// </summary>
- public struct StringSearchResult
- {
- #region Members
- private int _index;
- private string _keyword;
- /// <summary>
- /// Initialize string search result
- /// </summary>
- /// <param name="index">Index in text</param>
- /// <param name="keyword">Found keyword</param>
- public StringSearchResult(int index, string keyword)
- {
- _index = index; _keyword = keyword;
- }
- /// <summary>
- /// Returns index of found keyword in original text
- /// </summary>
- public int Index
- {
- get { return _index; }
- }
- /// <summary>
- /// Returns keyword found by this result
- /// </summary>
- public string Keyword
- {
- get { return _keyword; }
- }
- /// <summary>
- /// Returns empty search result
- /// </summary>
- public static StringSearchResult Empty
- {
- get { return new StringSearchResult(-1, ""); }
- }
- #endregion
- }
- /// <summary>
- /// Class for searching string for one or multiple
- /// keywords using efficient Aho-Corasick search algorithm
- /// </summary>
- public class StringSearch : IStringSearchAlgorithm
- {
- #region Objects
- /// <summary>
- /// Tree node representing character and its
- /// transition and failure function
- /// </summary>
- class TreeNode
- {
- #region Constructor & Methods
- /// <summary>
- /// Initialize tree node with specified character
- /// </summary>
- /// <param name="parent">Parent node</param>
- /// <param name="c">Character</param>
- public TreeNode(TreeNode parent, char c)
- {
- _char = c; _parent = parent;
- _results = new ArrayList();
- _resultsAr = new string[] { };
- _transitionsAr = new TreeNode[] { };
- _transHash = new Hashtable();
- }
- /// <summary>
- /// Adds pattern ending in this node
- /// </summary>
- /// <param name="result">Pattern</param>
- public void AddResult(string result)
- {
- if (_results.Contains(result)) return;
- _results.Add(result);
- _resultsAr = (string[])_results.ToArray(typeof(string));
- }
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- //public void AddTransition(TreeNode node)
- //{
- // AddTransition(node, false);
- //}
- /// <summary>
- /// Adds trabsition node
- /// </summary>
- /// <param name="node">Node</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- public void AddTransition(TreeNode node, bool ignoreCase)
- {
- if (ignoreCase) _transHash.Add(char.ToLower(node.Char), node);
- else _transHash.Add(node.Char, node);
- TreeNode[] ar = new TreeNode[_transHash.Values.Count];
- _transHash.Values.CopyTo(ar, 0);
- _transitionsAr = ar;
- }
- /// <summary>
- /// Returns transition to specified character (if exists)
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>Returns TreeNode or null</returns>
- public TreeNode GetTransition(char c, bool ignoreCase)
- {
- if (ignoreCase)
- return (TreeNode)_transHash[char.ToLower(c)];
- return (TreeNode)_transHash[c];
- }
- /// <summary>
- /// Returns true if node contains transition to specified character
- /// </summary>
- /// <param name="c">Character</param>
- /// <param name="ignoreCase">Ignore case of letters</param>
- /// <returns>True if transition exists</returns>
- public bool ContainsTransition(char c, bool ignoreCase)
- {
- return GetTransition(c, ignoreCase) != null;
- }
- #endregion
- #region Properties
- private char _char;
- private TreeNode _parent;
- private TreeNode _failure;
- private ArrayList _results;
- private TreeNode[] _transitionsAr;
- private string[] _resultsAr;
- private Hashtable _transHash;
- /// <summary>
- /// Character
- /// </summary>
- public char Char
- {
- get { return _char; }
- }
- /// <summary>
- /// Parent tree node
- /// </summary>
- public TreeNode Parent
- {
- get { return _parent; }
- }
- /// <summary>
- /// Failure function - descendant node
- /// </summary>
- public TreeNode Failure
- {
- get { return _failure; }
- set { _failure = value; }
- }
- /// <summary>
- /// Transition function - list of descendant nodes
- /// </summary>
- public TreeNode[] Transitions
- {
- get { return _transitionsAr; }
- }
- /// <summary>
- /// Returns list of patterns ending by this letter
- /// </summary>
- public string[] Results
- {
- get { return _resultsAr; }
- }
- #endregion
- }
- #endregion
- #region Local fields
- /// <summary>
- /// Root of keyword tree
- /// </summary>
- private TreeNode _root;
- /// <summary>
- /// Keywords to search for
- /// </summary>
- private string[] _keywords;
- #endregion
- #region Initialization
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- /// <param name="ignoreCase">Ignore case of letters (the default is false)</param>
- public StringSearch(string[] keywords, bool ignoreCase)
- : this(keywords)
- {
- IgnoreCase = ignoreCase;
- }
- /// <summary>
- /// Initialize search algorithm (Build keyword tree)
- /// </summary>
- /// <param name="keywords">Keywords to search for</param>
- public StringSearch(string[] keywords)
- {
- Keywords = keywords;
- }
- /// <summary>
- /// Initialize search algorithm with no keywords
- /// (Use Keywords property)
- /// </summary>
- public StringSearch()
- { }
- #endregion
- #region Implementation
- /// <summary>
- /// Build tree from specified keywords
- /// </summary>
- void BuildTree()
- {
- // Build keyword tree and transition function
- _root = new TreeNode(null, ' ');
- foreach (string p in _keywords)
- {
- // add pattern to tree
- TreeNode nd = _root;
- foreach (char c in p)
- {
- TreeNode ndNew = null;
- foreach (TreeNode trans in nd.Transitions)
- {
- if (this.IgnoreCase)
- {
- if (char.ToLower(trans.Char) == char.ToLower(c)) { ndNew = trans; break; }
- }
- else
- {
- if (trans.Char == c) { ndNew = trans; break; }
- }
- }
- if (ndNew == null)
- {
- ndNew = new TreeNode(nd, c);
- nd.AddTransition(ndNew, this.IgnoreCase);
- }
- nd = ndNew;
- }
- nd.AddResult(p);
- }
- // Find failure functions
- ArrayList nodes = new ArrayList();
- // level 1 nodes - fail to root node
- foreach (TreeNode nd in _root.Transitions)
- {
- nd.Failure = _root;
- foreach (TreeNode trans in nd.Transitions) nodes.Add(trans);
- }
- // other nodes - using BFS
- while (nodes.Count != 0)
- {
- ArrayList newNodes = new ArrayList();
- foreach (TreeNode nd in nodes)
- {
- TreeNode r = nd.Parent.Failure;
- char c = nd.Char;
- while (r != null && !r.ContainsTransition(c, this.IgnoreCase)) r = r.Failure;
- if (r == null)
- nd.Failure = _root;
- else
- {
- nd.Failure = r.GetTransition(c, this.IgnoreCase);
- foreach (string result in nd.Failure.Results)
- nd.AddResult(result);
- }
- // add child nodes to BFS list
- foreach (TreeNode child in nd.Transitions)
- newNodes.Add(child);
- }
- nodes = newNodes;
- }
- _root.Failure = _root;
- }
- #endregion
- #region Methods & Properties
- /// <summary>
- /// Ignore case of letters
- /// </summary>
- public bool IgnoreCase
- {
- get;
- set;
- }
- /// <summary>
- /// Keywords to search for (setting this property is slow, because
- /// it requieres rebuilding of keyword tree)
- /// </summary>
- public string[] Keywords
- {
- get { return _keywords; }
- set
- {
- _keywords = value;
- BuildTree();
- }
- }
- /// <summary>
- /// Searches passed text and returns all occurrences of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>Array of occurrences</returns>
- public StringSearchResult[] FindAll(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- ret.Add(new StringSearchResult(index - found.Length + 1, found));
- index++;
- }
- return (StringSearchResult[])ret.ToArray(typeof(StringSearchResult));
- }
- /// <summary>
- /// Searches passed text and returns first occurrence of any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>First occurrence of any keyword (or StringSearchResult.Empty if text doesn't contain any keyword)</returns>
- public StringSearchResult FindFirst(string text)
- {
- ArrayList ret = new ArrayList();
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- foreach (string found in ptr.Results)
- return new StringSearchResult(index - found.Length + 1, found);
- index++;
- }
- return StringSearchResult.Empty;
- }
- /// <summary>
- /// Searches passed text and returns true if text contains any keyword
- /// </summary>
- /// <param name="text">Text to search</param>
- /// <returns>True when text contains any keyword</returns>
- public bool ContainsAny(string text)
- {
- TreeNode ptr = _root;
- int index = 0;
- while (index < text.Length)
- {
- TreeNode trans = null;
- while (trans == null)
- {
- trans = ptr.GetTransition(text[index], this.IgnoreCase);
- if (ptr == _root) break;
- if (trans == null) ptr = ptr.Failure;
- }
- if (trans != null) ptr = trans;
- if (ptr.Results.Length > 0) return true;
- index++;
- }
- return false;
- }
- #endregion
- }
- }
示例下载页面:http://www.uushare.com/user/m2nlight/file/2722093
一个字符串搜索的Aho-Corasick算法的更多相关文章
- 多模字符串匹配算法-Aho–Corasick
背景 在做实际工作中,最简单也最常用的一种自然语言处理方法就是关键词匹配,例如我们要对n条文本进行过滤,那本身是一个过滤词表的,通常进行过滤的代码如下 for (String document : d ...
- 【ToolGood.Words】之【StringSearch】字符串搜索——基于BFS算法
字符串搜索中,BFS算法很巧妙,个人认为BFS算法效率是最高的. [StringSearch]就是根据BFS算法并优化. 使用方法: string s = "中国|国人|zg人|fuck|a ...
- C#算法之判断一个字符串是否是对称字符串
记得曾经一次面试时,面试官给我电脑,让我现场写个算法,判断一个字符串是不是对称字符串.我当时用了几分钟写了一个很简单的代码. 这里说的对称字符串是指字符串的左边和右边字符顺序相反,如"abb ...
- 基于python 3.5 所做的找出来一个字符串中最长不重复子串算法
功能:找出来一个字符串中最长不重复子串 def find_longest_no_repeat_substr(one_str): #定义一个列表用于存储非重复字符子串 res_list=[] #获得字符 ...
- 算法 - 给出一个字符串str,输出包含两个字符串str的最短字符串,如str为abca时,输出则为abcabca
今天碰到一个算法题觉得比较有意思,研究后自己实现了出来,代码比较简单,如发现什么问题请指正.思路和代码如下: 基本思路:从左开始取str的最大子字符串,判断子字符串是否为str的后缀,如果是则返回st ...
- 算法:Manacher,给定一个字符串str,返回str中最长回文子串的长度。
[题目] 给定一个字符串str,返回str中最长回文子串的长度 [举例] str="123", 1 str="abc1234321ab" 7 [暴力破解] 从左 ...
- 字符串模式匹配算法2 - AC算法
上篇文章(http://www.cnblogs.com/zzqcn/p/3508442.html)里提到的BF和KMP算法都是单模式串匹配算法,也就是说,模式串只有一个.当需要在字符串中搜索多个关键字 ...
- 字符串混淆技术应用 设计一个字符串混淆程序 可混淆.NET程序集中的字符串
关于字符串的研究,目前已经有两篇. 原理篇:字符串混淆技术在.NET程序保护中的应用及如何解密被混淆的字符串 实践篇:字符串反混淆实战 Dotfuscator 4.9 字符串加密技术应对策略 今天来 ...
- Aho - Corasick string matching algorithm
Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...
随机推荐
- Jmeter学习笔记
Jmeter安装 Jmeter组件介绍 Jmeter
- 网络最大流最短增广路Dinic算法模板
#include<cstdio> #include<cstring> #include<string> #include<cmath> #include ...
- NYIST OJ 题目38 布线问题
最小生成树水题,先按最小生成树做,答案最后加上最小的从第i号楼接线到外界供电设施所需要的费用即可. #include<cstdio> #include<cstring> #in ...
- jQuery第三章
一.jQuery中的DOM操作 一般来说,DOM操作分为3个方面,即DOM Core核心.HTML-DOM和CSS-DOM 1.DOM Core JavaScript中的getElementById( ...
- powder designer 转数据库
1.打开“file new model”
- Xcode-App Transport Security has blocked a cleartext HTTP (http://) resource load since it is insecure.
在xcode中上报数据时候,logserver一直没有数据,后来发现控制台有一个提示: 找了半天是因为Xcode7禁止明码的HTTP请求,而自己使用的是Xcode7.2.1 解决办法:修改info.p ...
- 再谈Java方法传参那些事
把一个变量带进一个方法,该方法执行结束后,它的值有时会改变,有时不会改变.一开始会觉得--“好神奇呀”.当我们了解java内存分析的知识后,一切都是那么简单明了了--“哦,这么回事呀”.但是今天的上机 ...
- Matlab - 矩阵基本操作
1. 矩阵的输入 右值是用方括号表示: , 逗号或空格分隔元素 ; 分号分隔行 >> A = [-, ; , ] A = - 2. 矩阵的加减 >> C = A + B ...
- 安装mcrouter
下载准备: mcrouter 下载地址 : https://github.com/facebook/mcrouter gflags 下载地址:http://download.csdn.net/deta ...
- python中的矩阵运算
摘自:http://m.blog.csdn.net/blog/taxueguilai1992/46581861 python的numpy库提供矩阵运算的功能,因此我们在需要矩阵运算的时候,需要导入nu ...