KMP算法

对于KMP算法我分为两个部分说明,第一部分是算法部分,介绍KMP算法的算法思想;第二部分是实现部分,介绍一种厉害的实现代码以及代码注释。当然了由于本文主要介绍怎么实现故而先分析实现,对KMP理解不是很透彻的朋友可以先读算法介绍部分再来看代码。

(2). 算法实现:

首先提一句,当然在理解KMP的时候也很关键啦,就是说”KMP algorithm never re-compares a character in T that has matched a character in P”.其中T是目标串文本,P是模式串

下面给出大牛代码(来源不详)。

 #define MAX_N 10010

 char T[MAX_N], P[MAX_N];

 //n为length of T  , m 为length of P
int next[MAX_N],n,m; //compute array next[], next[i]表示P[0…i-1]中的
//the length of the longest proper prefix that matches a proper suffix
void kmpPreprocess()
{
  int i=,j=-;
  next[]=-;
  while(i<m)//①为什么这一段代码可以顺利计算出next[]数组的值???
15   {
16 while(j>=0 && P[i]!=P[j])j=next[j];
17 i++, j++;
18      next[i]=j;
19   }
} void kmpSearch()
{
  int i=,j=;
  while(i<n)
  {
while(j>= && T[i]!=P[j])j=next[j];
i++, j++;
    if(j==m)
    {
  printf(“P is found at index %d in T\n”,i-j);
  j=next[j];
    }
  }
}

下面就来分析一下为什么上述标红代码可以顺利计算出next[]数组的值。

首先,我们明确两点

  1. next[i]表示P[0…i-1]中的the length of the longest proper prefix that matches a proper suffix
  2. 每次循环开始的时候总有j=next[i]成立,也就是说明j是P[0…i-1]中的the length of the longest proper prefix that matches a proper suffix

  其次,我们说明一下怎么去计算next[i+1],这儿有两种情况。

  1. 当P[i]==p[j]时,那么next[i+1]=next[i]+1; 这个比较容易理解就不多说了。
  2. 如果不等,在这种情况下代码上面给出的操作是j=next[j],什么意思呢,也就是说将j赋值为P[0..j-1] 中的the length of the longest proper prefix that matches a proper suffix,然后再来比较P[i]与p[j]的关系,递归。但是,为什么这么做是正确的呢?

  这里我们来详细阐述一下,我们先来转换一下问题。

首先我们明确一点,那就是P[i]!=p[j]情况下next[i+1]的值比 P[i]==p[j]时候next[i+1]的值要小。又因为P[0…j-1]==P[i-j…i-1],故而此时P[0…i]的the length of the longest proper prefix that matches a proper suffix等价于P[i-j…i] 的the length of the longest proper prefix that matches a proper suffix。于是我们要先找到P[i-j,i-1]中的the length of the longest proper prefix that matches a proper suffix,等于next[j]。

(1). 算法介绍:

 

这一部分我就不自己写了,因为我已经看到有人写的足够好的讲解了。

转自:http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

1

2

3

char:  | a | b | a | b | a | b | c | a |

index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

1

2

3

char:  | a | b | a | b | a | b | c | a |

index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

1

2

3

bacbababaabcbab

|

abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

1

2

3

bacbababaabcbab

|||||

abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

1

2

3

4

5

// x denotes a skip

bacbababaabcbab

xx|||

abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

1

2

3

4

5

// x denotes a skip

bacbababaabcbab

xx|

abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

KMP高质量代码实现详解的更多相关文章

  1. 算术编码Arithmetic Coding-高质量代码实现详解

    关于算术编码的具体讲解我不多细说,本文按照下述三个部分构成. 两个例子分别说明怎么用算数编码进行编码以及解码(来源:ARITHMETIC CODING FOR DATA COIUPRESSION): ...

  2. spark最新源码下载并导入到开发环境下助推高质量代码(Scala IDEA for Eclipse和IntelliJ IDEA皆适用)(以spark2.2.0源码包为例)(图文详解)

    不多说,直接上干货! 前言   其实啊,无论你是初学者还是具备了有一定spark编程经验,都需要对spark源码足够重视起来. 本人,肺腑之己见,想要成为大数据的大牛和顶尖专家,多结合源码和操练编程. ...

  3. jdk1.8源码包下载并导入到开发环境下助推高质量代码(Eclipse、MyEclipse和Scala IDEA for Eclipse皆适用)(图文详解)

    不多说,直接上干货! jdk1.8 源码, Linux的同学可以用的上. 由于源码JDK是前版本的超集, 所以1.4, 1.5, 1.6, 1.7都可以用的上.     其实大家安装的jdk路径下,这 ...

  4. 编写高质量代码改善C#程序的157个建议——导航开篇

    前言 由于最近工作重心的转移,原来和几个同事一起开发的项目也已经上线了,而新项目就是在现有的项目基础上进行优化延伸扩展.打个比方,现在已经上线的项目行政案件的Web管理网站(代码还没那么多相比较即将要 ...

  5. 【iOS 使用github上传代码】详解

    [iOS 使用github上传代码]详解 一.github创建新工程 二.直接添加文件 三.通过https 和 SSH 操作两种方式上传工程 3.1https 和 SSH 的区别: 3.1.1.前者可 ...

  6. 每周一书-编写高质量代码:改善C程序代码的125个建议

    首先说明,本周活动有效时间为2016年8月28日到2016年9月4日.本周为大家送出的书是由机械工业出版社出版,马伟编著的<编写高质量代码:改善C程序代码的125个建议>. 编辑推荐 10 ...

  7. Scala 深入浅出实战经典 第64讲:Scala中隐式对象代码实战详解

    王家林亲授<DT大数据梦工厂>大数据实战视频 Scala 深入浅出实战经典(1-87讲)完整视频.PPT.代码下载:百度云盘:http://pan.baidu.com/s/1c0noOt6 ...

  8. Scala 深入浅出实战经典 第63讲:Scala中隐式类代码实战详解

    王家林亲授<DT大数据梦工厂>大数据实战视频 Scala 深入浅出实战经典(1-87讲)完整视频.PPT.代码下载:百度云盘:http://pan.baidu.com/s/1c0noOt6 ...

  9. 博友的 编写高质量代码 改善java程序的151个建议

    编写高质量代码 改善java程序的151个建议 http://www.cnblogs.com/selene/category/876189.html

随机推荐

  1. angularjs2 学习笔记(五) http服务

    angular2的http服务是用于从后台程序获取或更新数据的一种机制,通常情况我们需要将与后台交换数据的模块做出angular服务,利用http获取更新后台数据,angular使用http的get或 ...

  2. 自学Python二 Python中的屠龙刀(续)

    函数 秉承着一切皆对象的理念,函数作为对象,可以为其赋值新的对象名,也可以作为参数传递给其他函数! 正常的诸如空函数,默认参数等等我们就不提了,在这里着重提一下默认参数里面的坑和lambda函数. 当 ...

  3. c++编程规范的纲要和记录

    这是一本好书, 可以让你认清自己对C++的掌握程度.看完之后,给自己打分,我对C++了解多少? 答案是不足20分.对于我自己是理所当然的问题, 就不提了, 记一些有启发的条目和细节: (*号表示不能完 ...

  4. SQL Server实现数据的递归查询

    在一次项目中遇到一种需求,需要记录某产品的替换记录. 实际应用举例为:产品101被201替换,之后201又被303替换,303又被109替换:产品102被202替换,之后202又被105替换. 现在我 ...

  5. 调试mvc 源码【转:http://www.cnblogs.com/wucj/archive/2013/06/09/3128698.html】

    最近在研究asp.net mvc的源码,于是在想,既然提供了源码,那我们如何进入源码调试了?在网上找了一些调试的方法,试了几个都不行,于是折腾了一上午,终于弄出来了,下面看看我的操作步骤.   一:准 ...

  6. MVC4.0网站发布和部署到IIS7.0上的方法【转:http://www.th7.cn/Program/net/201403/183756.shtml】

    最近在研究MVC4,使用vs2010,开发的站点在发布和部署到iis7上的过程中遇到了很多问题,现在将解决的过程记录下来,以便日后参考,整个过程主要以截图形式呈现 vs2010的安装和mvc4的安装不 ...

  7. epoll重要

    EPOLL事件分发系统可以运转在两种模式下:Edge Triggered (ET).Level Triggered (LT). LT是缺省的工作方式,并且同时支持block和no-blocksocke ...

  8. [转]Squid中的日志出现TCP_CLIENT_REFRESH_MISS的问题排除

    转自:http://www.php-oa.com/2008/07/15/tcp_client_refresh_miss.html 今天检查Squid发现大量的日志出现TCP_CLIENT_REFRES ...

  9. sp.ui.rte.js参数错误

    出现这个错误的原因是,在新建页面我用spd隐藏了两行不需要的填写的控件. 去掉这两个隐藏属性就可以了.

  10. VBA 一些用法

    另存为txt格式: Sheets().Activate ActiveWorkbook.SaveAs Filename:="E:\etl_folder\", FileFormat:= ...