origianl

For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

1
2
3
bacbababaabcbab
|
abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

1
2
3
bacbababaabcbab
|||||
abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
xx|||
abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
xx|
abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

// C++ program for implementation of KMP pattern searching
// algorithm
#include <bits/stdc++.h> void computeLPSArray(char* pat, int M, int* lps); // Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt); // create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M]; // Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps); int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
} if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
} // mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
} // Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0; lps[0] = 0; // lps[0] is always 0 // the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1]; // Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
} // Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}

  

The Knuth-Morris-Pratt Algorithm in my own words(转)的更多相关文章

  1. 我所理解的 KMP(Knuth–Morris–Pratt) 算法

    假设要在 haystack 中匹配 needle . 要理解 KMP 先需要理解两个概念 proper prefix 和 proper suffix,由于找到没有合适的翻译,暂时分别称真实前缀 和 真 ...

  2. 字符串匹配算法--KMP字符串搜索(Knuth–Morris–Pratt string-searching)C语言实现与讲解

    一.前言   在计算机科学中,Knuth-Morris-Pratt字符串查找算法(简称为KMP算法)可在一个主文本字符串S内查找一个词W的出现位置.此算法通过运用对这个词在不匹配时本身就包含足够的信息 ...

  3. 笔试算法题(52):简介 - KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm)

    议题:KMP算法(D.E. Knuth, J.H. Morris, V.R. Pratt Algorithm) 分析: KMP算法用于在一个主串中找出特定的字符或者模式串.现在假设主串为长度n的数组T ...

  4. Aho - Corasick string matching algorithm

    Aho - Corasick string matching algorithm 俗称:多模式匹配算法,它是对 Knuth - Morris - pratt algorithm (单模式匹配算法) 形 ...

  5. GO语言的开源库

    Indexes and search engines These sites provide indexes and search engines for Go packages: godoc.org ...

  6. Go语言(golang)开源项目大全

    转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析 ...

  7. 一个字符串搜索的Aho-Corasick算法

    Aho和Corasick对KMP算法(Knuth–Morris–Pratt algorithm)进行了改进,Aho-Corasick算法(Aho-Corasick algorithm)利用构建树,总时 ...

  8. [转]Go语言(golang)开源项目大全

    内容目录 Astronomy 构建工具 缓存 云计算 命令行选项解析器 命令行工具 压缩 配置文件解析器 控制台用户界面 加密 数据处理 数据结构 数据库和存储 开发工具 分布式/网格计算 文档 编辑 ...

  9. go语言项目汇总

    Horst Rutter edited this page 7 days ago · 529 revisions Indexes and search engines These sites prov ...

  10. Golang优秀开源项目汇总, 10大流行Go语言开源项目, golang 开源项目全集(golang/go/wiki/Projects), GitHub上优秀的Go开源项目

    Golang优秀开源项目汇总(持续更新...)我把这个汇总放在github上了, 后面更新也会在github上更新. https://github.com/hackstoic/golang-open- ...

随机推荐

  1. Codeforces 429B Working out:dp【枚举交点】

    题目链接:http://codeforces.com/problemset/problem/429/B 题意: 给你一个n*m的网格,每个格子上有一个数字a[i][j]. 一个人从左上角走到右下角,一 ...

  2. django1.8.3搭建博客——2 django web 开发指南阅读笔记

    一.    django框架 1 .http封装web服务的整个过程.由请求(request)和响应(response)两部分组成. 请求的内容为URL (指向文档的路径). 响应主要为正文(body ...

  3. vmware在桥接模式下配置centos7网络

    首先要将Vmware10.0.3设置为桥接模式. CentOS 7.0默认安装好之后是没有自动开启网络连接的! cd  /etc/sysconfig/network-scripts/  #进入网络配置 ...

  4. python-编译安装Python2.7

    yum中最新的也是Python 2.6.6,只能下载Python 2.7.5的源代码自己编译安装. 操作步骤如下: 1)下载并解压Python 2.7.9的源代码 cd /opt wget --no- ...

  5. Simple Rtmp Server的安装与简单使用

    Simple Rtmp Server是一个国人编写的开源的RTMP/HLS流媒体服务器. 功能与nginx-rtmp-module类似, 可以实现rtmp/hls的分发. 有关nginx-rtmp-m ...

  6. H5 限制input只能输入数字

    <input type="tel" /> 参考: http://blog.csdn.net/kongjiea/article/details/40185951

  7. GSM与GPRS区别介绍

    1.    GSM是全球移动通讯系统(Global System for Mobile Communications)的简称 2.    GPRS是通用分组无线业务(General Packet Ra ...

  8. 办公软件-Excel:Microsoft Office Excel 2003百科

    ylbtech-办公软件-Excel:Microsoft Office Excel 2003百科 Microsoft® Office Excel 2003 是一种电子表格程序,可提供对于 XML 的支 ...

  9. DCloud-MUI:Hello MUI2

    ylbtech-DCloud: 1. <head> <meta charset="utf-8"> <title>Hello MUI</ti ...

  10. js---复选框(全选,不选,反选)demo1--

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/stri ...