KMP文本匹配算法

现在的位置: 首页 > 综合 > 正文

KMP文本匹配算法

2019年04月10日 ⁄ 综合 ⁄ 共 9747字 ⁄ 字号小中大 ⁄ 评论关闭

KMP算法
作者：ljs
2011-06-20
(转载请注明出处，谢谢！)

KMP（Knuth–Morris–Pratt）算法的发明时间几乎跟BM（Boyer-Moore）算法在同一时期，即上世纪70年代末（巧合的是随着互联网的发展对文本处理提出了更高的要求，从而最近几年字符处理又成了热门话题），二者在最坏情况下的时间复杂度都是O(n)。它与BM算法的主要区别是：

1）在每次匹配中都是从左到右匹配，BM算法每一次匹配过程都是从模式串末尾开始匹配（指针从右到左移动），直到发现匹配失败字符(mismatch)才根据两张表（好后缀位移表-good suffix shift table和坏字符位移表-bad character shift table）决定向右移动一定的位置，因此在实践中KMP的比较次数一般要多于BM的比较次数，因为BM算法中最好情况下的比较次数为O(n/m)。(比较一下下面的测试输出和我的另一篇BM-Horspool文中的输出，可以看出BM算法或其简化版一开始就从字符串尾部比较的优势了)

2）KMP算法不依赖于字符集的大小，只是根据模式串的信息做预处理，BM算法需要根据字符集给坏字符（因为坏字符的来源是文本，而不是模式串）建立一张位移表，因此KMP算法的存储空间可以少一些。

以下讨论设文本T长度为n，模式串P长度为m。T的当前位置指针为i(0<=i<n)，P的当前位置指针为j(0<=j<m)。

KMP算法跟BM算法一样，用空间换取时间，基本的要求是能够最大幅度地向右移动模式串，同时又要保证不会遗漏能够成功匹配的子串。分两大步骤：1) 对模式串预处理，建立失败函数; 2)匹配过程。

****************

首先需要根据模式串预先计算一张表叫做失败函数(failure function) - f(j)。f(j)的变量j对应模式串中的下标，具体含义是，在模式串的前缀P[0...j]这个字符串中，存在一个它的最长的前缀和后缀二者完全相等，该前缀的长度就是f(j)的值。定义f(0)=0。如果不存在这样的前缀和后缀完全相等，f(j)取值为0。

例如：模式串xyzabc，如果j=3, P[j]=a，则考虑前缀xyza中是否存在这样的最长前缀和后缀完全相等，这里不存在，所以f(3)=0; 又如：模式串ABABACA，如果j=4, P[j]=A，则考虑前缀ABABA中是否存在这样的最长前缀和后缀完全相等，这里有一个最长的前缀ABA可以跟等值的后缀重叠，该前缀长度为3，所以f(4)=3。注意f(j)是一个长度值，j是一个索引值，在高级语言中，长度值对应后一个字符的索引值，根据这个特点，可以在KMP算法中将f(j)作为模式串中第j+1个字符出现匹配失败(mismatch)时下一次j指针位置值。

****************

KMP匹配过程也是通过不停地修改i和j的值直到找到匹配的文本子串，需要考虑这几种情况：
1) j指针等于0（即指向模式串的第一个字符位置），但是T[i] != P[j]，这时需要将模式串的第一个字符与文本的下一个字符比较，即i指针加1,但是j指针仍为0。

2）j指针不等于0，前面连续的j个字符P[0...j-1]与T[i-j...i-1]都匹配成功，这时候就要用到失败函数f了。具体做法是把f[j-1]的值作为下一次的j值，但是i指针的值不变,这样保证不会遗漏能够成功匹配的子串。例如：

有趣的是计算失败函数f可以使用KMP匹配算法本身的算法过程，这是因为计算f实质上就是模式串与自身匹配的过程，只是在匹配一开始，需要将模式串向右错开一个字符位置（即i=1，j=0）。

实现：

import java.util.Arrays;/** * * @author ljs * 2011-06-20 * */public class KMP {//caculate the failure function: f[0], f[1..m-1], m is the length of patternprivate int[] calFailureF(String pattern){int m = pattern.length();int[] f = new int[m];//i is the right border positionint i = 1;int j = 0;f[0] = 0; //by definitionwhile(i<m){if(pattern.charAt(i)==pattern.charAt(j)){//j is index from 0, f[i] is the length of suffix/prefix//so we need to add 1f[i] = j + 1; i++;j++;}else if(j==0){//find no valid prefix f[i] = 0; //move i only, j is still 0i++;}else{//move j only, i doesn't change position, thus f[i]'s value is not determined yet.//reuse the KMP algorithm: we already know f[j-1]'s value j = f[j-1];}}return f;}//find the first match in textL: return the first char's index if found; return -1 otherwisepublic int match(String text,String pattern){int m = pattern.length();int n = text.length();if(m>n)return -1;int[] f = this.calFailureF(pattern);//text's indexint i = 0;//pattern's indexint j = 0;/****BEGIN TEST: the following code snippet can be commented out****/System.out.format("%s%n",text);System.out.format("%s%n",pattern);/****END TEST: the above code snippet can be commented out****/while(i<n){if(text.charAt(i)==pattern.charAt(j)){//if we find the first match, return immediatelyif(j==m-1) //the borderreturn i-(m-1);i++;j++;}else if(j==0){i++;/****BEGIN TEST: the following code snippet can be commented out****/int dotsCount = i;byte dot[] = new byte[dotsCount]; Arrays.fill(dot, (byte)'.');System.out.format("%s%s%n",new String(dot),pattern);/****END TEST: the above code snippet can be commented out****/}else {//j-1>=0j = f[j-1];/****BEGIN TEST: the following code snippet can be commented out****/int dotsCount = i-j;byte dot[] = new byte[dotsCount]; Arrays.fill(dot, (byte)'.');System.out.format("%s%s%n",new String(dot),pattern);/****END TEST: the above code snippet can be commented out****/}}return -1;}public static void printFailureFunction(String pattern,int[] failureFunc){//pattern index positionsSystem.out.print("i:");for(int i=0;i<pattern.length();i++){System.out.format(" %2s", i);}System.out.println();//pattern printSystem.out.print("P:");for(int i=0;i<pattern.length();i++){System.out.format(" %2s", pattern.charAt(i));}System.out.println();//failure function outputSystem.out.print("f:");for(int i=0;i<failureFunc.length;i++){System.out.format(" %2d", failureFunc[i]);}System.out.println();System.out.println();}public static void findMatch(KMP solver,String text,String pattern){ int index = solver.match(text, pattern); if(index>=0){ System.out.format("Found at position %d%n",index); }else{ System.out.format("No match%n"); } } public static void main(String[] args) {KMP kmp = new KMP();String pattern = "cbcbcb";int[] f = kmp.calFailureF(pattern);KMP.printFailureFunction(pattern,f);pattern = "ababaca";f = kmp.calFailureF(pattern);KMP.printFailureFunction(pattern,f);pattern = "aaaaaabb";f = kmp.calFailureF(pattern);KMP.printFailureFunction(pattern,f);String text = "BABCXXXX";pattern = "BABD";KMP.findMatch(kmp,text,pattern);text = "After a long text, here's a needle ZZZZZ";pattern = "ZZZZZ";KMP.findMatch(kmp,text,pattern);text = "The quick brown fox jumps over the lazy dog.";pattern = "lazy";KMP.findMatch(kmp,text,pattern);System.out.format("**************%n");text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna...";pattern = "tempor";KMP.findMatch(kmp,text,pattern);}}

测试输出：

i: 0 1 2 3 4 5P: c b c b c bf: 0 0 1 2 3 4i: 0 1 2 3 4 5 6P: a b a b a c af: 0 0 1 2 3 0 1i: 0 1 2 3 4 5 6 7P: a a a a a a b bf: 0 1 2 3 4 5 0 0BABCXXXXBABD..BABD...BABD....BABD.....BABD......BABD.......BABD........BABDNo matchAfter a long text, here's a needle ZZZZZZZZZZ.ZZZZZ..ZZZZZ...ZZZZZ....ZZZZZ.....ZZZZZ......ZZZZZ.......ZZZZZ........ZZZZZ.........ZZZZZ..........ZZZZZ...........ZZZZZ............ZZZZZ.............ZZZZZ..............ZZZZZ...............ZZZZZ................ZZZZZ.................ZZZZZ..................ZZZZZ...................ZZZZZ....................ZZZZZ.....................ZZZZZ......................ZZZZZ.......................ZZZZZ........................ZZZZZ.........................ZZZZZ..........................ZZZZZ...........................ZZZZZ............................ZZZZZ.............................ZZZZZ..............................ZZZZZ...............................ZZZZZ................................ZZZZZ.................................ZZZZZ..................................ZZZZZ...................................ZZZZZFound at position 35The quick brown fox jumps over the lazy dog.lazy.lazy..lazy...lazy....lazy.....lazy......lazy.......lazy........lazy.........lazy..........lazy...........lazy............lazy.............lazy..............lazy...............lazy................lazy.................lazy..................lazy...................lazy....................lazy.....................lazy......................lazy.......................lazy........................lazy.........................lazy..........................lazy...........................lazy............................lazy.............................lazy..............................lazy...............................lazy................................lazy.................................lazy..................................lazy...................................lazyFound at position 35**************Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna...tempor.tempor..tempor...tempor....tempor.....tempor......tempor.......tempor........tempor.........tempor..........tempor...........tempor............tempor.............tempor..............tempor...............tempor................tempor.................tempor..................tempor...................tempor....................tempor.....................tempor......................tempor.......................tempor........................tempor.........................tempor..........................tempor...........................tempor............................tempor.............................tempor..............................tempor...............................tempor................................tempor.................................tempor..................................tempor....................................tempor.....................................tempor......................................tempor.......................................tempor........................................tempor.........................................tempor..........................................tempor...........................................tempor............................................tempor.............................................tempor..............................................tempor...............................................tempor................................................tempor.................................................tempor..................................................tempor...................................................tempor....................................................tempor.....................................................tempor......................................................tempor.......................................................tempor........................................................tempor.........................................................tempor..........................................................tempor...........................................................tempor............................................................tempor.............................................................tempor..............................................................tempor...............................................................tempor................................................................tempor.................................................................tempor..................................................................tempor...................................................................tempor....................................................................tempor.....................................................................tempor......................................................................tempor.......................................................................tempor........................................................................tempor.........................................................................temporFound at position 73

【上篇】Boyer-Moore文本匹配算法(联合使用KMP和Horspool算法)
【下篇】Boyer–Moore–Horspool文本匹配算法(BM算法的简化版)

作者: Ozskpgbd

该日志由 Ozskpgbd 于5年前发表在综合分类下，最后更新于 2019年04月10日.
转载请注明: KMP文本匹配算法 | 学步园 +复制链接

抱歉!评论已关闭.

学步园

KMP文本匹配算法

作者: Ozskpgbd

书签

最新文章New

本站推荐

返回首页