KMP字符串搜寻

2012-10-11

KMP字符串搜索?KMP号称是效率最高的字符串搜索算法，大学数据结构的时候老师说这玩意不考，然后我就直接趴

KMP字符串搜索

KMP号称是效率最高的字符串搜索算法，大学数据结构的时候老师说这玩意不考，然后我就直接趴下睡觉了-_-|||。这阵子正好要做点字符串搜索的东西，顺便抄起数据结构的书看看。KMP算法是Knuth、Morris、Pratt仨牛人发明的，不过俺也只听说过Knuth。算法的思想不难，不过还是有点小绕脑子的。

Java SDK的String类中的indexOf方法没有使用KMP搜索，基本上算是最简单的搜索

??? /**
???? * Code shared by String andStringBuffer to do searches. The source is the
???? * character array being searched,and the target is the string being
???? * searched for.
???? *
???? * @paramsource
???? *??????????? the characters being searched.
???? * @paramsourceOffset
???? *??????????? offset of the source string.
???? * @paramsourceCount
???? *??????????? count of the source string.
???? * @paramtarget
???? *??????????? the characters being searched for.
???? * @paramtargetOffset
???? *??????????? offset of the target string.
???? * @paramtargetCount
???? *??????????? count of the target string.
???? * @paramfromIndex
???? *??????????? the index to begin searching from.
???? */
??? static int indexOf(char[] source, intsourceOffset, int sourceCount,
??????????? char[]target, int targetOffset, int targetCount,int fromIndex) {

??????? // if start from a position that is beyond the sourcestring
???????if (fromIndex >= sourceCount) {
??????????? // return the string length if target string is empty,otherwise,
??????????? // return -1 which means matchfails
???????????return (targetCount == 0 ? sourceCount : -1);
??????? }

??????? // correct the fromIndex
???????if (fromIndex < 0) {
??????????? fromIndex = 0;
??????? }

??????? // if target string is empty, return fromIndex
???????if (targetCount == 0) {
??????????? returnfromIndex;
??????? }

??????? // first char to match
???????char first = target[targetOffset];

??????? /*
???????? * a little optimize. let's saythe source string length is 9 and the
???????? * target String length is 7. Thenstarting from 3 (index is 2) of
???????? * source string is the lastchange to match the whole target sting.
???????? * Otherwise, there are only 6characters in source string and it would
???????? * definitely not going to matchthe target string whose length is 7.
???????? */
??????? int max =sourceOffset + (sourceCount - targetCount);

??????? // loop from the first to the max
???????for (int i = sourceOffset + fromIndex; i <= max;i++) {
??????????? /* Look for first character. */
??????????? if (source[i]!= first) {
??????????????? // using i <= max, not i < max
??????????????? while (++i<= max && source[i] != first)
??????????????????? ;
??????????? }

??????????? /* Found first character, now look at the rest of v2 */
??????????? if (i<= max) {
??????????????? int j = i+ 1;
??????????????? int end =j + targetCount - 1;
??????????????? // using j < end, not j <= end
??????????????? for (int k = targetOffset + 1; j < end
??????????????????????? &&source[j] == target[k]; j++, k++)
???????????? ???????;

??????????????? if (j ==end) {
??????????????????? /* Found whole string. */
??????????????????? return i - sourceOffset;
??????????????? }
??????????????? // if match fails, i++ and loop again, there are toiterators
??????????????? // for two loops. i andj.
???????????}
??????? }
??????? return -1;
??? }

这是Java String中的搜索算法，对于原字符串使用了两个指针来进行搜索。但是实质上来讲，这个算法还是有回溯的，可以看出来，每次搜索的时候，j都会搜索到一个大于i的位置，而如果搜索失败，则下次搜索将是从i++开始，这就是回溯了。

KMP的优势就是没有回溯，这对于只能够使用一个指针进行搜索的情况下，不仅仅有效率上的优势，实现起来也更自然。当然对于数组来说，使用俩指针并没有什么不便，如果是对于文件或者输入流进行搜索，那回溯起来就会很麻烦了。下面是KMP搜索。

KMP算法的核心就是不回溯原字符串指针，这点其实不难做到，重要的是要想到这一点——对于回溯的字符，其实都是已知的。解释一下就是，比如在"abcdefg"中搜索"abcdeg"，前五个字符"abcdeg"都是匹配的，第六个字符f和g不匹配，这时候，对于上面的搜索算法，i将会+1，整个匹配重新开始一次，这就是回溯了。但是仔细想一下，回溯其实完全可以避免的，因为如果知道是在第六个字符不匹配，那就说明前五个字符都是匹配的，从而说明“知道回溯之后的字符是什么”，对于这个例子来说，我们肯定知道源字符串前面五个字符是"abcde"。这是KMP搜索的根基。

好，下面让我们抛开源字符串吧！我们只关心目标字符串，也就是"abcdeg"。下面我们来设想，如果在搜索中发现源字符串的【n】字符和目标字符串的【m】字符匹配失败，那说明什么呢？说明之前的字符都是匹配的，否则也不会走到这里。也就是源字符串的【n-m】到【n-1】这m个字符与目标字符串的【0】到【m-1】这m个字符匹配。既然已经在搜索之前知道这个相等关系，那何苦在搜索的时候一次又一次的回溯呢？这个本来就是可以预测的，是搞一次就得的事情。因为源字符串的【n-m】到【n-1】是已知的。所以不用每次都死板的回溯到源字符串的n-m+1。

举例来说，对于在"abababc"中搜索"ababc"，第一次不匹配的情况如下

0 1 2 3 4 5 6

a b a b a b c

a b a b c

?? ? ? ? ?^

这时候，如果把指针回溯到源字符串的1位置，其实没有意义的，因为它是b，和目标字符串的a不匹配。而且，我们其实已经知道源字符串0到3这四个字符的值是跟目标字符串的四个字符一样的，都是abab。KMP的思想就是，充分利用这个已知条件，“源字符串不回溯，尽量让目标字符串少回溯，然后继续进行搜索”。那应该让目标字符串回溯到什么地方呢？这就看已经匹配的字符串的内容了。

使用S代表源字符串，T代表目标字符串，S[n]和T[m]失配（注意，因为失配了，这时候S[n]是什么是不知道的）。对于源字符串已知的只有 S[n-m+1]到S[n-1]这m-1个字符。假设能够找到这样一个k，使得S[n-k]...S[n-1]=T[0]....T[k-1]?(0<k<m)，那么就只需要保持S不回溯，让T回溯到K，然后继续匹配就好了。而如果能够找到一个最大的K值，那么效率则是最高的.

对于上面的例子，k的值是2，KMP搜索的下一个状态是：

0 1 2 3 4 5 6

a b a b a b c

?? ? a b a b c

?? ? ? ? ?^

然后继续匹配就成功啦。

所以，KMP算法的核心是，如何为目标字符串的每个位置的找到一个k值，组成一个数组F，好在每次匹配到目标字符串的m失配的时候，将目标字符串回溯到F[m]，然后继续进行匹配。找到这个数组之后，KMP搜索就算是完成80%了。

下面是构建这个数组F的方法。

这时候目标字符串身兼源字符串和目标字符串两个角色。构建数组T可以说是一个步进的过程，需要用到之前的结果。首先是F[0]，F[0]的意思是第一个字符就不匹配，也就是说对源字符串一无所知，这时候没得搞了，直接要源字符串向前挪动一个。在F里，我们使用-1来标记第一个字符就匹配失败的情况。也就是F[0]=-1。F[1]其实肯定是0。我们真正需要计算的是从F[2]到最后的。下面是>=2的时候的计算方法。注意，F[i]代表S的第 i个字符匹配“失败”的时候，T需要回溯到的索引的值。如何求F[i]的值呢？首先取得F[i-1]的值，然后看S[i-1]是否=T[F[i-1]]，如果等于，那么F[i]=F[i-1]+1。这个原理是递归的。F[i-1]的值是在i-1失配的时候，T索引回溯到的值，如果这时候，这个值与S[i- 1]相等，那就说明F[i]可以在F[i-1]的基础上增加1了。否则继续检查S[i-1]是否等于T[[F[i-1]]]，直到没有的搜索了，就是0。下面是具体的代码：

/**
???? * each value of array rollbackmeans: when source[i] mismatch pattern[i],
???? * KMP will restart match processform rollback[j] of pattern with
???? * source[i]. And if rollback[i] ==-1, it means the current source[i] will
???? * never match pattern. then i shouldbe added by 1 and j should be set to
???? * 0, which means restart matchprocess from source[i+1] with pattern from
???? * pattern[0].
???? *
???? * @parampattern
???? * @return
???? */
??? private static int[] getRollbackArray(char[]pattern) {
??????? int[]rollback = new int[pattern.length];
??????? for (int i = 0; i < pattern.length; i++) {
??????????? rollback[i] = 0;
??????? }
??????? rollback[0] = -1;
??????? for (int i = 1; i < rollback.length; i++) {
??????????? charprevChar = pattern[i - 1];
??????????? intprevRollback = i - 1;
??????????? while(prevRollback >= 0) {
??????????????? intpreviousRollBackIdx = rollback[prevRollback];
??????????????? if((previousRollBackIdx == -1)
??????????????????????? || (prevChar ==pattern[previousRollBackIdx])) {
???????????????? ???rollback[i] = previousRollBackIdx + 1;
??????????????????? break;
??????????????? } else {
??????????????????? prevRollback =rollback[prevRollback];
??????????????? }
??????????? }
??????? }
??????? returnrollback;
??? }

上面并没有吧F[1]=1写成固定的，不过根据计算，F[1]始终是=0的。

有了这个rollback数组，KMP搜索就是水到渠成了：

/**
???? * search pattern chars in sourcechars.
???? *
???? * @paramsource
???? * @parampattern
???? * @return
???? */
??? public static int searchKMP(char[]source, char[] pattern) {
??????? // validation
???????if (source == null ||source.length == 0 || pattern == null
??????????????? || pattern.length == 0) {
??????????? return -1;
??????? }

??????? // get the rollback array.
???????int[] rollback = getRollbackArray(pattern);

??????? // incremental index of pattern. pointing the char tocompare with.
???????int currMatch = 0;
??????? int len =pattern.length;
??????? // i point the char to compare with
???????for (int i = 0; i < source.length;) {
??????????? // if current char match
???????????if ((currMatch == -1) || (source[i] ==pattern[currMatch])) {
??????????????? /*
???????????????? * then each of theindexes adding by one, moving to the next
???????????????? * char for comparation.notice that if currMatch is -1, it
???????????????? * means the first charin pattern can not be matched. so i add
???????????????? * by one to move on. andcurrMatch add by one so its value is
???????????????? * 0.
???????????????? */
??????????????? i++;
??????????????? currMatch++;
??????????????? /*
???????????????? * if reaches the end ofpattern, then match success, return the
???????????????? * index of first matchedchar.
???????????????? */
??????????????? if(currMatch == len) {
??????????????????? return i - len;
??????????????? }
??????????? } else {
?????????? ?????/*
???????????????? * if current charmismatch, then rollback the next char to
???????????????? * compare in pattern.
???????????????? */
??????????????? currMatch =rollback[currMatch];
??????????? }
??????? }
??????? return -1;
??? }

下面是几个测试方法：

? ??@Test
??? public void testRollBackArray() {
??????? int[]expectedRollback = new int[] { -1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0,
??????????????? 0, 0, 0, 0, 0, 1, 2, 3,0, 0, 0, 0, 0 };
??????? int[]rollback = getRollbackArray("PARTICIPATE IN PARACHUTE"
???? ???????????.toCharArray());
???????Assert.assertArrayEquals("Rollback array compare failed tomatch!",
??????????????? expectedRollback,rollback);
??? }

??? @Test
??? public void testKMPSearchMatch() {
??????? intmatchIndex = searchKMP(
?????????????? ?"aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),
???????????????"ababacb".toCharArray());
??????? Assert.assertEquals(5,matchIndex);

??????? matchIndex = searchKMP(
???????????????"aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),
??????????????? "aaaaaababacbaslierjalsdzmflkasjf".toCharArray());
??????? Assert.assertEquals(0,matchIndex);
??? }

??? @Test
??? public void testKMPSearchNoMatch() {
??????? intmatchIndex = searchKMP("ABCABCDABABCDABCDABDE".toCharArray(),
???????????????"hjABCDABD".toCharArray());
??????? Assert.assertEquals(-1,matchIndex);

??? }

把这三段代码放在一个类里，KMP搜索就算是完事儿了。

=============================正文无关内容============================================

在自己看KMP算法之前，很多文章都说神马KMP有代价，只适合目标字符串很长很长，搜索字符串也很长很长的case。但是就我看下来，KMP对于日常一般的搜索也是有优势的。首先，构建rollback数组计算并不复杂，当然需要一个额外的数组空间。但是对于匹配来说，还是有很大的加速优势的，而且目标字符串不需要回溯。所以KMP唯一的代价就是需要一个额外的数组，实际占用的内存应该是目标字符串的两倍（String是char的数组，char=short，int是char的两倍）。难道，真的是为了节省内存所以不采用KMP搜索？

热点排行

编程

KMP字符串搜寻