In my view, the Knuth-Morris-Pratt string searching algorithm is one of those “magic” algorithms that do something wonderful but are difficult to explain or understand in detail without a significant time investment. The pedagogical question that I would like to address is not just: “how does it work?”, or “why does it work?”, or “does it work?”. More importantly, it is: “how did they come up with it?”. Of course, I wasn’t there, and I am no historian, so let me instead address the somewhat related question: “how do I reconstruct it?”.
Of course one can prove that the algorithm works; this can be done with pen-and-paper or with a machine-checked proof. This is very useful, but does not answer my question.
I know of several ways of explaining and reconstructing the algorithm. In one of them, one views the pattern as a non-deterministic finite-state automaton; by eliminating the є-transitions, one obtains a deterministic automaton, whose transitions form the “failure function”. In the second approach, which is the one I would like to describe here, one begins with a naïve implementation of string searching, and by a series of program transformations, one derives an efficient algorithm, namely the Knuth-Morris-Pratt algorithm.
I found this idea in Ager, Danvy, and Rohde’s paper, where it is somewhat buried in Appendix A. I rephrase it here in a slightly different way.
The naïve algorithm
Let us begin with a supposedly idiomatic Java implementation of naïve string searching. We are looking for an occurrence of a string, known as the “pattern”, within another string, known as the “text”. Throughout, I write m for the length of the pattern and n for the length of the text.
The code returns the index in the text of the leftmost occurrence of the pattern, if one exists, and −1 otherwise. In the worst case, its time complexity is O(mn), that is, quadratic.
Let us first write this algorithm in the form of a single infinite loop. This makes the control flow much simpler, and means that the current state of the algorithm is now entirely encoded in the variables j and k. (With respect to the previous code, the meaning of k has changed. Now, j is an index into the pattern, while k is an index into the text.)
The source of inefficiency is the last case, where we backtrack and (if j is greater than zero) the index k does not increase. This leads the algorithm to read again a part of the text that it has already read.
A pure formulation
Before attempting any algorithmic change, let us transform the code to a purely functional form, which is both more general and easier to reason about. We switch to OCaml, and instead of a loop, we write a tail-recursive function.
This is a function of six parameters, but the first four are invariant, so it costs nothing to carry them around – they are just stored in registers. The code produced by the OCaml compiler is about as tight as one might hope (try ocamlopt -unsafe -S and look at the assembly file).
I have taken this opportunity to enrich the result type of the search function. When the search fails, by reaching the end of the text, instead of returning just −1, it returns the current value of j. In other words, it returns the state of the algorithm at that point.
For the sake of completeness, let me show the definition of the type result, as well as a destructor that extracts j out of a result when we know that this result is Interrupted _. This destructor will be useful later on.
In this form, the code looks almost identical to the previous version, which was written in Java. It is, however, much more powerful. Thanks to the extra parameters that it exposes and thanks to the extra information that it returns, the algorithm is now interruptible and resumable. In other words, a search within a long text can be split up as a sequential composition of two searches. If the index l lies between k and n (that is, k ≤ l ≤ n), then the call:
is equivalent to the sequence:
The first call searches the text interval [k, l), while the second call searches the interval [l, n), beginning in the state j′ that was reached by the first call.
Identifying and exposing the redundant work
It is now time to do something about the inefficiency of this algorithm. Recall that the root of the problem is the backtracking that takes place in the last case when j is greater than zero. It is stupid to go back (or stay where we are) in the text, because we already know what characters are there. We have just read these characters and successfully compared them with a prefix of the pattern!
To clarify what I mean, and to allow my claim to be tested, let me enrich the code with extra assertions that document our knowledge. (The function sub is String.sub; it extracts a substring out of a string.) In technical lingo, these assertions form a precondition of the function loop; in the second Java version above, they would form an invariant of the while loop.
The last recursive call is performing partly redundant work, because we already know what characters are found in the text interval [k − j, k). Fortunately, we are now in a position to do something about it. Instead of searching the interval [k − j + 1, n) in one go, let us split this search into the two intervals [k − j + 1, k) and [k, n). We will then be able to identify the first search as entirely redundant.
We set apart the sub-case where j is zero, as it does not pose a problem. Thus, the code becomes (assertions now removed, for brevity):
We can immediately simplify this code a little bit. Indeed, the call search pattern m text k 0 (k - j + 1) searches for the entire pattern, which has length m, in a text interval of length j − 1, where j < m, so it cannot possibly succeed. Thus, the first branch of the last match construct is dead, and the whole construct can be replaced with a more concise call to assertInterrupted:
We now take a more essential step. As previously noted, the sub-strings sub pattern 0 j and sub text (k - j) j are equal. Thus, the call search pattern m text k 0 (k - j + 1) searches for the pattern within (a prefix of a suffix of) the pattern itself. As a result, this call can be rewritten to mention the pattern, instead of the text! It is equivalent to the call search pattern m pattern j 0 1. We have replaced text with pattern, and adjusted the indices into the text by subtracting k − j, so that they are now indices into the pattern. The code is now:
The new assertion j′ < j is justified as follows. We are searching in a text fragment of length j − 1, and we begin in state 0, so, at best (if all characters match), this search will reach the final state j − 1.
The code is now roughly analogous to Figure 8 (page 42) of Ager, Danvy, and Rohde’s paper. Our version is more compact, compared to theirs, as we have avoided some code duplication, but the idea is the same.
It is worth noting that, at this point, our code is not tail-recursive, due to the manner in which we perform two searches in sequence. This is not a problem, however, as our very next step (as I am sure you, the reader, have guessed) is to remove the call search pattern m pattern j 0 1 entirely. Indeed, this value does not depend on the text in any way, so it can be precomputed before the search begins.
The search phase
Let us assume, for the time being, that we have precomputed the value of search pattern m pattern j 0 1, for every j in the range 0 < j < m, and stored this information in an array table. (As noted above, for every j in this range, we have the property that table.(j) is less than j.) Then, the code can be simplified as follows:
Cool. The code is tail-recursive again. This is cool. This is way cool. Wait a minute. Is this cool? A shadow of a doubt should cast itself in your mind.
Is this code efficient? We still seem to have a recursive call, on the last line, where k is not incremented. Thus, it is not obvious that this code has good time complexity. In fact, a rapid analysis indicates that, in the worst case, table.(j) could be j − 1, so, in the worst case, the algorithm could stutter m times (minus one or two) before it finally increments k. This leads to a worst-case upper bound of O(mn), no better than the naïve algorithm.
Fortunately, this analysis is overly pessimistic. The variables j and k are linked: whenever j is incremented, k is incremented too. Thus, the worst-case scenario, where j has reached a high point and requires many steps to go down, arises only if k has been incremented as many times as j has been incremented. Thus, speaking in terms of amortization, the cost of backtracking can be charged, ahead of time, to the steps where k was incremented. Speaking in plain mathematics, it is not difficult to see that the quantity 2k − j grows strictly at every recursive call. Since this quantity is initially 0 and is bounded by 2n, we find that the time complexity of the search phase is O(n).
The phenomenon that we have identified, and that gave rise to a doubt, is that the algorithm is not real-time. In the worst case, it can take up to O(m) steps to process one character of the text, even though, in an amortized sense (i.e., averaged over the entire text), processing one character only requires constant time.
The preprocessing phase
We now need to program the preprocessing phase, which initializes the array table. Recall that table.(j) is supposed to contain the integer value assertInterrupted (search pattern m pattern j 0 1). How do we efficiently compute this information?
Of course one could invoke the inefficient search function at every j, and store the results. But this naïve approach would have time complexity O(m3), which is not great.
In order to do better, we use two simple ideas.
First, it is quite obvious that the desired value at j can be computed in terms of the desired value at j − 1, simply by exploiting (again) the fact that a search can be split into two consecutive sub-searches. The base case, j = 1, corresponds to searching an empty interval; the desired value is then the initial state 0.
Second, instead of calling the inefficient search function, let us use the efficient loop function. Granted, loop reads the array table, which we have not fully initialized yet, but a moment’s thought reveals that this is safe: the array will be read only at indices that have been computed already. (The proof is not difficult; let me skip it.) The code is as follows:
This is a beautiful bootstrapping process: the same code, loop, is used in the preprocessing phase and in the search phase. We did not even consciously plan ahead for that, but the function was sufficiently general to begin with that it did not need to be modified. Thus, we re-use the code. We can re-use also its complexity argument, and assert that init has time complexity O(m).
Packing it all up
I leave it as an exercise for the reader to put everything together and produce an implementation that satisfies the following signature:
I should note that the algorithm that we have obtained is not quite the Knuth-Morris-Pratt algorithm. The latter includes one extra optimization, which I have not attempted to incorporate here. This optimization builds shortcuts in the failure table, and is implemented by modifying one line in the preprocessing phase, as follows:
As another exercise, the reader can translate the code back to imperative style (and to Java, if desired).
Chengdian (1985) and van der Woude (1989) derive an implementation of the Knuth-Morris-Pratt algorithm from its specification, expressed as a pre/postcondition pair. In other words, they construct, at the same time, the code and the proof of its correctness in Hoare logic. Again, this establishes that the code is correct, but does not (in my opinion) lead to a pedagogical explanation of the algorithm.
Bird, Gibbons, Jones (1989) and Bird (2005) use equational reasoning to transform a specification of the algorithm (that is, a simple but inefficient implementation) into an efficient one. Unfortunately, it is not obvious to a casual reader that the final algorithm is the Knuth-Morris-Pratt algorithm.
Ager, Danvy, and Rohde (2002) explain how, once the algorithm has been manually put under a suitable form, the standard techniques of partial evaluation can be used to specialize the code with respect to a fixed pattern. Thus, the partial evaluation phase is the preprocessing phase. Their paper provided the inspiration for this post. They also prove, in great detail, that the algorithm thus obtained is exactly the Knuth-Morris-Pratt algorithm.
This document was translated from LATEX by HEVEA. See also the PDF version (8 pages)