Knuth-Morris-Pratt Algorithm

In this Haskell cicada we will read the recently published Functional Pearl Knuth-Morris-Pratt illustrated. We will use the same strategy we used in understanding the cycle finding function, namely slowly deconstructing the functions and playing with them in the ghci repl (we use the Haskell setup described here).

The first function we dissect is scanl used in the paper in what looks like a convoluted way to get all possible suffixes of a string.

Let’s first remind ourselves how foldl works:

myfoldl :: (a -> b -> a) -> a -> [b] -> a
myfoldl _ z [] = z
myfoldl f z (x:xs) = myfoldl f (f z x) xs

The scanl variant of foldl keeps all the intermediate results around.

wrap :: (a -> b -> a) -> [a] -> b -> [a]
wrap f (y:ys) x = (f y x):(y:ys)
wrap _ [] _ = [] 

myscanl :: (a -> b -> a) -> a -> [b] -> [a]
myscanl f z xs = reverse (myfoldl (wrap f) [z] xs)

Note: A good way to understand the functions and their uses is to re-implement them and play with them in the repl.

Horizontally naive

How does the paper use scanl? The folding function is called step and has this definition:

step :: String -> Char -> String
step [] _ = []
step (_:ys) _ = ys

Using it like so, gets us all the suffixes of a string:

myscanl step "abcdefg" "abcdefg"

with result:

["abcdefg","bcdefg","cdefg","defg","efg","fg","g",""]

The second "abcdefg" is bound to the [b] argument in scanl. It determines how often we fold. We’re munching through "abcdefg", each time dropping one more character from the front of the last suffix we computed, but also collecting all these intermediary suffixes.

If any of these suffixes are a prefix of the pattern, then we found the pattern in the text. So far so good. This is called Horizontally naive in the paper: we slide the pattern over the whole text, one position at a time, hence the suffixes. If $n$ is the length of the text and $m$ the length of the pattern, then this approach has runtime $O(nm)$.

What is nice in this paper: for the next algorithms that follow, any and scanl remain the same. We will use different init, step and done.

Vertically naive

Let’s continue to the next section in the paper: the naive vertical approach. The idea here is that we maintain a set of pattern suffixes that are still in play as we traverse the text, one character at a time. At each stop, we filter out pattern suffixes that have first-character mismatch with the current text position. The remaining suffixes get shortened by one character. In addition to that, we also re-insert the full pattern into the pattern suffix set (a new chance for a new beginning). If at any moment we get to an empty suffix, it means we matched and the pattern is in the text.

The function init initializes the set of pattern suffixes with the full pattern. The function done looks for the empty string in the set of pattern suffixes (it means a pattern survived all the way to its last suffix without being filtered out of the set).

The interesting function is the step function because it evolves the set of pattern suffixes according to the description above. It is actually very readable and I don’t think I can explain it further. The folding munches through the text. It is a very elegant setup. The additional notes linking this approach to NFAs and Brzozowski derivatives are interesting.

Vertically naive with a list

This next approach is a refinement of the previous one. Instead of sets, it uses lists (specially defined in anticipation of the next section after this one). The only unfamiliar thing to me was the @ syntax, explained here. The rest is very readable. The choice of list here is very natural, it keeps the pattern suffix corresponding with the most matches so far on top and the other sorted too. The top either survives all the filtering which puts an empty string on top eventually or it gets filtered out which promotes the next best pattern suffix to the top. If at the end we have an empty string on top, we have a match.

data List a = Nil | Cons { top :: a, rest :: List a }

Instead of doing filtering and construction of sets in two separate runs through the previous set, here we do both filtering and pattern suffix prefix-truncating in one step.

Morris-Pratt

We have arrived at the Morris-Pratt section. Knuth comes in later (and takes naming pole position 🙂).

This section is based on two observations about the pattern suffix lists:

There is only one list possible which has a given pattern suffix on top.
Each list tail was a previous complete list in the folding.

Both of the observations can be proven inductively and the paper provides a nice informal sketch of a proof. The second observation proof relies on the highlighted expression in the code listing of the previous section of the paper. It is clear now why the author felt it important to highlight that expression. The Haskell code for vertically naive with a list is actually a good source of inductive expressions for proving the two observations.

The two observations allow us to collapse all the pattern suffix lists that would appear in the folding into one graph. Let me emphasize this one more time: all the possible lists that could arise from matching the pattern against the text using the vertically naive with a list algorithm can be collapsed into one graph.

data Tree a = Nil | Node { top :: a, next :: Tree a, rest :: Tree a }

Instead of just a list with top and rest, we now have nodes with an additional next. If we can proceed in the matching from a given node, we can go on to what’s pointed to by next. If not, we go on to rest in our folding setp.

The paper then constructs this graph and uses it in the tweaked matching algorithm. The advantage gained: we only compute the graph once and don’t compute lists at every folding. One delightful little wrinkle here: the graph is cyclic, which wouldn’t work in a “normal” language, but is no problem for Haskell courtesy of lazy evaluation. The graph is not fully realized upfront, the next field is a thunk sitting there to be used only when needed.