2012-11-18

String pattern matching (part 1/?)

Part 1: Finding the needle in a haystack

Background

I have recently learned about the Boyer-Moore-Horspool (BMH) algorithm, but after reading through it again, I realized that there were parts of it I didn't quite understand. I decided to write this article as a way for me to repeat what I've learned and also help others that want to implement this algorithm.

Boyer-Moore-Horspool

BMH pre-processes the string you search for (pattern) to reduce the amount of compares needed when searching through the string you search in (text). The algorithm has an average-case complexity of O(N) and a worst case complexity of O(MN) where N is the text length and M is the pattern length.

A brute force approach to searching for a pattern in a text would be to start by comparing the two first characters in the text and the pattern. If they match, compare the next two and continue until we either have figured out that we've found a complete match or we find a mismatch. Upon finding a mismatch we skip one character ahead in the text and start over again.

Let's see an example of this naïve approach. We'll use the string "dream" as the pattern, and "ice creamer dreamer" as the text to search in (I know it doesn't make much sense, but whatever). I have replaced spaces with underscores to make it easier later to see which character we're dealing with. The red color indicates where a mismatch happen, and the blue indicates a match. The number on the left side indicates how many positions we shift the pattern before the next compare. I also use the character '|' to show the position of the pattern we're looking for in the text.

  iced_creamer_dreamer
  dream        |   |
1  dream       |   |
1   dream      |   |
1    dream     |   |
1     dream    |   |
1      dream   |   |
1       dream  |   |
1        dream |   |
1         dream|   |
1          dream   |
1           dream  |
1            dream |
1             dream|
1              dream

Each blue and red character represents a comparison, so that's a total of 19 compares. There must be a way to do this better! This is where BMH comes in. Let's go through the algorithm step by step. The first step of the algorithm is to start comparing the string from the end, and not the start. In the example below we see that there's a mismatch on the last character. The character in the text that causes the mismatch (whitespace) doesn't occur in the pattern, which means that we can skip ahead the entire length of the pattern.

  iced_creamer_dreamer
  dream
5      dream

But now we have a more complicated situation (below). The mismatch doesn't happen before we hit the d in the beginning of the word. How do we figure out how far we can skip ahead this time?

  iced_creamer_dreamer
  dream
5      dream

Luckily this information is something we can pre-compute based on the search pattern. Say hello to the bad character shift table (BCS). The table contains an entry for each possible character we can encounter in the text, and tells us how far we're allowed to skip ahead if it doesn't match the pattern. This is probably a good time to point out that this approach works perfectly for fixed width character encodings and will also work out of the box with multi-byte encodings if they are self-synchronizing (UTF-8). It's still possible to make it work with encodings that don't meet these criterias, but it requires some extra work that I won't cover here.

First we create a table containing each possible character in the text (128 in the case of ASCII) and fill each entry with the length of the pattern, since this is the default amount of characters we can skip if the mismatched character in the text doesn't appear in the pattern.

Next we iterate over every character in the pattern except the last one and set the shift value of that character to be how far we must move the pattern to align with the character we found in the text. We skip the last character since a mismatch would lead to the same character being compared since it's already in the mismatched position. The BCS table can easily be calculated with the following C-like pseudo code where 'last' represents the index of the last character in the pattern and we use zero-indexed arrays:

for (i = 0; i < MAX_CHARACTER; ++i)
    shift[i] = pattern.length;  

last = pattern.length - 1;
for (i = 0; i < last; ++i)
    shift[pattern[i]] = last - i;

By iterating from start to end, we make sure that we never shift longer than the last character of a certain kind if there are multiple occurrences of the same character. The table generated for the pattern (* here means every character that's not present in the pattern):


*  5 (pattern length)
d  4
r  3
e  2
a  1
m  5 (skipped)

With the BCS table made, it's time to go back to our example and study it again. The first mismatch happens on the whitespace as before, which gives us an offset of 5.

  iced_creamer_dreamer
  dream        |   |
5      dream   |   |
5           dream  |
3              dream

Well that was easy, and we reduced the amount of compares to 12. The algorithm seemed to work for the test string, but we haven't quite solved the problem yet. Look at the example below where the pattern is "ram ram", and the text is "rum ram ram tam" (another nonsensical string, but I can't be bothered to find good examples that make sense). We get a mismatch between the characters 'u' in the text and 'a' in the pattern. 'u' doesn't exist in the pattern, so we consider it safe to shift the entire length of the pattern, but that makes us overshoot an occurrence of the pattern in the text.

  rum_ram_ram_tam
  ram_ram
7        ram_ram

The reason for this is that the BCS table values tell us how long we need to shift the string if the mismatch happen on the last character, but here it happens on the second. There are at least two ways of handling this problem. The Boyer-Moore way is to subtract the amount of matches we have already found from the BCS. This might lead to negative shifts, but the Boyer-Moore algorithm will always choose the larger of two shifts (bad character and good suffix) which will solve this issue. The Horspool solution is to always use the last character of the compared text. In this case we would use 'm' which gives us a shift of 4 and therefore puts us directly at the pattern we're searching for.

  rum_ram_ram_tam
  ram_ram   |
4     ram_ram

It is worth pointing out that this solution has some bad side effects in very special cases. If the two last characters are the same, then any mismatch that doesn't happen on the last character will lead to a shift of only one. This isn't a big issue since we usually see a lot of mismatches on the first character comparison, and a shift of just one will most likely be followed by a larger shift, but it's something to keep in mind.

Let's sum it all up in some more pseudo code to make it clearer:

for (i = 0; i < MAX_CHARACTER; ++i)
    shift[i] = pattern.length;  

last = pattern.length - 1;
for (i = 0; i < last; ++i)
    shift[pattern[i]] = last - i;

pos = 0;
while (pos < (text.length - pattern.length)) {
    i = last;
    while (pattern[i] == text[pos + i]) {
        if (i == 0)
            return pos;
        ++i;
    }
    pos += shift[buffer[pos + last]];
}


One last thing before we're done. Let's say we want to find all occurrences of the pattern in the string and want to jump as far as possible after each match we find. Lets start with a couple of examples:

One last thing before we're done. Let's say we want to find all occurrences of the pattern in the string and want to jump as far as possible after each match we find. Intuitively we realize that the longest shift we can perform after a match is the length of the string minus the longest match between a suffix and a prefix in the pattern. Calculating this requires a bit more pre-processing (which I'll leave as an exercise), but we already have some information we can use to gain the second best thing: the bad character shift table. By using the BCS table entry for the last character in the pattern we're not guaranteed to shift as long as we can, but we're guaranteed to never overshoot the next match, and it will perform almost as good in most cases. Under is an example of some strings that match and how far we shift. The first entry is the optimal shift, and the second is using the BCS table.

No overlap:
  find
4     find
4     find

Single character overlap:
  test
3    test
3    test

Multiple character overlap:
  baobao
3    baobao
3    baobao

Multiple suffix match:
  bababa
2   bababa
2   bababa

Bad character shift skips too short ahead:
  this_is_this
8         this_is_this
5      this_is_this

That's it for now. It sure helped me writing this, and I hope it helped you if you're reading it. Please post a comment if you find anything that's incorrect or if you appreciate the article (it's getting pretty late and I'm pretty tired right now, so I can't guarantee that it's all bug free, plus I've already found a couple of bugs myself).

No comments: