## Part 1: Finding the needle in a haystack

#### Background

I have recently learned about the Boyer-Moore-Horspool (BMH) algorithm, but after reading through it again, I realized that there were parts of it I didn't quite understand. I decided to write this article as a way for me to repeat what I've learned and also help others that want to implement this algorithm.

#### Boyer-Moore-Horspool

BMH pre-processes the string you search for (pattern) to reduce the amount of compares needed when searching through the string you search in (text). The algorithm has an average-case complexity of O(N) and a worst case complexity of O(MN) where N is the text length and M is the pattern length.

A brute force approach to searching for a pattern in a text would be to start by comparing the two first characters in the text and the pattern. If they match, compare the next two and continue until we either have figured out that we've found a complete match or we find a mismatch. Upon finding a mismatch we skip one character ahead in the text and start over again.

Let's see an example of this naïve approach. We'll use the string "dream" as the pattern, and "ice creamer dreamer" as the text to search in (I know it doesn't make much sense, but whatever). I have replaced spaces with underscores to make it easier later to see which character we're dealing with. The red color indicates where a mismatch happen, and the blue indicates a match. The number on the left side indicates how many positions we shift the pattern before the next compare. I also use the character '|' to show the position of the pattern we're looking for in the text.

```  iced_creamer_dreamer
dream        |   |
1  dream       |   |
1   dream      |   |
1    dream     |   |
1     dream    |   |
1      dream   |   |
1       dream  |   |
1        dream |   |
1         dream|   |
1          dream   |
1           dream  |
1            dream |
1             dream|
1              dream
```

Each blue and red character represents a comparison, so that's a total of 19 compares. There must be a way to do this better! This is where BMH comes in. Let's go through the algorithm step by step. The first step of the algorithm is to start comparing the string from the end, and not the start. In the example below we see that there's a mismatch on the last character. The character in the text that causes the mismatch (whitespace) doesn't occur in the pattern, which means that we can skip ahead the entire length of the pattern.

```  iced_creamer_dreamer
dream
5      dream
```

But now we have a more complicated situation (below). The mismatch doesn't happen before we hit the d in the beginning of the word. How do we figure out how far we can skip ahead this time?

```  iced_creamer_dreamer
dream
5      dream
```

Luckily this information is something we can pre-compute based on the search pattern. Say hello to the bad character shift table (BCS). The table contains an entry for each possible character we can encounter in the text, and tells us how far we're allowed to skip ahead if it doesn't match the pattern. This is probably a good time to point out that this approach works perfectly for fixed width character encodings and will also work out of the box with multi-byte encodings if they are self-synchronizing (UTF-8). It's still possible to make it work with encodings that don't meet these criterias, but it requires some extra work that I won't cover here.

First we create a table containing each possible character in the text (128 in the case of ASCII) and fill each entry with the length of the pattern, since this is the default amount of characters we can skip if the mismatched character in the text doesn't appear in the pattern.

Next we iterate over every character in the pattern except the last one and set the shift value of that character to be how far we must move the pattern to align with the character we found in the text. We skip the last character since a mismatch would lead to the same character being compared since it's already in the mismatched position. The BCS table can easily be calculated with the following C-like pseudo code where 'last' represents the index of the last character in the pattern and we use zero-indexed arrays:

```for (i = 0; i < MAX_CHARACTER; ++i)
shift[i] = pattern.length;

last = pattern.length - 1;
for (i = 0; i < last; ++i)
shift[pattern[i]] = last - i;
```

By iterating from start to end, we make sure that we never shift longer than the last character of a certain kind if there are multiple occurrences of the same character. The table generated for the pattern (* here means every character that's not present in the pattern):

```*  5 (pattern length)
d  4
r  3
e  2
a  1
m  5 (skipped)```

With the BCS table made, it's time to go back to our example and study it again. The first mismatch happens on the whitespace as before, which gives us an offset of 5.

```  iced_creamer_dreamer
dream        |   |
5      dream   |   |
5           dream  |
3              dream```

Well that was easy, and we reduced the amount of compares to 12. The algorithm seemed to work for the test string, but we haven't quite solved the problem yet. Look at the example below where the pattern is "ram ram", and the text is "rum ram ram tam" (another nonsensical string, but I can't be bothered to find good examples that make sense). We get a mismatch between the characters 'u' in the text and 'a' in the pattern. 'u' doesn't exist in the pattern, so we consider it safe to shift the entire length of the pattern, but that makes us overshoot an occurrence of the pattern in the text.

```  rum_ram_ram_tam
ram_ram
7        ram_ram```

The reason for this is that the BCS table values tell us how long we need to shift the string if the mismatch happen on the last character, but here it happens on the second. There are at least two ways of handling this problem. The Boyer-Moore way is to subtract the amount of matches we have already found from the BCS. This might lead to negative shifts, but the Boyer-Moore algorithm will always choose the larger of two shifts (bad character and good suffix) which will solve this issue. The Horspool solution is to always use the last character of the compared text. In this case we would use 'm' which gives us a shift of 4 and therefore puts us directly at the pattern we're searching for.

```  rum_ram_ram_tam
ram_ram   |
4     ram_ram
```

It is worth pointing out that this solution has some bad side effects in very special cases. If the two last characters are the same, then any mismatch that doesn't happen on the last character will lead to a shift of only one. This isn't a big issue since we usually see a lot of mismatches on the first character comparison, and a shift of just one will most likely be followed by a larger shift, but it's something to keep in mind.

Let's sum it all up in some more pseudo code to make it clearer:

```for (i = 0; i < MAX_CHARACTER; ++i)
shift[i] = pattern.length;

last = pattern.length - 1;
for (i = 0; i < last; ++i)
shift[pattern[i]] = last - i;

pos = 0;
while (pos < (text.length - pattern.length)) {
i = last;
while (pattern[i] == text[pos + i]) {
if (i == 0)
return pos;```
```        ++i;
}
pos += shift[buffer[pos + last]];
}
```

One last thing before we're done. Let's say we want to find all occurrences of the pattern in the string and want to jump as far as possible after each match we find. Lets start with a couple of examples:

One last thing before we're done. Let's say we want to find all occurrences of the pattern in the string and want to jump as far as possible after each match we find. Intuitively we realize that the longest shift we can perform after a match is the length of the string minus the longest match between a suffix and a prefix in the pattern. Calculating this requires a bit more pre-processing (which I'll leave as an exercise), but we already have some information we can use to gain the second best thing: the bad character shift table. By using the BCS table entry for the last character in the pattern we're not guaranteed to shift as long as we can, but we're guaranteed to never overshoot the next match, and it will perform almost as good in most cases. Under is an example of some strings that match and how far we shift. The first entry is the optimal shift, and the second is using the BCS table.

No overlap:
```  find
4     find
4     find```

Single character overlap:
```  test
3    test
3    test```

Multiple character overlap:
```  baobao
3    baobao
3    baobao```

Multiple suffix match:
```  bababa
2   bababa
2   bababa```

Bad character shift skips too short ahead:
```  this_is_this
8         this_is_this
5      this_is_this```

That's it for now. It sure helped me writing this, and I hope it helped you if you're reading it. Please post a comment if you find anything that's incorrect or if you appreciate the article (it's getting pretty late and I'm pretty tired right now, so I can't guarantee that it's all bug free, plus I've already found a couple of bugs myself).

## 2012-10-18

### Interesting blog

I found an interesting blog over at http://benryves.com. It belongs to a guy that seems to share my passion for low end systems. He's working on a 2.5D engine for the TI-83 and has a bunch of different hardware and software projects.

## 2012-10-10

### GBA 3D Rendering

I've finally put all the pieces together, and it turned out pretty good imo. The video shows a rotating torus with a rotating directional light. I'd love to do point lights, but the inverse square root needed for normalization is extremely expensive in terms of CPU time needed.

## 2012-10-05

### Fixed point division

Found an interesting article about fast fixed point division and multiplication on ARM processors over at Henry Thomas consulting. Originally I claimed that it didn't work for the GBA, but that was a lie. Some minor tweaks and now it runs like a charm. Fixed point division used to be a major bottleneck in my attempt at 3D rendering on the GBA, and now it's not a problem at all.

## 2012-09-29

### Refreshing my memory (3d math)

So.. it's been a while since the last time I did all the 3D calculations myself. Nowadays I just use libraries since every decent graphics API have it build in, but that isn't the case for the GBA. I started implementing the functions needed for matrix transformations and quickly concluded with that I suspected before starting: a naive implementation will be way too slow. You still need the background information to start optimizing though. The following is a good starting point at least: Egon Rath's Notes: Basic 3D Math: Matrices.

## 2012-09-26

### GBA: Place functions/variables in iwram

Placing functions and variables in iwram will *seriously* speed up your program. You only have limited amounts on the other hand (32K). I made the following defines based on The Dev'rs GBA Dev FAQs.

Defines:
```#define IWRAM_FUNCTION __attribute__((section (".iwram"), long_call))
#define IWRAM_VARIABLE __attribute__((section (".iwram")))
```
Usage:
```const int somearray[] IWRAM_VARIABLE = { ... }
void IWRAM_FUNCTION SomeFunction( ... ) { ... }
```
I figured out that there's some more info on this over at tonc after I posted.

## 2012-09-25

### One of my many 2d sprite projects

This is a test of a 2d engine with custom physics I worked on a while ago. Art credits goes to Zelda :)

### GBA: Asm tips

Thanks to the wayback machine for helping me find some GBA asm optimization tricks and source code here. Credits goes to Pete (dooby@bits.bris.ac.uk / http://bits.bris.ac.uk/dooby/).

There's also a few tips and tricks I thought people might like - comments welcome.
• STMIA for DMA setup
When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3 `STR`s. `STMIA` for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3 `STR`s would take. For example:
``` ldr a1, =0x040000d4 @ Point at DMA src register.
ldr a2, =from @ Point at source data.
ldr a3, =to  @ Point at destination.
mov a4, =0x85002580 @ Set up control register.
stmia a1, {a2-a4} @ Start DMA.
```
The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address.
• Halfword Endianess
To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (eg `ldr r0, =0xcafe`) then do
``` eor r0, r0, r0, ror #8
eor r0, r0, r0, ror #24```
Now r0 holds `0xfecafeca` ie the endian-flipped halfword replicated twice. To store just the short use `strh`. To get `0x0000feca` use `mov r0, r0, asr #16`. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, eg `add r2, r1, r0, asr #16`.
• Fast multiply by (say) 240
Instead of using a `mul` you can usually achieve the same with a couple of adds and shifts
``` mov  a2, a1, lsl #8
sub a2, a2, a1, lsl #4```
will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereas
``` mov a3, #240
mul a2, a1, a3```
takes 2S+I cycles (and uses an extra register).
• Fast register swap
You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register using
``` eor a1, a1, a2
eor a2, a2, a1
eor a1, a1, a2```
(Thanks to baah of ARM's Tech for that one). This also works in C using the `^` operator on unsigned long ints.
• Negating registers
To negate a register in ARM (ie get -a1) don't use
` mvn a1, a1`
because that just flips the bits (ie -1 will become 0 not +1). Instead, use
` rsb a1, a1, #0`
to get 0-a1 which is what you want.
Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't use
``` ldr a1, =0x04000006
ldrh a2, [a1]```
which requires 2 loads, but use
``` mov a1, #0x04000000
ldrh a2, [a1, #6]```
which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr).
• Unroll loops
Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120
` strh a1, [a2], #2`
instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one).
• DMA re-reading source
I have heard (but not tested) that DMA from a fixed source (such as a clear screen routine may use) re-reads the source each time. To speed this up either copy a 0 word to EXT WRAM or push a 0 word on the stack (in nice fast INT WRAM) and point at that for the DMA copy. Since DMA halts the CPU this will be safe. Even if the DMA is interrupted as it finishes, it will be IRQ mode's stack which changes, not your USR mode stack. See my gba library for an example (if this is wrong can someone let me know ;)
• Using rrx to divide by 2 and set carry
In my scanline blitter say I have the number of pixels to plot in a1 then I use
` movs a1, a1, rrx`
to find out how many halfwords I have to plot, and the carry flag tells me if I need to plot one more pixel at the end.