## 2012-09-25

### GBA: Asm tips

Thanks to the wayback machine for helping me find some GBA asm optimization tricks and source code here. Credits goes to Pete (dooby@bits.bris.ac.uk / http://bits.bris.ac.uk/dooby/).

There's also a few tips and tricks I thought people might like - comments welcome.
• STMIA for DMA setup
When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3 `STR`s. `STMIA` for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3 `STR`s would take. For example:
``` ldr a1, =0x040000d4 @ Point at DMA src register.
ldr a2, =from @ Point at source data.
ldr a3, =to  @ Point at destination.
mov a4, =0x85002580 @ Set up control register.
stmia a1, {a2-a4} @ Start DMA.
```
The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address.
• Halfword Endianess
To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (eg `ldr r0, =0xcafe`) then do
``` eor r0, r0, r0, ror #8
eor r0, r0, r0, ror #24```
Now r0 holds `0xfecafeca` ie the endian-flipped halfword replicated twice. To store just the short use `strh`. To get `0x0000feca` use `mov r0, r0, asr #16`. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, eg `add r2, r1, r0, asr #16`.
• Fast multiply by (say) 240
Instead of using a `mul` you can usually achieve the same with a couple of adds and shifts
``` mov  a2, a1, lsl #8
sub a2, a2, a1, lsl #4```
will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereas
``` mov a3, #240
mul a2, a1, a3```
takes 2S+I cycles (and uses an extra register).
• Fast register swap
You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register using
``` eor a1, a1, a2
eor a2, a2, a1
eor a1, a1, a2```
(Thanks to baah of ARM's Tech for that one). This also works in C using the `^` operator on unsigned long ints.
• Negating registers
To negate a register in ARM (ie get -a1) don't use
` mvn a1, a1`
because that just flips the bits (ie -1 will become 0 not +1). Instead, use
` rsb a1, a1, #0`
to get 0-a1 which is what you want.
Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't use
``` ldr a1, =0x04000006
ldrh a2, [a1]```
which requires 2 loads, but use
``` mov a1, #0x04000000
ldrh a2, [a1, #6]```
which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr).
• Unroll loops
Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120
` strh a1, [a2], #2`
instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one).
` movs a1, a1, rrx`