my eulogy: GBA: Asm tips

Thanks to the wayback machine for helping me find some GBA asm optimization tricks and source code here. Credits goes to Pete (dooby@bits.bris.ac.uk / http://bits.bris.ac.uk/dooby/).

There's also a few tips and tricks I thought people might like - comments welcome.

STMIA for DMA setup
When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3 STRs. STMIA for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3 STRs would take. For example:
```
 ldr a1, =0x040000d4 @ Point at DMA src register.
 ldr a2, =from @ Point at source data.
 ldr a3, =to  @ Point at destination.
 mov a4, =0x85002580 @ Set up control register.
 stmia a1, {a2-a4} @ Start DMA.
```
The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address.
Halfword Endianess
To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (eg ldr r0, =0xcafe) then do
```
 eor r0, r0, r0, ror #8
 eor r0, r0, r0, ror #24
```
Now r0 holds 0xfecafeca ie the endian-flipped halfword replicated twice. To store just the short use strh. To get 0x0000feca use mov r0, r0, asr #16. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, eg add r2, r1, r0, asr #16.
Fast multiply by (say) 240
Instead of using a mul you can usually achieve the same with a couple of adds and shifts
```
 mov  a2, a1, lsl #8
 sub a2, a2, a1, lsl #4
```
will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereas
```
 mov a3, #240
 mul a2, a1, a3
```
takes 2S+I cycles (and uses an extra register).
Fast register swap
You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register using
```
 eor a1, a1, a2
 eor a2, a2, a1
 eor a1, a1, a2
```
(Thanks to baah of ARM's Tech for that one). This also works in C using the ^ operator on unsigned long ints.
Negating registers
To negate a register in ARM (ie get -a1) don't use
```
 mvn a1, a1
```
because that just flips the bits (ie -1 will become 0 not +1). Instead, use
```
 rsb a1, a1, #0
```
to get 0-a1 which is what you want.
Load offsets
Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't use
```
 ldr a1, =0x04000006
 ldrh a2, [a1]
```
which requires 2 loads, but use
```
 mov a1, #0x04000000
 ldrh a2, [a1, #6]
```
which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr).
Unroll loops
Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120
```
 strh a1, [a2], #2
```
instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one).
DMA re-reading source
I have heard (but not tested) that DMA from a fixed source (such as a clear screen routine may use) re-reads the source each time. To speed this up either copy a 0 word to EXT WRAM or push a 0 word on the stack (in nice fast INT WRAM) and point at that for the DMA copy. Since DMA halts the CPU this will be safe. Even if the DMA is interrupted as it finishes, it will be IRQ mode's stack which changes, not your USR mode stack. See my gba library for an example (if this is wrong can someone let me know ;)
Using rrx to divide by 2 and set carry
In my scanline blitter say I have the number of pixels to plot in a1 then I use
```
 movs a1, a1, rrx
```
to find out how many halfwords I have to plot, and the carry flag tells me if I need to plot one more pixel at the end.

my eulogy

2012-09-25

GBA: Asm tips

No comments: