There's also a few tips and tricks I thought people might like - comments welcome.
- STMIA for DMA setup
When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3STR
s.STMIA
for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3STR
s would take. For example:ldr a1, =0x040000d4 @ Point at DMA src register. ldr a2, =from @ Point at source data. ldr a3, =to @ Point at destination. mov a4, =0x85002580 @ Set up control register. stmia a1, {a2-a4} @ Start DMA.
The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address. - Halfword Endianess
To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (egldr r0, =0xcafe
) then doeor r0, r0, r0, ror #8 eor r0, r0, r0, ror #24
Now r0 holds0xfecafeca
ie the endian-flipped halfword replicated twice. To store just the short usestrh
. To get0x0000feca
usemov r0, r0, asr #16
. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, egadd r2, r1, r0, asr #16
. - Fast multiply by (say) 240
Instead of using amul
you can usually achieve the same with a couple of adds and shiftsmov a2, a1, lsl #8 sub a2, a2, a1, lsl #4
will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereasmov a3, #240 mul a2, a1, a3
takes 2S+I cycles (and uses an extra register). - Fast register swap
You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register usingeor a1, a1, a2 eor a2, a2, a1 eor a1, a1, a2
(Thanks to baah of ARM's Tech for that one). This also works in C using the^
operator on unsigned long ints. - Negating registers
To negate a register in ARM (ie get -a1) don't usemvn a1, a1
because that just flips the bits (ie -1 will become 0 not +1). Instead, usersb a1, a1, #0
to get 0-a1 which is what you want. - Load offsets
Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't useldr a1, =0x04000006 ldrh a2, [a1]
which requires 2 loads, but usemov a1, #0x04000000 ldrh a2, [a1, #6]
which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr). - Unroll loops
Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120strh a1, [a2], #2
instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one). - DMA re-reading source
I have heard (but not tested) that DMA from a fixed source (such as a clear screen routine may use) re-reads the source each time. To speed this up either copy a 0 word to EXT WRAM or push a 0 word on the stack (in nice fast INT WRAM) and point at that for the DMA copy. Since DMA halts the CPU this will be safe. Even if the DMA is interrupted as it finishes, it will be IRQ mode's stack which changes, not your USR mode stack. See my gba library for an example (if this is wrong can someone let me know ;) - Using rrx to divide by 2 and set carry
In my scanline blitter say I have the number of pixels to plot in a1 then I usemovs a1, a1, rrx
to find out how many halfwords I have to plot, and the carry flag tells me if I need to plot one more pixel at the end.
No comments:
Post a Comment