There's also a few tips and tricks I thought people might like - comments welcome.

__STMIA for DMA setup__

When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3`STR`

s.`STMIA`

for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3`STR`

s would take. For example:ldr a1, =0x040000d4 @ Point at DMA src register. ldr a2, =from @ Point at source data. ldr a3, =to @ Point at destination. mov a4, =0x85002580 @ Set up control register. stmia a1, {a2-a4} @ Start DMA.

The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address.__Halfword Endianess__

To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (eg`ldr r0, =0xcafe`

) then doeor r0, r0, r0, ror #8 eor r0, r0, r0, ror #24

Now r0 holds`0xfecafeca`

ie the endian-flipped halfword replicated twice. To store just the short use`strh`

. To get`0x0000feca`

use`mov r0, r0, asr #16`

. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, eg`add r2, r1, r0, asr #16`

.__Fast multiply by (say) 240__

Instead of using a`mul`

you can usually achieve the same with a couple of adds and shiftsmov a2, a1, lsl #8 sub a2, a2, a1, lsl #4

will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereasmov a3, #240 mul a2, a1, a3

takes 2S+I cycles (and uses an extra register).__Fast register swap__

You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register usingeor a1, a1, a2 eor a2, a2, a1 eor a1, a1, a2

(Thanks to baah of ARM's Tech for that one). This also works in C using the`^`

operator on unsigned long ints.__Negating registers__

To negate a register in ARM (ie get -a1) don't usemvn a1, a1

because that just flips the bits (ie -1 will become 0 not +1). Instead, usersb a1, a1, #0

to get 0-a1 which is what you want.__Load offsets__

Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't useldr a1, =0x04000006 ldrh a2, [a1]

which requires 2 loads, but usemov a1, #0x04000000 ldrh a2, [a1, #6]

which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr).__Unroll loops__

Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120strh a1, [a2], #2

instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one).__DMA re-reading source__

I have heard (but not tested) that DMA from a fixed source (such as a clear screen routine may use) re-reads the source each time. To speed this up either copy a 0 word to EXT WRAM or push a 0 word on the stack (in nice fast INT WRAM) and point at that for the DMA copy. Since DMA halts the CPU this will be safe. Even if the DMA is interrupted as it finishes, it will be IRQ mode's stack which changes, not your USR mode stack. See my gba library for an example (if this is wrong can someone let me know ;)__Using rrx to divide by 2 and set carry__

In my scanline blitter say I have the number of pixels to plot in a1 then I usemovs a1, a1, rrx

to find out how many halfwords I have to plot, and the carry flag tells me if I need to plot one more pixel at the end.

## No comments:

Post a Comment