Refreshing my memory (3d math)

So.. it's been a while since the last time I did all the 3D calculations myself. Nowadays I just use libraries since every decent graphics API have it build in, but that isn't the case for the GBA. I started implementing the functions needed for matrix transformations and quickly concluded with that I suspected before starting: a naive implementation will be way too slow. You still need the background information to start optimizing though. The following is a good starting point at least: Egon Rath's Notes: Basic 3D Math: Matrices.


GBA: Useful links

Here are some useful links when developing for the GBA:


GBA: Place functions/variables in iwram

Placing functions and variables in iwram will *seriously* speed up your program. You only have limited amounts on the other hand (32K). I made the following defines based on The Dev'rs GBA Dev FAQs.

#define IWRAM_FUNCTION __attribute__((section (".iwram"), long_call))
#define IWRAM_VARIABLE __attribute__((section (".iwram")))
const int somearray[] IWRAM_VARIABLE = { ... }
void IWRAM_FUNCTION SomeFunction( ... ) { ... }
I figured out that there's some more info on this over at tonc after I posted.


One of my many 2d sprite projects

This is a test of a 2d engine with custom physics I worked on a while ago. Art credits goes to Zelda :)

GBA: Asm tips

Thanks to the wayback machine for helping me find some GBA asm optimization tricks and source code here. Credits goes to Pete (dooby@bits.bris.ac.uk / http://bits.bris.ac.uk/dooby/).

There's also a few tips and tricks I thought people might like - comments welcome.
  • STMIA for DMA setup
    When setting up DMA transfers, take advantage of the fact you are writing words to consecutive memory locations and use a STore Multiple Increment After instead of 3 STRs. STMIA for 3 registers takes 2n+2s cycles which is better than the 6n cycles that 3 STRs would take. For example:
     ldr a1, =0x040000d4 @ Point at DMA src register.
     ldr a2, =from @ Point at source data.
     ldr a3, =to  @ Point at destination.
     mov a4, =0x85002580 @ Set up control register.
     stmia a1, {a2-a4} @ Start DMA.
    The registers for source, destination and control need to be in ascending numerical order for this to work, as STM stores the lowest numbered registers at the lowest memory address.
  • Halfword Endianess
    To change endianess of a halfword (or short) first load the value in the bottom 16 bits of a register (eg ldr r0, =0xcafe) then do
     eor r0, r0, r0, ror #8
     eor r0, r0, r0, ror #24
    Now r0 holds 0xfecafeca ie the endian-flipped halfword replicated twice. To store just the short use strh. To get 0x0000feca use mov r0, r0, asr #16. Finally, if you are using the value in another calculation you may be able to make use of the ALU to do this for free, eg add r2, r1, r0, asr #16.
  • Fast multiply by (say) 240
    Instead of using a mul you can usually achieve the same with a couple of adds and shifts
     mov  a2, a1, lsl #8
     sub a2, a2, a1, lsl #4
    will multiply a1 by 240 and put the result in a2. This takes 2S cycles whereas
     mov a3, #240
     mul a2, a1, a3
    takes 2S+I cycles (and uses an extra register).
  • Fast register swap
    You can swap the contents of registers a1 and a2 in just 3 instructions without corrupting another register using
     eor a1, a1, a2
     eor a2, a2, a1
     eor a1, a1, a2
    (Thanks to baah of ARM's Tech for that one). This also works in C using the ^ operator on unsigned long ints.
  • Negating registers
    To negate a register in ARM (ie get -a1) don't use
     mvn a1, a1
    because that just flips the bits (ie -1 will become 0 not +1). Instead, use
     rsb a1, a1, #0
    to get 0-a1 which is what you want.
  • Load offsets
    Don't forget to use offsets in single loads/stores where possible. For example, to read VCOUNT on GBA don't use
     ldr a1, =0x04000006
     ldrh a2, [a1]
    which requires 2 loads, but use
     mov a1, #0x04000000
     ldrh a2, [a1, #6]
    which does just the same but saves you memory (no storing 0x04000006 in the literal pool) and speed (mov is faster than ldr).
  • Unroll loops
    Because there's no actual cache in GBA, unroll loops where you can afford the memory. The fastest way of blitting a scanline is to jump the right number of instructions into a list of 120
     strh a1, [a2], #2
    instructions where a2 is your start position and a1 is your colour halfword. This saves flushing the pipeline and losing 2S+N cycles for every branch in a very tight inner loop (thanks to someone on [gbadav] for that one).
  • DMA re-reading source
    I have heard (but not tested) that DMA from a fixed source (such as a clear screen routine may use) re-reads the source each time. To speed this up either copy a 0 word to EXT WRAM or push a 0 word on the stack (in nice fast INT WRAM) and point at that for the DMA copy. Since DMA halts the CPU this will be safe. Even if the DMA is interrupted as it finishes, it will be IRQ mode's stack which changes, not your USR mode stack. See my gba library for an example (if this is wrong can someone let me know ;)
  • Using rrx to divide by 2 and set carry
    In my scanline blitter say I have the number of pixels to plot in a1 then I use
     movs a1, a1, rrx
    to find out how many halfwords I have to plot, and the carry flag tells me if I need to plot one more pixel at the end.