AArch64 Assembly for People Who Know C

A practical introduction to AArch64 assembly — registers, calling convention, common patterns — written for developers who already understand what the CPU is supposed to do.

AArch64 (also called ARM64) is now the dominant architecture for mobile, embedded, and increasingly for servers and laptops. If you’ve written x86-64 assembly before, AArch64 will feel familiar in some ways and surprisingly different in others. If you’ve never written assembly, this is a decent first architecture to learn on — the instruction set is cleaner and more regular than x86-64.

I’m assuming you know C, you understand what a stack pointer is, and you’ve at least read about registers conceptually.

The register file

AArch64 has 31 general-purpose registers named x0x30. The w prefix (w0w30) accesses the lower 32 bits of the same register — writes to w0 zero-extend into x0. There’s also xzr/wzr (the zero register — reads always return 0, writes are discarded) and sp (stack pointer) and pc (program counter, not directly addressable in most instructions).

Key registers by convention (the AAPCS64 calling convention):

Register Role
x0x7 Arguments and return values
x8 Indirect result location (large struct return)
x9x15 Temporary (caller-saved)
x16x17 Intra-procedure-call scratch (used by linker stubs)
x18 Platform register (don’t touch on Darwin/iOS)
x19x28 Callee-saved
x29 Frame pointer
x30 Link register (return address)

The calling convention: integer/pointer arguments go in x0x7, return value in x0 (or x0+x1 for a 128-bit value). Callee-saved registers must be preserved across function calls — if you use x19, you must save and restore it.

A minimal function

Let’s implement int add(int a, int b) in assembly:

add.sasm
// int add(int a, int b)
// a is in w0, b is in w1
.global add
.text

add:
    add w0, w0, w1   // w0 = w0 + w1 (result goes in w0)
    ret              // return (branches to x30)

ret is not like x86’s ret — it’s syntactic sugar for br x30 (branch to the address in the link register). The calling convention puts the return address in x30 before jumping to a function (via bl).

A function with a stack frame

For a function that calls other functions, you need to save and restore x29 (frame pointer) and x30 (link register):

factorial.sasm
// long factorial(long n)
.global factorial
.text

factorial:
    stp x29, x30, [sp, #-16]!  // push {fp, lr}; sp -= 16
    mov x29, sp                 // frame pointer = current sp

    cmp x0, #1
    ble .Lbase                  // if n <= 1, return 1

    sub x1, x0, #1             // x1 = n - 1
    mov x0, x1
    bl  factorial               // x0 = factorial(n - 1)
    // now x0 = factorial(n-1), but we've lost n!

    // ← this is wrong, we need to save n before the call
    // (see corrected version below)

.Lbase:
    mov x0, #1
    ldp x29, x30, [sp], #16    // pop {fp, lr}; sp += 16
    ret

That version is buggy — we clobber x0 (which held n) with the recursive call. The fix is to save n in a callee-saved register (x19) before the call:

factorial.sasm
.global factorial
.text

factorial:
    stp x29, x30, [sp, #-32]!  // save fp, lr; make room
    stp x19, xzr, [sp, #16]    // save x19 (callee-saved)
    mov x29, sp

    mov x19, x0                // x19 = n (preserved across call)

    cmp x0, #1
    ble .Lbase

    sub x0, x19, #1            // arg0 = n - 1
    bl  factorial               // x0 = factorial(n - 1)
    mul x0, x0, x19            // result = factorial(n-1) * n

    b .Lreturn

.Lbase:
    mov x0, #1

.Lreturn:
    ldp x19, xzr, [sp, #16]    // restore x19
    ldp x29, x30, [sp], #32    // restore fp, lr; sp += 32
    ret

stp (store pair) and ldp (load pair) are the idiomatic way to push/pop on AArch64. The ! suffix on stp x29, x30, [sp, #-16]! means “pre-indexed” — update sp first, then store.

Load and store patterns

AArch64 is a load/store architecture. You can’t operate on memory directly — you load into registers, compute, then store:

asm
// int sum_array(int *arr, int len)
// arr in x0, len in w1
.global sum_array
.text

sum_array:
    mov w2, wzr        // sum = 0
    cbz w1, .Ldone    // if len == 0, return

.Lloop:
    ldr w3, [x0], #4   // w3 = *arr; arr += 4 (post-index)
    add w2, w2, w3     // sum += *arr
    subs w1, w1, #1    // len--; set flags
    bne .Lloop         // if len != 0, continue

.Ldone:
    mov w0, w2         // return sum
    ret

ldr w3, [x0], #4 is a post-indexed load: it reads 4 bytes from [x0] into w3, then adds 4 to x0. Very common for iterating over arrays.

subs is sub + set condition flags. bne (branch if not equal to zero) checks the Z flag.

Where to go from here

  • ARM Architecture Reference Manual: The authoritative spec. Dense but comprehensive.
  • Compiler Explorer (godbolt.org): Write C, see the generated AArch64 assembly. This is the fastest way to understand what the compiler does and doesn’t do.
  • GDB or LLDB with -arch arm64: Step through your assembly. There’s no substitute for watching registers change.

AArch64 is genuinely a pleasure to read compared to x86-64. The regularity of the instruction set — three-operand instructions, consistent load/store patterns, a clean calling convention — makes it worth learning even if you’ll mostly just be reading compiler output.

By PatrickChoDev

You might also like