AArch64 Assembly for People Who Know C

AArch64 (also called ARM64) is now the dominant architecture for mobile, embedded, and increasingly for servers and laptops. If you’ve written x86-64 assembly before, AArch64 will feel familiar in some ways and surprisingly different in others. If you’ve never written assembly, this is a decent first architecture to learn on — the instruction set is cleaner and more regular than x86-64.

I’m assuming you know C, you understand what a stack pointer is, and you’ve at least read about registers conceptually.

The register file

AArch64 has 31 general-purpose registers named x0–x30. The w prefix (w0–w30) accesses the lower 32 bits of the same register — writes to w0 zero-extend into x0. There’s also xzr/wzr (the zero register — reads always return 0, writes are discarded) and sp (stack pointer) and pc (program counter, not directly addressable in most instructions).

Key registers by convention (the AAPCS64 calling convention):

Register	Role
`x0`–`x7`	Arguments and return values
`x8`	Indirect result location (large struct return)
`x9`–`x15`	Temporary (caller-saved)
`x16`–`x17`	Intra-procedure-call scratch (used by linker stubs)
`x18`	Platform register (don’t touch on Darwin/iOS)
`x19`–`x28`	Callee-saved
`x29`	Frame pointer
`x30`	Link register (return address)

The calling convention: integer/pointer arguments go in x0–x7, return value in x0 (or x0+x1 for a 128-bit value). Callee-saved registers must be preserved across function calls — if you use x19, you must save and restore it.

A minimal function

Let’s implement int add(int a, int b) in assembly:

// int add(int a, int b)
// a is in w0, b is in w1
.global add
.text

add:
    add w0, w0, w1   // w0 = w0 + w1 (result goes in w0)
    ret              // return (branches to x30)

ret is not like x86’s ret — it’s syntactic sugar for br x30 (branch to the address in the link register). The calling convention puts the return address in x30 before jumping to a function (via bl).

A function with a stack frame

For a function that calls other functions, you need to save and restore x29 (frame pointer) and x30 (link register):

// long factorial(long n)
.global factorial
.text

factorial:
    stp x29, x30, [sp, #-16]!  // push {fp, lr}; sp -= 16
    mov x29, sp                 // frame pointer = current sp

    cmp x0, #1
    ble .Lbase                  // if n <= 1, return 1

    sub x1, x0, #1             // x1 = n - 1
    mov x0, x1
    bl  factorial               // x0 = factorial(n - 1)
    // now x0 = factorial(n-1), but we've lost n!

    // ← this is wrong, we need to save n before the call
    // (see corrected version below)

.Lbase:
    mov x0, #1
    ldp x29, x30, [sp], #16    // pop {fp, lr}; sp += 16
    ret

That version is buggy — we clobber x0 (which held n) with the recursive call. The fix is to save n in a callee-saved register (x19) before the call:

.global factorial
.text

factorial:
    stp x29, x30, [sp, #-32]!  // save fp, lr; make room
    stp x19, xzr, [sp, #16]    // save x19 (callee-saved)
    mov x29, sp

    mov x19, x0                // x19 = n (preserved across call)

    cmp x0, #1
    ble .Lbase

    sub x0, x19, #1            // arg0 = n - 1
    bl  factorial               // x0 = factorial(n - 1)
    mul x0, x0, x19            // result = factorial(n-1) * n

    b .Lreturn

.Lbase:
    mov x0, #1

.Lreturn:
    ldp x19, xzr, [sp, #16]    // restore x19
    ldp x29, x30, [sp], #32    // restore fp, lr; sp += 32
    ret

stp (store pair) and ldp (load pair) are the idiomatic way to push/pop on AArch64. The ! suffix on stp x29, x30, [sp, #-16]! means “pre-indexed” — update sp first, then store.

Load and store patterns

AArch64 is a load/store architecture. You can’t operate on memory directly — you load into registers, compute, then store:

// int sum_array(int *arr, int len)
// arr in x0, len in w1
.global sum_array
.text

sum_array:
    mov w2, wzr        // sum = 0
    cbz w1, .Ldone    // if len == 0, return

.Lloop:
    ldr w3, [x0], #4   // w3 = *arr; arr += 4 (post-index)
    add w2, w2, w3     // sum += *arr
    subs w1, w1, #1    // len--; set flags
    bne .Lloop         // if len != 0, continue

.Ldone:
    mov w0, w2         // return sum
    ret

ldr w3, [x0], #4 is a post-indexed load: it reads 4 bytes from [x0] into w3, then adds 4 to x0. Very common for iterating over arrays.

subs is sub + set condition flags. bne (branch if not equal to zero) checks the Z flag.

Where to go from here

ARM Architecture Reference Manual: The authoritative spec. Dense but comprehensive.
Compiler Explorer (godbolt.org): Write C, see the generated AArch64 assembly. This is the fastest way to understand what the compiler does and doesn’t do.
GDB or LLDB with -arch arm64: Step through your assembly. There’s no substitute for watching registers change.

AArch64 is genuinely a pleasure to read compared to x86-64. The regularity of the instruction set — three-operand instructions, consistent load/store patterns, a clean calling convention — makes it worth learning even if you’ll mostly just be reading compiler output.

AArch64 Assembly for People Who Know C

The register file

A minimal function

A function with a stack frame

Load and store patterns

Where to go from here

You might also like

Writing a Lexer in C Without Losing Your Mind

First Impressions: Zig for Systems Programming

Nomad vs Kubernetes: An Honest Comparison After Running Both