Pointer arithmetic and arrays¶

a[i] is the same as *(a + i). That one identity is the foundation of how every C-family language represents arrays, why iterating in cache-friendly order makes code 10× faster, why the compiler can read your for loop and generate SIMD, why Go and Java slices look different from C arrays but behave similarly at the assembly level, and why "off-by-one" is the most common bug in systems code.

This page walks the arc: what an array actually is in memory, what a pointer actually holds, how the two interact when you write a[i], where the cache layer enters the picture, and the patterns that make pointer-arithmetic code fast and correct.

1. What an array is in memory¶

An array of N elements of a type T is N × sizeof(T) contiguous bytes in memory. That is the whole definition. The elements sit back-to-back with no gaps; the second element starts immediately after the first.

For int32[5] (five 32-bit integers, 4 bytes each):

  address:  100  101  102  103  104  105  106  107  108  109  ...  119
  byte:     [---- a[0] ----] [---- a[1] ----] [---- a[2] ----] ... [---- a[4] ----]

20 bytes total. a[0] starts at address 100. a[1] starts at 104. a[2] starts at 108. The address of a[i] is address_of(a[0]) + i × sizeof(T).

This is the same in C, C++, Rust, Go, Java (for primitive arrays), Swift, and any language with anything resembling a "real" array type. JavaScript and Python "arrays" are different - in those languages an array is a heap object with metadata and the elements live elsewhere - but every language touching systems-level performance ultimately drops to the contiguous-bytes model under the hood (TypedArray in JavaScript, NumPy arrays in Python).

2. What a pointer is¶

A pointer is a memory address - usually 8 bytes on a 64-bit system, 4 bytes on a 32-bit system. The address itself is just a number. What makes a pointer typed is the size the language attaches to it: an int32* knows that one unit of arithmetic on it equals 4 bytes; a double* knows it equals 8.

In Go this is explicit:

import "unsafe"
var a [5]int32
fmt.Println(unsafe.Sizeof(a))       // 20  (five 4-byte ints)
fmt.Println(unsafe.Sizeof(a[0]))    // 4

In C, the language hides the arithmetic: p + 1 advances the pointer by sizeof(*p) bytes, not by 1 byte. So:

int32_t *p = &a[0];
p + 1;   // advances by 4 bytes (sizeof(int32_t))
p + 2;   // advances by 8 bytes

This is the dual of "arrays are contiguous." Combined, they produce the foundational identity of pointer arithmetic.

3. The identity: `a[i]` is `*(a + i)`¶

In C, the array name a decays to a pointer to its first element in most expressions. Then a + i is pointer arithmetic - advance by i × sizeof(T) bytes - and *(a + i) reads the value there. The language defines a[i] to be exactly that expression.

int32_t a[5] = {10, 20, 30, 40, 50};
a[2];                  // 30
*(a + 2);              // 30  (identical to a[2])
2[a];                  // 30  (a[i] is defined as *(a + i), and addition commutes)

The 2[a] trick is a C language curiosity that the spec actually allows. Nobody writes it in production; it exists because the bracket syntax is defined in terms of pointer arithmetic, which is commutative.

Go does not let you do raw pointer arithmetic on safe pointers (you must use unsafe.Pointer and uintptr to opt out), but the same indexing math runs underneath. When you write slice[i] in Go, the runtime checks the bound and then computes base + i × element_size - exactly the C-style address calculation. Slices are "a pointer, a length, and a capacity" precisely so this calculation stays cheap.

4. Stride and the row-vs-column iteration problem¶

For a 2D array int32 a[ROWS][COLS], C-family languages store it in row-major order: row 0's columns are contiguous, then row 1's columns, and so on. The address of a[r][c] is base + r × COLS × sizeof(T) + c × sizeof(T).

The distance between adjacent elements in the iteration direction is called the stride:

Iterating along a row (for c in 0..COLS: a[r][c]) has stride sizeof(T) - successive accesses are adjacent in memory.
Iterating down a column (for r in 0..ROWS: a[r][c]) has stride COLS × sizeof(T) - successive accesses are COLS × sizeof(T) bytes apart.

This matters enormously because of the cache line. CPU memory comes in chunks of 64 bytes (on every x86 and most ARM). When you read one byte, the CPU loads the whole 64-byte line containing it into cache. Subsequent accesses within that line are free; accesses outside it pay the (~100-cycle) cost of a main-memory load.

Row-by-row iteration on int32s: every cache line holds 16 elements, so 1 in 16 accesses pays the memory cost; 15 in 16 are free cache hits. Column-by-column iteration: if a row is wider than 64 bytes (always the case for non-trivial matrices), every access lands in a different cache line and none are cache hits. The same algorithm runs 10-30× slower.

const N = 4096
var a [N][N]float64

// Row-major - fast
sum := 0.0
for r := 0; r < N; r++ {
    for c := 0; c < N; c++ {
        sum += a[r][c]
    }
}

// Column-major - same elements, ~20× slower
sum = 0.0
for c := 0; c < N; c++ {
    for r := 0; r < N; r++ {
        sum += a[r][c]
    }
}

The standard advice "iterate over your arrays in memory order" is exactly this. Numerical libraries, matrix multiplication, and image processing kernels are tuned to march through memory in the natural stride direction; doing it wrong is the single most common reason "my code is slow" in performance-sensitive code.

Fortran and MATLAB use column-major order instead, so the cache-friendly direction for the same algorithm is reversed. If you ever port code between NumPy (row-major by default, since it grew from C) and MATLAB (column-major), this is the trap.

5. Why this lets the compiler generate SIMD¶

A loop with a contiguous-array access pattern (constant stride of 1 element) is easy for the compiler to vectorize - turn into SIMD (Single Instruction Multiple Data) instructions that process 4 or 8 elements at once.

for i := 0; i < len(a); i++ {
    a[i] = b[i] + c[i]
}

Modern compilers (Go's, GCC's, Clang's, MSVC's) recognize this loop, detect that the accesses are linear and aligned, and emit something like:

  vmovups  ymm0, [rsi + rdi*4]    ; load 8 floats from b[i..i+8]
  vaddps   ymm0, ymm0, [rdx + rdi*4]  ; add 8 floats from c[i..i+8]
  vmovups  [rcx + rdi*4], ymm0    ; store 8 floats to a[i..i+8]
  add      rdi, 8
  cmp      rdi, ...
  jl       loop

One instruction does 8 additions. The whole loop runs at memory bandwidth, not compute - because the access pattern is regular, the data layout is contiguous, and the compiler can prove there is no aliasing between the arrays. Break any of those (insert an indirection, alias the pointers, use a non-constant stride) and the compiler gives up and emits one element at a time.

This is the most direct reward for "understand your memory layout": the language gets to use the CPU's most powerful instructions on your behalf, automatically.

6. Slices, vectors, and "fat pointers"¶

C arrays decay to a bare pointer (the address of element 0). The size is lost - that is why sizeof does not work on arrays passed to a function; you have to pass the length as a separate argument, and forgetting to is the source of decades of buffer-overflow vulnerabilities.

Go slices, Rust slices, C++ std::span and std::string_view, and Swift Array all use a fat pointer: a pointer plus length, sometimes plus capacity:

// Go slice header (conceptually):
type slice struct {
    data     unsafe.Pointer   // pointer to first element
    length   int              // current length
    capacity int              // allocated space (for append)
}

// Rust &[T] is conceptually:
struct Slice<T> {
    ptr: *const T,
    len: usize,
}

Element access still does the base + i × element_size arithmetic underneath - the fat pointer's job is to carry the bounds information so the runtime can panic (Go, Rust) or throw (Java) on out-of-bounds access. The cost of the bounds check is ~1 cycle per access; the compiler frequently proves it away inside a loop (Go's, Rust's, and the JVM's optimizers all do this for the obvious for i := 0; i < len(a); i++ pattern). You get C-array speed with array-of-bytes safety. This is one of the best deals in modern programming language design.

7. Common pointer-arithmetic mistakes¶

7.1 Off-by-one ("fencepost") errors¶

If an array has N elements, its valid indices are 0 .. N-1. The pointer one past the end (&a[N]) is valid as an address but reading it is undefined behavior. The C standard allows the past-the-end address to exist precisely so loop conditions like p < end are safe.

int a[5];
int *end = a + 5;     // OK - past-the-end is allowed as an address
*end;                 // UB - reading the past-the-end address
for (int *p = a; p < end; p++) { ... }  // OK - classic pattern

Every off-by-one bug is a confusion between "the array has 5 elements" and "the last valid index is 4."

7.2 Forgetting element size¶

Subtracting two pointers in C gives the count of elements between them, not bytes:

int32_t a[10];
int32_t *p = &a[3];
int32_t *q = &a[7];
ptrdiff_t n = q - p;   // 4, not 16

This is convenient when you remember it and a foot-gun when you cast to char* halfway through and forget. The Go equivalent uses unsafe.Pointer and explicit uintptr arithmetic which makes the byte-vs-element distinction visible:

p := unsafe.Pointer(&a[3])
q := unsafe.Pointer(&a[7])
n := (uintptr(q) - uintptr(p)) / unsafe.Sizeof(a[0])  // 4

7.3 Aliasing¶

If two pointers point into the same array, the compiler cannot reorder operations across them freely. C's restrict keyword (and the convention of stating no-aliasing in function signatures) is how you tell the compiler "trust me, these do not overlap." Mis-stated restrict is undefined behavior; correctly stated, it unlocks vectorization. Fortran's lack of pointers is one reason its numerical code traditionally outran C - the compiler could always assume non-aliasing.

7.4 Storing a pointer to a stack array¶

int* leak(void) {
    int local[5];
    return local;   // BUG - returns pointer to a stack frame about to be destroyed
}

Once the function returns, the stack memory is reused by the next function; the pointer now refers to garbage. Go's escape analysis catches this and moves the array to the heap automatically; Rust's borrow checker rejects the code at compile time. C and C++ silently produce a use-after-free. This is the single most common "why does my server crash?" cause in long-running C programs.

8. Advanced: where the layer below shows up¶

8.1 Alignment¶

Most CPUs require certain types to be aligned at certain byte boundaries: a double (8 bytes) must usually live at an address divisible by 8; SIMD registers like __m128 (16 bytes) at addresses divisible by 16. The compiler enforces this by padding structs and aligning stack frames.

Misaligned accesses on x86 used to be slow but legal; on ARMv7 they would fault. Modern x86 makes them fast but still a few cycles slower than aligned. The patterns engineers care about:

struct { char a; double b; } is 16 bytes, not 9, because the double is forced to an 8-byte boundary and 7 bytes of padding sit between.
A []byte slice in Go is byte-aligned. If you cast it to []uint32 you must ensure the underlying address is 4-byte aligned, or risk a crash (or a silent slowdown, depending on the architecture).
posix_memalign and _mm_malloc exist precisely to return memory aligned for SIMD use.

8.2 Prefetching¶

When the CPU sees a regular access pattern, the hardware prefetcher notices and starts loading the next cache line before you ask for it. This is why linear iteration is so fast - much of the time, the data has already been loaded by the time the loop reaches it.

Random-access patterns (hash tables, linked lists, tree traversal) defeat the prefetcher: every access is a fresh memory request. This is the single biggest reason hash tables and linked lists are slow relative to their algorithmic-complexity story, and why "array of structs" beats "linked list of structs" almost always in practice.

When the compiler does not detect a pattern but you know one exists, intrinsics like __builtin_prefetch (GCC, Clang) and _mm_prefetch let you tell the CPU "I will need this address soon." Used carefully in tree traversal code, it can shave 30-50% off the runtime.

Two threads writing to different variables that happen to share a 64-byte cache line will spend most of their time bouncing the line between CPU caches, completely negating the parallelism. This is false sharing and it shows up in concurrent counters, log buffers, and any per-thread struct that is not padded to a cache line.

type Counter struct {
    val int64
    _   [56]byte   // pad to 64 bytes so adjacent counters do not share a line
}

Go's runtime/internal/atomic and Java's @Contended annotation exist for this. If you ever benchmark a parallel program and find it gets slower as you add threads, false sharing is the first suspect.

8.4 Pointer tagging¶

Pointers on 64-bit systems usually only use 48 of their 64 bits for the actual address; the upper bits are required to be all zero (user-space) or all one (kernel-space) by the architecture. Garbage collectors and dynamic-language runtimes (V8, LuaJIT, the JVM) abuse the unused bits to encode type information directly in the pointer: bit 0 = "this is an integer, not a heap pointer," for instance.

Reading such a pointer requires masking; the unmasked address is invalid. This is how V8 makes small integers indistinguishable from pointers without an allocation, and why some debuggers show V8 internal values as nonsense addresses until you understand the tagging scheme.

8.5 Garbage collection and pointers¶

A tracing GC (Java, Go, JavaScript) finds all live memory by walking pointers from a root set. This requires the runtime to know which words in memory are pointers (so it can follow them) and which are not (so it does not chase a 0xDEADBEEF integer as if it were an address). The information lives in stack maps, type descriptors, and the like; for native-compiled languages it must be emitted by the compiler.

This is why a language with mixed values and pointers in memory (C, C++ for raw pointers) cannot have a precise tracing GC without serious effort - the runtime cannot tell what is a pointer and what is data. Conservative GC (the Boehm collector) treats anything that looks like a pointer as one, which works but pins memory the program is not actually holding. Go's GC, by contrast, is precise because the compiler emits type information alongside the code.

9. The mental model to keep¶

An array is N × sizeof(T) contiguous bytes. A pointer is a typed memory address; p + 1 advances by sizeof(*p) bytes.
a[i] is *(a + i). Every indexing operation underneath is base + i × element_size.
Iterate in memory order. Row-major arrays mean inner loops over columns. Column-major (Fortran, MATLAB) is the opposite.
A cache line is 64 bytes. Iterating in memory order gets ~16 element accesses per memory load on int32s. Striding across cache lines pays for every access.
Fat pointers (Go slices, Rust slices, C++ span) buy bounds-checked safety with one extra word of overhead and almost no runtime cost.
The compiler vectorizes loops with regular access patterns. Indirection, aliasing, and irregular stride break that.
Off-by-one, alignment, aliasing, and use-after-free are the four classic pointer-arithmetic bugs. Memory-safe languages prevent the last; the first three still get you.

10. Further reading¶

Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) - the canonical reference for cache, TLB, prefetching, and access-pattern effects. Long but essential.
The Go runtime source, runtime/slice.go - a clean implementation of a fat-pointer array type.
Computer Systems: A Programmer's Perspective (Bryant & O'Hallaron), chapters 6 and 9 - the textbook treatment of memory hierarchy.
Bit operations - for the alignment math.
Virtual memory - what happens at the address level above pointer arithmetic.