18 Compiler Optimizations and Performance Blockers

Lecture from: 18.11.2025 | Video: Videos ETHZ

Optimizing Compilers

A significant amount of time has been spent understanding how to write C code, how it compiles to assembly, and how the hardware executes it. Now, attention turns to performance. How is this code made to run fast?

There is a harsh reality in systems programming: asymptotic complexity is not everything. Factors of 10x or 100x in runtime performance can be lost simply by constants, coding style, and how data interacts with the hardware architecture.

The Compiler is Your Friend (But a Conservative One)

The compiler is the bridge between logical C code and the physical machine. It generally wants code to run as fast as possible. It is excellent at mapping programs to machines:

Register Allocation: It knows exactly how many registers the CPU has and how to keep hot variables inside them.
Code Selection: It knows the specific assembly instructions (like leal for arithmetic) that are most efficient.
Dead Code Elimination: It ruthlessly deletes calculations that do not affect the final result.

However, the compiler operates under a fundamental constraint: it must not change the behavior of the program. If there is even a shadow of a doubt that an optimization might alter the output (under some edge case not intended), the compiler will refuse to do it. It assumes the programmer is right, even if the code looks inefficient.

Basic Optimization Techniques

Before looking at what blocks the compiler, it is useful to look at what it can do (or what can be done manually).

1. Code Motion (Precomputation)

If a computation is performed repeatedly but always yields the same result, it should be moved to a place where it is computed less frequently.

Example:

void set_row(double *a, double *b, long i, long n) {
    long j;
    for (j = 0; j < n; j++) {
        a[n*i + j] = b[j]; // n*i is computed in every iteration!
    }
}

The expression n*i is loop-invariant. The compiler (or the programmer) can calculate ni = n*i once before the loop starts and just use ni inside. This saves one multiplication per iteration.

2. Strength Reduction

Replace a computationally expensive operation with a simpler one.

Multiplication to Shift: 16 * x $\to$ x << 4.

Multiplication to Addition:

for (i=0; i<n; i++) {
    int offset = n*i; // Multiplication
    ...
}

Can be transformed into:

int offset = 0;
for (i=0; i<n; i++) {
    ...
    offset += n; // Addition is "weaker" (cheaper)
}

While modern CPUs have very fast multipliers (3-5 cycles), addition is still faster (1 cycle), so this optimization remains relevant.

3. Common Subexpression Elimination

If the same value is computed multiple times in a block, compute it once and reuse it.

up   = val[(i-1)*n + j];
down = val[(i+1)*n + j]; // i*n + j is computed repeatedly

The compiler detects i*n + j is common to all neighbor calculations and computes it once.

Optimization Blockers

Despite these capabilities, compilers often fail to optimize obvious inefficiencies. This happens when the compiler cannot prove that the optimization is safe.

Blocker 1: Procedure Calls

The compiler usually treats a function call as a “black box.” It assumes the function might have side effects, modify global state, or depend on specific machine state.

The strlen Disaster: Consider converting a string to lowercase:

void lower(char *s) {
    for (size_t i = 0; i < strlen(s); i++) {
        if (s[i] >= 'A' && s[i] <= 'Z')
            s[i] -= ('A' - 'a');
    }
}

Complexity: This looks like $O (n)$ , but it is actually $O (n^{2})$ .
Why? strlen(s) is called every single iteration. strlen itself walks the string to find the null terminator ( $O (n)$ ).

The Fix: Move strlen out of the loop.

size_t len = strlen(s);
for (size_t i = 0; i < len; i++) { ... }

Why didn’t the compiler do this? The compiler cannot verify that lower doesn’t change the length of the string. It sees a write to s[i]. It assumes the worst: maybe the null terminator was overwritten? It forces the re-evaluation of strlen to be safe.

Blocker 2: Memory Aliasing

Aliasing happens when two different pointers refer to the same memory location.

Example: Summing rows of matrix A into vector B.

void sum_rows(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        b[i] = 0;
        for (j = 0; j < n; j++)
            b[i] += a[i*n + j]; // Memory write to b[i] every step!
    }
}

The compiler generates code that reads b[i] from memory, adds a[...], and writes back to memory every iteration. Why not accumulate in a register (like %xmm0) and write once at the end?

The Compiler’s Fear: What if b points into the middle of a? If b[i] is actually an alias for a[i*n + j], then writing to b[i] updates a. The next iteration needs to read that new value. If the compiler optimized this into a register, the program logic would break for the aliased case.

The Fix: Introduce a local variable.

void sum_rows_optimized(double *a, double *b, long n) {
    for (i = 0; i < n; i++) {
        double val = 0; // Accumulate in register
        for (j = 0; j < n; j++)
            val += a[i*n + j];
        b[i] = val; // Write memory once
    }
}

The compiler knows local variables cannot alias (unless their address is taken), so it can safely optimize.

The restrict keyword

C99 introduced void foo(int *restrict a, int *restrict b). This tells the compiler: “I promise these pointers do not overlap.” It allows aggressive optimization, but if the promise is broken, the behavior is undefined.

Advanced Optimization: Blocking (Tiling)

Matrix-Matrix Multiplication (MMM) was previously discussed. $C = A \times B + C$ Naive triple-loop implementations run slowly because they ignore the Memory Hierarchy.

The Cache Problem

When computing $C_{ij} + = A_{ik} \times B_{kj}$ , accesses are:

Matrix A: Row-wise. Spatial locality is good.
Matrix B: Column-wise. As the inner loop increments k, the access jumps $N \times 8$ bytes to get the next element. This is terrible spatial locality.

If $N$ is large, by the time one element of $C$ is finished, huge chunks of $B$ have been loaded into the cache, evicting everything else. When moving to the next element of $C$ , $B$ is needed again, but it is gone. This causes Cache Thrashing.

The Solution: Blocking

The loops can be restructured to operate on small sub-matrices (tiles) that fit entirely inside the cache.

Load a small $B \times B$ block of $A$ into cache.
Load a small $B \times B$ block of $B$ into cache.
Compute partial results for a $B \times B$ block of $C$ using only this resident data.

Code Transformation: The triple loop becomes a six-level loop (loops over blocks, then loops inside blocks).

for (i = 0; i < n; i+=B)
  for (j = 0; j < n; j+=B)
    for (k = 0; k < n; k+=B)
      /* Inner loops process a B x B tile */
      for (i1 = i; i1 < i+B; i1++)
        for (j1 = j; j1 < j+B; j1++)
          for (k1 = k; k1 < k+B; k1++)
             C[i1][j1] += A[i1][k1] * B[k1][j1];

The Rule: If $3 \times B^{2} < Cache Size$ , the working set fits in the cache. This can speed up matrix multiplication by factors of 10x or 20x simply by avoiding RAM access.

Why won’t the compiler do this? Blocking changes the order of addition. Since floating-point addition is not associative, the result would change bit-wise. The compiler is conservative and won’t touch it.

Practice: Identifying Optimization Blockers

Correcting “anti-patterns” in C can lead to massive performance gains without changing your logic.

Exercise: Identify and Fix

Identify the optimization blocker in the following code and provide a fix.

void aggregate(int *src, int *dest, int n) {
    for (int i = 0; i < n; i++) {
        *dest += src[i] * factor();
    }
}

Answer:

Blocker 1 (Procedure Call): factor() is called every iteration. The compiler doesn’t know if factor() returns a constant or has side effects (like logging or modifying globals).
Blocker 2 (Memory Aliasing): The compiler must write to *dest every iteration because dest might point to the same memory as src[i]. Updating *dest might change the input src[i]!

Optimized Version:

void aggregate_fixed(int *src, int *dest, int n) {
    int f = factor(); // Move call out
    int acc = *dest;  // Accumulate in local variable (register)
    for (int i = 0; i < n; i++) {
        acc += src[i] * f;
    }
    *dest = acc;      // Write to memory once
}

Exercise: Restricted Pointers

How does adding restrict help the following function? void add(int *restrict a, int *restrict b, int n); Answer: It tells the compiler that the memory regions pointed to by a and b do not overlap. This allows the compiler to load multiple values into vector registers (SIMD) or reorder accesses without worrying about one write overwriting an input value.

Continue here: 19 Processor Architecture and Performance Optimization

CS Notes

Explorer

18 Compiler Optimizations and Performance Blockers

Optimizing Compilers

The Compiler is Your Friend (But a Conservative One)

Basic Optimization Techniques

1. Code Motion (Precomputation)

2. Strength Reduction

3. Common Subexpression Elimination

Optimization Blockers

Blocker 1: Procedure Calls

Blocker 2: Memory Aliasing

Advanced Optimization: Blocking (Tiling)

The Cache Problem

The Solution: Blocking

Practice: Identifying Optimization Blockers

Exercise: Identify and Fix

Exercise: Restricted Pointers

Table of Contents

Graph View

Backlinks