Lecture from: 19.11.2025 | Video: Videos ETHZ

Architecture and Optimization

Discussion has previously focused on how the compiler can rearrange code to speed it up. Techniques like code motion, strength reduction, and blocking for cache locality were examined. However, a wall was hit: the compiler is conservative. It refuses to make optimizations that might change the behavior of the program, even if the programmer knows it would be safe.

To go beyond what the compiler can do, it is necessary to look inside the “black box” of the processor and understand the micro-architecture. Why did the matrix multiplication example from the first lecture run in 8 cycles? Because modern hardware is frighteningly complex, and understanding that complexity allows for code that screams.

Performance Analysis

To improve performance, it must first be measured. A benchmark (a standard program to run) and a metric (a way to quantify the result) are needed.

There is a classic split in performance metrics:

  1. Throughput: How much work can be done in a unit of time? (e.g., requests per second).
  2. Latency: How long does it take to get a single result back? (e.g., response time).

Be Suspicious of "Fast"

When someone says a system is “fast,” always ask: “Do you mean high throughput or low latency?” They are often at odds with one another.

For the purposes of this chapter, numerical computations are the focus. A metric called CPE (Cycles Per Element) is defined.

The Linear Performance Model

The total execution time of a program operating on a vector of length can be modeled as:

  • Slope (CPE): The average number of cycles needed to process one element. This represents the efficiency of the inner loop.
  • Intercept (Overhead): The fixed cost of setting up loops, calling functions, etc.

The goal is to minimize the CPE.

The Benchmark: Vector Combination

A simple function combine is used to demonstrate optimization. It iterates over a vector and aggregates the result (sum or product) using a specific operation (OP) and an identity value (IDENT).

Data Structures:

/* data structure for vectors */
struct vec {
    size_t len;
    data_t *data;
};

Baseline Implementation (combine1):

void combine1(struct vec *v, data_t *dest) {
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) { // Optimization Blocker 1: Function call in loop condition
        data_t val;
        get_vec_element(v, i, &val);      // Optimization Blocker 2: Bounds checking every time
        *dest = *dest OP val;             // Optimization Blocker 3: Memory aliasing on dest
    }
}

As discussed in the previous chapter, combine1 is slow because of optimization blockers. Basic optimizations can be manually applied to create combine4.

Optimized Baseline (combine4):

void combine4(struct vec *v, data_t *dest) {
    long i;
    long length = vec_length(v);       // Move length call out
    data_t *d = get_vec_start(v);      // Direct data access
    data_t t = IDENT;                  // Local accumulator (register)
    
    for (i = 0; i < length; i++)
        t = t OP d[i];                 // The core operation
        
    *dest = t;
}

Going from combine1 to combine4 yields a roughly 2x to 10x speedup depending on the data type. But a wall is reached. For double-precision multiplication, combine4 stabilizes at 3.0 - 5.0 CPE. To go faster, understanding the hardware is required.

Modern Processor Design

How does a processor actually execute instructions?

1. Sequential Processor

The simplest model is the sequential processor. It executes instructions strictly one at a time: Fetch Decode Execute Memory Write Back Update PC.

  • Benefit: Easy to build, correct by design.
  • Drawback: Horribly slow. Most hardware units sit idle most of the time. The clock speed is limited by the time it takes for a signal to propagate through all stages.

2. Pipelined Processor

To speed this up, pipelining (like an assembly line) is used. Execution is broken into stages separated by pipeline registers.

  • Pros: Increased throughput. One instruction can be finished every cycle.
  • Cons: Hazards.
    • Data Hazards: Instruction B needs a result from Instruction A, but A has not finished yet. The pipeline must be “stalled” or data forwarding used.
    • Control Hazards: The result of a branch (taken/not taken) is unknown until late in the pipeline. A guess must be made (branch prediction) and the pipeline flushed if wrong.

3. Superscalar Processor (The Modern Era)

Modern CPUs (Intel Core, AMD Ryzen, etc.) are Superscalar. They can issue and execute multiple instructions per clock cycle.

The processor maintains an illusion of sequential execution. To the programmer, it looks like instructions run one after another. Under the hood, the hardware is grabbing a window of instructions (e.g., 100 at a time), analyzing their dependencies, and executing as many as possible in parallel.

Key Components:

  • Instruction Control Unit: Fetches instructions, decodes them into micro-operations, and throws them into a pool.
  • Execution Unit: Contains multiple Functional Units (e.g., 2 Load units, 4 Integer ALUs, 2 Floating Point Multipliers).
  • Retirement Unit: Reorders the results to ensure they are committed to registers/memory in the correct original order.

The Physics of Clock Speed

Why can’t the clock speed just be cranked up? Clock distribution is hard. In the DEC Alpha days (150 MHz), the clock distribution wire was a single transistor folded across the chip that measured over 3 meters long! Physics limits how fast signals can be synchronized. Therefore, more work per cycle (parallelism) must be done rather than just faster cycles.

Latency vs. Throughput Bounds

To optimize combine4, the specific characteristics of the functional units (e.g., Intel Haswell) must be examined.

Two numbers matter for every arithmetic operation:

  1. Latency: The time it takes to perform the computation (e.g., 5 cycles for Double FP Multiply).
  2. Issue Rate (Throughput): The time required between starting two consecutive operations.
InstructionLatency (Cycles)Cycles/Issue (Throughput)
Integer Add11
Integer Multiply31
FP Add31
FP Multiply51
FP Divide3-153-15

The Pipelining Effect: Notice that FP Multiply takes 5 cycles to finish, but a new one can be started every 1 cycle. The multiplier is fully pipelined.

  1. Latency Bound: If the result of the previous operation must be waited for before starting the next (sequential dependency), performance is limited by Latency.
  2. Throughput Bound: If operations are independent, performance is limited by Cycles/Issue.

Analyzing combine4:

The core loop is: t = t * d[i]. To compute the t for the next iteration, the result of the multiplication from the current iteration is needed. This is a Read-After-Write (RAW) data dependency.

  • The CPU cannot use the pipelined nature of the multiplier. It issues the multiply, waits 5 cycles for the result, then issues the next one.
  • Result: combine4 is Latency Bound. The measured CPE is ~5.0, matching the latency of the multiplier.

Breaking the Latency Bound

To go faster, the sequential dependency chain must be broken so the processor can utilize its parallel execution units.

Method 1: Loop Unrolling

Multiple elements are processed per iteration to reduce loop overhead.

// Unrolling by 2
for (i = 0; i < limit; i+=2) {
    x = (x OP d[i]) OP d[i+1];
}

Effect: This helps slightly with operations like Integer Addition (latency 1) by reducing loop index overhead. However, for FP Multiplication, the dependency chain is ((x * d[i]) * d[i+1]). The first multiply must still finish before doing the second. It does not break the latency bound.

Method 2: Reassociation

The parentheses in the calculation can be changed. Original: x = (x OP d[i]) OP d[i+1] Reassociated: x = x OP (d[i] OP d[i+1])

Why does this help?

  • d[i] OP d[i+1] depends only on the array data. It does not depend on the accumulator x.
  • The CPU can execute d[i] * d[i+1] in parallel with the multiplication of x from the previous iteration.
  • By the time x is ready to be multiplied, the result of d[i]*d[i+1] is also ready.

Result: The strict sequential chain is broken. The CPE drops to near the theoretical throughput bound (e.g., ~1.0 for Int Mult, ~1.5 for FP).

The Danger of Reassociation

Floating-point arithmetic is not associative. (a + b) + c is not always bit-wise identical to a + (b + c) due to rounding. The compiler will almost never do this optimization automatically. It refuses to change the numerical result of the program. The programmer must explicitly write the code this way if the slight numerical deviation is acceptable.

Method 3: Separate Accumulators

Multiple independent streams of computation can be explicitly maintained.

// 2-way unrolling with 2 accumulators
data_t x0 = IDENT;
data_t x1 = IDENT;
for (i = 0; i < limit; i+=2) {
    x0 = x0 OP d[i];
    x1 = x1 OP d[i+1];
}
// ... finish remaining elements ...
*dest = x0 OP x1;

Effect: x0 and x1 are independent. The CPU can push operations for x0 and x1 into the pipelines simultaneously.

  • With 2 accumulators, the CPE is effectively cut in half (e.g., from 5.0 to 2.5).
  • This can be scaled up (e.g., 10 accumulators) until the Throughput Bound of the functional units is hit.

Practice: Latency vs. Throughput

Calculations involving cycles and parallelism are key to performance tuning.

Exercise: Theoretical CPE

Consider a processor with 2 FP Multipliers. Each multiplier has a Latency of 4 cycles and is fully pipelined (issue rate of 1). A programmer writes a loop that multiplies 1000 numbers into a single accumulator (acc = acc * d[i]).

  1. What is the CPE of the naive loop?
    • Answer: 4.0. The loop is latency-bound because each multiply depends on the previous one.
  2. If the loop is unrolled by 2 with 2 separate accumulators, what is the CPE?
    • Answer: . Each accumulator still takes 4 cycles per result, but since they are independent, 2 results are finished every 4 cycles.
  3. What is the maximum achievable CPE (Throughput Bound) on this hardware?
    • Answer: 0.5. With 2 multipliers each able to start a new ops every 1 cycle, the hardware can theoretically finish 2 multiplications per cycle.

Exercise: Reassociation Precision

Question: Why does the compiler optimize (a + 5) + b to a + b + 5 for integers but NOT (f1 + 5.0) + f2 for floating points? Answer: Integer addition is perfectly associative. Floating point addition is not; the rounding error in (f1 + 5.0) might be different than in (f1 + f2). The compiler guarantees numerical preservation by default.

The Ultimate Limit: Throughput Bound

If unrolling and separate accumulators are used enough, the hardware is eventually saturated.

  • Integer Addition: Haswell has 4 ALU units but only 2 Load units. The bottleneck becomes loading data from memory. Bound: 0.5 CPE.
  • FP Multiplication: Haswell has 2 FP Multipliers. Bound: 0.5 CPE.

Using these techniques, a 42x speedup over the original unoptimized code can be achieved.

Beyond Scalar: Vectorization (SIMD)

It is possible to go even faster. The functional units discussed (like the FP multiplier) are actually vector units. They can perform operations on 256-bit registers (%ymm0 - %ymm15).

  • A single AVX2 instruction (vmulpd) can perform 4 double-precision multiplies in a single cycle.
  • Latency: 5 cycles. Throughput: 1 cycle (pipelined).
  • Vector Throughput Bound: 0.5 cycles / 4 operations = 0.125 CPE.

By writing vectorized code (using intrinsics or compiler auto-vectorization), data can be processed in chunks of 4 or 8.

Summary of Haswell Matrix Multiply: The “insane” 8-cycle performance for a 4x4x4 matrix multiply mentioned earlier is achieved by:

  1. Using Vector Instructions (doing 4 ops at once).
  2. Using Fused Multiply-Add (FMA) instructions (doing multiply and add in one step).
  3. Using Separate Accumulators (keeping all vector registers busy).

Summary: How to Optimize

  1. Optimize Algorithms: to beats everything else.
  2. Compiler Flags: Use -O3.
  3. Eliminate Blockers: Use local variables to avoid memory aliasing; remove function calls from loops.
  4. Exploit ILP:
    • Unroll loops.
    • Use multiple accumulators.
    • Reassociate operations (if precision allows).
  5. Vectorize: Use SIMD instructions for data-parallel tasks.

Continue here: 20 Principles of Cache Memory