Lecture from: 25.11.2025 | Video: Videos ETHZ

Caches

A critical juncture in computer architecture has been reached. Processors have been successfully pipelined, executing multiple instructions per cycle, and are extremely fast. However, a historical problem persists: The Processor-Memory Bottleneck.

While processor performance has doubled roughly every 18 months (the Moore’s Law era), the bandwidth and latency of the memory bus have evolved much more slowly.

The CPU: Can process roughly 512 Bytes/cycle (e.g., Haswell with AVX).
The Memory: Bandwidth is only ~10 Bytes/cycle, and the latency to fetch data is around 100 cycles.

If the CPU has to wait 100 cycles every time it needs a byte, it does not matter how fast the ALU is; the CPU will spend 99% of its time idle. The solution is the Cache.

The Core Concept: Locality

Caches work because of a fundamental property of computer programs called Locality. Programs do not access memory randomly; they exhibit specific patterns.

Temporal Locality: If an item was referenced recently, it will likely be referenced again soon. Examples include loop counters, accumulators, and common variables.
Spatial Locality: If an item is referenced, items with nearby addresses will likely be referenced soon. Examples include iterating through an array or sequential instruction execution.

A memory hierarchy is built based on technology and cost:

SRAM (Static RAM): Fast, expensive, low density. Used for Caches.
DRAM (Dynamic RAM): Slower, cheaper, high density. Used for Main Memory.
Disk/Flash: Slowest, cheapest, massive density.

Cache Mechanics

A cache is a small, fast storage buffer between the CPU and Main Memory. It holds a subset of the data from main memory.

Terminology

Block (or Line): The fixed-sized unit of data transfer between memory and cache (typically 64 bytes). Blocks are fetched, not just bytes, to exploit spatial locality.
Hit: The data requested is found in the cache.
Miss: The data is not in the cache. It must be fetched from main memory.
Placement Policy: Specifies where a block coming from memory should be placed.
Replacement Policy: Specifies which block should be evicted if the cache is full.

Performance Metrics

Cache performance is analyzed using Average Memory Access Time (AMAT).

AMAT = Hit Time + Miss Rate \times Miss Penalty

Hit Time: Time to deliver a line from cache (e.g., 1-2 cycles for L1, 5-20 for L2).
Miss Rate: Fraction of accesses that miss (e.g., 3-10% for L1).
Miss Penalty: Additional time required to fetch from memory (e.g., 50-200 cycles).

The Power of Miss Rates

Consider a cache with a 1 cycle hit time and 100 cycle miss penalty.

97% Hit Rate (3% Miss): $AMAT = 1 + 0.03 \times 100 = 4$ cycles.

99% Hit Rate (1% Miss): $AMAT = 1 + 0.01 \times 100 = 2$ cycles.

Reducing the miss rate by just 2% doubles the average performance. This is why miss rates are the primary focus rather than hit rates.

The Four C’s: Types of Misses

Cold (Compulsory) Miss: The first time a block is accessed, it is not there. It must be fetched.
Conflict Miss: The cache has enough space overall, but multiple blocks map to the same slot due to the placement policy (e.g., strict modulo mapping), forcing an eviction.
Capacity Miss: The set of active blocks (the working set) is simply larger than the cache. It physically cannot fit.
Coherency Miss: (Relevant in multiprocessors) Another core has updated the data, invalidating the local copy.

Cache Organization

How is data found in the cache? The whole thing is not scanned. The memory address itself is used to index into it. A cache is characterized by $(S, E, B)$ :

$S = 2^{s}$ : Number of Sets.
$E$ : Number of Lines per Set (Associativity).
$B = 2^{b}$ : Block Size (Bytes).

The $m$ -bit memory address is split into three parts:

Tag t bits ∣ Set Index s bits ∣ Block Offset b bits

Set Index ( $s$ ): Selects which “row” (set) of the cache to look in.
Tag ( $t$ ): Used to verify if the line in that set actually corresponds to the requested address.
Block Offset ( $b$ ): Selects the specific byte(s) within the data block.

The idea is: Address -> Set -> Tag match -> Block -> Block offset -> Bytes

Direct Mapped Cache ( $E = 1$ )

Each memory address maps to exactly one specific line in the cache.

Mapping: Set = Address % S.
Lookup: Index into Set $S$ . Compare the stored Tag with the address Tag. If they match and the Valid Bit is set, it is a hit.
Problem: If addresses $X, Y, X, Y$ are accessed and both map to set 0, thrashing (constant conflict misses) occurs, even if the rest of the cache is empty.

Set Associative Cache ( $E > 1$ )

Each memory address maps to a specific set, but within that set, it can go into any of the $E$ lines.

Example: 2-way associative ( $E = 2$ ).
Lookup: Go to the set. Check all $E$ tags in that set in parallel using hardware comparators.
Replacement: If the set is full, a victim must be chosen to evict. A common policy is LRU (Least Recently Used).
Benefit: Reduces conflict misses significantly compared to direct mapped.

Detailed Analysis: Matrix Traversal

Let’s look at a concrete example of locality to understand the difference between cache organizations. Consider summing a $16 \times 16$ array of doubles.

Assumption: Cache block size = 32 bytes (holds 4 doubles).
State: Cache is cold (empty).

Scenario 1: Row-Major Traversal (`sum_rows`)

Accesses are $a [0] [0], a [0] [1], a [0] [2], a [0] [3] \dots$

Access a[0][0]: Miss (Cold). CPU fetches block containing a[0][0] through a[0][3].
Access a[0][1]: Hit (Loaded by previous miss).
Access a[0][2]: Hit.
Access a[0][3]: Hit.
Access a[0][4]: Miss. Fetch next block.

Result: 1 Miss followed by 3 Hits.
Miss Rate: $1/4 = 25%$ . This code exhibits excellent Spatial Locality.

Scenario 2: Column-Major Traversal (`sum_cols`)

Accesses are $a [0] [0], a [1] [0], a [2] [0] \dots$

In memory, a[0][0] is far away from a[1][0]. They are separated by $16$ doubles ( $16 \times 8 = 128$ bytes).
Mapping: If the cache is small, these addresses map to specific sets.
- a[0][0] maps to Set 0.
- a[1][0] (128 bytes later) maps to Set 4.
- Eventually, a[X][0] will map to Set 0 again (wrapping around).
The Conflict: The block for a[0][0] is loaded, but an immediate jump to a new address a[1][0] follows. That block is loaded. By the time the column is finished and the program returns to a[0][1] (the neighbor of the first access), the original block has been evicted.
Result: Every single access is a miss.
Miss Rate: $100%$ .

Does Associativity Help?

If a 2-way associative cache is used, 2 blocks can be stored in Set 0.

a[0][0] (Set 0) is loaded.
Later a[X][0] (Set 0) is loaded. Both can be kept.
However: The column loop iterates through all 16 rows. All 16 blocks need to be stored to avoid eviction before returning to the start. A 2-way cache is not enough; a[0][0] will eventually be evicted before returning to it.
Conclusion: Associativity helps with conflicts, but it cannot solve Capacity issues where the working set (the whole column) is simply too big for the cache.

Cache Writes

Writing is trickier than reading because the cache and main memory must be kept consistent.

Scenario 1: Write Hit (Data is in cache)

Write-Through: Write to cache and immediately write to main memory.
- Pro: Simple, memory is always consistent.
- Con: Slow (memory bus traffic is the bottleneck).
Write-Back: Write only to cache. Set a Dirty Bit to 1. Write to memory only when this block is evicted.
- Pro: Fast, filters multiple writes to same location.
- Con: Complex, memory is temporarily inconsistent.

Scenario 2: Write Miss (Data is not in cache)

Write-Allocate: Load the block into cache, then update it. (Usually paired with Write-Back). This anticipates future locality.
No-Write-Allocate: Just write directly to memory, do not bring into cache. (Usually paired with Write-Through).

Modern Standard

Most modern CPUs (like Intel Core i7) use Write-Back + Write-Allocate for data caches.

Real World Hierarchies

L1 Cache: Often split into Instruction Cache (I-Cache) and Data Cache (D-Cache). This prevents structural hazards where the CPU wants to fetch an instruction and read data in the same cycle.
L2/L3 Caches: Usually Unified (store both instructions and data).
Private vs Shared: L1/L2 are usually private to a core; L3 is often shared among all cores.
Inclusive vs Exclusive:
- Inclusive: If data is in L1, it must also be in L2. Simplifies coherence but wastes space.
- Exclusive: If data is in L1, it is not in L2. Maximizes capacity but complicates eviction logic.

Software Caches

Focus has mostly been on hardware caches (L1, L2, L3) which are implemented with transistors and wires. However, the concept of caching is universal. It applies whenever there is a slow, large storage medium and a fast, small buffer.

As the hierarchy is descended, the nature of the cache changes significantly because the Miss Penalty grows exponentially.

The Cost of a Miss

Compare the penalties:

Hardware Cache Miss (L1 to RAM): ~100 cycles.
Software Cache Miss (RAM to Disk/Network): ~1,000,000 to ~1,000,000,000 cycles.

Because the penalty for missing in a software cache (like a file system buffer or a web browser cache) is so astronomical (millions of cycles), the design trade-offs flip completely compared to hardware caches.

Associativity: Software caches are almost always Fully Associative.
- Hardware: Set-associativity (index bits) is used because the answer is needed in 1-2 cycles. Searching the whole cache is too slow.
- Software: There are millions of cycles to spare before incurring the cost of a disk read. It is worth spending thousands of cycles running a hash table lookup to ensure the data is found if it is there. Conflict misses are not accepted here.
Replacement Policy:
- Hardware: Simple policies (like approximated LRU) implemented in logic gates.
- Software: Sophisticated algorithms. The OS tracks file access patterns over seconds or minutes to make intelligent eviction decisions.
Granularity:
- Hardware: 64-byte blocks.
- Software: 4KB pages (Virtual Memory), entire files, or complete web pages (Browser cache).

Advanced Optimization: Blocking

The Matrix-Matrix Multiplication (MMM) example ( $C = A \times B$ ) is revisited to demonstrate how code can be rewritten to exploit cache locality. This technique is known as Blocking or Tiling.

The Problem with Naive Matrix Multiplication

Consider the standard triple-loop implementation:

/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
    int i, j, k;
    for (i = 0; i < n; i++)         // Iterate rows of C
        for (j = 0; j < n; j++)     // Iterate columns of C
            for (k = 0; k < n; k++) // Dot product iteration
                c[i*n+j] += a[i*n + k] * b[k*n + j];
}

Let’s analyze the cache performance of the inner loop (variable k) which computes one element of C.

Assumptions:
- Matrix elements are doubles (8 bytes).
- Cache Block size is 64 bytes (holds 8 doubles).
- Matrix size $n$ is huge (much larger than the cache).
- Cache is cold (empty) at the start.

Access Pattern Analysis

Matrix A (a[i*n + k]):
- As k increments, access is a[i][0], a[i][1], a[i][2]…
- These are contiguous in memory (Row-Major order).
- Miss Rate: One miss is incurred to bring in a block of 8 doubles, then hits occur for the next 7 accesses. Miss rate $= 1/8$ .
- Misses per inner loop: $n /8$ .
Matrix B (b[k*n + j]):
- As k increments, access is b[0][j], b[1][j], b[2][j]…
- Movement is down a column. In memory, b[0][j] is separated from b[1][j] by $n \times 8$ bytes (one full row width).
- Miss Rate: Because $n$ is large, every access jumps to a memory address far away from the previous one. Each access lands in a different cache block.
- Misses per inner loop: $n$ (Every access is a miss).

Total Misses (Naive)

$Total Misses = n^{2} \times (n + n /8) \approx 1.125 n^{3}$

This is highly inefficient. Every element of matrix $B$ is loaded from RAM $n$ times!

Practice: Cache Address Math

Understanding how a 64-bit address is partitioned into Tag, Index, and Offset bits is a core systems programming skill.

Exercise: Address Breakdown

A system has a 32KB cache that is 4-way associative with 64-byte blocks. The total physical address space is 48 bits. Break down the address bits into Tag, Set Index, and Block Offset.

Solution:

Block Offset (b): Block size is 64 bytes ( $2^{6}$ ). So, 6 bits.
Set Index (s):
- Total lines in cache = $\frac{32 KB}{64 bytes} = \frac{32 , 768}{64} = 512$ lines.
- Number of sets ( $S$ ) = $\frac{Total lines}{Associativity} = \frac{512}{4} = 128$ sets.
- Index bits = $lo g_{2} (128) = 7 bits$ .
Tag (t):
- Tag bits = Total bits - Index bits - Offset bits.
- Tag bits = $48 - 7 - 6 = 35 bits$ .

Exercise: AMAT Logic

Given a processor with:

L1 Hit Time: 1 cycle
L2 Hit Time: 10 cycles
L1 Miss Rate: 10%
L2 Miss Rate (Global): 2%
Main Memory Penalty: 100 cycles

Calculate the AMAT. Answer: $AMAT = 1 (L1 hit) + 0.1 \times (10 (L2 hit) + 0.2 \times 100) = 1 + 1 + 2 = 4 cycles$ . (Note: Be careful with Local vs Global miss rates!)

The Solution: Blocking

The goal of blocking is to reorganize the computation so that work is done on small sub-matrices (tiles) that fit entirely inside the fast L1 cache. A tile is loaded, the data is reused as much as possible, and then it is discarded.

Concept

Instead of computing one full element of $C$ at a time (which requires a full row of $A$ and full column of $B$ ), a small $B \times B$ sub-matrix of $C$ is computed. To do this, a $B \times B$ sub-matrix of $A$ is multiplied by a $B \times B$ sub-matrix of $B$ , and the result is accumulated.

Blocked Code

The 3 loops are changed into 6 loops. The outer 3 loops iterate over the blocks (stepping by B), and the inner 3 loops perform the multiplication within the blocks.

for (i = 0; i < n; i+=B) {          // Block Row of C
    for (j = 0; j < n; j+=B) {      // Block Col of C
        for (k = 0; k < n; k+=B) {  // Block position in A and B
            
            /* Inner loops operate on B x B blocks */
            /* This fits entirely in L1 Cache! */
            for (i1 = i; i1 < i+B; i1++) {
                for (j1 = j; j1 < j+B; j1++) {
                    for (k1 = k; k1 < k+B; k1++) {
                        c[i1*n+j1] += a[i1*n + k1] * b[k1*n + j1];
                    }
                }
            }
        }
    }
}

Cache Miss Analysis (Blocked)

Let’s derive the new miss count.

Block Size $B$ : $B$ is chosen such that 3 blocks ( $A_{b l oc k}$ , $B_{b l oc k}$ , $C_{b l oc k}$ ) fit in the cache.
- Constraint: $3 \times B^{2} \times 8 bytes < Cache Size$ .

Step-by-Step Derivation

Outer Loops: The outer loops execute $(n / B) \times (n / B) \times (n / B) = (n / B)^{3}$ times.
Per Block Iteration: Inside the k loop, one block of $A$ and one block of $B$ are loaded.
- Misses for Block A: The block has $B^{2}$ doubles. Miss once for every 8 doubles. $\to B^{2} /8$ misses.
- Misses for Block B: Similarly, the $B \times B$ block of $B$ is loaded. Since it is accessed repeatedly within the small inner loops and fits in cache, only the compulsory misses to load it once per block step are paid. $\to B^{2} /8$ misses.
- Note: $C$ misses are ignored for simplicity, as they are negligible compared to $A$ and $B$ .
Total Misses: $Total Misses = Block Iterations \times Misses per Iteration$ $Total Misses = (\frac{n}{B})^{3} \times (\frac{B ^{2}}{8} + \frac{B ^{2}}{8})$ $Total Misses = \frac{n ^{3}}{B ^{3}} \times \frac{B ^{2}}{4} = \frac{n ^{3}}{4 B}$

Comparison: Naive vs. Blocked

Naive Misses: $\frac{9}{8} n^{3}$
Blocked Misses: $\frac{1}{4 B} n^{3}$

The Speedup

The blocking technique reduces the number of memory accesses by a factor proportional to $B$ . If a block size $B = 8$ (very small) is picked, the misses become $n^{3} /32$ . Compared to the naive $\approx 1.125 n^{3}$ , this is a massive reduction in memory traffic.

Why it works

In the naive case, matrix $B$ is accessed column-wise, and because the stride is $N$ , lines are evicted before they can be reused for the next row of $A$ .

In the blocked case, a small chunk of $B$ is loaded into the cache and reused $B$ times (for $B$ different rows of the $A$ block) before discarding it. This temporal reuse is what saves performance.

Summary

Blocking: For algorithms processing large data sets (like matrix multiplication), restructuring the code to work on cache-sized tiles is essential to avoid thrashing and to exploit temporal locality.

Continue here: 21 Cache Blocking and Exception Mechanics

CS Notes

Explorer

20 Principles of Cache Memory

Caches

The Core Concept: Locality

Cache Mechanics

Terminology

Terminology

Performance Metrics

The Four C’s: Types of Misses

Cache Organization

Direct Mapped Cache (E=1)

Set Associative Cache (E>1)

Detailed Analysis: Matrix Traversal

Scenario 1: Row-Major Traversal (sum_rows)

Scenario 2: Column-Major Traversal (sum_cols)

Does Associativity Help?

Cache Writes

Scenario 1: Write Hit (Data is in cache)

Scenario 2: Write Miss (Data is not in cache)

Real World Hierarchies

Software Caches

The Cost of a Miss

Advanced Optimization: Blocking

The Problem with Naive Matrix Multiplication

Access Pattern Analysis

Total Misses (Naive)

Practice: Cache Address Math

Exercise: Address Breakdown

Exercise: AMAT Logic

The Solution: Blocking

Concept

Blocked Code

Cache Miss Analysis (Blocked)

Step-by-Step Derivation

Comparison: Naive vs. Blocked

The Speedup

Why it works

Summary

Table of Contents

Graph View

Backlinks

Direct Mapped Cache ( $E = 1$ )

Set Associative Cache ( $E > 1$ )

Scenario 1: Row-Major Traversal (`sum_rows`)

Scenario 2: Column-Major Traversal (`sum_cols`)