24 Multiprocessing and Multicore Architecture

Lecture from: 09.12.2025 | Video: Videos ETHZ

The modern era has been reached. Around 2003, the “free lunch” of automatically faster processors abruptly ended. Massive gains in clock frequency stopped, and more cores on a single chip began to appear. This lecture explores why that happened, the architecture of these systems, and the profound complexity it introduces regarding how memory behaves.

The End of the Free Lunch

For decades, Moore’s Law (doubling transistors every ~18 months) directly translated to performance. Then three specific “walls” were hit that forced a change in strategy.

1. The Power Wall

This is the primary physical limitation. A processor converts electricity into computation and heat. The dynamic power dissipation is governed by the formula:

P_{d y nami c} \approx C \cdot V^{2} \cdot f

$C$ : Capacitance (related to the area/number of transistors).
$V$ : Supply Voltage.
$f$ : Frequency.

For years, frequency ( $f$ ) was increased to get speed, but the power cost was offset by lowering the voltage ( $V$ ). Since $V$ is squared, small reductions bought a lot of power budget.

The Problem

The floor on voltage scaling has been reached. Voltage cannot be lowered further because of Leakage Power (static power lost even when transistors are “off”). If $V$ cannot be lowered, $f$ cannot be raised without melting the chip.

2. The Memory Wall

Processors got faster much more quickly than memory did. The time to fetch data from RAM (in terms of CPU cycles) grew exponentially. A modern CPU executing billions of operations per second spends most of its life waiting for data.

3. The ILP Wall

Billions of transistors were spent on “serial acceleration”, making a single stream of instructions run faster (branch prediction, out-of-order execution, superscalar). A point of diminishing returns was reached; adding more complexity yielded negligible speedups.

The Solution: If the engine cannot rev higher, more engines are built. The transistors are used to build Multicore Processors.

Symmetric Multiprocessing (SMP)

The standard architecture today is SMP.

Symmetric: All cores are identical.
Shared Memory: All cores connect to a single physical memory address space via a shared bus (or interconnect).

The Catch: The shared bus is a bottleneck. If every instruction from 4 cores went to RAM, the system would crawl. The Fix: Caches. Heavy reliance is placed on caches to filter out traffic.

The New Problem

If Core A caches variable x and Core B caches variable x, and Core A changes it, then Core B is holding stale data.

This leads to two fundamental concepts that are often confused: Coherence and Consistency.

20.3 Coherence vs. Consistency

These two concepts must be strictly distinguished.

Cache Coherence (Zoomed In)
- Scope: A single memory address.
- Definition: All processors must agree on the order of reads and writes to that specific location. It ensures that if x=5 is written, 5 is eventually seen, not 4.
- Analogy: Everyone in a meeting agrees on what was written on the whiteboard.
Memory Consistency (Zoomed Out)
- Scope: The ordering of operations across multiple different addresses.
- Definition: If x=1 is written then y=1, in what order do they appear?
- Analogy: Agreement on the timeline of events happening in the meeting relative to events in the hallway.

Sequential Consistency (SC)

This is the “intuitive” model. It is what is assumed to happen when writing code.

Sequential Consistency

Operations from a single processor appear in Program Order.

The global execution is an interleaving of these program orders.

Consider a big switch connected to memory. It selects Core A, executes A’s next instruction completely, then selects Core B, executes B’s next instruction completely.

The Limitation of SC

SC is beautiful but slow. It forbids the hardware from doing almost any optimization (reordering reads/writes, using store buffers).

Therefore, no modern high-performance processor implements pure Sequential Consistency. They all use relaxed models.

Cache Coherence Protocols (Snooping)

How is data in different caches kept coherent? Snooping is used. Because all cores share a bus, caches can “snoop” (listen) to transactions. If Core A announces “I am writing to address 0x100”, Core B hears this and can react (e.g., by invalidating its own copy of 0x100).

This is managed using a state machine for every cache line.

1. The MSI Protocol (The Baseline)

Every cache line is in one of three states:

M (Modified): The block is dirty. The only valid copy is locally held. RAM is stale.
S (Shared): The block is clean. A copy is held, others might have it too. Read-only.
I (Invalid):): The block is empty or garbage.

Transitions (The Dance of the Caches):

Read Miss (Local):
- State is I. A Read is requested.
- BusRd (Read Request) is broadcast.
- If someone else has it in M, they must Flush (write back to RAM) immediately.
- Data is loaded and state moves to S.
Write Miss (Local):
- State is I. A Write is requested.
- BusRdX (Read Exclusive / “I want to own this”) is broadcast.
- Everyone else in S or M must Invalidate (move to I).
- Data is loaded and state moves to M.
Writing to Shared (Upgrade):
- State is S. A Write is requested.
- Writing cannot happen yet (others have copies).
- BusRdX is broadcast to kill other copies.
- State moves to M.

2. The MESI Protocol (Intel Optimization)

MSI has a flaw. Imagine a single-threaded program. It reads x, then writes x.

MSI behavior: Read x (Go to S). Write x (Broadcast “Invalidate” to go to M).
The waste: No one else had x! The bus was needlessly spammed to invalidate ghosts.

Solution: Add the E (Exclusive) state.

E (Exclusive): The copy is locally held. It is clean (matches RAM). But it is known to be unique.
Optimization: If a write to an E block is requested, the state can silently change to M without telling the bus. This saves massive bandwidth for private data.

MESI States

State Valid? Exclusive? Dirty?
M Yes Yes Yes
E Yes Yes No
S Yes No No
I No - -

State	Valid?	Exclusive?	Dirty?
M	Yes	Yes	Yes
E	Yes	Yes	No
S	Yes	No	No
I	No	-	-

3. The MOESI Protocol (AMD Optimization)

MESI has a flaw. If a line is in M (Dirty) and another core requests to Read it:

MESI behavior: Data must be flushed to RAM (slow). The other core reads from RAM. Both go to S. RAM is now up to date.
The waste: Why go through slow RAM? The data is right here!

Solution: Add the O (Owner) state.

O (Owner): The dirty data is locally held. Another core is allowed to have a copy (Shared). The local core is responsible for the eventual write-back to RAM.
Optimization: The dirty data is sent directly to the requester (Cache-to-Cache transfer). The local core moves to O, the requester moves to S. RAM remains stale (dirty), saving the memory write.

4. The MESIF Protocol (Intel Core i7 Optimization)

Modern interconnects are point-to-point (not a single wire bus). Broadcasts are expensive.

Problem: If 10 cores have a line in S and Core 11 requests it, who responds? If everyone responds, a traffic jam occurs.
Solution: Add F (Forward).
F (Forward): Special Shared state. Only the F node is allowed to respond to read requests. Everyone else in S stays silent. The most recent requester usually gets the F token.

Practice: Cache Coherence Transitions

Understanding the state machine of MSI/MESI is crucial for predicting bus traffic.

Example: MSI Transition Sequence

Consider three cores (C1, C2, C3) and a single memory address X.

Action	C1 State	C2 State	C3 State	Bus Action
Initial State	I	I	I	-
C1 reads X	S	I	I	`BusRd(X)`
C2 reads X	S	S	I	`BusRd(X)`
C1 writes X	M	I	I	`BusRdX(X)` or `Upgrade(X)`
C3 reads X	S	I	S	`BusRd(X)`, C1 performs `Flush`

Important

Notice how the write by C1 forced C2 into the Invalid state. When C3 eventually read the data, C1 had to flush its modified data to RAM so C3 (and C1) could enter the Shared state.

Exercise: MESI vs MSI

How does the Exclusive state in MESI save bandwidth in the following scenario?

Core A reads X. (MSI: S, MESI: E)
Core A writes X. Answer: In MSI, the write requires a BusRdX or Invalidate broadcast even if no one else has a copy. In MESI, because Core A is in the E state, it can transition to M silently without any bus traffic.

Relaxed Consistency Models

Hardware wants speed. To get speed, it cheats on the “Program Order” rule of Sequential Consistency.

The most common model (x86) is Processor Consistency or Total Store Ordering (TSO).

The Write Buffer

Writes are slow. Stalling the CPU on a store is undesirable. The write is put into a Write Buffer and execution continues. The write trickles to memory later.

The TSO Relaxation (Write-to-Read):

Rule: Reads are allowed to pass Writes.
If A is written to (it sits in the buffer) and then B is read (from memory), the read of B might happen globally before A is visible to others.

Example: Two cores update separate variables and then read the other’s variable. In TSO, it is possible for both to read the old value (0) because both writes are stuck in their respective write buffers. This is impossible in Sequential Consistency.

Synchronization: Barriers and Fences

Since the hardware reorders things, how is correct code written (e.g., implementing a lock)? The hardware must be explicitly told “Stop reordering!” using Fences.

Types of Barriers

Compiler Barriers: Tell the compiler (GCC/LLVM) not to reorder instructions in the generated assembly.
- Example: asm volatile("" ::: "memory");
Hardware Fences: Instructions that force the CPU pipeline to drain or the Write Buffer to flush.
- MFENCE (Memory Fence): The heavy hammer. Stops all loads and stores from passing the fence.
- SFENCE: Waits for stores.
- LFENCE: Waits for loads.

20.8 Atomic Operations: Test-and-Set

A lock cannot be implemented with just load and store. An instruction that Read-Modifies-Writes atomically is needed.

Test-and-Set (TAS)

Conceptually:

int TAS(int *lock) {
    int old = *lock;
    *lock = 1;
    return old;
}

This happens indivisibly. On x86, this is the XCHG instruction (or instructions with the LOCK prefix).

A Simple Spinlock:

void acquire(int *lock) {
    while (TAS(lock) == 1); // Spin while lock was already held
}

Performance Issue: TAS writes to memory every time, even if the lock is held. In a cache-coherent system (MESI), this is a disaster.

Core A tries TAS. It invalidates everyone else to get M state.
Core B tries TAS. It invalidates Core A to get M state.
The cache line is ping-ponged back and forth just to discover “Still locked.” This creates massive bus traffic.

The Fix: Test-and-Test-and-Set Check with a normal Read first. Only try the atomic write if the lock looks free.

void acquire(int *lock) {
    do {
        while (*lock == 1); // Spin on cached read (Shared state)
    } while (TAS(lock) == 1); // Only try write if it looks free
}

This allows processors to spin on their local Shared copy of the cache line, generating zero bus traffic until the lock is actually released.

Synchronization: Requires Atomic instructions (TAS) and Fences to fight reordering.

Continue here: 25 Parallel Software and Synchronization

CS Notes

Explorer

24 Multiprocessing and Multicore Architecture

The End of the Free Lunch

1. The Power Wall

The Problem

2. The Memory Wall

3. The ILP Wall

Symmetric Multiprocessing (SMP)

The New Problem

20.3 Coherence vs. Consistency

Sequential Consistency (SC)

The Limitation of SC

Cache Coherence Protocols (Snooping)

1. The MSI Protocol (The Baseline)

2. The MESI Protocol (Intel Optimization)

3. The MOESI Protocol (AMD Optimization)

4. The MESIF Protocol (Intel Core i7 Optimization)

Practice: Cache Coherence Transitions

Example: MSI Transition Sequence

Exercise: MESI vs MSI

Relaxed Consistency Models

The Write Buffer

Synchronization: Barriers and Fences

Types of Barriers

20.8 Atomic Operations: Test-and-Set

Test-and-Set (TAS)

Table of Contents

Graph View

Backlinks