Lecture from: 09.12.2025 | Video: Videos ETHZ
The modern era has been reached. Around 2003, the “free lunch” of automatically faster processors abruptly ended. Massive gains in clock frequency stopped, and more cores on a single chip began to appear. This lecture explores why that happened, the architecture of these systems, and the profound complexity it introduces regarding how memory behaves.
The End of the Free Lunch
For decades, Moore’s Law (doubling transistors every ~18 months) directly translated to performance. Then three specific “walls” were hit that forced a change in strategy.
1. The Power Wall
This is the primary physical limitation. A processor converts electricity into computation and heat. The dynamic power dissipation is governed by the formula:
- : Capacitance (related to the area/number of transistors).
- : Supply Voltage.
- : Frequency.
For years, frequency () was increased to get speed, but the power cost was offset by lowering the voltage (). Since is squared, small reductions bought a lot of power budget.
The Problem
The floor on voltage scaling has been reached. Voltage cannot be lowered further because of Leakage Power (static power lost even when transistors are “off”). If cannot be lowered, cannot be raised without melting the chip.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114012.png)
2. The Memory Wall
Processors got faster much more quickly than memory did. The time to fetch data from RAM (in terms of CPU cycles) grew exponentially. A modern CPU executing billions of operations per second spends most of its life waiting for data.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114021.png)
3. The ILP Wall
Billions of transistors were spent on “serial acceleration”, making a single stream of instructions run faster (branch prediction, out-of-order execution, superscalar). A point of diminishing returns was reached; adding more complexity yielded negligible speedups.
The Solution: If the engine cannot rev higher, more engines are built. The transistors are used to build Multicore Processors.
Symmetric Multiprocessing (SMP)
The standard architecture today is SMP.
- Symmetric: All cores are identical.
- Shared Memory: All cores connect to a single physical memory address space via a shared bus (or interconnect).
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114034.png)
The Catch: The shared bus is a bottleneck. If every instruction from 4 cores went to RAM, the system would crawl. The Fix: Caches. Heavy reliance is placed on caches to filter out traffic.
The New Problem
If Core A caches variable x and Core B caches variable x, and Core A changes it, then Core B is holding stale data.
This leads to two fundamental concepts that are often confused: Coherence and Consistency.
20.3 Coherence vs. Consistency
These two concepts must be strictly distinguished.
-
Cache Coherence (Zoomed In)
- Scope: A single memory address.
- Definition: All processors must agree on the order of reads and writes to that specific location. It ensures that if
x=5is written,5is eventually seen, not4. - Analogy: Everyone in a meeting agrees on what was written on the whiteboard.
-
Memory Consistency (Zoomed Out)
- Scope: The ordering of operations across multiple different addresses.
- Definition: If
x=1is written theny=1, in what order do they appear? - Analogy: Agreement on the timeline of events happening in the meeting relative to events in the hallway.
Sequential Consistency (SC)
This is the “intuitive” model. It is what is assumed to happen when writing code.
Sequential Consistency
- Operations from a single processor appear in Program Order.
- The global execution is an interleaving of these program orders.
Consider a big switch connected to memory. It selects Core A, executes A’s next instruction completely, then selects Core B, executes B’s next instruction completely.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114104.png)
The Limitation of SC
SC is beautiful but slow. It forbids the hardware from doing almost any optimization (reordering reads/writes, using store buffers).
Therefore, no modern high-performance processor implements pure Sequential Consistency. They all use relaxed models.
Cache Coherence Protocols (Snooping)
How is data in different caches kept coherent? Snooping is used.
Because all cores share a bus, caches can “snoop” (listen) to transactions. If Core A announces “I am writing to address 0x100”, Core B hears this and can react (e.g., by invalidating its own copy of 0x100).
This is managed using a state machine for every cache line.
1. The MSI Protocol (The Baseline)
Every cache line is in one of three states:
- M (Modified): The block is dirty. The only valid copy is locally held. RAM is stale.
- S (Shared): The block is clean. A copy is held, others might have it too. Read-only.
- I (Invalid):): The block is empty or garbage.
Transitions (The Dance of the Caches):
- Read Miss (Local):
- State is I. A Read is requested.
BusRd(Read Request) is broadcast.- If someone else has it in M, they must Flush (write back to RAM) immediately.
- Data is loaded and state moves to S.
- Write Miss (Local):
- State is I. A Write is requested.
BusRdX(Read Exclusive / “I want to own this”) is broadcast.- Everyone else in S or M must Invalidate (move to I).
- Data is loaded and state moves to M.
- Writing to Shared (Upgrade):
- State is S. A Write is requested.
- Writing cannot happen yet (others have copies).
BusRdXis broadcast to kill other copies.- State moves to M.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114131.png)
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114142.png)
2. The MESI Protocol (Intel Optimization)
MSI has a flaw. Imagine a single-threaded program. It reads x, then writes x.
- MSI behavior: Read
x(Go to S). Writex(Broadcast “Invalidate” to go to M). - The waste: No one else had
x! The bus was needlessly spammed to invalidate ghosts.
Solution: Add the E (Exclusive) state.
- E (Exclusive): The copy is locally held. It is clean (matches RAM). But it is known to be unique.
- Optimization: If a write to an E block is requested, the state can silently change to M without telling the bus. This saves massive bandwidth for private data.
MESI States
State Valid? Exclusive? Dirty? M Yes Yes Yes E Yes Yes No S Yes No No I No - -
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114228.png)
3. The MOESI Protocol (AMD Optimization)
MESI has a flaw. If a line is in M (Dirty) and another core requests to Read it:
- MESI behavior: Data must be flushed to RAM (slow). The other core reads from RAM. Both go to S. RAM is now up to date.
- The waste: Why go through slow RAM? The data is right here!
Solution: Add the O (Owner) state.
- O (Owner): The dirty data is locally held. Another core is allowed to have a copy (Shared). The local core is responsible for the eventual write-back to RAM.
- Optimization: The dirty data is sent directly to the requester (Cache-to-Cache transfer). The local core moves to O, the requester moves to S. RAM remains stale (dirty), saving the memory write.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114254.png)
4. The MESIF Protocol (Intel Core i7 Optimization)
Modern interconnects are point-to-point (not a single wire bus). Broadcasts are expensive.
- Problem: If 10 cores have a line in S and Core 11 requests it, who responds? If everyone responds, a traffic jam occurs.
- Solution: Add F (Forward).
- F (Forward): Special Shared state. Only the F node is allowed to respond to read requests. Everyone else in S stays silent. The most recent requester usually gets the F token.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114311.png)
Practice: Cache Coherence Transitions
Understanding the state machine of MSI/MESI is crucial for predicting bus traffic.
Example: MSI Transition Sequence
Consider three cores (C1, C2, C3) and a single memory address X.
| Action | C1 State | C2 State | C3 State | Bus Action |
|---|---|---|---|---|
| Initial State | I | I | I | - |
| C1 reads X | S | I | I | BusRd(X) |
| C2 reads X | S | S | I | BusRd(X) |
| C1 writes X | M | I | I | BusRdX(X) or Upgrade(X) |
| C3 reads X | S | I | S | BusRd(X), C1 performs Flush |
Important
Notice how the write by C1 forced C2 into the Invalid state. When C3 eventually read the data, C1 had to flush its modified data to RAM so C3 (and C1) could enter the Shared state.
Exercise: MESI vs MSI
How does the Exclusive state in MESI save bandwidth in the following scenario?
- Core A reads
X. (MSI:S, MESI:E) - Core A writes
X. Answer: In MSI, the write requires aBusRdXorInvalidatebroadcast even if no one else has a copy. In MESI, because Core A is in theEstate, it can transition toMsilently without any bus traffic.
Relaxed Consistency Models
Hardware wants speed. To get speed, it cheats on the “Program Order” rule of Sequential Consistency.
The most common model (x86) is Processor Consistency or Total Store Ordering (TSO).
The Write Buffer
Writes are slow. Stalling the CPU on a store is undesirable.
The write is put into a Write Buffer and execution continues. The write trickles to memory later.
The TSO Relaxation (Write-to-Read):
- Rule: Reads are allowed to pass Writes.
- If
Ais written to (it sits in the buffer) and thenBis read (from memory), the read ofBmight happen globally beforeAis visible to others.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114331.png)
Example: Two cores update separate variables and then read the other’s variable. In TSO, it is possible for both to read the old value (0) because both writes are stuck in their respective write buffers. This is impossible in Sequential Consistency.
Synchronization: Barriers and Fences
Since the hardware reorders things, how is correct code written (e.g., implementing a lock)? The hardware must be explicitly told “Stop reordering!” using Fences.
Types of Barriers
- Compiler Barriers: Tell the compiler (GCC/LLVM) not to reorder instructions in the generated assembly.
- Example:
asm volatile("" ::: "memory");
- Example:
- Hardware Fences: Instructions that force the CPU pipeline to drain or the Write Buffer to flush.
- MFENCE (Memory Fence): The heavy hammer. Stops all loads and stores from passing the fence.
- SFENCE: Waits for stores.
- LFENCE: Waits for loads.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107114402.png)
20.8 Atomic Operations: Test-and-Set
A lock cannot be implemented with just load and store. An instruction that Read-Modifies-Writes atomically is needed.
Test-and-Set (TAS)
Conceptually:
int TAS(int *lock) {
int old = *lock;
*lock = 1;
return old;
}This happens indivisibly. On x86, this is the XCHG instruction (or instructions with the LOCK prefix).
A Simple Spinlock:
void acquire(int *lock) {
while (TAS(lock) == 1); // Spin while lock was already held
}Performance Issue: TAS writes to memory every time, even if the lock is held. In a cache-coherent system (MESI), this is a disaster.
- Core A tries
TAS. It invalidates everyone else to get M state. - Core B tries
TAS. It invalidates Core A to get M state. - The cache line is ping-ponged back and forth just to discover “Still locked.” This creates massive bus traffic.
The Fix: Test-and-Test-and-Set Check with a normal Read first. Only try the atomic write if the lock looks free.
void acquire(int *lock) {
do {
while (*lock == 1); // Spin on cached read (Shared state)
} while (TAS(lock) == 1); // Only try write if it looks free
}This allows processors to spin on their local Shared copy of the cache line, generating zero bus traffic until the lock is actually released.
- Synchronization: Requires Atomic instructions (TAS) and Fences to fight reordering.
Continue here: 25 Parallel Software and Synchronization