Lecture from: 10.12.2025 | Video: Videos ETHZ
In the previous section, the hardware foundations were established: Cache Coherence (keeping data consistent across caches using protocols like MSI/MESI) and Memory Consistency (rules regarding the ordering of operations). Now, the focus moves up the stack to Systems Programming. How can these hardware features be used to write correct, high-performance parallel software?
Synchronization Primitives
Standard loads and stores are insufficient for implementing locks. Atomic operations are needed: instructions that perform a read-modify-write cycle while guaranteeing that the memory bus is locked, preventing any other core from interfering.
The Test-and-Set (TAS) Instruction
The simplest atomic primitive is Test-And-Set (often implemented as XCHG on x86).
- Operation: It reads the value at a memory address, returns it, and unconditionally writes a
1to that address. - Atomicity: The hardware guarantees this happens as one indivisible unit.
Implementation: Basic Spinlock
A basic Spinlock can be built with this:
// 0 = Free, 1 = Locked
void acquire(int *lock) {
// TAS returns the OLD value.
// If it returns 1, another thread held it; the current thread keeps spinning.
// If it returns 0, it was free, and the current thread just set it to 1; the current thread owns it.
while (test_and_set(lock) == 1) {
; // spin
}
}
void release(int *lock) {
*lock = 0; // Simple write releases it
}The Performance Problem with TAS
While correct, this implementation is disastrous for performance on a cached multicore system.
- Writes cause Invalidations:
Test-And-Setperforms a write every single time, even if the lock is already held (writing 1 over an existing 1). - Bus Contention: To write, the coherence protocol (MESI) demands the cache line be in Modified state. This invalidates the copy in every other spinning core’s cache.
- Ping-Pong: The cache line bounces between cores continuously. The interconnect is flooded with useless traffic, slowing down the entire machine, not just the app using the lock.
Optimization: Test-and-Test-and-Set
Bus traffic can be reduced by exploiting the cache. The expensive atomic write should only be attempted if the lock appears to be free locally.
void acquire(int *lock) {
do {
// 1. The "Test" (Normal Read)
// Spin here while the lock is held.
// The thread spins on a cached SHARED copy. Zero bus traffic.
while (*lock == 1);
} while (test_and_set(lock) == 1); // 2. The "Set" (Atomic Attempt)
}How this works
- Waiting: While the lock is held, all waiters have the line in Shared (S) state. They spin reading their local cache. This results in quiet bus operation.
- Release: The owner writes
0. This invalidates all S copies.- Race: All waiters experience a cache miss, go to memory to fetch the new
0. They all exit thewhileloop.- Attempt: They all rush to execute
test_and_set. One succeeds; the rest fail and go back to spinning.
Warning: This pattern works well in low-level systems code (OS kernels). In high-level application code (like Java), compiler or processor reordering might break the logic unless explicit memory fences are used.
Compare-and-Swap (CAS)
Test-And-Set is a blunt instrument (always writes 1). Compare-and-Swap (CAS) is a more precise, general-purpose primitive used for Lock-Free and Wait-Free algorithms.
Semantics
CAS(location, expected_old_value, new_value)
- Atomically load value from
location. - Compare with
expected_old_value. - If equal: Write
new_valuetolocation. - Return: The actual value found at
location.
Pattern: Lock-Free Update
Instead of locking a data structure, it is possible to optimistically try to update it. A common pattern is Read-Copy-Update:
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134419.png)
- Read a global pointer to the data.
- Create a local copy of the data structure and modify it.
- Use CAS to swap the global pointer to point to the new copy.
- Condition: Only swap if the global pointer hasn’t changed since step 1.
- If CAS fails (someone else updated it), loop back to step 1 and try again with the new data.
The ABA Problem
There is a subtle flaw in the logic above. CAS checks if the value is the same, not if the history is the same.
Scenario:
- Stack is
A -> B -> C. Thread 1 reads Top (A). - Thread 1 stalls.
- Thread 2 pops
A. Stack isB -> C. - Thread 2 pushes
Aback. Stack isA -> B -> C. - Thread 1 wakes up. It wants to CAS
AtoB. It checks Top: it seesA. Success! - Bug: Thread 1 sets Top to
B, but theBit held might be a freed pointer or structurally invalid now because the state changed underneath.
Solution: Version Counters
It is necessary to ensure that every update changes the state uniquely. The pointer is extended to be {Pointer, Counter}.
- Every time the pointer is modified, the counter is incremented.
- .
- When Thread 1 wakes up, it compares
{A, v1}against{A, v3}. They are different. CAS fails correctly. - Hardware Support: x86 provides
CMPXCHG16B(128-bit CAS) specifically to allow swapping a 64-bit pointer and a 64-bit counter atomically.
Practice: Lock-Free Stack and ABA
The ABA problem is a classic interview question and a critical bug in high-performance systems.
Exercise: Identify the ABA Vulnerability
Consider the following pop operation for a lock-free stack:
node_t* pop(stack_t *stack) {
node_t *old_top, *new_top;
do {
old_top = stack->top; // Step 1
if (old_top == NULL) return NULL;
new_top = old_top->next; // Step 2
} while (!CAS(&stack->top, old_top, new_top)); // Step 3
return old_top;
}Scenario:
- Thread 1 is at Step 2.
old_toppoints to node A,new_toppoints to node B. - Thread 2 pops A, pops B, then pushes A back.
- Thread 1 wakes up and executes Step 3.
Question: Why does CAS succeed, and what is the destructive result?
Answer: CAS succeeds because stack->top is still A. However, new_top (node B) has been freed or reused! The stack top is now pointing to a corrupted or invalid memory location.
Practice Challenge: Fixing with Versions
How would the counter fix this?
Answer: Thread 1 would hold {A, v1}. After Thread 2’s actions, the stack top would be {A, v3}. The CAS would fail, forcing Thread 1 to retry and correctly see the new state.
Simultaneous Multithreading (SMT)
Recall the “Walls” from the start of the chapter. The Memory Wall (CPU waits for RAM) and the ILP Wall (hard to find parallel instructions in one thread) were encountered. How can the execution units (ALUs) be utilized while a thread is stalled on a cache miss?
The Solution: Simultaneous Multithreading (Intel calls this Hyper-Threading). One physical core is presented as multiple (usually 2) Logical Cores to the OS. These logical cores share the execution engine but have separate architectural states (registers, PC).
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134444.png)
- Coarse-Grained MT: A thread is switched only on a long-latency event (like an L3 miss).
- Fine-Grained MT: A thread is switched every cycle (Round Robin).
- SMT: Instructions are fetched from Thread A and Thread B simultaneously. If Thread A is waiting on memory, Thread B’s instructions can flow into the ALUs.
Pros and Cons
- Pro: Increases total throughput (utilization) by ~10-30%. It is cheap to implement (mostly logic sharing).
- Con: Single-thread performance often drops slightly due to resource contention (L1 cache, TLB entries).
- Security: Cloud providers (e.g., AWS, Google) often disable SMT or refuse to co-locate different customers on the same physical core. Sharing internal buffers opens up Side-Channel Attacks where one user can spy on another based on timing differences.
Non-Uniform Memory Access (NUMA)
Scaling SMP (Shared Memory) beyond ~8-16 cores is difficult because a single memory bus cannot feed that many hungry processors.
The Solution: Distributed Memory Architecture. The machine is split into Sockets (or NUMA Nodes).
- Each Socket has a group of cores (e.g., 12) and a Local Memory Controller.
- Physical RAM is partitioned; some is plugged into Socket 0, some into Socket 1.
- Sockets are connected via a high-speed interconnect (e.g., Intel QPI/UPI, AMD HyperTransport).
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134503.png)
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134525.png)
Implications
- Local Access: Fast.
- Remote Access: Requires traversing the interconnect. This results in higher latency and lower bandwidth.
- L3 Cache: Often partitioned per socket. Accessing a remote L3 is slower than local L3.
Latency Hierarchy
- L1 Hit: ~2 cycles
- L2 Hit: ~15 cycles
- L3 Hit (Local): ~75 cycles
- L3 Hit (Remote): ~190 cycles (Must cross interconnect)
- Remote RAM: Significantly slower.
This architecture requires OS and Application awareness. Linux uses numactl to pin processes to specific nodes to ensure they use local memory.
NUMA Cache Coherence
The machine still needs to behave as one coherent memory space. If Core A (Socket 0) writes to an address, Core B (Socket 1) must see it. However, Snooping (broadcasting requests) cannot be used because flooding the interconnect with broadcasts doesn’t scale.
Solution: Directory-Based Coherence Instead of asking everyone “Who has this block?”, a specific “Librarian” is asked.
- The Directory: Stored in L3 or Main Memory.
- Home Node: Every physical address maps to a specific Home Node based on a hash.
- Operation:
- A core requests data. The request goes to the Home Node.
- The Home Node checks the Directory. It contains a bitmask of which nodes hold a copy.
- The Home Node forwards the request only to the specific holders (Multicast).
- The holders respond.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134552.png)
Trade-off: Interconnect bandwidth (scalability) is saved at the cost of storage overhead (storing the directory state takes up ~5-10% of memory capacity).
False Sharing
A classic performance bug in multicore programming. Caches manage data in Cache Lines (typically 64 bytes).
Scenario:
- Thread A updates
int x. - Thread B updates
int y. xandyare totally unrelated variables, but they sit next to each other in memory (on the same 64-byte line).
The Result: The hardware thinks the threads are contending for the same object.
- Thread A writes. To do so, it must get the line in Exclusive/Modified state. This invalidates Thread B’s cache.
- Thread B writes. This invalidates Thread A.
- Ping-Pong: The line bounces back and forth on every write. Performance degrades by orders of magnitude.
Fix: Use padding (e.g., char pad[64]) to force variables onto different cache lines.
20.13 The MCS Lock
It was established that spinlocks (TAS) scale poorly because all cores spin on a single global address, causing massive cache invalidation traffic (Ping-Pong). How is a lock created that scales to 100+ cores? The MCS Lock (Mellor-Crummey and Scott) is used.
The Idea
Instead of spinning on a global variable, each waiting thread spins on its own local variable.
- Structure: A linked list of waiters.
- Wait: Each node in the list spins on a flag inside its own node.
- Coherence Benefit: Since the node is local, the thread spins on a line in its own L1 cache. No bus traffic is generated while waiting.
Implementation Details
The Data Structure (QNode):
struct qnode {
struct qnode *next;
int locked; // 1 = waiting, 0 = acquired
};Acquire:
- The thread creates a local
QNode. - It uses an atomic
XCHG(swap) on the globalLockpointer. It swaps its own node in as the new Tail. - It gets back the address of the previous tail.
- If there was a predecessor, it links to it:
prev->next = my_node. - Spin:
while (my_node->locked == 1). This spins on local data.
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134625.png)
Release:
- The owner checks its
nextpointer. - If there is a successor, the owner writes
0tosuccessor->locked.- This is the only cache miss! The owner invalidates the successor’s cache line to signal “Go ahead”.
- Race Condition: What if
nextis NULL, but a new thread is currently executing theXCHGinAcquire?- The owner uses
CASon the global lock pointer to try to swap it fromMetoNULL. - If CAS fails, it means a successor arrived. The owner waits for the successor to finish linking, then notifies them.
- The owner uses
Performance Comparison
/Semester-3/Systems-Programming-and-Computer-Architecture/attachments/Pasted-image-20260107134642.png)
- MCS Lock: Flat performance. The time to acquire/release is constant regardless of how many other threads are waiting. This is the gold standard for high-contention locks.
Continue here: 26 System IO and Device Drivers