25 Parallel Software and Synchronization

Lecture from: 10.12.2025 | Video: Videos ETHZ

In the previous section, the hardware foundations were established: Cache Coherence (keeping data consistent across caches using protocols like MSI/MESI) and Memory Consistency (rules regarding the ordering of operations). Now, the focus moves up the stack to Systems Programming. How can these hardware features be used to write correct, high-performance parallel software?

Synchronization Primitives

Standard loads and stores are insufficient for implementing locks. Atomic operations are needed: instructions that perform a read-modify-write cycle while guaranteeing that the memory bus is locked, preventing any other core from interfering.

The Test-and-Set (TAS) Instruction

The simplest atomic primitive is Test-And-Set (often implemented as XCHG on x86).

Operation: It reads the value at a memory address, returns it, and unconditionally writes a 1 to that address.
Atomicity: The hardware guarantees this happens as one indivisible unit.

Implementation: Basic Spinlock

A basic Spinlock can be built with this:

// 0 = Free, 1 = Locked
 
void acquire(int *lock) {
    // TAS returns the OLD value.
    // If it returns 1, another thread held it; the current thread keeps spinning.
    // If it returns 0, it was free, and the current thread just set it to 1; the current thread owns it.
    while (test_and_set(lock) == 1) {
        ; // spin
    }
}
 
void release(int *lock) {
    *lock = 0; // Simple write releases it
}

The Performance Problem with TAS

While correct, this implementation is disastrous for performance on a cached multicore system.

Writes cause Invalidations: Test-And-Set performs a write every single time, even if the lock is already held (writing 1 over an existing 1).
Bus Contention: To write, the coherence protocol (MESI) demands the cache line be in Modified state. This invalidates the copy in every other spinning core’s cache.
Ping-Pong: The cache line bounces between cores continuously. The interconnect is flooded with useless traffic, slowing down the entire machine, not just the app using the lock.

Optimization: Test-and-Test-and-Set

Bus traffic can be reduced by exploiting the cache. The expensive atomic write should only be attempted if the lock appears to be free locally.

void acquire(int *lock) {
    do {
        // 1. The "Test" (Normal Read)
        // Spin here while the lock is held.
        // The thread spins on a cached SHARED copy. Zero bus traffic.
        while (*lock == 1);
 
    } while (test_and_set(lock) == 1); // 2. The "Set" (Atomic Attempt)
}

How this works

Waiting: While the lock is held, all waiters have the line in Shared (S) state. They spin reading their local cache. This results in quiet bus operation.

Release: The owner writes 0. This invalidates all S copies.

Race: All waiters experience a cache miss, go to memory to fetch the new 0. They all exit the while loop.

Attempt: They all rush to execute test_and_set. One succeeds; the rest fail and go back to spinning.

Warning: This pattern works well in low-level systems code (OS kernels). In high-level application code (like Java), compiler or processor reordering might break the logic unless explicit memory fences are used.

Compare-and-Swap (CAS)

Test-And-Set is a blunt instrument (always writes 1). Compare-and-Swap (CAS) is a more precise, general-purpose primitive used for Lock-Free and Wait-Free algorithms.

Semantics

CAS(location, expected_old_value, new_value)

Atomically load value from location.
Compare with expected_old_value.
If equal: Write new_value to location.
Return: The actual value found at location.

Pattern: Lock-Free Update

Instead of locking a data structure, it is possible to optimistically try to update it. A common pattern is Read-Copy-Update:

Read a global pointer to the data.
Create a local copy of the data structure and modify it.
Use CAS to swap the global pointer to point to the new copy.
- Condition: Only swap if the global pointer hasn’t changed since step 1.
If CAS fails (someone else updated it), loop back to step 1 and try again with the new data.

The ABA Problem

There is a subtle flaw in the logic above. CAS checks if the value is the same, not if the history is the same.

Scenario:

Stack is A -> B -> C. Thread 1 reads Top (A).
Thread 1 stalls.
Thread 2 pops A. Stack is B -> C.
Thread 2 pushes A back. Stack is A -> B -> C.
Thread 1 wakes up. It wants to CAS A to B. It checks Top: it sees A. Success!
Bug: Thread 1 sets Top to B, but the B it held might be a freed pointer or structurally invalid now because the state changed underneath.

Solution: Version Counters It is necessary to ensure that every update changes the state uniquely. The pointer is extended to be {Pointer, Counter}.

Every time the pointer is modified, the counter is incremented.
$A_{v 1} \to B_{v 2} \to A_{v 3}$ .
When Thread 1 wakes up, it compares {A, v1} against {A, v3}. They are different. CAS fails correctly.
Hardware Support: x86 provides CMPXCHG16B (128-bit CAS) specifically to allow swapping a 64-bit pointer and a 64-bit counter atomically.

Practice: Lock-Free Stack and ABA

The ABA problem is a classic interview question and a critical bug in high-performance systems.

Exercise: Identify the ABA Vulnerability

Consider the following pop operation for a lock-free stack:

node_t* pop(stack_t *stack) {
    node_t *old_top, *new_top;
    do {
        old_top = stack->top;           // Step 1
        if (old_top == NULL) return NULL;
        new_top = old_top->next;        // Step 2
    } while (!CAS(&stack->top, old_top, new_top)); // Step 3
    return old_top;
}

Scenario:

Thread 1 is at Step 2. old_top points to node A, new_top points to node B.
Thread 2 pops A, pops B, then pushes A back.
Thread 1 wakes up and executes Step 3.

Question: Why does CAS succeed, and what is the destructive result? Answer: CAS succeeds because stack->top is still A. However, new_top (node B) has been freed or reused! The stack top is now pointing to a corrupted or invalid memory location.

Practice Challenge: Fixing with Versions

How would the counter fix this? Answer: Thread 1 would hold {A, v1}. After Thread 2’s actions, the stack top would be {A, v3}. The CAS would fail, forcing Thread 1 to retry and correctly see the new state.

Simultaneous Multithreading (SMT)

Recall the “Walls” from the start of the chapter. The Memory Wall (CPU waits for RAM) and the ILP Wall (hard to find parallel instructions in one thread) were encountered. How can the execution units (ALUs) be utilized while a thread is stalled on a cache miss?

The Solution: Simultaneous Multithreading (Intel calls this Hyper-Threading). One physical core is presented as multiple (usually 2) Logical Cores to the OS. These logical cores share the execution engine but have separate architectural states (registers, PC).

Coarse-Grained MT: A thread is switched only on a long-latency event (like an L3 miss).
Fine-Grained MT: A thread is switched every cycle (Round Robin).
SMT: Instructions are fetched from Thread A and Thread B simultaneously. If Thread A is waiting on memory, Thread B’s instructions can flow into the ALUs.

Pros and Cons

Pro: Increases total throughput (utilization) by ~10-30%. It is cheap to implement (mostly logic sharing).
Con: Single-thread performance often drops slightly due to resource contention (L1 cache, TLB entries).
Security: Cloud providers (e.g., AWS, Google) often disable SMT or refuse to co-locate different customers on the same physical core. Sharing internal buffers opens up Side-Channel Attacks where one user can spy on another based on timing differences.

Non-Uniform Memory Access (NUMA)

Scaling SMP (Shared Memory) beyond ~8-16 cores is difficult because a single memory bus cannot feed that many hungry processors.

The Solution: Distributed Memory Architecture. The machine is split into Sockets (or NUMA Nodes).

Each Socket has a group of cores (e.g., 12) and a Local Memory Controller.
Physical RAM is partitioned; some is plugged into Socket 0, some into Socket 1.
Sockets are connected via a high-speed interconnect (e.g., Intel QPI/UPI, AMD HyperTransport).

Implications

Local Access: Fast.
Remote Access: Requires traversing the interconnect. This results in higher latency and lower bandwidth.
L3 Cache: Often partitioned per socket. Accessing a remote L3 is slower than local L3.

Latency Hierarchy

L1 Hit: ~2 cycles

L2 Hit: ~15 cycles

L3 Hit (Local): ~75 cycles

L3 Hit (Remote): ~190 cycles (Must cross interconnect)

Remote RAM: Significantly slower.

This architecture requires OS and Application awareness. Linux uses numactl to pin processes to specific nodes to ensure they use local memory.

NUMA Cache Coherence

The machine still needs to behave as one coherent memory space. If Core A (Socket 0) writes to an address, Core B (Socket 1) must see it. However, Snooping (broadcasting requests) cannot be used because flooding the interconnect with broadcasts doesn’t scale.

Solution: Directory-Based Coherence Instead of asking everyone “Who has this block?”, a specific “Librarian” is asked.

The Directory: Stored in L3 or Main Memory.
Home Node: Every physical address maps to a specific Home Node based on a hash.
Operation:
1. A core requests data. The request goes to the Home Node.
2. The Home Node checks the Directory. It contains a bitmask of which nodes hold a copy.
3. The Home Node forwards the request only to the specific holders (Multicast).
4. The holders respond.

Trade-off: Interconnect bandwidth (scalability) is saved at the cost of storage overhead (storing the directory state takes up ~5-10% of memory capacity).

A classic performance bug in multicore programming. Caches manage data in Cache Lines (typically 64 bytes).

Scenario:

Thread A updates int x.
Thread B updates int y.
x and y are totally unrelated variables, but they sit next to each other in memory (on the same 64-byte line).

The Result: The hardware thinks the threads are contending for the same object.

Thread A writes. To do so, it must get the line in Exclusive/Modified state. This invalidates Thread B’s cache.
Thread B writes. This invalidates Thread A.
Ping-Pong: The line bounces back and forth on every write. Performance degrades by orders of magnitude.

Fix: Use padding (e.g., char pad[64]) to force variables onto different cache lines.

20.13 The MCS Lock

It was established that spinlocks (TAS) scale poorly because all cores spin on a single global address, causing massive cache invalidation traffic (Ping-Pong). How is a lock created that scales to 100+ cores? The MCS Lock (Mellor-Crummey and Scott) is used.

The Idea

Instead of spinning on a global variable, each waiting thread spins on its own local variable.

Structure: A linked list of waiters.
Wait: Each node in the list spins on a flag inside its own node.
Coherence Benefit: Since the node is local, the thread spins on a line in its own L1 cache. No bus traffic is generated while waiting.

Implementation Details

The Data Structure (QNode):

struct qnode {
    struct qnode *next;
    int locked; // 1 = waiting, 0 = acquired
};

Acquire:

The thread creates a local QNode.
It uses an atomic XCHG (swap) on the global Lock pointer. It swaps its own node in as the new Tail.
It gets back the address of the previous tail.
If there was a predecessor, it links to it: prev->next = my_node.
Spin: while (my_node->locked == 1). This spins on local data.

Release:

The owner checks its next pointer.
If there is a successor, the owner writes 0 to successor->locked.
- This is the only cache miss! The owner invalidates the successor’s cache line to signal “Go ahead”.
Race Condition: What if next is NULL, but a new thread is currently executing the XCHG in Acquire?
- The owner uses CAS on the global lock pointer to try to swap it from Me to NULL.
- If CAS fails, it means a successor arrived. The owner waits for the successor to finish linking, then notifies them.

Performance Comparison

MCS Lock: Flat performance. The time to acquire/release is constant regardless of how many other threads are waiting. This is the gold standard for high-contention locks.

Continue here: 26 System IO and Device Drivers

CS Notes

Explorer

25 Parallel Software and Synchronization

Synchronization Primitives

The Test-and-Set (TAS) Instruction

Implementation: Basic Spinlock

The Performance Problem with TAS

Optimization: Test-and-Test-and-Set

Compare-and-Swap (CAS)

Semantics

Pattern: Lock-Free Update

The ABA Problem

Practice: Lock-Free Stack and ABA

Exercise: Identify the ABA Vulnerability

Practice Challenge: Fixing with Versions

Simultaneous Multithreading (SMT)

Pros and Cons

Non-Uniform Memory Access (NUMA)

Implications

NUMA Cache Coherence

20.13 The MCS Lock

The Idea

Implementation Details

Performance Comparison

Table of Contents

Graph View

Backlinks

CS Notes

Explorer

25 Parallel Software and Synchronization

Synchronization Primitives

The Test-and-Set (TAS) Instruction

Implementation: Basic Spinlock

The Performance Problem with TAS

Optimization: Test-and-Test-and-Set

Compare-and-Swap (CAS)

Semantics

Pattern: Lock-Free Update

The ABA Problem

Practice: Lock-Free Stack and ABA

Exercise: Identify the ABA Vulnerability

Practice Challenge: Fixing with Versions

Simultaneous Multithreading (SMT)

Pros and Cons

Non-Uniform Memory Access (NUMA)

Implications

NUMA Cache Coherence

False Sharing

20.13 The MCS Lock

The Idea

Implementation Details

Performance Comparison

Table of Contents

Graph View

Backlinks