Lecture from: 17.12.2025 | Video: Videos ETHZ

This is the final lecture. The discussion on devices concludes here, specifically focusing on how hardware and software coordinate data transfers efficiently. The course is then wrapped up with information regarding the exam and a brief look at current systems research.

Direct Memory Access (DMA) Recap

Recalling from the previous lecture, Programmed I/O, where the CPU is responsible for moving every byte of data, is to be avoided. Instead, Direct Memory Access (DMA) is used. The CPU offloads the transfer task to a DMA controller by specifying a source address, destination, and size. This allows the CPU to execute other instructions while the transfer happens asynchronously.

To orchestrate this communication efficiently between the OS (Device Driver) and the Device (Hardware), Ring Buffers are used.

Buffer and Descriptor Rings

A ring buffer allows the OS to enqueue work and the device to dequeue work asynchronously.

Buffer Ring: The ring contains the actual data packets. Requires large contiguous physical memory.
Descriptor Ring: The ring contains descriptors (pointers). Each descriptor points to a data buffer located elsewhere in memory. This is more flexible as buffers can be scattered in physical memory.

The Ownership Bit

To avoid locks, each descriptor contains an ownership bit.

Owned by OS: The device has finished with it (or hasn’t touched it yet). The OS is safe to modify it.

Owned by Device: The OS has submitted it. The device is processing it. The OS must not touch it.

I/O State Machines

The interaction between the driver and the device can be modeled as two communicating state machines. The example of Sending Packets (Transmit) is examined below.

1. The Hardware (Device) State Machine

The device conceptually oscillates between Stopped and Running.

Read Descriptor: The device reads the descriptor at its current Head pointer.
Check Ownership:
- If Owned by Device: The OS has provided data. The device reads the data buffer pointed to by the descriptor, sends the packet, and then writes back to the descriptor to mark it Owned by OS. It then advances the Head pointer.
- If Owned by OS: The device has caught up to the driver. It has no work. It stops and may request an interrupt to wake it up later.

2. The Software (Driver) State Machine

The driver manages the Tail pointer and transitions between Running, Stopped, and Waiting.

Packet to Send: The driver checks the descriptor at Tail.
Check Ownership:
- If Owned by OS: The slot is free. The driver copies packet info into the descriptor, flips the bit to Owned by Device, and advances Tail.
- If Owned by Device: The ring is full. The driver must Wait. It configures the device to raise an interrupt when a descriptor becomes free (i.e., when the device finishes sending a packet).

DMA and Caches

A critical challenge with DMA is Cache Coherence. Two entities access main memory: the CPU (via its caches) and the DMA Controller (directly to RAM).

The Problem

If the CPU writes to a buffer (e.g., to send a packet), that data might sit dirty in the L1/L2 cache and not yet be in main memory. If the DMA controller reads from that physical address in main memory, it reads stale data (garbage), not the packet the CPU intended to send.

Conversely, if the Device writes a received packet to memory via DMA, but the CPU has a stale copy of that address in its cache, the CPU will read old data, not the new packet.

The Solutions

1. Hardware Coherence (x86/PCIe) On modern x86 systems with PCI Express, the hardware handles this. DMA transactions “snoop” the CPU caches. If a DMA read hits a dirty cache line, the hardware flushes it automatically. This makes the programmer’s life easy, but it is complex to build.

2. Software Management (Flush & Invalidate) On many other architectures (like ARM or older systems), the OS must manage this explicitly.

DMA Read (Device reads from RAM): Before the DMA starts, the CPU must Flush (clean) the cache range for that buffer. This forces dirty data out to main memory.
DMA Write (Device writes to RAM): Before reading the data, the CPU must Invalidate the cache range. This ensures the next CPU load fetches fresh data from main memory.

Practice: Packet Processing with Descriptor Rings

Exercise: Producer-Consumer Coordination

Imagine a Receive Descriptor Ring with 1024 entries.

The Head pointer (owned by hardware) is at index 500.
The Tail pointer (owned by software) is at index 450.

Question: How many packets is the hardware currently processing? Answer: Zero. In a receive ring, the software provides empty buffers to the hardware. If the Tail is behind the Head, it usually means the software has consumed everything up to the Head. The hardware is waiting for the software to “refill” the tail.

Exercise: Performance Bottlenecks

If the Ownership bit is flipped to Hardware only once every 64 packets (batching), what is the benefit and the trade-off? Answer:

Benefit: Fewer “doorbell” rings (MMIO writes) to the hardware, saving CPU cycles.
Trade-off: Increased latency for individual packets, and if the ring is small, it might cause packet drops if the batch isn’t processed fast enough.

Non-Cacheable Memory

Another option is to mark DMA buffers as Non-Cacheable in the page tables. This avoids coherence issues but severely impacts performance because every CPU access to the buffer must go all the way to slow main memory.

DMA and Virtual Memory

DMA controllers operate on Physical Addresses. However, user programs and the kernel mostly deal with Virtual Addresses.

Scatter-Gather

A buffer that is contiguous in virtual memory (e.g., a 1MB array) might be fragmented across many non-contiguous physical pages.

A simple DMA controller can only copy contiguous physical memory. To solve this, sophisticated DMA engines support Scatter-Gather. The OS provides a list (a vector) of (Physical Address, Length) pairs, and the DMA engine processes them one by one.

The IOMMU

Modern systems include an IOMMU (Input-Output Memory Management Unit). Just as the MMU translates addresses for the CPU, the IOMMU translates addresses for devices.

The device can use virtual addresses.
The IOMMU intercepts the memory access and translates it to the correct physical address using page tables provided by the OS.
This also provides protection, preventing a rogue device from overwriting kernel memory.

Discoverable Buses: PCI

How does the OS know what devices are plugged in and what addresses to use for their registers? This is solved by the bus protocol, specifically PCI (Peripheral Component Interconnect) and its successor PCIe.

Device Discovery

PCI is a tree structure. At the root is the Root Complex, connected to the CPU. Below are bridges and devices.

The OS scans the bus at boot.
Every PCI device has a standardized Configuration Header.
The OS reads this header to determine the Vendor ID, Device ID, and Class (e.g., “Network Controller”).

Address Allocation (BARs)

The device tells the OS how much memory it needs via Base Address Registers (BARs) in the configuration header.

Device writes size requirement to BAR.
OS reads size, finds a free chunk of physical address space.
OS writes the base address back into the BAR. Now, when the CPU accesses that address range, the PCI bridge routes the request to that device.

Interrupts

Legacy: Physical pins (INTA, INTB…).
MSI (Message Signaled Interrupts): The device triggers an interrupt by writing a specific data value to a specific address in memory. This effectively turns an interrupt into a memory write, which is scalable and compatible with multi-core interrupt routing.

Bus Mastering

PCI allows Bus Mastering, which essentially means the device can initiate its own DMA transactions. It can act as the “master” of the bus to read/write system memory without constant CPU intervention.

Evolution of Devices

Development has moved far beyond simple peripherals.

Polling/PIO: CPU does everything. Slow.
Interrupts: CPU notified when data ready. Better.
DMA: Device moves data. CPU coordinates.
Smart Devices: Today, devices like GPUs and SmartNICs are full computers in their own right. They have their own processors, memory, and operating systems. The main CPU often acts merely as a coordinator or “liaison” for these accelerators.

Course Wrap-Up

This concludes the technical content of Systems Programming and Computer Architecture.

The Exam

Date: Friday, January 30th.
Format: Online (Moodle/Code Expert), on-campus.
Content: Everything covered in lectures, exercises, and assignments.
Aids: No internet. No notes. (Dictionary allowed).

Philosophy

The grading philosophy is reasonable. There is no intent to trick students with trivia. The goal is to test understanding and the ability to do systems programming.

Students will not be asked to memorize obscure assembly instruction encodings.
Students will be asked to read/write C code.
Students will be asked to understand memory layout, pointers, and how hardware concepts (caches, virtual memory) impact software performance.

Exam Tips

Read the Instructions: Take 30 seconds to calm down and understand the rules.
Scan the Exam: Read all questions first. Check the point values.
Do the Easiest Question First: Find something to be confident in. Get points on the board. It builds momentum and reduces panic.
Manage Time: Don’t spend 20 minutes on a 2-point question.
Compile Your Code: The last version submitted is graded. If a syntax error is introduced 10 seconds before the end, the tests cannot be run. Ensure the code compiles at the end.

Research at ETH

Systems research is about building over-engineered, flexible platforms to understand the future of hardware.

Cloud Computing (Prof. Klimovic)

Research focuses on the mismatch between legacy software stacks and modern hyperscale hardware.

Serverless: Moving away from renting VMs to renting functions.
Sailor: A system to optimize training of large AI models by navigating the complex trade-offs of available GPU types, costs, and performance in a heterogeneous cloud.

Hardware/Software Co-design (Prof. Roscoe)

Research focuses on the increasing complexity of hardware (e.g., manuals with 9,000 pages).

Enzian: A research computer built at ETH. It combines a server-class CPU (48-core ARM) with a massive FPGA, connected via a coherent link. This allows researchers to simulate future hardware designs and monitor system behavior at a level of detail impossible with standard commercial servers.

Final Thoughts

Computers are unique because they can virtualize themselves, making one machine look like many, or many look like one. This concept appears everywhere: Virtual Memory, Processes, VMs, VLANs. Understanding these layers from the transistor up to the cloud is what defines a systems programmer.

Good luck with the exam!

Back to index

CS Notes

Explorer

27 Network IO and Course Wrap-up

Direct Memory Access (DMA) Recap

Buffer and Descriptor Rings

I/O State Machines

1. The Hardware (Device) State Machine

2. The Software (Driver) State Machine

DMA and Caches

The Problem

The Solutions

Practice: Packet Processing with Descriptor Rings

Exercise: Producer-Consumer Coordination

Exercise: Performance Bottlenecks

DMA and Virtual Memory

Scatter-Gather

The IOMMU

Discoverable Buses: PCI

Device Discovery

Address Allocation (BARs)

Interrupts

Bus Mastering

Evolution of Devices

Course Wrap-Up

The Exam

Philosophy

Exam Tips

Research at ETH

Cloud Computing (Prof. Klimovic)

Hardware/Software Co-design (Prof. Roscoe)

Final Thoughts

Table of Contents

Graph View

Backlinks