Lecture from: 16.12.2025 | Video: Videos ETHZ
The final frontier of this course has been reached. Until now, the CPU and Memory have been treated as a self-contained universe. However, a computer that cannot talk to the outside world is effectively a heater.
This chapter covers Devices. This is where the rubber meets the road in Systems Programming. In fact, looking at the Linux kernel source code, roughly 70% of it is device drivers. It is the messy, complex, and absolutely vital interface between software concepts and physical reality.
What is a Device?
To an OS programmer, a device is not just a physical object like a printer or a keyboard. It is a specific entity on the system bus that exposes a programmatic interface.
A device typically has four characteristics:
- Bus Location: It occupies a specific slot on a bus (like PCI or PCIe).
- Registers: A set of addressable locations used to command the device.
- Interrupts: A mechanism to poke the CPU when something happens.
- DMA: The ability to bypass the CPU and read/write main memory directly.
The Register Interface
The CPU talks to devices primarily through Device Registers. These should not be confused with CPU registers (like rax). These are interfaces to the device hardware.
- Load (Read): Gets status information or data.
- Store (Write): Sets configuration, sends commands, or sends data.
Architectures map these registers in two ways:
- Memory Mapped I/O (MMIO): The registers appear as normal physical memory addresses. They are accessed using standard
movinstructions. This is the modern standard. - Port-Mapped I/O: (Specific to x86) A separate, smaller 16-bit address space accessed via special instructions (
inb,outb).
Registers RAM
Device registers are volatile.
Reading a register might have Side Effects, such as clearing an interrupt. Writing to a register might transmit a packet. Furthermore, the value in a register can change due to External Changes without the CPU touching it (e.g., a “Data Ready” bit flipping because a network packet arrived).
Case Study: The UART Driver
To understand the basics, the “Hello World” of device drivers is examined: the ns16550 UART (Universal Asynchronous Receiver/Transmitter). This is the classic Serial Port.
/Semester-3/Systems-Programming-and-Computer-Architecture/Lecture-Notes/attachments/Pasted-image-20260107143606.png)
The Datasheet
A driver cannot be written without the Datasheet. This document is the bible for the hardware; it lists register offsets and bit definitions.
/Semester-3/Systems-Programming-and-Computer-Architecture/Lecture-Notes/attachments/Pasted-image-20260107143551.png)
Programmed I/O (PIO)
The simplest way to drive a device is Polling. The CPU spins in a loop, asking the device “Are you ready?” repeatedly.
A simplified driver implementation is provided:
#define UART_BASE 0x3f8 // Standard PC port address
#define UART_THR 0 // Transmit Holding Register (Write here to send)
#define UART_LSR 5 // Line Status Register (Read status here)
// Bit masks from the datasheet
#define LSR_DATA_READY 0x01
#define LSR_TX_EMPTY 0x20
void serial_putc(char c) {
// 1. Poll: Wait until the device says the transmit buffer is empty
while ((inb(UART_BASE + UART_LSR) & LSR_TX_EMPTY) == 0);
// 2. Action: Write the character to the device
outb(UART_BASE + UART_THR, c);
}
char serial_getc() {
// 1. Poll: Wait until data is ready
while ((inb(UART_BASE + UART_LSR) & LSR_DATA_READY) == 0);
// 2. Action: Read the character
return inb(UART_BASE + 0); // Offset 0 is also Receive Buffer when reading
}Critique: This uses Programmed I/O. The CPU is involved in moving every single byte. Worse, it uses Polling, wasting 100% of the CPU cycles just waiting for the slow serial hardware. In a modern system, performing other tasks while waiting is desirable, which leads to Interrupts (covered in previous chapters) and DMA.
Direct Memory Access (DMA)
If terabytes of data are being transferred via a 100Gbps network card, having the CPU copy data byte-by-byte using mov instructions is unaffordable. The CPU is too valuable for that.
Direct Memory Access (DMA) is the solution. A DMA Controller is introduced, a specialized processor whose only job is to copy memory.
/Semester-3/Systems-Programming-and-Computer-Architecture/Lecture-Notes/attachments/Pasted-image-20260107143635.png)
The Transaction Flow
- Setup: The OS allocates a buffer in RAM.
- Command: The OS tells the device: “Transfer bytes from Address .”
- Transfer: The Device (bus master) takes control of the memory bus and copies the data. The CPU is free to do other work.
- Interrupt: When finished, the Device interrupts the CPU to say “I’m done.”
This decouples data movement from data processing.
DMA and Caches
DMA introduces a nasty problem: Coherence. Two distinct entities now modify memory: the CPU and the Device. However, the CPU sees memory through its L1/L2/L3 caches, while the Device sees physical RAM directly.
The Inconsistency Problem
Case 1: Transmitting (CPU Writes, Device Reads)
- CPU writes data to a buffer. It sits dirty in the L1 Cache. RAM still holds the old data.
- CPU commands Device to read from RAM.
- Device reads the stale (old) data from RAM. Corruption.
Case 2: Receiving (Device Writes, CPU Reads)
- CPU reads a buffer. It is loaded into L1 Cache.
- Device writes new incoming data to RAM.
- CPU reads the buffer again. It hits the stale line in L1 Cache. Corruption.
The Solutions
Coherence must be enforced manually in the driver software.
- Disable Caching: DMA buffers can be marked as “Uncacheable” in the Page Table.
- Trade-off: Safe, but performance is terrible. The CPU is very slow when accessing uncached RAM.
- Explicit Flush/Invalidate:
- Sending (CPU Device): Before the DMA starts, the OS must Flush (clean) the cache range. This forces dirty data out to RAM.
- Receiving (Device CPU): Before the CPU reads the data, the OS must Invalidate the cache range. This forces the next CPU load to fetch fresh data from RAM.
Hardware Coherence?
On some architectures (like x86 with PCI), the hardware automatically snoops DMA traffic and invalidates caches. However, on many RISC architectures and SoCs (System on Chips), software management is mandatory.
Virtual vs. Physical Addresses
DMA controllers live in the physical world. They need Physical Addresses. However, the OS and User programs work with Virtual Addresses.
If a 1MB buffer is allocated in software (malloc), it looks contiguous. But physically, it might be scattered across hundreds of discontinuous 4KB pages.
If a simple DMA controller is told to “Copy 1MB starting here,” it will happily march off the end of the first physical page into unrelated memory.
Solutions:
- Scatter-Gather: The DMA controller accepts a list (a vector) of
(Physical Address, Length)pairs, chaining them together. - IOMMU: A hardware unit (like the CPU’s MMU) that sits between the Device and Memory. It translates Device Virtual Addresses to Physical Addresses, allowing the device to see a contiguous view of memory.
Practice: I/O and Coherence
Exercise: MMIO vs PMIO Register Access
An engineer needs to set bit 3 of a device register located at MMIO address 0x4000. Write the C code to do this safely.
Answer:
volatile uint32_t *reg = (uint32_t*)0x4000;
*reg |= (1 << 3); // Read, modify, writeImportant
The
volatilekeyword is mandatory. Without it, the compiler might optimize the access away or cache the value in a CPU register, failing to communicate with the hardware.
Exercise: DMA Coherence Flush
A driver is sending a 4KB buffer to a disk. The buffer is at address 0x2000 in RAM. The CPU just finished writing the data. What must the driver do before starting the DMA?
Answer: The driver must Flush (clean) the cache range for [0x2000, 0x3000). If it doesn’t, the DMA controller might read old data from RAM while the new data is still dirty in the CPU’s L1 cache.
High-Performance I/O: Descriptor Rings
For complex devices like Network Interface Cards (NICs), a continuous stream of packets must be handled. An interrupt for every single packet is not desired. The standard design pattern for this is the Descriptor Ring (a circular buffer).
/Semester-3/Systems-Programming-and-Computer-Architecture/Lecture-Notes/attachments/Pasted-image-20260107143715.png)
How it Works
- Shared Memory: A ring of “Descriptors” is allocated in main memory.
- Descriptor: A small struct containing:
- Pointer to the actual data buffer.
- Length of the buffer.
- Status/Ownership Bit.
- Ownership: This is the key synchronization primitive.
- Bit = 1 (Device Owned): The hardware can process this slot. The OS must not touch it.
- Bit = 0 (OS Owned): The hardware is done. The OS can process the results or refill the slot.
The Transmit Loop (Producer: OS, Consumer: Device)
- OS prepares a packet in a buffer.
- OS writes the buffer pointer into the next available descriptor at the Tail.
- OS flips the Ownership bit to Device.
- OS updates the device’s “Tail Pointer Register” (a doorbell ring) to wake it up.
- Device reads the descriptor (DMA), reads the packet (DMA), and sends it.
- Device flips the Ownership bit back to OS and optionally interrupts.
The Receive Loop (Producer: Device, Consumer: OS)
- OS fills the ring with empty buffers and gives ownership to the Device.
- Device receives a packet off the wire.
- Device writes packet data into the buffer at the Head (DMA).
- Device flips Ownership bit to OS.
- Device interrupts.
- OS sees the bit flip, processes the packet, allocates a fresh buffer, and flips the bit back to Device.
Flow Control
- Overrun: Device receives packets faster than OS can process. The ring fills up (Head crashes into Tail). The device must drop packets.
- Underrun: Device wants to send, but the ring is empty. The device goes to sleep.
Example: The DEC Tulip NIC
To make this concrete, the DEC 21140A “Tulip” is examined, which is a classic Fast Ethernet card.
The Descriptor Structure
The Tulip uses a specific descriptor format:
- Word 0 (Status): Contains the OWN bit (Bit 31). If set, the Tulip owns it. If clear, the Host owns it.
- Word 1 (Control): Buffer sizes.
- Word 2 (Address 1): Pointer to the data buffer.
- Word 3 (Address 2): Pointer to a second data buffer OR a pointer to the next descriptor.
Chaining vs. Contiguous
Because of Word 3, the Tulip supports two modes:
- Ring Mode: Descriptors are in a contiguous array. The hardware wraps around automatically.
- Chain Mode: Descriptors are a linked list scattered in memory.
Address 2points to the next descriptor node.
/Semester-3/Systems-Programming-and-Computer-Architecture/Lecture-Notes/attachments/Pasted-image-20260107144547.png)
Driver Initialization
Writing the driver involves a state machine:
- Reset: Write to
CSR0(Bus Mode) to reset the hardware. - Setup: Allocate the descriptor ring in memory. Write the physical base address to
CSR3(Receive List Base) andCSR4(Transmit List Base). - Start: Write to
CSR6(Operation Mode) to put the device into the “Running” state.
Summary
This chapter closes the loop on how computers actually work.
- Descriptor Rings: The fundamental design pattern for asynchronous, lock-free communication between Hardware and Software.
As the course concludes, it is important to remember that modern systems are massive distributed systems on a single chip, with CPUs, GPUs, NICs, and IOMMUs all dancing around shared memory, trying not to step on each other’s toes.
Continue here: 27 Network IO and Course Wrap-up