23 Advanced Virtual Memory and TLB

Lecture from: 03.12.2025 | Video: Videos ETHZ

The last lecture ended with a cliffhanger. It was established that Virtual Memory is amazing, providing isolation, capacity, and management features. But a wall was hit regarding the Page Table size.

The Page Table Size Problem

Let’s do the math for a standard modern x86-64 system.

Virtual Address Space: 48 bits (currently utilized).
Page Size: 4 KB ( $2^{12}$ bytes).
Page Table Entry (PTE) Size: 8 bytes.

If a simple Linear Page Table (an array) is used, an entry is needed for every possible virtual page.

Number of Entries = \frac{2 ^{48}}{2 ^{12}} = 2^{36}

Total Table Size = 2^{36} entries \times 2^{3} bytes/entry = 2^{39} bytes = 512 GB

This is absurd. Allocating 512 GB of contiguous RAM just to run Notepad is impossible. A way to store these translations is needed that reflects the reality of memory usage: Sparseness. Most processes use a tiny fraction of the address space (some code at the bottom, stack at the top, vast nothingness in between).

The Solution: Multi-Level Page Tables

Instead of one giant array, a tree structure (a hierarchy of tables) is used. This is another layer of indirection.

The Concept

A Level 1 Table (the root) exists. It fits in one page (4KB).
An entry in Level 1 points to a Level 2 Table.
An entry in Level 2 points to a Level 3 Table, and so on.

Sparseness is Key

If a large chunk of virtual memory is unused, the corresponding entry in the Level 1 table is marked Null/Invalid. The Level 2, 3, or 4 tables for that region are simply not allocated.

x86-64 Implementation (4 Levels)

The 48-bit virtual address is split into indices for each level.

Bits 47-39 (9 bits): Index into Level 4 (Page Map Level 4).
Bits 38-30 (9 bits): Index into Level 3 (Page Directory Pointer Table).
Bits 29-21 (9 bits): Index into Level 2 (Page Directory Table).
Bits 20-12 (9 bits): Index into Level 1 (Page Table).
Bits 11-0 (12 bits): Page Offset (VPO).

The Page Walk

The hardware (MMU) holds a special register (CR3 in x86) that points to the physical address of the Level 4 table for the current process.

MMU reads CR3 to find L4 table.
Uses VPN bits 47-39 to select an L4 Entry (PML4E). This points to the base of the L3 table.
Uses VPN bits 38-30 to select an L3 Entry. This points to L2.
…and so on, until the L1 Entry gives the Physical Page Number (PPN).

This solves the space problem elegantly. However, it introduces a performance problem. A single memory access now requires 4 extra memory reads just to find the address.

The TLB (Translation Lookaside Buffer)

Since accessing RAM is slow (~100s of cycles), doing it 4 times just to translate an address is unacceptable. These translations must be cached.

The Translation Lookaside Buffer (TLB) is a small, specialized hardware cache inside the MMU.

Input: Virtual Page Number (VPN).
Output: Physical Page Number (PPN) + Permission Bits.

TLB Operation

When the CPU generates a virtual address:

Extract VPN: The MMU splits the VPN into a TLB Tag and TLB Index.
Check TLB:
- Hit: The PPN is obtained immediately. (Zero memory accesses!). This is the common case (>99%).
- Miss: The hardware (or OS) must perform the Page Walk (accessing L4→L3→L2→L1 in memory), fetch the PTE, place it in the TLB, and restart the instruction.

Context Switches

Since each process has its own virtual address space, the translations in the TLB are only valid for the current process. On a context switch, the OS must either:

Flush the entire TLB (expensive!).

Use Address Space IDs (ASIDs) (tags in the TLB to identify which process owns the entry).

Detailed Address Translation Examples

Let’s solidify this with the “Toy” memory system from the lecture (Slide 48).

Specs:

Virtual Address: 14 bits.
Physical Address: 12 bits.
Page Size: 64 bytes ( $2^{6}$ ).
TLB: 16 entries, 4-way associative.
L1 Cache: 16 lines, 4-byte blocks, Direct Mapped.

Bit Breakdown:

Offset (VPO/PPO): 6 bits.
VPN: 8 bits ( $14 - 6$ ).
PPN: 6 bits ( $12 - 6$ ).

Example 1: Translating `0x03D4`

Binary: 00 0011 1101 0100

VPO: 010100 (0x14). VPN: 00001111 (0x0F).
TLB Lookup:
- Index (bottom 2 bits of VPN): 11 (3).
- Tag (top 6 bits of VPN): 000011 (0x03).
- Check TLB Set 3: (Slide 53) Entry with Tag 03 is Valid. PPN=0D. Hit!
Physical Address: PPN 0D + Offset 14 $\to$ 0x354.
Cache Lookup:
- Block Offset (2 bits): 00.
- Index (4 bits): 0101 (5).
- Tag (6 bits): 001101 (0D).
- Check Cache Set 5: (Slide 56) Tag matches 0D. Hit! Data: 0x36.

Example 2: Translating `0x0B8F`

Binary: 00 1011 1000 1111

VPO: 001111. VPN: 0x2E.
TLB Lookup:
- Index: 10 (2). Tag: 0x0B.
- Check TLB Set 2: (Slide 59) No entry with Tag 0B. TLB Miss!
Next Step: The MMU must go to the Page Table in memory (Page Walk). It checks index 0x2E. If valid, it loads PPN into TLB. If invalid $\to$ Page Fault.

Example 3: Translating `0x0020`

Binary: 00 0000 0010 0000

VPO: 100000. VPN: 0x00.
TLB Lookup:
- Index: 0. Tag: 0.
- Check TLB Set 0: (Slide 62) Valid entry for tag 03, but not 00. TLB Miss!
Page Table: (Slide 63) Entry 0 is Valid, PPN = 0x28. Page Table Hit.
Physical Address: PPN 28 + Offset 20 $\to$ 0xA20 (1010 0010 0000).
Cache Lookup:
- Index: 1000 (8). Tag: 0x28.
- Check Cache Set 8: (Slide 66) Tag is 24. We need 28. Cache Miss.

Practice: Address Translation Math

Exercise: Multi-Level Table Calculation

Consider a 32-bit virtual address space with 4KB pages.

Question: If each Page Table Entry is 4 bytes, how many levels are needed to ensure every table fits in exactly one page?
Logic:
1. A page holds $4096/4 = 1024$ entries ( $2^{10}$ ).
2. Each level handles 10 bits of the address.
3. The VPN for a 32-bit address with 12-bit offset is 20 bits.
Answer: $20 bits /10 bits/level = 2 levels$ .

Exercise: TLB Reach

A system has 64 TLB entries.

Scenario A: Page size is 4KB. Reach is $64 \times 4 KB = 256 KB$ .
Scenario B: Page size is 2MB (Large Pages). Reach is $64 \times 2 MB = 128 MB$ .
Conclusion: Large pages significantly increase the amount of memory accessible without a TLB miss, benefiting applications with large data structures.

Integration: Caches and Translation

There are now two critical hardware lookups: The TLB (for translation) and the L1 Cache (for data). How are they ordered?

1. Physically Indexed, Physically Tagged (PIPT)

Process: Translate VA $\to$ PA using TLB. Then use PA to access Cache.
Pro: Simple. No aliasing issues.
Con: Slow. The cache access is serialized after the TLB lookup.

2. Virtually Indexed, Virtually Tagged (VIVT)

Process: Use VA to access Cache directly. Ignore TLB for data fetch.
Pro: Extremely fast.
Con:
- Homonyms: Same VA in different processes maps to different data.
- Synonyms: Different VAs map to same data. Requires flushing cache on context switch. Complex to manage.

3. Virtually Indexed, Physically Tagged (VIPT) - The Winner

This is the clever trick used in modern L1 caches. Recall that the Page Offset bits (e.g., bottom 12 bits for 4KB pages) do not change during translation.

If the Cache Index bits fit entirely within the Page Offset bits, the cache can be indexed using the Virtual Address bits while the TLB translates the upper bits.

Parallel Operation:

Send VA bits to Cache (Index) and TLB (Tag/Index) simultaneously.
Cache reads out the data and tags.
TLB outputs the Physical Address (PPN).
Compare the TLB’s Physical Address against the Cache’s Physical Tags.

Constraint: For this to work, Cache Size / Associativity must be $\leq$ Page Size. (This often limits L1 cache size).

Large Pages

4KB pages are usually used. But x86 allows the use of Large Pages (2 MB) or Huge Pages (1 GB).

How? In the Page Walk, a bit in the Directory Entry (e.g., at Level 2) says “Stop here. This isn’t a pointer to a Level 1 table; this is the physical page.”

Why use them?

TLB Reach: One TLB entry now covers 2MB or 1GB of RAM instead of just 4KB.
Performance: If a program has a massive working set (like a database or matrix solver), using 4KB pages requires thousands of TLB entries, leading to TLB thrashing. Large pages drastically reduce TLB misses.

Optimization: Matrix Multiplication Revisited

Matrix Multiplication ( $C = A \times B$ ) was previously optimized using Blocking to fit data into the L1/L2 Cache.

Constraint: $3 \times BlockSize^{2} < Cache Size$ . An optimal block size $B \approx 800$ for L2 was found.

The Hidden Problem: With $B = 800$ , accessing a column of matrix B involves striding through memory. Each row of the matrix is likely on a different 4KB Page. If 800 different rows are accessed, 800 different pages are touched. A typical TLB has only ~128 entries. Result: TLB Thrashing. Every single memory access causes a TLB miss + Page Walk.

Solution 1: Smaller Blocks Wait, two constraints must be satisfied:

Data fits in Cache.
Page translations fit in TLB. Calculating for TLB constraint ( $3 B < 128$ ), $B \approx 42$ is obtained. This works, but a block size of 42 is too small to get maximum efficiency from the data cache prefetchers.

Solution 2: Copy optimization (Buffering) Instead of striding through the large original matrix, the $B \times B$ block is copied into a contiguous temporary buffer in the stack/heap.

Cost: Copying takes time ( $O (B^{2})$ ).
Benefit: The computation is $O (B^{3})$ . The temporary buffer is contiguous, so it fits into a minimal number of pages (high TLB locality) and fits perfectly in L1 cache (high Data locality).
Result: $B$ can be pushed back up to $\sim 148$ or higher, getting the best of both worlds.

TLB Reach matters: algorithms must optimize for Page Locality (TLB) just as much as Data Locality (Cache).

Continue here: 24 Multiprocessing and Multicore Architecture

CS Notes

Explorer

23 Advanced Virtual Memory and TLB