08 Garbage Collection and Software Engineering Principles

Lecture from: 08.10.2025 | Video: Videos ETHZ

Implicit Memory Management: Garbage Collection

The alternative to C’s manual free() is garbage collection (GC), the automatic reclamation of heap-allocated storage.

The core question for a GC is: How does the memory manager know when memory can be freed?

The answer is based on reachability. A block of memory is considered “garbage” if it is no longer possible for the application to reach it.

Memory as a Graph

Memory can be viewed as a directed graph:

Nodes: Each allocated block in the heap is a node.
Edges: A pointer from one block to another is a directed edge.
Root Nodes: These are locations outside the heap that contain pointers into the heap. The roots include:
- Global variables.
- Local variables on the stack.
- CPU registers.

Reachability

A heap block is reachable if there is a path of pointers from any root node to that block. Any block that is not reachable is garbage and can be safely reclaimed.

Mark-and-Sweep Collection

A classic and influential GC algorithm is Mark-and-Sweep, first developed by John McCarthy in 1960 for Lisp.

It operates in two phases when the system runs out of memory:

Mark Phase: Traverses the memory graph starting from the root nodes. It follows every pointer and sets a special mark bit in the header of every reachable block.
Sweep Phase: Scans the entire heap from beginning to end. For each block it encounters:
- If the mark bit is set, the block is reachable. The bit is cleared for the next GC cycle.
- If the mark bit is not set, the block is unreachable garbage. The block is freed and added back to the free list.

Conservative Garbage Collection for C

Implementing GC for C is possible but difficult. In languages like Java, the runtime knows exactly which variables are pointers. In C, a long and a pointer can have the same bit pattern.

A conservative garbage collector for C assumes that if a value on the stack or in a register looks like it could be a pointer into the heap (i.e., its numeric value falls within the heap’s address range), it is a pointer. It marks the corresponding block as reachable. This means it might fail to collect some garbage (if an integer happens to look like a pointer), but it will never accidentally free valid memory.

Common Memory Pitfalls

Manual memory management is unforgiving. Several common memory-related bugs exist in C.

Dereferencing Bad Pointers: Passing a variable itself when a pointer to it is expected.

// WRONG
int val;
scanf("%d", val); // scanf expects an address, but gets garbage value of val
 
// CORRECT
scanf("%d", &val); // Pass the address of val

Reading Uninitialized Memory: malloc provides garbage data. Reading from it before writing results in undefined behavior.

// WRONG
int *y = malloc(N * sizeof(int));
for (i=0; i<N; i++) {
    for (j=0; j<N; j++) {
        y[i] += A[i][j] * x[j]; // Reads uninitialized y[i]!
    }
}
 
// CORRECT: Use calloc or initialize y[i] to 0 in the outer loop.

Overwriting Memory:
- Size Errors: Allocating space for an array of ints but using sizeof(int) instead of sizeof(int *) for an array of pointers.
- Off-by-one errors: Using <= instead of < in a loop condition.
- Unbounded string copies: Using dangerous functions like gets() which do not check buffer sizes.
Dangling Pointers: Returning an address of a local variable that is about to be deallocated.
Memory Leaks: Failing to free memory that is no longer reachable.
Double Free: Calling free on the same pointer twice.

Design and Debugging

Writing code is only part of programming. Writing good code that is correct, robust, maintainable, and understandable is essential. This section covers the philosophy of designing clean code and the systematic art of finding and fixing bugs.

The Origin of “Bug”

The term “bug” has a literal origin. On September 9, 1947, engineers at Harvard found the source of a malfunction in the Mark II computer: a moth trapped in a relay. They taped the insect into their logbook with the note: “First actual case of bug being found.” Removing it was “debugging.”

Defects, Errors, & Failures

Software engineering uses precise terminology for bugs.

A programmer creates a defect (or “bug”). This is a mistake in the code.
The defect may cause an error. This is an incorrect internal state, like a wrong variable value.
The error propagates. The incorrect state corrupts more of the program’s state.
Eventually, the error may cause a failure. This is an observable, incorrect result (e.g., wrong output, crash).

Why is an error not necessarily a failure?

An error might be corrected by later code, affect unused state, or the program might terminate before the error causes a visible failure.

The Curse of Testing

Edsger Dijkstra famously articulated a truth about software quality:

“Testing can only show the presence of bugs – not their absence.”

Not every defect causes a failure on every run.

A defect can be latent, hidden in rarely executed code.
A defect might only become active under specific conditions.

Thorough testing aims to turn latent defects into observable failures.

Debugging: The Classic Mystery Game

Debugging is the process of working backward from an observed failure to find and remove the underlying defect.

Testing Your Code

Good debugging starts with good testing.

Test Incrementally: Write tests during development. After adding a feature, write a test for it immediately.
Test the Simple Parts First: Build confidence in helper functions before testing complex functions.
Test Boundary Conditions: Bugs often hide in edge cases. Test:
- Empty inputs (e.g., empty list, empty string).
- Single-item inputs.
- Smallest and largest possible values.
- Inputs exactly at the limit.

Example: Test Cases for a Sorted Linked List

Consider a function to insert a value into a sorted linked list. Essential test cases include:

Empty list: Inserting into a NULL list.
Single-item list:
- Insert a value that already exists.
- Insert a smaller value.
- Insert a larger value.
Multi-item list:
- Insert at the beginning.
- Insert at the end.
- Insert in the middle.
- Insert an existing value in the middle.

Thinking through these cases before writing code is a powerful design technique.

General Debugging Advice

A systematic approach aids debugging.

Examine the Most Recent Change: If testing iteratively, the bug is likely in the recent code.
Simplify the Input: Find the smallest test case that triggers the failure.
Use a Debugger (gdb): For crashes, get a stack trace (bt).
Sanity-Check Intermediate Results: Don’t assume code behavior.
- Use printf to trace execution and inspect values.
- Use the debugger to step through code.
Understand the Root Cause: Avoid quick fixes without understanding why they work.
Check for Similar Mistakes: If a logical error exists in one place, it may exist elsewhere.

Diagnosing Failures: The Scientific Method

Debugging applies the scientific method to code.

Observe: Compare expected and actual output.
Hypothesize: Form a specific hypothesis to explain the failure.
Predict & Test: Develop a fix and predict the new output.
Confirm: Run the test to confirm resolution.

Correlation Does Not Imply Causation

Be wary of the post hoc fallacy. A change might make the bug vanish without fixing the root cause, possibly just shifting memory. Understanding the relationship between defect and fix is rigorous.

Debugging Tools

Debugger (gdb): Primary tool for inspecting program state, call stack, and variable values.
valgrind: Detects memory errors gdb misses, such as uninitialized variables and memory leaks.

Design: Managing Complexity and Communication

Good software design aims to minimize future bugs. Two main themes unify design goals:

Complexity Management: Techniques to keep software complexity manageable.
Communication: Code is for people as well as computers.

The overarching goal is readability.

Complexity Management Techniques

Separation of Concerns: Each function or module should do “one thing” well.
Modularity & Encapsulation: Divide the system into independent modules with clean interfaces, hiding internal state.
Abstraction: Hide implementation details behind simple interfaces.

Communication

Code communicates with:

The machine.
Other developers.
Code reviewers.
The future self.

Techniques include writing good tests, clear names, and effective comments.

Naming is Understanding

“If you don’t know what a thing should be called, you cannot know what it is. If you don’t know what it is, you cannot sit down and write the code.” - Sam Gardiner

Convey Meaning and be Accurate: Names should describe what a variable or function is or does. Avoid meaningless names.
Use Short Names for Locals, Longer for Globals: Short names suffice for small scopes; global variables require descriptive names.
Be Consistent: Stick to a naming convention and use opposites consistently.

Comments: The Why, Not the What

Good comments explain why, not what.

Comment Don'ts

Don’t say what the code does: i++; // Increment i is useless.

Don’t explain awkward logic: If explanation is needed, rewrite the code.

Don’t add too many comments: They clutter code and age poorly.

What Comments Should Do

Good comments answer:

Why does this code exist?

What are the pre-conditions and post-conditions?

Are there non-obvious limitations or edge cases?

Are there alternatives?

Coding by commenting is a useful technique: outline logic with comments before writing code.

// Example for quick sort:
// Initialize locals
// Pick a pivot value
// Reorder array around the pivot
// Recurse

Practice: Reachability and GC

Understanding the memory graph is key to predicting garbage collector behavior.

Exercise: Identifying Reachability

Consider the following heap structure:

Root A points to Block 1.
Block 1 points to Block 2.
Block 2 points to Block 3.
Block 4 points to Block 2.

Scenario: The user sets Root A = NULL.

Which blocks are now unreachable?
- Answer: Block 1, Block 2, Block 3, and Block 4.
- Explanation: Although Block 4 points to Block 2, there is no path FROM A ROOT to Block 4. Thus, the entire chain becomes garbage.
If this were a reference counting collector (another GC type), what would happen to Block 2?
- Answer: It might leak if the collector doesn’t handle cycles (though there isn’t a cycle here, Block 2 has an incoming edge from Block 4). A reference counter would see Block 2 has 1 incoming edge (from 4), but since 4 has 0 incoming edges from roots, they are all dead.

Exercise: Identify the Leak

Is there a leak here?

char *get_name() {
    char *name = malloc(10);
    strcpy(name, "Alice");
    return name;
}
 
int main() {
    get_name();
    return 0;
}

Answer: Yes. The get_name function returns a pointer to heap memory, but main discards that pointer immediately. The allocated memory is now unreachable and leaked.

Continue here: 09 From C to Rust - Safety and Ownership

CS Notes

Explorer

08 Garbage Collection and Software Engineering Principles

Implicit Memory Management: Garbage Collection

Memory as a Graph

Mark-and-Sweep Collection

Conservative Garbage Collection for C

Common Memory Pitfalls

Design and Debugging

The Origin of “Bug”

Defects, Errors, & Failures

The Curse of Testing

Debugging: The Classic Mystery Game

Testing Your Code

Example: Test Cases for a Sorted Linked List

General Debugging Advice

Diagnosing Failures: The Scientific Method

Debugging Tools

Design: Managing Complexity and Communication

Complexity Management Techniques

Communication

Naming is Understanding

Comments: The Why, Not the What

Practice: Reachability and GC

Exercise: Identifying Reachability

Exercise: Identify the Leak

Table of Contents

Graph View

Backlinks