Lecture from: 24.09.2025 | Video: Videos ETHZ

A Deeper Look at Strings

In many modern languages, strings are a fundamental, built-in type. They come with a rich set of methods, automatic memory management, and a sense of safety. C takes a different path. It’s a philosophy you’ll see again and again: C gives you the fundamental building blocks and trusts you to assemble them correctly.

The C “String”: A Convention, Not a Type

C has no real string type. Instead, it has a simple but powerful convention:

A “string” is an array of chars.
This array is terminated by a special character: the null byte, written as \0 or simply the integer 0.

This null terminator is the key. It’s how library functions know where the string ends. Without it, a function like printf would just keep reading memory until it hit a random zero byte or crashed.

Let’s look at two ways to initialize a character array, which are functionally identical:

#include <stdio.h>
 
int main(int argc, char *argv[]) {
    // The string literal is syntactic sugar for the array initialization below.
    // The compiler automatically adds the null terminator '\0'.
    char s1[6] = "hello";
 
    // Manually creating the same array of characters.
    // We must explicitly add the null terminator (0).
    char s2[6] = { 'h', 'e', 'l', 'l', 'o', 0 };
 
    printf("s1 = '%s'\n", s1);
    printf("s2 = '%s'\n", s2);
}

The string literal "hello" is just convenient shorthand. The compiler translates it into an array of characters and appends the \0 for you.

Visualizing String Operations in Memory

Let’s trace what happens in memory. When you declare char name1[12];, the compiler allocates a contiguous block of 12 bytes on the stack. Initially, their contents are garbage, whatever was in that memory location before.

A string literal like "Rosalinda" also exists in memory, typically in a read-only data segment of the program. It’s an array of 10 bytes (9 for the letters, 1 for \0).

Now, consider the function strncpy(name1, "Rosalinda", 12);. This function copies bytes.

It starts copying from the source ("Rosalinda") to the destination (name1).
It copies the 9 characters of “Rosalinda”.
It copies the null terminator \0.
It has now copied 10 bytes, but we told it to copy up to 12. strncpy will continue to pad the destination with null bytes until it has written n bytes in total.

The result in name1 is:

This padding behavior is a safety feature. It ensures that the destination buffer is always null-terminated if it’s large enough. Let’s trace a more complex example involving strncpy and strncat.

We execute the following code:

strncpy(mixed, name1, 25);
strncat(mixed, " & ", 25);
strncat(mixed, name2, 25);

After strncpy(mixed, name1, 25);: The contents of name1 (“Rosalinda”) are copied into mixed, which is then padded with \0s up to 25 bytes.
After strncat(mixed, " & ", 25);: strncat finds the first \0 in mixed (right after ‘a’) and starts copying ” & ” from there.
After strncat(mixed, name2, 25);: It finds the new \0 (after ” & ”) and copies name2 (“Zeke”) there.

strncpy vs. strcpy

The older strcpy function is notoriously unsafe. It copies from src to dest until it hits a \0 in src, without any bounds checking. If src is larger than dest, it will happily write past the end of dest, corrupting other data. This is a classic “buffer overflow” vulnerability. Always prefer strncpy and its safer modern alternatives.

Assertions: Guard Rails for Programmers

When you write a function, you operate under a set of assumptions. “This pointer will not be null.” “This count must be positive.” These are the invariants and pre-conditions that must hold for your code to be correct.

An assertion is a way to state these assumptions directly in your code. It’s a declaration to yourself and other programmers: “I assert that this condition is true. If it’s not, something is fundamentally wrong with the program’s logic, it’s a bug.”

How `assert` Works

You use it by including <assert.h> and calling the assert macro:

#include <assert.h>
 
void array_copy(int a[], int b[], size_t count) {
    // It's a bug to call this with null pointers.
    // We state this assumption explicitly.
    assert(a != NULL);
    assert(b != NULL);
 
    for(int i = 0; i < count; i++) {
        a[i] = b[i];
    }
}

Here is the contract assert makes:

At runtime, the expression inside assert() is evaluated.
If the expression is true (non-zero), nothing happens. The program continues.
If the expression is false (zero), the program immediately aborts. It prints a diagnostic message detailing what failed, where it failed (file, line number, function name), and then terminates, often dumping core for debugging.

The Philosophy of Assertions

Assertions are for Programmers, Not Users

An assertion failure signifies a bug. It’s an unexpected, internal contradiction in the program’s state. It is not for handling predictable errors, like invalid user input or a file not being found. For those, use if/else blocks and print helpful error messages. Crashing with an assertion is hostile to the user.

Why is `assert` a Macro?

assert must be a macro, not a function. Why? A function would only receive the result of the expression (true or false). It wouldn’t know:

The source code of the expression itself (e.g., the text a != NULL).
The file and line number where the check occurred.

The C preprocessor, however, can access this information and bake it into the macro expansion, giving us those wonderfully precise error messages. Because it’s a macro, you must ensure the expression has no side effects. An expression like assert(x++ > 0) is dangerous. If assertions are compiled out (by defining NDEBUG), x++ will never happen, and your program’s behavior will change.

On Style

Writing C is like working with a powerful, sharp tool. You can build beautiful, efficient things, but you can also make a mess. Good style isn’t just about aesthetics; it’s about safety, clarity, and maintainability.

Core Principles of Good C Style

Consistency is Key: Pick a code style (indentation, brace placement, naming) and stick to it. If you’re contributing to an existing project, adopt its style.
Write Clear Code:
- Indent to show structure.
- Use parentheses to clarify operator precedence. Don’t make readers guess.
- Break complex expressions into smaller, intermediate variables with meaningful names.
Avoid Magic Numbers: Don’t sprinkle raw numbers like 1024 in your code. Use const, enum, or #define to give them names.

Don’t Write C Like This

C’s flexibility allows for extreme cleverness, which is often the enemy of clarity. The International Obfuscated C Code Contest (IOCCC) celebrates this by awarding prizes for the most confusing, yet functional, programs.

This program, believe it or not, compiles and runs. Its purpose? To generate and print a random maze.

A Fascinating Diversion

While you should never write code like this professionally, studying obfuscated code can be a fun way to learn the dark corners of a language’s syntax and semantics. But remember: clarity trumps cleverness, always.

Chapter 3: Representing Integers in C

We often think of int as representing the mathematical integers ( $Z$ ). This is a useful abstraction, but a leaky one. A computer’s integers are finite. They have fixed sizes (like 32 or 64 bits), and this finiteness leads to behaviors that can be surprising if you’re not prepared.

Encoding Integers

A $w$ -bit integer is just a vector of $w$ bits. How we interpret those bits defines the number.

Unsigned Integers: A standard binary representation. A $w$ -bit vector $x$ represents the number:
$B 2 U (x) = i = 0 \sum w - 1 x_{i} \cdot 2^{i}$
Signed Integers (Two’s Complement): The most significant bit (MSB) has a negative weight.
$B 2 T (x) = - x_{w - 1} \cdot 2^{w - 1} + i = 0 \sum w - 2 x_{i} \cdot 2^{i}$

Little vs. Big Endian

Endianness refers to the order of bytes in memory for a multi-byte number.

Big Endian (e.g., SPARC, old PowerPC): Stores the most significant byte at the lowest memory address. (Reads like a number written on paper).
Little Endian (e.g., x86-64, ARM): Stores the least significant byte at the lowest memory address.

Bit-Level Operations and Shifts

C provides operators that manipulate these bit vectors directly.

Operator	Name	Description
`&`, `\|`, `^`	Bitwise AND, OR, XOR	Applied bit-by-bit to operands.
`~`	Bitwise NOT	Flips every bit.
`<<`, `>>`	Left Shift, Right Shift	Moves all bits left or right.

Shift Operations in Detail

Left Shift x << y: Fills with 0s on the right. Equivalent to multiplication by $2^{y}$ .
Right Shift x >> y:
- Logical Shift: If x is unsigned, fills with 0s on the left.
- Arithmetic Shift: If x is signed, it performs sign extension, filling with copies of the original sign bit. This preserves the number’s sign.

Let’s trace an arithmetic shift: (signed char)0b10100010 >> 2

The number is 10100010. It’s signed, so the MSB 1 means it’s negative.
We shift right by 2. The 10 on the right are discarded.
The two new bit positions on the left must be filled. Because it’s an arithmetic shift, we copy the original sign bit (1).
The result is 11101000.

Signed vs. Unsigned: The Danger Zone

The Golden Rule: If an operation involves both a signed and an unsigned integer, the C compiler will implicitly cast the signed value to unsigned before performing the operation.

Let’s dissect the surprising comparison -1 < 0U:

The operands are a signed int (-1) and an unsigned int (0U).
The rule applies: the signed -1 is cast to unsigned int.
The 32-bit two’s complement representation of -1 is 0xFFFFFFFF (all ones).
Interpreting 0xFFFFFFFF as an unsigned integer gives $2^{32} - 1$ , or UMax.
The comparison becomes 4294967295 < 0, which is false.

Sign Extension

When converting from a smaller integer type to a larger one (e.g., short to int), C preserves the value.

For unsigned values, it pads with 0s (zero extension).
For signed values, it pads by copying the sign bit (sign extension).

Integer Arithmetic in C

Now that we understand how integers are represented as bits, we can explore how arithmetic operations, addition, subtraction, multiplication, and division, work on them. The behavior of C’s arithmetic is a direct reflection of the underlying hardware. It’s efficient and predictable, but it is not the same as the arithmetic of mathematical integers.

Negation: The Foundation of Subtraction

In two’s complement, subtraction x - y is simply implemented as addition: x + (-y). This makes negation a critical operation. We’ve already seen the rule for negation:

- x =\sim x + 1

Let’s prove why this works. We know that for any x, the sum x + (~x) results in a bit pattern of all ones.

Example: An 8-bit x = 10 (00001010)

  00001010   (x)
+ 11110101   (~x)
----------
  11111111   (All ones)

In two’s complement, a pattern of all ones represents the number -1. So, we have the identity:

x + (\sim x) = - 1

With a little algebra, we can rearrange this:

- x =\sim x + 1

This simple, elegant identity is why two’s complement is universally used in modern computers. It allows the same adder circuit to be used for both addition and subtraction, with only a small amount of extra logic (an inverter and the ability to add 1 via the carry-in bit).

Unsigned Addition: Arithmetic on a Clock

When we add two $w$ -bit unsigned integers, the hardware performs standard binary addition. If the sum requires more than $w$ bits, the most significant bit (the “carry-out”) is simply discarded.

This behavior is equivalent to arithmetic modulo $2^{w}$ . The numbers “wrap around.”

U A d d_{w} (u, v) = (u + v) (mod 2^{w})

If we visualize the true mathematical sum of two 4-bit numbers, we get a smooth, planar surface.

However, when we visualize unsigned addition in C (with its wrap-around), the plane is abruptly cut off and wraps back to the bottom, creating a discontinuity.

Mathematical Properties of Unsigned Addition

This system of modular addition isn’t just a quirk; it forms a well-defined mathematical structure called an Abelian group. It is:

Closed: Adding two unsigned numbers yields an unsigned number.

Commutative: u + v == v + u.

Associative: (t + u) + v == t + (u + v).

Has an Identity: u + 0 == u.

Has an Inverse: For every u, there’s a -u such that u + (-u) == 0.

Two’s Complement Addition: The Same Bits, Different Meaning

Here is one of the most beautiful aspects of two’s complement: the hardware required for signed addition is exactly the same as for unsigned addition. The ALU adds the bit patterns together and discards the carry, regardless of whether you, the programmer, interpret those bits as signed or unsigned.

// This will always be true in C
int s, t, u, v;
s = (int) ((unsigned) u + (unsigned) v);
t = u + v;
// s == t will hold

The difference lies entirely in how we interpret the result, especially when overflow occurs.

Positive Overflow: When you add two large positive numbers, the result can wrap around and become negative. The sign bit, which was 0, flips to 1.
- Example: TMax + 1 → 011...1 + 1 → 100...0 → TMin.
Negative Overflow: When you add two large (in magnitude) negative numbers, the result can wrap around and become positive.
- Example: TMin + TMin → 100...0 + 100...0 → (1)000...0. The carry (1) is discarded, leaving 0.

Visualizing this gives us the same planar surface as unsigned addition, but shifted so that the origin (0,0) is in the middle. The wrap-around cliffs now represent positive and negative overflow.

Multiplication: Truncation is Key

When you multiply two $w$ -bit numbers, the full mathematical result can require up to $2 w$ bits. A 32-bit CPU’s multiplier hardware will indeed produce a 64-bit result, often storing it in two separate 32-bit registers.

However, C’s * operator for standard integer types only returns a $w$ -bit result. What happens to the rest? They are simply truncated (discarded).

Unsigned Multiplication: Truncating the result to $w$ bits is equivalent to taking the result modulo $2^{w}$ .
$U M u l t_{w} (u, v) = (u \cdot v) (mod 2^{w})$
This, combined with unsigned addition, forms a commutative ring.
Signed Multiplication: The bit-level operation is again identical for the lower $w$ bits. The hardware is the same, and the truncated result is the same. The only difference is the interpretation.

Power-of-2 Arithmetic with Shifts

Multiplication and division are computationally expensive operations. Shifts, on the other hand, are extremely fast, often taking a single clock cycle. Compilers are masters at replacing expensive arithmetic with cheap shifts whenever possible.

Multiplication by Powers of 2

This is the simple case. To multiply by $2^{k}$ , you can left-shift by k.

x \cdot 2^{k} \equiv x ≪ k

This works for both signed and unsigned integers (as long as overflow doesn’t change the sign). Compilers use this to synthesize multiplication by any constant. For example, x * 12 can be computed as:

x \cdot 12 = x \cdot (8 + 4) = (x \cdot 8) + (x \cdot 4) = (x ≪ 3) + (x ≪ 2)

Unsigned Division by Powers of 2

For unsigned numbers, division by $2^{k}$ is a logical right shift by k.

⌊ u / 2^{k} ⌋ \equiv u ≫ k

The 0s shifted in from the left correctly implement the floor division for non-negative numbers.

Signed Division by Powers of 2: The Tricky Case

Here, we hit a subtle snag.

The C standard requires that integer division rounds towards zero (e.g., -7 / 4 is -1).
An arithmetic right shift >> inherently rounds towards negative infinity (it performs a floor operation).

For positive numbers, these are the same. For negative numbers, they are different.

Example: x = -9, divide by 8 (shift by 3).

C Division: -9 / 8 should be -1.
Arithmetic Shift:
- -9 in 8-bit is 11110111.
- 11110111 >> 3 (arithmetic) → 11111110.
- 11111110 is the two’s complement representation of -2.
- The results -1 and -2 do not match.

To fix this, the compiler adds a bias to the number before shifting, but only if the number is negative. The bias “nudges” the number in the right direction so that the floor operation of the shift gives the same result as rounding towards zero.

The formula for correct division of s by $2^{k}$ is:

(s < 0)? ((s + (1 ≪ k) - 1) ≫ k) : (s ≫ k)

Let’s re-run our example x = -9, k = 3:

s is negative, so we apply the bias.
The bias is (1 << 3) - 1 = 8 - 1 = 7.
We compute (s + 7) >> 3 → (-9 + 7) >> 3 → -2 >> 3.
-2 in 8-bit is 11111110.
11111110 >> 3 (arithmetic) → 11111111.
11111111 is the two’s complement representation of -1.
This now matches the C division result.

This clever trick is exactly what you will see in compiled code.

This deep dive into the bit-level reality of C arithmetic is essential. It explains why certain bugs occur, how compilers optimize code, and reinforces the core principle of systems programming: the language is a thin but powerful abstraction over the real hardware.

Integer C Puzzles (with Detailed Solutions)

Let’s test our understanding. Assume 32-bit, two’s complement integers.

Puzzle 1: x < 0 ==> ((x*2) < 0)

Status: False.
Reasoning: This statement fails in the specific case of overflow. Multiplication by 2 is equivalent to a left shift by 1 (x << 1). If x is the most negative number, TMin, its bit pattern is 1000...000. Shifting this left results in 000...000, which is 0.
Counterexample: Let x = TMin (approximately -2.1 billion).
- x < 0 is true.
- x * 2 overflows to 0.
- The expression becomes 0 < 0, which is false.

Puzzle 2: ux >= 0

Status: True.
Reasoning: ux is an unsigned int. By definition, unsigned integers can only represent non-negative values. Their range is $[0, U M a x]$ .

Puzzle 3: x & 7 == 7 ==> (x<<30) < 0

Status: True.
Reasoning:
1. The expression x & 7 == 7 checks the last three bits of x. In binary, 7 is ...000111. For the result of the & to be 7, the last three bits of x must also be 111. So, x has the bit pattern ...b_2 b_1 b_0 where b_2=1, b_1=1, b_0=1.
2. x << 30 shifts the bits of x left by 30 positions. The new sign bit (bit 31) will be the bit that was originally at position 1 (b_1).
3. Since we established that b_1 is 1, the result of the shift will have 1 as its sign bit, making it a negative number.

Puzzle 4: x > y ==> -x < -y

Status: False.
Reasoning: This algebraic identity fails for TMin. The negation of TMin in two’s complement overflows to TMin itself.
Counterexample: Let x = 0 and y = TMin.
- x > y becomes 0 > TMin, which is true.
- -x is 0.
- -y is -TMin, which overflows and results in TMin.
- The comparison becomes 0 < TMin. This is false, as TMin is the most negative number.

Puzzle 5: x * x >= 0

Status: False.
Reasoning: The product of two $w$ -bit numbers can require up to $2 w$ bits. C truncates the result to $w$ bits. A large positive number squared can overflow and wrap around to become a negative number.
Counterexample: Let x = 65535 (which is 0x0000FFFF).
- x * x = 65535 * 65535 = 4294836225.
- In 64-bit hex, this is 0x00000000FFFF0001.
- Truncating to 32 bits gives 0xFFFF0001.
- This bit pattern, interpreted as a 32-bit signed integer, is -65535.
- The expression becomes -65535 >= 0, which is false.

Puzzle 6: x > 0 && y > 0 ==> x + y > 0

Status: False.
Reasoning: This is the classic positive overflow case. Adding two large positive numbers can result in a sum that wraps around and becomes negative.
Counterexample: Let x = TMax and y = 1.
- x > 0 and y > 0 are both true.
- TMax is 0111...111. Adding 1 results in 1000...000, which is the bit pattern for TMin.
- The expression becomes TMin > 0, which is false.

Puzzle 7: ux >> 3 == ux / 8

Status: True.
Reasoning: For unsigned integers, a right shift by k is defined to be equivalent to an integer division by $2^{k}$ . This identity always holds.

Puzzle 8: x >> 3 == x / 8

Status: False.
Reasoning: The two operations round differently for negative numbers.
- C’s / operator rounds towards zero.
- The >> operator (arithmetic shift) rounds towards negative infinity.
Counterexample: Let x = -9.
- x / 8 is -1.125, which rounds toward zero to become -1.
- The bit pattern for -9 is ...11110111.
- x >> 3 shifts this arithmetically: ...11111110. This is the bit pattern for -2.
- The expression becomes -2 == -1, which is false.

Continue here: 05 Pointers, Stack

CS Notes

Explorer

04 Strings, Assertions, Style, Integers and Arithmetic

A Deeper Look at Strings

The C “String”: A Convention, Not a Type

Visualizing String Operations in Memory

Assertions: Guard Rails for Programmers

How `assert` Works

The Philosophy of Assertions

Why is `assert` a Macro?

On Style

Core Principles of Good C Style

Don’t Write C Like This

Chapter 3: Representing Integers in C

Encoding Integers

Little vs. Big Endian

Bit-Level Operations and Shifts

Shift Operations in Detail

Signed vs. Unsigned: The Danger Zone

Sign Extension

Integer Arithmetic in C

Negation: The Foundation of Subtraction

Unsigned Addition: Arithmetic on a Clock

Two’s Complement Addition: The Same Bits, Different Meaning

Multiplication: Truncation is Key

Power-of-2 Arithmetic with Shifts

Multiplication by Powers of 2

Unsigned Division by Powers of 2

Signed Division by Powers of 2: The Tricky Case

Integer C Puzzles (with Detailed Solutions)

Table of Contents

Graph View

Backlinks

CS Notes

Explorer

04 Strings, Assertions, Style, Integers and Arithmetic

A Deeper Look at Strings

The C “String”: A Convention, Not a Type

Visualizing String Operations in Memory

Assertions: Guard Rails for Programmers

How assert Works

The Philosophy of Assertions

Why is assert a Macro?

On Style

Core Principles of Good C Style

Don’t Write C Like This

Chapter 3: Representing Integers in C

Encoding Integers

Little vs. Big Endian

Bit-Level Operations and Shifts

Shift Operations in Detail

Signed vs. Unsigned: The Danger Zone

Sign Extension

Integer Arithmetic in C

Negation: The Foundation of Subtraction

Unsigned Addition: Arithmetic on a Clock

Two’s Complement Addition: The Same Bits, Different Meaning

Multiplication: Truncation is Key

Power-of-2 Arithmetic with Shifts

Multiplication by Powers of 2

Unsigned Division by Powers of 2

Signed Division by Powers of 2: The Tricky Case

Integer C Puzzles (with Detailed Solutions)

Table of Contents

Graph View

Backlinks

How `assert` Works

Why is `assert` a Macro?