17 Floating Point Rounding and Compiler Basics

Lecture from: 12.11.2025 | Video: Videos ETHZ

Floating Point Rounding

The representation of floating-point numbers was discussed in the previous lecture. The next crucial step is understanding how mathematical results fit back into this finite representation.

Rounding Modes

When a calculation results in a number that cannot be exactly represented (e.g., it requires more bits of precision than the fraction field allows), rounding is necessary. IEEE 754 defines four standard rounding modes.

Consider rounding a currency (Francs) to whole numbers:

Round toward zero (Truncate): $1.40 \to 1$ , $- 1.50 \to - 1$ .
Round down ( $- \infty$ ): $1.60 \to 1$ , $- 1.50 \to - 2$ .
Round up ( $+ \infty$ ): $1.40 \to 2$ , $- 1.50 \to - 1$ .
Round to nearest even (Default): This is the standard mode used in C and most hardware.

Closer Look at “Round-to-Even”

Why is “Round-to-Even” the default?

Statistical Bias: If 0.5 is always rounded up, results will have a slight upward bias over many calculations. Rounding to the nearest even number ensures that, statistically, errors average out to zero (half the time rounding is up, half the time it is down).
1. If the number is not exactly halfway between two representable values, round to the nearest one (e.g., $1.6 \to 2$ , $1.4 \to 1$ ).

If the number is exactly halfway (e.g., $1.5$ ), round to the nearest even integer.
- $1.5 \to 2$ (2 is even)
- $2.5 \to 2$ (2 is even)
- $- 1.5 \to - 2$ (-2 is even)

Rounding Binary Numbers

In binary, “even” means the Least Significant Bit (LSB) is 0. To implement this in hardware, bits beyond the precision limit are examined using three specific bits: 2. Round Bit (R): The bit immediately following the guard bit. 3. Sticky Bit (S): The logical OR of all bits remaining after the round bit.

Rounding Logic:

If $R = 1$ and $S = 1$ : The value is $> 0.5$ . Round Up.
If $G = 1$ , $R = 1$ , and $S = 0$ : The value is exactly halfway. Round to Even.
- If the current LSB is 1 (odd), add 1 to make it even.
- If the current LSB is 0 (even), leave it alone.

Examples (rounding to nearest 1/4, i.e., 2 bits right of binary point):

$2 \frac{3}{32}$ ( $10.0001 1_{2}$ ): $< 1/2$ , round down to 2.
$2 \frac{3}{16}$ ( $10.0011 0_{2}$ ): $> 1/2$ , round up to $2 \frac{1}{4}$ .
$2 \frac{7}{8}$ ( $10.1110 0_{2}$ ): Exactly halfway between $10.11$ ( $2.75$ ) and $11.00$ ( $3.0$ ). $11.00$ is even (LSB 0), so round up to 3.

Post-Normalization

After rounding, a number might overflow (e.g., rounding $1.11...1$ up might result in $10.00...0$ ).

Fix: Shift right by one and increment the exponent.
If this increment causes the exponent to become all 1s, the result overflows to $\infty$ .

Practice: Floating Point Rounding

Correctly applying the GRS (Guard, Round, Sticky) bits is crucial for understanding how hardware ensures precision.

Exercise: Rounding to Nearest Even

Round the following binary values to the nearest 1/4 (2 fractional bits).

Value: $10.01 01 1_{2}$ ( $G = 1, R = 0, S = 1$ )
- Reason: $R = 0, S = 1$ means the value is $< 0.5$ units beyond the rounding point.
- Answer: $10.01$ ( $2.25$ ).
Value: $10.01 11 0_{2}$ ( $G = 1, R = 1, S = 1$ )
- Reason: $R = 1, S = 1$ means the value is $> 0.5$ beyond the rounding point.
- Answer: $10.10$ ( $2.5$ ) - Round Up.
Value: $10.01 10 0_{2}$ ( $G = 1, R = 1, S = 0$ )
- Reason: $R = 1, S = 0$ means it is exactly halfway. We look at $G$ . Since $G = 1$ (odd), we round up to make it even.
- Answer: $10.10$ ( $2.5$ ).
Value: $10.10 10 0_{2}$ ( $G = 0, R = 1, S = 0$ )
- Reason: Exactly halfway. Since $G = 0$ (even), we leave it alone.
- Answer: $10.10$ ( $2.5$ ).

Floating Point Arithmetic

The standard defines arithmetic operations (add, multiply, etc.) as if the exact mathematical result were computed first, and then rounded to fit the format.

Multiplication

Multiplication is relatively straightforward in hardware. $(- 1)^{s_{1}} M_{1} 2^{E_{1}} \times (- 1)^{s_{2}} M_{2} 2^{E_{2}}$

Sign: $s = s_{1} \oplus s_{2}$ .
Significand: $M = M_{1} \times M_{2}$ . (This is essentially an integer multiplication).
Exponent: $E = E_{1} + E_{2}$ .
Fixing: If $M \geq 2$ , shift right and increment $E$ . Round $M$ .

Addition

Addition is harder because of alignment. Numbers with different exponents cannot be added directly. $(- 1)^{s_{1}} M_{1} 2^{E_{1}} + (- 1)^{s_{2}} M_{2} 2^{E_{2}}$ Assume $E_{1} > E_{2}$ .

Align: Shift $M_{2}$ right by $E_{1} - E_{2}$ positions.
- Note: If $E_{1}$ is much larger than $E_{2}$ , $M_{2}$ might be shifted entirely off the end. This effectively means $B i g + S ma ll = B i g$ .
Add: $M = M_{1} + M_{2} (s hi f t e d)$ .
Fixing: If $M \geq 2$ , shift right and increment $E$ . If $M < 1$ , shift left and decrement $E$ . Round.

Mathematical Properties

Floating-point arithmetic does not behave like standard real arithmetic (Abelian groups or rings).

Properties of Addition

Closed? Yes (but may generate $\infty$ or NaN).
Commutative? Yes ( $a + b = b + a$ ).
Associative? NO.
- Due to overflow and rounding precision.
- Example: $(3.14 + 1 e 10) - 1 e 10 = 0.0$ (because $3.14$ is lost when aligning with $1 e 10$ ).
- But $3.14 + (1 e 10 - 1 e 10) = 3.14$ .
Inverses? Almost (except infinities and NaNs).

Properties of Multiplication

Commutative? Yes.
Associative? NO.
Distributive over addition? NO.
- $a \times (b + c) \neq = (a \times b) + (a \times c)$ due to rounding errors.

Implication for Compilers

Because FP arithmetic is not associative, compilers cannot arbitrarily reorder FP operations for optimization (like they can with integers) without risking changing the result.

Floating Point Puzzles

Let’s test understanding. Assume int x, float f, double d. Assume neither d nor f is NaN.

Expression	Verdict	Reason
`x == (int)(float) x`	False	`float` has 23 fractional bits. An `int` has 31 value bits. Large integers (e.g., `0x0f0f0f0f`) cannot be exactly represented as a float and will lose precision.
`x == (int)(double) x`	True	`double` has 52 fractional bits, which is enough to store any 32-bit `int` exactly.
`f == (float)(double) f`	True	Promoting a `float` to `double` preserves the value exactly. Converting back is an identity operation.
`d == (float) d`	False	`float` has less range and precision. Converting a `double` to `float` might overflow to $\infty$ or lose precision.
`f == -(-f)`	True	Floating point negation just flips the sign bit. It is symmetric.
`2/3 == 2/3.0`	False	`2/3` is integer division (result `0`). `2/3.0` is FP division (result `0.66...`). `0 != 0.66...`.
`d < 0.0 => ((d*2) < 0.0)`	True	Even if `d` is a huge negative number, `d*2` becomes $- \infty$ , which is still $< 0.0$ .
`d > f => -f > -d`	True	Negation is monotonic.
`d * d >= 0.0`	True	Squaring a real number is non-negative. Even if it overflows to $+ \infty$ , that is $> 0$ . (Unless it’s NaN, but NaNs were assumed not to be present).
`(d+f)-d == f`	False	Not associative. If `d` is huge ( $1 e 30$ ), `d+f` will equal `d` due to precision loss. Then `d - d = 0`, and `0 != f`.

Floating Point in C and Assembly

C Guarantees

float: IEEE Single Precision.
double: IEEE Double Precision.
Casting int $\to$ float: Changes bit representation (rounding may occur).
Casting int $\to$ double: Exact.
Casting double $\to$ int: Truncates (rounds toward zero).

Hardware: SSE (Streaming SIMD Extensions)

On modern Intel architectures (x86-64), floating point is handled by the SSE unit (specifically SSE3 for this course), not the old x87 stack-based coprocessor.

Architecture:

Registers: 16 registers named %xmm0 through %xmm15.
Size: Each is 128 bits wide.
Usage:
- Scalar: Can hold a single float (32-bit) or double (64-bit) in the lower bits.
- Packed (Vector): Can hold 4 floats or 2 doubles packed together for SIMD operations.

Instructions: The instructions differentiate between Scalar (S) vs Packed (P) and Single (S) vs Double (D).

addss: Add Scalar Single. (Standard float addition).
addsd: Add Scalar Double. (Standard double addition).
addps: Add Packed Single. (Vector addition of 4 floats at once).

Code Example (Inner Product):

; float ipf(float x[], float y[], int n)
; x in %rdi, y in %rsi, n in %edx
 
xorps %xmm1, %xmm1      ; result = 0.0
xorl %ecx, %ecx         ; i = 0
jmp .L8
.L10:                   ; Loop
    movss (%rsi,%rax,4), %xmm0  ; Load y[i] into xmm0
    mulss (%rdi,%rax,4), %xmm0  ; xmm0 = xmm0 * x[i]
    addss %xmm0, %xmm1          ; result += xmm0
    incq %r8                    ; i++
.L8:
    cmpq %rdx, %rcx
    jl .L10

Note the use of movss, mulss, and addss. These are scalar operations using the XMM registers.

Conversions: Special instructions exist to convert between types, e.g., cvtsi2sd (Convert Scalar Integer to Scalar Double). This is what happens when values are cast like (double) i in C.

Optimizing Compilers

The focus now shifts from correctness and representation to performance. The process of C turning into assembly and how memory works is understood. Now, the question is how to make it run fast.

The Landscape

Virtual Memory: The code runs in a nice linear address space (0 to $2^{48}$ ), isolated from other programs.
Hardware Reality: Caches, RAM latencies, pipelines, and IO devices all impact speed.
The Compiler: It is the bridge between the logical C code and the physical hardware.

The Compiler is Your Friend

The compiler wants to make code fast. It is very good at:

Register Allocation: Deciding which variables live in fast registers vs. slow stack memory.
Code Selection: Choosing the most efficient assembly instructions.
Dead Code Elimination: Removing useless calculations.

However, the compiler has a strict rule: It must not change the behavior of the program. If there is any ambiguity (e.g., “does writing to pointer A change the value at pointer B?”), the compiler must be conservative. It forces itself to be slow to ensure correctness.

Optimization Levels

-O0: No optimization (default). Good for debugging, terrible for speed.
-O1: Basic optimizations.
-O2: Recommended for most deployments. Good balance.
-O3: Aggressive optimization. Can sometimes make code larger or expose subtle bugs, but usually fastest.

Example: Matrix Multiplication

Let’s look at a classic example: Matrix-Matrix Multiplication (MMM). Calculate $C = A \times B$ . Standard algorithm: Triple loop ( $O (n^{3})$ ).

The Baseline: A standard triple loop compiled with -O3. It achieves a certain performance level but plateaus.
The Best Code: Highly optimized code (e.g., by K. Goto).
The Gap: The optimized code is 160x faster than the naive version, despite performing the exact same number of arithmetic operations ( $2 n^{3}$ ).

What is going on? Asymptotic complexity ( $O (n^{3})$ ) is the same for both. The difference lies in the constants and how the hardware is utilized:

Memory Hierarchy: Blocking/tiling to keep data in L1/L2 cache.
Vectorization: Using those SIMD instructions (addps) to do 4 or 8 operations per cycle.
Instruction Level Parallelism (ILP): Keeping the CPU pipeline full.

Harsh Reality

There is more to performance than Big-O. A factor of 10x or 100x can be lost just by confusing the compiler or ignoring the hardware architecture.

Goals for this chapter:

Understand what the compiler can do.
Understand what blocks the compiler (e.g., memory aliasing).
Learn how to write code that helps the compiler help you.

Next time: Specific optimization techniques like code motion, strength reduction, and how to resolve memory aliasing will be discussed.

Tip

The compiler loves you and wants you to be happy (fast code). But it is terrified of making you sad (incorrect code). The job is to prove to the compiler that the fast way is also the safe way.

Back to Continue here: 18 Compiler Optimizations and Performance Blockers

CS Notes

Explorer

17 Floating Point Rounding and Compiler Basics

Floating Point Rounding

Rounding Modes

Closer Look at “Round-to-Even”

Rounding Binary Numbers

Post-Normalization

Practice: Floating Point Rounding

Exercise: Rounding to Nearest Even

Floating Point Arithmetic

Multiplication

Addition

Mathematical Properties

Properties of Addition

Properties of Multiplication

Floating Point Puzzles

Floating Point in C and Assembly

C Guarantees

Hardware: SSE (Streaming SIMD Extensions)

Optimizing Compilers

The Landscape

The Compiler is Your Friend

Optimization Levels

Example: Matrix Multiplication

Harsh Reality

Table of Contents

Graph View

Backlinks