04 Image Segmentation

In our last lecture, we deconstructed the digital image, understanding its journey from photons in the world to a grid of numbers in a computer. We concluded by acknowledging the ever-present reality of noise. Before we build upon this foundation to tackle the complex task of image segmentation, let us revisit a crucial and often counter-intuitive aspect of noise that warrants a deeper look.

A Remark on the Nature of Light: Revisiting Poisson Noise

We identified two primary categories of noise: scene-independent noise (like electronic hiss) and scene-dependent noise. The most fundamental form of the latter is Photon Shot Noise, which arises not from imperfections in our sensor, but from the quantum nature of light itself.

Photons do not arrive in a smooth, continuous stream; they arrive as discrete, random events. The number of photons, $K$ , that hit a given pixel in a given time interval follows a Poisson distribution.

p (K = k) = \frac{λ ^{k} e ^{- λ}}{k !}

Here, $λ$ is the crucial parameter. It represents the expected or average number of photons we expect to receive for a given light level. $p (K = k)$ then gives us the probability of actually observing exactly $k$ photons.

A unique and critical property of the Poisson distribution is that its mean is equal to its variance. That is:

Mean: $E [K] = λ$
Variance: $Var (K) = λ$

This single fact is the key to understanding why shot noise behaves the way it does.

The Signal-to-Noise Ratio Paradox

Our intuition might tell us that brighter scenes, which have a higher variance ( $λ$ ), should be noisier. However, our eyes tell us the opposite: low-light photos are notoriously grainy, while bright, sunny photos are clean. The Signal-to-Noise Ratio (SNR) resolves this apparent paradox.

Recall that SNR is the ratio of the signal’s strength (the mean) to the noise’s strength (the standard deviation). For a Poisson process, the standard deviation is the square root of the variance, which is $λ$ .

SNR = \frac{Mean}{Standard Deviation} = \frac{λ}{λ} = λ

An Intuitive Example: Low Light vs. Bright Light

Let’s put numbers to this to build our intuition.

At Low Light (e.g., a dim scene): Let’s say we expect an average of $λ = 4$ photons per pixel.

Mean signal = 4

Standard deviation of noise = $4 = 2$

SNR = 4 / 2 = 2. The noise is a staggering 50% of the signal strength. An actual measurement of 3 or 5 photons is a huge relative deviation from the expected value of 4. The image will look very noisy.

At Bright Light (e.g., a sunny day): Let’s say we expect an average of $λ = 1000$ photons per pixel.

Mean signal = 1000

Standard deviation of noise = $1000 \approx 31.6$

SNR = 1000 / 31.6 $\approx$ 31.6. The noise is now only about 3% of the signal strength. An actual measurement of 980 or 1020 photons is a tiny relative deviation. The image will look much cleaner.

This is why brighter pictures have fewer noise issues. Even though the absolute amount of random fluctuation is larger in bright light, it is dwarfed by the strength of the signal itself.

An Introduction to Segmentation

With a firm grasp of the digital image, we now turn to a fundamental task in computer vision: Image Segmentation. This is the process of partitioning an image into a set of meaningful regions, often corresponding to objects or parts of objects. It is a critical first step in many analysis systems, from autonomous driving (identifying road, pedestrians, and other vehicles) to medical imaging (isolating tumors or organs).

Defining Segmentation

Formally, a complete segmentation of an image $I$ is a finite set of regions ${R_{1}, R_{2}, \dots, R_{N}}$ that are collectively exhaustive and mutually exclusive (aka a partition).

I = i = 1 ⋃ N R_{i} and R_{i} \cap R_{j} = \emptyset \forall i \neq = j

Advanced Segmentation Paradigms

We’ve seen that simple thresholding is limited. To create more powerful and general segmentation algorithms, we need to move beyond isolated pixel operations and incorporate notions of shape, context, and probability. We will now explore three such advanced paradigms.

Segmentation by Energy Minimization: Active Contours (Snakes)

Imagine an elastic rubber band that we can place on an image. We want this band to automatically shrink and deform until it tightly wraps around the boundary of an object. This is the core idea behind Active Contours, or Snakes.

A snake is a deformable contour, typically a polygon, that is iteratively adjusted to minimize an energy functional.

What is an Energy Functional?

In standard optimization, we minimize a function that maps a vector of variables to a single cost value. A functional is a higher-level concept: it is a “function of a function.”

The input to our snake functional is not a vector of numbers, but the entire continuous curve, $v (s)$ , that defines the snake’s shape. The functional, $E_{s nak e}$ , takes this curve and outputs a single scalar value representing its total “energy.” The algorithm’s goal is to find the curve $v (s)$ that minimizes this energy.

The total energy is a carefully designed sum of competing forces that guide the snake’s evolution:

E_{snake} = \int (E_{int} (v (s)) + E_{image} (v (s)) + E_{constraint} (v (s))) d s

Internal Energy ( $E_{in t}$ ): This term enforces the snake’s smoothness and physical properties. It includes an elasticity term (penalizing stretching) and a bending term (penalizing sharp corners). This keeps the snake from becoming jagged or breaking apart.
Image Energy ( $E_{ima g e}$ ): This term links the snake to the image data. It is designed to be low where there are strong image features, like edges. This is the force that “pulls” the snake towards the object’s boundary.
External Constraint Energy ( $E_{co n s t r ain t}$ ): This allows for high-level guidance, such as user-defined points that attract the snake or a prior shape model.

The algorithm starts with an initial contour and iteratively adjusts it, balancing the internal desire for smoothness with the external pull of image features, until it settles into a low-energy state, hopefully outlining the desired object.

Segmentation as a Clustering Problem

A fundamentally different approach is to reframe segmentation as a clustering problem. The idea is to group pixels not by their spatial proximity, but by their similarity in an abstract feature space.

The K-Means Algorithm

The K-Means algorithm is a cornerstone of unsupervised learning and a powerful tool for this task. It aims to partition a set of data points into $K$ clusters by iteratively performing two steps:

Assign Points: Assign each data point to the cluster with the nearest center (mean).
Update Centers: Recalculate each cluster’s center as the mean of the points assigned to it.

This process repeats until the assignments stabilize, minimizing the overall Sum of Squared Differences (SSD) between points and their assigned cluster centers.

Pros and Cons of K-Means

K-Means is simple, fast, and widely used, but it’s important to understand its limitations:

Pros:

Simple and Fast: The algorithm is easy to understand and computationally efficient.

Guaranteed Convergence: It is guaranteed to converge to a local minimum of the within-cluster SSD.

Cons / Issues:

Choosing K: The number of clusters, $K$ , must be specified in advance, which can be difficult.

Sensitivity to Initialization: The final result can vary significantly depending on the random initial placement of cluster centers.

Sensitivity to Outliers: A single outlier can drastically skew the mean of a cluster, pulling the center away from the true group of points.

Assumes Spherical Clusters: The use of Euclidean distance means K-Means performs best when clusters are roughly spherical and of similar sizes. It struggles with elongated, non-convex, or complex shapes.

The Power of Feature Space

The magic of using K-Means for segmentation lies in the choice of feature space. By defining what “features” represent a pixel, we can achieve different kinds of groupings:

Intensity (1D space): Clustering on grayscale values groups pixels of similar brightness.
Color (3D space): Clustering on (R, G, B) vectors groups pixels of similar color.
Texture (High-D space): Using a filter bank to create a feature vector for the patch around each pixel allows us to group by texture.
Color + Position (5D space): By clustering on $(R, G, B, x, y)$ vectors, we group pixels that are similar in both color and spatial location, elegantly incorporating the Gestalt principle of proximity.

Markov Random Fields (MRFs)

Our final and most principled approach frames segmentation as a problem of probabilistic inference. Markov Random Fields (MRFs) allow us to build a model that finds the most probable segmentation given the image data, elegantly balancing local evidence with neighborhood consistency.

The Energy Formulation

An MRF for segmentation consists of a grid of nodes, where each node $x_{i}$ represents the unknown label of a pixel. We want to find the labeling $x$ that maximizes the joint probability $p (x, y)$ , where $y$ is the observed image. This probability is factored into two types of local potentials:

Unary Potentials $Φ (x_{i}, y_{i})$ : The likelihood of a pixel having label $x_{i}$ given its observed data $y_{i}$ . This is our data term.
Pairwise Potentials $Ψ (x_{i}, x_{j})$ : The likelihood of two neighboring pixels having labels $x_{i}$ and $x_{j}$ . This is our smoothness term, which penalizes disagreements between neighbors.

Maximizing this probability is equivalent to minimizing its negative logarithm, which we call the energy function:

E (x, y) = Unary Terms i \sum ϕ (x_{i}, y_{i}) + Pairwise Terms i, j \sum ψ (x_{i}, x_{j})

Solving the MRF: Graph Cuts

For binary segmentation (foreground/background), this energy minimization problem can be solved exactly and efficiently using graph cuts. The problem is transformed into finding a minimum cut in a specially constructed graph.

The Min-Cut/Max-Flow Theorem

This is a beautiful and profound result from computer science. It states that for a given source-sink graph, the maximum “flow” that can be pushed from the source to the sink is exactly equal to the minimum “cost” of any cut that separates the source from the sink. This allows us to solve the difficult min-cut problem by using well-known, efficient algorithms for finding the maximum flow, such as the augmenting path algorithm.

See: 21 Flow Networks, Cuts, Max-Flow Min-Cut, Residual Networks, Ford-Fulkerson

The graph is constructed with a node for each pixel, plus a special source node $S$ (foreground) and a sink node $T$ (background). The weights (capacities) of the edges in this graph are set according to our energy function:

The weights of edges from $S$ and to $T$ (t-links) are determined by the unary potentials. A pixel that is very likely to be foreground will have a strong link to $S$ and a weak link to $T$ .
The weights of edges between neighboring pixels (n-links) are determined by the pairwise potentials. If two neighboring pixels have very different colors, the edge between them will have a low weight, making it “cheap” to cut and assign them different labels.

Finding the minimum cut in this graph is equivalent to finding the optimal segmentation that minimizes our energy function.

This powerful technique is the engine behind interactive segmentation tools like “GrabCut,” where a user provides rough scribbles for foreground and background, and the algorithm finds the optimal boundary that respects both the user’s input and the underlying image structure.

Continue here: 05 Convolution and Filtering

CS Notes

Explorer