03 Image Segmentation

Welcome back. Before we venture into our main topic for this week, Image Segmentation, let us first conclude our discussion on the fundamental nature of the digital image by exploring its final, unavoidable property: noise.

The Unavoidable Reality of Image Noise

No matter how advanced our technology becomes, the process of capturing an image of the real world will always be imperfect. Every step in the imaging pipeline—from the quantum arrival of photons at the sensor, to their conversion into electrons, to the transmission and processing of the digital signal—introduces a degree of randomness. This unwanted modification of the signal is what we call noise.

Classifying Noise

We can broadly categorize image noise into two families:

Scene-Dependent Noise: This noise varies with the content of the image itself. The primary example is Photon Shot Noise. Light is not a continuous fluid; it is composed of discrete particles called photons. The arrival of these photons at any given photosite on the sensor is a random, quantum process. Brighter areas of an image, which correspond to a higher rate of photon arrival, will naturally exhibit more of this random fluctuation than darker areas.
Scene-Independent Noise: This noise is a characteristic of the imaging system itself, regardless of what it is looking at. This category includes electronic noise from the sensor’s circuitry, quantization noise introduced when converting the analog signal to digital values, and the dark current noise we discussed previously, which is caused by thermal energy in the sensor.

Measuring Image Quality: The Signal-to-Noise Ratio (SNR)

To quantify the “cleanliness” of an image, we use a metric called the Signal-to-Noise Ratio (SNR). It’s an intuitive measure that compares the strength of the actual image signal to the strength of the noise.

The SNR, denoted by $s$ , is defined as the ratio of the mean pixel intensity of the image, $\overset{ˉ}{F}$ , to the standard deviation of the noise, $σ$ .

s = \frac{F ˉ}{σ}, where \overset{ˉ}{F} = \frac{1}{X Y} x = 1 \sum X y = 1 \sum Y f (x, y)

Understanding SNR in images

The signal-to-noise ratio (SNR) is often defined as
$s = \frac{F ˉ}{σ}, \overset{ˉ}{F} = \frac{1}{X Y} x = 1 \sum X y = 1 \sum Y f (x, y)$

$\overset{ˉ}{F}$ = average pixel intensity (the “signal”).

$σ$ = standard deviation of pixel intensities (the “noise”).

✅ High SNR → the image patch is nearly uniform (all pixels close to the mean).
Example: a smooth gray or white region → $σ \approx 0$ , so $s$ is very large.

❌ Low SNR → pixel values vary strongly.
This could be from random noise, or from structured patterns (like a half-black/half-white split or a checkerboard).

⚠️ Important: this formula does not distinguish between “true noise” and “structured variation” — it only measures how consistent pixel intensities are. If all pixels agree, SNR is high; if they are very different, SNR is low.

A high SNR indicates a clean image where the signal dominates the noise. A low SNR indicates a noisy image where the random fluctuations are significant relative to the image content.

A Note on PSNR

In many computer vision papers, you will encounter a related metric called the Peak Signal-to-Noise Ratio (PSNR). It is very similar to SNR, but instead of using the mean signal strength $\overset{ˉ}{F}$ in the numerator, it uses the maximum possible signal value, $F_{ma x}$ (e.g., 255 for an 8-bit image).
$S_{p e ak} = \frac{F _{ma x}}{σ}$
PSNR is often used to compare the quality of a reconstructed or compressed image against an original, noise-free ground truth image.

Common Noise Models

To develop algorithms that can remove noise, we first need mathematical models to describe it.

Additive Gaussian Noise: This is the most common model for scene-independent electronic noise. We model the observed image $I (x, y)$ as the true, clean image $f (x, y)$ plus a random noise value $c$ , where $c$ is drawn from a Gaussian (normal) distribution with a mean of 0 and a variance of $σ^{2}$ . $I (x, y) = f (x, y) + c, where c \sim N (0, σ^{2})$
Poisson Noise (Shot Noise): This models the scene-dependent noise from the random arrival of photons. The probability of observing $k$ photons at a pixel, given that the true average rate of arrival is $λ$ , follows the Poisson distribution: $p (k) = \frac{λ ^{k} e ^{- λ}}{k !}$ As you can see from the formula and the graph, the shape of the noise distribution depends on the expected intensity $λ$ . Brighter regions (larger $λ$ ) have a wider distribution of noise, but as we discussed, a higher signal-to-noise ratio.

Other noise types include multiplicative noise (common in radar imagery), and impulse “salt-and-pepper” noise, which appears as random white and black pixels and can be caused by transmission errors or faulty sensor elements.

Week 2: Image Segmentation

With our understanding of the digital image now solidified, we can turn to our first major task in computer vision: Image Segmentation. This is the process of partitioning an image into meaningful regions or objects. It is often the very first step in a complex image analysis pipeline, providing the raw material for higher-level tasks like object recognition or scene understanding.

The Psychology of Grouping: Gestalt Theory

Why is segmentation such a natural and fundamental task? The answer lies in the very way our own visual system works. A school of psychology that emerged in the early 20th century, known as Gestalt Theory, sought to understand how humans perceive structure in the world. Its central tenet is that grouping is the key to visual perception.

The famous maxim of Gestalt psychology is: “The whole is greater than the sum of its parts.” This means that when we look at a collection of elements, we perceive emergent properties that are not present in the individual elements themselves.

Our brains are hardwired to group visual elements based on a set of powerful, intuitive principles, or Gestalt Factors:

Proximity: We group elements that are close to each other.
Similarity: We group elements that share similar features, like color, shape, or texture.
Continuity: We perceive smooth, continuous lines or curves rather than disjointed fragments.
Closure: We tend to “fill in the gaps” to perceive complete, closed figures.
Common Fate: We group elements that move together in the same direction.

These principles are what allow us to see a Dalmatian in a seemingly random pattern of black and white splotches. Our brain groups the splotches that form a coherent, familiar shape. The challenge for computer vision is to translate these powerful, intuitive human principles into concrete algorithms.

What is Image Segmentation?

At its core, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. It is the act of separating an image into coherent “objects” or regions.

More formally, a complete segmentation of an image $I$ is a finite set of regions ${R_{1}, R_{2}, \dots, R_{N}}$ such that:

The union of all regions covers the entire image: $I = ⋃_{i = 1}^{N} R_{i}$
The regions are mutually exclusive (they do not overlap): $R_{i} \cap R_{j} = \emptyset$ for all $i \neq = j$ .

Different Flavors of Segmentation

The term “segmentation” can mean slightly different things depending on the specific goal.

Semantic Segmentation: Assigns a class label (e.g., “person,” “car,” “sky”) to each pixel. It does not distinguish between different instances of the same class. In the image below, all people are colored pink.
Instance Segmentation: Goes a step further. It not only labels each pixel with a class but also identifies individual object instances. Here, each person is given a unique color.
Panoptic Segmentation: This is the holy grail, combining both. It provides a complete scene understanding, assigning both a semantic label and an instance ID to every single pixel in the image.

Segmentation by Thresholding

The simplest possible method for segmentation is thresholding. It is a process that creates a binary image (an image with only two values, typically 0 and 1) by labeling pixels as either “foreground” or “background” based on their intensity.

We define a threshold value, $T$ . For every pixel $(x, y)$ in the image $I$ , we create a new binary image $B$ according to the rule:

B (x, y) = {10 if I (x, y) > T else

The Power of the Histogram

How do we choose a good threshold $T$ ? A powerful tool for this is the image histogram. A histogram is a plot that shows the number of pixels in an image for each possible intensity value.

If an image contains a dark object on a bright background (or vice versa), its histogram will often be bimodal, meaning it will have two distinct peaks. The valley between these two peaks is often an excellent choice for the threshold $T$ .

The Challenge of Choosing T

Let’s consider an example: segmenting a white duck from a dark, watery background.

If we choose a low threshold ( $T = 50$ ), we correctly label the duck as foreground, but we also mislabel many bright parts of the water as foreground.
If we choose a high threshold ( $T = 200$ ), we get a very clean background, but we start to lose parts of the duck itself.

Clearly, the choice of $T$ is a critical trade-off. How can we choose it?

Trial and Error: Manually adjust $T$ until the result looks good.
Comparison with Ground Truth: If we have a manually segmented “ground truth” image, we can systematically test different values of $T$ and choose the one that produces a result most similar to the ground truth.
Automatic Methods: Develop algorithms that can analyze the image histogram (e.g., by finding the valley between two peaks) to automatically determine the optimal threshold.

A Real-World Application: Chromakeying

The ideal scenario for thresholding is when the foreground and background intensity distributions are perfectly separated, with no overlap.

While this rarely happens in natural images, we can engineer this situation in a studio. This is the principle behind chromakeying, the technique widely known as green screen or blue screen. By filming an actor against a uniformly lit, brightly colored background, we create an image where the background pixels have a very distinct and predictable color, making them, in theory, easy to separate from the foreground actor.

A Naive Approach: Plain Distance Thresholding

Let’s try to build a simple chromakeying algorithm. Our goal is to create a binary mask, often called an alpha mask, which is 1 for every foreground pixel and 0 for every background pixel.

A straightforward idea is to define a target green color, say $g = (0, 255, 0)$ in RGB space. We can then classify a pixel with color $I$ as foreground if its color is “far enough” from our target green. We can measure this “distance” and compare it to a threshold, $T$ .

I_{α} = ∣ I - g ∣ > T

Here, $∣ \cdot ∣$ could be the L1 or L2 norm (Euclidean distance). This seems plausible, but this simple approach is brittle and suffers from two major problems:

Correlated Color Variation: The formula assumes that the “distance” from pure green is uniform in all directions in color space. However, the actual color of a green screen is never perfectly uniform. Due to subtle variations in lighting, shadows, and material properties, the background pixels form a correlated cloud of colors. A shadow, for instance, might reduce the green and red components together. The simple distance metric fails to capture this complex, correlated structure.
Hard Alpha Mask: This method produces a binary, “hard” mask. A pixel is either 100% foreground or 100% background. This creates jagged, unrealistic edges when the foreground is composited onto a new background. Professional systems need to compute a “soft” alpha mask, where pixels at the boundary (like strands of hair) can be semi-transparent.

Understanding Background Color Variation

To build a better model, we must first understand the nature of the background color’s variation. Let’s visualize the colors of pixels sampled from a green screen on a 2D plot of Red vs. Green intensity.

Ideally, all background pixels would be the exact same color, forming a single point on our plot. In reality, they form a cloud. Crucially, this cloud is often not circular but elliptical, indicating that the variations in the red and green channels are correlated.

Now, imagine we have a new test pixel (the black dot). Is it part of the background? A simple Euclidean distance measure is equivalent to drawing a circle around the mean of the background colors and checking if the test pixel falls inside. As the figure clearly shows, this is a poor fit for an elliptical cloud. A point that is intuitively “close” to the data cloud might be outside the circle, while a point that is far away could be inside.

We need a distance measure that understands the shape of the data.

Modeling the Background with a Gaussian

The elliptical shape of the data cloud strongly suggests that we can model the distribution of background colors with a multivariate Gaussian distribution. A Gaussian distribution is defined by two parameters:

A mean vector ( $μ$ ), which represents the center of the data cloud (the average background color).
A covariance matrix ( $Σ$ ), which describes the spread and orientation of the data cloud (the shape of the ellipse).

The Covariance Matrix: Understanding the Shape of Data

The covariance matrix is one of the most important concepts in data analysis. For our 3D RGB color data, it’s a 3x3 matrix.

The diagonal elements ( $Σ_{11}, Σ_{22}, Σ_{33}$ ) represent the variance in each channel (Red, Green, Blue) independently. A large value means the data is very spread out along that axis.

The off-diagonal elements ( $Σ_{12}, Σ_{21}$ , etc.) represent the covariance between pairs of channels. A large positive value for $Σ_{12}$ means that when the Red value increases, the Green value also tends to increase. This is what gives the data cloud its elliptical tilt.

In essence, the covariance matrix captures the complete second-order statistics of the data’s shape.

A Better Metric: The Mahalanobis Distance

Once we have modeled our background color distribution as a Gaussian, we can use a more powerful distance metric that is perfectly suited for it: the Mahalanobis Distance.

Instead of measuring the simple geometric distance to the mean, the Mahalanobis distance measures how many standard deviations a point is from the mean, taking the covariance of the data into account. It effectively “warps” the coordinate system so that the elliptical data cloud becomes a unit circle, and then measures the standard Euclidean distance in this new, warped space.

The Mahalanobis distance $D_{M}$ of a pixel color $c$ from the mean background color $μ$ is given by:

D_{M} (c) = (c - μ)^{T} Σ^{- 1} (c - μ)

We can then use this more meaningful distance for our thresholding:

I_{α} = (I - I_{b g})^{T} Σ^{- 1} (I - I_{b g}) > T

Here, $I_{b g}$ is the mean background color vector $μ$ , and $Σ$ is the covariance matrix, both computed from a sample of known background pixels. This approach is far more robust to the correlated variations found in real-world green screens.

Handling Illumination: Normalized Color

One final problem remains: overall brightness. A shadow cast on the green screen doesn’t change its “green-ness,” but it does lower the absolute R, G, and B values. To make our segmentation robust to such lighting effects, we can work in a normalized color space.

First, we define the overall intensity of a pixel as $I = R + G + B$ . Then, we define the normalized color coordinates $(r, g, b)$ as:

(r, g, b) = (\frac{R}{I}, \frac{G}{I}, \frac{B}{I})

By dividing by the total intensity, we remove the effect of overall brightness and are left with a representation of pure color, or chromaticity. Performing our Gaussian modeling and Mahalanobis distance calculation in this normalized space makes the chromakeying process remarkably robust to shadows and lighting changes.

Evaluating Segmentation Performance

How do we know if our segmentation algorithm is any good? We need a quantitative way to measure its performance. This requires comparing the algorithm’s output to a manually created ground truth segmentation.

The Language of Classification

For a binary segmentation task (foreground vs. background), there are four possible outcomes for each pixel:

True Positive (TP): The algorithm correctly labels a foreground pixel as foreground.
False Positive (FP): The algorithm incorrectly labels a background pixel as foreground. (Type I error)
True Negative (TN): The algorithm correctly labels a background pixel as background.
False Negative (FN): The algorithm incorrectly labels a foreground pixel as background. (Type II error)

The ROC Curve

A powerful tool for visualizing the performance of a binary classifier is the Receiver Operating Characteristic (ROC) curve. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) as we vary the decision threshold of our classifier.

True Positive Rate (TPR), also known as Recall or Sensitivity: “What fraction of the actual foreground pixels did we correctly identify?” $TPR = \frac{TP}{TP + FN}$
False Positive Rate (FPR): “What fraction of the actual background pixels did we incorrectly label as foreground?” $FPR = \frac{FP}{FP + TN}$

Interpreting an ROC Curve

An ROC curve is generated by sweeping the decision threshold (e.g., our intensity threshold $T$ ) from high to low.

The point (0, 0) corresponds to a very high threshold where the classifier labels everything as negative (0% TPR, 0% FPR).

The point (1, 1) corresponds to a very low threshold where the classifier labels everything as positive (100% TPR, 100% FPR).

A perfect classifier would have a curve that goes straight up to the point (0, 1) and then across. This represents a 100% True Positive Rate with a 0% False Positive Rate.

A classifier that is no better than random guessing will produce a diagonal line from (0, 0) to (1, 1).

The Area Under the Curve (AUC) is a common single-number metric for classifier performance. A perfect classifier has an AUC of 1.0, while a random one has an AUC of 0.5.

Precision and Recall

Another common pair of metrics, especially in information retrieval and object detection, are Precision and Recall.

Precision: “Of all the pixels that the classifier labeled as foreground, what fraction were actually correct?” $Precision = \frac{TP}{TP + FP}$
Recall: This is identical to the True Positive Rate. “Of all the pixels that truly are foreground, what fraction did the classifier find?” $Recall = \frac{TP}{TP + FN}$

There is often a trade-off between precision and recall. An algorithm can achieve high recall by being very aggressive and labeling many things as positive, but this will likely lower its precision. A good algorithm finds a balance between the two.

Beyond Thresholding: Using Context

Simple thresholding operates on each pixel in isolation. It has no concept of “context” or “objectness.” This is why humans can so easily outperform it. We don’t just see pixel intensities; we see surfaces, shapes, and groups. To improve our algorithms, we must incorporate this idea of context. The simplest way to do this is to consider a pixel’s neighbors.

Pixel Connectivity

Before we can use neighbors, we must define what a “neighbor” is. For a pixel on a grid, we typically use one of two definitions:

4-neighborhood: The four pixels directly above, below, left, and right.
8-neighborhood: The four neighbors above, plus the four diagonal neighbors.

Based on this, we can define a path between two pixels as a sequence of adjacent neighbors. A connected region is then a set of pixels where a path exists between any two pixels within the set.

Connected Components Labeling

This brings us to a powerful segmentation technique: Connected Components Labeling. The algorithm takes a binary image (perhaps from thresholding) and assigns a unique label to each distinct connected region.

This simple algorithm is incredibly useful for counting objects, separating them for further analysis, and cleaning up the noisy output of a thresholding operation.

Region Growing

A more flexible approach that combines the idea of connectivity with inclusion criteria is Region Growing. The algorithm works as follows:

Start with one or more “seed” points or regions.
Grow the regions by adding neighboring pixels that satisfy some inclusion criteria.
Repeat until no more pixels can be added to any region.

The power of this method lies in its flexibility. The inclusion criteria can be much more sophisticated than a simple global threshold. For example, a pixel might be added to a region if its intensity is “close enough” to the mean intensity of the region so far. This allows the algorithm to adapt to local variations in brightness and texture.

By incorporating neighborhood information and adaptive criteria, region growing represents a significant step up from simple thresholding, bringing us closer to the context-aware segmentation that our own brains perform so effortlessly.

Continue here: 04 Image Segmentation

CS Notes

Explorer