05 Convolution and Filtering

In our previous lecture, we embarked on the task of image segmentation, exploring a variety of methods from simple thresholding to complex probabilistic models. We saw that to truly succeed, our algorithms must move beyond treating pixels as isolated entities and begin to understand their relationships and context.

Today, we delve into the fundamental machinery that allows us to do just that: image filtering. This is the bedrock of low-level image processing, a set of tools for modifying or extracting information from an image by operating on a pixel’s local neighborhood. The concepts we will learn here—especially convolution—are not just foundational to classical image processing; they are the very heart of modern deep learning architectures like Convolutional Neural Networks (CNNs).

What is Image Filtering?

At its core, image filtering is a simple but powerful idea: Image filtering is the process of modifying the value of a pixel based on a function of the values in its local neighborhood.

Imagine sliding a small window across an image. At each position, we look at the pixel values inside the window, apply some mathematical function to them, and use the result to compute the new value for the center pixel in our output image.

Why Filter an Image?

This simple operation of local neighborhood processing unlocks a vast array of capabilities. The two primary motivations for filtering are:

Noise Reduction and Image Restoration: As we’ve learned, all real-world images are corrupted by noise. By making the reasonable assumption that a pixel’s true value is likely similar to its neighbors, we can use filtering to average out random fluctuations and “denoise” the image, restoring a cleaner version of the underlying signal.
Structure Extraction: Images are rich with structure—edges, corners, textures, and lines. Filtering allows us to design specific operators that are sensitive to these structures. We can create filters that respond strongly at the locations of vertical edges, for example, effectively extracting a “map” of all the vertical lines in an image. This is a crucial first step in transforming a raw pixel grid into a more meaningful, feature-based representation.

A First Application: Denoising by Averaging

Let’s tackle the problem of noise reduction head-on. Consider an image corrupted by Gaussian noise, where each pixel’s intensity is randomly perturbed. Our goal is to recover the original, clean image.

The Core Assumptions

Our strategy will be based on two fundamental assumptions:

Signal Coherence: We expect the true pixel values to be “like their neighbors.” Natural images are generally smooth; abrupt changes are rare and usually correspond to object boundaries.
Noise Independence: We assume the noise is i.i.d. (independent and identically distributed). This means the noise added to one pixel is a random event that has no bearing on the noise added to its neighbors.

The Moving Average Solution

Given these assumptions, an intuitive solution presents itself: if the noise is random and independent at each pixel, but the underlying signal is similar across neighbors, then averaging the values in a local neighborhood should cancel out the noise while preserving the signal.

Let’s see this in action. We will create a new image, $G$ , from our noisy input image, $F$ . For each pixel $(x, y)$ in $G$ , its value will be the average of the pixels in a 3x3 neighborhood around the corresponding location in $F$ . This is called a moving average.

As the averaging window slides across the image, you can see the effect. Outlier pixels (like the 90 in a sea of 0s) are suppressed, and the sharp boundary between the 0 and 90 regions becomes blurred. The noise is reduced, but at the cost of some image sharpness.

Formalizing the Operation: Linear Shift-Invariant Filtering

The moving average operation we just performed is an example of a broad and powerful class of operations known as Linear Shift-Invariant (LSI) Filtering. Let’s break down what this means.

Linear: An operation is linear if it obeys the principles of superposition and homogeneity. Formally, for an operator $L$ , inputs $I_{1}, I_{2}$ , and scalars $α, β$ :
$L [α I_{1} + β I_{2}] = αL [I_{1}] + β L [I_{2}]$
In the context of images, this means the operation can be expressed as a weighted sum of the input pixel values. Our moving average, where we summed the neighbors and divided by 9 (i.e., multiplied each neighbor by 1/9 and summed them), is a perfect example.
Shift-Invariant: An operation is shift-invariant if it behaves the same way everywhere in the image. The weights (the averaging function) we apply at pixel $(x, y)$ are identical to the weights we apply at any other pixel $(x^{'}, y^{'})$ . We use the same filter, or kernel, at every location.

The Mathematics of Linear Filtering

Any LSI filtering operation can be written as a weighted sum over a neighborhood. The output image $I^{'}$ at pixel $(x, y)$ is computed from the input image $I$ as:

I^{'} (x, y) = (i, j) \in N (x, y) \sum K (i, j) I (x + i, y + j)

$K$ is the kernel of the operation. It’s a small array of weights that defines the filter. For our 3x3 moving average, the kernel was a 3x3 matrix of all 1/9s.
$N (x, y)$ is the neighborhood of the pixel $(x, y)$ .

This operation is so fundamental that it has two specific names, depending on a subtle detail in its implementation: Correlation and Convolution.

Correlation

Correlation is the most direct implementation of the formula above. To compute the output at $(x, y)$ , we overlay the kernel $K$ onto the image neighborhood centered at $(x, y)$ , multiply the corresponding elements, and sum the results.

This is often used for template matching. If we treat the kernel as a “template” we are looking for, the output image will have high values in regions that are a good match for the template.

Correlation as a Dot Product

We can think of the correlation operation in a more geometric way. If we “unroll” the kernel and the image patch under it into long vectors, say $k$ and $i$ , the correlation output is simply their dot product, $k \cdot i$ .

Recall the definition of the dot product:
$a \cdot b = ∥ a ∥∥ b ∥ cos θ$
The output is maximized when the angle $θ$ between the two vectors is zero—that is, when the vectors are pointing in the same direction (i.e., they are similar). By normalizing both the kernel and image patch vectors to have a length of 1, the dot product directly gives us $cos θ$ , a measure of similarity ranging from -1 to 1. This is the basis of the Normalized Cross-Correlation (NCC) algorithm, a very robust method for template matching.

Convolution

Convolution is almost identical to correlation, with one crucial difference: the kernel is flipped 180 degrees before being applied.

Correlation: I^{'} (x, y) = i, j \sum K (i, j) I (x + i, y + j)

Convolution: I^{'} (x, y) = i, j \sum K (i, j) I (x - i, y - j)

Notice the signs: +i, +j for correlation, and -i, -j for convolution. This means that to compute the output at $(x, y)$ , the kernel element $K (i, j)$ is multiplied with the image pixel $I (x - i, y - j)$ .

Why the Flip?

Why have two nearly identical operations? Convolution has deep roots in physics and signal processing. It is the correct mathematical operator to describe how a linear, shift-invariant system (like an optical system) responds to an input signal. It possesses mathematically desirable properties, such as being associative: $(f * g) * h = f * (g * h)$ .

In practice, for many applications in computer vision (including deep learning), the distinction is minor. If the kernel $K$ is symmetric (e.g., a Gaussian), then correlation and convolution are identical. Many deep learning libraries, while calling their layers “convolutional,” actually implement correlation for computational convenience.

A Gallery of Kernels

The power of LSI filtering comes from the design of the kernel. Different kernels produce vastly different effects.

Identity: A kernel with a 1 at the center and 0s elsewhere. The output is simply a copy of the original image.
Shift: A kernel with an off-center 1. This shifts the entire image in the opposite direction of the 1’s offset (for convolution).
Blur (Box Filter): A kernel of all 1s, normalized by the number of elements (e.g., 1/9 for a 3x3 kernel). This is the moving average filter we started with. It smooths the image and reduces noise.
Sharpening: A sharpening filter enhances differences between a pixel and its local average. A common way to construct one is to take an identity kernel and subtract a blurred version of the identity. This creates a kernel with a large positive value at the center, surrounded by small negative values. $Sharpen = 000020000 - \frac{1}{9} 111111111$ This operation, known as unsharp masking, accentuates edges and details.

The Gaussian Filter: A Better Way to Smooth

The simple box filter gives equal weight to all pixels in the neighborhood, which can create blocky artifacts. A more principled approach is to give more weight to closer pixels, with the influence falling off smoothly with distance. The perfect mathematical function for this is the Gaussian.

A 2D Gaussian kernel is defined by:

G_{σ} (x, y) = \frac{1}{2 π σ ^{2}} e^{- \frac{x ^{2} + y ^{2}}{2 σ ^{2}}}

The parameter $σ$ (standard deviation) controls the “width” of the bell curve. A larger $σ$ results in a wider kernel and more aggressive smoothing.

Properties of the Gaussian Filter

The Gaussian is the king of smoothing filters for several reasons:

Rotational Symmetry: It smooths equally in all directions.
Smooth Falloff: It provides a gentle, weighted average that avoids the artifacts of a box filter.
Separability: This is a crucial computational property. A 2D Gaussian can be decomposed into the product of two 1D Gaussians: $G_{σ} (x, y) = g (x) g (y) = (\frac{1}{2 π σ} e^{- \frac{x ^{2}}{2 σ ^{2}}}) (\frac{1}{2 π σ} e^{- \frac{y ^{2}}{2 σ ^{2}}})$ This means that convolving with a 2D Gaussian kernel is equivalent to first convolving every row with a 1D Gaussian, and then convolving every column of the result with the same 1D Gaussian. This reduces the computational complexity for an $N \times N$ kernel from $O (N^{2})$ to $O (2 N)$ operations per pixel—a massive speedup for large kernels.

Multi-Scale Representation: Gaussian Pyramids

A fundamental problem in vision is that objects can appear at any size or scale. A template of Waldo that is 50x30 pixels will fail to find a Waldo that is 100x60 pixels in the image. To handle this, we need to analyze the image at multiple scales. The Gaussian Pyramid is the standard way to do this.

It’s a collection of down-sampled versions of the image. However, to avoid aliasing artifacts when down-sampling, we must first smooth the image. The process is iterative:

Start with the original image, $G_{0}$ .
Create the next level, $G_{1}$ , by convolving $G_{0}$ with a Gaussian filter and then down-sampling (e.g., by taking every other pixel).
Repeat: create $G_{i + 1}$ by smoothing and down-sampling $G_{i}$ .

This creates a “pyramid” of images, from high resolution at the bottom to low resolution at the top, where each level represents a coarser scale of the image content. This multi-scale representation, also known as scale space, is essential for building scale-invariant algorithms.

Integral Images: A Shortcut for Summation

The box filter, while simple, requires summing up all the pixels in a rectangular window at every position. If the window is large, this is computationally expensive. The Integral Image, or Summed-Area Table, is a brilliant data structure that allows this calculation to be done in constant time, regardless of the window size.

An integral image, $II$ , is an image where the value at any pixel $(x, y)$ is the sum of all pixels in the original image $I$ above and to the left of $(x, y)$ .

II (x, y) = x^{'} \leq x, y^{'} \leq y \sum I (x^{'}, y^{'})

Once this integral image is computed (which takes a single pass over the image), the sum of pixels within any rectangular region can be computed with just four lookups into the integral image.

This technique was the key innovation behind the Viola-Jones face detector, the first real-time object detector, which needed to compute thousands of box-filter-like features at every location and scale in the image.

Extracting Structure: Image Gradients

We now turn to our second major application of filtering: structure extraction. The most fundamental structure in an image is the edge—a location of rapid intensity change. Calculus gives us the perfect tool for measuring change: the derivative.

Derivatives and Edges

If we plot the intensity profile across an edge, we see a step-like function.

The first derivative of this profile will have a peak (a maximum or minimum) at the location of the edge.
The second derivative will have a zero-crossing at the location of the edge.

Approximating Derivatives with Convolution

For a discrete image, we can approximate the partial derivative with respect to $x$ using a finite difference:

\frac{\partial f ( x , y )}{\partial x} \approx f (x + 1, y) - f (x, y)

This operation is equivalent to convolving the image with the simple kernel [-1, 1]. Similarly, the partial derivative with respect to $y$ can be computed by convolving with the kernel [-1, 1] transposed. Filters like Prewitt and Sobel are slightly more sophisticated versions that incorporate smoothing to make the derivative calculation more robust to noise.

Continue here: 05 Convolution and Filtering

CS Notes

Explorer