In our previous lectures, we have built a solid foundation. We journeyed from the physics of image formation to the mathematics of filtering, learning how to manipulate and enhance pixel values. We saw how linear shift-invariant filters, particularly convolution, can be used for tasks like noise reduction (smoothing) and structure extraction (computing image gradients to find edges).
Today, we take a significant leap forward. We will move from processing images to understanding them. The goal is no longer just to modify pixels, but to distill the vast, unstructured grid of pixel data into a compact, meaningful, and robust set of image features. These features will serve as the fundamental building blocks for nearly all high-level computer vision tasks, from object recognition and image stitching to 3D reconstruction.
Recap: The Power and Purpose of Image Filtering
Let’s briefly revisit the key concepts from our discussion on filtering, as they are the tools we will use to build our features.
- Linear Shift-Invariant Filtering: This is the core operation, where we compute a new value for each pixel as a weighted sum of its neighbors. This is defined by a kernel.
- Convolution/Correlation: These are the two fundamental implementations of this weighted sum.
- Gaussian Filter: Our go-to filter for smoothing, chosen for its desirable mathematical properties like rotational symmetry and separability.
- Scale Space: By repeatedly smoothing and down-sampling an image, we create a Gaussian Pyramid, a multi-scale representation that allows us to analyze features at different levels of detail.
- Integral Image: A clever data structure that allows for the extremely fast computation of sums within rectangular regions, enabling real-time applications.
- Image Gradient: By applying derivative filters, we can compute the image gradient (), a vector at each pixel that points in the direction of the steepest intensity change. Its magnitude, , tells us the edge strength, and its direction, , tells us the orientation of the edge.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005175910.png)
The Image Gradient: Quantifying Change
The most fundamental structure in an image is the edge, a location of rapid intensity change. Calculus gives us the perfect tool for measuring change: the derivative.
Edges and Derivatives
If we plot the intensity profile across an edge, we see a step-like function.
- The first derivative of this profile will have a peak (a maximum or minimum) at the location of the edge.
- The second derivative will have a zero-crossing at the location of the edge.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005175921.png)
From Calculus to Convolution
For a discrete image, we can approximate the partial derivative with respect to using a finite difference:
This operation is equivalent to convolving the image with the simple kernel [-1, 1]. Similarly, the partial derivative with respect to can be computed by convolving with the kernel [-1, 1] transposed. Filters like Prewitt and Sobel are slightly more sophisticated versions that incorporate smoothing to make the derivative calculation more robust to noise.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005175942.png)
The Gradient Vector
The two partial derivatives can be combined into a vector called the image gradient:
This vector has two important properties:
- Magnitude: . This measures the edge strength. High magnitude corresponds to a strong edge.
- Direction: . This vector points in the direction of the most rapid intensity increase, perpendicular to the edge itself.
The Achilles’ Heel of Differentiation: Noise
A major problem is that differentiation is extremely sensitive to noise. A small, random fluctuation in a noisy image can create a very large derivative, swamping the true signal from the edges.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180003.png)
Solution: Smooth First!
The solution is elegant and powerful: smooth the image first, then take the derivative. By convolving the image with a Gaussian filter before applying the derivative filter, we suppress the noise, allowing the true edge structure to emerge.
A key mathematical property of convolution is that it is associative and differentiation is a linear operator. This leads to the Derivative of Convolution Theorem:
This means that smoothing with a filter and then differentiating is identical to convolving the image with the derivative of the filter . We can pre-compute the derivative of our Gaussian kernel to create a Derivative of Gaussian (DoG) filter. Applying this single filter to the image accomplishes both smoothing and differentiation in one efficient step, giving us a robust way to find edges even in the presence of noise.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180027.png)
An Alternative Approach: The Laplacian of Gaussian (LoG)
Instead of finding the peaks of the first derivative, we can look for the zero-crossings of the second derivative. The 2D equivalent of the second derivative is the Laplacian operator, .
Applying the same “smooth first” principle, we can convolve our image with the Laplacian of a Gaussian (LoG). This filter, often called the “Mexican Hat” filter for its distinctive shape, finds edges at the zero-crossings of its response. The width of the Gaussian, , controls the scale of the edges being detected.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180113.png)
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180154.png)
The Canny Edge Detector: An Optimal Approach
The Canny edge detector, developed in 1986, is a classic and highly influential algorithm that embodies the principles of designing an optimal detector. It’s a multi-stage process designed to satisfy three criteria: good detection, good localization, and single response.
- Smooth Image: Convolve the image with a Gaussian filter to reduce noise.
- Compute Gradient: Compute the gradient magnitude and angle at every pixel.
- Non-Maximum Suppression: Thin the thick edges. For each pixel, look at its two neighbors along the gradient direction. If the pixel’s gradient magnitude is not greater than both of its neighbors, suppress it (set its magnitude to 0). This ensures that only the peaks of the gradient ridges survive, resulting in thin, one-pixel-wide edges.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180225.png)
- Double Thresholding: Use two thresholds, and . Pixels with magnitude above are marked as “strong” edges. Pixels between and are marked as “weak” edges.
- Edge Linking (Hysteresis): Keep the weak edge pixels only if they are connected to a strong edge pixel. This clever step allows the algorithm to trace along weaker parts of a strong contour while discarding isolated noise responses.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180240.png)
From Edges to Fitting: The Hough Transform
An edge map is a good start, but it’s just a collection of disconnected points. How do we find the larger structures, like lines or circles, that these points belong to? This is a fitting problem.
The Hough Transform is a brilliant and general technique for solving this. It reframes the fitting problem as a voting process in a parameter space.
The Duality of Points and Lines
Consider the equation of a line: . This equation defines a relationship between image space and a parameter space .
- A single line in image space corresponds to a single point in parameter space.
- A single point in image space corresponds to a line in parameter space, defined by the equation . This line represents all possible lines that could pass through the point .
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180316.png)
This duality is the key. If multiple points are collinear in image space, their corresponding lines in parameter space will all intersect at a single point, the point that defines the line they all lie on!
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180331.png)
The Hough transform algorithm is thus:
- Create a 2D grid (an accumulator) representing the parameter space.
- For each detected edge pixel in the image, trace its corresponding line in parameter space and “vote” for (increment the value of) every cell it passes through.
- Find the cells in the accumulator with the most votes. These peaks correspond to the parameters of the most prominent lines in the image.
A Better Parameterization: Polar Coordinates
The parameterization has a major flaw: the slope goes to infinity for vertical lines. A more robust representation is the polar form:
Here, a line is defined by its angle and its perpendicular distance from the origin . In this parameter space, a point in the image maps to a sinusoid. The voting process remains the same: collinear points will produce sinusoids that intersect at a single point in the Hough space.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180352.png)
Examples
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180437.png)
Generalizing to Circles and Beyond
This voting principle is incredibly general. To find circles with a known radius , we can use the circle equation . Here, the parameter space is the 2D space of circle centers . Each edge point in the image votes for a circle of possible centers in the space. The intersection of these circles reveals the center of the circle in the image. If the radius is also unknown, the parameter space becomes 3D , and the voting process becomes more complex, but the principle remains the same.
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180452.png)
Examples
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180511.png)
Local Image Features: The Building Blocks of Recognition
We now shift our focus from finding global structures like lines to identifying and describing local, salient regions. These local image features are the cornerstone of modern computer vision, enabling tasks like image stitching, 3D reconstruction, and object recognition.
Why We Need Features
Imagine trying to build a panorama from multiple photos. You need to find corresponding points between the images to align them. Or consider a robot navigating a room using a single camera (Monocular SLAM (= Simultaneous Localization and Mapping)). It needs to identify stable, recognizable landmarks in the video stream to track its own motion. In all these cases, we need to answer two key questions:
- Where are the interesting, repeatable regions? (Detection)
- How can we describe these regions in a way that is robust and distinctive? (Description)
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180526.png)
Applications
/Semester-5/Visual-Computing/Lecture-Notes/attachments/Pasted-image-20251005180813.png)