Before we embark on our technical journey, it is wise to equip ourselves with good maps. For those who wish to explore the vast territories of computer vision beyond this course, two texts are particularly invaluable.

  • Computer Vision: Algorithms and Applications by Richard Szeliski is a comprehensive and authoritative tome, covering the breadth and depth of the field.
  • Foundations of Computer Vision, a more recent text by Antonio Torralba, Phillip Isola, and William T. Freeman, offers a modern perspective, beautifully connecting classical principles with the latest deep learning and foundation models.

For this course, we will follow a curated path. A set of lecture notes, originally prepared by Prof. Marc Pollefeys and Prof. Markus Gross and now updated for this semester, will be our primary guide. These notes will be released weekly to accompany the lectures and will serve as the core material for the exam.

The Modern Miracle of Digital Cameras

The image you see below, a stunning photograph of Zurich at dusk, is a testament to a technology that has become so ubiquitous we often forget its magic. Digital cameras are, without exaggeration, one of the most remarkable sensors ever invented.

This technology has evolved at a breathtaking pace. The very first digital camera, a cumbersome prototype secretly built by Kodak engineers in 1975, was a far cry from the sleek devices we know today. It wasn’t until 1988 that Fuji released one of the first commercial models with memory and storage. Now, just a few decades later, we carry cameras of extraordinary power and sophistication in our pockets.

This proliferation has had a profound consequence: it has fueled an unprecedented explosion of visual data. Every photo shared, every video uploaded, contributes to a vast digital library of the world. This very data is the lifeblood of modern artificial intelligence, providing the raw material from which machine learning models learn to see, understand, and create.

The Imperfections of a Perfect Sensor

Yet, for all their brilliance, digital cameras are not perfect. The journey from light in the real world to a final image file is fraught with potential pitfalls. The images we capture are often corrupted by a host of artifacts and imperfections.

  • Transmission Interference: Data can be lost or garbled as it’s sent from a satellite to Earth, resulting in corrupted or incomplete images.
  • Compression Artifacts: To save space, images are compressed using algorithms like JPEG. This process is lossy, meaning some information is discarded. At high compression levels, this results in noticeable blocky or blurry artifacts.
  • Sensor Noise and Defects: Physical imperfections in the sensor, or even cosmic rays striking it, can lead to scratches, dead pixels, or random noise that speckles the image.
  • Bad Contrast: If the camera fails to correctly map the range of brightness in the scene to the displayable range, the resulting image can appear washed out or overly dark, obscuring important details.
  • Resolution Limits: Every digital image has a finite resolution. If we zoom in too far, we lose detail. The task of super-resolution, intelligently guessing the missing details to create a higher-resolution image, is a major area of computer vision research, often relying on AI models trained on vast datasets to “hallucinate” plausible details.
  • Motion Blur: If an object (or the camera itself) moves during the time the shutter is open, the resulting image will be blurred. This is a common problem when photographing fast-moving scenes.

Co-Designing Hardware and Algorithms to Fight Blur

How can we combat motion blur? One clever approach, pioneered in a 2006 SIGGRAPH paper, involves co-designing the camera hardware and the processing algorithm.

  • A short exposure freezes motion but captures very little light, resulting in a dark, noisy image.
  • A traditional long exposure captures plenty of light but produces a blurry image.
  • The proposed solution is a coded exposure, where the shutter flutters open and closed in a specific, pseudo-random pattern during the exposure time.

The resulting captured photo still looks blurry to a human. However, because the pattern of the blur is now known and controlled, a corresponding deblurring algorithm can deconvolve this specific pattern from the image, recovering a final result that is both sharp and bright. This is a beautiful example of how thinking about the entire imaging pipeline, from hardware to software, can solve seemingly intractable problems.

What is an Image? A Deeper Look

We have discussed the creation and flaws of digital images, but we have yet to formally define what an image is. There are two complementary ways to think about this: one from the perspective of mathematics and signal processing, and the other from the perspective of physics and geometry.

The Image as a 2D Signal

From a mathematical standpoint, an image is simply a function. For a standard grayscale image, it is a function of two variables, , that returns an intensity value for every coordinate on a 2D plane.

The value of the function, , typically represents brightness. However, it can represent any other physical quantity that varies over a 2D space, such as temperature in a thermal image, tissue density in a CT scan, or depth in a range map.

A video can be seen as a natural extension of this idea, adding a third variable for time:

The Image as a Projection

From a physical standpoint, an image is a projection of a 3D scene onto a 2D plane. It is a shadow, a trace left by the 3D world on our sensor. To understand the image, we must understand the geometric and photometric (light-related) relationship between the scene and its 2D representation.

Mathematically, we can describe an image as a function that maps a point in some n-dimensional space to a value in some space .

For a digital grayscale image, this continuous function is discretized. The domain becomes a finite grid of integer coordinates, and the range becomes the set of positive real numbers, representing intensity.

What is a Pixel?

This leads us to the fundamental building block of a digital image: the pixel. What exactly is it?

It is a common and deeply ingrained misconception to think of a pixel as a “little square.” It is not.

A Pixel Is Not A Little Square!

In his famous 1995 technical memo, computer graphics pioneer Alvy Ray Smith passionately argued against the “little square” model. A pixel, he explained, is not a geometric shape. It is a sample. It is a point measurement of the continuous light signal that falls on the sensor at a specific location.

The familiar grid of squares we see when we zoom into an image is merely a convenient way to visualize these discrete samples. The underlying reality is a set of point values, which can be rendered or reconstructed in many different ways, as dots, as interpolated smooth surfaces, or, yes, as little squares. But the square itself is not the pixel.

The Genesis of an Image: From Pinhole to Lens

Images are all around us, originating from digital cameras, MRI scanners, computer graphics software, and more. But how is an image formed in the first place? The fundamental physical principle is ancient, elegant, and remarkably simple.

The Pinhole Camera

The principle of the pinhole camera, or camera obscura, was described by thinkers like Mozi in ancient China and Aristotle in ancient Greece. It states that light passing through a tiny hole projects an inverted image on the opposite side.

Let’s understand why this is necessary through a thought experiment.

  1. Bare-Sensor Imaging: Imagine we have a light-sensitive sensor and an object (a tree). Light rays from every point on the tree travel in all directions. This means that every point on our sensor is illuminated by rays from every point on the tree. The result is a complete, uniform blur, no image is formed.
  2. Adding a Barrier: Now, let’s place an opaque barrier with a tiny pinhole between the tree and the sensor. This barrier blocks almost all light rays. For any given point on the sensor, only a single ray of light from a single point on the tree can pass through the pinhole to reach it.
  3. Image Formation: This creates a one-to-one mapping between points in the 3D scene and points on the 2D sensor. The result is a sharp, focused, but inverted and scaled, copy of the object. This is the magic of the pinhole camera.

The Pinhole’s Dilemma and the Lens Solution

The ideal pinhole camera creates a perfectly sharp image. However, it relies on two conflicting requirements:

  • For a perfectly sharp image, the pinhole must be infinitesimally small to ensure a perfect one-to-one mapping of rays.
  • To create a bright image, the pinhole must be large enough to let in a sufficient amount of light (photons).

In practice, making the pinhole larger to increase brightness causes each scene point to project to a small circle on the sensor, rather than a single point, resulting in a blurry image.

The modern solution to this trade-off is the lens. A lens has a much larger aperture than a pinhole, allowing it to gather a large “bundle” of light rays from a single point in the scene. Its crucial property is that it can bend all of these rays and refocus them back to a single point on the sensor. A lens thus provides the best of both worlds: the brightness of a large aperture and the sharpness of a perfect pinhole.

Inside the Digital Camera

The pinhole and lens describe the optics of image formation. But how is that optical image converted into a digital file? This is the job of the image sensor and its associated electronics.

The Charge-Coupled Device (CCD)

One of the foundational sensor technologies is the Charge-Coupled Device (CCD). Here’s how it works:

  1. Capturing Photons: The sensor is a grid of millions of photosites. Each photosite is essentially a tiny “bucket” made of silicon that is sensitive to light. When the shutter opens, photons from the lens strike these photosites.
  2. Photon-to-Electron Conversion: The silicon in the photosites has a special property: when a photon hits it, it releases an electron. Over the exposure time, each bucket accumulates an electrical charge (a number of electrons) that is directly proportional to the intensity of the light that fell on it.
  3. The Readout: A “Bucket Brigade”: After the shutter closes, the collected charge must be read out and measured. In a CCD, this happens through a clever process often called a “bucket brigade.” The entire grid of charge packets is shifted, one row at a time, to a special readout register at the edge of the sensor. This register then shifts the charge packets one by one to an amplifier.
  4. Analog-to-Digital Conversion (ADC): The amplifier converts the tiny electrical charge of each packet into a measurable analog voltage. This voltage is then fed to an Analog-to-Digital Converter (ADC), which measures the voltage and assigns it a discrete numerical value (e.g., a number from 0 to 255). This stream of numbers forms the final digital image.

This elegant process, however, has some inherent artifacts:

  • Blooming: If a photosite is exposed to extremely bright light, its bucket can overflow, spilling excess charge into neighboring buckets and creating bright streaks or blobs in the image.
  • Dark Current: Heat can cause electrons to be generated in the silicon even in complete darkness. This creates a low level of noise called dark current, which is why high-end astronomical cameras are often actively cooled.

The CMOS Sensor

Today, most consumer cameras, especially in smartphones, use a different technology called CMOS (Complementary Metal-Oxide-Semiconductor). While the light-capturing element is still a silicon photosite, the readout architecture is fundamentally different.

The key innovation in CMOS sensors is that each photosite has its own amplifier. There is no “bucket brigade.” Instead, the charge in each pixel can be converted to a voltage and read out directly and individually.

CCD vs. CMOS: A Tale of Two Architectures

CCD (Charge-Coupled Device)CMOS (Complementary Metal-Oxide-Semiconductor)
Mature, specialized technologyMore recent, standard IC technology
High production costCheap to manufacture at scale
High power consumptionLow power consumption
High “fill rate” (more light-sensitive area)Lower fill rate (space needed for amplifier)
Prone to “blooming” artifactsLess sensitive, traditionally more noise
Sequential, “bucket brigade” readoutRandom pixel access, “smart pixels” possible
-On-chip integration of other components

One significant issue with many CMOS sensors is the rolling shutter. Because pixels can be read out line by line, if an object is moving very fast, the top of the object is captured at a slightly different time than the bottom. This can lead to strange geometric distortions, such as the skewed appearance of a helicopter’s rotor blades in a video. More expensive “global shutter” CMOS sensors avoid this by reading out the entire sensor simultaneously.

The Future: Event Cameras

A revolutionary new type of sensor, inspired by the human visual system and pioneered here in Zurich, is the Dynamic Vision Sensor (DVS), or event camera. Unlike traditional cameras that capture entire frames at a fixed rate (e.g., 30 times per second), an event camera’s pixels work independently and asynchronously. A pixel only “fires” and transmits data when it detects a change in brightness.

This leads to incredible advantages:

  • High Temporal Resolution: It can capture extremely fast motion without any motion blur.
  • Low Data Rate: If nothing in the scene is moving, the camera produces no data, saving power and bandwidth.
  • High Dynamic Range: It can see details in both very dark and very bright parts of a scene simultaneously.

This technology is poised to revolutionize high-speed robotics, drones, and other applications where traditional cameras fall short.

Digitizing Reality: The Twin Processes

We’ve seen how a sensor turns light into an analog electrical signal. The final step is to convert this continuous signal into the discrete numbers that form a digital image. This involves two fundamental processes.

Sampling

The world is continuous. An image is discrete. Sampling is the process of measuring the continuous image signal at a finite number of points on a grid. Each sample becomes a pixel.

The Peril of Undersampling: Aliasing

A crucial question arises: how many samples are enough? If we don’t sample frequently enough, we run into a deceptive problem called aliasing.

Imagine a rapidly spinning wagon wheel in an old movie. Sometimes, it appears to be spinning slowly backwards. This is aliasing. The camera’s frame rate (its sampling rate in time) is too slow to accurately capture the high-frequency rotation of the wheel. The high-frequency motion is not just lost; it is masquerading as a different, lower frequency.

This is a fundamental issue in all of signal processing. A signal “travels in disguise” as another frequency if the sampling rate is insufficient.

The Nyquist-Shannon Sampling Theorem

Fortunately, there is a mathematical foundation that tells us how to avoid this. The Nyquist-Shannon Sampling Theorem is a cornerstone of information theory. It states:

To perfectly capture and reconstruct a signal, you must sample it at a rate at least twice as fast as the highest frequency present in the signal.

This critical rate, twice the highest frequency, is known as the Nyquist rate. The highest frequency that can be accurately captured at a given sampling rate (which is half the sampling rate) is the Nyquist frequency.

Quantization

Sampling discretizes the spatial domain of an image. Quantization discretizes the value domain. The analog voltage from the sensor can be any real number within a range. To store it digitally, we must round it to one of a finite number of levels.

This process is inherently lossy. Once a value is rounded, we can never recover the original, precise analog value. The number of quantization levels is determined by the bit depth.

  • A 1-bit image can only represent levels (e.g., black and white).
  • An 8-bit grayscale image can represent levels of gray.
  • A standard 24-bit color image uses 8 bits for each of the Red, Green, and Blue channels, allowing for million possible colors per pixel.

Image Properties and Noise

The concepts of sampling and quantization lead us to the final properties that define a digital image.

Image Resolution

When we talk about “resolution,” we are often referring to two different things:

  • Geometric Resolution: This is determined by sampling. It refers to how many pixels are used to represent a certain area. An image with dimensions of has a higher geometric resolution than one that is .
  • Radiometric Resolution: This is determined by quantization. It refers to the number of bits used to represent each pixel’s value (its bit depth). An image with 256 gray levels (8-bit) has a higher radiometric resolution than one with only 2 gray levels (1-bit).

Continue here: 03 Image Segmentation