01 An Introduction to Visual Computing

A Tale of Two Disciplines

Welcome to the world of Visual Computing. At its heart, this field is driven by a fundamental human quest: to understand and recreate the world we see around us. This course is a journey into that quest, structured as a tale of two deeply intertwined disciplines.

The first half of our journey, led by Prof. Siyu Tang, will be an exploration of Computer Vision. Think of this as the science of seeing. We will delve into how we can empower machines to interpret and understand the content of images and videos, moving from raw pixel data to meaningful insights. Our focus will be on the bedrock of this field: the principles of image processing.

The second half, guided by Prof. Markus Gross, will venture into the realm of Computer Graphics. This is the art and science of creating. Here, we will learn how to synthesize visual worlds from the ground up, how to model the geometry of a mountain, simulate the way light dances on water, and render the breathtaking scenes you see in modern films and video games.

These two fields are not separate; they are two sides of the same coin, locked in a fascinating, inverse relationship.

The Grand Duality: Vision and Graphics

To truly grasp the essence of visual computing, we must first understand the profound connection between seeing and creating.

Computer Graphics: The Forward Problem

Computer Graphics begins with a perfect, known quantity: a complete description of a 3D scene stored within a computer’s memory. This is a world of pure information, where every detail is explicitly defined, the exact shape of every object, the material properties of every surface, the precise location and intensity of every light source, and the viewpoint of a virtual camera.

The task of computer graphics is to solve the forward problem: to take this rich 3D model and project it onto a 2D plane, calculating the color and brightness of every single pixel to produce a final image. This is a process of synthesis. It is a well-posed problem; with enough computational power, the result is deterministic and unambiguous.

Computer Vision: The Inverse Problem

Computer Vision starts where graphics leaves off. We are handed the final product, a 2D image, which to a computer is nothing more than a vast grid of numerical values. From this limited, flat representation, we must perform an act of profound inference. We must solve the inverse problem: to deduce the rich, 3D world that gave rise to the image. This is a process of analysis.

This task is monumentally more difficult. It is what mathematicians call an ill-posed problem.

The Ill-Posed Nature of Vision

An ill-posed problem is one where a unique solution does not exist or is not stable. For any given 2D image, there are an infinite number of possible 3D scenes that could have produced it.

Consider the image below. This simple 2D pattern could be the result of a flat painting on an easel, a complex 3D sculpture viewed from just the right angle, or even a clever projection of light onto a blank wall. All of these are plausible, yet wildly different, 3D realities.

To solve this inherent ambiguity, computer vision must rely on priors, assumptions or knowledge about the structure of the world. We might assume that surfaces are generally smooth, that objects are solid, or that lighting comes from above. In the modern era, these priors are often learned from enormous datasets, allowing a machine to understand what constitutes a “plausible” 3D scene.

What, Then, is Computer Vision?

When we look at a photograph, we don’t see a matrix of numbers; we see a story. Our brain effortlessly translates patterns of light into concepts: a bicycle leaning against a wall, flowers blooming in a window box. A computer, however, sees only the raw data.

The grand, aspirational goal of our field can be stated simply: to give computers (super) human-level perception. We aim to build a bridge from that meaningless grid of numbers to a rich, semantic understanding. This ambition unfolds into two primary pursuits.

Vision for Measurement

The first pursuit is to use vision as a scientific instrument. Here, the goal is to extract precise, quantitative information about the 3D world from images. A classic example is the NASA Mars Rover, which uses its stereo cameras not just to navigate, but to build detailed 3D topographical maps of the Martian landscape. A more terrestrial example involves taking hundreds of tourist photos of a landmark, like the Notre Dame Cathedral, and using sophisticated algorithms to automatically stitch them together into a metrically accurate 3D model. This is about turning pixels into geometry.

Vision for Perception and Interpretation

The second, and arguably more profound, pursuit is to imbue machines with the ability to understand the meaning within an image. This is not about measuring distances, but about identifying concepts. Given an image of an amusement park, we want a system that can label the objects (Ferris wheel, carousel), describe the activities (people sitting on a ride), and summarize the scene (an amusement park on a sunny day). This is about turning pixels into knowledge.

The Foundational Wisdom of David Marr

The pioneering neuroscientist David Marr provided one of the most elegant definitions of vision in his seminal book, VISION. He proposed that to “see” is simply “to know what is where by looking.”

Marr’s enduring legacy is his framework for understanding any complex information-processing system, like vision. He argued that we must analyze it at three distinct levels: the computational theory (the “what” and “why”), the algorithm (the “how”), and the implementation (the physical substrate). This structured way of thinking remains a guiding principle for researchers today.

The Intrinsic Hardship of Visual Perception

The human brain, with over half its cortex dedicated to processing visual information, makes seeing look easy. It is anything but. When we attempt to replicate this feat in silicon, we confront a series of fundamental and deeply challenging problems.

The Ambiguity of Projection

The core difficulty stems from the loss of information when a 3D world is projected onto a 2D sensor.

Viewpoint Variation: An object’s appearance is a function of the observer’s viewpoint. A coffee mug seen from the side is a rectangle with a C-shaped handle; from the top, it is a circle. A robust vision system must possess an understanding of object permanence, recognizing that these are merely different views of the same underlying 3D object.
Perception vs. Measurement: Our visual system is an interpretation engine, not a photometer. The famous checker shadow illusion by Edward Adelson is the quintessential example. In this image, two checkerboard squares, labeled A and B, appear to be very different shades of gray. In truth, their pixel values are identical. Our brain, using its built-in knowledge of how shadows work, automatically brightens square B in our perception. A computer, reading the raw data, would see no difference. This highlights the gap between raw measurement and contextual interpretation.

The Unruly Nature of the Real World

Beyond the geometry of projection, the physical world itself presents a chaotic and ever-changing stream of data.

Illumination: The color and intensity of light sources dramatically alter the raw pixel values of a scene. An object photographed at noon looks entirely different at sunset or under fluorescent lighting. An algorithm must learn to disentangle the intrinsic properties of an object (its “albedo” or true color) from the lighting that illuminates it.
Deformation: Most objects in the world are not rigid. Animals, people, cloth, and water are in constant flux. Recognizing a “horse” requires an algorithm to understand the concept of a horse across all its possible gaits and poses, a far greater challenge than recognizing a static, rigid object like a building.
Occlusion and Clutter: Rarely do we see objects in perfect isolation. They are often partially hidden behind other objects (occlusion) or set against a busy, confusing background (clutter). The ability to segment an object from its surroundings and infer its full shape from partial evidence is a critical, and difficult, visual skill.

The Abstraction of Meaning

Perhaps the deepest challenge is that of semantics.

Intra-class Variation: Consider the concept of a “chair.” There is no single geometric template that can describe all chairs. Some have four legs, some have one. Some are made of wood, others of plastic or metal. The category is defined by its function or affordance, it is something one can sit on. Teaching a machine such an abstract, functional category based only on visual data is a frontier of AI research.

Opportunities: Finding Cues in the Chaos

While the challenges are immense, the picture is not entirely bleak. Images are not random noise; they are structured projections of a physical world governed by rules. This structure provides a wealth of cues that an intelligent system can exploit to resolve ambiguity. Our job is to design algorithms that can find and interpret these cues.

Cues like linear perspective, where parallel lines converge at a distance, provide powerful information about depth. Occlusion, where one object partially hides another, gives us an unambiguous sense of relative depth ordering. The subtle gradations of shading across a curved surface reveal its 3D form. By combining these and many other cues, such as texture, color similarity, and motion, a vision system can begin to piece together a coherent and plausible model of the world.

Applications: From Pixels to Progress

The rapid progress in solving these challenges, largely fueled by deep learning, has unlocked a breathtaking array of applications that are reshaping industries and our daily lives.

Reconstructing Our World in 3D

The dream of capturing the real world and bringing it into the digital realm is fast becoming a reality. This technology powers the 3D buildings in Google Earth and allows us to create “digital twins” of factories for simulation. A cornerstone of this technology is COLMAP, a powerful Structure-from-Motion pipeline developed right here at ETH, which can build a 3D model from nothing more than a collection of photographs.

Understanding and Simulating Humanity

Nowhere is this progress more evident than in the quest to capture, understand, and synthesize humans.

The Quest for the Digital Human

Creating realistic digital avatars is crucial for the future of communication, entertainment, and virtual reality. Traditionally, this required Hollywood-style motion capture studios. Today, the goal is to democratize this process, enabling the creation of photorealistic avatars from simple video.

A Decade of Progress in Motion Capture

The pace of change in this field is staggering. For the lectures own Master’s thesis, roughly 14 years ago, the task was to track human motion from a single depth camera. The result, after much effort, was a crude model of connected cylinders, and it didn’t even have arms, as they were too difficult to track reliably!

Fast forward ten years. A PhD student at ETH, using the same type of input data but armed with modern deep learning techniques, can now produce a fully articulated, realistic human model that tracks complex motion with remarkable fidelity. This incredible leap was made possible by the confluence of large datasets, sophisticated neural network architectures, and powerful GPUs. It begs the question: what will be possible in another ten years?

Synthesizing Behavior for a Safer World

Beyond just capturing what is, we can now synthesize what could be. For applications like training autonomous vehicles, this is critical. We need to test self-driving cars against not just typical pedestrian behavior, but also rare and dangerous “edge cases.” We can’t wait for these to happen in the real world; we must generate them.

Using techniques like Reinforcement Learning, we can create virtual humans that populate these simulators, behaving in diverse and realistic ways, making our AI systems safer and more robust.

The Generative Revolution

Perhaps the most visible breakthrough has been in image synthesis. Models like DALL-E 2 and Imagen can now take a simple text prompt, like “An astronaut riding a horse in a photorealistic style,” and generate a high-quality image that matches the description. This technology is revolutionizing creative industries, but it also forces us to confront profound ethical questions about authenticity, misinformation, and the nature of art itself.

The Road Ahead

This is a golden age for visual computing. Yet, for all our progress, many fundamental questions remain open. How can we learn more efficiently, with less human supervision? How can we guarantee the robustness and safety of our systems? How do we deploy these powerful models on resource-constrained devices like phones and drones? And most importantly, how do we navigate the ethical landscape we are creating?

The principles we will explore in this course are the foundation upon which the answers to these questions will be built. Let us begin.

Continue here: 02 The Digital Image

CS Notes

Explorer