In the world of computing, everything ultimately reduces to sequences of symbols. Programs are symbol sequences, data is stored as bits, and every interaction with a computer, from a mouse click to a complex calculation, can be represented in this fundamental way. To study computation, we first need a precise vocabulary to describe these symbol sequences.

This chapter introduces the fundamental vocabulary of theoretical computer science. We will build a formal framework for describing information, which will serve as the bedrock for all the concepts that follow, from simple automata to the limits of what can be computed.

2.1 Aims

This chapter is designed to equip you with the foundational concepts for our entire journey. By the end, you will be able to:

  1. Define the Core Vocabulary: Understand and use the fundamental concepts of Alphabets, Words, and Languages. These are the basic building blocks for describing information in a computational context, much like letters form words and sentences in natural languages.
  2. Formalize Algorithmic Problems: Use this new vocabulary to precisely define what an “algorithmic problem” is. We will focus on crucial classes such as Decision Problems, Optimization Problems, Relation Problems, and Generation/Enumeration Problems.
  3. Measure Information Content: Explore the idea of compressibility by introducing Kolmogorov Complexity. This powerful concept allows us to measure the intrinsic information content of a word and provides a formal, computable definition for the elusive concept of randomness. This also serves as a valuable instrument for investigating computations and proving non-existence results.

2.2 Alphabets, Words, and Languages

Just as natural languages are built from letters, the languages of computation are built from symbols. To precisely define what computers process and how they do it, we must first formalize these fundamental building blocks.

2.2.1 Alphabets: The Building Blocks

Definition 2.1 (Alphabet)

An alphabet is a finite, non-empty set of symbols. We typically denote it with the Greek letter (Sigma). The elements of an alphabet are called letters, characters, or symbols.

The choice of symbols is arbitrary; what matters is that the set is finite. The meaning of the symbols comes from how we interpret them, not from their appearance. For instance, ‘0’ and ‘1’ could just as easily be ‘A’ and ‘B’ – their computational role remains the same.

Common Examples of Alphabets:

  • The Boolean Alphabet: . This is the most fundamental alphabet in computing, representing binary data.
  • The Latin Alphabet: .
  • A Keyboard Alphabet: includes all symbols on a standard computer keyboard: letters, numbers, punctuation, and special characters like the space symbol ().
  • The Alphabet for m-adic numbers: for any integer .
  • The Alphabet for Logic: , used to represent Boolean formulas.

2.2.2 Words: Sequences of Symbols

Once we have an alphabet, we can form sequences of symbols. In computer science, any text, regardless of its length or meaning in natural language, is considered a “word.”

Definition 2.2 (Word)

A word (or string) over an alphabet is a finite sequence of letters from .

  • The empty word, denoted by (or sometimes ), is the sequence with zero letters. Its length is .
  • The length of a word , denoted , is the number of letters in the sequence.
  • is the set of all possible words over the alphabet , including the empty word. This set represents every possible finite combination of letters from .
  • is the set of all non-empty words over .

For example, 010011 is a word over with length . The set contains all possible binary strings: .

2.2.3 Representing Objects as Words (Encoding)

The true power of this formalism comes from its ability to represent any object of interest in computation as a unique word. This universal representation process is called encoding. It’s how we translate complex ideas—numbers, graphs, logical formulas, or even entire computer programs—into the simple sequences of symbols that computers can process.

Representing Numbers

A binary word can represent the natural number: We denote the standard, shortest binary representation of a number as . For , this representation conventionally starts with a ‘1’. For example, . .

A sequence of numbers, like , can be encoded as a single word using a special separator symbol, for instance:

Representing Graphs

A directed graph with vertices can be represented by its adjacency matrix, . In this matrix, the entry is 1 if there is an edge from vertex to vertex , and 0 otherwise.

We can flatten this matrix into a word by listing its rows, separated by a symbol.

Graph Encoding

The graph with the adjacency matrix

0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{pmatrix} $$ is encoded as the word `0011#0011#0101#0000#` over the alphabet $\{0,1,\#\}$. This representation is unambiguous; we can perfectly reconstruct the graph from the word.

Weighted graphs can be represented similarly by encoding the integer weights in binary, using a single to separate weights in a row and a double ## to separate rows. For example, if an edge weight is 7, it would be or 111.

Representing Logical Formulas

Boolean formulas can also be encoded as words. We can use an alphabet like . To handle an infinite number of variables like , we can encode the variable as the word x followed by .

Formula Encoding

The formula is encoded as: (x1 ∨ x111) ∧ ¬(x1100) Here, , , and .

2.2.4 Operations on Words

We can manipulate words using several fundamental operations.

Definition 2.3 (Concatenation)

The concatenation of two words and , written , is the word formed by appending to the end of . For example, if and , then .

Concatenation is associative (i.e., ). The set of all words with the concatenation operation forms a mathematical structure called a monoid, with the empty word as the neutral element (). Concatenation is generally not commutative (i.e., unless specific conditions are met). The length of concatenated words is additive: .

Definition 2.4 (Reversal)

For a word , where , the reversal of is the word . The reversal of the empty word is . A useful property is that for any words , .

Definition 2.5 (Iteration)

For any word and any integer , the -th iteration of , denoted , is the word formed by concatenating with itself times. By definition, and . For example, .

Definition 2.6 (Subwords)

Let .

  • is a subword (or substring) of if there exist words such that .
  • is a prefix of if there exists a word such that .
  • is a suffix of if there exists a word such that . A subword, prefix, or suffix is proper if it is not equal to the word itself (i.e., ).

Definition 2.7 (Symbol Count)

For a word and a symbol , the notation denotes the number of occurrences of the symbol in . For example, for , and . The total length of a word is the sum of the counts of all its symbols: .

2.2.5 Languages: Sets of Words

In formal language theory, the term “language” has a very broad meaning, encompassing any collection of words.

Definition 2.9 (Language)

A language over an alphabet is any subset of .

This means any set of strings, finite or infinite, is a language. Think of it as a collection of “valid” or “meaningful” words from the infinite pool of all possible words. Languages are the primary way we define the properties of strings we are interested in.

Examples of languages over :

  • (the empty language, containing no words).
  • (the language containing only the empty word).
  • . This language represents strings with a block of a’s followed by an equal number of b’s.
  • (all words with an equal number of a’s and b’s).
  • The set of all syntactically correct C++ programs (a language over ).

Since languages are sets, we can apply standard set operations like union (), intersection (), and complement (). We can also extend word operations to languages:

  • Concatenation: . This forms a new language by concatenating every word from with every word from .
  • Kleene Star: , where . The Kleene star represents concatenating words from the language zero or more times, including the empty word.
  • Positive Closure: . This is similar to the Kleene star but excludes the empty word.

Lemma 2.1 (Distributivity of Concatenation over Union)

For any languages over an alphabet , the following holds: This means concatenation distributes over union, similar to multiplication over addition in arithmetic.

Lemma 2.2 (Inclusion for Concatenation and Intersection)

For any languages over an alphabet , the following inclusion holds:

Lemma 2.3 (Strict Inclusion Example)

There exist languages for which the inclusion in Lemma 2.2 is proper (i.e., not an equality).

2.2.6 Canonical Order and Homomorphisms

To work with infinite sets of words, it’s often useful to have a standardized, predictable way to list them. This is where the concept of a canonical order comes in.

Definition 2.8 (Canonical Order)

Let be an alphabet with a fixed ordering on its symbols, . The canonical order (or lexicographical order) on is defined as follows: for any two words , we say if:

  1. (shorter words come first), OR
  2. and comes before in the standard dictionary order (comparing symbol by symbol from left to right).

For example, for with , the canonical order is:

This ordering is crucial for enumeration problems, where we need to list the words of a language in a specific sequence.

Another powerful tool for relating languages to each other is a homomorphism, which is a function that preserves the structure of word concatenation.

Definition 2.10 (Homomorphism)

A homomorphism is a function between two alphabets and that satisfies two properties:

  1. (it maps the empty word to the empty word).
  2. for all words .

Because of the second property, a homomorphism is completely determined by its action on the individual symbols of the alphabet . If we know for every , we can find the image of any word by concatenating the images of its symbols. For example, if , then .

Homomorphism for Encoding

We can use a homomorphism to encode words from one alphabet into another. Let and . Define a homomorphism by and . Then the word aba is mapped to: .

Homomorphisms are fundamental in formal language theory for simplifying proofs and relating different classes of languages.


2.3 Algorithmic Problems

With our formal vocabulary of words and languages, we can now define what we mean by an “algorithmic problem.” Just as we formalized data, we must formalize the questions we ask about that data to study their solvability and complexity. We’ll focus on several main types.

2.3.1 Decision Problems

A decision problem is a question with a yes/no answer. We can formalize this using languages.

Definition 2.11 (Decision Problem)

A decision problem is a pair , where is an alphabet and is a language. The problem is: for any given word , decide whether .

An algorithm solves or decides this problem if it halts on every input and outputs “yes” (e.g., 1) if , and “no” (e.g., 0) if . A language for which such an algorithm exists is called recursive or decidable.

Primality Test

  • Problem: Given a number, is it prime?
  • Formalization: Let and . The problem is . An algorithm for this problem would take a binary string as input and output 1 if is prime, and 0 otherwise.

Hamiltonian Cycle Problem

  • Problem: Does a given graph contain a cycle that visits every vertex exactly once?
  • Formalization: Let and . The problem is . The input would be an encoded graph, and the output would be 1 or 0.

2.3.2 Optimization Problems

Many real-world problems ask for the “best” solution, not just a yes/no answer.

Definition 2.14 (Optimization Problem)

An optimization problem is a 6-tuple , where:

  • is the input alphabet.
  • is the output alphabet.
  • is the language of valid problem instances (i.e., inputs that have a meaningful interpretation). An is called a problem instance.
  • is a function from to , where is the set of feasible solutions for an instance .
  • is a function, , assigning a positive real cost to a feasible solution for a given instance .
  • is the optimization objective.

The task is to find a feasible solution with the optimal (minimum or maximum) cost. An algorithm solves if for every , it outputs a solution such that is optimal.

Traveling Salesman Problem (TSP)

  • Input: An edge-weighted complete graph . This would be encoded as a word .
  • Feasible Solutions : The set of all Hamiltonian cycles in . Each cycle can be encoded as a word .
  • Cost: The sum of the weights of the edges in a cycle.
  • Goal: Minimum.

The main difficulty in optimization problems is that the set of feasible solutions is often astronomically large, making it impossible to check them all. For a graph with vertices, there are distinct Hamiltonian cycles, which grows extremely fast.

2.3.3 Relation Problems

Beyond decision and optimization problems lies a more general class: relation problems. While a decision problem asks for a yes/no answer (a function to ) and an optimization problem seeks a single best solution, a relation problem allows for multiple valid outputs for a single input.

Definition 2.13 (Relation Problem)

A relation problem is defined by a relation , where and are the input and output alphabets. The task is: for a given input , find any output such that .

If for a given no such exists, the problem may be undefined or require a specific output indicating this. The key difference from a function is that a relation can associate one input with many possible correct outputs.

Finding a Factor

  • Problem: Given a composite number, find one of its non-trivial factors.
  • Formalization: Let be the relation where if and only if is composite and is a factor of such that .
  • Task: For input representing the number 15 (e.g., 1111), both outputs representing 3 (11) and representing 5 (101) are correct. An algorithm only needs to find one of them.

2.3.4 Generation and Enumeration Problems

Some algorithmic tasks don’t require an input at all. Their purpose is to generate a specific output or to list all members of a set.

Definition 2.15 (Generation Problem)

The task of creating a single, specific word . An algorithm solves this by taking no input (or a trivial one like ) and producing . A program that generates can be considered an alternative representation of .

Definition 2.16 (Enumeration Problem)

The task of listing the words of a language . An algorithm for this task, given an integer , typically outputs the first words of according to a defined order (like the canonical order).

A language is recursively enumerable if an algorithm exists that can list all of its words (though the process might never halt if the language is infinite). If an algorithm can also decide membership in the language (a decision problem), the language is called recursive or decidable.

Enumerating Primes

  • Problem: List the first prime numbers.
  • Formalization: Let be the language of binary representations of prime numbers.
  • Task: An enumeration algorithm, given input , would output the first words from in canonical order. For , the output would be 10, 11, 101, 111 (representing ).

2.4 Kolmogorov Complexity

Is the string 0101010101010101 complex? What about 1101001011101001? Intuitively, the first is simple because it has a short description: “repeat 01 eight times.” The second seems random, with no obvious pattern. Kolmogorov complexity formalizes this intuition by measuring the intrinsic information content of a word.

Definition 2.17 (Kolmogorov Complexity)

The Kolmogorov complexity of a word , denoted , is the length of the shortest program (in a fixed, universal programming language) that generates as its output and then halts.

This definition elegantly circumvents the problem of choosing a specific compression method by allowing any computable compression. A program that generates a string is the ultimate compressed representation of that string. The length of this shortest program is the measure of its information content.

Lemma 2.4 (Upper Bound)

There exists a constant such that for every word : This is because we can always write a simple program that just contains the string literally and prints it. The constant accounts for the fixed length of the print instruction itself and any overhead for encoding the literal string. This shows that the Kolmogorov complexity of a word is never significantly larger than its length.

Lemma 2.5 (Incompressibility)

For every integer , there exists at least one word of length such that .

This implies that most long strings are incompressible, which leads to the idea of defining randomness as incompressibility.

Theorem 2.1 (Invariance Theorem)

Let A and B be any two universal programming languages. There exists a constant (that depends only on A and B) such that for all words : This theorem is crucial. It states that the choice of programming language does not significantly change the Kolmogorov complexity of a string. The complexity is robust and language-independent, up to an additive constant. The constant represents the length of a compiler or interpreter program that translates between the two languages. This justifies using “the” Kolmogorov complexity without specifying a particular programming language.

The existence of an algorithm to decide a language can lead to strong statements about the complexity of the words in that language.

Theorem 2.2 (Kolmogorov Complexity of Decidable Language Elements)

Let be a recursive (decidable) language over . Let be the -th word in according to the canonical order. Then there exists a constant such that for all :

A word is considered random if it is incompressible, i.e., if . This provides a powerful, algorithmic definition of randomness, aligning with the intuition that a random object has no shorter description than itself.

Definition 2.19 (Randomness)

A word is called random if its Kolmogorov complexity is at least its length, i.e., . A number is called random if .

Application: Lower Bounds for Prime Numbers (Weaker Prime Number Theorem)

Kolmogorov complexity can be used to prove results in number theory. Here, we present a proof idea for a weaker version of the Prime Number Theorem, demonstrating that prime numbers are not “too sparse.”

Lemma 2.6 (Infinite Prime Factors for Complex Numbers)

Let be a strictly increasing infinite sequence of natural numbers such that for every . For every , let be the largest prime number that divides . Then the set is infinite.

This lemma, combined with careful encoding arguments, can be used to derive lower bounds on the density of prime numbers, showing that they appear frequently enough to support various computational tasks, especially in randomized algorithms.


2.5 Summary and Outlook

This chapter has laid the essential groundwork for our study of theoretical computer science.

  • We defined alphabets, words, and languages as the formal tools to describe data and problems, emphasizing that any computational object can be encoded as a word.
  • We categorized algorithmic tasks into decision problems, optimization problems, relation problems, and generation/enumeration problems, providing a precise framework for discussing what algorithms do.
  • We introduced Kolmogorov complexity as a robust measure of the intrinsic information content of a string, which also provides a formal definition of randomness. This concept is powerful for both theoretical proofs and understanding the limits of compression.

With this formal language in hand, we are now ready to explore the machines that process it. In the next chapter, we will start with the simplest model of computation: the finite automaton, and use the concepts developed here to understand its capabilities and limitations.


Previous Chapter: Chapter 1 - Introduction Next Up: Chapter 3 - Finite Automata

2.6 Exercises

Exercise 2.1

How many different words of length can be formed over an alphabet with ?

Exercise 2.2

Given the alphabet . Let be positive integers with . (a) Determine the number of different words of length with exactly occurrences of the symbol . (b) Determine the number of different words of length with at most occurrences of the symbol .

Exercise 2.3

A binary representation of every positive number begins with a 1. How long is for a given number ?

Exercise 2.4

Let for an . Consider as an -adic representation of a number . How do you compute ?

Exercise 2.5

The proposed representation of a graph as a word over has length for a graph with vertices (using rows and separators). Think of a shorter unique representation of graphs over the alphabet .

Exercise 2.6

Design a representation for graphs over the alphabet .

Exercise 2.7

Prove that for any words , .

Exercise 2.8

What is the maximum number of distinct subwords a word of length can have? List all different subwords of the word abbcbbab.

Exercise 2.9

Let and . Which words are in the language ?

Exercise 2.10

Let be languages over the alphabet . Does hold?

Exercise 2.11

Let and for two alphabets and with . Does hold?

Exercise 2.12

Do there exist languages such that is finite and is infinite?

Exercise 2.13

Prove or disprove: .

Exercise 2.14

Let be a homomorphism from to . Prove by induction that for every word , where for , .

Exercise 2.15

Define a homomorphism from to that maps infinitely many different words from to the same word from .

Exercise 2.16

Define an injective homomorphism from to to create a unique representation of Boolean formulas over .

Exercise 2.17

Let and be two alphabets. Let be a homomorphism from to . For every language , we define . Let . Prove or disprove the following statement: .

Exercise 2.18

For the Traveling Salesman Problem (TSP) on a complete graph with vertices, how many feasible solutions (Hamiltonian cycles) are there?

Exercise 2.19

The minimum vertex cover problem (MIN-VC) is a minimization problem where one seeks a vertex cover with minimal cardinality for a given (undirected) graph G. (a) Determine the set of all vertex covers of the graph from Figure 2.5. (b) Give the formal description of MIN-VC. Specify the representation of inputs and outputs over the alphabet .

Exercise 2.24

Argue why is approximately for some constant . What does this tell you about the “randomness” of the string ?

Exercise 2.27

Prove that for any integers and , there are at least distinct words of length such that .

Exercise 2.28

Prove that there are infinitely many numbers such that .


Key Takeaways

  • Formal Language Theory is Foundational: Alphabets, words, and languages provide the precise mathematical framework for describing all computational objects and problems.
  • Encoding is Universal: Any object relevant to computation can be uniquely encoded as a word over a finite alphabet, enabling universal processing by algorithms.
  • Problem Types: Algorithmic tasks are formally categorized into decision, optimization, relation, generation, and enumeration problems, each with specific objectives.
  • Kolmogorov Complexity: This concept offers a robust, machine-independent measure of a word’s intrinsic information content, defining randomness as incompressibility. It’s a powerful tool for theoretical proofs.
  • Limits of Computation: Even at this foundational level, we begin to see hints of the inherent limits of computation, both in terms of what can be compressed and what can be efficiently solved.