A Warm-Up: Recognizing Suffixes

Let’s get our hands dirty and jump straight into designing an automaton. We’ll start with a classic problem: recognizing words that end in a specific pattern.

Consider the language over the alphabet :

The core of the problem is that our automaton needs to remember the last one or two symbols it has seen. Since it has no variables, this memory must be encoded in its states.

The Design Strategy

The most crucial information is the suffix of length two. This suggests we’ll need states corresponding to the four possible two-letter suffixes: aa, ab, ba, and bb.

Let’s start with these core states.

  • : The last two symbols seen were aa.
  • : The last two symbols seen were ab.
  • : The last two symbols seen were ba.
  • : The last two symbols seen were bb.

Now, let’s think about the transitions. If we are in state (meaning the word so far ends in aa) and we read a new symbol, say b, the new suffix of length two is ab. So, we should transition from to . If we read an a, the new suffix is aa, so we loop back to .

Applying this logic to all four states gives us the core of our machine. But what about short words, of length 0 or 1? They don’t have a two-letter suffix. We need states to handle the beginning of the computation:

  • : The start state, representing the empty word prefix.
  • : We have seen exactly one symbol, and it was a.
  • : We have seen exactly one symbol, and it was b.

Putting it all together, we get the following automaton. The accepting states are and , since those are the desired endings.

Let’s trace an example word like aabababba to see it in action:

  1. Start at . Read a .
  2. In . Read a . (If the word stopped here, it would be accepted).
  3. In . Read b . (Still an accepting state if the word ended).
  4. In . Read a .
  5. In . Read b .
  6. In . Read a .
  7. In . Read b .
  8. In . Read a . The word ends, and we are in state , which is not accepting. The word is correctly rejected.

State Classes for the Suffix Automaton

We can formally describe what each state means by defining its class:

  • For short prefixes:
    • for .
  • For the main states (words of length ):
    • for .

Counting Modulo a Constant

Finite automata cannot count indefinitely, but they are perfect for counting modulo a fixed number. This is a powerful technique.

Consider the language :

The expression can only have four possible outcomes: 0, 1, 2, or 3. This suggests a beautiful design: one state for each possible remainder.

  • : The current sum modulo 4 is 0. (This is our start state, for the empty word).
  • : The current sum modulo 4 is 1.
  • : The current sum modulo 4 is 2.
  • : The current sum modulo 4 is 3. (This is our only accepting state).

Now, how do the transitions work?

  • If we are in state and read an a, the total sum increases by 1. So we move from to . This creates a cycle: .
  • If we are in state and read a b, the total sum increases by 2. So we move from to . This creates two separate cycles: and .

The resulting automaton is symmetric and elegant:

The class of each state is precisely the set of words for which the condition holds for the remainder :

What Finite Automata Can Do

Let’s summarize the capabilities we’ve uncovered. Finite automata can:

  • Recognize any finite set of words.
  • Count modulo a constant.
  • Recognize patterns like prefixes, suffixes, and infixes.
  • Combine a finite number of these conditions using boolean logic ().

This last point is key. If we have automata for two different languages, can we combine them to recognize their union, intersection, or difference? The answer is yes, and the technique is called simulation.

Simulations: The Product Construction

The core idea is to build a new, larger automaton that simulates two smaller automata, and , simultaneously.

Lemma 3.3: Let be an alphabet and let and be two DFAs. For any set operation , there exists a DFA such that:

This proves that the class of regular languages is closed under union, intersection, and set difference.

The Proof Idea: A “Simulcast” Automaton

We construct a new automaton that runs and in parallel.

  • States: The states of our new automaton are pairs of states, one from each of the original automata. This is the Cartesian product of the state sets. A state in looks like , where and .
  • Start State: The start state is the pair of the original start states.
  • Transition Function: To simulate both automata, a transition in on a symbol a simply applies the respective transition functions of and to each component of the state pair.
  • Accepting States: This is where the set operation comes in. The set of accepting states is defined based on whether the components are in their respective accepting sets, and .
    • For Union (): We accept if either automaton accepts.
    • For Intersection (): We accept only if both automata accept.
    • For Set Difference (): We accept if the first accepts and the second does not.

Example: A Modular Design

Let’s build an automaton for the language .

We can break this down into two simpler problems:

We can easily design automata for each of these. Let’s call the states of the first automaton and the states of the second .

Now, we construct the product automaton. The states will be pairs , which we can arrange in a grid. The transitions are determined by combining the transitions of the two base automata.

The start state is . Since the operation is OR (union), a state is accepting if its first component is accepting in (i.e., ) OR its second component is accepting in (i.e., ). This corresponds to the entire first row and the entire last column of our grid.

Small exercise:

Task: Build the product automata.

Proving Non-Existence

We’ve seen what finite automata can do. Now we turn to the more profound question: what can’t they do? We will show that there are languages that are not in the class .

To prove that something doesn’t exist is fundamentally harder than proving it does. As the lecturer said, to prove white rabbits exist, I just need to show you one. To prove green rabbits don’t exist, showing you a lack of a green rabbit isn’t enough. We need a more powerful, indirect argument. We will assume the thing we want to disprove does exist, and then derive a logical contradiction.

The contradiction will always hinge on the one great weakness of a DFA: its finite memory.

The Core Limitation: Obliviousness

A DFA has no auxiliary memory. All it “knows” about the input it has processed so far is encoded in its current state. If two different input strings, and , drive the automaton to the same state, the automaton becomes “oblivious” to the difference between them. From that point on, it must treat them identically.

This is the key insight, formalized in the following lemma.

Lemma 3.3: Let be a DFA. Let be two different strings () that lead to the same state:

Then, for any suffix string , the strings and must also lead to the same final state. Consequently:

Proof

The proof is a direct consequence of the definition of .

  1. We are given that after reading or , the automaton is in state .
  2. From state , reading the suffix will lead to some state .
  3. Since the automaton is deterministic, the path from determined by is unique.
  4. Therefore, the computation for ends in state , and the computation for also ends in state .
  5. If , both strings are accepted. If , both are rejected.

Transclude of Diagram-illustrating-Lemma-3.3,-from-lecture-slide

The First Non-Regular Language:

This is the canonical example of a non-regular language. It represents the problem of balanced parentheses or matching tags in a programming language. Intuitively, to check if a string is in , an automaton would need to count the number of ‘s. But since can be arbitrarily large, this would require an infinite number of states (memories) to store the count. A finite automaton cannot do this. Let’s prove it formally.

Theorem: The language is not regular.

Proof by Contradiction

  1. Assumption: Assume is regular. By definition, there must exist a DFA such that . Let be the finite number of states in this automaton.

  2. Pigeonhole Principle: Consider the following set of distinct words:

    When the automaton processes these words, each one ends in some state in . Since there are more words (pigeons) than states (pigeonholes), at least two of these words must end in the same state.

  3. Finding the Collision: Therefore, there must exist two distinct integers with , such that:

  4. Applying the Lemma: We have found our two words, and , that land in the same state. Now we apply Lemma 3.3. The lemma states that for any suffix , the automaton must treat and identically.

  5. Choosing the Killer Suffix: We can choose any we want to create a contradiction. Let’s choose a that completes one of the words to be in the language . A clever choice is .

    • Consider the word . By the definition of , this word is in the language. Therefore, our automaton must accept it: .
    • Now consider the word . Since , this word is not in the language .
    • However, according to Lemma 3.3, since and lead to the same state, appending must also lead to the same final outcome. Because is accepted, must also be accepted by the automaton .
  6. The Contradiction: We have shown that the automaton accepts the word , but this word is not in the language . This contradicts our initial assumption that .

  7. Conclusion: The assumption that is regular must be false.

A Preview of the Pumping Lemma

This method of proof is powerful, but there is another, closely related technique called the Pumping Lemma. The core idea is similar, but instead of considering two different words, we consider one single, very long word.

If a word is longer than the number of states in the automaton, the path it takes through the state diagram must visit at least one state twice. This creates a loop.

This means we can decompose the word into three parts, , where:

  • is the part that leads up to the first visit to the repeated state.
  • is the part that traverses the loop.
  • is the part that follows the loop to the end.

The crucial insight is that we can traverse the loop () any number of times, zero, one, or many, and the resulting word must still be treated the same way by the automaton. We can “pump” the middle section:

If the original word was accepted, then all pumped versions must also be accepted. This gives us another powerful tool to prove that languages requiring unbounded counting, like , cannot be regular. We will formalize this next time.

Continue here: 08 Proving Non-Regularity