The theory of formal languages is a cornerstone of theoretical computer science, with profound connections to linguistics and even evolutionary biology. While our previous computational models, such as finite automata and Turing machines, were primarily conceived as acceptors of languages (recognizing whether a given word belongs to a language), grammars offer an alternative perspective: they are mechanisms for generating (producing) words.

Grammars provide a finite way to describe potentially infinite languages. For linguists, they are crucial for formally describing the syntax of natural languages. In computer science, grammars, especially context-free grammars, are central to defining the syntax of programming languages and are indispensable in compiler construction. (They allow us to precisely define the valid structure of code, enabling compilers to parse and understand programs.) From a computability standpoint, the most general type of grammars (Type-0) are as powerful as Turing machines, further reinforcing Church’s thesis.

This chapter will introduce the fundamental concept of grammars, explain their generative mechanism, and classify them according to the Chomsky Hierarchy. We will explore each type of grammar, relating its generative power to the corresponding automaton model we’ve studied.

10.1 Aims

By the end of this chapter, you will be able to:

  • Understand the Concept of Grammars: Grasp how grammars axiomatically define and generate languages.
  • Formally Define Grammars: Understand the components of a grammar (nonterminals, terminals, production rules, start symbol) and the process of derivation.
  • Classify Grammars by the Chomsky Hierarchy: Differentiate between Type-0 (unrestricted), Type-1 (context-sensitive), Type-2 (context-free), and Type-3 (regular) grammars based on their rule restrictions.
  • Relate Grammars to Automata: Understand the equivalence between regular grammars and finite automata, and context-free grammars and pushdown automata.
  • Apply Normal Forms: Understand Chomsky and Greibach normal forms for context-free grammars.
  • Prove Non-Context-Freeness: Utilize the Pumping Lemma for Context-Free Languages and Ogden’s Lemma to demonstrate that certain languages are not context-free.

10.2 The Concept of Grammars

Grammars provide an axiomatic method for defining classes of objects. Just as Boolean formulas can be defined recursively, languages can be defined by a set of rules that generate their words.

Generation Procedure for

Consider the following rules for generating a language over {a, b}:

  1. .
  2. If , then .
  3. No other words belong to .

Starting with , applying rule (2) once yields , twice yields , and so on. This procedure generates exactly the language .

This idea is formalized using terminal symbols (the alphabet of the language, which are the final characters in a generated word), nonterminal symbols (variables that are replaced during generation, acting as placeholders for structures), production rules (the rewrite rules), and a start symbol.

Definition 10.1 (Grammar)

  1. A grammar is a 4-tuple , where: a. is the nonterminal alphabet (set of nonterminals). b. is the terminal alphabet (set of terminal symbols), with . c. is the start symbol. d. is a finite set of production rules (or productions), where each rule is of the form , with and .
  2. Let . We say is derivable from in one step in , denoted , if and for some rule and .
  3. We say is derivable from in , denoted , if or there is a sequence of one-step derivations .
  4. The language generated by is .

Grammar for

Let with . A derivation for baabaabb: . This grammar generates all words with an equal number of ‘a’s and ‘b’s.

Grammars are inherently nondeterministic: multiple rules might apply, or multiple nonterminals could be chosen for replacement. (This flexibility allows a single grammar to describe a wide range of valid strings, reflecting the ambiguity often present in natural and programming languages.)


10.3 Regular Grammars and Finite Automata

The simplest type of grammars in the Chomsky Hierarchy are regular grammars.

Definition 10.2 (Chomsky Hierarchy - Part 1)

Let be a grammar.

  • is a regular grammar (Type-3 grammar) if all rules in are of the form or , where and . (These simple rules essentially allow generating a terminal symbol and then optionally transitioning to another nonterminal, mimicking the state transitions of a finite automaton.) (Typically, is restricted to a single terminal symbol or the empty string, i.e., ).

A language is regular if it is generated by a regular grammar. The class of regular languages is denoted .

Closure Properties of Regular Languages

The class is closed under many operations, mirroring the closure properties of regular languages defined by finite automata.

Lemma 10.2 ( is closed under union)

If , then .

Lemma 10.3 ( is closed under concatenation)

If , then .

Equivalence with Finite Automata

A cornerstone result is that regular grammars generate precisely the class of languages accepted by finite automata.

Theorem 10.1 (FA to Regular Grammar)

To each finite automaton , there exists a regular grammar such that .

To prove the reverse, we first need to normalize regular grammars.

Definition 10.3 (Normalized Regular Grammar)

A regular grammar is normalized if all rules are of one of the following forms:

  1. (if )
  2. for
  3. for

Theorem 10.2 (Normalization of Regular Grammars)

To every regular grammar, there exists an equivalent normalized regular grammar.

Theorem 10.3 (Regular Grammar to NFA)

To each regular grammar , there exists a nondeterministic finite automaton (NFA) such that .

These theorems establish that regular grammars and finite automata have equivalent expressive power. (This equivalence is fundamental: it means we can either describe simple patterns using a generative grammar or recognize them using a finite-state machine, offering two complementary perspectives on the same class of languages.)


10.4 Context-Free Grammars and Pushdown Automata

Context-free grammars (CFGs) are more powerful than regular grammars and are fundamental to describing the syntax of programming languages.

Definition 10.2 (Chomsky Hierarchy - Part 2)

  • is a context-free grammar (Type-2 grammar) if all rules in are of the form , where and .

A language is context-free if it is generated by a context-free grammar. The class of context-free languages is denoted . The key difference from regular grammars is that can contain multiple nonterminals, allowing for recursive structures. (This added flexibility allows CFGs to describe languages with nested dependencies, like correctly matched parentheses or arithmetic expressions, which regular grammars cannot handle.)

Context-Free Language

The grammar generates . This language is not regular.

Syntax Trees

Derivations in CFGs can be represented by syntax trees (or parse trees), which visually depict the hierarchical structure of the generated word. (These trees are invaluable for understanding the structure of programming language constructs and are a core concept in compiler design.)

Definition 10.4 (Syntax Tree)

A syntax tree for a CFG is an ordered tree where:

  1. The root is labeled with the start symbol.
  2. Internal nodes are labeled with nonterminals.
  3. Leaves are labeled with terminals or .
  4. If an internal node labeled has children labeled (from left to right), then must be a production rule in . The concatenation of the labels of the leaves (read from left to right) forms the generated word.

Syntax Tree for Arithmetic Expression

For the grammar , the expression (id * id) + id has a syntax tree showing its structure.

Normal Forms for Context-Free Grammars

For theoretical analysis and parsing algorithms, CFGs are often converted into simpler forms.

Definition 10.5 (Chomsky Normal Form and Greibach Normal Form)

Let be a context-free grammar (where the empty string is not in ).

  • is in Chomsky Normal Form (CNF) if all rules are of the form (where ) or (where ). (CNF ensures that every production either generates a single terminal or combines exactly two nonterminals, simplifying parsing algorithms.)
  • is in Greibach Normal Form (GNF) if all rules are of the form (where ). (GNF ensures that every production starts with a terminal, which is useful for top-down parsing.)

Theorem 10.4 (Existence of CNF and GNF)

For every context-free grammar with , there exist equivalent grammars in Chomsky Normal Form and in Greibach Normal Form.

Pumping Lemma for Context-Free Languages

Similar to regular languages, there’s a pumping lemma for CFGs to prove that a language is not context-free.

Lemma 10.6 (Pumping Lemma for Context-Free Languages)

If is a context-free language, then there exists a constant (depending only on ) such that for all words with , there exists a decomposition such that:

  1. (at least one of or is non-empty).
  2. .
  3. for all .

(Important Note: The Pumping Lemma can only be used to prove that a language is not context-free. It cannot be used to prove that a language is context-free.)

Proving is not context-free

Assume is context-free. Let be the pumping length. Choose . By the pumping lemma, with and . Since , the segment cannot contain all three types of symbols (). It can contain at most two types. If contains only ‘s and ‘s (or only ‘s and ‘s, or only ‘s), then pumping and (i.e., forming ) will increase the count of only one or two types of symbols, while the third remains unchanged. This breaks the pattern, so . This contradicts condition (3) of the pumping lemma, so is not context-free.

Pushdown Automata

Nondeterministic Pushdown Automata (NPdAs) are the machine model equivalent to context-free grammars. They extend finite automata with a stack (a LIFO data structure) for memory. (This stack memory is crucial for handling nested structures, like matching parentheses, which finite automata cannot do.)

Definition 10.6 (Nondeterministic Pushdown Automaton)

An NPdA is a 6-tuple , where:

  • is a finite set of states.
  • is the input alphabet.
  • is the stack alphabet.
  • is the transition function: .
  • is the initial state.
  • is the initial stack symbol. An NPdA accepts a word if, starting in with on the stack, it can read and empty its stack.

NPdA for

An NPdA for would push a ‘0’ onto the stack for each ‘0’ read from the input. When a ‘1’ is read, it pops a ‘0’ from the stack. If the stack is empty when all ‘1’s are read, the word is accepted.

Equivalence of CFGs and NPdAs

A language is context-free if and only if there exists a nondeterministic pushdown automaton such that .


10.5 Context-Sensitive and Unrestricted Grammars

Beyond context-free grammars lie more powerful types, completing the Chomsky Hierarchy.

Definition 10.2 (Chomsky Hierarchy - Part 3)

  • is a context-sensitive grammar (Type-1 grammar) if all rules satisfy , and contains at least one nonterminal. (This means a nonterminal can only be replaced if it’s in a specific “context” of surrounding symbols, and the replacement cannot make the string shorter.) (The restriction means rules cannot shorten strings).
  • is an unrestricted grammar (Type-0 grammar) if it has no restrictions on its rules other than containing at least one nonterminal.

A language is context-sensitive (resp. recursively enumerable) if it is generated by a context-sensitive (resp. unrestricted) grammar. The classes are denoted and .

Generative Power and Automata Equivalence

  • Context-Sensitive Grammars (): These grammars can model languages where the replacement of a nonterminal depends on its surrounding context. They are equivalent in power to Linear Bounded Automata (LBAs), which are Turing machines whose tape head cannot move beyond the portion of the tape initially occupied by the input. Context-sensitive languages are decidable.
  • Unrestricted Grammars (): These are the most powerful grammars, equivalent to Turing Machines. They can generate any recursively enumerable language. Languages generated by unrestricted grammars are not necessarily decidable.

The Chomsky Hierarchy forms a strict inclusion: . (This means each level of the hierarchy is strictly more powerful than the one below it; there are languages that can be generated by a Type-X grammar but not by a Type-(X+1) grammar.)


Summary

  • Grammars are generative models for languages, defined by nonterminals, terminals, production rules, and a start symbol.
  • The Chomsky Hierarchy classifies grammars into four types based on restrictions on their production rules, each corresponding to a different class of languages and computational power.
    • Type-3 (Regular Grammars): Generate regular languages, equivalent to finite automata. Rules are or .
    • Type-2 (Context-Free Grammars): Generate context-free languages, equivalent to pushdown automata. Rules are . Crucial for programming language syntax.
    • Type-1 (Context-Sensitive Grammars): Generate context-sensitive languages, equivalent to linear bounded automata. Rules with .
    • Type-0 (Unrestricted Grammars): Generate recursively enumerable languages, equivalent to Turing machines. No restrictions on rules.
  • Syntax trees provide a hierarchical representation of derivations in context-free grammars.
  • Normal forms (Chomsky Normal Form, Greibach Normal Form) simplify CFGs for analysis and parsing.
  • Pumping lemmas (for regular and context-free languages) are powerful tools for proving that a language does not belong to a particular class.
  • The hierarchy demonstrates a fundamental trade-off between the expressiveness of a language model and the complexity of the automaton required to process it.

Previous Chapter: Chapter 9 - Communication and Cryptography

Exercises

Exercise 10.1 (Regular Grammar Construction)

Construct a regular grammar for the language .

Exercise 10.2 (Context-Free Grammar for Palindromes)

Construct a context-free grammar for the language of palindromes over {a, b}, i.e., .

Exercise 10.3 (Pumping Lemma for Context-Free Languages)

Prove that the language is not context-free using the Pumping Lemma for Context-Free Languages.

Exercise 10.4 (Pushdown Automaton Design)

Design a nondeterministic pushdown automaton (NPdA) for the language .