The theory of formal languages is a cornerstone of theoretical computer science, with profound connections to linguistics and even evolutionary biology. While our previous computational models, such as finite automata and Turing machines, were primarily conceived as acceptors of languages (recognizing whether a given word belongs to a language), grammars offer an alternative perspective: they are mechanisms for generating (producing) words.
Grammars provide a finite way to describe potentially infinite languages. For linguists, they are crucial for formally describing the syntax of natural languages. In computer science, grammars, especially context-free grammars, are central to defining the syntax of programming languages and are indispensable in compiler construction. (They allow us to precisely define the valid structure of code, enabling compilers to parse and understand programs.) From a computability standpoint, the most general type of grammars (Type-0) are as powerful as Turing machines, further reinforcing Church’s thesis.
This chapter will introduce the fundamental concept of grammars, explain their generative mechanism, and classify them according to the Chomsky Hierarchy. We will explore each type of grammar, relating its generative power to the corresponding automaton model we’ve studied.
10.1 Aims
By the end of this chapter, you will be able to:
- Understand the Concept of Grammars: Grasp how grammars axiomatically define and generate languages.
- Formally Define Grammars: Understand the components of a grammar (nonterminals, terminals, production rules, start symbol) and the process of derivation.
- Classify Grammars by the Chomsky Hierarchy: Differentiate between Type-0 (unrestricted), Type-1 (context-sensitive), Type-2 (context-free), and Type-3 (regular) grammars based on their rule restrictions.
- Relate Grammars to Automata: Understand the equivalence between regular grammars and finite automata, and context-free grammars and pushdown automata.
- Apply Normal Forms: Understand Chomsky and Greibach normal forms for context-free grammars.
- Prove Non-Context-Freeness: Utilize the Pumping Lemma for Context-Free Languages and Ogden’s Lemma to demonstrate that certain languages are not context-free.
10.2 The Concept of Grammars
Grammars provide an axiomatic method for defining classes of objects. Just as Boolean formulas can be defined recursively, languages can be defined by a set of rules that generate their words.
Generation Procedure for
Consider the following rules for generating a language over {a, b}:
- .
- If , then .
- No other words belong to .
Starting with , applying rule (2) once yields , twice yields , and so on. This procedure generates exactly the language .
This idea is formalized using terminal symbols (the alphabet of the language, which are the final characters in a generated word), nonterminal symbols (variables that are replaced during generation, acting as placeholders for structures), production rules (the rewrite rules), and a start symbol.
Definition 10.1 (Grammar)
- A grammar is a 4-tuple , where: a. is the nonterminal alphabet (set of nonterminals). b. is the terminal alphabet (set of terminal symbols), with . c. is the start symbol. d. is a finite set of production rules (or productions), where each rule is of the form , with and .
- Let . We say is derivable from in one step in , denoted , if and for some rule and .
- We say is derivable from in , denoted , if or there is a sequence of one-step derivations .
- The language generated by is .
Grammar for
Let with . A derivation for
baabaabb: . This grammar generates all words with an equal number of ‘a’s and ‘b’s.
Grammars are inherently nondeterministic: multiple rules might apply, or multiple nonterminals could be chosen for replacement. (This flexibility allows a single grammar to describe a wide range of valid strings, reflecting the ambiguity often present in natural and programming languages.)
10.3 Regular Grammars and Finite Automata
The simplest type of grammars in the Chomsky Hierarchy are regular grammars.
Definition 10.2 (Chomsky Hierarchy - Part 1)
Let be a grammar.
- is a regular grammar (Type-3 grammar) if all rules in are of the form or , where and . (These simple rules essentially allow generating a terminal symbol and then optionally transitioning to another nonterminal, mimicking the state transitions of a finite automaton.) (Typically, is restricted to a single terminal symbol or the empty string, i.e., ).
A language is regular if it is generated by a regular grammar. The class of regular languages is denoted .
Closure Properties of Regular Languages
The class is closed under many operations, mirroring the closure properties of regular languages defined by finite automata.
Lemma 10.2 ( is closed under union)
If , then .
Proof of Closure under Union
Let and be regular grammars for and , respectively. Assume . Construct where is a new start symbol, , and . is regular and .
Lemma 10.3 ( is closed under concatenation)
If , then .
Proof of Closure under Concatenation
Let and be regular grammars for and . Assume . Construct where . contains all rules from . For each rule (where ), add to . For each rule , add it to . is regular and .
Equivalence with Finite Automata
A cornerstone result is that regular grammars generate precisely the class of languages accepted by finite automata.
Theorem 10.1 (FA to Regular Grammar)
To each finite automaton , there exists a regular grammar such that .
Proof of FA to Regular Grammar Equivalence
Let be an FA. Construct where:
- For each transition , add rule to .
- For each accepting state , add rule to . The nonterminals of are the states of . A derivation in simulates a computation in .
To prove the reverse, we first need to normalize regular grammars.
Definition 10.3 (Normalized Regular Grammar)
A regular grammar is normalized if all rules are of one of the following forms:
- (if )
- for
- for
Theorem 10.2 (Normalization of Regular Grammars)
To every regular grammar, there exists an equivalent normalized regular grammar.
Proof of Normalization
The proof involves a series of transformations:
- Eliminate chain rules (): For each , find all such that using only chain rules. For every rule (where is not a nonterminal), add .
- Eliminate -rules ( for ): For every rule , add .
- Break long terminal strings: Replace with using new nonterminals. Similarly for .
Theorem 10.3 (Regular Grammar to NFA)
To each regular grammar , there exists a nondeterministic finite automaton (NFA) such that .
Proof of Regular Grammar to NFA Equivalence
Let be a normalized regular grammar. Construct where is a new final state.
- (and if ).
- For each rule , add transition .
- For each rule , add transition . The nonterminals of become states in . A derivation in corresponds to an accepting path in .
These theorems establish that regular grammars and finite automata have equivalent expressive power. (This equivalence is fundamental: it means we can either describe simple patterns using a generative grammar or recognize them using a finite-state machine, offering two complementary perspectives on the same class of languages.)
10.4 Context-Free Grammars and Pushdown Automata
Context-free grammars (CFGs) are more powerful than regular grammars and are fundamental to describing the syntax of programming languages.
Definition 10.2 (Chomsky Hierarchy - Part 2)
- is a context-free grammar (Type-2 grammar) if all rules in are of the form , where and .
A language is context-free if it is generated by a context-free grammar. The class of context-free languages is denoted . The key difference from regular grammars is that can contain multiple nonterminals, allowing for recursive structures. (This added flexibility allows CFGs to describe languages with nested dependencies, like correctly matched parentheses or arithmetic expressions, which regular grammars cannot handle.)
Context-Free Language
The grammar generates . This language is not regular.
Syntax Trees
Derivations in CFGs can be represented by syntax trees (or parse trees), which visually depict the hierarchical structure of the generated word. (These trees are invaluable for understanding the structure of programming language constructs and are a core concept in compiler design.)
Definition 10.4 (Syntax Tree)
A syntax tree for a CFG is an ordered tree where:
- The root is labeled with the start symbol.
- Internal nodes are labeled with nonterminals.
- Leaves are labeled with terminals or .
- If an internal node labeled has children labeled (from left to right), then must be a production rule in . The concatenation of the labels of the leaves (read from left to right) forms the generated word.
Syntax Tree for Arithmetic Expression
For the grammar , the expression
(id * id) + idhas a syntax tree showing its structure.
Normal Forms for Context-Free Grammars
For theoretical analysis and parsing algorithms, CFGs are often converted into simpler forms.
Definition 10.5 (Chomsky Normal Form and Greibach Normal Form)
Let be a context-free grammar (where the empty string is not in ).
- is in Chomsky Normal Form (CNF) if all rules are of the form (where ) or (where ). (CNF ensures that every production either generates a single terminal or combines exactly two nonterminals, simplifying parsing algorithms.)
- is in Greibach Normal Form (GNF) if all rules are of the form (where ). (GNF ensures that every production starts with a terminal, which is useful for top-down parsing.)
Theorem 10.4 (Existence of CNF and GNF)
For every context-free grammar with , there exist equivalent grammars in Chomsky Normal Form and in Greibach Normal Form.
Proof of Existence of CNF and GNF
The conversion to CNF typically involves:
- Eliminating useless symbols (nonterminals that cannot be reached or cannot derive a terminal string).
- Eliminating -productions (rules , except ).
- Eliminating chain rules ().
- Replacing rules with mixed terminals/nonterminals or long right-hand sides with new nonterminals and binary rules.
Pumping Lemma for Context-Free Languages
Similar to regular languages, there’s a pumping lemma for CFGs to prove that a language is not context-free.
Lemma 10.6 (Pumping Lemma for Context-Free Languages)
If is a context-free language, then there exists a constant (depending only on ) such that for all words with , there exists a decomposition such that:
- (at least one of or is non-empty).
- .
- for all .
(Important Note: The Pumping Lemma can only be used to prove that a language is not context-free. It cannot be used to prove that a language is context-free.)
Proof of Pumping Lemma for CFLs
The proof relies on the fact that in any sufficiently long derivation tree, some nonterminal must repeat on a path from the root to a leaf. This repeating nonterminal allows for “pumping” (repeating or removing) the corresponding segments of the word.
Proving is not context-free
Assume is context-free. Let be the pumping length. Choose . By the pumping lemma, with and . Since , the segment cannot contain all three types of symbols (). It can contain at most two types. If contains only ‘s and ‘s (or only ‘s and ‘s, or only ‘s), then pumping and (i.e., forming ) will increase the count of only one or two types of symbols, while the third remains unchanged. This breaks the pattern, so . This contradicts condition (3) of the pumping lemma, so is not context-free.
Pushdown Automata
Nondeterministic Pushdown Automata (NPdAs) are the machine model equivalent to context-free grammars. They extend finite automata with a stack (a LIFO data structure) for memory. (This stack memory is crucial for handling nested structures, like matching parentheses, which finite automata cannot do.)
Definition 10.6 (Nondeterministic Pushdown Automaton)
An NPdA is a 6-tuple , where:
- is a finite set of states.
- is the input alphabet.
- is the stack alphabet.
- is the transition function: .
- is the initial state.
- is the initial stack symbol. An NPdA accepts a word if, starting in with on the stack, it can read and empty its stack.
NPdA for
An NPdA for would push a ‘0’ onto the stack for each ‘0’ read from the input. When a ‘1’ is read, it pops a ‘0’ from the stack. If the stack is empty when all ‘1’s are read, the word is accepted.
Equivalence of CFGs and NPdAs
A language is context-free if and only if there exists a nondeterministic pushdown automaton such that .
Proof of Equivalence of CFGs and NPdAs
- CFG to NPdA: Given a CFG in Greibach Normal Form, an NPdA can be constructed with a single state. For a rule , the NPdA reads , pops from the stack, and pushes (reversed ).
- NPdA to CFG: This is more complex. It involves constructing a CFG whose nonterminals represent the state changes and stack contents of the NPdA.
10.5 Context-Sensitive and Unrestricted Grammars
Beyond context-free grammars lie more powerful types, completing the Chomsky Hierarchy.
Definition 10.2 (Chomsky Hierarchy - Part 3)
- is a context-sensitive grammar (Type-1 grammar) if all rules satisfy , and contains at least one nonterminal. (This means a nonterminal can only be replaced if it’s in a specific “context” of surrounding symbols, and the replacement cannot make the string shorter.) (The restriction means rules cannot shorten strings).
- is an unrestricted grammar (Type-0 grammar) if it has no restrictions on its rules other than containing at least one nonterminal.
A language is context-sensitive (resp. recursively enumerable) if it is generated by a context-sensitive (resp. unrestricted) grammar. The classes are denoted and .
Generative Power and Automata Equivalence
- Context-Sensitive Grammars (): These grammars can model languages where the replacement of a nonterminal depends on its surrounding context. They are equivalent in power to Linear Bounded Automata (LBAs), which are Turing machines whose tape head cannot move beyond the portion of the tape initially occupied by the input. Context-sensitive languages are decidable.
- Unrestricted Grammars (): These are the most powerful grammars, equivalent to Turing Machines. They can generate any recursively enumerable language. Languages generated by unrestricted grammars are not necessarily decidable.
The Chomsky Hierarchy forms a strict inclusion: . (This means each level of the hierarchy is strictly more powerful than the one below it; there are languages that can be generated by a Type-X grammar but not by a Type-(X+1) grammar.)
Summary
- Grammars are generative models for languages, defined by nonterminals, terminals, production rules, and a start symbol.
- The Chomsky Hierarchy classifies grammars into four types based on restrictions on their production rules, each corresponding to a different class of languages and computational power.
- Type-3 (Regular Grammars): Generate regular languages, equivalent to finite automata. Rules are or .
- Type-2 (Context-Free Grammars): Generate context-free languages, equivalent to pushdown automata. Rules are . Crucial for programming language syntax.
- Type-1 (Context-Sensitive Grammars): Generate context-sensitive languages, equivalent to linear bounded automata. Rules with .
- Type-0 (Unrestricted Grammars): Generate recursively enumerable languages, equivalent to Turing machines. No restrictions on rules.
- Syntax trees provide a hierarchical representation of derivations in context-free grammars.
- Normal forms (Chomsky Normal Form, Greibach Normal Form) simplify CFGs for analysis and parsing.
- Pumping lemmas (for regular and context-free languages) are powerful tools for proving that a language does not belong to a particular class.
- The hierarchy demonstrates a fundamental trade-off between the expressiveness of a language model and the complexity of the automaton required to process it.
Previous Chapter: Chapter 9 - Communication and Cryptography
Exercises
Exercise 10.1 (Regular Grammar Construction)
Construct a regular grammar for the language .
Solution
Let , where means an even number of 0s seen, and means an odd number of 0s seen. .
Exercise 10.2 (Context-Free Grammar for Palindromes)
Construct a context-free grammar for the language of palindromes over {a, b}, i.e., .
Solution
with . (Note: and are for odd-length palindromes).
Exercise 10.3 (Pumping Lemma for Context-Free Languages)
Prove that the language is not context-free using the Pumping Lemma for Context-Free Languages.
Solution
Assume is context-free. Let be the pumping length. Choose . By the pumping lemma, with and . Since , the segment can span at most three of the four blocks of identical symbols.
Case 1: contains symbols from only one block (e.g., only ‘s). Pumping and (i.e., ) will increase the count of only ‘s, breaking the pattern. Case 2: contains symbols from two adjacent blocks (e.g., ‘s and ‘s). Pumping will increase the count of ‘s and ‘s, but not ‘s or ‘s, breaking the pattern. Case 3: contains symbols from two non-adjacent blocks (e.g., ‘s and ‘s). This is impossible since is a contiguous substring and .
In all valid cases, pumping and leads to a string not in , contradicting the pumping lemma. Therefore, is not context-free.
Exercise 10.4 (Pushdown Automaton Design)
Design a nondeterministic pushdown automaton (NPdA) for the language .
Solution
This is a classic example where an NPdA can be designed. The NPdA would use its stack to keep track of the imbalance between ‘a’s and ‘b’s. Let .
- If an ‘a’ is read: If ‘B’ is on top of stack, pop ‘B’. Else, push ‘A’.
- If a ‘b’ is read: If ‘A’ is on top of stack, pop ‘A’. Else, push ‘B’.
- Accept if stack is empty (only remains) after reading the entire input.
This requires a bit more detail in the transition function to handle and ensure correct pushing/popping. For example:
- (for acceptance if stack is empty)
This NPdA accepts the language.