Context-Free Languages

Context-free languages (CFLs) are the second level of the Chomsky hierarchy. They capture the recursive structure of programming languages and are recognized by pushdown automata.

Context-Free Grammars (CFG)

A CFG G = (V, Σ, R, S) where:

V: Non-terminal symbols
Σ: Terminal symbols
R: Production rules of the form A → γ where A ∈ V and γ ∈ (V ∪ Σ)*
S: Start symbol

Example: Balanced Parentheses

S → (S) | SS | ε

Derivation of "(()())":

S ⇒ (S) ⇒ (SS) ⇒ ((S)S) ⇒ ((S)(S)) ⇒ (()()) ⇒ (()())

Example: Arithmetic Expressions

E → E + T | T
T → T * F | F
F → (E) | id | num

This grammar encodes operator precedence (* before +) and associativity (left) in its structure.

Derivations and Parse Trees

Leftmost and Rightmost Derivations

Leftmost: Always expand the leftmost non-terminal. Rightmost: Always expand the rightmost non-terminal.

For unambiguous grammars, the leftmost and rightmost derivations correspond to the same parse tree.

Parse Trees

A tree representation of a derivation:

Root: Start symbol
Internal nodes: Non-terminals
Leaves: Terminals (reading left-to-right gives the derived string)

Expression: id + id * id

Parse tree (with precedence):
        E
       / \
      E   +   T
      |      / \
      T    T  *  F
      |    |     |
      F    F    id
      |    |
     id   id

Ambiguity

A grammar is ambiguous if some string has two or more distinct parse trees (equivalently, two distinct leftmost derivations).

Example: E → E + E | E * E | id. The string "id + id * id" has two parse trees (one with + at root, one with * at root), giving different evaluation orders.

Disambiguation techniques:

Rewrite the grammar to encode precedence/associativity.
Use disambiguation rules in the parser (shift/reduce preferences).

Inherently ambiguous languages: Some CFLs have no unambiguous grammar. Example: {aⁱbʲcᵏ | i=j or j=k}.

Normal Forms

Chomsky Normal Form (CNF)

Every production is either:

A → BC (two non-terminals)
A → a (single terminal)
S → ε (only for start symbol, if ε ∈ L)

Every CFG can be converted to CNF (by eliminating ε-productions, unit productions, and long/mixed productions).

CNF is important for: CYK parsing algorithm (requires CNF input).

Greibach Normal Form (GNF)

Every production is: A → aα where a is a terminal and α is a (possibly empty) string of non-terminals.

Every CFG can be converted to GNF. Useful for proofs and for constructing PDAs.

Pushdown Automata (PDA)

PDA example — accepting the language a^n b^n

A PDA is an NFA augmented with a stack (infinite LIFO memory).

M = (Q, Σ, Γ, δ, q₀, Z₀, F) where:

Γ: Stack alphabet
Z₀: Initial stack symbol
δ: Q × (Σ ∪ {ε}) × Γ → P(Q × Γ*)

Each transition: read an input symbol (or ε), pop a stack symbol, push zero or more stack symbols.

Example: {aⁿbⁿ | n ≥ 0}

States: {q₀, q₁, q_accept}

1. In q₀, reading 'a': Push A onto stack. Stay in q₀.
2. In q₀, reading 'b': Pop A from stack. Go to q₁.
3. In q₁, reading 'b': Pop A from stack. Stay in q₁.
4. In q₁, reading ε with empty stack: Go to q_accept.

Computation on "aabb":

(q₀, aabb, Z₀)
→ (q₀, abb, AZ₀)     push A
→ (q₀, bb, AAZ₀)     push A
→ (q₁, b, AZ₀)       pop A, switch to q₁
→ (q₁, ε, Z₀)        pop A
→ (q_accept, ε, Z₀)  accept

Deterministic vs Nondeterministic PDA

DPDA: At each step, at most one transition applies. DCFLs ⊊ CFLs.

NPDA: Multiple transitions may apply; accepts if any path reaches acceptance. NPDAs recognize all CFLs.

Key difference from DFA/NFA: For finite automata, deterministic = nondeterministic (same power). For PDAs, nondeterministic is strictly more powerful than deterministic.

DCFLs: {aⁿbⁿ}, most programming language syntax, LR(1) languages. Inherently non-deterministic CFLs: {wwᴿ | w ∈ {a,b}*} (palindromes — need to guess the middle).

Equivalence: CFG ↔ PDA

Theorem: A language is context-free iff it is recognized by a PDA.

CFG → PDA: Top-down simulation. Push start symbol, match terminals with input, expand non-terminals using productions.

PDA → CFG: Construct grammar rules that simulate PDA transitions. Each non-terminal represents a state pair [qᵢ, qⱼ] (strings that take the PDA from qᵢ to qⱼ with the same stack height).

Pumping Lemma for CFLs

Theorem: If L is context-free, there exists a pumping length p such that for every w ∈ L with |w| ≥ p, there exist u, v, x, y, z with w = uvxyz satisfying:

|vy| ≥ 1 (v and y aren't both empty)
|vxy| ≤ p
For all i ≥ 0: uvⁱxyⁱz ∈ L

Differences from regular pumping lemma: Two segments are pumped (v and y), and they must be pumped together at the same rate.

Example: {aⁿbⁿcⁿ | n ≥ 0} is Not Context-Free

Take w = aᵖbᵖcᵖ. By condition 2, |vxy| ≤ p, so vxy can contain at most two types of symbols. Pumping changes the count of those symbols but not the third. So uvⁱxyⁱz ∉ L for i ≠ 1. ∎

Closure Properties

Operation	Closed?
Union	✓ (S → S₁ \| S₂)
Concatenation	✓ (S → S₁S₂)
Kleene star	✓ (S → SS₁ \| ε)
Intersection	✗ ({aⁿbⁿcⁿ} = {aⁿbⁿ} ∩ {bⁿcⁿ} shift)
Complement	✗ (since intersection = complement of union of complements)
Intersection with regular	✓ (product construction: PDA × DFA)
Reversal	✓
Homomorphism	✓

CFLs are NOT closed under intersection or complement — a crucial difference from regular languages.

CYK Parsing Algorithm

Cocke-Younger-Kasami: Decides membership in O(n³) for any CFG (in CNF).

Dynamic programming: Fill a table T[i][j] = set of non-terminals that can derive the substring w[i..j].

For length l = 1 to n:
    For start i = 0 to n-l:
        j = i + l - 1
        For split k = i to j-1:
            If A → BC where B ∈ T[i][k] and C ∈ T[k+1][j]:
                Add A to T[i][j]
Accept if S ∈ T[0][n-1]

Time: O(n³ × |G|). Space: O(n²).

Practical use: Not used for programming language parsers (too slow, O(n³)). Used for natural language parsing (where ambiguity is high and grammars are complex).

Earley Parsing

Handles any CFG (not just CNF) in O(n³) worst case, O(n²) for unambiguous grammars, and O(n) for LR grammars.

More practical than CYK. Used in NLP (e.g., NLTK).

Applications in CS

Programming language syntax: Most programming language syntax is defined by context-free grammars.
Parsers: Compilers use LR, LALR, LL parsers for CFGs (covered in compiler design).
XML/HTML/JSON: Structured documents have context-free structure.
Natural language processing: Phrase structure grammars, probabilistic CFGs for parsing sentences.
Bioinformatics: Stochastic CFGs model RNA secondary structure.
Security: Grammar-based fuzzing generates syntactically valid inputs.