5 min read
On this page

Context-Free Languages

Context-free languages (CFLs) are the second level of the Chomsky hierarchy. They capture the recursive structure of programming languages and are recognized by pushdown automata.

Context-Free Grammars (CFG)

A CFG G = (V, Σ, R, S) where:

  • V: Non-terminal symbols
  • Σ: Terminal symbols
  • R: Production rules of the form A → γ where A ∈ V and γ ∈ (V ∪ Σ)*
  • S: Start symbol

Example: Balanced Parentheses

S → (S) | SS | ε

Derivation of "(()())":

S ⇒ (S) ⇒ (SS) ⇒ ((S)S) ⇒ ((S)(S)) ⇒ (()()) ⇒ (()())

Example: Arithmetic Expressions

E → E + T | T
T → T * F | F
F → (E) | id | num

This grammar encodes operator precedence (* before +) and associativity (left) in its structure.

Derivations and Parse Trees

Leftmost and Rightmost Derivations

Leftmost: Always expand the leftmost non-terminal. Rightmost: Always expand the rightmost non-terminal.

For unambiguous grammars, the leftmost and rightmost derivations correspond to the same parse tree.

Parse Trees

A tree representation of a derivation:

  • Root: Start symbol
  • Internal nodes: Non-terminals
  • Leaves: Terminals (reading left-to-right gives the derived string)
Expression: id + id * id

Parse tree (with precedence):
        E
       / \
      E   +   T
      |      / \
      T    T  *  F
      |    |     |
      F    F    id
      |    |
     id   id

Ambiguity

A grammar is ambiguous if some string has two or more distinct parse trees (equivalently, two distinct leftmost derivations).

Example: E → E + E | E * E | id. The string "id + id * id" has two parse trees (one with + at root, one with * at root), giving different evaluation orders.

Disambiguation techniques:

  • Rewrite the grammar to encode precedence/associativity.
  • Use disambiguation rules in the parser (shift/reduce preferences).

Inherently ambiguous languages: Some CFLs have no unambiguous grammar. Example: {aⁱbʲcᵏ | i=j or j=k}.

Normal Forms

Chomsky Normal Form (CNF)

Every production is either:

  • A → BC (two non-terminals)
  • A → a (single terminal)
  • S → ε (only for start symbol, if ε ∈ L)

Every CFG can be converted to CNF (by eliminating ε-productions, unit productions, and long/mixed productions).

CNF is important for: CYK parsing algorithm (requires CNF input).

Greibach Normal Form (GNF)

Every production is: A → aα where a is a terminal and α is a (possibly empty) string of non-terminals.

Every CFG can be converted to GNF. Useful for proofs and for constructing PDAs.

Pushdown Automata (PDA)

PDA example — accepting the language a^n b^n

A PDA is an NFA augmented with a stack (infinite LIFO memory).

M = (Q, Σ, Γ, δ, q₀, Z₀, F) where:

  • Γ: Stack alphabet
  • Z₀: Initial stack symbol
  • δ: Q × (Σ ∪ {ε}) × Γ → P(Q × Γ*)

Each transition: read an input symbol (or ε), pop a stack symbol, push zero or more stack symbols.

Example: {aⁿbⁿ | n ≥ 0}

States: {q₀, q₁, q_accept}

1. In q₀, reading 'a': Push A onto stack. Stay in q₀.
2. In q₀, reading 'b': Pop A from stack. Go to q₁.
3. In q₁, reading 'b': Pop A from stack. Stay in q₁.
4. In q₁, reading ε with empty stack: Go to q_accept.

Computation on "aabb":

(q₀, aabb, Z₀)
→ (q₀, abb, AZ₀)     push A
→ (q₀, bb, AAZ₀)     push A
→ (q₁, b, AZ₀)       pop A, switch to q₁
→ (q₁, ε, Z₀)        pop A
→ (q_accept, ε, Z₀)  accept

Deterministic vs Nondeterministic PDA

DPDA: At each step, at most one transition applies. DCFLs ⊊ CFLs.

NPDA: Multiple transitions may apply; accepts if any path reaches acceptance. NPDAs recognize all CFLs.

Key difference from DFA/NFA: For finite automata, deterministic = nondeterministic (same power). For PDAs, nondeterministic is strictly more powerful than deterministic.

DCFLs: {aⁿbⁿ}, most programming language syntax, LR(1) languages. Inherently non-deterministic CFLs: {wwᴿ | w ∈ {a,b}*} (palindromes — need to guess the middle).

Equivalence: CFG ↔ PDA

Theorem: A language is context-free iff it is recognized by a PDA.

CFG → PDA: Top-down simulation. Push start symbol, match terminals with input, expand non-terminals using productions.

PDA → CFG: Construct grammar rules that simulate PDA transitions. Each non-terminal represents a state pair [qᵢ, qⱼ] (strings that take the PDA from qᵢ to qⱼ with the same stack height).

Pumping Lemma for CFLs

Theorem: If L is context-free, there exists a pumping length p such that for every w ∈ L with |w| ≥ p, there exist u, v, x, y, z with w = uvxyz satisfying:

  1. |vy| ≥ 1 (v and y aren't both empty)
  2. |vxy| ≤ p
  3. For all i ≥ 0: uvⁱxyⁱz ∈ L

Differences from regular pumping lemma: Two segments are pumped (v and y), and they must be pumped together at the same rate.

Example: {aⁿbⁿcⁿ | n ≥ 0} is Not Context-Free

Take w = aᵖbᵖcᵖ. By condition 2, |vxy| ≤ p, so vxy can contain at most two types of symbols. Pumping changes the count of those symbols but not the third. So uvⁱxyⁱz ∉ L for i ≠ 1. ∎

Closure Properties

| Operation | Closed? | |---|---| | Union | ✓ (S → S₁ | S₂) | | Concatenation | ✓ (S → S₁S₂) | | Kleene star | ✓ (S → SS₁ | ε) | | Intersection | ✗ ({aⁿbⁿcⁿ} = {aⁿbⁿ} ∩ {bⁿcⁿ} shift) | | Complement | ✗ (since intersection = complement of union of complements) | | Intersection with regular | ✓ (product construction: PDA × DFA) | | Reversal | ✓ | | Homomorphism | ✓ |

CFLs are NOT closed under intersection or complement — a crucial difference from regular languages.

CYK Parsing Algorithm

Cocke-Younger-Kasami: Decides membership in O(n³) for any CFG (in CNF).

Dynamic programming: Fill a table T[i][j] = set of non-terminals that can derive the substring w[i..j].

For length l = 1 to n:
    For start i = 0 to n-l:
        j = i + l - 1
        For split k = i to j-1:
            If A → BC where B ∈ T[i][k] and C ∈ T[k+1][j]:
                Add A to T[i][j]
Accept if S ∈ T[0][n-1]

Time: O(n³ × |G|). Space: O(n²).

Practical use: Not used for programming language parsers (too slow, O(n³)). Used for natural language parsing (where ambiguity is high and grammars are complex).

Earley Parsing

Handles any CFG (not just CNF) in O(n³) worst case, O(n²) for unambiguous grammars, and O(n) for LR grammars.

More practical than CYK. Used in NLP (e.g., NLTK).

Applications in CS

  • Programming language syntax: Most programming language syntax is defined by context-free grammars.
  • Parsers: Compilers use LR, LALR, LL parsers for CFGs (covered in compiler design).
  • XML/HTML/JSON: Structured documents have context-free structure.
  • Natural language processing: Phrase structure grammars, probabilistic CFGs for parsing sentences.
  • Bioinformatics: Stochastic CFGs model RNA secondary structure.
  • Security: Grammar-based fuzzing generates syntactically valid inputs.