5 min read
On this page

Formal Languages

Formal language theory provides the mathematical framework for describing patterns in strings — the foundation of compilers, parsers, regular expressions, and computability theory.

Alphabets and Strings

Alphabet (Σ)

A finite, non-empty set of symbols.

Σ = {0, 1}           — binary alphabet
Σ = {a, b, c, ..., z} — lowercase English
Σ = {0, 1, ..., 9}   — decimal digits
Σ = ASCII             — all ASCII characters

String (Word)

A finite sequence of symbols from Σ.

w = 0110    (string over {0, 1})
w = hello   (string over {a-z})

Empty string: ε (or λ) — the string with zero symbols. |ε| = 0.

Length: |w| = number of symbols in w. |hello| = 5.

Concatenation: w₁w₂ — append w₂ to w₁. "ab" · "cd" = "abcd".

Power: wⁿ = w concatenated n times. (ab)³ = ababab. w⁰ = ε.

Reversal: wᴿ — reverse the string. (abc)ᴿ = cba.

Σ* and Σ⁺

Σ*: Set of all strings over Σ (including ε). Σ* = Σ⁰ ∪ Σ¹ ∪ Σ² ∪ ...

Σ⁺: All non-empty strings. Σ⁺ = Σ* \ {ε} = Σ¹ ∪ Σ² ∪ ...

For Σ = {0, 1}: Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, 001, ...} — countably infinite.

Languages

A language L over alphabet Σ is any subset of Σ*.

L ⊆ Σ*

Examples:

  • L = {ε} — language containing only the empty string
  • L = ∅ — the empty language (no strings at all)
  • L = {0ⁿ1ⁿ | n ≥ 0} = {ε, 01, 0011, 000111, ...}
  • L = {w ∈ {0,1}* | w is a palindrome}
  • L = {w ∈ {a-z}* | w is a valid English word}
  • L = set of all syntactically valid Rust programs

Note: ∅ ≠ {ε}. The empty language has no strings. {ε} has one string (the empty string).

Operations on Languages

Set Operations

Since languages are sets of strings, all set operations apply:

  • Union: L₁ ∪ L₂ = {w | w ∈ L₁ or w ∈ L₂}
  • Intersection: L₁ ∩ L₂ = {w | w ∈ L₁ and w ∈ L₂}
  • Difference: L₁ \ L₂ = {w | w ∈ L₁ and w ∉ L₂}
  • Complement: L̄ = Σ* \ L = {w ∈ Σ* | w ∉ L}

Concatenation

L₁ · L₂ = {w₁w₂ | w₁ ∈ L₁, w₂ ∈ L₂}

Example: L₁ = {a, ab}, L₂ = {c, cd}. L₁ · L₂ = {ac, acd, abc, abcd}.

Properties:

  • L · {ε} = {ε} · L = L
  • L · ∅ = ∅ · L = ∅
  • Not commutative in general

Kleene Star (Closure)

L* = L⁰ ∪ L¹ ∪ L² ∪ L³ ∪ ... = {ε} ∪ L ∪ LL ∪ LLL ∪ ...

Zero or more concatenations of strings from L.

Example: L = {ab}. L* = {ε, ab, abab, ababab, ...}.

Kleene plus: L⁺ = L · L* = L* \ {ε} (one or more).

Properties:

  • ∅* = {ε}
  • {ε}* = {ε}
  • (L*)* = L*
  • L* = {ε} ∪ L · L*

Reversal

Lᴿ = {wᴿ | w ∈ L}

Homomorphism

A function h: Σ → Δ* mapping each symbol to a string. Extended to strings: h(a₁a₂...aₙ) = h(a₁)h(a₂)...h(aₙ).

Example: h(0) = ab, h(1) = ε. Then h(010) = abab.

Chomsky Hierarchy

Chomsky Hierarchy — four levels with automata and example languages

The Chomsky hierarchy classifies grammars and languages into four types based on the form of their production rules.

Type 0: Recursively Enumerable (Turing machines)
  ⊃
Type 1: Context-Sensitive (Linear bounded automata)
  ⊃
Type 2: Context-Free (Pushdown automata)
  ⊃
Type 3: Regular (Finite automata)
Type Grammar Automaton Example
3 (Regular) A → aB or A → a DFA/NFA a*b+
2 (Context-Free) A → γ (any string of terminals/nonterminals) PDA {aⁿbⁿ}
1 (Context-Sensitive) αAβ → αγβ (|LHS| ≤ |RHS|) LBA {aⁿbⁿcⁿ}
0 (Unrestricted) Any production Turing machine Halting problem complement

Each level strictly contains the levels below it.

Why This Matters for CS

Language Class Application
Regular Lexical analysis (tokenization), pattern matching (regex), protocol validation
Context-free Syntax analysis (parsing), programming language grammars, XML/JSON/HTML
Context-sensitive Natural language (some aspects), type checking
Recursively enumerable General computation, program verification

Grammars

A grammar G = (V, Σ, R, S) consists of:

  • V: Set of non-terminal symbols (variables)
  • Σ: Set of terminal symbols (alphabet)
  • R: Set of production rules (rewriting rules)
  • S: Start symbol (S ∈ V)

Derivation

Apply production rules to transform strings.

Grammar:
S → aSb | ε

Derivation of "aabb":
S ⇒ aSb ⇒ aaSbb ⇒ aabb  (replaced S → ε at last step)

Leftmost derivation: Always expand the leftmost non-terminal. Rightmost derivation: Always expand the rightmost non-terminal.

The language generated by G: L(G) = {w ∈ Σ* | S ⇒* w} (all terminal strings derivable from S).

Decision Problems

Problem Regular Context-Free Context-Sensitive RE
Membership (w ∈ L?) O(n) O(n³) (CYK) Decidable Undecidable
Emptiness (L = ∅?) Decidable Decidable Undecidable Undecidable
Equivalence (L₁ = L₂?) Decidable Undecidable Undecidable Undecidable
Containment (L₁ ⊆ L₂?) Decidable Undecidable Undecidable Undecidable

Applications in CS

  • Compilers: Lexical analysis (regular), parsing (context-free), semantic analysis (context-sensitive aspects).
  • Text processing: Regular expressions for search, validation, extraction.
  • Protocol specification: Formal grammars describe message formats and state machines.
  • Natural language processing: Context-free grammars for syntax, context-sensitive for agreement and scoping.
  • Bioinformatics: Stochastic context-free grammars for RNA secondary structure prediction.
  • Security: Input validation, protocol conformance checking.
  • Verification: Model checking temporal properties expressed as automata.