Regular Languages

Regular languages are the simplest class in the Chomsky hierarchy. They are recognized by finite automata, generated by regular grammars, and described by regular expressions. These three formalisms are equivalent.

Deterministic Finite Automata (DFA)

DFA example — binary numbers divisible by 3

A DFA M = (Q, Σ, δ, q₀, F) consists of:

Q: Finite set of states
Σ: Input alphabet
δ: Transition function δ: Q × Σ → Q (exactly one transition per state-symbol pair)
q₀: Start state (q₀ ∈ Q)
F: Set of accepting (final) states (F ⊆ Q)

Example

DFA accepting strings over {0, 1} with an even number of 1s:

States: {q₀, q₁}  (q₀ = even count, q₁ = odd count)
Start: q₀
Accept: {q₀}

Transitions:
δ(q₀, 0) = q₀    δ(q₀, 1) = q₁
δ(q₁, 0) = q₁    δ(q₁, 1) = q₀

Diagram:
    0         1
→(q₀) ─────→ q₁
  ↺0    ←───── ↺0
         1

Computation

Process input string one symbol at a time, transitioning between states:

Input: 1010
q₀ →₁ q₁ →₀ q₁ →₁ q₀ →₀ q₀ ∈ F → ACCEPT

L(M) = {w ∈ Σ* | δ*(q₀, w) ∈ F} where δ* is the extended transition function.

Nondeterministic Finite Automata (NFA)

An NFA N = (Q, Σ, δ, q₀, F) where:

δ: Q × Σ → P(Q) (transition to a set of states — can be multiple or empty)

The NFA accepts if any computation path leads to an accepting state.

Example

NFA accepting strings ending in "01":

States: {q₀, q₁, q₂}
Start: q₀, Accept: {q₂}

δ(q₀, 0) = {q₀, q₁}    δ(q₀, 1) = {q₀}
δ(q₁, 0) = ∅            δ(q₁, 1) = {q₂}
δ(q₂, 0) = ∅            δ(q₂, 1) = ∅

Key insight: The NFA "guesses" when the pattern starts. Multiple branches explore different guesses simultaneously.

Epsilon-NFA (ε-NFA)

Allows ε-transitions: transitions without consuming input.

δ: Q × (Σ ∪ {ε}) → P(Q).

ε-transitions enable "free" state changes, useful for composing automata.

NFA to DFA Conversion (Subset Construction)

Theorem: Every NFA can be converted to an equivalent DFA.

Algorithm (Rabin-Scott, 1959):

The DFA states are subsets of NFA states.
Start state: ε-closure({q₀}) (all states reachable from q₀ via ε-transitions).
Transition: δ_DFA(S, a) = ε-closure(⋃_{q∈S} δ_NFA(q, a)).
Accept: Any DFA state containing an NFA accept state.

Worst case: 2^n DFA states from n NFA states. This exponential blowup is sometimes unavoidable (e.g., "n-th from last symbol is 1").

In practice: Most states are unreachable, so the DFA is much smaller.

Regular Expressions

A regular expression (regex) describes a regular language using:

Operator	Meaning	Example	Language
a	Literal character	a	{a}
ε	Empty string	ε	{ε}
∅	Empty language	∅	{}
R₁R₂	Concatenation	ab	{ab}
R₁\|R₂	Union (alternation)	a\|b	{a, b}
R*	Kleene star	a*	{ε, a, aa, aaa, ...}
R⁺	One or more	a⁺	{a, aa, aaa, ...}
R?	Zero or one	a?	{ε, a}

Examples

(0|1)*         — all binary strings
(0|1)*1        — binary strings ending in 1
0*10*          — exactly one 1
(01|10)*       — alternating pairs
a*b*           — any number of a's followed by any number of b's
(a|b)*aba(a|b)* — contains "aba" as substring

Equivalence: DFA ↔ NFA ↔ Regex

All three formalisms describe exactly the class of regular languages:

Regex → NFA: Thompson's construction. Build NFA from regex compositionally.
NFA → DFA: Subset construction.
DFA → Regex: State elimination method (remove states one by one, labeling edges with regex).

DFA Minimization

Find the smallest DFA (fewest states) accepting a given language.

Hopcroft's Algorithm

Start with two groups: accepting states and non-accepting states.
Refine: Split a group if some transition on symbol a goes to different groups.
Repeat until no more splits.

Time: O(n log n) where n = number of states.

The minimal DFA is unique (up to state renaming) for each regular language.

Myhill-Nerode Theorem

A language L is regular if and only if the number of equivalence classes of the indistinguishability relation ≡_L is finite.

Definition: x ≡_L y iff for all z ∈ Σ*, xz ∈ L ↔ yz ∈ L.

The number of equivalence classes equals the number of states in the minimal DFA.

Use: Prove a language is NOT regular by showing infinitely many distinguishable strings.

Pumping Lemma for Regular Languages

Theorem: If L is regular, then there exists a pumping length p such that for every string w ∈ L with |w| ≥ p, there exist x, y, z with w = xyz satisfying:

|y| ≥ 1 (y is non-empty)
|xy| ≤ p
For all i ≥ 0: xyⁱz ∈ L

Using the Pumping Lemma

To prove L is not regular: Show that no pumping length p works.

Example: Prove L = {0ⁿ1ⁿ | n ≥ 0} is not regular.

Assume L is regular with pumping length p. Take w = 0ᵖ1ᵖ (w ∈ L, |w| = 2p ≥ p).

By condition 2, |xy| ≤ p, so y consists only of 0s. Say y = 0ᵏ for some k ≥ 1.

Pump with i = 2: xy²z = 0ᵖ⁺ᵏ1ᵖ ∉ L (more 0s than 1s). Contradiction. ∎

Closure Properties

Regular languages are closed under:

Operation	Closed?	Proof Technique
Union	✓	Product construction or NFA
Intersection	✓	Product construction
Complement	✓	Swap accept/non-accept in DFA
Concatenation	✓	NFA with ε-transition
Kleene star	✓	NFA with ε-transition
Reversal	✓	Reverse all transitions, swap start/accept
Difference	✓	L₁ ∩ L₂'
Homomorphism	✓	Replace each symbol in NFA transitions
Inverse homomorphism	✓	Modify DFA transitions

Product construction: Run two DFAs simultaneously. State = (state_of_DFA₁, state_of_DFA₂).

Intersection: Accept when both accept.
Union: Accept when either accepts.

Applications in CS

Lexical analysis: Tokenizers convert source code into tokens using DFAs compiled from regex patterns. Tools: lex, flex, re2c.
Text search: grep, ripgrep, IDE search — all powered by regex/DFA matching.
Input validation: Email addresses, phone numbers, URLs — validated by regex.
Network protocols: Packet format validation. Protocol state machines are DFAs.
Hardware design: FSM-based controllers are DFAs implemented in logic gates.
Model checking: System behaviors modeled as automata. Properties as automata on infinite words (Büchi automata).
Bioinformatics: DNA motif finding with position-specific scoring matrices (related to weighted automata).