Regular Languages
Regular languages are the simplest class in the Chomsky hierarchy. They are recognized by finite automata, generated by regular grammars, and described by regular expressions. These three formalisms are equivalent.
Deterministic Finite Automata (DFA)

A DFA M = (Q, Σ, δ, q₀, F) consists of:
- Q: Finite set of states
- Σ: Input alphabet
- δ: Transition function δ: Q × Σ → Q (exactly one transition per state-symbol pair)
- q₀: Start state (q₀ ∈ Q)
- F: Set of accepting (final) states (F ⊆ Q)
Example
DFA accepting strings over {0, 1} with an even number of 1s:
States: {q₀, q₁} (q₀ = even count, q₁ = odd count)
Start: q₀
Accept: {q₀}
Transitions:
δ(q₀, 0) = q₀ δ(q₀, 1) = q₁
δ(q₁, 0) = q₁ δ(q₁, 1) = q₀
Diagram:
0 1
→(q₀) ─────→ q₁
↺0 ←───── ↺0
1
Computation
Process input string one symbol at a time, transitioning between states:
Input: 1010
q₀ →₁ q₁ →₀ q₁ →₁ q₀ →₀ q₀ ∈ F → ACCEPT
L(M) = {w ∈ Σ* | δ*(q₀, w) ∈ F} where δ* is the extended transition function.
Nondeterministic Finite Automata (NFA)
An NFA N = (Q, Σ, δ, q₀, F) where:
- δ: Q × Σ → P(Q) (transition to a set of states — can be multiple or empty)
The NFA accepts if any computation path leads to an accepting state.
Example
NFA accepting strings ending in "01":
States: {q₀, q₁, q₂}
Start: q₀, Accept: {q₂}
δ(q₀, 0) = {q₀, q₁} δ(q₀, 1) = {q₀}
δ(q₁, 0) = ∅ δ(q₁, 1) = {q₂}
δ(q₂, 0) = ∅ δ(q₂, 1) = ∅
Key insight: The NFA "guesses" when the pattern starts. Multiple branches explore different guesses simultaneously.
Epsilon-NFA (ε-NFA)
Allows ε-transitions: transitions without consuming input.
δ: Q × (Σ ∪ {ε}) → P(Q).
ε-transitions enable "free" state changes, useful for composing automata.
NFA to DFA Conversion (Subset Construction)
Theorem: Every NFA can be converted to an equivalent DFA.
Algorithm (Rabin-Scott, 1959):
- The DFA states are subsets of NFA states.
- Start state: ε-closure({q₀}) (all states reachable from q₀ via ε-transitions).
- Transition: δ_DFA(S, a) = ε-closure(⋃_{q∈S} δ_NFA(q, a)).
- Accept: Any DFA state containing an NFA accept state.
Worst case: 2^n DFA states from n NFA states. This exponential blowup is sometimes unavoidable (e.g., "n-th from last symbol is 1").
In practice: Most states are unreachable, so the DFA is much smaller.
Regular Expressions
A regular expression (regex) describes a regular language using:
| Operator | Meaning | Example | Language | |---|---|---|---| | a | Literal character | a | {a} | | ε | Empty string | ε | {ε} | | ∅ | Empty language | ∅ | {} | | R₁R₂ | Concatenation | ab | {ab} | | R₁|R₂ | Union (alternation) | a|b | {a, b} | | R* | Kleene star | a* | {ε, a, aa, aaa, ...} | | R⁺ | One or more | a⁺ | {a, aa, aaa, ...} | | R? | Zero or one | a? | {ε, a} |
Examples
(0|1)* — all binary strings
(0|1)*1 — binary strings ending in 1
0*10* — exactly one 1
(01|10)* — alternating pairs
a*b* — any number of a's followed by any number of b's
(a|b)*aba(a|b)* — contains "aba" as substring
Equivalence: DFA ↔ NFA ↔ Regex
All three formalisms describe exactly the class of regular languages:
- Regex → NFA: Thompson's construction. Build NFA from regex compositionally.
- NFA → DFA: Subset construction.
- DFA → Regex: State elimination method (remove states one by one, labeling edges with regex).
DFA Minimization
Find the smallest DFA (fewest states) accepting a given language.
Hopcroft's Algorithm
- Start with two groups: accepting states and non-accepting states.
- Refine: Split a group if some transition on symbol a goes to different groups.
- Repeat until no more splits.
Time: O(n log n) where n = number of states.
The minimal DFA is unique (up to state renaming) for each regular language.
Myhill-Nerode Theorem
A language L is regular if and only if the number of equivalence classes of the indistinguishability relation ≡_L is finite.
Definition: x ≡_L y iff for all z ∈ Σ*, xz ∈ L ↔ yz ∈ L.
The number of equivalence classes equals the number of states in the minimal DFA.
Use: Prove a language is NOT regular by showing infinitely many distinguishable strings.
Pumping Lemma for Regular Languages
Theorem: If L is regular, then there exists a pumping length p such that for every string w ∈ L with |w| ≥ p, there exist x, y, z with w = xyz satisfying:
- |y| ≥ 1 (y is non-empty)
- |xy| ≤ p
- For all i ≥ 0: xyⁱz ∈ L
Using the Pumping Lemma
To prove L is not regular: Show that no pumping length p works.
Example: Prove L = {0ⁿ1ⁿ | n ≥ 0} is not regular.
Assume L is regular with pumping length p. Take w = 0ᵖ1ᵖ (w ∈ L, |w| = 2p ≥ p).
By condition 2, |xy| ≤ p, so y consists only of 0s. Say y = 0ᵏ for some k ≥ 1.
Pump with i = 2: xy²z = 0ᵖ⁺ᵏ1ᵖ ∉ L (more 0s than 1s). Contradiction. ∎
Closure Properties
Regular languages are closed under:
| Operation | Closed? | Proof Technique | |---|---|---| | Union | ✓ | Product construction or NFA | | Intersection | ✓ | Product construction | | Complement | ✓ | Swap accept/non-accept in DFA | | Concatenation | ✓ | NFA with ε-transition | | Kleene star | ✓ | NFA with ε-transition | | Reversal | ✓ | Reverse all transitions, swap start/accept | | Difference | ✓ | L₁ ∩ L₂' | | Homomorphism | ✓ | Replace each symbol in NFA transitions | | Inverse homomorphism | ✓ | Modify DFA transitions |
Product construction: Run two DFAs simultaneously. State = (state_of_DFA₁, state_of_DFA₂).
- Intersection: Accept when both accept.
- Union: Accept when either accepts.
Applications in CS
- Lexical analysis: Tokenizers convert source code into tokens using DFAs compiled from regex patterns. Tools: lex, flex, re2c.
- Text search: grep, ripgrep, IDE search — all powered by regex/DFA matching.
- Input validation: Email addresses, phone numbers, URLs — validated by regex.
- Network protocols: Packet format validation. Protocol state machines are DFAs.
- Hardware design: FSM-based controllers are DFAs implemented in logic gates.
- Model checking: System behaviors modeled as automata. Properties as automata on infinite words (Büchi automata).
- Bioinformatics: DNA motif finding with position-specific scoring matrices (related to weighted automata).