Lexical Analysis

The lexer (lexical analyzer, scanner, tokenizer) converts a stream of characters into a stream of tokens — the smallest meaningful units of the language.

Tokens, Lexemes, and Patterns

Token: A category of lexical unit (IDENTIFIER, NUMBER, PLUS, IF).

Lexeme: The actual character string matched (variable name "count", number "42", operator "+").

Pattern: The rule describing which strings match a token (regex, description).

Source: let x = 42 + y;

Tokens:
  (KEYWORD, "let")
  (IDENTIFIER, "x")
  (EQUALS, "=")
  (NUMBER, "42")
  (PLUS, "+")
  (IDENTIFIER, "y")
  (SEMICOLON, ";")

Token Categories

Category	Examples
Keywords	if, else, while, fn, let, return, struct
Identifiers	variable_name, MyStruct, foo
Literals	42, 3.14, "hello", 'c', true
Operators	+, -, *, /, ==, !=, &&, \|\|
Delimiters	(, ), {, }, [, ], ;, ,
Whitespace	spaces, tabs, newlines (usually discarded)
Comments	// line comment, /* block */ (discarded)

Regular Expressions for Tokens

Each token class is defined by a regular expression.

IDENTIFIER:  [a-zA-Z_][a-zA-Z0-9_]*
INTEGER:     [0-9]+
FLOAT:       [0-9]+\.[0-9]+([eE][+-]?[0-9]+)?
STRING:      "([^"\\]|\\.)*"
WHITESPACE:  [ \t\n\r]+
LINE_COMMENT: //[^\n]*
BLOCK_COMMENT: /\*([^*]|\*[^/])*\*/

Priority Rules

When multiple patterns match:

Longest match: Choose the pattern that matches the most characters. "iffy" is IDENTIFIER, not IF + "fy".
Priority order: If two patterns match the same length, the one listed first wins. "if" is IF keyword, not IDENTIFIER.

Finite Automata for Lexing

From Regex to NFA (Thompson's Construction)

Each regex operation maps to an NFA fragment:

Character 'a':     ──(a)──→
Concatenation AB:  ──A──→──B──→
Alternation A|B:   ──ε──→A──→──ε──→
                   ──ε──→B──→──ε──→
Kleene star A*:    ──ε──→A──→──ε──→
                   ──ε────────────→ (skip)
                   ←──────ε──────── (repeat)

NFA to DFA (Subset Construction)

Convert NFA to DFA for efficient simulation. Each DFA state = set of NFA states. See automata topic for details.

DFA Minimization

Minimize the DFA for the most efficient lexer. Hopcroft's algorithm.

Lexer Implementation

Table-Driven (Generated)

Lexer generators (lex, flex, re2c) compile regex specifications into DFA transition tables.

// Conceptual table-driven lexer
fn next_token(input: &str) -> Token {
    let mut state = START;
    let mut pos = 0;
    let mut last_accepting = None;

    while pos < input.len() {
        let c = input[pos];
        state = transition_table[state][c];
        if state == DEAD { break; }
        if is_accepting[state] {
            last_accepting = Some((state, pos));
        }
        pos += 1;
    }

    match last_accepting {
        Some((state, end)) => token_for_state(state, &input[..=end]),
        None => error("unexpected character"),
    }
}

Hand-Written Lexer

Many production compilers use hand-written lexers for better error messages and performance.

CLASS Lexer
    FIELDS: source (string), pos (integer)

    FUNCTION NEXT_TOKEN()
        SKIP_WHITESPACE()
        IF pos ≥ length(source) THEN RETURN Token.Eof

        c ← CURRENT_CHAR()
        MATCH c
            '(':         ADVANCE(); RETURN Token.LParen
            ')':         ADVANCE(); RETURN Token.RParen
            '+':         ADVANCE(); RETURN Token.Plus
            '=':
                ADVANCE()
                IF CURRENT_CHAR() = '='
                    ADVANCE(); RETURN Token.EqualEqual
                ELSE RETURN Token.Equal
            '0'..'9':    RETURN LEX_NUMBER()
            'a'..'z', 'A'..'Z', '_':  RETURN LEX_IDENTIFIER()
            '"':         RETURN LEX_STRING()
            DEFAULT:     RETURN Token.Error("unexpected '" + c + "'")

    FUNCTION LEX_NUMBER()
        start ← pos
        WHILE pos < length(source) AND CURRENT_CHAR() is digit
            ADVANCE()
        value ← PARSE_INT(source[start..pos])
        RETURN Token.Integer(value)

    FUNCTION LEX_IDENTIFIER()
        start ← pos
        WHILE pos < length(source) AND (CURRENT_CHAR() is alphanumeric OR CURRENT_CHAR() = '_')
            ADVANCE()
        text ← source[start..pos]
        MATCH text
            "fn":     RETURN Token.Fn
            "let":    RETURN Token.Let
            "if":     RETURN Token.If
            "else":   RETURN Token.Else
            "return": RETURN Token.Return
            DEFAULT:  RETURN Token.Identifier(text)

Advantages of hand-written: Better error messages, faster (no table lookup overhead), easier to handle special cases (string interpolation, indentation-sensitive syntax).

Used by: GCC, Clang, rustc, Go compiler, V8 (JavaScript).

Lexer Generators

lex/flex (C): The classic. Define patterns → generate C code.

re2c (C/C++): Generates highly optimized DFAs. Very fast. Used by PHP.

logos (Rust): Procedural macro-based lexer generator.

// Lexer generator specification
TOKEN DEFINITIONS
    Fn         ← "fn"
    Let        ← "let"
    Identifier ← [a-zA-Z_][a-zA-Z0-9_]*
    Number     ← [0-9]+
    Plus       ← "+"
    Equals     ← "="
    Whitespace ← [ \t\n\r]+  (skip)

lex ← CREATE_LEXER("let x = 42 + y")
WHILE HAS_NEXT(lex)
    token ← NEXT(lex)
    PRINT token, ": ", SLICE(lex)

Token Buffering

The parser may need to look ahead multiple tokens before deciding which grammar rule to apply.

Single buffer: Read tokens one at a time. Parser calls next_token().

Peekable: Buffer one token ahead. peek() returns the next token without consuming it.

Multi-token lookahead: Buffer k tokens for LL(k) or backtracking parsers.

Error Recovery in Lexing

Unknown character: Report error, skip the character, continue lexing.

Unterminated string: Report error at end of line/file, insert closing quote.

Invalid number: "42abc" — lex "42" as number, "abc" as identifier (or report error).

Goal: Report as many errors as possible in one compilation pass. Don't stop at the first error.

Applications in CS

Compilers: First phase of every compiler.
Syntax highlighting: IDEs and editors use lexers to colorize code.
Code formatters: Parse tokens → reformat (prettier, rustfmt, gofmt).
Search tools: grep, ripgrep use regex/DFA for fast pattern matching.
Data parsing: CSV, JSON, XML, log files — all start with tokenization.
Protocol parsing: Network protocols have defined token structures.