Lexical Analysis
The lexer (lexical analyzer, scanner, tokenizer) converts a stream of characters into a stream of tokens — the smallest meaningful units of the language.
Tokens, Lexemes, and Patterns
Token: A category of lexical unit (IDENTIFIER, NUMBER, PLUS, IF).
Lexeme: The actual character string matched (variable name "count", number "42", operator "+").
Pattern: The rule describing which strings match a token (regex, description).
Source: let x = 42 + y;
Tokens:
(KEYWORD, "let")
(IDENTIFIER, "x")
(EQUALS, "=")
(NUMBER, "42")
(PLUS, "+")
(IDENTIFIER, "y")
(SEMICOLON, ";")
Token Categories
| Category | Examples | |---|---| | Keywords | if, else, while, fn, let, return, struct | | Identifiers | variable_name, MyStruct, foo | | Literals | 42, 3.14, "hello", 'c', true | | Operators | +, -, , /, ==, !=, &&, || | | Delimiters | (, ), {, }, [, ], ;, , | | Whitespace | spaces, tabs, newlines (usually discarded) | | Comments | // line comment, / block */ (discarded) |
Regular Expressions for Tokens
Each token class is defined by a regular expression.
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*
INTEGER: [0-9]+
FLOAT: [0-9]+\.[0-9]+([eE][+-]?[0-9]+)?
STRING: "([^"\\]|\\.)*"
WHITESPACE: [ \t\n\r]+
LINE_COMMENT: //[^\n]*
BLOCK_COMMENT: /\*([^*]|\*[^/])*\*/
Priority Rules
When multiple patterns match:
- Longest match: Choose the pattern that matches the most characters. "iffy" is IDENTIFIER, not IF + "fy".
- Priority order: If two patterns match the same length, the one listed first wins. "if" is IF keyword, not IDENTIFIER.
Finite Automata for Lexing
From Regex to NFA (Thompson's Construction)
Each regex operation maps to an NFA fragment:
Character 'a': ──(a)──→
Concatenation AB: ──A──→──B──→
Alternation A|B: ──ε──→A──→──ε──→
──ε──→B──→──ε──→
Kleene star A*: ──ε──→A──→──ε──→
──ε────────────→ (skip)
←──────ε──────── (repeat)
NFA to DFA (Subset Construction)
Convert NFA to DFA for efficient simulation. Each DFA state = set of NFA states. See automata topic for details.
DFA Minimization
Minimize the DFA for the most efficient lexer. Hopcroft's algorithm.
Lexer Implementation
Table-Driven (Generated)
Lexer generators (lex, flex, re2c) compile regex specifications into DFA transition tables.
// Conceptual table-driven lexer
fn next_token(input: &str) -> Token {
let mut state = START;
let mut pos = 0;
let mut last_accepting = None;
while pos < input.len() {
let c = input[pos];
state = transition_table[state][c];
if state == DEAD { break; }
if is_accepting[state] {
last_accepting = Some((state, pos));
}
pos += 1;
}
match last_accepting {
Some((state, end)) => token_for_state(state, &input[..=end]),
None => error("unexpected character"),
}
}
Hand-Written Lexer
Many production compilers use hand-written lexers for better error messages and performance.
CLASS Lexer
FIELDS: source (string), pos (integer)
FUNCTION NEXT_TOKEN()
SKIP_WHITESPACE()
IF pos ≥ length(source) THEN RETURN Token.Eof
c ← CURRENT_CHAR()
MATCH c
'(': ADVANCE(); RETURN Token.LParen
')': ADVANCE(); RETURN Token.RParen
'+': ADVANCE(); RETURN Token.Plus
'=':
ADVANCE()
IF CURRENT_CHAR() = '='
ADVANCE(); RETURN Token.EqualEqual
ELSE RETURN Token.Equal
'0'..'9': RETURN LEX_NUMBER()
'a'..'z', 'A'..'Z', '_': RETURN LEX_IDENTIFIER()
'"': RETURN LEX_STRING()
DEFAULT: RETURN Token.Error("unexpected '" + c + "'")
FUNCTION LEX_NUMBER()
start ← pos
WHILE pos < length(source) AND CURRENT_CHAR() is digit
ADVANCE()
value ← PARSE_INT(source[start..pos])
RETURN Token.Integer(value)
FUNCTION LEX_IDENTIFIER()
start ← pos
WHILE pos < length(source) AND (CURRENT_CHAR() is alphanumeric OR CURRENT_CHAR() = '_')
ADVANCE()
text ← source[start..pos]
MATCH text
"fn": RETURN Token.Fn
"let": RETURN Token.Let
"if": RETURN Token.If
"else": RETURN Token.Else
"return": RETURN Token.Return
DEFAULT: RETURN Token.Identifier(text)
Advantages of hand-written: Better error messages, faster (no table lookup overhead), easier to handle special cases (string interpolation, indentation-sensitive syntax).
Used by: GCC, Clang, rustc, Go compiler, V8 (JavaScript).
Lexer Generators
lex/flex (C): The classic. Define patterns → generate C code.
re2c (C/C++): Generates highly optimized DFAs. Very fast. Used by PHP.
logos (Rust): Procedural macro-based lexer generator.
// Lexer generator specification
TOKEN DEFINITIONS
Fn ← "fn"
Let ← "let"
Identifier ← [a-zA-Z_][a-zA-Z0-9_]*
Number ← [0-9]+
Plus ← "+"
Equals ← "="
Whitespace ← [ \t\n\r]+ (skip)
lex ← CREATE_LEXER("let x = 42 + y")
WHILE HAS_NEXT(lex)
token ← NEXT(lex)
PRINT token, ": ", SLICE(lex)
Token Buffering
The parser may need to look ahead multiple tokens before deciding which grammar rule to apply.
Single buffer: Read tokens one at a time. Parser calls next_token().
Peekable: Buffer one token ahead. peek() returns the next token without consuming it.
Multi-token lookahead: Buffer k tokens for LL(k) or backtracking parsers.
Error Recovery in Lexing
Unknown character: Report error, skip the character, continue lexing.
Unterminated string: Report error at end of line/file, insert closing quote.
Invalid number: "42abc" — lex "42" as number, "abc" as identifier (or report error).
Goal: Report as many errors as possible in one compilation pass. Don't stop at the first error.
Applications in CS
- Compilers: First phase of every compiler.
- Syntax highlighting: IDEs and editors use lexers to colorize code.
- Code formatters: Parse tokens → reformat (prettier, rustfmt, gofmt).
- Search tools: grep, ripgrep use regex/DFA for fast pattern matching.
- Data parsing: CSV, JSON, XML, log files — all start with tokenization.
- Protocol parsing: Network protocols have defined token structures.