4 min read
On this page

Syntax Analysis (Parsing)

The parser takes a token stream from the lexer and builds a structured representation (parse tree or AST) according to the language's grammar.

Context-Free Grammars (Review)

A CFG G = (V, Σ, P, S) defines the syntax. Productions describe how to derive valid programs.

// Simple expression grammar
Expr    → Expr '+' Term | Expr '-' Term | Term
Term    → Term '*' Factor | Term '/' Factor | Factor
Factor  → '(' Expr ')' | NUMBER | IDENTIFIER

This grammar encodes precedence (* binds tighter than +) and associativity (left-to-right) in its structure.

Parse Trees vs ASTs

Parse tree (concrete syntax tree): Includes all grammar symbols, including intermediate non-terminals.

AST (abstract syntax tree): Simplified tree retaining only essential structure. Drops parentheses, intermediate nodes.

Expression: 2 + 3 * 4

Parse tree:           AST:
    Expr                 +
   / | \                / \
 Expr + Term           2    *
  |      |  \              / \
Term   Term * Factor      3   4
  |      |     |
Factor Factor  4
  |      |
  2      3

Ambiguity Resolution

Operator Precedence

Higher-precedence operators appear deeper in the grammar (closer to terminal rules).

Expr → Expr '+' Term | Term       // + is lower precedence (closer to start)
Term → Term '*' Factor | Factor   // * is higher precedence (deeper)

Associativity

Left-associative (most operators): Left-recursive production.

Expr → Expr '+' Term    // left-recursive → left-associative: (a+b)+c

Right-associative (assignment, exponentiation): Right-recursive production.

Assign → IDENTIFIER '=' Assign | Expr   // right-recursive: a = b = c → a = (b = c)

Dangling Else

if (a) if (b) s1 else s2
// Does 'else' belong to the inner or outer 'if'?

Convention: else matches the nearest unmatched if. Most parsers handle this by preferring shift over reduce (in LR parsers) or by restructuring the grammar.

Top-Down Parsing

Parsing hierarchy — LL vs LR vs PEG parser relationships

Parse from the start symbol down to the terminal tokens.

Recursive Descent

Each non-terminal becomes a function. The function examines the next token(s) to decide which production to apply.

CLASS Parser
    FIELDS: tokens (list), pos (integer)

    FUNCTION PARSE_EXPR()
        left ← PARSE_TERM()
        WHILE CURRENT_TOKEN is Plus OR Minus
            op ← ADVANCE()
            right ← PARSE_TERM()
            left ← BinaryExpr(left, op, right)
        RETURN left

    FUNCTION PARSE_TERM()
        left ← PARSE_FACTOR()
        WHILE CURRENT_TOKEN is Star OR Slash
            op ← ADVANCE()
            right ← PARSE_FACTOR()
            left ← BinaryExpr(left, op, right)
        RETURN left

    FUNCTION PARSE_FACTOR()
        IF CURRENT_TOKEN is LParen
            ADVANCE()              // consume '('
            expr ← PARSE_EXPR()
            EXPECT(RParen)
            RETURN expr
        ELSE IF CURRENT_TOKEN is Number(n)
            ADVANCE()
            RETURN NumberExpr(n)
        ELSE
            ERROR "unexpected token: ", PEEK()

Advantages: Simple, intuitive. Excellent error messages (know exactly where we are in the grammar). Easy to extend. No external tools needed.

Used by: GCC, Clang, rustc, Go, TypeScript, most hand-written parsers.

Limitation: Can't handle left-recursive grammars directly (infinite recursion). Must rewrite: E → E + T becomes E → T (('+') T)*.

LL(1) Parsing

Predictive parsing: Use a single lookahead token to decide which production to apply. No backtracking.

FIRST set: Set of terminals that can begin a derivation from a non-terminal.

FOLLOW set: Set of terminals that can appear immediately after a non-terminal.

LL(1) parse table: M[A, a] = production to use when non-terminal A sees lookahead token a.

LL(1) conflict: If a cell in the table has multiple entries, the grammar is not LL(1).

LL(k): Use k tokens of lookahead. More powerful but larger tables.

Parser Combinators

Build parsers by combining smaller parsers using higher-order functions.

// Parser combinator style
FUNCTION EXPR(input)
    (input, first) ← TERM(input)
    (input, rest) ← MANY(SEQUENCE(ONE_OF('+', '-'), TERM))(input)
    RETURN (input, FOLD_LEFT(rest, first, (acc, (op, t)) -> BinaryExpr(acc, op, t)))

Advantages: Composable. No separate grammar file. Type-safe. Good for small-to-medium grammars.

Libraries: nom (Rust, widely used), chumsky (Rust, better errors), parsec (Haskell), megaparsec (Haskell).

Bottom-Up Parsing

Parse from the terminal tokens up to the start symbol. More powerful than top-down — handles left-recursive grammars naturally.

Shift-Reduce Parsing

Maintain a stack. Two operations:

Shift: Push the next input token onto the stack.

Reduce: Replace the top of the stack (matching a production's right side) with the production's left side.

Input: 2 + 3 * 4

Stack          Input       Action
               2+3*4$
2              +3*4$      Shift 2
Factor         +3*4$      Reduce Factor → 2
Term           +3*4$      Reduce Term → Factor
Expr           +3*4$      Reduce Expr → Term
Expr +         3*4$       Shift +
Expr + 3       *4$        Shift 3
Expr + Factor  *4$        Reduce
Expr + Term    *4$        Reduce
Expr + Term *  4$         Shift *
Expr + Term * 4 $         Shift 4
Expr + Term * Factor $    Reduce
Expr + Term    $          Reduce (Term → Term * Factor)
Expr           $          Reduce (Expr → Expr + Term)
ACCEPT

LR Parsing Family

LR(k): Left-to-right scan, Rightmost derivation, k tokens of lookahead.

| Parser | Power | Table Size | Use | |---|---|---|---| | LR(0) | Weakest | Small | Too weak for most grammars | | SLR(1) | Medium | Small | Simple grammars | | LALR(1) | Strong | Medium | Most programming languages (yacc/bison) | | LR(1) | Strongest | Large | Very complex grammars | | GLR | Universal | Medium | Ambiguous grammars (multiple parses) |

LALR(1) is the sweet spot for generated parsers — handles most programming languages with compact tables.

Parser Generators

yacc / bison (C): The classic. Define grammar + actions → generate C parser.

// bison grammar excerpt
expr: expr '+' term   { $$ = make_binary('+', $1, $3); }
    | expr '-' term   { $$ = make_binary('-', $1, $3); }
    | term            { $$ = $1; }
    ;

ANTLR (Java, multi-target): LL(*) parser generator. Generates parsers in Java, Python, C#, JavaScript, Go, C++. Excellent tooling (ANTLRWorks, syntax diagrams).

tree-sitter (C, multi-language): Incremental parser generator. Used by editors (Neovim, Helix, Zed) for syntax highlighting, code folding, structural navigation. Handles incomplete/erroneous code gracefully.

lalrpop (Rust): LALR(1) parser generator for Rust.

pest (Rust): PEG parser generator.

Error Recovery

Parsers should report multiple errors per compilation, not just the first one.

Panic Mode

On error: skip tokens until a synchronization token is found (;, }, keyword). Resume parsing from there.

PROCEDURE SYNCHRONIZE()
    ADVANCE()         // skip the error token
    WHILE NOT AT_END()
        IF PREVIOUS_TOKEN = Semicolon THEN RETURN
        IF CURRENT_TOKEN IN {Fn, Let, If, Return} THEN RETURN
        ADVANCE()

Phrase-Level Recovery

Insert or delete specific tokens to fix the parse locally.

Expected ')' but found ';'  →  Insert ')' before ';', continue parsing

Error Productions

Add grammar rules for common mistakes.

// Common mistake: missing semicolon
stmt → expr ';'
     | expr error { report("missing semicolon") }   // error production

PEG Grammars

Parsing Expression Grammars: Similar to CFGs but with ordered choice (try alternatives in order, commit to the first match).

Expr    ← Term (('+' / '-') Term)*
Term    ← Factor (('*' / '/') Factor)*
Factor  ← '(' Expr ')' / NUMBER / IDENTIFIER

Key difference from CFG: PEG's / (ordered choice) always picks the first successful match. No ambiguity.

Advantages: No ambiguity. Scannerless (can integrate lexing). Supports unlimited lookahead.

Disadvantages: Left recursion requires special handling. Ordered choice can mask bugs. No formal theory of "the language" as rich as CFG.

Tools: pest (Rust), PEG.js (JavaScript), packrat parsers.

Applications in CS

  • Compilers: Central phase — determines whether code is syntactically valid and builds the structure for subsequent phases.
  • IDEs: Incremental parsing (tree-sitter) enables syntax highlighting, code folding, and structural navigation that updates as you type.
  • Linters/Formatters: Parse code to AST, analyze or reformat, emit modified source.
  • Domain-specific languages: Custom parsers for configuration files, query languages, template languages.
  • Data formats: JSON, XML, YAML, TOML, protobuf — all defined by grammars and parsed by specialized parsers.