6 min read
On this page

Syntactic Parsing

Overview

Syntactic parsing analyzes the grammatical structure of sentences. Two major formalisms exist: constituency parsing (phrase structure trees) and dependency parsing (head-modifier relations). Both are essential for tasks requiring structural understanding of language.


Constituency Parsing

Constituency parsing decomposes a sentence into nested phrases according to a context-free grammar.

          S
        /   \
      NP      VP
     / \     / \
   DET  N   V   NP
   The cat  sat  on the mat

Context-Free Grammars (CFG)

A CFG G = (N, T, R, S) where:

  • N: non-terminal symbols (S, NP, VP, PP, ...)
  • T: terminal symbols (words)
  • R: production rules (S -> NP VP, NP -> DET N, ...)
  • S: start symbol

Chomsky Normal Form (CNF): All rules are either A -> B C (binary) or A -> a (lexical). Any CFG can be converted to CNF.

CKY Algorithm

The Cocke-Kasami-Younger algorithm is the standard dynamic programming parser for CFGs in CNF.

Procedure:

  1. Fill a triangular chart bottom-up
  2. Cell [i, j] contains all non-terminals that can derive the span from word i to word j
  3. For each span, try all split points k and all rules A -> B C where B is in [i, k] and C is in [k+1, j]

Complexity: O(n^3 * |G|) where n is sentence length and |G| is grammar size.

Recognizer vs Parser: The basic algorithm determines if a sentence is grammatical. To recover the tree, store backpointers at each cell.

Probabilistic Context-Free Grammars (PCFG)

Augment each rule with a probability. Probabilities for rules with the same left-hand side sum to 1.

S  -> NP VP      [0.8]
S  -> VP         [0.2]
NP -> DET N      [0.5]
NP -> NP PP      [0.3]
NP -> N          [0.2]

Probabilistic CKY: Same algorithm, but track the probability of each entry (product of rule probabilities). Select the parse with highest probability (Viterbi parse).

Limitations of PCFGs:

  • Independence assumption: rule probability is independent of context
  • "NP -> PRP" has the same probability whether NP is a subject or object
  • Poor modeling of lexical preferences and structural ambiguity

Lexicalized PCFGs

Add head word annotations to non-terminals to capture lexical dependencies.

  • VP(sat) -> V(sat) PP(on) is more informative than VP -> V PP
  • Dramatically improves parsing accuracy but explodes the grammar size
  • Sparse data problem requires smoothing

Neural Constituency Parsers

Chart-based neural parsers (Stern et al., Kitaev & Klein):

  1. Encode the sentence with a BiLSTM or Transformer
  2. Score each span [i, j] with label L using a neural classifier
  3. Find the highest-scoring tree using CKY decoding over span scores
  4. Kitaev & Klein (2018): BERT encoder + self-attention + CKY achieves ~96 F1 on Penn Treebank

Transition-based neural parsers:

  • Shift-reduce parsing with neural scoring of actions
  • Faster (linear time) but slightly less accurate than chart parsers

Sequence-to-sequence parsers:

  • Linearize the tree and generate it as a token sequence
  • Competitive accuracy with simpler implementation

Dependency Parsing

Dependency parsing represents syntactic structure as directed edges between words. Each word (dependent) is connected to exactly one head (governor), forming a tree rooted at a special ROOT node.

ROOT -> sat
sat -> cat (nsubj)
cat -> The (det)
sat -> on (obl)
on -> mat (nmod)
mat -> the (det)

Dependency Relations

Each edge is labeled with a grammatical relation.

| Relation | Meaning | Example | |---|---|---| | nsubj | Nominal subject | "cat" -> "sat" | | obj | Direct object | "book" -> "read" | | det | Determiner | "the" -> "cat" | | amod | Adjectival modifier | "big" -> "cat" | | nmod | Nominal modifier | "mat" -> "on" | | advmod | Adverbial modifier | "quickly" -> "ran" | | conj | Conjunct | "cats" -> "dogs" in "cats and dogs" |

Properties of Dependency Trees

  • Single head: each word has exactly one head (except ROOT)
  • Connected: all words are reachable from ROOT
  • Acyclic: no cycles
  • Projectivity: if word A governs word B, then A also governs all words between A and B (in projective trees). Non-projective trees occur in free word-order languages.

Transition-Based Parsing

Processes the sentence left-to-right using a stack, a buffer, and a set of actions.

Arc-standard system:

| Action | Effect | |---|---| | SHIFT | Move front of buffer onto stack | | LEFT-ARC(r) | Head = stack top, dependent = second on stack; remove dependent | | RIGHT-ARC(r) | Head = second on stack, dependent = stack top; remove dependent |

Properties:

  • Linear time O(n) -- exactly 2n transitions for a sentence of length n
  • Greedy: each action is predicted independently (no global search)
  • Error propagation: early mistakes cascade
  • Beam search partially mitigates but increases complexity

Neural scoring (Chen & Manning, 2014):

  • Feed-forward network over features from stack and buffer (embeddings of words, POS tags, labels)
  • First fast and accurate neural parser
  • Later improved with BiLSTM encoders and global training

Graph-Based Parsing

Scores all possible edges independently, then finds the highest-scoring tree globally.

Edge scoring:

  • For each pair (head, dependent), compute a score using a biaffine classifier
  • s(i, j) = h_i^T W h_j + U^T h_i + V^T h_j + b

Eisner Algorithm:

  • Dynamic programming for projective dependency parsing
  • O(n^3) complexity
  • Builds optimal tree by combining left and right spans
  • Analogous to CKY for dependencies

Chu-Liu/Edmonds Algorithm:

  • Finds the maximum spanning arborescence (non-projective case)
  • O(n^2) or O(n^3) depending on implementation
  • Required for free word-order languages (Czech, German, etc.)

Biaffine Parser (Dozat & Manning, 2017)

The current standard architecture for graph-based dependency parsing.

Input -> BERT/BiLSTM -> separate head/dependent MLPs -> Biaffine scoring -> MST decoding
  1. Encode sentence with BiLSTM or Transformer
  2. Apply separate MLPs to produce head representations and dependent representations
  3. Biaffine attention scores all head-dependent pairs
  4. Decode with Eisner (projective) or Chu-Liu/Edmonds (non-projective)
  5. Separate biaffine classifier for relation labeling

Achieves ~96 UAS / ~94 LAS on Penn Treebank.

Transition-Based vs Graph-Based

| Aspect | Transition-Based | Graph-Based | |---|---|---| | Time complexity | O(n) greedy, O(n*b) beam | O(n^2) or O(n^3) | | Global optimality | No (greedy/beam) | Yes (MST) | | Non-projectivity | Requires special transitions | Chu-Liu/Edmonds | | Error propagation | Yes | Less | | Speed | Faster | Slower |


Universal Dependencies (UD)

A cross-linguistically consistent framework for dependency annotation.

Design principles:

  • Content words are heads (not function words)
  • Prioritize cross-linguistic parallelism
  • Unified POS tag set (UPOS) and relation set across languages

Scale: UD v2.x covers 150+ languages with 250+ treebanks.

Impact:

  • Enables multilingual parser training and evaluation
  • Facilitates cross-lingual transfer learning
  • Standard benchmark for dependency parsing research

Evaluation Metrics

| Metric | Definition | |---|---| | UAS (Unlabeled Attachment Score) | % of words with correct head | | LAS (Labeled Attachment Score) | % of words with correct head and relation label | | ELAS (Enhanced LAS) | LAS on enhanced dependencies (additional relations) |


Constituency vs Dependency

| Aspect | Constituency | Dependency | |---|---|---| | Represents | Phrases and nesting | Head-modifier relations | | Nodes | Words + phrase labels | Words only | | Better for | English, phrase structure | Free word order, morphologically rich | | Conversion | Can convert to dependency | Harder to convert back | | Common in | Formal linguistics, generation | Typological studies, multilingual NLP |

Both formalisms are interconvertible to a degree. Head-finding rules convert constituency trees to dependencies. The Stanford converter and UD tools automate this.


Key Takeaways

  • CKY with PCFGs provides exact inference for constituency parsing; neural chart parsers achieve ~96 F1
  • Dependency parsing captures head-modifier relations; transition-based is fast, graph-based is globally optimal
  • The biaffine parser (graph-based, neural) is the dominant dependency parsing architecture
  • Universal Dependencies standardize annotation across 150+ languages
  • Modern parsers use pretrained transformers as encoders, achieving near-human performance on English