6 min read
On this page

Syntactic Parsing

Overview

Syntactic parsing analyzes the grammatical structure of sentences. Two major formalisms exist: constituency parsing (phrase structure trees) and dependency parsing (head-modifier relations). Both are essential for tasks requiring structural understanding of language.


Constituency Parsing

Constituency parsing decomposes a sentence into nested phrases according to a context-free grammar.

          S
        /   \
      NP      VP
     / \     / \
   DET  N   V   NP
   The cat  sat  on the mat

Context-Free Grammars (CFG)

A CFG G = (N, T, R, S) where:

  • N: non-terminal symbols (S, NP, VP, PP, ...)
  • T: terminal symbols (words)
  • R: production rules (S -> NP VP, NP -> DET N, ...)
  • S: start symbol

Chomsky Normal Form (CNF): All rules are either A -> B C (binary) or A -> a (lexical). Any CFG can be converted to CNF.

CKY Algorithm

The Cocke-Kasami-Younger algorithm is the standard dynamic programming parser for CFGs in CNF.

Procedure:

  1. Fill a triangular chart bottom-up
  2. Cell [i, j] contains all non-terminals that can derive the span from word i to word j
  3. For each span, try all split points k and all rules A -> B C where B is in [i, k] and C is in [k+1, j]

Complexity: O(n^3 * |G|) where n is sentence length and |G| is grammar size.

Recognizer vs Parser: The basic algorithm determines if a sentence is grammatical. To recover the tree, store backpointers at each cell.

Probabilistic Context-Free Grammars (PCFG)

Augment each rule with a probability. Probabilities for rules with the same left-hand side sum to 1.

S  -> NP VP      [0.8]
S  -> VP         [0.2]
NP -> DET N      [0.5]
NP -> NP PP      [0.3]
NP -> N          [0.2]

Probabilistic CKY: Same algorithm, but track the probability of each entry (product of rule probabilities). Select the parse with highest probability (Viterbi parse).

Limitations of PCFGs:

  • Independence assumption: rule probability is independent of context
  • "NP -> PRP" has the same probability whether NP is a subject or object
  • Poor modeling of lexical preferences and structural ambiguity

Lexicalized PCFGs

Add head word annotations to non-terminals to capture lexical dependencies.

  • VP(sat) -> V(sat) PP(on) is more informative than VP -> V PP
  • Dramatically improves parsing accuracy but explodes the grammar size
  • Sparse data problem requires smoothing

Neural Constituency Parsers

Chart-based neural parsers (Stern et al., Kitaev & Klein):

  1. Encode the sentence with a BiLSTM or Transformer
  2. Score each span [i, j] with label L using a neural classifier
  3. Find the highest-scoring tree using CKY decoding over span scores
  4. Kitaev & Klein (2018): BERT encoder + self-attention + CKY achieves ~96 F1 on Penn Treebank

Transition-based neural parsers:

  • Shift-reduce parsing with neural scoring of actions
  • Faster (linear time) but slightly less accurate than chart parsers

Sequence-to-sequence parsers:

  • Linearize the tree and generate it as a token sequence
  • Competitive accuracy with simpler implementation

Dependency Parsing

Dependency parsing represents syntactic structure as directed edges between words. Each word (dependent) is connected to exactly one head (governor), forming a tree rooted at a special ROOT node.

ROOT -> sat
sat -> cat (nsubj)
cat -> The (det)
sat -> on (obl)
on -> mat (nmod)
mat -> the (det)

Dependency Relations

Each edge is labeled with a grammatical relation.

Relation Meaning Example
nsubj Nominal subject "cat" -> "sat"
obj Direct object "book" -> "read"
det Determiner "the" -> "cat"
amod Adjectival modifier "big" -> "cat"
nmod Nominal modifier "mat" -> "on"
advmod Adverbial modifier "quickly" -> "ran"
conj Conjunct "cats" -> "dogs" in "cats and dogs"

Properties of Dependency Trees

  • Single head: each word has exactly one head (except ROOT)
  • Connected: all words are reachable from ROOT
  • Acyclic: no cycles
  • Projectivity: if word A governs word B, then A also governs all words between A and B (in projective trees). Non-projective trees occur in free word-order languages.

Transition-Based Parsing

Processes the sentence left-to-right using a stack, a buffer, and a set of actions.

Arc-standard system:

Action Effect
SHIFT Move front of buffer onto stack
LEFT-ARC(r) Head = stack top, dependent = second on stack; remove dependent
RIGHT-ARC(r) Head = second on stack, dependent = stack top; remove dependent

Properties:

  • Linear time O(n) -- exactly 2n transitions for a sentence of length n
  • Greedy: each action is predicted independently (no global search)
  • Error propagation: early mistakes cascade
  • Beam search partially mitigates but increases complexity

Neural scoring (Chen & Manning, 2014):

  • Feed-forward network over features from stack and buffer (embeddings of words, POS tags, labels)
  • First fast and accurate neural parser
  • Later improved with BiLSTM encoders and global training

Graph-Based Parsing

Scores all possible edges independently, then finds the highest-scoring tree globally.

Edge scoring:

  • For each pair (head, dependent), compute a score using a biaffine classifier
  • s(i, j) = h_i^T W h_j + U^T h_i + V^T h_j + b

Eisner Algorithm:

  • Dynamic programming for projective dependency parsing
  • O(n^3) complexity
  • Builds optimal tree by combining left and right spans
  • Analogous to CKY for dependencies

Chu-Liu/Edmonds Algorithm:

  • Finds the maximum spanning arborescence (non-projective case)
  • O(n^2) or O(n^3) depending on implementation
  • Required for free word-order languages (Czech, German, etc.)

Biaffine Parser (Dozat & Manning, 2017)

The current standard architecture for graph-based dependency parsing.

Input -> BERT/BiLSTM -> separate head/dependent MLPs -> Biaffine scoring -> MST decoding
  1. Encode sentence with BiLSTM or Transformer
  2. Apply separate MLPs to produce head representations and dependent representations
  3. Biaffine attention scores all head-dependent pairs
  4. Decode with Eisner (projective) or Chu-Liu/Edmonds (non-projective)
  5. Separate biaffine classifier for relation labeling

Achieves ~96 UAS / ~94 LAS on Penn Treebank.

Transition-Based vs Graph-Based

Aspect Transition-Based Graph-Based
Time complexity O(n) greedy, O(n*b) beam O(n^2) or O(n^3)
Global optimality No (greedy/beam) Yes (MST)
Non-projectivity Requires special transitions Chu-Liu/Edmonds
Error propagation Yes Less
Speed Faster Slower

Universal Dependencies (UD)

A cross-linguistically consistent framework for dependency annotation.

Design principles:

  • Content words are heads (not function words)
  • Prioritize cross-linguistic parallelism
  • Unified POS tag set (UPOS) and relation set across languages

Scale: UD v2.x covers 150+ languages with 250+ treebanks.

Impact:

  • Enables multilingual parser training and evaluation
  • Facilitates cross-lingual transfer learning
  • Standard benchmark for dependency parsing research

Evaluation Metrics

Metric Definition
UAS (Unlabeled Attachment Score) % of words with correct head
LAS (Labeled Attachment Score) % of words with correct head and relation label
ELAS (Enhanced LAS) LAS on enhanced dependencies (additional relations)

Constituency vs Dependency

Aspect Constituency Dependency
Represents Phrases and nesting Head-modifier relations
Nodes Words + phrase labels Words only
Better for English, phrase structure Free word order, morphologically rich
Conversion Can convert to dependency Harder to convert back
Common in Formal linguistics, generation Typological studies, multilingual NLP

Both formalisms are interconvertible to a degree. Head-finding rules convert constituency trees to dependencies. The Stanford converter and UD tools automate this.


Key Takeaways

  • CKY with PCFGs provides exact inference for constituency parsing; neural chart parsers achieve ~96 F1
  • Dependency parsing captures head-modifier relations; transition-based is fast, graph-based is globally optimal
  • The biaffine parser (graph-based, neural) is the dominant dependency parsing architecture
  • Universal Dependencies standardize annotation across 150+ languages
  • Modern parsers use pretrained transformers as encoders, achieving near-human performance on English