Syntactic Parsing
Overview
Syntactic parsing analyzes the grammatical structure of sentences. Two major formalisms exist: constituency parsing (phrase structure trees) and dependency parsing (head-modifier relations). Both are essential for tasks requiring structural understanding of language.
Constituency Parsing
Constituency parsing decomposes a sentence into nested phrases according to a context-free grammar.
S
/ \
NP VP
/ \ / \
DET N V NP
The cat sat on the mat
Context-Free Grammars (CFG)
A CFG G = (N, T, R, S) where:
- N: non-terminal symbols (S, NP, VP, PP, ...)
- T: terminal symbols (words)
- R: production rules (S -> NP VP, NP -> DET N, ...)
- S: start symbol
Chomsky Normal Form (CNF): All rules are either A -> B C (binary) or A -> a (lexical). Any CFG can be converted to CNF.
CKY Algorithm
The Cocke-Kasami-Younger algorithm is the standard dynamic programming parser for CFGs in CNF.
Procedure:
- Fill a triangular chart bottom-up
- Cell [i, j] contains all non-terminals that can derive the span from word i to word j
- For each span, try all split points k and all rules A -> B C where B is in [i, k] and C is in [k+1, j]
Complexity: O(n^3 * |G|) where n is sentence length and |G| is grammar size.
Recognizer vs Parser: The basic algorithm determines if a sentence is grammatical. To recover the tree, store backpointers at each cell.
Probabilistic Context-Free Grammars (PCFG)
Augment each rule with a probability. Probabilities for rules with the same left-hand side sum to 1.
S -> NP VP [0.8]
S -> VP [0.2]
NP -> DET N [0.5]
NP -> NP PP [0.3]
NP -> N [0.2]
Probabilistic CKY: Same algorithm, but track the probability of each entry (product of rule probabilities). Select the parse with highest probability (Viterbi parse).
Limitations of PCFGs:
- Independence assumption: rule probability is independent of context
- "NP -> PRP" has the same probability whether NP is a subject or object
- Poor modeling of lexical preferences and structural ambiguity
Lexicalized PCFGs
Add head word annotations to non-terminals to capture lexical dependencies.
- VP(sat) -> V(sat) PP(on) is more informative than VP -> V PP
- Dramatically improves parsing accuracy but explodes the grammar size
- Sparse data problem requires smoothing
Neural Constituency Parsers
Chart-based neural parsers (Stern et al., Kitaev & Klein):
- Encode the sentence with a BiLSTM or Transformer
- Score each span [i, j] with label L using a neural classifier
- Find the highest-scoring tree using CKY decoding over span scores
- Kitaev & Klein (2018): BERT encoder + self-attention + CKY achieves ~96 F1 on Penn Treebank
Transition-based neural parsers:
- Shift-reduce parsing with neural scoring of actions
- Faster (linear time) but slightly less accurate than chart parsers
Sequence-to-sequence parsers:
- Linearize the tree and generate it as a token sequence
- Competitive accuracy with simpler implementation
Dependency Parsing
Dependency parsing represents syntactic structure as directed edges between words. Each word (dependent) is connected to exactly one head (governor), forming a tree rooted at a special ROOT node.
ROOT -> sat
sat -> cat (nsubj)
cat -> The (det)
sat -> on (obl)
on -> mat (nmod)
mat -> the (det)
Dependency Relations
Each edge is labeled with a grammatical relation.
| Relation | Meaning | Example | |---|---|---| | nsubj | Nominal subject | "cat" -> "sat" | | obj | Direct object | "book" -> "read" | | det | Determiner | "the" -> "cat" | | amod | Adjectival modifier | "big" -> "cat" | | nmod | Nominal modifier | "mat" -> "on" | | advmod | Adverbial modifier | "quickly" -> "ran" | | conj | Conjunct | "cats" -> "dogs" in "cats and dogs" |
Properties of Dependency Trees
- Single head: each word has exactly one head (except ROOT)
- Connected: all words are reachable from ROOT
- Acyclic: no cycles
- Projectivity: if word A governs word B, then A also governs all words between A and B (in projective trees). Non-projective trees occur in free word-order languages.
Transition-Based Parsing
Processes the sentence left-to-right using a stack, a buffer, and a set of actions.
Arc-standard system:
| Action | Effect | |---|---| | SHIFT | Move front of buffer onto stack | | LEFT-ARC(r) | Head = stack top, dependent = second on stack; remove dependent | | RIGHT-ARC(r) | Head = second on stack, dependent = stack top; remove dependent |
Properties:
- Linear time O(n) -- exactly 2n transitions for a sentence of length n
- Greedy: each action is predicted independently (no global search)
- Error propagation: early mistakes cascade
- Beam search partially mitigates but increases complexity
Neural scoring (Chen & Manning, 2014):
- Feed-forward network over features from stack and buffer (embeddings of words, POS tags, labels)
- First fast and accurate neural parser
- Later improved with BiLSTM encoders and global training
Graph-Based Parsing
Scores all possible edges independently, then finds the highest-scoring tree globally.
Edge scoring:
- For each pair (head, dependent), compute a score using a biaffine classifier
- s(i, j) = h_i^T W h_j + U^T h_i + V^T h_j + b
Eisner Algorithm:
- Dynamic programming for projective dependency parsing
- O(n^3) complexity
- Builds optimal tree by combining left and right spans
- Analogous to CKY for dependencies
Chu-Liu/Edmonds Algorithm:
- Finds the maximum spanning arborescence (non-projective case)
- O(n^2) or O(n^3) depending on implementation
- Required for free word-order languages (Czech, German, etc.)
Biaffine Parser (Dozat & Manning, 2017)
The current standard architecture for graph-based dependency parsing.
Input -> BERT/BiLSTM -> separate head/dependent MLPs -> Biaffine scoring -> MST decoding
- Encode sentence with BiLSTM or Transformer
- Apply separate MLPs to produce head representations and dependent representations
- Biaffine attention scores all head-dependent pairs
- Decode with Eisner (projective) or Chu-Liu/Edmonds (non-projective)
- Separate biaffine classifier for relation labeling
Achieves ~96 UAS / ~94 LAS on Penn Treebank.
Transition-Based vs Graph-Based
| Aspect | Transition-Based | Graph-Based | |---|---|---| | Time complexity | O(n) greedy, O(n*b) beam | O(n^2) or O(n^3) | | Global optimality | No (greedy/beam) | Yes (MST) | | Non-projectivity | Requires special transitions | Chu-Liu/Edmonds | | Error propagation | Yes | Less | | Speed | Faster | Slower |
Universal Dependencies (UD)
A cross-linguistically consistent framework for dependency annotation.
Design principles:
- Content words are heads (not function words)
- Prioritize cross-linguistic parallelism
- Unified POS tag set (UPOS) and relation set across languages
Scale: UD v2.x covers 150+ languages with 250+ treebanks.
Impact:
- Enables multilingual parser training and evaluation
- Facilitates cross-lingual transfer learning
- Standard benchmark for dependency parsing research
Evaluation Metrics
| Metric | Definition | |---|---| | UAS (Unlabeled Attachment Score) | % of words with correct head | | LAS (Labeled Attachment Score) | % of words with correct head and relation label | | ELAS (Enhanced LAS) | LAS on enhanced dependencies (additional relations) |
Constituency vs Dependency
| Aspect | Constituency | Dependency | |---|---|---| | Represents | Phrases and nesting | Head-modifier relations | | Nodes | Words + phrase labels | Words only | | Better for | English, phrase structure | Free word order, morphologically rich | | Conversion | Can convert to dependency | Harder to convert back | | Common in | Formal linguistics, generation | Typological studies, multilingual NLP |
Both formalisms are interconvertible to a degree. Head-finding rules convert constituency trees to dependencies. The Stanford converter and UD tools automate this.
Key Takeaways
- CKY with PCFGs provides exact inference for constituency parsing; neural chart parsers achieve ~96 F1
- Dependency parsing captures head-modifier relations; transition-based is fast, graph-based is globally optimal
- The biaffine parser (graph-based, neural) is the dominant dependency parsing architecture
- Universal Dependencies standardize annotation across 150+ languages
- Modern parsers use pretrained transformers as encoders, achieving near-human performance on English