Syntactic Parsing

Overview

Syntactic parsing analyzes the grammatical structure of sentences. Two major formalisms exist: constituency parsing (phrase structure trees) and dependency parsing (head-modifier relations). Both are essential for tasks requiring structural understanding of language.

Constituency Parsing

Constituency parsing decomposes a sentence into nested phrases according to a context-free grammar.

          S
        /   \
      NP      VP
     / \     / \
   DET  N   V   NP
   The cat  sat  on the mat

Context-Free Grammars (CFG)

A CFG G = (N, T, R, S) where:

N: non-terminal symbols (S, NP, VP, PP, ...)
T: terminal symbols (words)
R: production rules (S -> NP VP, NP -> DET N, ...)
S: start symbol

Chomsky Normal Form (CNF): All rules are either A -> B C (binary) or A -> a (lexical). Any CFG can be converted to CNF.

CKY Algorithm

The Cocke-Kasami-Younger algorithm is the standard dynamic programming parser for CFGs in CNF.

Procedure:

Fill a triangular chart bottom-up
Cell [i, j] contains all non-terminals that can derive the span from word i to word j
For each span, try all split points k and all rules A -> B C where B is in [i, k] and C is in [k+1, j]

Complexity: O(n^3 * |G|) where n is sentence length and |G| is grammar size.

Recognizer vs Parser: The basic algorithm determines if a sentence is grammatical. To recover the tree, store backpointers at each cell.

Probabilistic Context-Free Grammars (PCFG)

Augment each rule with a probability. Probabilities for rules with the same left-hand side sum to 1.

S  -> NP VP      [0.8]
S  -> VP         [0.2]
NP -> DET N      [0.5]
NP -> NP PP      [0.3]
NP -> N          [0.2]

Probabilistic CKY: Same algorithm, but track the probability of each entry (product of rule probabilities). Select the parse with highest probability (Viterbi parse).

Limitations of PCFGs:

Independence assumption: rule probability is independent of context
"NP -> PRP" has the same probability whether NP is a subject or object
Poor modeling of lexical preferences and structural ambiguity

Lexicalized PCFGs

Add head word annotations to non-terminals to capture lexical dependencies.

VP(sat) -> V(sat) PP(on) is more informative than VP -> V PP
Dramatically improves parsing accuracy but explodes the grammar size
Sparse data problem requires smoothing

Neural Constituency Parsers

Chart-based neural parsers (Stern et al., Kitaev & Klein):

Encode the sentence with a BiLSTM or Transformer
Score each span [i, j] with label L using a neural classifier
Find the highest-scoring tree using CKY decoding over span scores
Kitaev & Klein (2018): BERT encoder + self-attention + CKY achieves ~96 F1 on Penn Treebank

Transition-based neural parsers:

Shift-reduce parsing with neural scoring of actions
Faster (linear time) but slightly less accurate than chart parsers

Sequence-to-sequence parsers:

Linearize the tree and generate it as a token sequence
Competitive accuracy with simpler implementation

Dependency Parsing

Dependency parsing represents syntactic structure as directed edges between words. Each word (dependent) is connected to exactly one head (governor), forming a tree rooted at a special ROOT node.

ROOT -> sat
sat -> cat (nsubj)
cat -> The (det)
sat -> on (obl)
on -> mat (nmod)
mat -> the (det)

Dependency Relations

Each edge is labeled with a grammatical relation.

Relation	Meaning	Example
nsubj	Nominal subject	"cat" -> "sat"
obj	Direct object	"book" -> "read"
det	Determiner	"the" -> "cat"
amod	Adjectival modifier	"big" -> "cat"
nmod	Nominal modifier	"mat" -> "on"
advmod	Adverbial modifier	"quickly" -> "ran"
conj	Conjunct	"cats" -> "dogs" in "cats and dogs"

Properties of Dependency Trees

Single head: each word has exactly one head (except ROOT)
Connected: all words are reachable from ROOT
Acyclic: no cycles
Projectivity: if word A governs word B, then A also governs all words between A and B (in projective trees). Non-projective trees occur in free word-order languages.

Transition-Based Parsing

Processes the sentence left-to-right using a stack, a buffer, and a set of actions.

Arc-standard system:

Action	Effect
SHIFT	Move front of buffer onto stack
LEFT-ARC(r)	Head = stack top, dependent = second on stack; remove dependent
RIGHT-ARC(r)	Head = second on stack, dependent = stack top; remove dependent

Properties:

Linear time O(n) -- exactly 2n transitions for a sentence of length n
Greedy: each action is predicted independently (no global search)
Error propagation: early mistakes cascade
Beam search partially mitigates but increases complexity

Neural scoring (Chen & Manning, 2014):

Feed-forward network over features from stack and buffer (embeddings of words, POS tags, labels)
First fast and accurate neural parser
Later improved with BiLSTM encoders and global training

Graph-Based Parsing

Scores all possible edges independently, then finds the highest-scoring tree globally.

Edge scoring:

For each pair (head, dependent), compute a score using a biaffine classifier
s(i, j) = h_i^T W h_j + U^T h_i + V^T h_j + b

Eisner Algorithm:

Dynamic programming for projective dependency parsing
O(n^3) complexity
Builds optimal tree by combining left and right spans
Analogous to CKY for dependencies

Chu-Liu/Edmonds Algorithm:

Finds the maximum spanning arborescence (non-projective case)
O(n^2) or O(n^3) depending on implementation
Required for free word-order languages (Czech, German, etc.)

Biaffine Parser (Dozat & Manning, 2017)

The current standard architecture for graph-based dependency parsing.

Input -> BERT/BiLSTM -> separate head/dependent MLPs -> Biaffine scoring -> MST decoding

Encode sentence with BiLSTM or Transformer
Apply separate MLPs to produce head representations and dependent representations
Biaffine attention scores all head-dependent pairs
Decode with Eisner (projective) or Chu-Liu/Edmonds (non-projective)
Separate biaffine classifier for relation labeling

Achieves ~96 UAS / ~94 LAS on Penn Treebank.

Transition-Based vs Graph-Based

Aspect	Transition-Based	Graph-Based
Time complexity	O(n) greedy, O(n*b) beam	O(n^2) or O(n^3)
Global optimality	No (greedy/beam)	Yes (MST)
Non-projectivity	Requires special transitions	Chu-Liu/Edmonds
Error propagation	Yes	Less
Speed	Faster	Slower

Universal Dependencies (UD)

A cross-linguistically consistent framework for dependency annotation.

Design principles:

Content words are heads (not function words)
Prioritize cross-linguistic parallelism
Unified POS tag set (UPOS) and relation set across languages

Scale: UD v2.x covers 150+ languages with 250+ treebanks.

Impact:

Enables multilingual parser training and evaluation
Facilitates cross-lingual transfer learning
Standard benchmark for dependency parsing research

Evaluation Metrics

Metric	Definition
UAS (Unlabeled Attachment Score)	% of words with correct head
LAS (Labeled Attachment Score)	% of words with correct head and relation label
ELAS (Enhanced LAS)	LAS on enhanced dependencies (additional relations)

Constituency vs Dependency

Aspect	Constituency	Dependency
Represents	Phrases and nesting	Head-modifier relations
Nodes	Words + phrase labels	Words only
Better for	English, phrase structure	Free word order, morphologically rich
Conversion	Can convert to dependency	Harder to convert back
Common in	Formal linguistics, generation	Typological studies, multilingual NLP

Both formalisms are interconvertible to a degree. Head-finding rules convert constituency trees to dependencies. The Stanford converter and UD tools automate this.

Key Takeaways

CKY with PCFGs provides exact inference for constituency parsing; neural chart parsers achieve ~96 F1
Dependency parsing captures head-modifier relations; transition-based is fast, graph-based is globally optimal
The biaffine parser (graph-based, neural) is the dominant dependency parsing architecture
Universal Dependencies standardize annotation across 150+ languages
Modern parsers use pretrained transformers as encoders, achieving near-human performance on English