Sequence Labeling

Overview

Sequence labeling assigns a categorical label to each token in a sequence. It is a foundational task in NLP, underpinning part-of-speech tagging, named entity recognition, chunking, and more. The progression from handcrafted rules to statistical models to neural architectures mirrors the broader evolution of the field.

Part-of-Speech Tagging

POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word.

Tag Sets

Tag Set	Size	Example Tags
Universal POS	17	NOUN, VERB, ADJ, ADV, DET
Penn Treebank	45	NN, NNS, VB, VBD, JJ, RB

Hidden Markov Models (HMM)

A generative model that jointly models the tag sequence and the word sequence.

Components:

Transition probabilities: P(t_i | t_{i-1}) -- how likely one tag follows another
Emission probabilities: P(w_i | t_i) -- how likely a word given its tag
Initial probabilities: P(t_1)

Decoding: The Viterbi algorithm finds the most probable tag sequence in O(T * K^2) time, where T is sequence length and K is the number of tags.

Limitations:

Assumes each word depends only on its tag (not context words)
Cannot use arbitrary overlapping features
Unknown words handled by suffix-based heuristics

Conditional Random Fields (CRF)

A discriminative model that directly models P(tags | words), allowing rich overlapping features.

Advantages over HMM:

No independence assumptions on observations
Can include arbitrary features: word shape, prefixes, suffixes, capitalization, gazetteer membership
Feature templates capture patterns like "current word is capitalized and previous tag is DET"

Training: Maximize conditional log-likelihood with gradient-based optimization. The partition function is computed efficiently with the forward algorithm.

Decoding: Viterbi algorithm (same complexity as HMM).

Neural POS Taggers

Modern approach: encode the sentence with a bidirectional model and classify each token.

Input tokens -> Embeddings -> BiLSTM / Transformer -> Linear -> Softmax per token

Pretrained embeddings (or contextual encoders like BERT) provide rich input representations
BiLSTM captures left and right context
A CRF layer on top can model tag dependencies (see BiLSTM-CRF below)
BERT-based taggers achieve >97.5% accuracy on Penn Treebank

Named Entity Recognition (NER)

NER identifies and classifies named entities in text: persons, organizations, locations, dates, etc.

BIO Tagging Scheme

Converts NER from a span identification problem to a token classification problem.

Barack  B-PER
Obama   I-PER
visited O
the     O
White   B-LOC
House   I-LOC

Tag	Meaning
B-X	Beginning of entity type X
I-X	Inside (continuation) of entity type X
O	Outside any entity

Variants:

BIOES/BILOU: adds E (end) and S (single-token entity) tags for richer boundary information
BIOES generally outperforms BIO for CRF-based models

Entity Types

Standard datasets define varying granularities:

Dataset	Types	Example Entities
CoNLL-2003	4	PER, ORG, LOC, MISC
OntoNotes 5.0	18	PERSON, ORG, GPE, DATE, MONEY, ...
Few-NERD	66	Fine-grained subtypes

BiLSTM-CRF

The dominant neural NER architecture before transformers.

Characters -> CharCNN/CharLSTM -> [char embedding]
Words -> Pretrained embedding -> [word embedding]
[char; word] -> BiLSTM -> CRF -> BIO tags

Why the CRF layer matters:

Without CRF: each tag is predicted independently per token
With CRF: the model learns transition scores (e.g., I-PER cannot follow B-LOC)
CRF ensures globally consistent tag sequences
Improves F1 by 1-2 points on standard benchmarks

Span-based NER

An alternative to sequence labeling that directly classifies text spans.

Approach:

Enumerate all spans up to a maximum length
Represent each span (e.g., concatenation of start, end, and span-width embeddings)
Classify each span as an entity type or "not an entity"

Advantages:

Naturally handles nested entities ("Bank of [New York]_LOC")_ORG
No BIO encoding needed
Can share span representations with other tasks (coreference, relation extraction)

Challenge: O(n^2) candidate spans for a sequence of length n; pruning strategies needed.

Transformer-based NER

Fine-tuning BERT/RoBERTa for NER:

Tokenize with subword tokenizer
Encode with transformer
Take first subword token's representation for each word
Linear + CRF (or just linear + softmax) for BIO tag prediction

State-of-the-art on CoNLL-2003: ~94 F1 (ensemble approaches).

Chunking (Shallow Parsing)

Identifies non-recursive phrase constituents without building a full parse tree.

[NP The cat] [VP sat] [PP on] [NP the mat]

Uses BIO tagging with chunk type labels (B-NP, I-NP, B-VP, etc.)
Faster than full parsing; useful as a preprocessing step
Often jointly trained with POS tagging in a multi-task setup

Semantic Role Labeling (SRL)

Identifies "who did what to whom, where, when, and how" by labeling predicate-argument structures.

[ARG0 The cat] [V sat] [ARGM-LOC on the mat]

Role	Meaning	Example
ARG0	Agent / proto-agent	"The cat"
ARG1	Patient / theme	"the ball" in "kicked the ball"
ARGM-TMP	Temporal modifier	"yesterday"
ARGM-LOC	Location modifier	"on the mat"
V	Verb predicate	"sat"

Approaches:

BIO sequence labeling: given a predicate, label each token with its role
Span-based: classify candidate spans as arguments of a given predicate
End-to-end: jointly identify predicates and their arguments
Modern systems: BERT + span classification achieves ~87 F1 on OntoNotes

PropBank vs FrameNet

Resource	Approach	Roles
PropBank	Verb-specific numbered roles (ARG0-5)	Consistent per verb sense
FrameNet	Frame-specific roles	"Buyer", "Seller", "Goods" for commerce frame

Coreference Resolution

Determines which mentions in a text refer to the same real-world entity.

[Barack Obama]_1 visited France. [He]_1 met [the president]_2.
[Macron]_2 welcomed [him]_1.

Mention detection: Find all noun phrases, pronouns, and named entities.

Approaches:

Mention-pair: Binary classifier on all pairs of mentions; greedy clustering
Mention-ranking: For each mention, score all antecedent candidates; pick the best
End-to-end (Lee et al., 2017): Jointly detect mentions and resolve coreference
- Span representations from BiLSTM/Transformer
- Score each span as a mention, then score mention pairs
- Coarse-to-fine pruning for efficiency
Current SOTA: SpanBERT-based models achieve ~80 F1 on OntoNotes

Challenges:

Long-distance references
World knowledge required ("The company... its CEO")
Gender bias in pronoun resolution

Evaluation Metrics

Task	Primary Metric	Notes
POS Tagging	Accuracy	Per-token accuracy
NER	Span-level F1	Exact match on entity boundaries and type
Chunking	F1	Exact phrase match
SRL	F1	Argument span and label match
Coreference	CoNLL F1	Average of MUC, B-cubed, CEAF metrics

Key Takeaways

Sequence labeling reduces structured prediction to per-token classification with BIO encoding
CRF layers enforce global consistency of label sequences and improve neural taggers
BiLSTM-CRF was the dominant architecture; transformer-based models now achieve state-of-the-art
Span-based methods handle nested entities and bridge to relation extraction and coreference
SRL and coreference resolution extend sequence labeling to deeper semantic analysis