Sequence Labeling
Overview
Sequence labeling assigns a categorical label to each token in a sequence. It is a foundational task in NLP, underpinning part-of-speech tagging, named entity recognition, chunking, and more. The progression from handcrafted rules to statistical models to neural architectures mirrors the broader evolution of the field.
Part-of-Speech Tagging
POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word.
Tag Sets
| Tag Set | Size | Example Tags | |---|---|---| | Universal POS | 17 | NOUN, VERB, ADJ, ADV, DET | | Penn Treebank | 45 | NN, NNS, VB, VBD, JJ, RB |
Hidden Markov Models (HMM)
A generative model that jointly models the tag sequence and the word sequence.
Components:
- Transition probabilities: P(t_i | t_{i-1}) -- how likely one tag follows another
- Emission probabilities: P(w_i | t_i) -- how likely a word given its tag
- Initial probabilities: P(t_1)
Decoding: The Viterbi algorithm finds the most probable tag sequence in O(T * K^2) time, where T is sequence length and K is the number of tags.
Limitations:
- Assumes each word depends only on its tag (not context words)
- Cannot use arbitrary overlapping features
- Unknown words handled by suffix-based heuristics
Conditional Random Fields (CRF)
A discriminative model that directly models P(tags | words), allowing rich overlapping features.
Advantages over HMM:
- No independence assumptions on observations
- Can include arbitrary features: word shape, prefixes, suffixes, capitalization, gazetteer membership
- Feature templates capture patterns like "current word is capitalized and previous tag is DET"
Training: Maximize conditional log-likelihood with gradient-based optimization. The partition function is computed efficiently with the forward algorithm.
Decoding: Viterbi algorithm (same complexity as HMM).
Neural POS Taggers
Modern approach: encode the sentence with a bidirectional model and classify each token.
Input tokens -> Embeddings -> BiLSTM / Transformer -> Linear -> Softmax per token
- Pretrained embeddings (or contextual encoders like BERT) provide rich input representations
- BiLSTM captures left and right context
- A CRF layer on top can model tag dependencies (see BiLSTM-CRF below)
- BERT-based taggers achieve >97.5% accuracy on Penn Treebank
Named Entity Recognition (NER)
NER identifies and classifies named entities in text: persons, organizations, locations, dates, etc.
BIO Tagging Scheme
Converts NER from a span identification problem to a token classification problem.
Barack B-PER
Obama I-PER
visited O
the O
White B-LOC
House I-LOC
| Tag | Meaning | |---|---| | B-X | Beginning of entity type X | | I-X | Inside (continuation) of entity type X | | O | Outside any entity |
Variants:
- BIOES/BILOU: adds E (end) and S (single-token entity) tags for richer boundary information
- BIOES generally outperforms BIO for CRF-based models
Entity Types
Standard datasets define varying granularities:
| Dataset | Types | Example Entities | |---|---|---| | CoNLL-2003 | 4 | PER, ORG, LOC, MISC | | OntoNotes 5.0 | 18 | PERSON, ORG, GPE, DATE, MONEY, ... | | Few-NERD | 66 | Fine-grained subtypes |
BiLSTM-CRF
The dominant neural NER architecture before transformers.
Characters -> CharCNN/CharLSTM -> [char embedding]
Words -> Pretrained embedding -> [word embedding]
[char; word] -> BiLSTM -> CRF -> BIO tags
Why the CRF layer matters:
- Without CRF: each tag is predicted independently per token
- With CRF: the model learns transition scores (e.g., I-PER cannot follow B-LOC)
- CRF ensures globally consistent tag sequences
- Improves F1 by 1-2 points on standard benchmarks
Span-based NER
An alternative to sequence labeling that directly classifies text spans.
Approach:
- Enumerate all spans up to a maximum length
- Represent each span (e.g., concatenation of start, end, and span-width embeddings)
- Classify each span as an entity type or "not an entity"
Advantages:
- Naturally handles nested entities ("Bank of [New York]_LOC")_ORG
- No BIO encoding needed
- Can share span representations with other tasks (coreference, relation extraction)
Challenge: O(n^2) candidate spans for a sequence of length n; pruning strategies needed.
Transformer-based NER
Fine-tuning BERT/RoBERTa for NER:
- Tokenize with subword tokenizer
- Encode with transformer
- Take first subword token's representation for each word
- Linear + CRF (or just linear + softmax) for BIO tag prediction
State-of-the-art on CoNLL-2003: ~94 F1 (ensemble approaches).
Chunking (Shallow Parsing)
Identifies non-recursive phrase constituents without building a full parse tree.
[NP The cat] [VP sat] [PP on] [NP the mat]
- Uses BIO tagging with chunk type labels (B-NP, I-NP, B-VP, etc.)
- Faster than full parsing; useful as a preprocessing step
- Often jointly trained with POS tagging in a multi-task setup
Semantic Role Labeling (SRL)
Identifies "who did what to whom, where, when, and how" by labeling predicate-argument structures.
[ARG0 The cat] [V sat] [ARGM-LOC on the mat]
| Role | Meaning | Example | |---|---|---| | ARG0 | Agent / proto-agent | "The cat" | | ARG1 | Patient / theme | "the ball" in "kicked the ball" | | ARGM-TMP | Temporal modifier | "yesterday" | | ARGM-LOC | Location modifier | "on the mat" | | V | Verb predicate | "sat" |
Approaches:
- BIO sequence labeling: given a predicate, label each token with its role
- Span-based: classify candidate spans as arguments of a given predicate
- End-to-end: jointly identify predicates and their arguments
- Modern systems: BERT + span classification achieves ~87 F1 on OntoNotes
PropBank vs FrameNet
| Resource | Approach | Roles | |---|---|---| | PropBank | Verb-specific numbered roles (ARG0-5) | Consistent per verb sense | | FrameNet | Frame-specific roles | "Buyer", "Seller", "Goods" for commerce frame |
Coreference Resolution
Determines which mentions in a text refer to the same real-world entity.
[Barack Obama]_1 visited France. [He]_1 met [the president]_2.
[Macron]_2 welcomed [him]_1.
Mention detection: Find all noun phrases, pronouns, and named entities.
Approaches:
- Mention-pair: Binary classifier on all pairs of mentions; greedy clustering
- Mention-ranking: For each mention, score all antecedent candidates; pick the best
- End-to-end (Lee et al., 2017): Jointly detect mentions and resolve coreference
- Span representations from BiLSTM/Transformer
- Score each span as a mention, then score mention pairs
- Coarse-to-fine pruning for efficiency
- Current SOTA: SpanBERT-based models achieve ~80 F1 on OntoNotes
Challenges:
- Long-distance references
- World knowledge required ("The company... its CEO")
- Gender bias in pronoun resolution
Evaluation Metrics
| Task | Primary Metric | Notes | |---|---|---| | POS Tagging | Accuracy | Per-token accuracy | | NER | Span-level F1 | Exact match on entity boundaries and type | | Chunking | F1 | Exact phrase match | | SRL | F1 | Argument span and label match | | Coreference | CoNLL F1 | Average of MUC, B-cubed, CEAF metrics |
Key Takeaways
- Sequence labeling reduces structured prediction to per-token classification with BIO encoding
- CRF layers enforce global consistency of label sequences and improve neural taggers
- BiLSTM-CRF was the dominant architecture; transformer-based models now achieve state-of-the-art
- Span-based methods handle nested entities and bridge to relation extraction and coreference
- SRL and coreference resolution extend sequence labeling to deeper semantic analysis