6 min read
On this page

Sequence Labeling

Overview

Sequence labeling assigns a categorical label to each token in a sequence. It is a foundational task in NLP, underpinning part-of-speech tagging, named entity recognition, chunking, and more. The progression from handcrafted rules to statistical models to neural architectures mirrors the broader evolution of the field.


Part-of-Speech Tagging

POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word.

Tag Sets

Tag Set Size Example Tags
Universal POS 17 NOUN, VERB, ADJ, ADV, DET
Penn Treebank 45 NN, NNS, VB, VBD, JJ, RB

Hidden Markov Models (HMM)

A generative model that jointly models the tag sequence and the word sequence.

Components:

  • Transition probabilities: P(t_i | t_{i-1}) -- how likely one tag follows another
  • Emission probabilities: P(w_i | t_i) -- how likely a word given its tag
  • Initial probabilities: P(t_1)

Decoding: The Viterbi algorithm finds the most probable tag sequence in O(T * K^2) time, where T is sequence length and K is the number of tags.

Limitations:

  • Assumes each word depends only on its tag (not context words)
  • Cannot use arbitrary overlapping features
  • Unknown words handled by suffix-based heuristics

Conditional Random Fields (CRF)

A discriminative model that directly models P(tags | words), allowing rich overlapping features.

Advantages over HMM:

  • No independence assumptions on observations
  • Can include arbitrary features: word shape, prefixes, suffixes, capitalization, gazetteer membership
  • Feature templates capture patterns like "current word is capitalized and previous tag is DET"

Training: Maximize conditional log-likelihood with gradient-based optimization. The partition function is computed efficiently with the forward algorithm.

Decoding: Viterbi algorithm (same complexity as HMM).

Neural POS Taggers

Modern approach: encode the sentence with a bidirectional model and classify each token.

Input tokens -> Embeddings -> BiLSTM / Transformer -> Linear -> Softmax per token
  • Pretrained embeddings (or contextual encoders like BERT) provide rich input representations
  • BiLSTM captures left and right context
  • A CRF layer on top can model tag dependencies (see BiLSTM-CRF below)
  • BERT-based taggers achieve >97.5% accuracy on Penn Treebank

Named Entity Recognition (NER)

NER identifies and classifies named entities in text: persons, organizations, locations, dates, etc.

BIO Tagging Scheme

Converts NER from a span identification problem to a token classification problem.

Barack  B-PER
Obama   I-PER
visited O
the     O
White   B-LOC
House   I-LOC
Tag Meaning
B-X Beginning of entity type X
I-X Inside (continuation) of entity type X
O Outside any entity

Variants:

  • BIOES/BILOU: adds E (end) and S (single-token entity) tags for richer boundary information
  • BIOES generally outperforms BIO for CRF-based models

Entity Types

Standard datasets define varying granularities:

Dataset Types Example Entities
CoNLL-2003 4 PER, ORG, LOC, MISC
OntoNotes 5.0 18 PERSON, ORG, GPE, DATE, MONEY, ...
Few-NERD 66 Fine-grained subtypes

BiLSTM-CRF

The dominant neural NER architecture before transformers.

Characters -> CharCNN/CharLSTM -> [char embedding]
Words -> Pretrained embedding -> [word embedding]
[char; word] -> BiLSTM -> CRF -> BIO tags

Why the CRF layer matters:

  • Without CRF: each tag is predicted independently per token
  • With CRF: the model learns transition scores (e.g., I-PER cannot follow B-LOC)
  • CRF ensures globally consistent tag sequences
  • Improves F1 by 1-2 points on standard benchmarks

Span-based NER

An alternative to sequence labeling that directly classifies text spans.

Approach:

  1. Enumerate all spans up to a maximum length
  2. Represent each span (e.g., concatenation of start, end, and span-width embeddings)
  3. Classify each span as an entity type or "not an entity"

Advantages:

  • Naturally handles nested entities ("Bank of [New York]_LOC")_ORG
  • No BIO encoding needed
  • Can share span representations with other tasks (coreference, relation extraction)

Challenge: O(n^2) candidate spans for a sequence of length n; pruning strategies needed.

Transformer-based NER

Fine-tuning BERT/RoBERTa for NER:

  1. Tokenize with subword tokenizer
  2. Encode with transformer
  3. Take first subword token's representation for each word
  4. Linear + CRF (or just linear + softmax) for BIO tag prediction

State-of-the-art on CoNLL-2003: ~94 F1 (ensemble approaches).


Chunking (Shallow Parsing)

Identifies non-recursive phrase constituents without building a full parse tree.

[NP The cat] [VP sat] [PP on] [NP the mat]
  • Uses BIO tagging with chunk type labels (B-NP, I-NP, B-VP, etc.)
  • Faster than full parsing; useful as a preprocessing step
  • Often jointly trained with POS tagging in a multi-task setup

Semantic Role Labeling (SRL)

Identifies "who did what to whom, where, when, and how" by labeling predicate-argument structures.

[ARG0 The cat] [V sat] [ARGM-LOC on the mat]
Role Meaning Example
ARG0 Agent / proto-agent "The cat"
ARG1 Patient / theme "the ball" in "kicked the ball"
ARGM-TMP Temporal modifier "yesterday"
ARGM-LOC Location modifier "on the mat"
V Verb predicate "sat"

Approaches:

  • BIO sequence labeling: given a predicate, label each token with its role
  • Span-based: classify candidate spans as arguments of a given predicate
  • End-to-end: jointly identify predicates and their arguments
  • Modern systems: BERT + span classification achieves ~87 F1 on OntoNotes

PropBank vs FrameNet

Resource Approach Roles
PropBank Verb-specific numbered roles (ARG0-5) Consistent per verb sense
FrameNet Frame-specific roles "Buyer", "Seller", "Goods" for commerce frame

Coreference Resolution

Determines which mentions in a text refer to the same real-world entity.

[Barack Obama]_1 visited France. [He]_1 met [the president]_2.
[Macron]_2 welcomed [him]_1.

Mention detection: Find all noun phrases, pronouns, and named entities.

Approaches:

  • Mention-pair: Binary classifier on all pairs of mentions; greedy clustering
  • Mention-ranking: For each mention, score all antecedent candidates; pick the best
  • End-to-end (Lee et al., 2017): Jointly detect mentions and resolve coreference
    • Span representations from BiLSTM/Transformer
    • Score each span as a mention, then score mention pairs
    • Coarse-to-fine pruning for efficiency
  • Current SOTA: SpanBERT-based models achieve ~80 F1 on OntoNotes

Challenges:

  • Long-distance references
  • World knowledge required ("The company... its CEO")
  • Gender bias in pronoun resolution

Evaluation Metrics

Task Primary Metric Notes
POS Tagging Accuracy Per-token accuracy
NER Span-level F1 Exact match on entity boundaries and type
Chunking F1 Exact phrase match
SRL F1 Argument span and label match
Coreference CoNLL F1 Average of MUC, B-cubed, CEAF metrics

Key Takeaways

  • Sequence labeling reduces structured prediction to per-token classification with BIO encoding
  • CRF layers enforce global consistency of label sequences and improve neural taggers
  • BiLSTM-CRF was the dominant architecture; transformer-based models now achieve state-of-the-art
  • Span-based methods handle nested entities and bridge to relation extraction and coreference
  • SRL and coreference resolution extend sequence labeling to deeper semantic analysis