6 min read
On this page

Sequence Labeling

Overview

Sequence labeling assigns a categorical label to each token in a sequence. It is a foundational task in NLP, underpinning part-of-speech tagging, named entity recognition, chunking, and more. The progression from handcrafted rules to statistical models to neural architectures mirrors the broader evolution of the field.


Part-of-Speech Tagging

POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word.

Tag Sets

| Tag Set | Size | Example Tags | |---|---|---| | Universal POS | 17 | NOUN, VERB, ADJ, ADV, DET | | Penn Treebank | 45 | NN, NNS, VB, VBD, JJ, RB |

Hidden Markov Models (HMM)

A generative model that jointly models the tag sequence and the word sequence.

Components:

  • Transition probabilities: P(t_i | t_{i-1}) -- how likely one tag follows another
  • Emission probabilities: P(w_i | t_i) -- how likely a word given its tag
  • Initial probabilities: P(t_1)

Decoding: The Viterbi algorithm finds the most probable tag sequence in O(T * K^2) time, where T is sequence length and K is the number of tags.

Limitations:

  • Assumes each word depends only on its tag (not context words)
  • Cannot use arbitrary overlapping features
  • Unknown words handled by suffix-based heuristics

Conditional Random Fields (CRF)

A discriminative model that directly models P(tags | words), allowing rich overlapping features.

Advantages over HMM:

  • No independence assumptions on observations
  • Can include arbitrary features: word shape, prefixes, suffixes, capitalization, gazetteer membership
  • Feature templates capture patterns like "current word is capitalized and previous tag is DET"

Training: Maximize conditional log-likelihood with gradient-based optimization. The partition function is computed efficiently with the forward algorithm.

Decoding: Viterbi algorithm (same complexity as HMM).

Neural POS Taggers

Modern approach: encode the sentence with a bidirectional model and classify each token.

Input tokens -> Embeddings -> BiLSTM / Transformer -> Linear -> Softmax per token
  • Pretrained embeddings (or contextual encoders like BERT) provide rich input representations
  • BiLSTM captures left and right context
  • A CRF layer on top can model tag dependencies (see BiLSTM-CRF below)
  • BERT-based taggers achieve >97.5% accuracy on Penn Treebank

Named Entity Recognition (NER)

NER identifies and classifies named entities in text: persons, organizations, locations, dates, etc.

BIO Tagging Scheme

Converts NER from a span identification problem to a token classification problem.

Barack  B-PER
Obama   I-PER
visited O
the     O
White   B-LOC
House   I-LOC

| Tag | Meaning | |---|---| | B-X | Beginning of entity type X | | I-X | Inside (continuation) of entity type X | | O | Outside any entity |

Variants:

  • BIOES/BILOU: adds E (end) and S (single-token entity) tags for richer boundary information
  • BIOES generally outperforms BIO for CRF-based models

Entity Types

Standard datasets define varying granularities:

| Dataset | Types | Example Entities | |---|---|---| | CoNLL-2003 | 4 | PER, ORG, LOC, MISC | | OntoNotes 5.0 | 18 | PERSON, ORG, GPE, DATE, MONEY, ... | | Few-NERD | 66 | Fine-grained subtypes |

BiLSTM-CRF

The dominant neural NER architecture before transformers.

Characters -> CharCNN/CharLSTM -> [char embedding]
Words -> Pretrained embedding -> [word embedding]
[char; word] -> BiLSTM -> CRF -> BIO tags

Why the CRF layer matters:

  • Without CRF: each tag is predicted independently per token
  • With CRF: the model learns transition scores (e.g., I-PER cannot follow B-LOC)
  • CRF ensures globally consistent tag sequences
  • Improves F1 by 1-2 points on standard benchmarks

Span-based NER

An alternative to sequence labeling that directly classifies text spans.

Approach:

  1. Enumerate all spans up to a maximum length
  2. Represent each span (e.g., concatenation of start, end, and span-width embeddings)
  3. Classify each span as an entity type or "not an entity"

Advantages:

  • Naturally handles nested entities ("Bank of [New York]_LOC")_ORG
  • No BIO encoding needed
  • Can share span representations with other tasks (coreference, relation extraction)

Challenge: O(n^2) candidate spans for a sequence of length n; pruning strategies needed.

Transformer-based NER

Fine-tuning BERT/RoBERTa for NER:

  1. Tokenize with subword tokenizer
  2. Encode with transformer
  3. Take first subword token's representation for each word
  4. Linear + CRF (or just linear + softmax) for BIO tag prediction

State-of-the-art on CoNLL-2003: ~94 F1 (ensemble approaches).


Chunking (Shallow Parsing)

Identifies non-recursive phrase constituents without building a full parse tree.

[NP The cat] [VP sat] [PP on] [NP the mat]
  • Uses BIO tagging with chunk type labels (B-NP, I-NP, B-VP, etc.)
  • Faster than full parsing; useful as a preprocessing step
  • Often jointly trained with POS tagging in a multi-task setup

Semantic Role Labeling (SRL)

Identifies "who did what to whom, where, when, and how" by labeling predicate-argument structures.

[ARG0 The cat] [V sat] [ARGM-LOC on the mat]

| Role | Meaning | Example | |---|---|---| | ARG0 | Agent / proto-agent | "The cat" | | ARG1 | Patient / theme | "the ball" in "kicked the ball" | | ARGM-TMP | Temporal modifier | "yesterday" | | ARGM-LOC | Location modifier | "on the mat" | | V | Verb predicate | "sat" |

Approaches:

  • BIO sequence labeling: given a predicate, label each token with its role
  • Span-based: classify candidate spans as arguments of a given predicate
  • End-to-end: jointly identify predicates and their arguments
  • Modern systems: BERT + span classification achieves ~87 F1 on OntoNotes

PropBank vs FrameNet

| Resource | Approach | Roles | |---|---|---| | PropBank | Verb-specific numbered roles (ARG0-5) | Consistent per verb sense | | FrameNet | Frame-specific roles | "Buyer", "Seller", "Goods" for commerce frame |


Coreference Resolution

Determines which mentions in a text refer to the same real-world entity.

[Barack Obama]_1 visited France. [He]_1 met [the president]_2.
[Macron]_2 welcomed [him]_1.

Mention detection: Find all noun phrases, pronouns, and named entities.

Approaches:

  • Mention-pair: Binary classifier on all pairs of mentions; greedy clustering
  • Mention-ranking: For each mention, score all antecedent candidates; pick the best
  • End-to-end (Lee et al., 2017): Jointly detect mentions and resolve coreference
    • Span representations from BiLSTM/Transformer
    • Score each span as a mention, then score mention pairs
    • Coarse-to-fine pruning for efficiency
  • Current SOTA: SpanBERT-based models achieve ~80 F1 on OntoNotes

Challenges:

  • Long-distance references
  • World knowledge required ("The company... its CEO")
  • Gender bias in pronoun resolution

Evaluation Metrics

| Task | Primary Metric | Notes | |---|---|---| | POS Tagging | Accuracy | Per-token accuracy | | NER | Span-level F1 | Exact match on entity boundaries and type | | Chunking | F1 | Exact phrase match | | SRL | F1 | Argument span and label match | | Coreference | CoNLL F1 | Average of MUC, B-cubed, CEAF metrics |


Key Takeaways

  • Sequence labeling reduces structured prediction to per-token classification with BIO encoding
  • CRF layers enforce global consistency of label sequences and improve neural taggers
  • BiLSTM-CRF was the dominant architecture; transformer-based models now achieve state-of-the-art
  • Span-based methods handle nested entities and bridge to relation extraction and coreference
  • SRL and coreference resolution extend sequence labeling to deeper semantic analysis