5 min read
On this page

Text Preprocessing

Overview

NLP Text Processing Pipeline

Text preprocessing transforms raw text into a structured representation suitable for downstream NLP tasks. The choices made at this stage profoundly affect model performance, vocabulary size, and the ability to handle out-of-vocabulary words.


Tokenization

Tokenization splits text into discrete units (tokens) that serve as input to models.

Word Tokenization

The simplest approach splits on whitespace and punctuation.

Challenge Example
Contractions "don't" -> "do" + "n't"?
Hyphenation "state-of-the-art" -> one or four tokens?
Multiword expressions "New York" is one concept
Agglutinative languages Turkish, Finnish words encode entire phrases

Rule-based tokenizers (Penn Treebank, Moses) use regex cascades. SpaCy uses language-specific rules plus statistical models.

Subword Tokenization

Subword methods balance vocabulary size against token granularity, eliminating the out-of-vocabulary problem.

Byte Pair Encoding (BPE)

  1. Start with a character-level vocabulary
  2. Count all adjacent symbol pairs in the corpus
  3. Merge the most frequent pair into a new symbol
  4. Repeat for a fixed number of merge operations (typically 30k-50k)

BPE is used by GPT-2/3/4 and RoBERTa. It is deterministic at inference given a learned merge table.

WordPiece

  • Similar to BPE but selects merges that maximize likelihood of the training corpus
  • Merges the pair that maximizes P(merged) / (P(first) * P(second))
  • Used by BERT and DistilBERT
  • Prefixes continuation tokens with ## (e.g., "playing" -> "play" + "##ing")

Unigram Language Model

  • Starts with a large vocabulary and iteratively removes tokens whose removal least decreases corpus likelihood
  • Each segmentation has a probability; can sample or find the Viterbi-best segmentation
  • Used by XLNet, ALBERT, T5

SentencePiece

  • Language-agnostic framework that treats input as a raw byte stream (no pre-tokenization)
  • Implements both BPE and Unigram algorithms
  • Handles any language without whitespace-based pre-splitting (critical for Chinese, Japanese, Thai)
  • Supports byte-fallback to handle any Unicode character

Comparison

Method Vocabulary Source Merge Criterion Notable Users
BPE Bottom-up merging Frequency GPT family
WordPiece Bottom-up merging Likelihood gain BERT
Unigram Top-down pruning Likelihood loss T5, XLNet
SentencePiece Framework (BPE/Unigram) Depends on algorithm LLaMA, T5

Text Normalization

Case Folding

Lowercasing reduces vocabulary but destroys information (e.g., "US" vs "us"). Modern models typically preserve case and let the model learn case sensitivity.

Unicode Normalization and UTF-8

Unicode assigns codepoints to characters; UTF-8 encodes them in 1-4 bytes.

  • NFC (Canonical Composition): combines decomposed characters ("e" + combining accent -> "e")
  • NFKC (Compatibility Composition): also normalizes compatibility variants (ligatures, width forms)
  • Always normalize to a consistent form before tokenization to avoid duplicate vocabulary entries

Stemming

Reduces words to a stem by stripping suffixes heuristically.

  • Porter Stemmer: rule-based suffix stripping (e.g., "running" -> "run", "studies" -> "studi")
  • Snowball Stemmer: improved Porter with multilingual support
  • Fast but crude; produces non-words ("studies" -> "studi")

Lemmatization

Maps words to their dictionary form (lemma) using morphological analysis.

  • "better" -> "good", "ran" -> "run", "mice" -> "mouse"
  • Requires POS information for disambiguation ("saw" noun vs verb)
  • WordNet lemmatizer, SpaCy lemmatizer, Stanza
  • More accurate than stemming but slower

Stemming vs Lemmatization

Aspect Stemming Lemmatization
Output May be non-word Always valid word
Speed Fast Slower
Context needed No Often (POS tag)
Use case IR, search indexing Text understanding tasks

Stop Word Removal

Stop words are high-frequency, low-information words (the, is, at, of). Removing them reduces dimensionality for bag-of-words models.

Caveats:

  • Negation words ("not", "no") carry critical sentiment information
  • Phrases depend on stop words ("to be or not to be")
  • Modern neural models generally do not remove stop words; the model learns to ignore them
  • Custom stop word lists per domain often outperform generic lists

Sentence Segmentation

Splitting text into sentences is harder than it appears.

Challenge Example
Abbreviations "Dr. Smith went to Washington."
Ellipsis "Wait... what?"
Decimal numbers "The price is 3.50 dollars."
Quoted speech She said, "Hello." He nodded.

Approaches:

  • Rule-based: Punkt tokenizer (unsupervised abbreviation detection)
  • ML-based: binary classifier on each period (features: word length, case, abbreviation lists)
  • Neural: SpaCy's sentence segmenter, Stanza

Preprocessing Pipelines

A typical pipeline applies steps in order. The exact steps depend on the task and model.

Raw Text
  -> Unicode normalization (NFKC)
  -> Sentence segmentation
  -> Tokenization (subword for neural, word-level for classical)
  -> Optional: lowercasing, stop word removal, stemming/lemmatization
  -> Numericalization (token -> integer ID)

Classical NLP Pipeline (e.g., for TF-IDF + SVM)

  1. Lowercase
  2. Remove punctuation and special characters
  3. Tokenize (word-level)
  4. Remove stop words
  5. Stem or lemmatize
  6. Build vocabulary and vectorize

Modern Neural Pipeline (e.g., for BERT)

  1. Unicode normalize
  2. Apply pretrained tokenizer (WordPiece) -- handles casing, subwords
  3. Add special tokens ([CLS], [SEP])
  4. Convert to IDs and create attention masks
  5. No stop word removal, no stemming

Practical Considerations

Whitespace and Noise

  • HTML tags, URLs, email addresses, @mentions, hashtags all need task-specific handling
  • Regular expressions for cleaning must be carefully ordered

Multilingual Text

  • Language detection (fastText lid, CLD3) before language-specific processing
  • Transliteration for code-switched text
  • Shared subword vocabularies (SentencePiece) enable multilingual models

Reproducibility

  • Pin tokenizer versions; vocabulary changes break model compatibility
  • Document normalization choices; NFKC vs NFC matters for reproducibility
  • Store tokenizer artifacts alongside model checkpoints

Key Takeaways

  • Subword tokenization (BPE, WordPiece, Unigram) has largely replaced word-level tokenization in neural NLP
  • SentencePiece enables language-agnostic preprocessing by operating on raw bytes
  • Classical preprocessing (stemming, stop words) is still relevant for bag-of-words and search applications
  • Modern pretrained models handle most normalization internally; minimal preprocessing is preferred
  • Unicode normalization is essential for consistent tokenization across different text sources