Text Preprocessing

Overview

NLP Text Processing Pipeline

Text preprocessing transforms raw text into a structured representation suitable for downstream NLP tasks. The choices made at this stage profoundly affect model performance, vocabulary size, and the ability to handle out-of-vocabulary words.

Tokenization

Tokenization splits text into discrete units (tokens) that serve as input to models.

Word Tokenization

The simplest approach splits on whitespace and punctuation.

Challenge	Example
Contractions	"don't" -> "do" + "n't"?
Hyphenation	"state-of-the-art" -> one or four tokens?
Multiword expressions	"New York" is one concept
Agglutinative languages	Turkish, Finnish words encode entire phrases

Rule-based tokenizers (Penn Treebank, Moses) use regex cascades. SpaCy uses language-specific rules plus statistical models.

Subword Tokenization

Subword methods balance vocabulary size against token granularity, eliminating the out-of-vocabulary problem.

Byte Pair Encoding (BPE)

Start with a character-level vocabulary
Count all adjacent symbol pairs in the corpus
Merge the most frequent pair into a new symbol
Repeat for a fixed number of merge operations (typically 30k-50k)

BPE is used by GPT-2/3/4 and RoBERTa. It is deterministic at inference given a learned merge table.

WordPiece

Similar to BPE but selects merges that maximize likelihood of the training corpus
Merges the pair that maximizes P(merged) / (P(first) * P(second))
Used by BERT and DistilBERT
Prefixes continuation tokens with ## (e.g., "playing" -> "play" + "##ing")

Unigram Language Model

Starts with a large vocabulary and iteratively removes tokens whose removal least decreases corpus likelihood
Each segmentation has a probability; can sample or find the Viterbi-best segmentation
Used by XLNet, ALBERT, T5

SentencePiece

Language-agnostic framework that treats input as a raw byte stream (no pre-tokenization)
Implements both BPE and Unigram algorithms
Handles any language without whitespace-based pre-splitting (critical for Chinese, Japanese, Thai)
Supports byte-fallback to handle any Unicode character

Comparison

Method	Vocabulary Source	Merge Criterion	Notable Users
BPE	Bottom-up merging	Frequency	GPT family
WordPiece	Bottom-up merging	Likelihood gain	BERT
Unigram	Top-down pruning	Likelihood loss	T5, XLNet
SentencePiece	Framework (BPE/Unigram)	Depends on algorithm	LLaMA, T5

Text Normalization

Case Folding

Lowercasing reduces vocabulary but destroys information (e.g., "US" vs "us"). Modern models typically preserve case and let the model learn case sensitivity.

Unicode Normalization and UTF-8

Unicode assigns codepoints to characters; UTF-8 encodes them in 1-4 bytes.

NFC (Canonical Composition): combines decomposed characters ("e" + combining accent -> "e")
NFKC (Compatibility Composition): also normalizes compatibility variants (ligatures, width forms)
Always normalize to a consistent form before tokenization to avoid duplicate vocabulary entries

Stemming

Reduces words to a stem by stripping suffixes heuristically.

Porter Stemmer: rule-based suffix stripping (e.g., "running" -> "run", "studies" -> "studi")
Snowball Stemmer: improved Porter with multilingual support
Fast but crude; produces non-words ("studies" -> "studi")

Lemmatization

Maps words to their dictionary form (lemma) using morphological analysis.

"better" -> "good", "ran" -> "run", "mice" -> "mouse"
Requires POS information for disambiguation ("saw" noun vs verb)
WordNet lemmatizer, SpaCy lemmatizer, Stanza
More accurate than stemming but slower

Stemming vs Lemmatization

Aspect	Stemming	Lemmatization
Output	May be non-word	Always valid word
Speed	Fast	Slower
Context needed	No	Often (POS tag)
Use case	IR, search indexing	Text understanding tasks

Stop Word Removal

Stop words are high-frequency, low-information words (the, is, at, of). Removing them reduces dimensionality for bag-of-words models.

Caveats:

Negation words ("not", "no") carry critical sentiment information
Phrases depend on stop words ("to be or not to be")
Modern neural models generally do not remove stop words; the model learns to ignore them
Custom stop word lists per domain often outperform generic lists

Sentence Segmentation

Splitting text into sentences is harder than it appears.

Challenge	Example
Abbreviations	"Dr. Smith went to Washington."
Ellipsis	"Wait... what?"
Decimal numbers	"The price is 3.50 dollars."
Quoted speech	She said, "Hello." He nodded.

Approaches:

Rule-based: Punkt tokenizer (unsupervised abbreviation detection)
ML-based: binary classifier on each period (features: word length, case, abbreviation lists)
Neural: SpaCy's sentence segmenter, Stanza

Preprocessing Pipelines

A typical pipeline applies steps in order. The exact steps depend on the task and model.

Raw Text
  -> Unicode normalization (NFKC)
  -> Sentence segmentation
  -> Tokenization (subword for neural, word-level for classical)
  -> Optional: lowercasing, stop word removal, stemming/lemmatization
  -> Numericalization (token -> integer ID)

Classical NLP Pipeline (e.g., for TF-IDF + SVM)

Lowercase
Remove punctuation and special characters
Tokenize (word-level)
Remove stop words
Stem or lemmatize
Build vocabulary and vectorize

Modern Neural Pipeline (e.g., for BERT)

Unicode normalize
Apply pretrained tokenizer (WordPiece) -- handles casing, subwords
Add special tokens ([CLS], [SEP])
Convert to IDs and create attention masks
No stop word removal, no stemming

Practical Considerations

Whitespace and Noise

HTML tags, URLs, email addresses, @mentions, hashtags all need task-specific handling
Regular expressions for cleaning must be carefully ordered

Multilingual Text

Language detection (fastText lid, CLD3) before language-specific processing
Transliteration for code-switched text
Shared subword vocabularies (SentencePiece) enable multilingual models

Reproducibility

Pin tokenizer versions; vocabulary changes break model compatibility
Document normalization choices; NFKC vs NFC matters for reproducibility
Store tokenizer artifacts alongside model checkpoints

Key Takeaways

Subword tokenization (BPE, WordPiece, Unigram) has largely replaced word-level tokenization in neural NLP
SentencePiece enables language-agnostic preprocessing by operating on raw bytes
Classical preprocessing (stemming, stop words) is still relevant for bag-of-words and search applications
Modern pretrained models handle most normalization internally; minimal preprocessing is preferred
Unicode normalization is essential for consistent tokenization across different text sources