Text Representation
Overview
Text representation converts discrete symbols into numerical vectors that capture semantic and syntactic relationships. The evolution from sparse, count-based methods to dense, contextual embeddings is one of the central advances in NLP.
Sparse Representations
Bag of Words (BoW)
Represents a document as a vector of word counts, ignoring word order.
- Vocabulary of size V defines the vector dimensionality
- Each document becomes a sparse V-dimensional vector
- Simple and interpretable but loses word order, synonymy, and polysemy
- High dimensionality (V can be 100k+)
TF-IDF
Term Frequency-Inverse Document Frequency weights words by their importance.
TF(t, d) = count(t in d) / |d|
IDF(t) = log(N / df(t))
TF-IDF(t, d) = TF(t, d) * IDF(t)
- Downweights common words (high df), upweights rare but present words
- Variants: sublinear TF (1 + log(tf)), smoothed IDF, BM25 (length-normalized)
- BM25 remains a strong baseline for information retrieval
N-grams
Captures local word order by treating sequences of n consecutive tokens as features.
- Unigrams (n=1): bag of words
- Bigrams (n=2): "New York", "machine learning"
- Trigrams (n=3): "New York City"
- Vocabulary explodes combinatorially; typically limited to n <= 3
- Character n-grams capture morphological patterns and are robust to typos
Static Word Embeddings
Dense, low-dimensional vectors (50-300d) learned from co-occurrence patterns. Each word gets one fixed vector regardless of context.
Word2Vec
Two architectures trained on a sliding window objective.
CBOW (Continuous Bag of Words)
- Predicts center word from surrounding context words
- Averages context word embeddings, passes through a linear layer
- Faster to train, better for frequent words
Skip-gram
- Predicts context words from the center word
- For each (center, context) pair, maximizes dot-product similarity
- Better for rare words and small datasets
- Negative sampling (Skip-gram with Negative Sampling, SGNS): instead of full softmax, contrast true pairs against k randomly sampled negatives
Training objective (Skip-gram with negative sampling):
maximize: log sigma(v_context . v_center)
+ sum_k E[log sigma(-v_negative_k . v_center)]
GloVe (Global Vectors)
Factorizes the log co-occurrence matrix directly.
- Builds a global word-word co-occurrence matrix X from the corpus
- Learns vectors such that w_i . w_j + b_i + b_j = log(X_ij)
- Weighted least squares loss; downweights very frequent co-occurrences
- Combines the global statistics of matrix factorization with the efficiency of local window methods
- Comparable performance to Word2Vec; often used interchangeably
FastText
Extends Word2Vec by representing each word as a bag of character n-grams.
- "where" with n=3:
<wh,whe,her,ere,re> - Word vector = sum of its character n-gram vectors
- Can compute vectors for out-of-vocabulary words by summing their n-grams
- Handles morphologically rich languages well
- Pretrained vectors available for 157 languages
Comparison of Static Embeddings
| Method | Key Idea | OOV Handling | Training Signal | |---|---|---|---| | Word2Vec | Predict word from/by context | None | Local windows | | GloVe | Factorize co-occurrence matrix | None | Global statistics | | FastText | Character n-gram composition | Yes | Local windows + subword |
Properties of Word Embeddings
- Analogies: king - man + woman = queen (approximately)
- Clustering: semantically related words cluster together
- Bias: embeddings reflect and amplify biases in training data (gender, racial)
- Evaluation: intrinsic (analogy, similarity benchmarks) vs extrinsic (downstream task performance)
Contextual Embeddings
Static embeddings assign one vector per word type. Contextual embeddings produce different vectors for each word token depending on its surrounding context.
ELMo (Embeddings from Language Models)
- Trains a two-layer bidirectional LSTM language model
- Word representation = learned weighted sum of all LSTM layers
- Different layers capture different information (lower = syntax, upper = semantics)
- Task-specific layer weights are learned during fine-tuning
- First widely successful contextual embedding (2018)
BERT Embeddings
- Transformer encoder trained with masked language modeling
- Each token gets a 768d (base) or 1024d (large) contextual vector
- "Bank" in "river bank" vs "bank account" gets different representations
- Representations from different layers serve different purposes
- Typically, the last 4 layers concatenated or averaged work best for feature extraction
Comparison: Static vs Contextual
| Aspect | Static (Word2Vec, GloVe) | Contextual (ELMo, BERT) | |---|---|---| | Polysemy | One vector per word | Different vector per usage | | Dimensionality | 50-300 | 768-1024 | | Training cost | Hours on CPU | Days on GPUs/TPUs | | Downstream use | Feature input | Fine-tune or feature extract | | Vocabulary | Fixed | Subword (open vocabulary) |
Sentence and Document Embeddings
Simple Aggregation
- Average word embeddings (surprisingly strong baseline)
- TF-IDF weighted average
- Max pooling over word embeddings
Sentence-BERT (SBERT)
- Fine-tunes BERT with a siamese/triplet network on NLI and STS data
- Produces fixed-size sentence embeddings efficient for similarity search
- Cosine similarity between SBERT vectors correlates well with semantic similarity
- Orders of magnitude faster than cross-encoder BERT for pairwise comparison (encode once, compare many)
Other Sentence Embedding Methods
| Method | Approach | Key Property | |---|---|---| | InferSent | BiLSTM on NLI data | Early learned sentence encoder | | Universal Sentence Encoder | Transformer or DAN | Lightweight, multilingual | | SimCSE | Contrastive learning on BERT | State-of-the-art unsupervised | | E5, GTE, BGE | Instruction-tuned encoders | Current best general-purpose |
Embedding Spaces and Operations
Similarity Measures
- Cosine similarity: most common for normalized embeddings
- Euclidean distance: sensitive to magnitude
- Dot product: captures both similarity and magnitude
Dimensionality Reduction for Visualization
- PCA: linear projection preserving variance
- t-SNE: nonlinear; preserves local neighborhood structure
- UMAP: faster alternative to t-SNE with better global structure
Alignment and Mapping
- Cross-lingual word embedding alignment maps two languages into a shared space
- Procrustes alignment on a bilingual dictionary as anchors
- Enables zero-shot cross-lingual transfer
Practical Considerations
- Pretrained embeddings (GloVe, FastText) are effective initializations for downstream models
- Fine-tuning embeddings on domain-specific data often helps, especially for specialized vocabulary
- For retrieval tasks, embedding quality directly determines recall; choose models trained on similar data
- Embedding dimensionality is a trade-off: higher dimensions capture more information but increase compute and storage
- Normalize embeddings for cosine similarity search; use approximate nearest neighbor (FAISS, ScaNN) for scale
Key Takeaways
- Sparse representations (BoW, TF-IDF) remain useful for interpretability and efficiency in retrieval
- Static embeddings (Word2Vec, GloVe, FastText) capture semantic similarity in dense vectors but ignore context
- Contextual embeddings (ELMo, BERT) produce context-dependent representations that handle polysemy
- Sentence embeddings (SBERT, E5) enable efficient semantic search and clustering at the sentence level
- The choice of representation depends on the task, computational budget, and need for interpretability