Text Representation

Overview

Text representation converts discrete symbols into numerical vectors that capture semantic and syntactic relationships. The evolution from sparse, count-based methods to dense, contextual embeddings is one of the central advances in NLP.

Sparse Representations

Bag of Words (BoW)

Represents a document as a vector of word counts, ignoring word order.

Vocabulary of size V defines the vector dimensionality
Each document becomes a sparse V-dimensional vector
Simple and interpretable but loses word order, synonymy, and polysemy
High dimensionality (V can be 100k+)

TF-IDF

Term Frequency-Inverse Document Frequency weights words by their importance.

TF(t, d) = count(t in d) / |d|
IDF(t) = log(N / df(t))
TF-IDF(t, d) = TF(t, d) * IDF(t)

Downweights common words (high df), upweights rare but present words
Variants: sublinear TF (1 + log(tf)), smoothed IDF, BM25 (length-normalized)
BM25 remains a strong baseline for information retrieval

N-grams

Captures local word order by treating sequences of n consecutive tokens as features.

Unigrams (n=1): bag of words
Bigrams (n=2): "New York", "machine learning"
Trigrams (n=3): "New York City"
Vocabulary explodes combinatorially; typically limited to n <= 3
Character n-grams capture morphological patterns and are robust to typos

Static Word Embeddings

Dense, low-dimensional vectors (50-300d) learned from co-occurrence patterns. Each word gets one fixed vector regardless of context.

Word2Vec

Two architectures trained on a sliding window objective.

CBOW (Continuous Bag of Words)

Predicts center word from surrounding context words
Averages context word embeddings, passes through a linear layer
Faster to train, better for frequent words

Skip-gram

Predicts context words from the center word
For each (center, context) pair, maximizes dot-product similarity
Better for rare words and small datasets
Negative sampling (Skip-gram with Negative Sampling, SGNS): instead of full softmax, contrast true pairs against k randomly sampled negatives

Training objective (Skip-gram with negative sampling):

maximize: log sigma(v_context . v_center)
        + sum_k E[log sigma(-v_negative_k . v_center)]

GloVe (Global Vectors)

Factorizes the log co-occurrence matrix directly.

Builds a global word-word co-occurrence matrix X from the corpus
Learns vectors such that w_i . w_j + b_i + b_j = log(X_ij)
Weighted least squares loss; downweights very frequent co-occurrences
Combines the global statistics of matrix factorization with the efficiency of local window methods
Comparable performance to Word2Vec; often used interchangeably

FastText

Extends Word2Vec by representing each word as a bag of character n-grams.

"where" with n=3: <wh, whe, her, ere, re>
Word vector = sum of its character n-gram vectors
Can compute vectors for out-of-vocabulary words by summing their n-grams
Handles morphologically rich languages well
Pretrained vectors available for 157 languages

Comparison of Static Embeddings

Method	Key Idea	OOV Handling	Training Signal
Word2Vec	Predict word from/by context	None	Local windows
GloVe	Factorize co-occurrence matrix	None	Global statistics
FastText	Character n-gram composition	Yes	Local windows + subword

Properties of Word Embeddings

Analogies: king - man + woman = queen (approximately)
Clustering: semantically related words cluster together
Bias: embeddings reflect and amplify biases in training data (gender, racial)
Evaluation: intrinsic (analogy, similarity benchmarks) vs extrinsic (downstream task performance)

Contextual Embeddings

Static embeddings assign one vector per word type. Contextual embeddings produce different vectors for each word token depending on its surrounding context.

ELMo (Embeddings from Language Models)

Trains a two-layer bidirectional LSTM language model
Word representation = learned weighted sum of all LSTM layers
Different layers capture different information (lower = syntax, upper = semantics)
Task-specific layer weights are learned during fine-tuning
First widely successful contextual embedding (2018)

BERT Embeddings

Transformer encoder trained with masked language modeling
Each token gets a 768d (base) or 1024d (large) contextual vector
"Bank" in "river bank" vs "bank account" gets different representations
Representations from different layers serve different purposes
Typically, the last 4 layers concatenated or averaged work best for feature extraction

Comparison: Static vs Contextual

Aspect	Static (Word2Vec, GloVe)	Contextual (ELMo, BERT)
Polysemy	One vector per word	Different vector per usage
Dimensionality	50-300	768-1024
Training cost	Hours on CPU	Days on GPUs/TPUs
Downstream use	Feature input	Fine-tune or feature extract
Vocabulary	Fixed	Subword (open vocabulary)

Sentence and Document Embeddings

Simple Aggregation

Average word embeddings (surprisingly strong baseline)
TF-IDF weighted average
Max pooling over word embeddings

Sentence-BERT (SBERT)

Fine-tunes BERT with a siamese/triplet network on NLI and STS data
Produces fixed-size sentence embeddings efficient for similarity search
Cosine similarity between SBERT vectors correlates well with semantic similarity
Orders of magnitude faster than cross-encoder BERT for pairwise comparison (encode once, compare many)

Other Sentence Embedding Methods

Method	Approach	Key Property
InferSent	BiLSTM on NLI data	Early learned sentence encoder
Universal Sentence Encoder	Transformer or DAN	Lightweight, multilingual
SimCSE	Contrastive learning on BERT	State-of-the-art unsupervised
E5, GTE, BGE	Instruction-tuned encoders	Current best general-purpose

Embedding Spaces and Operations

Similarity Measures

Cosine similarity: most common for normalized embeddings
Euclidean distance: sensitive to magnitude
Dot product: captures both similarity and magnitude

Dimensionality Reduction for Visualization

PCA: linear projection preserving variance
t-SNE: nonlinear; preserves local neighborhood structure
UMAP: faster alternative to t-SNE with better global structure

Alignment and Mapping

Cross-lingual word embedding alignment maps two languages into a shared space
Procrustes alignment on a bilingual dictionary as anchors
Enables zero-shot cross-lingual transfer

Practical Considerations

Pretrained embeddings (GloVe, FastText) are effective initializations for downstream models
Fine-tuning embeddings on domain-specific data often helps, especially for specialized vocabulary
For retrieval tasks, embedding quality directly determines recall; choose models trained on similar data
Embedding dimensionality is a trade-off: higher dimensions capture more information but increase compute and storage
Normalize embeddings for cosine similarity search; use approximate nearest neighbor (FAISS, ScaNN) for scale

Key Takeaways

Sparse representations (BoW, TF-IDF) remain useful for interpretability and efficiency in retrieval
Static embeddings (Word2Vec, GloVe, FastText) capture semantic similarity in dense vectors but ignore context
Contextual embeddings (ELMo, BERT) produce context-dependent representations that handle polysemy
Sentence embeddings (SBERT, E5) enable efficient semantic search and clustering at the sentence level
The choice of representation depends on the task, computational budget, and need for interpretability