6 min read
On this page

What Embeddings Are

An embedding is a mapping from something humans understand — a word, a sentence, an image, a song — to a dense vector of floating-point numbers that a machine can compare, cluster, and search. The core idea: things with similar meaning end up close together in vector space, and things with different meaning end up far apart. "King" is closer to "Queen" than it is to "Banana." "A photo of a sunset" is closer to "golden hour over the ocean" than it is to "tax return spreadsheet."

This is not a metaphor. It is literally a list of numbers. An embedding model takes your input and returns something like [0.023, -0.118, 0.541, ...] with anywhere from 256 to 3072 dimensions. Those numbers encode semantic information learned during training on massive text (or image, or audio) corpora.

Why Embeddings Matter

Traditional text comparison is keyword-based. If a user searches "how to fix a slow database," keyword search finds documents containing those exact words. It misses a document titled "Optimizing PostgreSQL query performance" even though it is exactly what the user wants.

Embeddings solve this. Both "how to fix a slow database" and "optimizing PostgreSQL query performance" map to nearby vectors because the embedding model learned that they mean similar things. Search becomes semantic, not lexical.

This extends beyond text. Embed an image and a text description into the same vector space (as CLIP does), and you can search images using natural language queries. Embed audio transcripts and you can search podcasts. The technique is general: if you can feed it to a neural network, you can embed it.

How Embedding Models Work

At a high level, embedding models are neural networks trained on pairs of related content. The training objective forces the model to produce similar vectors for similar inputs and dissimilar vectors for dissimilar inputs. This is called contrastive learning.

Training pair (similar):    "The cat sat on the mat" <-> "A feline resting on a rug"
Training pair (dissimilar): "The cat sat on the mat" <-> "Stock prices fell 3% today"

The model learns to push similar pairs together and dissimilar pairs apart in vector space. After training on billions of pairs, the model generalizes: it can embed text it has never seen and place it meaningfully relative to everything else.

Commercial APIs

Model                        Dimensions   Context Window   Notes
OpenAI text-embedding-3-small    1536      8191 tokens     Good balance of cost & quality
OpenAI text-embedding-3-large    3072      8191 tokens     Higher quality, higher cost
Cohere embed-v3                  1024      512 tokens      Strong multilingual support
Google text-embedding-004         768      2048 tokens     Tight Vertex AI integration
Voyage AI voyage-3               1024      32000 tokens    Long-context specialist

Open Source

Model                              Dimensions   Notes
sentence-transformers/all-MiniLM-L6   384       Fast, small, good enough for many tasks
BAAI/bge-large-en-v1.5              1024       Top of MTEB leaderboard (as of 2024)
nomic-ai/nomic-embed-text-v1.5      768       Long context, open weights
intfloat/e5-mistral-7b-instruct    4096       LLM-based, highest quality, slow

Open-source models run locally, which matters for sensitive data, air-gapped environments, and cost at scale.

Generating Embeddings in Practice

# Using OpenAI
from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The database query is running slowly"
)
vector = response.data[0].embedding  # list of 1536 floats

# Using sentence-transformers (local, no API key)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
vector = model.encode("The database query is running slowly")  # numpy array, 384 dims

Dimensions & What They Mean

Each dimension in an embedding vector captures some learned feature of the input. Unlike hand-crafted features ("word length," "contains number"), these features are not human-interpretable. Dimension 47 does not mean anything specific you can point to. But collectively, the dimensions encode meaning.

More dimensions generally means more expressive power, but also more storage, more compute for similarity search, and diminishing returns. For most production use cases, 768 to 1536 dimensions is the sweet spot.

import numpy as np

# A 1536-dimensional embedding uses about 6 KB (1536 * 4 bytes for float32)
# 1 million documents = ~6 GB of embedding storage
# 10 million documents = ~60 GB

embedding_size_bytes = 1536 * 4  # 6144 bytes per vector
one_million = 1_000_000 * embedding_size_bytes / (1024 ** 3)  # ~5.7 GB

Some models support dimensionality reduction. OpenAI's text-embedding-3 models let you truncate vectors to fewer dimensions with Matryoshka Representation Learning, trading quality for storage:

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="The database query is running slowly",
    dimensions=256  # truncate from 3072 to 256
)

Similarity Metrics

Once you have two vectors, you need a way to measure how close they are. Three metrics dominate:

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude. Returns a value between -1 and 1, where 1 means identical direction.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Identical meaning -> ~0.95
# Related meaning  -> ~0.70-0.85
# Unrelated        -> ~0.10-0.40

Cosine similarity is the default choice. Most embedding models are trained to optimize for it, and it is insensitive to vector magnitude, which makes it robust.

Dot Product

def dot_product(a, b):
    return np.dot(a, b)

Faster than cosine (no normalization step). If your vectors are already L2-normalized (unit vectors), dot product equals cosine similarity. Many production systems pre-normalize vectors at indexing time and use dot product at query time for speed.

Euclidean Distance (L2)

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

Measures straight-line distance. Smaller means more similar. Less common for text embeddings but used in some clustering and anomaly detection scenarios.

When to use what:
  Cosine similarity  -> default, works everywhere, most models optimize for it
  Dot product        -> when vectors are pre-normalized, slightly faster
  Euclidean distance -> clustering, anomaly detection, when magnitude matters

Practical Considerations

Chunking Long Documents

Embedding models have context windows. If your document exceeds the window, you must split it into chunks. This is not trivial — a bad chunking strategy produces bad embeddings.

# Naive chunking: split every N tokens (loses context)
# Better: split on paragraph or section boundaries
# Best: overlap chunks so boundary information is not lost

def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Batch Processing

Embedding one document at a time is wasteful. Batch your requests:

# OpenAI supports batches of up to 2048 inputs
texts = ["doc one", "doc two", "doc three", ...]
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts  # send all at once
)
vectors = [item.embedding for item in response.data]

Embedding Quality Depends on the Domain

General-purpose embedding models work well for general text. If your domain is specialized (medical records, legal contracts, semiconductor datasheets), you may need to fine-tune an embedding model or choose one trained on domain-specific data. Test this before assuming: general models are often better than you expect on specialized text. Fine-tune only when you have measured a quality gap on your actual queries.

Common Pitfalls

  • Using keyword search when you need semantic search, or vice versa. Embeddings are great at "find me something with similar meaning." They are not great at exact match. If a user searches for invoice number INV-2024-0847, keyword search wins.
  • Ignoring the context window. Stuffing a 10,000-word document into a model with a 512-token window silently truncates it. You get an embedding of the first 512 tokens only.
  • Mixing embedding models. Vectors from different models live in different spaces. You cannot compare an OpenAI embedding to a sentence-transformers embedding. Pick one model and stick with it (or re-embed everything when you switch).
  • Not normalizing before dot product. If your vectors are not unit-length and you use dot product instead of cosine similarity, longer documents (with larger magnitude vectors) will dominate results regardless of relevance.
  • Assuming embeddings are deterministic across versions. Model updates can change the embedding space. Pin your model version and re-embed when you upgrade.
  • Embedding too much at once. A single embedding for a 50-page document is a blurry summary. Chunk it and embed the chunks separately for better retrieval granularity.

Key Takeaways

  • An embedding is a dense vector of floats that encodes semantic meaning. Similar inputs produce nearby vectors.
  • Cosine similarity is the default metric. Use dot product if you pre-normalize.
  • Choose dimensions based on your quality-vs-cost tradeoff. 768 to 1536 covers most use cases.
  • Chunking strategy matters as much as model choice. Bad chunks produce bad embeddings.
  • Do not mix vectors from different models. They live in incompatible spaces.
  • Open-source models (sentence-transformers, BGE, Nomic) are production-ready and eliminate API dependency.
  • Batch your embedding requests. Single-document calls waste latency and cost.
  • Test general-purpose models on your domain before investing in fine-tuning.