Vector Databases
You have millions of embedding vectors. A user sends a query, you embed it, and you need to find the 10 most similar vectors out of those millions — in under 100 milliseconds. A brute-force comparison against every vector is O(n), which is too slow at scale. Vector databases solve this with specialized indexing algorithms that trade a small amount of accuracy for orders-of-magnitude speed improvement.
The Landscape
The vector database market exploded in 2023-2024. Some options are purpose-built for vectors, others are extensions of existing databases.
Purpose-Built Vector Databases
Database Deployment Notes
Pinecone Managed (cloud only) Simplest to operate. No self-hosted option.
Weaviate Managed or self-hosted Hybrid search (vector + keyword). GraphQL API.
Qdrant Managed or self-hosted Rust-based, fast. Good filtering.
Milvus Self-hosted or Zilliz Apache 2.0. Handles billions of vectors.
Chroma Self-hosted (or cloud) Developer-friendly. Popular with LLM apps.
Vector Extensions for Existing Databases
Extension Database Notes
pgvector PostgreSQL SQL interface. No new infrastructure.
pgvectorscale PostgreSQL Timescale's optimized fork of pgvector.
Atlas Vector MongoDB Native vector search in MongoDB.
OpenSearch k-NN OpenSearch If you already run OpenSearch.
Redis VSS Redis In-memory, very fast, limited persistence story.
pgvector: When It Is Enough
If you are already running PostgreSQL, pgvector eliminates an entire category of infrastructure. You store vectors alongside your relational data in the same database, query them with SQL, and get transactional consistency for free.
# Install pgvector extension (run once)
# CREATE EXTENSION vector;
import psycopg2
import numpy as np
conn = psycopg2.connect("postgresql://localhost/mydb")
cur = conn.cursor()
# Create a table with a vector column
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
category TEXT,
embedding vector(1536)
)
""")
# Insert a document with its embedding
embedding = np.random.rand(1536).tolist() # replace with real embedding
cur.execute(
"INSERT INTO documents (title, content, category, embedding) VALUES (%s, %s, %s, %s)",
("Query optimization guide", "Full text here...", "engineering", str(embedding))
)
# Find the 10 most similar documents using cosine distance
query_embedding = np.random.rand(1536).tolist()
cur.execute("""
SELECT id, title, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT 10
""", (str(query_embedding), str(query_embedding)))
results = cur.fetchall()
conn.commit()
pgvector supports three distance operators:
<=> Cosine distance
<-> L2 (Euclidean) distance
<#> Negative inner product (for dot product similarity)
When pgvector Is Enough
pgvector handles millions of vectors comfortably on modern hardware. Benchmarks consistently show sub-100ms query times for 1-5 million vectors with proper indexing. For many production applications — RAG systems, semantic search within a product, recommendation engines for mid-size catalogs — this is more than sufficient.
The real advantages:
- One database to operate. No new infrastructure, no new backup strategy, no new monitoring.
- Joins with relational data. Filter by user, tenant, category, date — all in one SQL query.
- ACID transactions. Your vectors and metadata are always consistent.
- Existing tooling. pgdump, replication, connection pooling — it all works.
When You Need a Dedicated Vector Database
- Hundreds of millions to billions of vectors. pgvector starts to struggle. Purpose-built databases like Milvus are designed for this scale.
- Sub-10ms latency requirements. In-memory vector databases (Qdrant, Redis VSS) can hit microsecond-level latencies.
- Distributed vector search. Sharding vectors across a cluster with automatic rebalancing.
- Advanced features. Multi-tenancy, role-based access, built-in reranking, hybrid search scoring.
Indexing Algorithms
The magic of vector databases is approximate nearest neighbor (ANN) search. Instead of comparing your query against every vector, ANN algorithms build an index structure that narrows the search space.
HNSW (Hierarchical Navigable Small World)
The most popular algorithm. Builds a multi-layer graph where each node is a vector and edges connect nearby vectors. Search starts at the top layer (sparse, long-range connections) and descends to lower layers (dense, short-range connections).
Layer 3: A ---- D (few nodes, long jumps)
Layer 2: A -- C -- D -- F (more nodes, medium jumps)
Layer 1: A - B - C - D - E - F - G (all nodes, short jumps)
Layer 0: [all vectors, fine-grained]
HNSW gives excellent recall (95-99% of true nearest neighbors) with fast query times. The tradeoff is memory: the index must fit in RAM.
# pgvector HNSW index
# m = max connections per node (higher = better recall, more memory)
# ef_construction = search width during build (higher = better recall, slower build)
cur.execute("""
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
IVF (Inverted File Index)
Partitions vectors into clusters (using k-means), then only searches the clusters nearest to the query vector. Faster to build than HNSW, uses less memory, but lower recall at the same speed.
# pgvector IVF index
# lists = number of clusters (sqrt(n) is a common starting point)
cur.execute("""
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
""")
# At query time, set the number of clusters to search (probes)
cur.execute("SET ivfflat.probes = 10") # search 10 of 100 clusters
Algorithm Build Speed Query Speed Memory Recall Best For
HNSW Slow Fast High High Most use cases
IVF Fast Medium Medium Medium Large datasets, memory-constrained
For most applications, HNSW is the right default. IVF is useful when you have very large datasets and cannot afford the memory overhead of HNSW.
Filtering with Metadata
Real queries are rarely "find the 10 most similar vectors." They are "find the 10 most similar vectors that belong to this tenant, were created in the last 30 days, and are in the engineering category."
Pre-filtering vs Post-filtering
Pre-filtering: Filter first, then search vectors within the filtered set.
Accurate, but slow if the filter is very selective (small result set).
Post-filtering: Search vectors first, then filter results.
Fast, but may return fewer than K results if many get filtered out.
Most vector databases use a hybrid approach. pgvector lets you combine SQL WHERE clauses with vector search:
cur.execute("""
SELECT id, title, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE category = 'engineering'
AND created_at > NOW() - INTERVAL '30 days'
ORDER BY embedding <=> %s::vector
LIMIT 10
""", (str(query_embedding), str(query_embedding)))
Dedicated vector databases handle this differently. Qdrant, for example, uses payload filtering:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
from datetime import datetime, timedelta
client = QdrantClient("localhost", port=6333)
results = client.search(
collection_name="documents",
query_vector=query_embedding,
query_filter=Filter(
must=[
FieldCondition(key="category", match=MatchValue(value="engineering")),
FieldCondition(
key="created_at",
range=Range(gte=(datetime.now() - timedelta(days=30)).isoformat())
),
]
),
limit=10
)
Production Considerations
Sizing & Performance
Rule of thumb for pgvector with HNSW:
- 1M vectors at 1536 dims: ~6 GB vectors + ~2 GB index = ~8 GB RAM
- 5M vectors at 1536 dims: ~30 GB vectors + ~10 GB index = ~40 GB RAM
- Query latency: 5-50ms depending on index params and hardware
Rule of thumb for Qdrant / Milvus:
- Same vectors, ~30-50% less memory with quantization
- Query latency: 1-10ms
- Can shard across multiple nodes
Quantization
Storing full float32 vectors is expensive. Quantization reduces each float to fewer bits:
float32: 4 bytes per dimension (full precision)
float16: 2 bytes per dimension (minimal quality loss)
int8: 1 byte per dimension (noticeable on edge cases)
binary: 1 bit per dimension (fast but lossy, good for re-ranking stage)
pgvector supports halfvec (float16) natively. Qdrant and Milvus support scalar and product quantization.
Reindexing
When you add a significant number of vectors (say, 20%+ of the collection), HNSW indexes degrade. You need to rebuild the index periodically. With pgvector, this means REINDEX INDEX CONCURRENTLY. With managed vector databases, this is handled automatically.
Common Pitfalls
- Choosing a dedicated vector database before you need one. pgvector in PostgreSQL handles millions of vectors. Start there. Migrate when you hit a real scaling wall, not a theoretical one.
- No index on the vector column. Without an ANN index, every query is a full table scan. This is the number one performance complaint from pgvector users, and it is always a missing index.
- Wrong distance metric. If you build an HNSW index with cosine distance but query with L2 distance, the index is useless and the database falls back to sequential scan.
- Ignoring index build time. HNSW indexes on millions of vectors take minutes to hours to build. Plan for this during initial load and migrations.
- Filtering after vector search. If you search for top 10 and then filter, you might end up with 2 results. Over-fetch (top 50) and filter, or use a database that supports pre-filtering.
- Not monitoring recall. ANN search is approximate. Periodically spot-check results against brute-force search to verify your index parameters give acceptable recall.
Key Takeaways
- pgvector in PostgreSQL is the right starting point for most applications. It handles millions of vectors, offers SQL joins with relational data, and requires no new infrastructure.
- Purpose-built vector databases (Pinecone, Qdrant, Milvus) earn their place at hundreds of millions of vectors, sub-10ms latency requirements, or when you need distributed search.
- HNSW is the default indexing algorithm. It gives the best recall-speed tradeoff for most workloads.
- Always create an ANN index. Without one, every search is a full scan.
- Metadata filtering is a first-class concern. Most real queries combine vector similarity with structured filters.
- Quantization (float16, int8) cuts memory usage substantially with minimal quality loss.