Traditional databases find exact matches: “find all users where email = alice@example.com.” Vector databases find similar matches: “find documents most similar to this question.” This capability powers every RAG pipeline, semantic search engine, recommendation system, and image similarity feature built with AI.

What Are Embeddings?

An embedding is a list of numbers (a vector) that represents the meaning of text, images, or any data. Similar meanings produce similar vectors. The magic is that “How do I reset my password?” and “I forgot my login credentials” produce nearby vectors, even though they share no words.

# Generate embeddings with OpenAI or sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "What is the weather today?",
]

embeddings = model.encode(texts)
# embeddings[0].shape = (384,)  # 384-dimensional vector

# Similarity between password questions: ~0.85 (very similar)
# Similarity between password and weather: ~0.12 (very different)

from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity([embeddings[0]], [embeddings[1]]))  # ~0.85
print(cosine_similarity([embeddings[0]], [embeddings[2]]))  # ~0.12

How Similarity Search Works

Given a query vector, find the K nearest vectors in a database of millions. The naive approach (compare against every vector) is O(n) and too slow. Vector databases use approximate nearest neighbor (ANN) algorithms.

HNSW: The Algorithm Behind Most Vector DBs

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each layer is progressively sparser. Search starts at the top layer (coarse navigation) and descends to lower layers (fine-grained search).

# Conceptual HNSW structure:
# Layer 2 (sparse):   A ---- D ---- G
# Layer 1 (medium):   A -- B -- D -- F -- G
# Layer 0 (dense):    A-B-C-D-E-F-G-H-I-J

# Search for a vector near E:
# 1. Start at layer 2: jump to closest node (D)
# 2. Drop to layer 1: navigate D -> F or D -> B
# 3. Drop to layer 0: navigate to E (found!)

# Time complexity: O(log n) instead of O(n)
# Accuracy: 95-99% recall (misses ~1-5% of true nearest neighbors)
# Trade-off: more memory for higher recall

Vector Database Options

Database Type Best For Pricing
pgvector PostgreSQL extension Small-medium datasets, existing PG users Free (self-hosted)
ChromaDB Embedded / client-server Prototyping, small RAG apps Free (open source)
Pinecone Managed cloud Production at scale, zero ops Pay per use
Weaviate Self-hosted / cloud Multi-modal (text + images) Free (self-hosted) / paid cloud
Qdrant Self-hosted / cloud High performance, filtering Free (self-hosted) / paid cloud
Milvus Self-hosted / cloud Billion-scale datasets Free (self-hosted) / paid cloud

pgvector: Start Here

# Install pgvector extension
CREATE EXTENSION vector;

# Create a table with a vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding vector(384)   -- 384 dimensions
);

# Insert a document with its embedding
INSERT INTO documents (title, content, embedding)
VALUES ('Password Reset', 'How to reset your password...',
        '[0.1, -0.3, 0.5, ...]');  -- 384 floats

# Find the 5 most similar documents
SELECT id, title, embedding <=> '[0.2, -0.1, 0.4, ...]' AS distance
FROM documents
ORDER BY embedding <=> '[0.2, -0.1, 0.4, ...]'  -- cosine distance
LIMIT 5;

# Create an HNSW index for fast search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

# With index: searches 1M vectors in ~5ms
# Without index: searches 1M vectors in ~500ms

Chunking Strategies for RAG

Documents must be split into chunks before embedding. Chunk size dramatically affects retrieval quality.

# Strategy 1: Fixed-size chunks (simple, often good enough)
def chunk_by_size(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Overlap prevents splitting mid-sentence
    return chunks

# Strategy 2: Semantic chunking (split on headings/paragraphs)
def chunk_by_structure(text: str) -> list[str]:
    # Split on markdown headers or double newlines
    import re
    sections = re.split(r'\n#{1,3} |\n\n', text)
    return [s.strip() for s in sections if len(s.strip()) > 50]

# Strategy 3: Recursive chunking (LangChain approach)
# Split on paragraphs first, then sentences, then words
# Keep chunks under max_size while preserving semantic boundaries

# Chunk size guidelines:
# Too small (< 100 tokens): loses context, retrieval misses meaning
# Too large (> 1000 tokens): dilutes relevance, wastes context window
# Sweet spot: 200-500 tokens with 10-20% overlap

Complete RAG Pipeline

import chromadb
from sentence_transformers import SentenceTransformer

# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./vectordb")
collection = client.get_or_create_collection(
    "docs",
    metadata={"hnsw:space": "cosine"}
)

# Index documents
def index_documents(docs: list[dict]):
    for doc in docs:
        chunks = chunk_by_size(doc["content"])
        embeddings = embedder.encode(chunks).tolist()
        collection.add(
            ids=[f"{doc['id']}_chunk_{i}" for i in range(len(chunks))],
            embeddings=embeddings,
            documents=chunks,
            metadatas=[{"source": doc["title"], "chunk": i} for i in range(len(chunks))],
        )

# Query: find relevant chunks
def search(query: str, top_k: int = 5) -> list[str]:
    query_embedding = embedder.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
    )
    return results["documents"][0]

# Generate answer with context
def rag_answer(question: str) -> str:
    context_chunks = search(question, top_k=5)
    context = "\n\n".join(context_chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        system="Answer using ONLY the provided context. Cite sources.",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }],
    )
    return response.content[0].text

When You Need a Vector Database

  • RAG pipeline: Retrieve relevant documents to ground LLM responses
  • Semantic search: Search by meaning, not just keywords
  • Recommendation engine: Find similar products, articles, or users
  • Image similarity: Reverse image search, duplicate detection
  • Anomaly detection: Find data points that are far from any cluster

When You Do NOT Need One

  • Less than 10,000 documents: Brute-force cosine similarity in NumPy is fast enough
  • Keyword search is sufficient: Elasticsearch with BM25 handles keyword queries well
  • Exact match only: Regular database with full-text search
  • Already using PostgreSQL: pgvector extension avoids adding a new database

Key Takeaways

  • Embeddings convert meaning to numbers — similar meanings produce nearby vectors
  • HNSW is the dominant algorithm for approximate nearest neighbor search — O(log n) with 95-99% recall
  • Start with pgvector if you already use PostgreSQL — it handles millions of vectors well
  • Chunk size matters for RAG: 200-500 tokens with overlap is the sweet spot
  • Use managed services (Pinecone) for production at scale — self-hosting vector databases requires tuning
  • You might not need a vector database — for small datasets, NumPy cosine similarity works fine
  • Combine vector search with keyword search (hybrid search) for best results

Vector databases are infrastructure, not magic. They store numbers and find nearest neighbors efficiently. The magic is in the embeddings — how you convert your data into meaningful vectors. Get the embeddings and chunking right, and any vector database will serve you well. Get them wrong, and the fanciest database cannot save your search quality.