Vector Databases Explained

In the previous lesson, you learned that RAG systems need to search through documents to find relevant information. But how do you search for meaning rather than just keywords?

That's where vector databases come in.

The Problem with Traditional Databases

Traditional databases are great for exact matches, but terrible at understanding meaning:

python

# Traditional SQL database
query = "SELECT * FROM documents WHERE content LIKE '%refund policy%'"

# ❌ Problems:
# - Misses synonyms: "return policy", "money back guarantee"
# - Misses related concepts: "customer satisfaction", "warranty"
# - No semantic understanding

Keyword Search Limitations

python

documents = [
    "Our refund policy allows 30-day returns",
    "We offer a money-back guarantee within one month",
    "Customer satisfaction is our priority with full reimbursement"
]

user_query = "What's your return policy?"

# Traditional keyword search
def keyword_search(query, docs):
    return [doc for doc in docs if "return" in doc.lower()]

results = keyword_search(user_query, documents)
print(results)
# Output: [] ❌ Misses all relevant documents!

What is a Vector Database?

A vector database stores data as high-dimensional vectors (embeddings) that represent semantic meaning, enabling similarity search.

Vector Database Definition: A specialized database optimized for storing, indexing, and searching high-dimensional vector embeddings, enabling fast similarity searches based on semantic meaning rather than exact keyword matches.

How It Works

python

# 1. Convert text to vectors (embeddings)
text = "refund policy"
vector = embedding_model.embed(text)
print(vector[:5])  # [0.023, -0.145, 0.389, 0.012, -0.234, ...]
# Shape: (1536,) - 1536-dimensional vector

# 2. Store vectors in database
vector_db.add(
    id="doc1",
    vector=vector,
    metadata={"text": text, "source": "policy.pdf"}
)

# 3. Search by similarity
query_vector = embedding_model.embed("return policy")
results = vector_db.search(query_vector, top_k=3)
# Returns most similar documents ✅

Semantic Search Example

python

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-api-key")

def get_embedding(text):
    """Convert text to vector"""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Our documents
documents = [
    "Our refund policy allows 30-day returns",
    "We offer a money-back guarantee within one month",
    "Customer satisfaction is our priority with full reimbursement",
    "Contact us at support@example.com"
]

# Create embeddings
doc_embeddings = [get_embedding(doc) for doc in documents]

# User query
query = "What's your return policy?"
query_embedding = get_embedding(query)

# Find most similar
<Callout type="info">
**Cosine Similarity Definition:** A metric that measures the similarity between two vectors by calculating the cosine of the angle between them, ranging from -1 to 1, where 1 indicates identical direction regardless of magnitude.
</Callout>

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]

# Get top results
top_idx = np.argsort(similarities)[::-1][:3]
for idx in top_idx:
    print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")

# Output:
# Score: 0.892 - Our refund policy allows 30-day returns ✅
# Score: 0.856 - We offer a money-back guarantee within one month ✅
# Score: 0.823 - Customer satisfaction is our priority with full reimbursement ✅

Key Insight: Vector databases understand that "refund", "return", "money-back", and "reimbursement" all refer to similar concepts, even though they use different words.

Popular Vector Databases

1. Chroma - Easy to Start

Best for: Development, prototyping, small projects

Chroma Definition: An open-source, embedded vector database that runs locally without external dependencies, designed for easy prototyping and development of RAG applications with simple setup and Python-native API.

python

# Install
# pip install chromadb

import chromadb
from chromadb.utils import embedding_functions

# Create client
client = chromadb.Client()

# Create collection
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="my_documents",
    embedding_function=openai_ef
)

# Add documents
collection.add(
    documents=[
        "Our refund policy allows 30-day returns",
        "We offer a money-back guarantee",
        "Contact us at support@example.com"
    ],
    metadatas=[
        {"source": "policy.pdf", "page": 1},
        {"source": "policy.pdf", "page": 2},
        {"source": "contact.txt", "page": 1}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Query
results = collection.query(
    query_texts=["What's your return policy?"],
    n_results=2
)

print(results['documents'])
# [["Our refund policy allows 30-day returns",
#   "We offer a money-back guarantee"]]

Chroma Features:

✅ Runs locally, no server needed
✅ Built-in embedding functions
✅ Persistent storage
✅ Simple API
❌ Not ideal for production scale
❌ Limited advanced features

2. Pinecone - Production Ready

Best for: Production applications, large scale

Pinecone Definition: A fully managed, cloud-native vector database service that provides high-performance similarity search at scale, handling billions of vectors with low latency and minimal operational overhead.

python

# Install
# pip install pinecone-client

from pinecone import Pinecone, ServerlessSpec

# Initialize
pc = Pinecone(api_key="your-api-key")

# Create index
index_name = "rag-documents"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # text-embedding-3-small dimension
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

index = pc.Index(index_name)

# Prepare data
from openai import OpenAI
client = OpenAI(api_key="your-openai-key")

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

documents = [
    "Our refund policy allows 30-day returns",
    "We offer a money-back guarantee",
    "Contact us at support@example.com"
]

# Upsert vectors
vectors = []
for i, doc in enumerate(documents):
    vectors.append({
        "id": f"doc{i}",
        "values": get_embedding(doc),
        "metadata": {"text": doc, "source": "policy.pdf"}
    })

index.upsert(vectors=vectors)

# Query
query_embedding = get_embedding("What's your return policy?")
results = index.query(
    vector=query_embedding,
    top_k=2,
    include_metadata=True
)

for match in results['matches']:
    print(f"Score: {match['score']:.3f}")
    print(f"Text: {match['metadata']['text']}\n")

Pinecone Features:

✅ Fully managed cloud service
✅ Scales to billions of vectors
✅ Low latency (50ms p99)
✅ Namespace support
✅ Filtering and metadata
❌ Requires API key (paid after free tier)
❌ Cloud-only (no local option)

3. Weaviate - Open Source Powerhouse

Best for: Self-hosted, hybrid search, GraphQL fans

python

# Install
# pip install weaviate-client

import weaviate
from weaviate.classes.init import Auth

# Connect to Weaviate
client = weaviate.connect_to_wcs(
    cluster_url="your-cluster-url",
    auth_credentials=Auth.api_key("your-api-key"),
)

# Create schema
schema = {
    "class": "Document",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "model": "text-embedding-3-small"
        }
    },
    "properties": [
        {
            "name": "content",
            "dataType": ["text"]
        },
        {
            "name": "source",
            "dataType": ["string"]
        }
    ]
}

client.schema.create_class(schema)

# Add documents
documents = [
    {"content": "Our refund policy allows 30-day returns", "source": "policy.pdf"},
    {"content": "We offer a money-back guarantee", "source": "policy.pdf"},
    {"content": "Contact us at support@example.com", "source": "contact.txt"}
]

with client.batch as batch:
    for doc in documents:
        batch.add_data_object(
            data_object=doc,
            class_name="Document"
        )

# Query
result = client.query.get(
    "Document",
    ["content", "source"]
).with_near_text({
    "concepts": ["return policy"]
}).with_limit(2).do()

print(result)

client.close()

Weaviate Features:

✅ Self-hosted or cloud
✅ Hybrid search (vector + keyword)
✅ GraphQL API
✅ Multi-tenancy support
✅ Built-in vectorizers
❌ More complex setup
❌ Steeper learning curve

4. FAISS - Facebook's Library

Best for: Research, maximum control, no external dependencies

python

# Install
# pip install faiss-cpu

import faiss
import numpy as np
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Prepare documents
documents = [
    "Our refund policy allows 30-day returns",
    "We offer a money-back guarantee",
    "Contact us at support@example.com"
]

# Create embeddings
embeddings = np.array([get_embedding(doc) for doc in documents]).astype('float32')

# Create FAISS index
dimension = embeddings.shape[1]  # 1536
index = faiss.IndexFlatL2(dimension)  # L2 distance

# Add vectors
index.add(embeddings)

print(f"Total vectors: {index.ntotal}")

# Search
query = "What's your return policy?"
query_vector = np.array([get_embedding(query)]).astype('float32')

# Find 2 nearest neighbors
k = 2
distances, indices = index.search(query_vector, k)

print("\nTop results:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. Distance: {distances[0][i]:.3f}")
    print(f"   Text: {documents[idx]}\n")

# Save index
faiss.write_index(index, "documents.index")

# Load index
loaded_index = faiss.read_index("documents.index")

Advanced FAISS: Faster Search

python

# For large datasets, use IVF (Inverted File Index)
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train the index
index.train(embeddings)

# Add vectors
index.add(embeddings)

# Search (faster for large datasets)
index.nprobe = 10  # Number of clusters to search
distances, indices = index.search(query_vector, k)

FAISS Features:

✅ Extremely fast
✅ No external services
✅ Advanced indexing algorithms
✅ Free and open source
✅ CPU and GPU support
❌ No built-in embeddings
❌ Manual metadata management
❌ No built-in persistence

FAISS Tip: Use

IndexFlatL2

for small datasets (<100k vectors). For larger datasets, use

IndexIVFFlat

IndexIVFPQ

for faster searches with approximate results.

Vector Database Comparison

Feature	Chroma	Pinecone	Weaviate	FAISS
Deployment	Local/Embedded	Cloud Only	Self-hosted/Cloud	Local Library
Ease of Use	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐
Scalability	Small-Medium	Billions	Large	Custom
Cost	Free	Free tier + paid	Free (self-hosted)	Free
Managed	No	Yes	Optional	No
Metadata Filter	Yes	Yes	Yes	Manual
Hybrid Search	No	No	Yes	No
Best For	Development	Production	Flexibility	Research

Choosing the Right Vector Database

Decision Tree

python

def choose_vector_db():
    """
    Choose the right vector database for your needs
    """

    # Just getting started?
    if project_stage == "prototype":
        return "Chroma"  # Easy, local, fast to set up

    # Production with budget?
    if project_stage == "production" and budget > 0:
        return "Pinecone"  # Managed, scalable, reliable

    # Need self-hosting or hybrid search?
    if self_hosted_required or hybrid_search_needed:
        return "Weaviate"  # Flexible, powerful features

    # Maximum performance, custom solution?
    if need_maximum_control or research_project:
        return "FAISS"  # Fast, customizable, free

    # Default: Start simple
    return "Chroma"

Real-World Scenarios

python

# Scenario 1: Startup MVP
scenario_1 = {
    "project": "Document Q&A MVP",
    "docs": "10K documents",
    "users": "100",
    "budget": "$0",
    "recommendation": "Chroma",
    "reason": "Free, easy to set up, sufficient for early stage"
}

# Scenario 2: Growing SaaS
scenario_2 = {
    "project": "Customer support chatbot",
    "docs": "1M+ documents",
    "users": "10K concurrent",
    "budget": "$500/month",
    "recommendation": "Pinecone",
    "reason": "Scales automatically, managed service, reliable"
}

# Scenario 3: Enterprise
scenario_3 = {
    "project": "Internal knowledge base",
    "docs": "500K documents",
    "users": "5K employees",
    "budget": "Self-host preferred",
    "recommendation": "Weaviate",
    "reason": "Full control, hybrid search, on-premises"
}

# Scenario 4: Research
scenario_4 = {
    "project": "Academic research on embeddings",
    "docs": "Variable",
    "users": "Researchers",
    "budget": "$0",
    "recommendation": "FAISS",
    "reason": "Maximum flexibility, no vendor lock-in, customize algorithms"
}

Practical Integration with LangChain

LangChain makes it easy to work with any vector database:

python

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.schema import Document

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key="your-api-key"
)

# Create documents
documents = [
    Document(
        page_content="Our refund policy allows 30-day returns",
        metadata={"source": "policy.pdf", "page": 1}
    ),
    Document(
        page_content="We offer a money-back guarantee within one month",
        metadata={"source": "policy.pdf", "page": 2}
    ),
    Document(
        page_content="Contact us at support@example.com for assistance",
        metadata={"source": "contact.txt"}
    )
]

# Create vector store
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Create retrieval chain
llm = ChatOpenAI(model="gpt-4", api_key="your-api-key")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 2})
)

# Query
response = qa_chain.invoke({"query": "What's your return policy?"})
print(response['result'])
# "We have a 30-day return policy and offer a money-back guarantee within one month..."

Switching Vector Databases

python

# Switch to Pinecone (same code structure!)
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore.from_documents(
    documents=documents,
    embedding=embeddings,
    index_name="rag-documents"
)

# Switch to Weaviate
from langchain_weaviate import WeaviateVectorStore

vectorstore = WeaviateVectorStore.from_documents(
    documents=documents,
    embedding=embeddings,
    client=weaviate_client
)

# Rest of the code stays the same! ✅

LangChain Advantage: Write once, switch databases easily. The abstraction lets you start with Chroma and migrate to Pinecone without rewriting your code.

Advanced Features

Metadata Filtering

python

# Chroma: Filter by metadata
results = collection.query(
    query_texts=["refund policy"],
    n_results=5,
    where={"source": "policy.pdf"}  # Only search policy documents
)

# Pinecone: Filter with expressions
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={"source": {"$eq": "policy.pdf"}}
)

# Weaviate: GraphQL filtering
result = client.query.get(
    "Document",
    ["content"]
).with_where({
    "path": ["source"],
    "operator": "Equal",
    "valueString": "policy.pdf"
}).with_near_text({
    "concepts": ["refund policy"]
}).do()

Hybrid Search (Weaviate)

python

# Combine vector search + keyword search
result = client.query.get(
    "Document",
    ["content", "source"]
).with_hybrid(
    query="refund policy",
    alpha=0.5  # 0=keyword only, 1=vector only, 0.5=balanced
).with_limit(5).do()

Performance Optimization

Batch Operations

python

# ❌ Slow: One at a time
for doc in documents:
    vector_store.add_document(doc)

# ✅ Fast: Batch insert
vector_store.add_documents(documents)  # Much faster!

Indexing Strategies

python

# FAISS: Choose right index type
indexes = {
    "small": "IndexFlatL2",          # < 100K, exact search
    "medium": "IndexIVFFlat",        # 100K-10M, approximate
    "large": "IndexIVFPQ",           # > 10M, compressed
    "huge": "IndexHNSW",             # Graph-based, fast
}

# Example: Product Quantization for compression
dimension = 1536
m = 8  # Number of sub-quantizers
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, 8)

Summary

Vector databases are the foundation of RAG systems:

Store embeddings for semantic search
Enable similarity search instead of keyword matching
Scale from prototypes to production
Integrate easily with LangChain and other frameworks

Quick Recommendations:

Learning/Prototyping: Start with Chroma
Production: Use Pinecone (cloud) or Weaviate (self-hosted)
Research/Custom: Use FAISS for maximum control