Vector Databases Explained
In the previous lesson, you learned that RAG systems need to search through documents to find relevant information. But how do you search for meaning rather than just keywords?
That's where vector databases come in.
The Problem with Traditional Databases
Traditional databases are great for exact matches, but terrible at understanding meaning:
# Traditional SQL database
query = "SELECT * FROM documents WHERE content LIKE '%refund policy%'"
# ❌ Problems:
# - Misses synonyms: "return policy", "money back guarantee"
# - Misses related concepts: "customer satisfaction", "warranty"
# - No semantic understanding
Keyword Search Limitations
documents = [
"Our refund policy allows 30-day returns",
"We offer a money-back guarantee within one month",
"Customer satisfaction is our priority with full reimbursement"
]
user_query = "What's your return policy?"
# Traditional keyword search
def keyword_search(query, docs):
return [doc for doc in docs if "return" in doc.lower()]
results = keyword_search(user_query, documents)
print(results)
# Output: [] ❌ Misses all relevant documents!
What is a Vector Database?
A vector database stores data as high-dimensional vectors (embeddings) that represent semantic meaning, enabling similarity search.
Vector Database Definition: A specialized database optimized for storing, indexing, and searching high-dimensional vector embeddings, enabling fast similarity searches based on semantic meaning rather than exact keyword matches.
How It Works
# 1. Convert text to vectors (embeddings)
text = "refund policy"
vector = embedding_model.embed(text)
print(vector[:5]) # [0.023, -0.145, 0.389, 0.012, -0.234, ...]
# Shape: (1536,) - 1536-dimensional vector
# 2. Store vectors in database
vector_db.add(
id="doc1",
vector=vector,
metadata={"text": text, "source": "policy.pdf"}
)
# 3. Search by similarity
query_vector = embedding_model.embed("return policy")
results = vector_db.search(query_vector, top_k=3)
# Returns most similar documents ✅
Semantic Search Example
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="your-api-key")
def get_embedding(text):
"""Convert text to vector"""
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Our documents
documents = [
"Our refund policy allows 30-day returns",
"We offer a money-back guarantee within one month",
"Customer satisfaction is our priority with full reimbursement",
"Contact us at support@example.com"
]
# Create embeddings
doc_embeddings = [get_embedding(doc) for doc in documents]
# User query
query = "What's your return policy?"
query_embedding = get_embedding(query)
# Find most similar
<Callout type="info">
**Cosine Similarity Definition:** A metric that measures the similarity between two vectors by calculating the cosine of the angle between them, ranging from -1 to 1, where 1 indicates identical direction regardless of magnitude.
</Callout>
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [
cosine_similarity(query_embedding, doc_emb)
for doc_emb in doc_embeddings
]
# Get top results
top_idx = np.argsort(similarities)[::-1][:3]
for idx in top_idx:
print(f"Score: {similarities[idx]:.3f} - {documents[idx]}")
# Output:
# Score: 0.892 - Our refund policy allows 30-day returns ✅
# Score: 0.856 - We offer a money-back guarantee within one month ✅
# Score: 0.823 - Customer satisfaction is our priority with full reimbursement ✅
Key Insight: Vector databases understand that "refund", "return", "money-back", and "reimbursement" all refer to similar concepts, even though they use different words.
Popular Vector Databases
1. Chroma - Easy to Start
Best for: Development, prototyping, small projects
Chroma Definition: An open-source, embedded vector database that runs locally without external dependencies, designed for easy prototyping and development of RAG applications with simple setup and Python-native API.
# Install
# pip install chromadb
import chromadb
from chromadb.utils import embedding_functions
# Create client
client = chromadb.Client()
# Create collection
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="my_documents",
embedding_function=openai_ef
)
# Add documents
collection.add(
documents=[
"Our refund policy allows 30-day returns",
"We offer a money-back guarantee",
"Contact us at support@example.com"
],
metadatas=[
{"source": "policy.pdf", "page": 1},
{"source": "policy.pdf", "page": 2},
{"source": "contact.txt", "page": 1}
],
ids=["doc1", "doc2", "doc3"]
)
# Query
results = collection.query(
query_texts=["What's your return policy?"],
n_results=2
)
print(results['documents'])
# [["Our refund policy allows 30-day returns",
# "We offer a money-back guarantee"]]
Chroma Features:
- ✅ Runs locally, no server needed
- ✅ Built-in embedding functions
- ✅ Persistent storage
- ✅ Simple API
- ❌ Not ideal for production scale
- ❌ Limited advanced features
2. Pinecone - Production Ready
Best for: Production applications, large scale
Pinecone Definition: A fully managed, cloud-native vector database service that provides high-performance similarity search at scale, handling billions of vectors with low latency and minimal operational overhead.
# Install
# pip install pinecone-client
from pinecone import Pinecone, ServerlessSpec
# Initialize
pc = Pinecone(api_key="your-api-key")
# Create index
index_name = "rag-documents"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # text-embedding-3-small dimension
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index(index_name)
# Prepare data
from openai import OpenAI
client = OpenAI(api_key="your-openai-key")
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
documents = [
"Our refund policy allows 30-day returns",
"We offer a money-back guarantee",
"Contact us at support@example.com"
]
# Upsert vectors
vectors = []
for i, doc in enumerate(documents):
vectors.append({
"id": f"doc{i}",
"values": get_embedding(doc),
"metadata": {"text": doc, "source": "policy.pdf"}
})
index.upsert(vectors=vectors)
# Query
query_embedding = get_embedding("What's your return policy?")
results = index.query(
vector=query_embedding,
top_k=2,
include_metadata=True
)
for match in results['matches']:
print(f"Score: {match['score']:.3f}")
print(f"Text: {match['metadata']['text']}\n")
Pinecone Features:
- ✅ Fully managed cloud service
- ✅ Scales to billions of vectors
- ✅ Low latency (50ms p99)
- ✅ Namespace support
- ✅ Filtering and metadata
- ❌ Requires API key (paid after free tier)
- ❌ Cloud-only (no local option)
3. Weaviate - Open Source Powerhouse
Best for: Self-hosted, hybrid search, GraphQL fans
# Install
# pip install weaviate-client
import weaviate
from weaviate.classes.init import Auth
# Connect to Weaviate
client = weaviate.connect_to_wcs(
cluster_url="your-cluster-url",
auth_credentials=Auth.api_key("your-api-key"),
)
# Create schema
schema = {
"class": "Document",
"vectorizer": "text2vec-openai",
"moduleConfig": {
"text2vec-openai": {
"model": "text-embedding-3-small"
}
},
"properties": [
{
"name": "content",
"dataType": ["text"]
},
{
"name": "source",
"dataType": ["string"]
}
]
}
client.schema.create_class(schema)
# Add documents
documents = [
{"content": "Our refund policy allows 30-day returns", "source": "policy.pdf"},
{"content": "We offer a money-back guarantee", "source": "policy.pdf"},
{"content": "Contact us at support@example.com", "source": "contact.txt"}
]
with client.batch as batch:
for doc in documents:
batch.add_data_object(
data_object=doc,
class_name="Document"
)
# Query
result = client.query.get(
"Document",
["content", "source"]
).with_near_text({
"concepts": ["return policy"]
}).with_limit(2).do()
print(result)
client.close()
Weaviate Features:
- ✅ Self-hosted or cloud
- ✅ Hybrid search (vector + keyword)
- ✅ GraphQL API
- ✅ Multi-tenancy support
- ✅ Built-in vectorizers
- ❌ More complex setup
- ❌ Steeper learning curve
4. FAISS - Facebook's Library
Best for: Research, maximum control, no external dependencies
# Install
# pip install faiss-cpu
import faiss
import numpy as np
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Prepare documents
documents = [
"Our refund policy allows 30-day returns",
"We offer a money-back guarantee",
"Contact us at support@example.com"
]
# Create embeddings
embeddings = np.array([get_embedding(doc) for doc in documents]).astype('float32')
# Create FAISS index
dimension = embeddings.shape[1] # 1536
index = faiss.IndexFlatL2(dimension) # L2 distance
# Add vectors
index.add(embeddings)
print(f"Total vectors: {index.ntotal}")
# Search
query = "What's your return policy?"
query_vector = np.array([get_embedding(query)]).astype('float32')
# Find 2 nearest neighbors
k = 2
distances, indices = index.search(query_vector, k)
print("\nTop results:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}. Distance: {distances[0][i]:.3f}")
print(f" Text: {documents[idx]}\n")
# Save index
faiss.write_index(index, "documents.index")
# Load index
loaded_index = faiss.read_index("documents.index")
Advanced FAISS: Faster Search
# For large datasets, use IVF (Inverted File Index)
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Train the index
index.train(embeddings)
# Add vectors
index.add(embeddings)
# Search (faster for large datasets)
index.nprobe = 10 # Number of clusters to search
distances, indices = index.search(query_vector, k)
FAISS Features:
- ✅ Extremely fast
- ✅ No external services
- ✅ Advanced indexing algorithms
- ✅ Free and open source
- ✅ CPU and GPU support
- ❌ No built-in embeddings
- ❌ Manual metadata management
- ❌ No built-in persistence
FAISS Tip: Use
IndexFlatL2IndexIVFFlatIndexIVFPQVector Database Comparison
| Feature | Chroma | Pinecone | Weaviate | FAISS |
|---|---|---|---|---|
| Deployment | Local/Embedded | Cloud Only | Self-hosted/Cloud | Local Library |
| Ease of Use | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Scalability | Small-Medium | Billions | Large | Custom |
| Cost | Free | Free tier + paid | Free (self-hosted) | Free |
| Managed | No | Yes | Optional | No |
| Metadata Filter | Yes | Yes | Yes | Manual |
| Hybrid Search | No | No | Yes | No |
| Best For | Development | Production | Flexibility | Research |
Choosing the Right Vector Database
Decision Tree
def choose_vector_db():
"""
Choose the right vector database for your needs
"""
# Just getting started?
if project_stage == "prototype":
return "Chroma" # Easy, local, fast to set up
# Production with budget?
if project_stage == "production" and budget > 0:
return "Pinecone" # Managed, scalable, reliable
# Need self-hosting or hybrid search?
if self_hosted_required or hybrid_search_needed:
return "Weaviate" # Flexible, powerful features
# Maximum performance, custom solution?
if need_maximum_control or research_project:
return "FAISS" # Fast, customizable, free
# Default: Start simple
return "Chroma"
Real-World Scenarios
# Scenario 1: Startup MVP
scenario_1 = {
"project": "Document Q&A MVP",
"docs": "10K documents",
"users": "100",
"budget": "$0",
"recommendation": "Chroma",
"reason": "Free, easy to set up, sufficient for early stage"
}
# Scenario 2: Growing SaaS
scenario_2 = {
"project": "Customer support chatbot",
"docs": "1M+ documents",
"users": "10K concurrent",
"budget": "$500/month",
"recommendation": "Pinecone",
"reason": "Scales automatically, managed service, reliable"
}
# Scenario 3: Enterprise
scenario_3 = {
"project": "Internal knowledge base",
"docs": "500K documents",
"users": "5K employees",
"budget": "Self-host preferred",
"recommendation": "Weaviate",
"reason": "Full control, hybrid search, on-premises"
}
# Scenario 4: Research
scenario_4 = {
"project": "Academic research on embeddings",
"docs": "Variable",
"users": "Researchers",
"budget": "$0",
"recommendation": "FAISS",
"reason": "Maximum flexibility, no vendor lock-in, customize algorithms"
}
Practical Integration with LangChain
LangChain makes it easy to work with any vector database:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.schema import Document
# Initialize embeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
api_key="your-api-key"
)
# Create documents
documents = [
Document(
page_content="Our refund policy allows 30-day returns",
metadata={"source": "policy.pdf", "page": 1}
),
Document(
page_content="We offer a money-back guarantee within one month",
metadata={"source": "policy.pdf", "page": 2}
),
Document(
page_content="Contact us at support@example.com for assistance",
metadata={"source": "contact.txt"}
)
]
# Create vector store
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Create retrieval chain
llm = ChatOpenAI(model="gpt-4", api_key="your-api-key")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 2})
)
# Query
response = qa_chain.invoke({"query": "What's your return policy?"})
print(response['result'])
# "We have a 30-day return policy and offer a money-back guarantee within one month..."
Switching Vector Databases
# Switch to Pinecone (same code structure!)
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore.from_documents(
documents=documents,
embedding=embeddings,
index_name="rag-documents"
)
# Switch to Weaviate
from langchain_weaviate import WeaviateVectorStore
vectorstore = WeaviateVectorStore.from_documents(
documents=documents,
embedding=embeddings,
client=weaviate_client
)
# Rest of the code stays the same! ✅
LangChain Advantage: Write once, switch databases easily. The abstraction lets you start with Chroma and migrate to Pinecone without rewriting your code.
Advanced Features
Metadata Filtering
# Chroma: Filter by metadata
results = collection.query(
query_texts=["refund policy"],
n_results=5,
where={"source": "policy.pdf"} # Only search policy documents
)
# Pinecone: Filter with expressions
results = index.query(
vector=query_embedding,
top_k=5,
filter={"source": {"$eq": "policy.pdf"}}
)
# Weaviate: GraphQL filtering
result = client.query.get(
"Document",
["content"]
).with_where({
"path": ["source"],
"operator": "Equal",
"valueString": "policy.pdf"
}).with_near_text({
"concepts": ["refund policy"]
}).do()
Hybrid Search (Weaviate)
# Combine vector search + keyword search
result = client.query.get(
"Document",
["content", "source"]
).with_hybrid(
query="refund policy",
alpha=0.5 # 0=keyword only, 1=vector only, 0.5=balanced
).with_limit(5).do()
Performance Optimization
Batch Operations
# ❌ Slow: One at a time
for doc in documents:
vector_store.add_document(doc)
# ✅ Fast: Batch insert
vector_store.add_documents(documents) # Much faster!
Indexing Strategies
# FAISS: Choose right index type
indexes = {
"small": "IndexFlatL2", # < 100K, exact search
"medium": "IndexIVFFlat", # 100K-10M, approximate
"large": "IndexIVFPQ", # > 10M, compressed
"huge": "IndexHNSW", # Graph-based, fast
}
# Example: Product Quantization for compression
dimension = 1536
m = 8 # Number of sub-quantizers
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, 8)
Summary
Vector databases are the foundation of RAG systems:
- Store embeddings for semantic search
- Enable similarity search instead of keyword matching
- Scale from prototypes to production
- Integrate easily with LangChain and other frameworks
Quick Recommendations:
- Learning/Prototyping: Start with Chroma
- Production: Use Pinecone (cloud) or Weaviate (self-hosted)
- Research/Custom: Use FAISS for maximum control