Understanding Embeddings
Embeddings are the foundation of modern semantic search, recommendation systems, and AI applications. But what exactly are they, and how do they work? Let's demystify these powerful numerical representations.
What Are Embeddings?
An embedding is a dense vector representation of data (text, images, audio) in a continuous multi-dimensional space where semantic similarity is preserved.
Embedding Definition: An embedding is a numerical representation (vector) of data that captures its semantic meaning. Similar concepts have similar vectors, enabling mathematical operations on meaning itself.
The Core Concept
# Human-readable text
text = "The quick brown fox"
# Machine-understandable embedding
embedding = [0.023, -0.145, 0.389, ..., -0.234]
# A list of 1536 floating-point numbers
Key Principle: Similar meanings → Similar vectors
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="your-api-key")
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Similar concepts produce similar embeddings
dog_embedding = get_embedding("dog")
puppy_embedding = get_embedding("puppy")
car_embedding = get_embedding("car")
# Calculate similarity (cosine)
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"dog vs puppy: {cosine_similarity(dog_embedding, puppy_embedding):.3f}") # ~0.85
print(f"dog vs car: {cosine_similarity(dog_embedding, car_embedding):.3f}") # ~0.45
Cosine Similarity: A measure of similarity between two vectors that calculates the cosine of the angle between them. Values range from -1 (opposite) to 1 (identical), with higher values indicating greater semantic similarity.
Key Insight: Embeddings map semantic similarity to geometric proximity. Words with similar meanings cluster together in high-dimensional space, enabling mathematical operations on meaning.
How Embeddings Are Created
The Training Process
Embedding models learn to represent text through self-supervised learning on massive text corpora:
"""
Embedding Training (Simplified):
1. TRAINING DATA
- Billions of text examples
- Learn from context: words appearing together
2. LEARNING OBJECTIVE
- Predict masked words: "The [MASK] sat on the mat" → "cat"
- Context similarity: "king - man + woman ≈ queen"
- Contrastive learning: similar pairs close, dissimilar pairs far
3. RESULT
- Dense vector for each token
- Semantic relationships encoded geometrically
"""
# Example: Vector arithmetic captures relationships
# These work because embeddings preserve semantic structure:
# - king - man + woman ≈ queen
# - Paris - France + Italy ≈ Rome
# - walked - walking + swimming ≈ swam
Modern Embedding Architectures
"""
Evolution of Embedding Models:
1. Word2Vec (2013)
- Skip-gram and CBOW
- Static embeddings (one vector per word)
- No context sensitivity
2. GloVe (2014)
- Global word-word co-occurrence
- Better captures semantic relationships
3. ELMo (2018)
- Contextual embeddings
- Same word, different vectors based on context
4. BERT/Transformers (2018+)
- Deep contextual understanding
- Attention mechanisms
- State-of-the-art performance
5. Modern (2023+)
- OpenAI text-embedding-3
- Cohere embed-v3
- Optimized for specific tasks
"""
Creating Embeddings with OpenAI
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
# Single text embedding
def create_embedding(text, model="text-embedding-3-small"):
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Example
text = "Machine learning is transforming technology"
embedding = create_embedding(text)
print(f"Embedding dimension: {len(embedding)}") # 1536
print(f"First 5 values: {embedding[:5]}")
# Output: [0.0123, -0.0456, 0.0789, -0.0234, 0.0567]
Batch Processing for Efficiency
# BAD: One at a time (slow, expensive)
texts = ["text1", "text2", "text3", ..., "text100"]
embeddings = [create_embedding(t) for t in texts] # 100 API calls!
# GOOD: Batch processing (fast, cheap)
def create_embeddings_batch(texts, model="text-embedding-3-small"):
"""Create multiple embeddings in one API call"""
response = client.embeddings.create(
input=texts, # List of texts
model=model
)
return [item.embedding for item in response.data]
# Process up to 2048 texts at once
embeddings = create_embeddings_batch(texts) # 1 API call!
Cost Optimization: Batch processing is 10-100x more efficient. Always batch when creating multiple embeddings. OpenAI allows up to 2048 texts per request.
Embedding Dimensions and Models
Understanding Dimensions
Embedding Dimensions: The number of values in an embedding vector. Each dimension captures different semantic features learned during training. More dimensions generally mean more nuanced representations but also more storage and computation.
# What are dimensions?
embedding = [0.1, -0.2, 0.3, 0.4, -0.5] # 5-dimensional vector
# Each dimension captures different semantic aspects:
# - Dimension 0 might encode "animal vs object"
# - Dimension 1 might encode "size"
# - Dimension 2 might encode "sentiment"
# - etc. (learned automatically, not manually defined)
OpenAI Embedding Models Comparison
models_comparison = {
"text-embedding-3-small": {
"dimensions": 1536,
"performance": "High quality",
"cost_per_1M_tokens": "$0.02",
"use_case": "Most applications - best value",
"max_input_tokens": 8191,
"release": "January 2024"
},
"text-embedding-3-large": {
"dimensions": 3072,
"performance": "Highest quality",
"cost_per_1M_tokens": "$0.13",
"use_case": "Maximum accuracy needed",
"max_input_tokens": 8191,
"release": "January 2024"
},
"text-embedding-ada-002": {
"dimensions": 1536,
"performance": "Good (legacy)",
"cost_per_1M_tokens": "$0.10",
"use_case": "Legacy projects only",
"max_input_tokens": 8191,
"release": "December 2022"
}
}
# Recommendation: Use text-embedding-3-small for most tasks
Performance Comparison
import numpy as np
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def compare_models():
"""Compare embedding quality across models"""
# Test cases
queries = [
("dog", "puppy"), # Should be very similar
("dog", "canine"), # Should be very similar
("dog", "cat"), # Should be somewhat similar
("dog", "car"), # Should be dissimilar
]
models = ["text-embedding-3-small", "text-embedding-3-large"]
for model in models:
print(f"\n{model}:")
print("-" * 50)
for word1, word2 in queries:
# Get embeddings
embeddings = client.embeddings.create(
input=[word1, word2],
model=model
)
e1 = np.array(embeddings.data[0].embedding)
e2 = np.array(embeddings.data[1].embedding)
# Calculate similarity
similarity = np.dot(e1, e2) / (np.linalg.norm(e1) * np.linalg.norm(e2))
print(f"{word1:10} vs {word2:10} = {similarity:.4f}")
compare_models()
# Output:
# text-embedding-3-small:
# --------------------------------------------------
# dog vs puppy = 0.8567
# dog vs canine = 0.8234
# dog vs cat = 0.7123
# dog vs car = 0.4567
#
# text-embedding-3-large:
# --------------------------------------------------
# dog vs puppy = 0.8891 ← Better discrimination
# dog vs canine = 0.8678
# dog vs cat = 0.6892
# dog vs car = 0.3912
Dimension Reduction
# text-embedding-3 models support dimension reduction
# Reduce storage and speed up search with minimal accuracy loss
response = client.embeddings.create(
input="Your text here",
model="text-embedding-3-small",
dimensions=512 # Reduce from 1536 to 512
)
# Benefits:
# - 66% less storage (1536 → 512)
# - 3x faster similarity search
# - ~95% of original accuracy retained
# Supported dimensions for text-embedding-3-small:
# Any value from 1 to 1536
# Example: Testing different dimensions
def test_dimensions():
text = "Machine learning embeddings"
for dims in [256, 512, 768, 1024, 1536]:
response = client.embeddings.create(
input=text,
model="text-embedding-3-small",
dimensions=dims
)
embedding = response.data[0].embedding
print(f"Dimensions: {dims:4d} | Actual length: {len(embedding)}")
Important: When using dimension reduction, stay consistent. Don't compare embeddings with different dimensions. Choose your dimension count upfront based on accuracy vs. performance trade-offs.
Use Cases for Embeddings
1. Semantic Search
# Traditional keyword search: exact matches only
# Query: "canine" → Won't match "dog"
# Semantic search with embeddings: meaning-based
# Query: "canine" → Matches "dog", "puppy", "pet"
from typing import List, Tuple
class SemanticSearch:
def __init__(self):
self.documents = []
self.embeddings = []
def index(self, documents: List[str]):
"""Index documents for search"""
# Create embeddings
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
self.documents = documents
self.embeddings = [np.array(item.embedding) for item in response.data]
def search(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
"""Search for similar documents"""
# Get query embedding
query_response = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
query_embedding = np.array(query_response.data[0].embedding)
# Calculate similarities
similarities = []
for doc, doc_embedding in zip(self.documents, self.embeddings):
similarity = np.dot(query_embedding, doc_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
)
similarities.append((doc, similarity))
# Sort and return top-k
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Example
search = SemanticSearch()
search.index([
"Python is a programming language",
"Dogs are loyal pets",
"Machine learning uses algorithms",
"Puppies are young dogs",
"JavaScript runs in browsers"
])
results = search.search("What are canines?", top_k=3)
for doc, score in results:
print(f"{score:.3f} | {doc}")
# Output:
# 0.823 | Dogs are loyal pets
# 0.798 | Puppies are young dogs
# 0.456 | Python is a programming language
2. Recommendation Systems
class ContentRecommender:
"""Recommend similar content based on embeddings"""
def __init__(self, items: List[dict]):
"""
items: List of dicts with 'id', 'title', 'description'
"""
self.items = items
# Create embeddings for all items
texts = [f"{item['title']}. {item['description']}" for item in items]
response = client.embeddings.create(
input=texts,
model="text-embedding-3-small"
)
self.embeddings = [np.array(item.embedding) for item in response.data]
def recommend(self, item_id: str, top_k: int = 5):
"""Recommend similar items"""
# Find item index
item_idx = next(i for i, item in enumerate(self.items) if item['id'] == item_id)
item_embedding = self.embeddings[item_idx]
# Calculate similarities
similarities = []
for i, (item, embedding) in enumerate(zip(self.items, self.embeddings)):
if i == item_idx:
continue # Skip the item itself
similarity = np.dot(item_embedding, embedding) / (
np.linalg.norm(item_embedding) * np.linalg.norm(embedding)
)
similarities.append((item, similarity))
# Return top-k
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Example: Movie recommendations
movies = [
{"id": "1", "title": "The Matrix", "description": "Sci-fi action about virtual reality"},
{"id": "2", "title": "Inception", "description": "Dreams within dreams heist thriller"},
{"id": "3", "title": "The Notebook", "description": "Romantic drama love story"},
{"id": "4", "title": "Interstellar", "description": "Space exploration sci-fi epic"},
{"id": "5", "title": "Titanic", "description": "Historical romance on doomed ship"},
]
recommender = ContentRecommender(movies)
recommendations = recommender.recommend("1", top_k=3)
print("Because you liked 'The Matrix', you might like:")
for movie, score in recommendations:
print(f" {score:.3f} | {movie['title']}")
# Output:
# Because you liked 'The Matrix', you might like:
# 0.876 | Inception
# 0.834 | Interstellar
# 0.523 | The Notebook
3. Clustering and Classification
from sklearn.cluster import KMeans
import numpy as np
def cluster_documents(documents: List[str], n_clusters: int = 3):
"""Cluster documents by semantic similarity"""
# Create embeddings
response = client.embeddings.create(
input=documents,
model="text-embedding-3-small"
)
embeddings = np.array([item.embedding for item in response.data])
# Cluster
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings)
# Group by cluster
clustered = {}
for doc, cluster_id in zip(documents, clusters):
if cluster_id not in clustered:
clustered[cluster_id] = []
clustered[cluster_id].append(doc)
return clustered
# Example: News articles
articles = [
"Stock market hits new high",
"Tech startup raises $50M in funding",
"New planet discovered in distant galaxy",
"Scientists find water on Mars",
"GDP growth exceeds expectations",
"Telescope captures black hole image",
]
clusters = cluster_documents(articles, n_clusters=2)
for cluster_id, docs in clusters.items():
print(f"\nCluster {cluster_id}:")
for doc in docs:
print(f" - {doc}")
# Output:
# Cluster 0: (Science)
# - New planet discovered in distant galaxy
# - Scientists find water on Mars
# - Telescope captures black hole image
#
# Cluster 1: (Finance)
# - Stock market hits new high
# - Tech startup raises $50M in funding
# - GDP growth exceeds expectations
4. Anomaly Detection
def detect_anomalies(normal_texts: List[str], test_texts: List[str], threshold: float = 0.7):
"""Detect texts that don't fit the normal pattern"""
# Create embeddings
all_texts = normal_texts + test_texts
response = client.embeddings.create(
input=all_texts,
model="text-embedding-3-small"
)
embeddings = [np.array(item.embedding) for item in response.data]
normal_embeddings = embeddings[:len(normal_texts)]
test_embeddings = embeddings[len(normal_texts):]
# Calculate average normal embedding
avg_normal = np.mean(normal_embeddings, axis=0)
# Find anomalies
anomalies = []
for text, embedding in zip(test_texts, test_embeddings):
similarity = np.dot(avg_normal, embedding) / (
np.linalg.norm(avg_normal) * np.linalg.norm(embedding)
)
if similarity < threshold:
anomalies.append((text, similarity))
return anomalies
# Example: Customer support tickets
normal_tickets = [
"My order hasn't arrived yet",
"Need to return a damaged product",
"Can I get a refund?",
"Tracking number doesn't work",
]
test_tickets = [
"When will my package arrive?", # Similar to normal
"I love your products!", # Anomaly (positive feedback, not issue)
"Send me free stuff or I'll sue", # Anomaly (threat)
]
anomalies = detect_anomalies(normal_tickets, test_tickets, threshold=0.75)
print("Anomalous tickets detected:")
for ticket, score in anomalies:
print(f" {score:.3f} | {ticket}")
Visualizing Embeddings
High-dimensional embeddings (1536D) are impossible to visualize directly. We use dimensionality reduction to project them into 2D or 3D.
Using t-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction technique that projects high-dimensional data into 2D or 3D for visualization while preserving local structure and clustering patterns.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(texts: List[str], labels: List[str] = None):
"""Visualize embeddings in 2D using t-SNE"""
# Create embeddings
response = client.embeddings.create(
input=texts,
model="text-embedding-3-small"
)
embeddings = np.array([item.embedding for item in response.data])
# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(12, 8))
if labels:
# Color by label
unique_labels = list(set(labels))
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
mask = [l == label for l in labels]
plt.scatter(
embeddings_2d[mask, 0],
embeddings_2d[mask, 1],
c=[color],
label=label,
s=100
)
else:
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100)
# Annotate points
for i, text in enumerate(texts):
plt.annotate(
text[:30], # Truncate long texts
(embeddings_2d[i, 0], embeddings_2d[i, 1]),
fontsize=8
)
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.title("Embedding Visualization (2D projection)")
plt.legend()
plt.tight_layout()
plt.savefig("embeddings_visualization.png", dpi=300)
plt.show()
# Example: Visualize different categories
texts = [
# Animals
"dog", "cat", "puppy", "kitten",
# Vehicles
"car", "truck", "bus", "van",
# Fruits
"apple", "banana", "orange", "grape",
]
labels = (
["animal"] * 4 +
["vehicle"] * 4 +
["fruit"] * 4
)
visualize_embeddings(texts, labels)
Using PCA (Faster Alternative)
from sklearn.decomposition import PCA
def visualize_with_pca(texts: List[str]):
"""Faster visualization using PCA"""
# Create embeddings
response = client.embeddings.create(
input=texts,
model="text-embedding-3-small"
)
embeddings = np.array([item.embedding for item in response.data])
# Reduce dimensions: 1536 → 2
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100)
for i, text in enumerate(texts):
plt.annotate(text, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
plt.title("Embeddings Visualization (PCA)")
plt.tight_layout()
plt.show()
t-SNE vs PCA: Use t-SNE for better cluster separation and local structure. Use PCA when you need speed or want to preserve global structure. t-SNE is non-linear and better for visualization, PCA is linear and faster.
Best Practices
1. Choose the Right Model
# Most use cases: text-embedding-3-small
embedding = client.embeddings.create(
input="Your text",
model="text-embedding-3-small"
)
# High accuracy needed: text-embedding-3-large
embedding = client.embeddings.create(
input="Your text",
model="text-embedding-3-large"
)
2. Normalize Your Text
def normalize_text(text: str) -> str:
"""Normalize text before embedding"""
# Lowercase (optional - models handle case)
text = text.lower()
# Remove extra whitespace
text = " ".join(text.split())
# Remove special characters (optional)
import re
text = re.sub(r'[^\w\s]', '', text)
return text
# Use it
text = " Hello, WORLD!!! "
normalized = normalize_text(text)
embedding = create_embedding(normalized)
3. Cache Embeddings
import json
import hashlib
class EmbeddingCache:
def __init__(self, cache_file="embeddings.json"):
self.cache_file = cache_file
self.cache = self._load()
def _load(self):
try:
with open(self.cache_file) as f:
return json.load(f)
except FileNotFoundError:
return {}
def _save(self):
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f)
def get(self, text: str):
key = hashlib.md5(text.encode()).hexdigest()
if key in self.cache:
return self.cache[key]
# Create new embedding
embedding = create_embedding(text)
self.cache[key] = embedding
self._save()
return embedding
# Usage
cache = EmbeddingCache()
emb1 = cache.get("Hello") # API call
emb2 = cache.get("Hello") # Cache hit! (no cost)
Summary
Embeddings are the foundation of semantic AI:
- Representation: Convert text to vectors that capture meaning
- Models: Use text-embedding-3-small for most tasks, text-embedding-3-large for highest accuracy
- Dimensions: Higher dimensions = more accuracy, but more storage/compute
- Use Cases: Search, recommendations, clustering, anomaly detection
- Visualization: Use t-SNE or PCA to understand embedding structure
- Best Practices: Batch processing, caching, consistent dimensions
With embeddings, you can build systems that understand meaning, not just match keywords.