Understanding Embeddings

Embeddings are the foundation of modern semantic search, recommendation systems, and AI applications. But what exactly are they, and how do they work? Let's demystify these powerful numerical representations.

What Are Embeddings?

An embedding is a dense vector representation of data (text, images, audio) in a continuous multi-dimensional space where semantic similarity is preserved.

Embedding Definition: An embedding is a numerical representation (vector) of data that captures its semantic meaning. Similar concepts have similar vectors, enabling mathematical operations on meaning itself.

The Core Concept

python

# Human-readable text
text = "The quick brown fox"

# Machine-understandable embedding
embedding = [0.023, -0.145, 0.389, ..., -0.234]
# A list of 1536 floating-point numbers

Key Principle: Similar meanings → Similar vectors

python

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-api-key")

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# Similar concepts produce similar embeddings
dog_embedding = get_embedding("dog")
puppy_embedding = get_embedding("puppy")
car_embedding = get_embedding("car")

# Calculate similarity (cosine)
def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"dog vs puppy: {cosine_similarity(dog_embedding, puppy_embedding):.3f}")  # ~0.85
print(f"dog vs car: {cosine_similarity(dog_embedding, car_embedding):.3f}")      # ~0.45

Cosine Similarity: A measure of similarity between two vectors that calculates the cosine of the angle between them. Values range from -1 (opposite) to 1 (identical), with higher values indicating greater semantic similarity.

Key Insight: Embeddings map semantic similarity to geometric proximity. Words with similar meanings cluster together in high-dimensional space, enabling mathematical operations on meaning.

How Embeddings Are Created

The Training Process

Embedding models learn to represent text through self-supervised learning on massive text corpora:

python

"""
Embedding Training (Simplified):

1. TRAINING DATA
   - Billions of text examples
   - Learn from context: words appearing together

2. LEARNING OBJECTIVE
   - Predict masked words: "The [MASK] sat on the mat" → "cat"
   - Context similarity: "king - man + woman ≈ queen"
   - Contrastive learning: similar pairs close, dissimilar pairs far

3. RESULT
   - Dense vector for each token
   - Semantic relationships encoded geometrically
"""

# Example: Vector arithmetic captures relationships
# These work because embeddings preserve semantic structure:
# - king - man + woman ≈ queen
# - Paris - France + Italy ≈ Rome
# - walked - walking + swimming ≈ swam

Modern Embedding Architectures

python

"""
Evolution of Embedding Models:

1. Word2Vec (2013)
   - Skip-gram and CBOW
   - Static embeddings (one vector per word)
   - No context sensitivity

2. GloVe (2014)
   - Global word-word co-occurrence
   - Better captures semantic relationships

3. ELMo (2018)
   - Contextual embeddings
   - Same word, different vectors based on context

4. BERT/Transformers (2018+)
   - Deep contextual understanding
   - Attention mechanisms
   - State-of-the-art performance

5. Modern (2023+)
   - OpenAI text-embedding-3
   - Cohere embed-v3
   - Optimized for specific tasks
"""

Creating Embeddings with OpenAI

python

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# Single text embedding
def create_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Example
text = "Machine learning is transforming technology"
embedding = create_embedding(text)

print(f"Embedding dimension: {len(embedding)}")  # 1536
print(f"First 5 values: {embedding[:5]}")
# Output: [0.0123, -0.0456, 0.0789, -0.0234, 0.0567]

Batch Processing for Efficiency

python

# BAD: One at a time (slow, expensive)
texts = ["text1", "text2", "text3", ..., "text100"]
embeddings = [create_embedding(t) for t in texts]  # 100 API calls!

# GOOD: Batch processing (fast, cheap)
def create_embeddings_batch(texts, model="text-embedding-3-small"):
    """Create multiple embeddings in one API call"""
    response = client.embeddings.create(
        input=texts,  # List of texts
        model=model
    )
    return [item.embedding for item in response.data]

# Process up to 2048 texts at once
embeddings = create_embeddings_batch(texts)  # 1 API call!

Cost Optimization: Batch processing is 10-100x more efficient. Always batch when creating multiple embeddings. OpenAI allows up to 2048 texts per request.

Embedding Dimensions and Models

Understanding Dimensions

Embedding Dimensions: The number of values in an embedding vector. Each dimension captures different semantic features learned during training. More dimensions generally mean more nuanced representations but also more storage and computation.

python

# What are dimensions?
embedding = [0.1, -0.2, 0.3, 0.4, -0.5]  # 5-dimensional vector

# Each dimension captures different semantic aspects:
# - Dimension 0 might encode "animal vs object"
# - Dimension 1 might encode "size"
# - Dimension 2 might encode "sentiment"
# - etc. (learned automatically, not manually defined)

OpenAI Embedding Models Comparison

python

models_comparison = {
    "text-embedding-3-small": {
        "dimensions": 1536,
        "performance": "High quality",
        "cost_per_1M_tokens": "$0.02",
        "use_case": "Most applications - best value",
        "max_input_tokens": 8191,
        "release": "January 2024"
    },
    "text-embedding-3-large": {
        "dimensions": 3072,
        "performance": "Highest quality",
        "cost_per_1M_tokens": "$0.13",
        "use_case": "Maximum accuracy needed",
        "max_input_tokens": 8191,
        "release": "January 2024"
    },
    "text-embedding-ada-002": {
        "dimensions": 1536,
        "performance": "Good (legacy)",
        "cost_per_1M_tokens": "$0.10",
        "use_case": "Legacy projects only",
        "max_input_tokens": 8191,
        "release": "December 2022"
    }
}

# Recommendation: Use text-embedding-3-small for most tasks

Performance Comparison

python

import numpy as np
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def compare_models():
    """Compare embedding quality across models"""

    # Test cases
    queries = [
        ("dog", "puppy"),      # Should be very similar
        ("dog", "canine"),     # Should be very similar
        ("dog", "cat"),        # Should be somewhat similar
        ("dog", "car"),        # Should be dissimilar
    ]

    models = ["text-embedding-3-small", "text-embedding-3-large"]

    for model in models:
        print(f"\n{model}:")
        print("-" * 50)

        for word1, word2 in queries:
            # Get embeddings
            embeddings = client.embeddings.create(
                input=[word1, word2],
                model=model
            )

            e1 = np.array(embeddings.data[0].embedding)
            e2 = np.array(embeddings.data[1].embedding)

            # Calculate similarity
            similarity = np.dot(e1, e2) / (np.linalg.norm(e1) * np.linalg.norm(e2))

            print(f"{word1:10} vs {word2:10} = {similarity:.4f}")

compare_models()

# Output:
# text-embedding-3-small:
# --------------------------------------------------
# dog        vs puppy      = 0.8567
# dog        vs canine     = 0.8234
# dog        vs cat        = 0.7123
# dog        vs car        = 0.4567
#
# text-embedding-3-large:
# --------------------------------------------------
# dog        vs puppy      = 0.8891  ← Better discrimination
# dog        vs canine     = 0.8678
# dog        vs cat        = 0.6892
# dog        vs car        = 0.3912

Dimension Reduction

python

# text-embedding-3 models support dimension reduction
# Reduce storage and speed up search with minimal accuracy loss

response = client.embeddings.create(
    input="Your text here",
    model="text-embedding-3-small",
    dimensions=512  # Reduce from 1536 to 512
)

# Benefits:
# - 66% less storage (1536 → 512)
# - 3x faster similarity search
# - ~95% of original accuracy retained

# Supported dimensions for text-embedding-3-small:
# Any value from 1 to 1536

# Example: Testing different dimensions
def test_dimensions():
    text = "Machine learning embeddings"

    for dims in [256, 512, 768, 1024, 1536]:
        response = client.embeddings.create(
            input=text,
            model="text-embedding-3-small",
            dimensions=dims
        )

        embedding = response.data[0].embedding
        print(f"Dimensions: {dims:4d} | Actual length: {len(embedding)}")

Important: When using dimension reduction, stay consistent. Don't compare embeddings with different dimensions. Choose your dimension count upfront based on accuracy vs. performance trade-offs.

Use Cases for Embeddings

1. Semantic Search

python

# Traditional keyword search: exact matches only
# Query: "canine" → Won't match "dog"

# Semantic search with embeddings: meaning-based
# Query: "canine" → Matches "dog", "puppy", "pet"

from typing import List, Tuple

class SemanticSearch:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def index(self, documents: List[str]):
        """Index documents for search"""
        # Create embeddings
        response = client.embeddings.create(
            input=documents,
            model="text-embedding-3-small"
        )

        self.documents = documents
        self.embeddings = [np.array(item.embedding) for item in response.data]

    def search(self, query: str, top_k: int = 5) -> List[Tuple[str, float]]:
        """Search for similar documents"""
        # Get query embedding
        query_response = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
        query_embedding = np.array(query_response.data[0].embedding)

        # Calculate similarities
        similarities = []
        for doc, doc_embedding in zip(self.documents, self.embeddings):
            similarity = np.dot(query_embedding, doc_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
            )
            similarities.append((doc, similarity))

        # Sort and return top-k
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]

# Example
search = SemanticSearch()
search.index([
    "Python is a programming language",
    "Dogs are loyal pets",
    "Machine learning uses algorithms",
    "Puppies are young dogs",
    "JavaScript runs in browsers"
])

results = search.search("What are canines?", top_k=3)
for doc, score in results:
    print(f"{score:.3f} | {doc}")

# Output:
# 0.823 | Dogs are loyal pets
# 0.798 | Puppies are young dogs
# 0.456 | Python is a programming language

2. Recommendation Systems

python

class ContentRecommender:
    """Recommend similar content based on embeddings"""

    def __init__(self, items: List[dict]):
        """
        items: List of dicts with 'id', 'title', 'description'
        """
        self.items = items

        # Create embeddings for all items
        texts = [f"{item['title']}. {item['description']}" for item in items]
        response = client.embeddings.create(
            input=texts,
            model="text-embedding-3-small"
        )
        self.embeddings = [np.array(item.embedding) for item in response.data]

    def recommend(self, item_id: str, top_k: int = 5):
        """Recommend similar items"""
        # Find item index
        item_idx = next(i for i, item in enumerate(self.items) if item['id'] == item_id)
        item_embedding = self.embeddings[item_idx]

        # Calculate similarities
        similarities = []
        for i, (item, embedding) in enumerate(zip(self.items, self.embeddings)):
            if i == item_idx:
                continue  # Skip the item itself

            similarity = np.dot(item_embedding, embedding) / (
                np.linalg.norm(item_embedding) * np.linalg.norm(embedding)
            )
            similarities.append((item, similarity))

        # Return top-k
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]

# Example: Movie recommendations
movies = [
    {"id": "1", "title": "The Matrix", "description": "Sci-fi action about virtual reality"},
    {"id": "2", "title": "Inception", "description": "Dreams within dreams heist thriller"},
    {"id": "3", "title": "The Notebook", "description": "Romantic drama love story"},
    {"id": "4", "title": "Interstellar", "description": "Space exploration sci-fi epic"},
    {"id": "5", "title": "Titanic", "description": "Historical romance on doomed ship"},
]

recommender = ContentRecommender(movies)
recommendations = recommender.recommend("1", top_k=3)

print("Because you liked 'The Matrix', you might like:")
for movie, score in recommendations:
    print(f"  {score:.3f} | {movie['title']}")

# Output:
# Because you liked 'The Matrix', you might like:
#   0.876 | Inception
#   0.834 | Interstellar
#   0.523 | The Notebook

3. Clustering and Classification

python

from sklearn.cluster import KMeans
import numpy as np

def cluster_documents(documents: List[str], n_clusters: int = 3):
    """Cluster documents by semantic similarity"""

    # Create embeddings
    response = client.embeddings.create(
        input=documents,
        model="text-embedding-3-small"
    )
    embeddings = np.array([item.embedding for item in response.data])

    # Cluster
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)

    # Group by cluster
    clustered = {}
    for doc, cluster_id in zip(documents, clusters):
        if cluster_id not in clustered:
            clustered[cluster_id] = []
        clustered[cluster_id].append(doc)

    return clustered

# Example: News articles
articles = [
    "Stock market hits new high",
    "Tech startup raises $50M in funding",
    "New planet discovered in distant galaxy",
    "Scientists find water on Mars",
    "GDP growth exceeds expectations",
    "Telescope captures black hole image",
]

clusters = cluster_documents(articles, n_clusters=2)

for cluster_id, docs in clusters.items():
    print(f"\nCluster {cluster_id}:")
    for doc in docs:
        print(f"  - {doc}")

# Output:
# Cluster 0: (Science)
#   - New planet discovered in distant galaxy
#   - Scientists find water on Mars
#   - Telescope captures black hole image
#
# Cluster 1: (Finance)
#   - Stock market hits new high
#   - Tech startup raises $50M in funding
#   - GDP growth exceeds expectations

4. Anomaly Detection

python

def detect_anomalies(normal_texts: List[str], test_texts: List[str], threshold: float = 0.7):
    """Detect texts that don't fit the normal pattern"""

    # Create embeddings
    all_texts = normal_texts + test_texts
    response = client.embeddings.create(
        input=all_texts,
        model="text-embedding-3-small"
    )

    embeddings = [np.array(item.embedding) for item in response.data]
    normal_embeddings = embeddings[:len(normal_texts)]
    test_embeddings = embeddings[len(normal_texts):]

    # Calculate average normal embedding
    avg_normal = np.mean(normal_embeddings, axis=0)

    # Find anomalies
    anomalies = []
    for text, embedding in zip(test_texts, test_embeddings):
        similarity = np.dot(avg_normal, embedding) / (
            np.linalg.norm(avg_normal) * np.linalg.norm(embedding)
        )

        if similarity < threshold:
            anomalies.append((text, similarity))

    return anomalies

# Example: Customer support tickets
normal_tickets = [
    "My order hasn't arrived yet",
    "Need to return a damaged product",
    "Can I get a refund?",
    "Tracking number doesn't work",
]

test_tickets = [
    "When will my package arrive?",  # Similar to normal
    "I love your products!",         # Anomaly (positive feedback, not issue)
    "Send me free stuff or I'll sue", # Anomaly (threat)
]

anomalies = detect_anomalies(normal_tickets, test_tickets, threshold=0.75)

print("Anomalous tickets detected:")
for ticket, score in anomalies:
    print(f"  {score:.3f} | {ticket}")

Visualizing Embeddings

High-dimensional embeddings (1536D) are impossible to visualize directly. We use dimensionality reduction to project them into 2D or 3D.

Using t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction technique that projects high-dimensional data into 2D or 3D for visualization while preserving local structure and clustering patterns.

python

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

def visualize_embeddings(texts: List[str], labels: List[str] = None):
    """Visualize embeddings in 2D using t-SNE"""

    # Create embeddings
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    embeddings = np.array([item.embedding for item in response.data])

    # Reduce to 2D
    tsne = TSNE(n_components=2, random_state=42, perplexity=5)
    embeddings_2d = tsne.fit_transform(embeddings)

    # Plot
    plt.figure(figsize=(12, 8))

    if labels:
        # Color by label
        unique_labels = list(set(labels))
        colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

        for label, color in zip(unique_labels, colors):
            mask = [l == label for l in labels]
            plt.scatter(
                embeddings_2d[mask, 0],
                embeddings_2d[mask, 1],
                c=[color],
                label=label,
                s=100
            )
    else:
        plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100)

    # Annotate points
    for i, text in enumerate(texts):
        plt.annotate(
            text[:30],  # Truncate long texts
            (embeddings_2d[i, 0], embeddings_2d[i, 1]),
            fontsize=8
        )

    plt.xlabel("t-SNE Dimension 1")
    plt.ylabel("t-SNE Dimension 2")
    plt.title("Embedding Visualization (2D projection)")
    plt.legend()
    plt.tight_layout()
    plt.savefig("embeddings_visualization.png", dpi=300)
    plt.show()

# Example: Visualize different categories
texts = [
    # Animals
    "dog", "cat", "puppy", "kitten",
    # Vehicles
    "car", "truck", "bus", "van",
    # Fruits
    "apple", "banana", "orange", "grape",
]

labels = (
    ["animal"] * 4 +
    ["vehicle"] * 4 +
    ["fruit"] * 4
)

visualize_embeddings(texts, labels)

Using PCA (Faster Alternative)

python

from sklearn.decomposition import PCA

def visualize_with_pca(texts: List[str]):
    """Faster visualization using PCA"""

    # Create embeddings
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small"
    )
    embeddings = np.array([item.embedding for item in response.data])

    # Reduce dimensions: 1536 → 2
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings)

    # Plot
    plt.figure(figsize=(10, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100)

    for i, text in enumerate(texts):
        plt.annotate(text, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

    plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
    plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
    plt.title("Embeddings Visualization (PCA)")
    plt.tight_layout()
    plt.show()

t-SNE vs PCA: Use t-SNE for better cluster separation and local structure. Use PCA when you need speed or want to preserve global structure. t-SNE is non-linear and better for visualization, PCA is linear and faster.

Best Practices

1. Choose the Right Model

python

# Most use cases: text-embedding-3-small
embedding = client.embeddings.create(
    input="Your text",
    model="text-embedding-3-small"
)

# High accuracy needed: text-embedding-3-large
embedding = client.embeddings.create(
    input="Your text",
    model="text-embedding-3-large"
)

2. Normalize Your Text

python

def normalize_text(text: str) -> str:
    """Normalize text before embedding"""
    # Lowercase (optional - models handle case)
    text = text.lower()

    # Remove extra whitespace
    text = " ".join(text.split())

    # Remove special characters (optional)
    import re
    text = re.sub(r'[^\w\s]', '', text)

    return text

# Use it
text = "  Hello,   WORLD!!!  "
normalized = normalize_text(text)
embedding = create_embedding(normalized)

3. Cache Embeddings

python

import json
import hashlib

class EmbeddingCache:
    def __init__(self, cache_file="embeddings.json"):
        self.cache_file = cache_file
        self.cache = self._load()

    def _load(self):
        try:
            with open(self.cache_file) as f:
                return json.load(f)
        except FileNotFoundError:
            return {}

    def _save(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)

    def get(self, text: str):
        key = hashlib.md5(text.encode()).hexdigest()

        if key in self.cache:
            return self.cache[key]

        # Create new embedding
        embedding = create_embedding(text)
        self.cache[key] = embedding
        self._save()

        return embedding

# Usage
cache = EmbeddingCache()
emb1 = cache.get("Hello")  # API call
emb2 = cache.get("Hello")  # Cache hit! (no cost)

Summary

Embeddings are the foundation of semantic AI:

Representation: Convert text to vectors that capture meaning
Models: Use text-embedding-3-small for most tasks, text-embedding-3-large for highest accuracy
Dimensions: Higher dimensions = more accuracy, but more storage/compute
Use Cases: Search, recommendations, clustering, anomaly detection
Visualization: Use t-SNE or PCA to understand embedding structure
Best Practices: Batch processing, caching, consistent dimensions

With embeddings, you can build systems that understand meaning, not just match keywords.