Positional Encodings Explained
Transformers process all tokens in parallel, which means they have no inherent notion of sequence order. Positional encodings solve this by injecting position information into the input. This lesson explores why we need them and how they work.
The Position Problem
Why Transformers Don't Understand Order
Permutation Invariance: A property where the output changes in the same way as the input when the order of inputs is rearranged. Transformers without positional encodings treat sequences as unordered sets, losing crucial sequential information.
Self-attention is permutation-invariant: shuffling the input produces correspondingly shuffled output.
import torch
import torch.nn as nn
# Simple self-attention (without positional encoding)
d_model = 4
seq_len = 3
X = torch.randn(seq_len, d_model)
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)
def attention(x):
Q, K, V = W_q(x), W_k(x), W_v(x)
scores = torch.matmul(Q, K.T) / (d_model ** 0.5)
weights = torch.softmax(scores, dim=-1)
return torch.matmul(weights, V)
# Original sequence
output1 = attention(X)
# Permuted sequence
X_permuted = X[[2, 0, 1], :] # Shuffle rows
output2 = attention(X_permuted)
# Outputs are correspondingly permuted
print("Original output:\n", output1)
print("\nPermuted output:\n", output2[[1, 2, 0], :]) # Un-permute
print("\nAre they the same?", torch.allclose(output1, output2[[1, 2, 0], :]))
Why Order Matters
Consider these sentences:
- "The cat chased the dog"
- "The dog chased the cat"
Same words, completely different meanings! Without positional information, a transformer can't distinguish them.
The Permutation Problem:
Without positional encodings:
- "I love Paris" = "Paris love I" = "love I Paris"
- Transformer would treat all three identically
- Word order is crucial for understanding language
RNNs don't have this problem because they process sequentially. Transformers gain parallelism but lose position awareness.
The Solution: Positional Encodings
Add position information directly to the input embeddings:
Input = Word Embedding + Positional Encoding
Requirements for Good Positional Encodings
- Unique: Different positions get different encodings
- Consistent: Same relative positions should have consistent relationships
- Generalizable: Works for sequences longer than those seen in training
- Bounded: Values don't grow arbitrarily large
Sinusoidal Positional Encoding
Sinusoidal Positional Encoding: A fixed mathematical function using sine and cosine waves at different frequencies to encode position information, providing unique representations for each position that can generalize to unseen sequence lengths.
The original transformer paper uses sine and cosine functions.
The Formula
For position
posiPE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- = position in sequence (0, 1, 2, ...)
pos - = dimension index (0, 1, 2, ..., d_model/2)
i - Even dimensions (2i) use sine
- Odd dimensions (2i+1) use cosine
Implementation
import numpy as np
import torch
import matplotlib.pyplot as plt
def get_positional_encoding(max_len, d_model):
"""
Generate sinusoidal positional encodings
Args:
max_len: Maximum sequence length
d_model: Model dimension (must be even)
Returns:
pe: Positional encoding matrix (max_len, d_model)
"""
# Initialize encoding matrix
pe = np.zeros((max_len, d_model))
# Create position indices [0, 1, 2, ..., max_len-1]
position = np.arange(0, max_len)[:, np.newaxis] # (max_len, 1)
# Create dimension indices [0, 2, 4, ..., d_model-2]
div_term = np.exp(
np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)
) # (d_model/2,)
# Apply sine to even indices
pe[:, 0::2] = np.sin(position * div_term)
# Apply cosine to odd indices
pe[:, 1::2] = np.cos(position * div_term)
return torch.FloatTensor(pe)
# Generate positional encodings
max_len = 100
d_model = 128
pe = get_positional_encoding(max_len, d_model)
print("Positional encoding shape:", pe.shape) # (100, 128)
print("\nFirst position encoding:")
print(pe[0, :8]) # First 8 dimensions
print("\nSecond position encoding:")
print(pe[1, :8])
Visualizing Positional Encodings
def visualize_positional_encoding(pe):
"""Visualize positional encoding as heatmap"""
plt.figure(figsize=(12, 6))
# Plot heatmap
plt.imshow(pe.numpy(), cmap='RdBu', aspect='auto')
plt.colorbar(label='Encoding Value')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position in Sequence')
plt.title('Sinusoidal Positional Encoding')
plt.tight_layout()
plt.show()
# Plot specific dimensions over positions
plt.figure(figsize=(12, 6))
for i in [0, 1, 4, 8, 16, 32]:
plt.plot(pe[:, i].numpy(), label=f'Dim {i}')
plt.xlabel('Position')
plt.ylabel('Encoding Value')
plt.title('Positional Encoding Values Across Positions')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Visualize
pe = get_positional_encoding(100, 128)
visualize_positional_encoding(pe)
Patterns in the Visualization:
- Low dimensions (columns 0-10): Rapid oscillation, change quickly with position
- High dimensions (columns 100-128): Slow oscillation, change slowly
- Wavelengths: Each dimension has a different frequency from 2π to 10000·2π
This creates a unique "fingerprint" for each position.
Why Sinusoidal Functions?
1. Unique Representations
Each position gets a unique encoding vector:
pe = get_positional_encoding(100, 128)
# Compare different positions
pos_0 = pe[0]
pos_1 = pe[1]
pos_50 = pe[50]
print("Similarity (pos 0 vs pos 1):", torch.cosine_similarity(pos_0, pos_1, dim=0))
print("Similarity (pos 0 vs pos 50):", torch.cosine_similarity(pos_0, pos_50, dim=0))
2. Relative Position Information
The encoding for position
pos + kposPE(pos + k) = f(PE(pos))
This is due to trigonometric identities:
sin(α + β) = sin(α)cos(β) + cos(α)sin(β)
cos(α + β) = cos(α)cos(β) - sin(α)sin(β)
3. Extrapolation to Longer Sequences
Sinusoidal functions continue smoothly beyond training lengths:
# Train on sequences up to length 50
train_pe = get_positional_encoding(50, 128)
# Generalize to length 200 (4x longer)
test_pe = get_positional_encoding(200, 128)
# The pattern continues smoothly
print("Training PE shape:", train_pe.shape)
print("Test PE shape:", test_pe.shape)
4. Bounded Values
All values stay in [-1, 1]:
pe = get_positional_encoding(1000, 512)
print("Min value:", pe.min().item()) # Close to -1
print("Max value:", pe.max().item()) # Close to +1
Using Positional Encodings in Practice
Adding to Input Embeddings
class PositionalEncoding(nn.Module):
"""Positional encoding module for transformers"""
def __init__(self, d_model, max_len=5000, dropout=0.1):
"""
Args:
d_model: Model dimension
max_len: Maximum sequence length
dropout: Dropout probability
"""
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Create positional encoding matrix
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Register as buffer (not a parameter, but part of state)
pe = pe.unsqueeze(0) # (1, max_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x):
"""
Args:
x: Input embeddings (batch, seq_len, d_model)
Returns:
x: Embeddings with positional encoding added (batch, seq_len, d_model)
"""
# Add positional encoding
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
# Example usage
batch_size = 2
seq_len = 10
d_model = 128
vocab_size = 10000
# Token embeddings
embedding = nn.Embedding(vocab_size, d_model)
pos_encoding = PositionalEncoding(d_model)
# Input tokens
tokens = torch.randint(0, vocab_size, (batch_size, seq_len))
# Get embeddings
token_embeddings = embedding(tokens) # (2, 10, 128)
print("Token embeddings shape:", token_embeddings.shape)
# Add positional encoding
output = pos_encoding(token_embeddings) # (2, 10, 128)
print("Output shape:", output.shape)
Scaling Convention
The original paper scales embeddings before adding positional encodings:
# Standard approach
token_embeddings = embedding(tokens) * np.sqrt(d_model)
output = pos_encoding(token_embeddings)
This makes the embedding and positional encoding magnitudes comparable.
Why scale by √d_model?
- Token embeddings have variance ≈ 1 (typically initialized this way)
- Summing two random variables: variance = var₁ + var₂
- Scaling by √d_model makes embedding variance ≈ d_model
- After adding PE, total variance ≈ d_model + 1 ≈ d_model
- This keeps the signal strong relative to the positional information
Learned Positional Embeddings
An alternative to sinusoidal encodings: learn position embeddings as parameters.
Implementation
class LearnedPositionalEncoding(nn.Module):
"""Learned positional embeddings (used in BERT)"""
def __init__(self, d_model, max_len=512, dropout=0.1):
super(LearnedPositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Learnable position embeddings
self.position_embeddings = nn.Embedding(max_len, d_model)
def forward(self, x):
"""
Args:
x: Input embeddings (batch, seq_len, d_model)
Returns:
x: Embeddings with positional encoding added
"""
batch_size, seq_len, d_model = x.size()
# Create position IDs [0, 1, 2, ..., seq_len-1]
position_ids = torch.arange(seq_len, dtype=torch.long, device=x.device)
position_ids = position_ids.unsqueeze(0).expand(batch_size, -1) # (batch, seq_len)
# Get position embeddings
position_embeds = self.position_embeddings(position_ids)
# Add to input
x = x + position_embeds
return self.dropout(x)
# Example
learned_pe = LearnedPositionalEncoding(d_model=128, max_len=512)
token_embeddings = torch.randn(2, 10, 128)
output = learned_pe(token_embeddings)
print("Output shape:", output.shape)
Learned vs Sinusoidal
| Aspect | Sinusoidal | Learned |
|---|---|---|
| Parameters | 0 (deterministic) | max_len × d_model |
| Extrapolation | Natural (continues smoothly) | Poor (unseen positions) |
| Flexibility | Fixed pattern | Adapts to data |
| Used in | Original Transformer, many models | BERT, GPT-2 |
Modern Practice:
- BERT, GPT-2, GPT-3: Learned embeddings
- T5, Reformer: Relative positional encodings
- RoFormer, LLaMA: Rotary Position Embeddings (RoPE)
Learned embeddings often work slightly better for fixed-length tasks, but sinusoidal is better for variable/long sequences.
Advanced: Relative Positional Encodings
Relative Positional Encoding: An alternative to absolute positions that encodes the distance between positions rather than their absolute locations, allowing models to better capture positional relationships and generalize to longer sequences.
Instead of absolute positions (0, 1, 2, ...), encode relative distances.
Motivation
For attention, what matters is often relative position:
- "I saw her duck" - "her" is 2 positions before "duck"
- Absolute positions (3, 5) matter less than the gap (2)
Relative Position Bias (T5 Approach)
class RelativePositionBias(nn.Module):
"""Relative position bias (simplified T5 approach)"""
def __init__(self, num_heads, max_distance=128):
super(RelativePositionBias, self).__init__()
self.num_heads = num_heads
self.max_distance = max_distance
# Learnable bias for each relative position and head
# Positions: [-max_distance, ..., -1, 0, 1, ..., max_distance]
num_buckets = 2 * max_distance + 1
self.relative_bias = nn.Embedding(num_buckets, num_heads)
def forward(self, seq_len):
"""
Compute relative position bias
Args:
seq_len: Sequence length
Returns:
bias: (num_heads, seq_len, seq_len)
"""
# Compute relative positions
positions = torch.arange(seq_len)
relative_positions = positions[:, None] - positions[None, :] # (seq_len, seq_len)
# Clip to max distance
relative_positions = torch.clamp(
relative_positions,
-self.max_distance,
self.max_distance
)
# Shift to positive indices
relative_positions = relative_positions + self.max_distance
# Get bias values
bias = self.relative_bias(relative_positions) # (seq_len, seq_len, num_heads)
# Transpose to (num_heads, seq_len, seq_len)
bias = bias.permute(2, 0, 1)
return bias
# Usage in attention
def attention_with_relative_bias(Q, K, V, bias):
"""
Attention with relative position bias
Args:
Q, K, V: (batch, num_heads, seq_len, d_k)
bias: (num_heads, seq_len, seq_len)
"""
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_k)
# Add relative position bias
scores = scores + bias.unsqueeze(0) # Broadcast across batch
weights = torch.softmax(scores, dim=-1)
return torch.matmul(weights, V)
Rotary Position Embeddings (RoPE)
Rotary Position Embeddings (RoPE): A position encoding method that applies rotations to query and key vectors based on their absolute positions, creating relative position information in the attention mechanism and enabling excellent length extrapolation.
Used in modern models like LLaMA, encodes position via rotation in complex space.
Key Idea
Rotate query and key vectors based on their position:
def apply_rotary_emb(x, position):
"""
Apply rotary position embedding (simplified)
Args:
x: Input tensor (..., seq_len, d)
position: Position indices
"""
# Create rotation angles
d = x.size(-1)
inv_freq = 1.0 / (10000 ** (torch.arange(0, d, 2).float() / d))
# Compute angles
angles = position[:, None].float() * inv_freq[None, :] # (seq_len, d/2)
# Create rotation matrix
cos = torch.cos(angles)
sin = torch.sin(angles)
# Rotate (simplified - actual implementation is more complex)
x_rot = torch.cat([
x[..., ::2] * cos - x[..., 1::2] * sin,
x[..., ::2] * sin + x[..., 1::2] * cos
], dim=-1)
return x_rot
RoPE Advantages:
- Relative information: Naturally encodes relative positions through rotation
- Extrapolation: Generalizes well to longer sequences
- No additional parameters: Applied via rotation, not learned
- Efficiency: Can be computed efficiently
Used in: LLaMA, GPT-NeoX, PaLM
Practical Considerations
Maximum Sequence Length
# Fixed maximum
pe = PositionalEncoding(d_model=512, max_len=512)
# For longer sequences, need to re-initialize or use relative encodings
Memory Usage
# Sinusoidal: O(max_len × d_model) storage (but computed once)
# Learned: O(max_len × d_model) parameters
# For max_len=2048, d_model=768:
memory = 2048 * 768 * 4 # 4 bytes per float32
print(f"Memory: {memory / 1e6:.2f} MB") # ~6.3 MB
Position IDs for Padding
When using padding, position IDs should account for it:
def create_position_ids(input_ids, pad_token_id=0):
"""
Create position IDs, accounting for padding
Args:
input_ids: (batch, seq_len)
pad_token_id: Padding token ID
Returns:
position_ids: (batch, seq_len)
"""
# Mask for non-padding tokens
mask = (input_ids != pad_token_id).long()
# Cumulative sum to get positions (0 for padding)
position_ids = torch.cumsum(mask, dim=1) * mask - 1
return position_ids
# Example
input_ids = torch.tensor([
[101, 2054, 2003, 0, 0], # Last two are padding
[101, 2023, 2003, 1037, 3231]
])
position_ids = create_position_ids(input_ids, pad_token_id=0)
print("Position IDs:")
print(position_ids)
# Output:
# [[0, 1, 2, 0, 0],
# [0, 1, 2, 3, 4]]
Summary
Positional encodings solve the position-blindness of transformers:
Why Needed:
- Self-attention is permutation-invariant
- Word order is crucial for language
- Transformers process in parallel, no inherent sequence information
Sinusoidal Encodings:
- Formula: PE(pos, 2i) = sin(pos/10000^(2i/d))
- Unique for each position
- Generalizes to unseen lengths
- Bounded values [-1, 1]
Alternatives:
- Learned: More flexible, but limited to training lengths
- Relative: Focus on position differences, not absolute positions
- RoPE: Rotation-based, excellent extrapolation
Modern Practice:
- GPT-2/3: Learned
- BERT: Learned
- T5: Relative
- LLaMA: RoPE
Position information is fundamental to transformer performance!