Attention mechanisms have become a cornerstone of modern deep learning. Let's explore how they work.
The Intuition
Think of attention like a spotlight. When reading a sentence, you don't give equal weight to every word—you focus on the relevant parts.
Types of Attention
Soft Attention
Soft attention computes a weighted average over all positions:
where $\alpha_i$ are the attention weights.
Hard Attention
Hard attention selects a single position, but is non-differentiable and requires reinforcement learning to train.
Scaled Dot-Product Attention
The most common form used in Transformers:
def scaled_dot_product_attention(query, key, value, mask=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
return torch.matmul(attention_weights, value)
Applications
- Machine Translation
- Image Captioning
- Speech Recognition
- Question Answering
Attention mechanisms continue to evolve and remain an active area of research.