Skip to content
← Back to paper reviews

Review: Attention Is All You Need

Attention Is All You Need

Vaswani et al.

NeurIPS 2017

arXiv →
Nov 1, 20252 min read

Paper Summary

This landmark paper introduced the Transformer architecture, which has since become the foundation for most modern NLP models.

Key Contributions

  1. Self-Attention Mechanism: Replaced recurrence with attention for sequence modeling
  2. Multi-Head Attention: Allows the model to attend to different representation subspaces
  3. Positional Encoding: Injects sequence order information without recurrence

The Architecture

The Transformer consists of:

  • Encoder: 6 identical layers with self-attention and feed-forward networks
  • Decoder: 6 identical layers with masked self-attention, encoder-decoder attention, and feed-forward networks
$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O $

Why This Paper Matters

Before Transformers:

  • RNNs/LSTMs were the standard for sequence modeling
  • Training was sequential and slow
  • Long-range dependencies were difficult to capture

After Transformers:

  • Parallel training enabled massive scale
  • Models like BERT, GPT, T5 became possible
  • Attention became the dominant paradigm

Strengths

  • Elegant, simple architecture
  • Highly parallelizable
  • Strong empirical results
  • Clear writing and presentation

Limitations

  • Quadratic complexity in sequence length ($O(n^2)$)
  • No inherent notion of position (requires positional encoding)
  • Large memory footprint for long sequences

My Takeaways

This paper is a masterclass in research presentation. The authors clearly motivate the problem, present a clean solution, and provide thorough experiments. A must-read for anyone in ML.