(ENG) Attention Is All You Need Paper Review

1. Background

RNN models

Previously, RNN, gated RNNs, and LSTMs were considered SOTA models in NLP tasks. However, because RNNs encode all input words it had a Long-Term Dependency Problem, where if input words were longer, it fails to train efficiently. The sequential nature of RNNs make parallelization difficult. [link to blog]

Seq2Seq models

Before the Attention mechanism was put in ligmlight, Seq2Seq model based on LSTM was the SOTA model for machine translation. However, the Seq2Seq model requires a fixed size of convex vector. Therefore, all given sentences had to be compressed to a fixed size vector, which introduced a major limitation. Seq2Seq models had a critical bottleneck problem, where all encoded information from the source sentence had to be crammed into a single context vector $v$. When dealing with long sentences, this single vector $v$ became overloaded, leading to loss of information and poor translation quality. This limitation hindered the model’s ability to handle long-range dependencies in text.

Seq2Seq models with Attention Mechanism

To address this bottleneck issue, researchers implemented attention mechanism to Seq2Seq models. This allowed the model to focus on different parts of the input sentence dynamically at each decoding step. Instead of relying on a single fixed-size context vector $v$, the decoder selectively attends to relevant portions of the source sentence. So how does the decoder selectively retrieve relevant information?

Step 1

The decoder computes an energy score to measure the relevance between the current decoder hidden state and each encoder hidden state. This is typically computed using methods such as dot product, additive, or scaled dot-product attention. The energy score $e_i$ quantifies how relevant the encoder hidden state $h_i$ is to the current decoder hidden state $s_t$.

\[e_i = \text{score}(s_t, h_i)\]

dot product : computes similarity using a dot product
additive : learns a nonlinear alignment
scaled dot product : scales dot product to stabilize training

Step 2

Then, the energy scores are converted into probabilities using a softmax function to obtain attention weights, which determine how much focus is given to each encoder state.

\[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\]

Step 3

The final context vector is obtained as a weighted sum of all encoder hidden states, where higher-weighted states contribute more to the final representation. This ensures that the model focuses more on important words (higher $\alpha_i $) while still considering others with lower importance.

\[c_t = \sum_{i} \alpha_i h_i\]

Transformer Models

In this paper, Google researchers propose Transformers, a model architecture that utilizes only attention mechanisms without use of RNN architecture. In fact, transformers could be trained with smaller computatational resources due to its parallelization-friendly structure.

2. Model Architecture

As stated above, transformers do not use RNN architecture. It instead uses attention mechanism and positional encoding.

2.1 Encoder and Decoder Stacks

Both the encoder and decoder are composed of N = 6 identical layers, where each layer consists of the following sub-layers.

2.2 Encoder Sub-layers

Multi-head Self-Attention : allows the model to attend to different positions in the input sequence
Feed Forward Network : applies a point-wise transformation to enhance the representation
Layer Normalization and Residual Connections : to ensure stable training, residual connections and layer normalization are applied after each sub-layer.

2.3 Decoder Sub-layers

The decoder follows a similar structure but includes an additional cross-attention layer.

Masked Multi-Head Self-Attention
Cross-Attention over Encoder Outputs
Feed-Forward Network (FFN)
Layer Normalization and Residual Connections

3. Attention

TL;DR : Attention is computed to determine how much attention each word should pay to the others.

For calculating attention, 3 vectors : Query, Key, Value is used. The attention mechanism compares the Query with all Keys, assigns weights using softmax, and produces a weighted sum of the Values. Attention is merely a weighted average - which is powerful when the weights are learned.

Query (Q): Represents the token that is trying to attend to other tokens.
Key (K): Represents the tokens that can be attended to.
Value (V): Contains the actual information to be retrieved.

For the following sentence: The cat sat on the mat, if we set Q as “sat”,

\[Q = \text{embedding of "sat"}\] \[K = \text{embedding of the individual words : "The cat sat on the mat"}\] \[V = \text{embedding of the individual words : "The cat sat on the mat"}\]

The similarity (dot product) between Q and each K is computed to see how much “sat” should attend to each word in the sentence. After applying a softmax over these scores, the attention weights are computed. These weights are then used to compute a weighted sum over V — resulting in the attention output for “sat”.

3.1 Multi-Head Attention

Instead of a single attention function, Multi-Head Attention utilizes multiple heads. Each head is an independent attention mechanism with its own learned Q, K, V projections. This allows the model to capture different types of relationships via parallel.

Each head attends to the input sequence differently — one head might learn syntactic dependencies, another might learn positional or semantic relevance.

\[Multihead(Q,K,V) = concat(head_1, ... head_h)W^O\]

where each head is computed as:

\[head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\]

3.2 Positional Encoding

One of the key differences between Transformers and RNNs is that Transformers do not process input sequences in order. Since there’s no recurrence or convolution, the model needs a way to incorporate the order of words — that’s where positional encoding comes in.

Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. These encodings inject information about the position of each token in the sequence using a combination of sine and cosine functions of different frequencies:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\]

This deterministic function ensures that:

Each position gets a unique encoding.
The model can generalize to longer sequences.
The encoding has smooth variation, helping the model learn relative positions.

3.3 Feedforward Layers

Each encoder and decoder layer also includes a position-wise feedforward network, applied independently to each position. The architecture is:

\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]

This two-layer MLP transforms each position’s embedding, enabling non-linear combinations and richer representations.

Input and output dimensionality: $d_{\text{model}}$
Hidden layer: usually set to 2048
Activation function: ReLU

4. Training Setup and Results

The original paper trained the model on the WMT 2014 English-German and English-French translation tasks.

Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, and $\epsilon = 10^{-9}$
Learning rate schedule: Increases linearly for warmup steps, then decreases proportionally to the inverse square root of the step number.

\[\text{lrate} = d_{\text{model}}^{-0.5} \cdot \min(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5})\]

Dropout: Applied after attention and FFN layers with dropout rate of 0.1
Label smoothing: 0.1 for better generalization

4.1 Results

The Transformer outperformed traditional Seq2Seq models (with or without attention) and convolutional models on translation benchmarks.

BLEU Score (EN→DE): Transformer Base (28.4), Transformer Big (29.3)
Training Speed: Due to full parallelization, the Transformer trained faster and was more efficient on modern hardware (TPUs/GPUs)

5. Summary

The “Attention Is All You Need” paper introduced the Transformer model, a groundbreaking architecture that:

Replaces recurrence with attention
Enables full parallelization
Scales well with data and compute
Achieves state-of-the-art results on translation tasks

This architecture has since become the foundation for modern NLP, powering models like BERT, GPT, T5, and many others.