Previously, RNN, gated RNNs, and LSTMs were considered SOTA models in NLP tasks. However, because RNNs encode all input words it had a Long-Term Dependency Problem, where if input words were longer, it fails to train efficiently. The sequential nature of RNNs make parallelization difficult. [link to blog]
Before the Attention mechanism was put in ligmlight, Seq2Seq model based on LSTM was the SOTA model for machine translation. However, the Seq2Seq model requires a fixed size of convex vector. Therefore, all given sentences had to be compressed to a fixed size vector, which introduced a major limitation. Seq2Seq models had a critical bottleneck problem, where all encoded information from the source sentence had to be crammed into a single context vector $v$. When dealing with long sentences, this single vector $v$ became overloaded, leading to loss of information and poor translation quality. This limitation hindered the model’s ability to handle long-range dependencies in text.
To address this bottleneck issue, researchers implemented attention mechanism to Seq2Seq models. This allowed the model to focus on different parts of the input sentence dynamically at each decoding step. Instead of relying on a single fixed-size context vector $v$, the decoder selectively attends to relevant portions of the source sentence. So how does the decoder selectively retrieve relevant information?
The decoder computes an energy score to measure the relevance between the current decoder hidden state and each encoder hidden state. This is typically computed using methods such as dot product, additive, or scaled dot-product attention. The energy score $e_i$ quantifies how relevant the encoder hidden state $h_i$ is to the current decoder hidden state $s_t$.
\[e_i = \text{score}(s_t, h_i)\]Then, the energy scores are converted into probabilities using a softmax function to obtain attention weights, which determine how much focus is given to each encoder state.
\[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\]The final context vector is obtained as a weighted sum of all encoder hidden states, where higher-weighted states contribute more to the final representation. This ensures that the model focuses more on important words (higher $\alpha_i $) while still considering others with lower importance.
\[c_t = \sum_{i} \alpha_i h_i\]In this paper, Google researchers propose Transformers, a model architecture that utilizes only attention mechanisms without use of RNN architecture. In fact, transformers could be trained with smaller computatational resources due to its parallelization-friendly structure.
As stated above, transformers do not use RNN architecture. It instead uses attention mechanism and positional encoding.
Both the encoder and decoder are composed of N = 6 identical layers, where each layer consists of the following sub-layers.
The decoder follows a similar structure but includes an additional cross-attention layer.
TL;DR : Attention is computed to determine how much attention each word should pay to the others.
For calculating attention, 3 vectors : Query, Key, Value is used. The attention mechanism compares the Query with all Keys, assigns weights using softmax, and produces a weighted sum of the Values. Attention is merely a weighted average - which is powerful when the weights are learned.
For the following sentence: The cat sat on the mat, if we set Q as “sat”,
\[Q = \text{embedding of "sat"}\] \[K = \text{embedding of the individual words : "The cat sat on the mat"}\] \[V = \text{embedding of the individual words : "The cat sat on the mat"}\]The similarity (dot product) between Q and each K is computed to see how much “sat” should attend to each word in the sentence. After applying a softmax over these scores, the attention weights are computed. These weights are then used to compute a weighted sum over V — resulting in the attention output for “sat”.
Instead of a single attention function, Multi-Head Attention utilizes multiple heads. Each head is an independent attention mechanism with its own learned Q, K, V projections. This allows the model to capture different types of relationships via parallel.
Each head attends to the input sequence differently — one head might learn syntactic dependencies, another might learn positional or semantic relevance.
\[Multihead(Q,K,V) = concat(head_1, ... head_h)W^O\]where each head is computed as:
\[head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\]One of the key differences between Transformers and RNNs is that Transformers do not process input sequences in order. Since there’s no recurrence or convolution, the model needs a way to incorporate the order of words — that’s where positional encoding comes in.
Positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. These encodings inject information about the position of each token in the sequence using a combination of sine and cosine functions of different frequencies:
\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\]This deterministic function ensures that:
Each encoder and decoder layer also includes a position-wise feedforward network, applied independently to each position. The architecture is:
\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]This two-layer MLP transforms each position’s embedding, enabling non-linear combinations and richer representations.
The original paper trained the model on the WMT 2014 English-German and English-French translation tasks.
The Transformer outperformed traditional Seq2Seq models (with or without attention) and convolutional models on translation benchmarks.
The “Attention Is All You Need” paper introduced the Transformer model, a groundbreaking architecture that:
This architecture has since become the foundation for modern NLP, powering models like BERT, GPT, T5, and many others.