Previously, RNN, gated RNNs, and LSTMs were considered SOTA models in NLP tasks. However, because RNNs encode all input words it had a Long-Term Dependency Problem, where if input words were longer, it fails to train efficiently. The sequential nature of RNNs make parallelization difficult. [link to blog]
Before the Attention mechanism was put in ligmlight, Seq2Seq model based on LSTM was the SOTA model for machine translation. However, the Seq2Seq model requires a fixed size of convex vector. Therefore, all given sentences had to be compressed to a fixed size vector, which introduced a major limitation. Seq2Seq models had a critical bottleneck problem, where all encoded information from the source sentence had to be crammed into a single context vector $v$. When dealing with long sentences, this single vector $v$ became overloaded, leading to loss of information and poor translation quality. This limitation hindered the model’s ability to handle long-range dependencies in text.
To address this bottleneck issue, researchers implemented attention mechanism to Seq2Seq models. This allowed the model to focus on different parts of the input sentence dynamically at each decoding step. Instead of relying on a single fixed-size context vector $v$, the decoder selectively attends to relevant portions of the source sentence. So how does the decoder selectively retrieve relevant information?
The decoder computes an energy score to measure the relevance between the current decoder hidden state and each encoder hidden state. This is typically computed using methods such as dot product, additive, or scaled dot-product attention. The energy score $e_i$ quantifies how relevant the encoder hidden state $h_i$ is to the current decoder hidden state $s_t$.
\[e_i = \text{score}(s_t, h_i)\]Then, the energy scores are converted into probabilities using a softmax function to obtain attention weights, which determine how much focus is given to each encoder state.
\[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\]The final context vector is obtained as a weighted sum of all encoder hidden states, where higher-weighted states contribute more to the final representation. This ensures that the model focuses more on important words (higher $\alpha_i $) while still considering others with lower importance.
\[c_t = \sum_{i} \alpha_i h_i\]In this paper, Google researchers propose Transformers, a model architecture that utilizes only attention mechanisms without use of RNN architecture. In fact, transformers could be trained with smaller computatational resources due to its parallelization-friendly structure.
As stated above, transformers do not use RNN architecture. It instead uses attention mechanism and positional encoding.
Both the encoder and decoder are composed of N = 6 identical layers, where each layer consists of the following sub-layers.
The decoder follows a similar structure but includes an additional cross-attention layer.
TL;DR : Attention is computed to determine how much attention each word should pay to the others.
For calculating attention, 3 vectors : Query, Key, Value is used. The attention mechanism compares the Query with all Keys, assigns weights using softmax, and produces a weighted sum of the Values. Attention is merely a weighted average - which is powerful when the weights are learned.
For the following sentence: The cat sat on the mat, if we set Q as “sat”,
\[Q = "sat" K = V =\]Instead of a single attention function, Multi-Head Attention utilizies multiple heads. The head is a scaled dot product attention.