(작성중)(ENG) Attention Is All You Need Paper Review

1. Background

RNN models

Previously, RNN, gated RNNs, and LSTMs were considered SOTA models in NLP tasks. However, because RNNs encode all input words it had a Long-Term Dependency Problem, where if input words were longer, it fails to train efficiently. The sequential nature of RNNs make parallelization difficult. [link to blog]

Seq2Seq models

Before the Attention mechanism was put in ligmlight, Seq2Seq model based on LSTM was the SOTA model for machine translation. However, the Seq2Seq model requires a fixed size of convex vector. Therefore, all given sentences had to be compressed to a fixed size vector, which introduced a major limitation. Seq2Seq models had a critical bottleneck problem, where all encoded information from the source sentence had to be crammed into a single context vector $v$. When dealing with long sentences, this single vector $v$ became overloaded, leading to loss of information and poor translation quality. This limitation hindered the model’s ability to handle long-range dependencies in text.

Seq2Seq models with Attention Mechanism

To address this bottleneck issue, researchers implemented attention mechanism to Seq2Seq models. This allowed the model to focus on different parts of the input sentence dynamically at each decoding step. Instead of relying on a single fixed-size context vector $v$, the decoder selectively attends to relevant portions of the source sentence. So how does the decoder selectively retrieve relevant information?

Step 1

The decoder computes an energy score to measure the relevance between the current decoder hidden state and each encoder hidden state. This is typically computed using methods such as dot product, additive, or scaled dot-product attention. The energy score $e_i$ quantifies how relevant the encoder hidden state $h_i$ is to the current decoder hidden state $s_t$.

\[e_i = \text{score}(s_t, h_i)\]

Step 2

Then, the energy scores are converted into probabilities using a softmax function to obtain attention weights, which determine how much focus is given to each encoder state.

\[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\]

Step 3

The final context vector is obtained as a weighted sum of all encoder hidden states, where higher-weighted states contribute more to the final representation. This ensures that the model focuses more on important words (higher $\alpha_i $) while still considering others with lower importance.

\[c_t = \sum_{i} \alpha_i h_i\]

Transformer Models

In this paper, Google researchers propose Transformers, a model architecture that utilizes only attention mechanisms without use of RNN architecture. In fact, transformers could be trained with smaller computatational resources due to its parallelization-friendly structure.

2. Model Architecture

As stated above, transformers do not use RNN architecture. It instead uses attention mechanism and positional encoding.

Encoder and Decoder Stacks

Both the encoder and decoder are composed of N = 6 identical layers, where each layer consists of the following sub-layers.

Encoder Sub-layers

  1. Multi-head Self-Attention : allows the model to attend to different positions in the input sequence
  2. Feed Forward Network : applies a point-wise transformation to enhance the representation
  3. Layer Normalization and Residual Connections : to ensure stable training, residual connections and layer normalization are applied after each sub-layer.

Decoder Sub-layers

The decoder follows a similar structure but includes an additional cross-attention layer.

  1. Masked Multi-Head Self-Attention
  2. Cross-Attention over Encoder Outputs
  3. Feed-Forward Network (FFN)
  4. Layer Normalization and Residual Connections

Attention

TL;DR : Attention is computed to determine how much attention each word should pay to the others.

For calculating attention, 3 vectors : Query, Key, Value is used. The attention mechanism compares the Query with all Keys, assigns weights using softmax, and produces a weighted sum of the Values. Attention is merely a weighted average - which is powerful when the weights are learned.

For the following sentence: The cat sat on the mat, if we set Q as “sat”,

\[Q = "sat" K = V =\]

Multi-Head Attention

Instead of a single attention function, Multi-Head Attention utilizies multiple heads. The head is a scaled dot product attention.