(ENG) RNN and LSTM

Table of Contents


Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining an internal state that captures information about previous time steps. This makes them ideal for tasks like speech recognition, machine translation, and time series analysis.

  • Structure: In an RNN, each hidden unit has connections to itself and to other hidden units, allowing information to flow across time steps.
  • Mathematics: At each time step \(t\), the hidden state \(h_t\) is updated using the input \(x_t\) and the previous hidden state \(h_{t-1}\): \(h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)\) Where:
    • \(W_{xh}\): Weight matrix for the input.
    • \(W_{hh}\): Weight matrix for the hidden state.
    • \(b_h\): Bias term.

Exploding and Vanishing Gradient Problem

RNNs use the same weight matrices \(W_{hh}\) and \(W_{xh}\) at every time step. During backpropagation through time (BPTT), gradients are repeatedly multiplied by these matrices, which can cause:

  • Vanishing gradients: Gradients shrink exponentially, making it difficult to update weights and learn long-term dependencies.
  • Exploding gradients: Gradients grow exponentially, leading to instability.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are a type of RNN specifically designed to handle the vanishing gradient problem. They achieve this by introducing a more complex structure within each unit that allows information to be selectively remembered or forgotten over long time steps.

LSTM Cell Structure

An LSTM unit consists of four interacting layers that control the flow of information:

  1. Forget Gate: Determines which parts of the previous cell state \(C_{t-1}\) should be forgotten: \(f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\) Where \(\sigma\) is the sigmoid activation function.

  2. Input Gate: Decides what new information to store in the cell state: \(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) A candidate cell state \(\tilde{C}_t\) is created: \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)

  3. Cell State Update: Combines the forget gate and the input gate to update the cell state: \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) Where \(\odot\) denotes element-wise multiplication.

  4. Output Gate: Determines the next hidden state \(h_t\): \(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) The hidden state is calculated as: \(h_t = o_t \odot \tanh(C_t)\)


Differences Between RNN and LSTM
Feature RNN LSTM
Architecture Simple structure with one update equation. Complex structure with gates for controlling memory.
Gradient Issues Suffers from vanishing/exploding gradients. Designed to avoid vanishing gradients.
Long-Term Dependencies Struggles to learn long-term patterns. Efficiently captures long-term dependencies.
Use Cases Short-term dependencies. Long-term dependencies (e.g., speech, text).

TL;DR

While RNNs are effective for sequence modeling, their training can be hindered by the vanishing gradient problem, especially for long-term dependencies. LSTMs address this limitation by introducing gating mechanisms that allow the model to selectively retain and forget information, making them a powerful tool for tasks involving sequential data.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • (ENG) Are Sixteen Heads Really Better than One?
  • (작성중)(ENG) Attention Is All You Need Paper Review
  • (KOR) 소프트웨어 서비스 설계 방법론
  • (KOR) Retrieval-Augmented Machine Translation with Unstructured Knowledge 논문 리뷰
  • (KOR) Arc Browser 사용법