(ENG) RNN and LSTM | JooHun Hyun

Table of Contents

Recurrent Neural Networks
Exploding and Vanishing Gradient Problem
Long Short-Term Memory (LSTM)
- LSTM Cell Structure
- Differences Between RNN and LSTM
TL;DR

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining an internal state that captures information about previous time steps. This makes them ideal for tasks like speech recognition, machine translation, and time series analysis.

Structure: In an RNN, each hidden unit has connections to itself and to other hidden units, allowing information to flow across time steps.
Mathematics: At each time step \(t\), the hidden state \(h_t\) is updated using the input \(x_t\) and the previous hidden state \(h_{t-1}\): \(h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)\) Where:
- \(W_{xh}\): Weight matrix for the input.
- \(W_{hh}\): Weight matrix for the hidden state.
- \(b_h\): Bias term.

Exploding and Vanishing Gradient Problem

RNNs use the same weight matrices \(W_{hh}\) and \(W_{xh}\) at every time step. During backpropagation through time (BPTT), gradients are repeatedly multiplied by these matrices, which can cause:

Vanishing gradients: Gradients shrink exponentially, making it difficult to update weights and learn long-term dependencies.
Exploding gradients: Gradients grow exponentially, leading to instability.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are a type of RNN specifically designed to handle the vanishing gradient problem. They achieve this by introducing a more complex structure within each unit that allows information to be selectively remembered or forgotten over long time steps.

LSTM Cell Structure

An LSTM unit consists of four interacting layers that control the flow of information:

Forget Gate: Determines which parts of the previous cell state \(C_{t-1}\) should be forgotten: \(f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\) Where \(\sigma\) is the sigmoid activation function.
Input Gate: Decides what new information to store in the cell state: \(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) A candidate cell state \(\tilde{C}_t\) is created: \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)
Cell State Update: Combines the forget gate and the input gate to update the cell state: \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) Where \(\odot\) denotes element-wise multiplication.
Output Gate: Determines the next hidden state \(h_t\): \(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) The hidden state is calculated as: \(h_t = o_t \odot \tanh(C_t)\)

Differences Between RNN and LSTM

Feature	RNN	LSTM
Architecture	Simple structure with one update equation.	Complex structure with gates for controlling memory.
Gradient Issues	Suffers from vanishing/exploding gradients.	Designed to avoid vanishing gradients.
Long-Term Dependencies	Struggles to learn long-term patterns.	Efficiently captures long-term dependencies.
Use Cases	Short-term dependencies.	Long-term dependencies (e.g., speech, text).

TL;DR

While RNNs are effective for sequence modeling, their training can be hindered by the vanishing gradient problem, especially for long-term dependencies. LSTMs address this limitation by introducing gating mechanisms that allow the model to selectively retain and forget information, making them a powerful tool for tasks involving sequential data.