(ENG) RNN and LSTM
Table of Contents
- Recurrent Neural Networks
- Exploding and Vanishing Gradient Problem
- Long Short-Term Memory (LSTM)
- TL;DR
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining an internal state that captures information about previous time steps. This makes them ideal for tasks like speech recognition, machine translation, and time series analysis.
- Structure: In an RNN, each hidden unit has connections to itself and to other hidden units, allowing information to flow across time steps.
- Mathematics: At each time step \(t\), the hidden state \(h_t\) is updated using the input \(x_t\) and the previous hidden state \(h_{t-1}\): \(h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)\) Where:
- \(W_{xh}\): Weight matrix for the input.
- \(W_{hh}\): Weight matrix for the hidden state.
- \(b_h\): Bias term.
Exploding and Vanishing Gradient Problem
RNNs use the same weight matrices \(W_{hh}\) and \(W_{xh}\) at every time step. During backpropagation through time (BPTT), gradients are repeatedly multiplied by these matrices, which can cause:
- Vanishing gradients: Gradients shrink exponentially, making it difficult to update weights and learn long-term dependencies.
- Exploding gradients: Gradients grow exponentially, leading to instability.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks are a type of RNN specifically designed to handle the vanishing gradient problem. They achieve this by introducing a more complex structure within each unit that allows information to be selectively remembered or forgotten over long time steps.
LSTM Cell Structure
An LSTM unit consists of four interacting layers that control the flow of information:
-
Forget Gate: Determines which parts of the previous cell state \(C_{t-1}\) should be forgotten: \(f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\) Where \(\sigma\) is the sigmoid activation function.
-
Input Gate: Decides what new information to store in the cell state: \(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) A candidate cell state \(\tilde{C}_t\) is created: \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)
-
Cell State Update: Combines the forget gate and the input gate to update the cell state: \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) Where \(\odot\) denotes element-wise multiplication.
-
Output Gate: Determines the next hidden state \(h_t\): \(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) The hidden state is calculated as: \(h_t = o_t \odot \tanh(C_t)\)
Differences Between RNN and LSTM
Feature | RNN | LSTM |
---|---|---|
Architecture | Simple structure with one update equation. | Complex structure with gates for controlling memory. |
Gradient Issues | Suffers from vanishing/exploding gradients. | Designed to avoid vanishing gradients. |
Long-Term Dependencies | Struggles to learn long-term patterns. | Efficiently captures long-term dependencies. |
Use Cases | Short-term dependencies. | Long-term dependencies (e.g., speech, text). |
TL;DR
While RNNs are effective for sequence modeling, their training can be hindered by the vanishing gradient problem, especially for long-term dependencies. LSTMs address this limitation by introducing gating mechanisms that allow the model to selectively retain and forget information, making them a powerful tool for tasks involving sequential data.
Enjoy Reading This Article?
Here are some more articles you might like to read next: