(ENG) MLP and DNN
- Perceptron -> delta rule
- MLP (2~3 layers) -> backpropagation
- DNN (~10 layers) -> using ReLU instead of sigmoid
MLP’s problem
- Major issue of designing MLP is how many hidden units are optimal?
- 너무 만으면 : overfit
- 너무 적으면 : underfit
- As the number of hidden layers increase, the sigmoid function’s relatively small gradient repeatedly gets multiplied. -> this makes the gradient close to 0, and the weights are not updated (vanishing gradient problem)
Solution to this problem
- Using a ReLU Function
The Rectified Linear Unit (ReLU) function is defined as:
\[f(x)=max(0,x)\]ReLU has a gradient of 1 1 for positive inputs, which prevents the gradient from shrinking excessively as it is propagated through the network.
Advantages of ReLU:
- Avoids vanishing gradients: The gradient remains constant for positive inputs, ensuring that weights continue to be updated.
- Computational efficiency: ReLU is simpler to compute than sigmoid or tanh.
- Sparsity: It introduces sparsity in activations (many outputs are zero), which can improve generalization.
However, ReLU can suffer from the dying ReLU problem, where neurons become inactive (outputting zero) due to large negative gradients. Variants like Leaky ReLU and Parametric ReLU address this issue.
- Xavier Initialization
Proper weight initialization is crucial for mitigating vanishing or exploding gradients. The Xavier initialization (or Glorot initialization) ensures that the variance of activations and gradients is maintained across layers.
Benefits:
- Prevents vanishing/exploding gradients by keeping the variance of inputs and outputs consistent across layers.
- Helps the network converge faster.
- Batch Normalization
Batch Normalization (BatchNorm) normalizes the inputs to each layer by adjusting the mean and variance of the activations during training.
Benefits
- Improves gradient flow: Normalization reduces internal covariate shift, allowing deeper networks to train effectively.
- Stabilizes learning: It reduces sensitivity to initialization and learning rate.
- Acts as regularization: BatchNorm has a slight regularization effect, reducing the need for dropout in some cases.
- Early stopping
Terminating training once a performance plateau has been reached.
Example performance plateus
- Error is small enough
- Number of epochs
- Pruning the network
- Training with noisy samples
Enjoy Reading This Article?
Here are some more articles you might like to read next: