(ENG) MLP and DNN | JooHun Hyun

Perceptron -> delta rule
MLP (2~3 layers) -> backpropagation
DNN (~10 layers) -> using ReLU instead of sigmoid

MLP’s problem

Major issue of designing MLP is how many hidden units are optimal?
- 너무 만으면 : overfit
- 너무 적으면 : underfit
As the number of hidden layers increase, the sigmoid function’s relatively small gradient repeatedly gets multiplied. -> this makes the gradient close to 0, and the weights are not updated (vanishing gradient problem)

Solution to this problem

Using a ReLU Function

The Rectified Linear Unit (ReLU) function is defined as:

\[f(x)=max(0,x)\]

ReLU has a gradient of 1 1 for positive inputs, which prevents the gradient from shrinking excessively as it is propagated through the network.

Advantages of ReLU:

Avoids vanishing gradients: The gradient remains constant for positive inputs, ensuring that weights continue to be updated.
Computational efficiency: ReLU is simpler to compute than sigmoid or tanh.
Sparsity: It introduces sparsity in activations (many outputs are zero), which can improve generalization.

However, ReLU can suffer from the dying ReLU problem, where neurons become inactive (outputting zero) due to large negative gradients. Variants like Leaky ReLU and Parametric ReLU address this issue.

Xavier Initialization

Proper weight initialization is crucial for mitigating vanishing or exploding gradients. The Xavier initialization (or Glorot initialization) ensures that the variance of activations and gradients is maintained across layers.

Benefits:

Prevents vanishing/exploding gradients by keeping the variance of inputs and outputs consistent across layers.
Helps the network converge faster.

Batch Normalization

Batch Normalization (BatchNorm) normalizes the inputs to each layer by adjusting the mean and variance of the activations during training.

Benefits

Improves gradient flow: Normalization reduces internal covariate shift, allowing deeper networks to train effectively.
Stabilizes learning: It reduces sensitivity to initialization and learning rate.
Acts as regularization: BatchNorm has a slight regularization effect, reducing the need for dropout in some cases.

Early stopping

Terminating training once a performance plateau has been reached.

Example performance plateus

Error is small enough
Number of epochs
Pruning the network
Training with noisy samples

MLP’s problem

Enjoy Reading This Article?