(ENG) MLP and DNN

  • Perceptron -> delta rule
  • MLP (2~3 layers) -> backpropagation
  • DNN (~10 layers) -> using ReLU instead of sigmoid

MLP’s problem

  • Major issue of designing MLP is how many hidden units are optimal?
    • 너무 만으면 : overfit
    • 너무 적으면 : underfit
  • As the number of hidden layers increase, the sigmoid function’s relatively small gradient repeatedly gets multiplied. -> this makes the gradient close to 0, and the weights are not updated (vanishing gradient problem)

Solution to this problem

  1. Using a ReLU Function

The Rectified Linear Unit (ReLU) function is defined as:

\[f(x)=max(0,x)\]

ReLU has a gradient of 1 1 for positive inputs, which prevents the gradient from shrinking excessively as it is propagated through the network.

Advantages of ReLU:

  • Avoids vanishing gradients: The gradient remains constant for positive inputs, ensuring that weights continue to be updated.
  • Computational efficiency: ReLU is simpler to compute than sigmoid or tanh.
  • Sparsity: It introduces sparsity in activations (many outputs are zero), which can improve generalization.

However, ReLU can suffer from the dying ReLU problem, where neurons become inactive (outputting zero) due to large negative gradients. Variants like Leaky ReLU and Parametric ReLU address this issue.

  1. Xavier Initialization

Proper weight initialization is crucial for mitigating vanishing or exploding gradients. The Xavier initialization (or Glorot initialization) ensures that the variance of activations and gradients is maintained across layers.

Benefits:

  • Prevents vanishing/exploding gradients by keeping the variance of inputs and outputs consistent across layers.
  • Helps the network converge faster.
  1. Batch Normalization

Batch Normalization (BatchNorm) normalizes the inputs to each layer by adjusting the mean and variance of the activations during training.

Benefits

  • Improves gradient flow: Normalization reduces internal covariate shift, allowing deeper networks to train effectively.
  • Stabilizes learning: It reduces sensitivity to initialization and learning rate.
  • Acts as regularization: BatchNorm has a slight regularization effect, reducing the need for dropout in some cases.
  1. Early stopping

Terminating training once a performance plateau has been reached.

Example performance plateus

  • Error is small enough
  • Number of epochs
  • Pruning the network
  • Training with noisy samples



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • (ENG) Are Sixteen Heads Really Better than One?
  • (작성중)(ENG) Attention Is All You Need Paper Review
  • (KOR) 소프트웨어 서비스 설계 방법론
  • (KOR) Retrieval-Augmented Machine Translation with Unstructured Knowledge 논문 리뷰
  • (KOR) Arc Browser 사용법