(ENG) Backpropagation
Table of Contents
Before delving in
Fitting the curve
What is the “best curve”?
Fitting data points across a plane using a curve is the most important question.
So what is the “best curve”?
Loss function : Measure of total squared distance between the points and the curve
🧠 It is referred to as a function because it has multiple parameters \(\eta = (k_0 ... k_5)\)
The function yields a sigle value, where low value = good fit
Minimizing the loss function
How can we find the best configuration of \((k_0 ... k_5)\), or in other words, minimize the loss function?
Methods Random pertubation : random changes to parameters
How can we make predictions to the value of \(\eta\)$ without performing all calculations via brute-force?
Gradient Descent Method
Sometimes the derivative is unknown (not derivable)
Limitation of the Perceptron
a) Linear Separability
A single-layer perceptron uses a linear decision boundary to classify data. It only functioned well if the data can be separated by a straight line (or hyperplane in higher dimensions). However, some problems like XOR are not linearly separable.
b) Absence of Hidden Layers
The perceptron lacks hidden layers. Without them, it cannot model complex relationships between input features.
c) Learning is not Incremental
Learning is not incremental over time - meaning it has no retention of previous learning.
c) Inability to Learn Nonlinear Functions
The perceptron updates its weights using a simple rule:
\[w_i+1=w_i + \eta(y-\hat{y})x\]This works for linear problems but fails for nonlinear problems, as the perceptron has no mechanism to capture nonlinear patterns.
How Backpropagation solved this issue?
Backpropagation(1986, Hinton) solved these problems by enabling multi-layer networks to learn nonlinear decision boundaries.
a) Nonlinear Activation Functions
Backpropagation allows the use of nonlinear activation functions (e.g., sigmoid, tanh, ReLU). Nonlinear activations enable the network to combine inputs in complex ways, effectively learning nonlinear decision boundaries.
b) Hidden Layers
Backpropagation trains networks with multiple layers of neurons (hidden layers). Hidden layers allow the network to create hierarchical representations of data, transforming input features into complex, abstract representations.
For example, in the XOR problem:
- The first hidden layer transforms the inputs into a new feature space.
- The second layer uses this new space to create a nonlinear decision boundary.
c) Learning Complex Relationships
Backpropagation applies the chain rule to compute gradients layer by layer, allowing the network to adjust weights in all layers based on how they affect the output error. This enables the network to learn mappings for nonlinear functions, solving problems like XOR.
Chain Rule behind Backpropagation
Composition of a NN
- Layers with weights : \(w\) and biases : \(b\)
- Activation function : \(f\)
- Ouput function : \(\hat{y}\)
- Loss function : \(\eta\)
A single layer NN equation, where \(x\) is the input :
- The goal is to minimize loss:
Chain Rule
The chain rule(연쇄법칙)
- Given functions \(f\) and \(g\) that are both differentiable, and a composite function \(F = f(g(x)) = f \circ g\)
- Then, \(F'(x) = f'(g(x)) \circ g'(x)\)
- If we assume \(t=g(x)\)
- Then, \(\frac{dy}{dx} = \frac{dt}{dx} \cdot \frac{dy}{dt}\)
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: