Gradient Descent using Backpropagation

In my first post, we talked about neural networks and it’s uses.In this post, we are going to talk about how they work and what are it’s advantages.Before moving onto the discussion, we must need to understand several things.A neural network is nothing but a bunch of neurons interconnected together.

x -> Input neuron (Where it takes in an input)

w -> weight of the neuron (In the case of y= 2 * x the weight would be 2)

y -> Output neuron ( w * x)

Let’s consider a small problem and understand what gradient descent algorithm means.

We are given the x value as 1 and y value as 4.We need to find the relationship between them or find the weight which when multiplied by x gives y.The answer is 4(y = 1 * 4).It’s easy to find the weight because it was a simple function.So, if x were 2 and y was 7.9 we could not find the weight which maps x to y because it is not simple.So, we need to come up with a general algorithm to find the weights for any arbitrary value of x and y given.Thus , the algorithm for finding the optimal weights is known as gradient descent using backpropagation.

Let us consider x to be 1 and y to be 3. We will need to find the weight between x and y.At first we would guess some random weight and compute y’ .Then we compare with the original y and find the difference.If the difference was 0 then we found the correct weight, else we need to change the weights.The algorithm is quite easy,but we don’t know whether we need to increase or decrease the weights if the difference wasn’t 0 . .Let’s understand it in picture.

It is clear from the above picture that we used trial and error method.We guessed some random weight and computed the difference between the actual output and our output.If the error is 0 , then we found the correct weight else we readjust it.But, we need to decide whether we need to increase or decrease the weight if the error was not zero.How to decide?.To answer the above question, let us first plot the error and the weights used.We will call our friend Mr.Graph.

From the graph we can conclude that , if the weights are increased the error tends to decrease.So what does this signifies?.Yes, the derivative of a function.To restate again,derivative of a function is nothing but the rate of change or it is used to measure the change in output if the input changes.So to answer the question whether to increase or decrease the weight if the error wasn’t zero,we could easily find it using the derivative.(How?).Derivative tells us the whether the function is increasing or decreasing(rate of change).If the derivative comes out to be positive, then the function is increasing for every change in input and vice-versa for negative derivative.So, we need our error to be small or zero or we need the direction where the function is decreasing. In the above graph, if we compute the derivative it comes out to be -1.This says that if we increase the weight ,the error decreases.So this answers our question whether we need to increase or decrease our weights when error is not zero.What if the derivative was positive.It says that if i increase the weight then error also increases,so we need to decrease the weight in that situation to decrease the error.Please spend some time on this paragraph, as it is the main thing to understand any concepts in deep learning.Okay , we now found how to find the weights to map x to y.Let us formalize the steps in terms of algorithm(Gradient descent using backpropagation) so it will be easier.

Gradient Descent algorithm using backpropagation:

1.) Initialise with some random weight

2.)Compute y’ (Our network output)

3.) Find the Error between actual output and our output E = ( y – y’ )

4.) Find the derivative of Error with respect to weight

5.) Update weight as : w = w – ((dE/dw)/(dE/dw))

6.)Then go to step 2 until Error is zero.

Why step 5 is so weird?? .. Lets talk about it. From the above example if the weight were 1, then the error would be 3 .If we find the derivative of Error with respect to the weight ,it comes out to be -3 (Try yourselves). The negative sign of the derivative term tells us that the function will decrease if i increase the weight.So if the step 5 were w = w + ((dE/dw)/(dE/dw)) , then we would be decreasing the weight.We don’t want that to happen, so we flip the sign.Why we are dividing by (dE/dw). Let’s understand. If we didn’t divide by the term then we will increase w by 3 , then we would not arrive at a correct answer(Why?).If we were at w=3 and we compute the error term and the dE/dw , it comes out to be 1 and -3.So if we use step 5 to update the weight, then we would increase w from 3 to 6.But the true weight was 4.To avoid higher update of w,we would update it step by step.So to change w step by step , we divided by that term and it would become 1 so, weight gets incremented or decremented by 1 and we would arrive at the optimal weight.That’s all the algorithm is.Why the name Backpropagation?. Because in the algorithm we first computed the y’ using some random weight and compared with the actual output and we updated back the weight to minimise the error.We are moving backward to update the weight,hence it is called as backpropagation.In the next post, we will implement this algorithm from scratch using python and write our own neural network!!!

Leave a comment