In this post, we are going to see one of the most popular algorithms for optimization in deep neural network i.e. Adagrad. Most of the classical machine learning optimization algorithms do not work in case of deep neural network.
In the case of SGD or GD, our learning rate eta is constant throughout the life cycle until finding out the minima. But sometimes due to the learning rate, we fluctuate around the minima but never reach the minima or we reach minima very slowly.
So to avoid this problem we use Adagrad i.e. Adaptive Gradient. In Adaptive Gradient at each epoch, we update the weight along with changing the learning weight as well. So the main idea of the algorithm is to keep higher learning rate in the initial iterations and gradually decrease it in the subsequent iterations.
For each iteration, the learning rate eta will also get updated according to the following rule.
To avoid the problem of divide by zero we are having epsilon, its value will be very small. Alpha for each iteration is the sum of the square of previously calculated gradients.
It will remove the problem of manually tunning the learning rate. But it has one disadvantage i.e. as the number of iterations increases denominator in the 3rd equation increase and at one point, it makes apha to be infinitely large. Thus resulting in eta to be very small. So the weight update almost stops at this point. So to avoid this problem we have some other optimization algorithms such as Adadelta.