
The gradient descent technique is a popular optimization method (to find minimum error) in deep learning and machine learning. Gradient descent modifies the parameters in order to minimize specific functions. Backward propagation technique is used in deep learning whereas weights and biases are optimized in linear regression.
A gradient calculates how much a function’s output will vary if its inputs are slightly altered. In machine learning, a gradient is the derivative of a function with several input variables. The gradient is merely a measurement of the change in all weights relative to the change in error and is analytically defined as the slope of a function.
If y is the variable dependent on x and
dy = change in y and dx = change in x
then, dy / dx is the primary measure of gradient decent.
Types of Gradient Descent
There are three popular types of gradient descents depending upon the amount of data to be handled.
- Batch Gradient Descent
- Stochastic Gradient Decent
- Mini-batch Gradient Descent
1) Batch Gradient Descent:
The error is determined for each example inside the training dataset using batch gradient descent (commonly referred to as vanilla gradient descent). The model is not altered until every training sample has been evaluated. The entire procedure is referred to as a cycle and a training epoch. It is the steepest gradient descent mechanism and computationally most effective. The effectiveness of batch processing, which results in a consistent error gradient and a stable convergence, is one of its advantages. In contrary, one drawback is that, the stable error gradient can occasionally lead to a state of convergence that isn’t the best the model can achieve, which is one downside. Basically, this method is frequently used in case of small dataset.

2) Stochastic Gradient Descent:
The parameters for each training sample in stochastic gradient descent (SGD) are altered one at a time for each training example in the dataset. This can make SGD faster than batch gradient descent. The regular updates help us determine the rate of improvement quite accurately.
The batch method, however, requires less computing than the frequent updates. In addition to producing noisy gradients, the frequency of such updates may also cause the error rate to fluctuate rather than decrease with time. It is suitable for larger dataset.

3) Mini-batch Gradient Descent:
Mini-batch gradient descent is the favoured method since it combines the concepts of batch gradient descent with SGD. The training dataset is divided into manageable groups, and each group is treated independently. This finds a balance between the robustness of stochastic gradient descent and the effectiveness of batch gradient descent. While there is no defined standard because it depends on the application, Mini-batch sizes commonly range from 50 to 256. This approach is applied to neural networks and is the most widely used type in deep learning. Smaller values reduce training noise while allowing the learning process to converge more quickly. Larger values cause a learning process to gradually converge to an exact estimation of the error gradient.


