Mini-batch and Stochastic Gradient Descent

Here's a good explanation:

The key advantage of using minibatch as opposed to the full dataset goes back to the fundamental idea of stochastic gradient descent1.

In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information. It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point).

In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset. Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient. However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima(Theorem 6 in [2]). The disadvantage is it's terribly inefficient and you need to loop over the entire dataset many times to find a good solution.

The minibatch methodology is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence.

https://datascience.stackexchange.com/questions/16807/why-mini-batch-size-is-better-than-one-single-batch-with-all-training-data

Doing stochastic gradient descent is vastly more efficient than doing regular gradient descent

Stochastic gradient descent works by using an estimate of the gradient rather than the actual gradient.

When doing stochastic gradient descent we just iterate through our dataset at batch size at a time.

Randomising is important. If you batch in order the NN will pick this up.

https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118390923