### A non-technical introduction to Gradient Descent

Gradient descent is an iterative optimization algorithm that is often used to find the minimum or maximum values of a function.

To understand it better, let us use an analogy of a factory that assembles different parts to build an automobile.

Naturally, the factory would like to produce vehicles as efficiently as possible. Although a real world example would comprise of many different factors to be optimized, for simplicity, we shall consider two key factors only – time and money. This would require the factory manager to come up with some ‘function’ that represents the relationship between money, time and the produced product.

Therefore, the manager figures out a set of equations that represents these relationships, and needs an optimization approach to find a solution for producing the product with lowest monetary and temporal costs.

In order to solve this problem, the manager could solve these equations manually, which will expend enormous amounts of time, energy (and patience, to say the least), or he could plot a graph of all the equations previously formed, and find the local minimum of the resulting curve. The minima, as the name suggests, is the point at which the curve has its smallest value.

The above sample graph has only one curve, and shows the minima at (0,0).

Finding the minimum manually is quite easy when the graph is as simple as the one illustrated above, but doing the same for a complex curve can be challenging. Optimization techniques such as using a gradient descent algorithm however can simplify and speed up the process considerably.

The gradient descent algorithm starts at a point on the graph and traverses the graph “downwards”, i.e. in the direction it is more likely to find the minimum. The algorithm uses the current position to determine the direction to move in order to find the minimum.

For example, if the algorithm is currently at a point (5,20) in the graph above, it will check and conclude that moving towards the minimum would mean moving in a direction where the value of the point is less than (5,20) on the graph – the downward direction

But here a new question arises: how many points should the algorithm traverse in a single step in order to reach the minimum in the fastest and most optimal manner? Could it also be that the step taken by the algorithm is either too small or too big to be efficient?

This brings us to the notion of ‘step size’ or ‘learning rate’ (often denoted by alpha).  The step size is the rate at which the algorithm traverses the curve or size of the step the algorithm makes in each iteration traversal.

Selecting a large step size would force the algorithm to bounce between different points around the minimum, thus reducing efficiency.

On the other hand, selecting a small step size would force the algorithm to make more iterations, which again, like before would increase the time taken for the process to complete.

`next_x = 6  `

`gamma = 0.01  `

`precision = 0.00001  `

`max_iters = 10000  `

`# Derivative function`

`df = lambda x: 4 * x**3 - 9 * x**2`

`for i in range(max_iters):`

`    current_x = next_x`

`    next_x = current_x - gamma * df(current_x)`

`    step = next_x - current_x`

`    if abs(step) <= precision:`

`        break`

`print("Minimum at", next_x)`

`# The output for the above will be something like`

`# "Minimum at 2.2499646074278457"`

(Code taken from wikipedia)

Let’s break the code down to understand it better-

The first section defines the following variables-

– next_x (the point where we want the search to begin)

– gamma (step size multiplier)

–  precision (This value determines how close the obtained result need to be to the actual minimum in order to decide the stopping point)

– max_iters (the maximum number of iterations)

The derivative function represents the equation for which we are trying to find the minima.

The for loop runs until the maximum number of iterations are reached. This is done in order to prevent the program from entering an infinite loop. The starting point (next_x) is stored in a different variable called current_x, and a new value for next_x is calculated. The variable called step is then defined, and a check is performed to make sure the step isn’t smaller than the value for precision (set earlier). Setting smaller values than precision could lead the algorithm to take extremely small values, and take a very long time to finish execution. Finally, the result is then displayed on the screen.