 ### Gradient Descent with Nesterov Momentum from the ground up

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that it can get stuck in flat areas or bounce around if the objective function gives back noisy gradients. Momentum is a strategy that accelerates the progression of the search to skim across flat regions and smooth out bouncy gradients.

In some scenarios, the acceleration of momentum can cause the search to miss or overshoot the minima at the bottom of basins or valleys. Nesterov Momentum is an extension of momentum that consists of calculating the decaying moving average of the gradients of projected positions within the search space instead of the actual positions themselves.

This has the influence of leveraging the accelerating advantages of momentum while enabling the search to slow down when approaching the optima and minimize the probability of missing or overshooting it.

In this guide, you will find out how to develop the Gradient Descent optimization algorithm with Nesterov Momentum from the ground up.

After going through this guide, you will be aware of:

• Gradient descent being an optimization algorithm that leverages the gradient of the objective function to navigate the search space.
• The convergence of gradient descent optimization algorithm can be accelerated through extension of the algorithm and including Nesterov Momentum.
• How to implement the Nesterov Momentum optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This tutorial is subdivided into three portions, which are:

2] Nesterov Momentum

3] Gradient Descent with Nesterov Momentum

• Two-Dimensional Test Problem
• Gradient descent optimization with Nesterov Momentum
•  Visualization of Nesterov Momentum

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly leverages the first order derivative of the targeted objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular input.

If the target functions takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be perceived as a vector and is referenced to typically as the “gradient”.

Gradient: First order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent references to a minimization optimization algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimized and the derivative function for the objective function. The targeted function(f) gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point in the input space.

The derivative is subsequently calculated and a step is undertaken within the input space that is predicted to have the outcome of a downhill movement within the targeted function, under the assumption we are minimizing the target function.

A downhill movement is performed by first calculating how far to shift within the input space, calculated as the steps size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we shift against the gradient, or down the targeted function.

• x(t+1) = x(t) – step_size * f'(x(t))

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step undertaken is scaled leveraging a step size hyperparameter.

Step size (alpha): Hyperparameter that manages how far to shift in the search space against the gradient every iteration of the algorithm.

If the step size is too minimal, the movement within the search space will be minimal, and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s observe Nesterov momentum.

Nesterov Momentum

Nesterov Momentum is an extension to the gradient descent optimization algorithm.

The strategy was detailed by (and named for) Yurii Nesterov in his 1983 paper entitled “A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k^2)

Ilya Sutskever, et al. are accountable for making widespread the application of Nesterov Momentum with regards to the training of neural networks with stochastic gradient descent detailed in their 2013 research paper, “On the importance of initialization and momentum in deep learning.” They refer to the approach as “Nesterov’s Accelerated Gradient” or NAG for short.

Nesterov Momentum is not usually viewed as a variant of momentum, it indeed happens to be closely connected to classical momentum, varying only in the precise update of the velocity vector.

Conventional momentum consists of maintenance of an extra variable that indicates the last update performed to the variable, an exponentially decaying shifting average of past gradients.

The momentum algorithm accumulates an exponentially decaying moving average of historical gradients and goes on to shift in their direction.

This final update or final change to the variable is the included to the variable scaled by a “momentum” hyperparameter that manages how much of the final change to include, e.g., 0.9 for 90%.

It is simpler to think about this update in terms of dual steps, for example, calculate the modification in the variable leveraging the partial derivative then calculate the fresh value for the variable.

• change(t+1) = (momentum * change(t)) – (step_size * f’(x(t)))
• w(t+1) = x(t) + change(t+1)

We can perceive momentum in terms of a ball rolling down the hill that will speed up and continue traversing in the same direction even in the existence of small hills.

Momentum can be interpreted as a ball rolling down a nearly horizontal incline. The ball naturally gathers momentum as gravity is the reason behind its acceleration, just like the gradient is the reason behind momentum to accumulate in this descent method.

An issue with momentum is that acceleration can at times be the reason behind the search to overshoot the minima at the bottom of a basin or valley floor.

Nesterov Momentum can be viewed of as a modification to momentum to surpass this issue of overshooting the minima.

It consists of first calculating the projected location of the variable leveraging the modification from the last iteration, and leveraging the derivative of the projected position in the calculation of the new position for the variable.

Calculating the gradient of the projected position functions like a correction factor for the acceleration that has been accumulated.

With Nesterov Momentum the gradient is assessed after the current velocity is applied. Therefore one can interpret Nesterov momentum as making an effort to include a correction factor to the conventional method of momentum.

1] Project the location of the solutions.

2] Caculate the gradient of the projection

3] Calculate the change in the variable leveraging the partial derivative.

4] Update the variable

Let’s observe these individual steps in comprehensive detail.

To start with, the projected position of the complete solution is calculated leveraging the change calculated in the final iteration of the algorithm.

• Projection(t+1) = x(t) + (momentum * change(t))

We can then calculate the gradient for this new position.

Now we can calculate the new position of every variable leveraging the gradient of the projection, to start with by calculating the modification in every variable.

• Change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

And lastly, calculating the fresh value for every variable leveraging the calculated change.

x(t+1) = x(t) + change(t+1)

In the domain of convex optimization more generally, Nesterov Momentum is known to enhance the rate of convergence of the optimization algorithm (for example, minimize the number of iterations needed to identify the solution)

Much like momentum, NAG is a first-order optimization strategy with improved convergence rate guarantee than gradient descent in specific situations.

Even though the strategy is efficient in training of neural networks, it might not have the same general effect of accelerating convergence.

Unluckily, in the stochastic gradient scenario, Nesterov momentum does not enhance the rate of convergence.

Now that we are acquainted with the Nesterov momentum algorithm, let’s look into how we might go about implementing it and assess its performance.

We will leverage a simple two-dimensional functions that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective() function below goes about implementing this function:

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can develop a 3D plot of the dataset to obtain a feeling for the curvature of the response surface.

The full instance of plotting the objective function is detailed below.

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Running the instance develops a 3D surface plot of the objective function.

We can see the familiar bowl shape possessing the global minima at f(0,0) = 0. We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below develops a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Running the instance develops a 2D contour plot of the objective function.

We can observe the classical bowl shape compressed to contours displayed with a color gradient. We will leverage this plot to plot the particular points looked into during the progression of the search. Now that we possess a test objective function, let’s observe how we might implement the Nesterov Momentum optimization algorithm.

Gradient Descent Optimization with Nesterov Momentum

We can go about applying the gradient descent with Nesterov Momentum to the test problem.

To start with, we require a function that calculates the derivatives for this function.

The derivative of x^2 is x * 2 in every dimension and the derivative() function implements this below.

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

Then, we can implement gradient descent optimization.

To start with, we can choose an arbitrary point in the bounds of the problem as a beginning point for the search.

This assumes we have a plethora that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

 123 …# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Then, we are required to calculate the projected point from the present position and calculate its derivative.

 12345 …# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape)]# calculate the gradient for the projectiongradient = derivative(projected, projected)

We can then develop the new solution, a single variable at a time.

To start with the, the modification in the variable is calculated leveraging the partial derivative and learning rate with the momentum from the final change in the variable. This modification is recorded for the next iteration of the algorithm. Then the modification is leveraged to calculate the new value for the variable.

 12345678910 …# build a solution one variable at a timenew_solution = list()for i in range(solution.shape):# calculate the changechange[i] = (momentum * change[i]) – step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)

This is then performed for every variable for the objective function, then repeated for every iteration of the algorithm.

This new solution can then be assessed leveraging the objective() function and the performance of the search can be reported.

 123456 …# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution, solution)# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

And that’s it.

We can connect all of this together, into a function referred to as nesterov() that takes on the names of the objective function and the derivative function, an array with the bounds of the domain and hyperparameter values for the cumulative number of algorithm iterations, the learning rate, and the momentum, and returns the final solution and its evaluation.

This complete function is detailed below.

 123456789101112131415161718192021222324252627 # gradient descent algorithm with nesterov momentumdef nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape)]# calculate the gradient for the projectiongradient = derivative(projected, projected)# build a solution one variable at a timenew_solution = list()for i in range(solution.shape):# calculate the changechange[i] = (momentum * change[i]) – step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution, solution)# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return [solution, solution_eval]

Note: We have intentionally leveraged lists and imperative coding style rather than vectorized operations for readability. Feel free to adapt the implementation to a vectorization implementation with NumPy arrays for enhanced performance.

We can then go about defining our hyperparameters can call the nesterov() function to optimize our test objective function.

In this scenario, we will leverage 30 iterations of the algorithm with a learning rate of 0.1 and momentum of 0.3. These hyperparameter values were identified following a bit of trial and error.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 30# define the step sizestep_size = 0.1# define momentummomentum = 0.3# perform the gradient descent search with nesterov momentumbest, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)print(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the complete example of gradient descent optimization with Nesterov Momentum is detailed here.

 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556 # gradient descent optimization with nesterov momentum for a two-dimensional test functionfrom math import sqrtfrom numpy import asarrayfrom numpy.random import randfrom numpy.random import seed # objective functiondef objective(x, y):return x**2.0 + y**2.0 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nesterov momentumdef nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape)]# calculate the gradient for the projectiongradient = derivative(projected, projected)# build a solution one variable at a timenew_solution = list()for i in range(solution.shape):# calculate the changechange[i] = (momentum * change[i]) – step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution, solution)# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return [solution, solution_eval] # seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 30# define the step sizestep_size = 0.1# define momentummomentum = 0.3# perform the gradient descent search with nesterov momentumbest, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)print(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Running the instance applies the optimization algorithm with Nesterov Momentum to our test issue and reports performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up executing the instance a few times and contrast the mean outcomes.

In this scenario, we can observe that a near optimal solution was identified following after probably 15 iterations of the search, with input values  near 0.0 and 0.0 evaluating to 0.0.

 1234567891011121314151617181920212223242526272829303132 >0 f([-0.13276479 0.35251919]) = 0.14190>1 f([-0.09824595 0.2608642 ]) = 0.07770>2 f([-0.07031223 0.18669416]) = 0.03980>3 f([-0.0495457 0.13155452]) = 0.01976>4 f([-0.03465259 0.0920101 ]) = 0.00967>5 f([-0.02414772 0.06411742]) = 0.00469>6 f([-0.01679701 0.04459969]) = 0.00227>7 f([-0.01167344 0.0309955 ]) = 0.00110>8 f([-0.00810909 0.02153139]) = 0.00053>9 f([-0.00563183 0.01495373]) = 0.00026>10 f([-0.00391092 0.01038434]) = 0.00012>11 f([-0.00271572 0.00721082]) = 0.00006>12 f([-0.00188573 0.00500701]) = 0.00003>13 f([-0.00130938 0.0034767 ]) = 0.00001>14 f([-0.00090918 0.00241408]) = 0.00001>15 f([-0.0006313 0.00167624]) = 0.00000>16 f([-0.00043835 0.00116391]) = 0.00000>17 f([-0.00030437 0.00080817]) = 0.00000>18 f([-0.00021134 0.00056116]) = 0.00000>19 f([-0.00014675 0.00038964]) = 0.00000>20 f([-0.00010189 0.00027055]) = 0.00000>21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000>22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000>23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000>24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000>25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000>26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000>27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000>28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000>29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000Done!f([-3.82285090e-06 1.01504945e-05]) = 0.000000

Visualization of Nesterov Momentum

We can go about plotting the progress of the Nesterov Momentum search on a contour plot of the domain.

This can furnish an intuition with regards to the progress of the search over the iterations of the algorithm.

We must go about updating the nesterov() function to maintain a listing of all solutions identified during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed here:

 12345678910111213141516171819202122232425262728293031 # gradient descent algorithm with nesterov momentumdef nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape)]# calculate the gradient for the projectiongradient = derivative(projected, projected)# build a solution one variable at a timenew_solution = list()for i in range(solution.shape):# calculate the changechange[i] = (momentum * change[i]) – step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution, solution)# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return solutions

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the best final solution.

 12345678910111213 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# define the step sizestep_size = 0.01# define momentummomentum = 0.8# perform the gradient descent search with nesterov momentumsolutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)

We can then develop a contour plot of the objective function, as prior:

 12345678910 …# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution identified during the search as a white dot linked by a line.

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the complete example of performing the Nesterov Momentum optimization on the test problem and plotting the outcomes on a contour plot is detailed here.

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576 # example of plotting the nesterov momentum search on a contour plot of the test functionfrom math import sqrtfrom numpy import asarrayfrom numpy import arangefrom numpy.random import randfrom numpy.random import seedfrom numpy import meshgridfrom matplotlib import pyplotfrom mpl_toolkits.mplot3d import Axes3D # objective functiondef objective(x, y):return x**2.0 + y**2.0 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with nesterov momentumdef nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape)]# calculate the gradient for the projectiongradient = derivative(projected, projected)# build a solution one variable at a timenew_solution = list()for i in range(solution.shape):# calculate the changechange[i] = (momentum * change[i]) – step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution, solution)# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return solutions # seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# define the step sizestep_size = 0.01# define momentummomentum = 0.8# perform the gradient descent search with nesterov momentumsolutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)# show the plotpyplot.show()

Running the instance performs the search as prior, except in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot. This portion of the blog furnishes additional resources on the subject if you are looking to delve deeper.

Papers

A method for solving the convex programming problem with convergence rate O(1/k^2), 1983.

On the importance of initialization and momentum in deep learning, 2013

Books

• Algorithms for optimization, 2019
• Deep learning, 2016.

APIs

• numpy.random.rand.API
• numpy.asarray.API
• Matplotlib API

Articles