A way to express Nesterov Accelerated Gradient in terms of a regular momentum update was noted by Sutskever and co-workers, and perhaps more importantly, when it came to training neural networks, it seemed to work better than classical momentum schemes. This was further confirmed by Bengio and co-workers, who provided an alternative formulation that might be easier to integrate into existing software.

I implemented NAG as part of mize, but found it was a bit more difficult than I’d anticipated to test that I’d got the correct results.

It seems I am not alone in being confused by the exact details of implementing NAG as a momentum scheme (e.g. comments and issues in the deep learning projects keras and mocha.jl, which I found from random googling), so mainly for the benefit of future me, I am going to derive the Sutskever and Bengio formulations from NAG in tedious detail. I am also going to derive an alternative expression that I ended up using in mize. Finally, I will write some R code to demonstrate their equivalence. This will turn out to be trickier than I anticipated.

Wherever possible I will try and stick with the notation used by Sutskever. Apart from the Sutskever and Bengio papers linked to above, it’s worth checking out the appendix to the Sutskever paper (PDF) or the relevant part (chapter 7) of Sutskever’s thesis. For an introduction to NAG itself, try the first part of this paper by O’Donoghue and Candès (the rest of it is also good but not germane to Nesterov momentum expressions).


The goal is to optimize some parameters, \(\theta\). This is going to involve gradient descent, so we will be evaluating the gradient of an objective function of those parameters, \(\nabla f\left(\theta\right)\), and moving a certain distance in the direction of the negative of the gradient, the distance being related to the learning rate, \(\varepsilon\). There will also be a momentum term involved, with momentum coefficient \(\mu\). The value of the parameters, learning rate and momentum at iteration \(t\) will be indicated by a subscript, e.g. \(\theta_t\).

A definition which holds across all the methods discussed here is that the parameters at iteration \(t+1\) are related to the parameters at iteration \(t\) by an update that involves the addition of a velocity vector, \(v\):

\[\theta_{t+1} = \theta_t + v_{t+1} \implies v_{t+1} = \theta_{t+1} - \theta_t\]

This is pretty obvious, but different methods have different velocity defintions, and it’s easy to get confused.

Finally, I’ll try to be strict about names. When I’m referring to the original Nesterov Accelerated Gradient, I’ll call that NAG. When I’m referring to the versions that recast it as a momentum scheme, I’ll call that Nesterov momentum, and refer to either the Sutskever or Bengio forumulation where necessary. The traditional momentum scheme is referred to as “standard” or “regular” momentum by Bengio, and “classical” by Sutskever. I’ll refer to it as “classical” momentum.

Classical Momentum

The classical momentum velocity vector is defined by Bengio as:

\[v_t = \mu_{t-1}v_{t-1} - \varepsilon_{t-1}\nabla f\left(\theta_{t-1}\right)\]

Sutskever gives the same definition but without any \(t\) subscript on the velocity vector or the momentum coefficient.

We can write out the full update for classical momentum as:

\[\theta_{t+1} = \theta_t + \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t\right)\] with velocity:

\[v_{t+1} = \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t \right)\]


The Nesterov Accelerated Gradient method consists of a gradient descent step, followed by something that looks a lot like a momentum term, but isn’t exactly the same as that found in classical momentum. I’ll call it a “momentum stage” here. It’s important to note that the parameters being minimized by NAG are given the symbol \(y\) by Sutskever, not \(\theta\). You’ll see that \(\theta\) is the symbol for the parameters after they’ve been updated by the gradient descent stage, but before the momentum stage.

Here’s the gradient descent stage: \[\theta_t = y_t - \varepsilon_t \nabla f\left(y_t\right)\] And here’s the momentum stage: \[y_{t+1} = \theta_t + \mu_t\left(\theta_t - \theta_{t-1} \right)\] That concludes one iteration of NAG. The hard part is actually finding the correct learning rate and momentum value in order to get the convergence guarantees that make the method attractive, but that needn’t concern us. Sutskever in his thesis suggests manual tuning to get an optimal result (at least for deep learning applications) so we are in heuristic territory here anyway.

Sutskever Nesterov Momentum

The key idea behind the Sutskever momentum derivation is to shift the perspective about which of the parameters we want as the result of the iteration, from \(y\) to \(\theta\). Rather than having the optimization iterations proceed as “gradient descent, momentum (end of iteration 1), gradient descent, momentum (end of iteration 2), gradient descent etc.” move the boundary of where the iterations end by a half-iteration to get “momentum, gradient descent (end of iteration 1), momentum, gradient descent (end of iteration 2) etc.”. This leaves a phantom gradient descent step that used to be first stage of the first iteration, now floating in the nether regions of iteration zero, but you can just pretend that the starting position is the result of gradient descent from some other arbitrary starting position.

The Sutskever derivation then proceeds as follows. First, rewrite the momentum stage in terms of \(y_t\):

\[y_t = \theta_{t-1} + \mu_{t-1}\left(\theta_{t-1} - \theta_{t-2} \right)\]

Then note that the term in parentheses is just the definition of the velocity vector, so we can write:

\[y_t = \theta_{t-1} + \mu_{t-1}v_{t-1}\]

We now substitute this expression for \(y_t\) into the gradient descent stage:

\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_t \nabla f\left(y_t\right)\] and replace \(y_t\) with an expression in terms of \(\theta_t\):

\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_t \nabla f\left(\theta_{t-1} + \mu_{t-1}v_{t-1}\right)\]

At this point, the expression differs from that given by Sutskever in one detail: the learning rate associated with the gradient descent is currently written as \(\varepsilon_{t}\). But remember that we have now moved the boundaries of the iteration: the gradient descent that would have been the first stage of iteration \(t+1\) is now the second stage of iteration \(t\). So the above expression is more correctly written as:

\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_{t-1} \nabla f\left(\theta_{t-1} + \mu_{t-1}v_{t-1}\right)\]

just as given by Sutskever. And therefore the expression for the parameter update is trivially:

\[\theta_{t+1} = \theta_t + \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t + \mu_t v_t\right)\]

with the velocity vector defined as:

\[v_{t+1} = \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t + \mu_t v_t\right)\]

This looks just like the classical momentum update, except that the gradient is calculated after the momentum update. Hence, one can do NAG by simply reversing the order in which the update is usually carried out: do the momentum stage first, update the parameters, and then do the gradient descent part.

Bengio Nesterov Momentum

The Bengio formulation of Nesterov momentum starts from the Sutskever definition and then defines a new variable, \(\Theta\), which represents \(\theta\) after the momentum update:

\[\Theta_{t-1} \equiv \theta_{t-1} + \mu_{t-1} v_{t-1}\]

A re-arrangement that will come in handy is: \[\theta_{t-1} = \Theta_{t-1} - \mu_{t-1} v_{t-1}\]

Also, we are going to do something a bit tricky, and that is to keep on using the velocity vector definition from the Sutskever formulation, but with \(\Theta\) substituted in:

\[v_t = \mu_{t-1} v_{t-1} - \varepsilon_{t-1} \nabla f\left(\Theta_{t-1}\right)\] The reason I call this tricky is that the velocity vector still refers to the \(\theta\) update, but we are going to be writing the update in terms of \(\Theta\). You’ll see what I mean.

Anyway, let’s take the Sutskever Nesterov momentum update and substitute in the expression above for \(\theta\) in terms of \(\Theta\):

\[\Theta_{t+1} - \mu_{t+1} v_{t+1} = \Theta_t - \mu_t v_t + \mu_t v_t - \varepsilon_t \nabla f\left(\Theta_t\right)\]

Noting that those two \(\mu_t v_t\) terms cancel out and then rearranging to leave \(\Theta_{t+1}\) on the LHS, we get:

\[\Theta_{t+1} = \Theta_t + \mu_{t+1} v_{t+1} - \varepsilon_t \nabla f\left(\Theta_t\right)\]

We can now substitute in the Sutskever velocity expression, \(v_{t+1}\):

\[ \Theta_{t+1} = \Theta_t + \mu_{t+1}\left[\mu_t v_t - \varepsilon_t \nabla f\left(\Theta_t\right)\right] - \varepsilon_t \nabla f\left(\Theta_t\right) \]

Expanding out the parentheses gives:

\[ \Theta_{t+1} = \Theta_t + \mu_{t+1} \mu_t v_t - \mu_{t+1} \varepsilon_t \nabla f\left(\Theta_t\right) - \varepsilon_t \nabla f\left(\Theta_t\right) \]

and, after grouping the gradient descent parts, finally gives:

\[ \Theta_{t+1} = \Theta_t + \mu_{t+1} \mu_t v_t - \left(1 + \mu_{t+1} \right) \varepsilon_t \nabla f\left(\Theta_t\right) \]

At this point in the other derivations, I isolated the bits on the RHS that wasn’t the old parameter and defined them as the velocity vector. Well, we can’t do that here, as we’re already using the Sutskever definition of the velocity vector which is \(\theta_t - \theta_{t-1}\) and not \(\Theta_t - \Theta_{t-1}\). I said it was a bit tricky.

The advantage of this expression for the Nesterov momentum is that it doesn’t require calculating a gradient at a non-standard position, and only requires a modification to the coefficients used to calculate the velocity, which is probably an easier change to make to an existing codebase which already uses classical momentum.

Before going any further, it’s worth going back to the original expression for the gradient descent part of NAG that Sutskever gives in his thesis (and appendix to the paper):

\[ y_{t+1} = \theta_t + \mu_t\left(\theta_t - \theta_{t-1} \right) = \theta_t + \mu_t v_t \] Let’s just compare that to the Bengio definition of \(\Theta\):

\[ \Theta_{t} \equiv \theta_{t} + \mu_{t} v_{t} \] Yep, that means:

\[ \Theta_{t} \equiv y_{t+1} \]

So the Bengio formulation is really just another way to write the standard NAG update.

An Alternative Expression for NAG

Let’s go back to the original formulation of NAG and now instead of making \(\theta\) the variables after gradient descent, we’ll put them where \(y\) was originally used. The variables after gradient descent I’ll refer to as \(\phi\).

\[\phi_t = \theta_t - \varepsilon_t \nabla f\left(\theta_t \right)\] \[\theta_{t+1} = \phi_t + \mu_t\left(\phi_t - \phi_{t-1} \right)\]

Now, let’s just write out the momentum stage in terms of \(\theta\), substituting \(\phi\) wherever we find it:

\[ \theta_{t+1} = \theta_t - \varepsilon_t \nabla f\left(\theta_t \right) + \mu_t \left[\theta_t - \varepsilon_t \nabla f\left(\theta_t \right) - \theta_{t-1} + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) \right] \]


\[ \theta_{t+1} = \theta_t + \mu_t \left[\theta_t - \theta_{t-1} + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]

Finally, we can subtitute in \(v_t\) for the first two terms in the square brackets, to give:

\[ \theta_{t+1} = \theta_t + \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]

with velocity:

\[ v_{t+1} = \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]

This looks a lot like the classical momentum expression, but with the velocity vector modified to first remove the contribution of the gradient descent from the previous iteration, and replace it with the gradient descent contribution from the current iteration. Gives an interesting insight into the idea of the Nesterov momentum using a form of “lookahead” with the gradient descent.

You could also choose to expand the velocity expression to make it look a bit like the Bengio formulation:

\[ v_{t+1} = \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) \right] - \left(1 + \mu_t\right) \varepsilon_t \nabla f\left(\theta_t \right) \]

but as this version can’t be expressed as the classical momentum form with different coefficients, the way the Bengio formulation can be, it probably doesn’t gain you anything in terms of implementation, except you can expand it and rearrange it further to give:

\[ v_{t+1} = \mu_t v_t - \varepsilon_{t} \nabla f\left(\theta_{t} \right) + \mu_t \left[ \varepsilon_{t-1} \nabla f \left(\theta_{t-1}\right) - \varepsilon_{t} \nabla f \left(\theta_{t}\right) \right] \] which now resembles the classical momentum expression with an extra momentum term. I haven’t found a definitive reference for this expression, or what, if any, extra insight it provides, but user ‘denis’ uses this expression in an answer on the Cross Validated Stack Exchange, and refers to the third form as “gradient momentum”. Another way to look at this is to say that the third term reduces the contribution of the classical momentum component of the update and increases the contribution of the gradient descent, with the degree of weighting controlled by \(\mu_{t}\).

Quasi-hyperbolic momentum uses a similar formulation, but introduces an extra parameter that allows QHM to generalize over classical momentum and NAG, subject to the restriction of a fixed learning rate. The Stack Exchange answer given above also suggests that the “gradient momentum” term can be set independently of \(\mu\) and Denis Yarats is one of the authors of the QHM paper, so perhaps we see the origins of QHM in this answer.

The setup and presentation of the update equations for QHM in that paper make it a bit hard to see the connection between it and the form of NAG given here. The supplemental material in Understanding the Role of Momentum in Stochastic Gradient Methods might be of interest, or some extra notes on QHM I wrote for my own benefit.

Similarly to QHM, we can write a related “generalized” momentum update by introducing a parameter \(\beta\) to give:

\[ v_{t+1} = \mu_t v_t - \varepsilon_{t} \nabla f\left(\theta_{t} \right) + \beta_t \mu_t \left[ \varepsilon_{t-1} \nabla f \left(\theta_{t-1}\right) - \varepsilon_{t} \nabla f \left(\theta_{t}\right) \right] \] where setting \(\beta_t = 0\) will give classical momentum and \(\beta_t = 1\) gives Nesterov.

From the look of the expression, it seems that a major downside of this expression is that calculating the parameters for iteration \(t+1\) requires storage of information from not only iteration \(t\) but from iteration \(t-1\) too. But you don’t necessarily have to do any extra storage with an implementation that used this version of NAG. At the end of an iteration, when saving the velocity vector for the next iteration, you change:

\[v_{t-1} \leftarrow v_t\] to: \[v_{t-1} \leftarrow v_t + \varepsilon_{t} \nabla f\left(\theta_t \right)\]

and then when calculating the momentum term, change:

\[\mu_t v_t\] to: \[\mu_t \left[v_t - \varepsilon_{t} \nabla f\left(\theta_t \right)\right]\]

Unrolling NAG and Classical Momentum

As a way to understand the difference between classical momentum and NAG, let’s write out the first few steps of the optimization, substituting in the recursive definitions with the previous step, and see what emerges.

Because I have finally got tired of writing out \(\varepsilon_t \nabla f\left(\theta_t \right)\) all over the place (and it’s visually distracting), let’s define:

\[ -\varepsilon_t \nabla f\left(\theta_t \right) \equiv s_t \]

to represent the steepest descent direction. Note that this includes the negative sign. Also, I am going to assume that the momentum is constant, to save on some more notational clutter.

Classical momentum is therefore represented as:

\[ \theta_{t+1} = \theta_t + s_t + \mu v_t \]

And we’ll use this straight-forward expression for NAG:

\[ \theta_{t+1} = \theta_t + s_t + \mu \left(\theta_t + s_t - \theta_{t-1} - s_{t-1} \right) \]

with just this tiny bit of re-arranging so it most closely resembles the classical momentum expression:

\[ \theta_{t+1} = \theta_t + s_t + \mu v_t + \mu \left(s_t - s_{t-1} \right) \]

Step 1: Gradient Descent

A consequence of my nomenclature choice is that the initial coordinates are at \(\theta_1\), not \(\theta_0\). Sorry if you find that annoying. In both cases, the first step is always just gradient descent:

\[ \theta_2 = \theta_1 + s_1 \]

The next three steps then proceed as follows:

Classical Momentum Steps 2-4

\[ \theta_3 = \theta_2 + s_2 + \mu v_2 \\ \theta_3 = \theta_2 + s_2 + \mu s_1 \]

\[ \theta_4 = \theta_3 + s_3 + \mu v_3 \\ \theta_4 = \theta_3 + s_3 + \mu \left(s_2 + \mu s_1 \right) \\ \theta_4 = \theta_3 + s_3 + \mu s_2 + \mu^2 s_1 \]

\[ \theta_5 = \theta_4 + s_4 + \mu v_4 \\ \theta_5 = \theta_4 + s_4 + \mu \left( s_3 + \mu s_2 + \mu^2 s_1 \right)\\ \theta_5 = \theta_4 + s_4 + \mu s_3 + \mu^2 s_2 + \mu^3 s_1 \]

NAG Steps 2-4

\[ \theta_3 = \theta_2 + s_2 + \mu v_2 + \mu \left(s_2 - s_1 \right) \\ \theta_3 = \theta_2 + s_2 + \mu s_1 + \mu s_2 - \mu s_1 \\ \theta_3 = \theta_2 + \left(1 + \mu \right) s_2 \]

\[ \theta_4 = \theta_3 + s_3 + \mu v_3 + \mu \left(s_3 - s_2 \right) \\ \theta_4 = \theta_3 + s_3 + \mu \left[ \left(1 + \mu \right) s_2 \right] + \mu s_3 - \mu s_2 \\ \theta_4 = \theta_3 + \left(1 + \mu \right) s_3 + \mu^2 s_2 \] \[ \theta_5 = \theta_4 + s_4 + \mu v_4 + \mu \left(s_4 - s_3 \right) \\ \theta_5 = \theta_4 + s_4 + \mu \left[ \left(1 + \mu \right) s_3 + \mu^2 s_2 \right] + \mu s_4 - \mu s_3 \\ \theta_5 = \theta_4 + \left(1 + \mu \right) s_4 + \mu^2 s_3 + \mu^3 s_2 \]

You can see a pattern emerging here at this point, I hope. These are quite similar forms, the difference being in how the momentum affects the weighting coefficients.

Relative Weights

Here’s a table showing how on the 4th step, the two methods weight the current and previous gradients in deciding on the next direction:

Momentum Type \(s_4\) \(s_3\) \(s_2\) \(s_1\)
Classical 1 \(\mu\) \(\mu^2\) \(\mu^3\)
NAG 1 + \(\mu\) \(\mu^2\) \(\mu^3\) 0

From this it’s clear that the difference between NAG and classical momentum is that NAG puts more weight on recent gradients: in fact, it gives zero weight to the first gradient descent direction after the first iteration, so the second step also be a pure gradient descent step, but with an extra big step size of \(\left(1 + \mu \right) \varepsilon\). Another way of looking at it is to say that NAG forgets old gradients more quickly.

Let’s write some R code to generate a table to compare how the relative weights turn out between CM and NAG at different values of \(\mu\). We’ll normalize the contributions so they sum to 1.

cm_weights <- function(mu) { c(1, mu, mu * mu, mu * mu * mu) }

nag_weights <- function(mu) { c(1 + mu, mu * mu, mu * mu * mu, 0) }

mumat <- function(wfun, mus = c(0.1, 0.25, 0.5, 0.75, 0.9, 0.99)) {
  rel_weights <- matrix(nrow = length(mus), ncol = 5)
  for (i in 1:length(mus)) {
    mu <- mus[i]
    weights <- wfun(mu)
    rel_weights[i, ] <- c(mu, weights / sum(weights))
  colnames(rel_weights) <- c("mu", "s4", "s3", "s2", "s1")


Classical momentum weights

knitr::kable(mumat(cm_weights), digits = 4)
mu s4 s3 s2 s1
0.10 0.9001 0.0900 0.0090 0.0009
0.25 0.7529 0.1882 0.0471 0.0118
0.50 0.5333 0.2667 0.1333 0.0667
0.75 0.3657 0.2743 0.2057 0.1543
0.90 0.2908 0.2617 0.2355 0.2120
0.99 0.2538 0.2512 0.2487 0.2462

NAG weights

knitr::kable(mumat(nag_weights), digits = 4)
mu s4 s3 s2 s1
0.10 0.9901 0.0090 0.0009 0
0.25 0.9412 0.0471 0.0118 0
0.50 0.8000 0.1333 0.0667 0
0.75 0.6400 0.2057 0.1543 0
0.90 0.5525 0.2355 0.2120 0
0.99 0.5050 0.2487 0.2462 0

At high momentum, classical momentum is only placing around 25-30% of the weight on the current gradient, whereas for NAG, it’s still at 50-55%.

NAG in practice

So that’s how it’s all supposed to work in principle. Here I’ll demonstrate the equivalence of these methods, by implementing them all in simple R code.

The biggest simplification I’ll make is that I’ll assume a constant learning rate and a constant momentum coefficient.

Classical Momentum

#' Optimization by Classical Momentum
#' @param par Starting point of vector of parameters to optimize.
#' @param fn Objective function to optimize. Takes vector with length of
#' \code{par} and returns a scalar.
#' @param gr Gradient of the objective function \code{fn}. Takes vector with
#' length of \code{par} and returns a vector of the same length.
#' @param lr Learning rate.
#' @param mu Momentum coefficient.
#' @param max_iter Maximum number of iterations to optimize for. First iteration
#' is always steepest descent.
#' @return list with components: \code{par} final set of parameters; \code{f}
#' value of \code{fn} evaluated at the returned set of parameters; \code{fs}
#' vector of function evaluated after each iteration.
cm <- function(par, fn, gr, lr, mu, max_iter = 10) {
  fs <- rep(0, max_iter)

  v <- rep(0, length(par))
  for (i in 1:max_iter) {
    g <- gr(par)
    v <- mu * v - lr * g
    par <- par + v
    # store results
    f <- fn(par)
    fs[i] <- f

  list(par = par, f = f, fs = fs)

This is a reference implementation of classical momentum. Nearly all the parameters and the return values are the same for the other functions (except where noted), so they’re documented here once.

Onto the various NAG implementations. They all have an extra bit of code in them to ensure that on the first iteration only the gradient descent stage is carried out. This is needed to keep the outputs consistent between the different implementations. The cm routine doesn’t require any extra checks for steepest descent on the first iteration, because initializing the velocity vector to zero results in steepest descent anyway even if the momentum coefficient is non-zero. As we’ll see below, some of the other implementations either don’t explicitly use a velocity vector or need the momentum coefficient to be set to zero on the first iteration to get the same result.


# Optimization by Nesterov Accelerated Gradient
# Return list also contains gd_fs: function evaluated after gradient descent
# stage of each iteration; all: function evaluated after gradient descent
# stage and momentum stage, in order.
nag <- function(par, fn, gr, lr, mu, max_iter = 10) {
  fs <- rep(0, max_iter)
  gd_fs <- rep(0, max_iter)
  all <- rep(0, max_iter * 2)

  x_old <- rep(0, length(par))
  for (i in 1:max_iter) {
    # gradient descent stage
    g <- gr(par)
    x <- par - (lr * g)

    # store gradient descent values
    f <- fn(x)
    gd_fs[i] <- f
    all[i * 2 - 1] <- f

    # momentum stage and update
    par <- x + ifelse(i == 1, 0, mu) * (x - x_old)
    x_old <- x

    # store momentum values
    f <- fn(par)
    fs[i] <- f
    all[i * 2] <- f

  list(par = par, f = f, fs = fs, gd_fs = gd_fs, all = all)

In this routine, rather than store the velocity vector, we store the previous gradient descent result, x_old. In order to ensure the first iteration is gradient descent only, we also need to manually set the momentum coefficient mu to zero on the first iteration, which is what that ifelse expression does.

Also, there’s some extra code to calculate and store the function values after the gradient descent stage. These aren’t needed for the optimization to work, I just want to keep track of the values to demonstrate the different methods are in fact equivalent.

Sutskever Formulation

# Optimization by Sutskever Nesterov Momentum
# Return list also contains mu_fs: function evaluated after momentum
# stage of each iteration; all: function evaluated after gradient descent
# stage and momentum stage, in order.
snag <- function(par, fn, gr, lr, mu, max_iter = 10) {
  v <- rep(0, length(par))

  fs <- rep(0, max_iter)
  mu_fs <- rep(0, max_iter)
  all <- rep(0, max_iter * 2)

  for (i in 1:max_iter) {
    # momentum stage and update parameters
    mu_step <- mu * v
    par <- par + mu_step

    # store momentum results
    f <- fn(par)
    mu_fs[i] <- f
    all[i * 2 - 1] <- f

    # gradient descent stage
    g <- gr(par)
    gd_step <- -lr * g

    # update and store velocity for next step
    par <- par + gd_step
    v <- mu_step + gd_step

    # store gradient descent results
    f <- fn(par)
    fs[i] <- f
    all[i * 2] <- f

  list(par = par, f = f, fs = fs, mu_fs = mu_fs, all = all)

Bengio Formulation

# Optimization by Bengio Nesterov Momentum
bnag <- function(par, fn, gr, lr, mu, max_iter = 10) {
  fs <- rep(0, max_iter)

  v <- rep(0, length(par))
  for (i in 1:max_iter) {
    g <- gr(par)
    par <- par + mu * mu * v - (1 + mu) * lr * g
    v <- mu * v - lr * g

    # store results
    f <- fn(par)
    fs[i] <- f

  list(par = par, f = f, fs = fs)

Because this implementation only uses a constant momentum coefficient, we replace the \(\mu_{t+1}\mu_{t}\) term with \(\mu_{t}^2\). You would probably have to using a pretty wacky non-constant momentum schedule where this approximation had any major effect anyway… except when the momentum is changing from zero to non-zero, that is, in which case that’s the difference between some momentum and no momentum. We shall return to this point later.

Alternative momentum NAG Expression

# Optimization by Nesterov Accelerated Gradient, using a momentum-style
# update expression.
mnag <- function(par, fn, gr, lr, mu, max_iter = 10) {
  fs <- rep(0, max_iter)

  v <- rep(0, length(par))
  for (i in 1:max_iter) {
    g <- gr(par)
    v <- mu * (v - lr * g) - lr * g
    par <- par + v
    # setup v for the next iteration by removing the old gradient contribution
    v <- v + lr * g
    # store results
    f <- fn(par)
    fs[i] <- f

  list(par = par, f = f, fs = fs)

Finally, here is your humble author’s expression for NAG, written as a momentum-style update. To differentiate from the Sutskever and Bengio versions of NAG, I’ll refer to it as momentum-NAG or mNAG.

Testing with Rosenbrock

Tradition dictates that I must demonstrate the use of these optimizers using the 2D Rosenbrock function, with a specific starting point:

par <- c(-1.2, 1)

fn <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
  100 * (x2 - x1 * x1) ^ 2 + (1 - x1) ^ 2
gr <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
    -400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1),
    200 *      (x2 - x1 * x1))

Let’s run the optimizers for 100 iterations. The point here is not whether we get really amazing optimization (we don’t), but whether the outputs of the Sutskever and Bengio algorithms are equivalent. It would be a bonus if my mNAG result also worked. For comparison we’ll throw in the classical momentum and as a sanity check, the vanilla NAG routine.

lr <- 0.001
mu <- 0.95
max_iter <- 100
snag_opt <- snag(par, fn, gr, lr, mu, max_iter)
bnag_opt <- bnag(par, fn, gr, lr, mu, max_iter)
nag_opt <- nag(par, fn, gr, lr, mu, max_iter)
mnag_opt <- mnag(par, fn, gr, lr, mu, max_iter)
cm_opt <- cm(par, fn, gr, lr, mu, max_iter)

sbnag_df <- data.frame(Bengio = bnag_opt$fs,
                 Sutskever = snag_opt$fs, 
                 "Suts Mom" = snag_opt$mu_fs,
                 "NAG" = nag_opt$fs,
                 "mNAG" = mnag_opt$fs,
                 "CM" = cm_opt$fs)

Let’s have a look at the first few iteration results:

knitr::kable(head(sbnag_df), caption = paste(
  "First few evaluations of NAG implementations, with lr = ", 
  formatC(lr), "mu = ", formatC(mu),
  collapse = " "))
First few evaluations of NAG implementations, with lr = 0.001 mu = 0.95
Bengio Sutskever Suts.Mom NAG