-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
I'm slightly confused about the final steps described in the doc vs the code below, should the Nesterov momentum be applied before updating the parameters, i.e.: self.wrt -= step1 + step2
step1 = step_m1 * self.momentum
self.wrt -= step1
gradient = self.fprime(self.wrt, *args, **kwargs)
self.moving_mean_squared = (
self.decay * self.moving_mean_squared
+ (1 - self.decay) * gradient ** 2)
step2 = self.step_rate * gradient
step2 /= sqrt(self.moving_mean_squared + 1e-8)
self.wrt -= step2
step = step1 + step2
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels