Make your Stochastic Gradient Descent more Stochastic

by  on

Results in Deep Learning never cease to surprise me. One ICLR 2016 paper from Google Brain team suggests a simple 1-line code change to improve your parameter estimation across the board by adding a Gaussian noise to the computed gradients. Typical SGD updates parameters by taking a step in the direction of the gradient (simplified):

Θt+1Θt+αtΘ\mathbf{\Theta}_{t+1} \leftarrow \mathbf{\Theta}_{t} + \alpha_{t}\nabla\mathbf{\Theta}

Instead of doing that the suggestion is add a small random noise to the update:

Θt+1Θt+αt(Θ+N(0,σt2))\mathbf{\Theta}_{t+1} \leftarrow \mathbf{\Theta}_{t} + \alpha_{t}(\nabla\mathbf{\Theta} + N(0, \sigma_t^2) )

Further, σt\sigma_t is prescribed to be:

σt2=η(1+t)0.55\sigma_t^2 = \frac{\eta}{(1 + t)^{0.55}}

and η\eta is one of {0.01,0.3,1.0}\{0.01, 0.3, 1.0\}!


Stop. Stare at that for a while. Enjoy this magic. As with such things, the authors give no theoretical justification other than showing it to work on a variety of networks (kudos for that), and a hand wavy connection to simulated annealing, but examining the expression for σt\sigma_t should tell that the additive noise is highest at the beginning and has little to no effect during later stages of training. Like other methods for careful initialization, this should be effective in breaking symmetries and getting the training started on the right foot.

This is not entirely weird if you think more about it. Any dataset is after all has a sampling bias. So, in a way, even exact gradients are exact only with respect to this sample, and the empirical gradients are only an approximation of the gradient of manifold of the underlying physical process. So real question: 1) why bother computing exact gradients? 2) Are there computationally inexpensive/sloppy approaches to computing approximate gradients that will make the training process faster? (Remember sampling from the Gaussian will take additional time.)

Bonus: This post has an update.


Copyright © 2021. Delip Rao