Make your Stochastic Gradient Descent more Stochastic

Results in Deep Learning never cease to surprise me. One ICLR 2016 paper from Google Brain team suggests a simple 1-line code change to improve your parameter estimation across the board — by adding a Gaussian noise to the computed gradients. Typical SGD updates parameters by taking a step in the direction of the gradient (simplified):
\mathbf{\Theta}_{t+1} \leftarrow \mathbf{\Theta}_{t} + \alpha_{t}\nabla\mathbf{\Theta}

Instead of doing that the suggestion is add a small random noise to the update:

\mathbf{\Theta}_{t+1} \leftarrow \mathbf{\Theta}_{t} + \alpha_{t}(\nabla\mathbf{\Theta} + N(0, \sigma_t^2) )

Further, \sigma_t is prescribed to be:

\sigma_t^2 = \frac{\eta}{(1 + t)^{0.55}}

and \eta is one of \{0.01, 0.3, 1.0\}!alice

Stop. Stare at that for a while. Enjoy this magic. As with such things, the authors give no theoretical justification other than showing it to work on a variety of networks (kudos for that), and a hand wavy connection to simulated annealing, but examining the expression for \sigma_t should tell that the additive noise is highest at the beginning and has little to no effect during later stages of training. Like other methods for careful initialization, this should be effective in breaking symmetries and getting the training started on the right foot.

This is not entirely weird if you think more about it. Any dataset is after all has a sampling bias. So, in a way, even exact gradients are exact only with respect to this sample, and the empirical gradients are only an approximation of the gradient of manifold of the underlying physical process. So real question: 1) why bother computing exact gradients? 2) Are there computationally inexpensive/sloppy approaches to computing approximate gradients that will make the training process faster? (Remember sampling from the Gaussian will take additional time.)

Bonus: This post has an update.

3 Responses to Make your Stochastic Gradient Descent more Stochastic

  1. Tim Vieira June 2, 2016 at 3:02 pm #

    Some theory for injected-noise sgd:
    http://web.mit.edu/6.435/www/Gelfand93.pdf

    There is a survey of injected-noise sgd in section 8.4 of James Spall’s book
    https://books.google.com/books/about/Introduction_to_Stochastic_Search_and_Op.html?id=f66OIvvkKnAC&source=kp_cover

    • Delip Rao June 2, 2016 at 4:47 pm #

      @timvieira:disqus , that’s a great find! I missed out on taking James’s stochastic optimization course at AMS dept.

  2. Jae Duk Seo January 15, 2018 at 11:34 pm #

    Very interesting read, thank you for this!

Leave a Reply

© 2016 Delip Rao. All Rights Reserved.