Yesterday, I wrote about a gradient noise injection result at ICLR 2016, and noted the authors of the paper, despite detailed experimentation, were very wishy washy in their explanation of why it works. Fortunately, my Twitter friends, particularly Tim Vieira and Shubhendu Trivedi, grounded this much better than the authors themselves!
Shubhendu pointed out Rong Ge (of MSR) and friends tried this in the context of Tensor Decomposition in 2015 (at some point I should write about connection between backprop and matrix factorization). Algorithm 1. in that paper is pretty much the update equation of the recent ICLR paper (modulo actual values of the constants).
Shubhendu also added this goes further back in literature. And indeed it does.
Tim pointed an optimization paper from 1993 where they call gradient noise injection as “Metropolis-Type Annealing Algorithm”. Interestingly, James Spall (of JHU AMS dept.), has an entire section devoted to gradient noise injection in his Stochastic Optimization book.
Both these papers are heavy on analysis. So if you were put off by the hacky result yesterday, geek out to your hearts content :)
That said, yesterday’s question still stands — is there a way to exploit the fact that gradient updates can be noisy to compute it cheaply?