Synthetic Gradients .. Cool or Meh?
Synthetic what now? DeepMind recently published about Synthetic Gradients. This post is about that -- what they are, and does it make sense for your average Deep Joe to use it.
A Computational Graph is the best data structure to represent deep networks. (D)NN training and inference algorithms are examples of data flow algorithms, and the biggest bane of anyone implementing DNNs is the time it takes to train those damn things. Largely, this is because of the backprop algorithm. You pass each data point (in the purely stochastic case) all the way through the network and propagate the error back. You can speed things up a bit using minibatches, parallelism inherent to the computational graph in complex models, using the right optimizer, fiddling with the right learning rates and learning rate schedules, and multiple GPUs. Still, you are stuck with the fundamental speedup limitation -- the forward-backward algorithm and the weight updates happen in a lockstep. Until the final loss is computed, the updates for a parameter in a layer simply cannot be computed because of the dependence on the gradient of that loss.
You knew that. The key insight in the paper is the gradient doesn't have to be a monolithic expression, but factorizes nicely via chain rule as , and that each of these factors can be computed separately:
1. The factor , only depends on information local to the layer . So, that's available instantaneously.
2. Leaving the factor to be approximated with an estimate . So, now your "instantaneous" update equation looks like this:
The estimator can be anything, but the most obvious thing is train another mini neural network . [caption] Image courtesy: Jarderberg et al (2016). A Decoupled Neural Interface. [/caption]
The mini neural network has its own parameters to be trained in a supervised setting with the real loss .
Big fleas have little fleas, Upon their backs to bite 'em, And little fleas have lesser fleas, and so, ad infinitum.
The gradient prediction model is simply a fully-connected NN regressor with 0-3 hidden layers. The supervision comes from the fact that the accurate loss will eventually get computed and will trickle down to serve as supervision to learn the params of . This parameter update can happen independently and asynchronously, removing any kind of network locking. As expected, this model can produce noisy gradients, but as we know from literature, the presence of noise is actually desirable to make the training robust.
At this point, if you are scratching your head and telling yourself, how can this be fast? How much compute is needed for this? Will this work on my tiny GPU cluster? I'm with you.
* The authors show speedup improvements in the paper, but that's on Google/DeepMind's hardware setup which is infinitely better than most startups and even big companies. With no details on the hardware, the timing improvements don't make much sense. For instance, on a low-end GPU like a Titan X, there could be other systems issues that could wipe out any performance gains.
* If you are implementing this in anything else but TensorFlow (sorry torch/caffe peeps), it will require a lot of engineering to get it right in a truly distributed system.
* If you don't have a large gpu-borg farm like Google/Facebook, I'm betting this will not work out well.
* It's a cool idea, really. The obviousness of it makes me think it is something others who have worked long and hard in the field have thought of but never had the resources to execute. But then, some things are obvious only in retrospect. So, nice job DeepMind people.
* This paper also gets the award for best figures. I want to know what latex package or software tools they used for generating the figs!
* This setup with the gradient estimator between layers is now called a "decoupled neural interface". You should be calling your regular matrix multiplications as "neural interfaces" from now on :P
See also: the next post, where I ask if backpropagation is necessary.