Synthetic what now? DeepMind recently published about Synthetic Gradients. This post is about that — what they are, and does it make sense for your average Deep Joe to use it.

A Computational Graph is the best data structure to represent deep networks. (D)NN training and inference algorithms are examples of data flow algorithms, and the biggest bane of anyone implementing DNNs is the time it takes to train those damn things. Largely, this is because of the backprop algorithm. You pass each data point (in the purely stochastic case) all the way through the network and propagate the error back. You can speed things up a bit using minibatches, parallelism inherent to the computational graph in complex models, using the right optimizer, fiddling with the right learning rates and learning rate schedules, and multiple GPUs. Still, you are stuck with the fundamental speedup limitation — the forward-backward algorithm and the weight updates happen in a lockstep. Until the final loss L is computed, the updates for a parameter \theta_i in a layer i simply cannot be computed because of the dependence on the gradient of that loss.

\theta_i \leftarrow \theta_i - \alpha~\frac{\partial L}{\partial \theta_i}You knew that. The key insight in the paper is the gradient \frac{\partial L}{\partial \theta_i} doesn’t have to be a monolithic expression, but factorizes nicely via chain rule as \frac{\partial L}{\partial h_i}\frac{\partial h_i}{\partial \theta_i}, and that each of these factors can be computed separately:

1. The factor \frac{\partial h_i}{\partial \theta_i}, only depends on information local to the layer i. So, that’s available instantaneously.

2. Leaving the factor \frac{\partial L}{\partial h_i} to be approximated with an *estimate* \hat{\delta_i}.

So, now your “instantaneous” update equation looks like this:

\theta_i \leftarrow \theta_i - \alpha~\hat{\delta_i}~\frac{\partial h_i}{\partial \theta_i}

The estimator \hat{\delta_i} can be anything, but the most obvious thing is train another mini neural network M_{i+1}.

*Image courtesy: Jarderberg et al (2016). A Decoupled Neural Interface.*

The mini neural network M_{i+1} has its own parameters to be trained in a supervised setting with the real loss L.

Big fleas have little fleas,

Upon their backs to bite ’em,

And little fleas have lesser fleas,

and so, ad infinitum.

The gradient prediction model M_{i+1} is simply a fully-connected NN regressor with 0-3 hidden layers. The supervision comes from the fact that the accurate loss L will eventually get computed and will trickle down to serve as supervision to learn the params of M_{i+1}. This parameter update can happen independently and asynchronously, removing any kind of network locking. As expected, this model can produce noisy gradients, but as we know from literature, the presence of noise is actually desirable to make the training robust.

At this point, if you are scratching your head and telling yourself, how can this be fast? How much compute is needed for this? Will this work on my tiny GPU cluster? I’m with you.

**Parting notes:**

* The authors show speedup improvements in the paper, but that’s on Google/DeepMind’s hardware setup which is infinitely better than most startups and even big companies. With no details on the hardware, the timing improvements don’t make much sense. For instance, on a low-end GPU like a Titan X, there could be other systems issues that could wipe out any performance gains.

* If you are implementing this in anything else but TensorFlow (sorry torch/caffe peeps), it will require a lot of engineering to get it right in a truly distributed system.

* If you don’t have a large gpu-borg farm like Google/Facebook, I’m betting this will not work out well.

* It’s a cool idea, really. The obviousness of it makes me think it is something others who have worked long and hard in the field have thought of but never had the resources to execute. But then, some things are obvious only in retrospect. So, nice job DeepMind people.

* This paper also gets the award for best figures. I want to know what latex package or software tools they used for generating the figs!

* This setup with the gradient estimator between layers is now called a “decoupled neural interface”. You should be calling your regular matrix multiplications as “neural interfaces” from now on :P

**See also**: the next post, where I ask if backpropagation is necessary.

I tried downloading their arxiv source and found that those (PDF) figures are drawn using Keynote. Hope that helps!

Isn’t there a much broader implication here? If you can approximate the loss per layer without actually having the final layer… Those mini networks basically have a way to judge where mistakes were made? How?

How accurate are they?

If I understand correctly, no. Training the mini-network approximator depends on at least sometimes running real backpropagation, so all of this falls apart if you don’t have a final layer and corresponding loss. (Plus, it’s not even clear what error would mean without a final layer.)

See the last sentence of the second paragraph of Sec. 2.

That’s right. Without the feedback from the final loss, you’re just making random updates. Although, I still don’t see a proof of why synthetic gradients should converge. A short paper for anyone interested ;-)

Say there’s a 4 layer network, using DNI, does the first layer get updated 4 times during each forward-backward cycle? and during learning, does each layer still update based on the real gradients ?

It’s similar to TD algorithm in RL? Perhaps a proof can be obtained from that.

Say there’s a 4 layer network, using DNI, does the first layer get updated 4 times during each forward-backward cycle?

I found the follow-up paper https://arxiv.org/abs/1703.00522 much clearer.

There is a lot of work along these lines (target propagation, ADMM, “MAC Learning”), and they are all quite same-y when you work out the math. Synthetic gradients are similar to these approaches but still stands out as the most novel, I think. The neologisms here are a bit dramatic (“Neural Interfaces”?), but it’s otherwise a cool paper that I hope gets more follow up work.

I got a pytorch implementation in mnist classification

here: https://github.com/andrewliao11/dni.pytorch

I really like the idea they proposed, and the follow-up works make us rethink some problem and potential of SGs

Say there’s a 4 layer network, using DNI, does the first layer get updated 4 times during each forward-backward cycle? i couldn’t find out from the paper..

Here is a tensorflow implementation for the case of RNN, and the explanations.