Is BackPropagation Necessary?

In the previous post, we saw how the backprop algorithm itself is a bottleneck in training, and how the Synthetic Gradient approach proposed by DeepMind reduces/avoids network locking during training. While very clever, there is something unsettling about the solution. It seems very contrived, and definitely resource intensive.  For example, a simple feed forward network under the scheme has a Rube-Goldbergesque feel to it.

FeedFoward with DNIimage courtesy: Jaderberg et al (2016). A fully unlocked feed forward net using DNI.

Every time you see a solution that looks unnatural, you want to go back and ask are we solving the right problem or are we even asking the right question. Naturally, this begs the question: Is backprop is the right way to train neural networks? To answer look at this, let’s step back a few paces.

All machine learning algorithms are solving one kind of optimization problem or another. A majority of those optimization problems (esp. those involving real-world tasks) are non-convex. While there are several attempts to handle non-convexity in machine learning, the AI community appears to have converged upon deep neural networks as a way to handle that nonconvexity in a reasonable manner. While it requires massive amounts of compute (and data) to learn meaningful parameters, the deep learning methods at least allow us to capture representations that were not previously possible in some situations. So until we find alternate ways of modeling nonconvexity, deep networks have a role to play in many real-world problems, and we have to find efficient ways to train them. The original question “Is backpropagation necessary?”, has two embedded questions:

1. Are there other ways to train traditional (deep) neural networks without backpropagation?
2. Are there other architectures (neural or otherwise) will obviate the need for backprop while still producing models with similar capacities?

These are questions people in the field have pondered a lot about (by “field”, I mean not just the DL community but the whole ML/neuroscience/cognitive science community), and I am just poking the monster here.

1. Are there other ways to train traditional (deep) neural networks without backpropagation?

Consider the Hidden Markov Model — one of the simplest yet widely used models for speech and language before DNNs came along. HMM training has a similar (although not identical) forward backward dataflow algorithm — the Baum-Welch algorithm. Since its inception in the 1970s, Baum-Welch and their variants were de jure in training HMMs until, in 2008, Daniel Hsu, Sham Kakade, and Tong Zhang showed how to learn HMM parameters using a spectral method. This was very exciting for me because I was a spectral graph theory nut in grad school back then. Around the same time, I also worked with Deepak Ravichandran at Google on applying Label Propagation to various NLP problems. Label Propagation is another example of a dataflow algorithm that can be naturally written down as an eigen decomposition problem, and we used the fact that these algorithms have a flow-based iterative analog to build a mapreduce-based solution that scaled to large datasets. (Digression: today you would not implement iterative algorithms using mapreduce, but that was a thing to do back then.)

So, you see a pattern here. Most data flow problems can be transformed into spectral learning problems. How about training neural networks?  If you follow Anima Anandakumar‘s work (and you should!), she has shown how spectral methods (tensor decomposition to be general) can be applied to a variety of tasks ranging from learning parameters of neural networks to latent variables in general probabilistic models (not just HMMs). Furthermore, the tensor decomposition methods are embarrassingly parallel, enabling large-scale distributed training. For more details on the neural network training with spectral methods, see:

Majid Janzamin, Hanie Sedghi, Anima Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, Jan. 2016.

The core idea is to use the method of moments recover to the latent variables (in case NNs, hidden layer params) via tensor decomposition. In fact, you can look at Omer Levy and Yoav Goldberg’s word2vec as a matrix factorization problem as an instantiation of this for autoencoders.

In the context of deep networks, there is also the interesting idea of Target Propagation by Yoshua Bengio, which claims to be backprop-free.
Yoshua Bengio, How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation, Sep. 2014

All of these solutions take high energy, that’s in direct contrast to the human brain which supposedly operates on 20W or less. Computational Neuroscientists often refer to this as “biologically plausible”. This leads us to the next question:

2. Are there other (neural) architectures will obviate the need for backprop while still producing models with similar capacities?

There’s an entire line a biologically plausible learning work, with a long history. Now you’re entering the realm of computational neuroscience, about which I know nothing about. What are some of results here that one should be watching? Spiking NNs? This to me is very exciting about future of NN research, than simply throwing tons of compute and approximating functions.

 

  • Simon Alexander Hewat

    Nice post. Succinct and informative.

    If you want to learn more about the state of the art in biological plausibility in relation to DNN learning, the below link is a pretty good survey.

    http://biorxiv.org/content/early/2016/08/22/058545

    • Delip Rao

      This is awesome reference! Thanks!

  • The Lone Wolfling

    I seem to remember a paper about randomly dropping layers while training
    (analogous to dropping nodes while training). Apparently it works
    surprisingly well.

    So: how do you train a deep neural network? You don’t. You train a bunch of shallow subsets of said network.

    • Shaohua Li

      This idea sounds questionable to me. Each layer is a nonlinear transformation, and removing or keeping it results in totally different input for the next layer. Couldn’t see why fitting both input will improve the generalization.

© 2016 Delip Rao. All Rights Reserved.