Yesterday, I wrote (excitedly) about stochastic depth in neural networks. The reactions I saw for that paper ranged from, “dang! I should’ve thought of that” to, umm, shall we say annoyed?
This reaction is not surprising at all. The idea was one of those “Frustratingly Simple” ideas that worked. If you read the paper, there is no new theory or model there. Neither do the authors spend a lot of time on why things work other than a hand-wavy explanation of ensembles. Critics might argue if there was any “contribution to science” — I’m sure some reviewers will.
The fact of the matter is nobody knows *exactly* why this works. My guess is this: A lot of the regularization we are seeing is probably coming from preventing the layers from co-adapting with each other. Just as dropout discourages adjacent layers from coadapting with each other, my guess is stochastic depth is discouraging entire subsets of layers from coadapting with each other. No doubt, there is an army of people out there to science the hell out of this, and explain better what’s going on. Kudos to them & I look forward to those works.
But that is Science. The realities of practice, however, are different. As a practitioner, if you are in the business of approximating functions, there is no escaping from methods now branded as Deep Learning, and (old and new) ensemble methods. In fact, all top submissions at Kaggle, for instance, use one or both of these. As a practitioner, I care about turnaround time for my experiments. I do care about accuracy improvements I can get by squeezing in more parameters and not losing generalization performance at the same time. But more importantly, I care about training/testing turnaround time. A lot. Fancy tree-structured models that take forever to converge for a marginal improvement in accuracy? No, thank you.
Whenever I see something that fits the bill, I don’t hesitate in coopting it. We don’t understand some of these things well today. A lot of deep learning is like that. I have used my Twitter stream to call this out, but it hasn’t stopped me from using deep learning where I see fit. But at the same time, I will be the first to criticize anyone claiming things like “deep learning is intelligence” or “deep learning will solve all problems”.
So back to stochastic depth. I think this idea has a lot of promise. I’m bullish on the time savings in training. While we don’t fully understand, we will use it, and eventually figure out *exact* reasons for its success. This reminds me of Random Projections, where the idea of dimensionality reduction is to simply multiply with random binary matrices — i.e. you throw away columns at random. Sounds stupid? When it came out, (Kaski, 1998), Kaski’s paper also has a hand wavy explanation. It wasn’t until few years later that connections were made to the Johnson-Lindenstrauss Lemma, and the method was understood in depth.
Until then, I will have an open mind, be critical, and use
what’s good whatever works.