The Twelve Truths of Machine Learning for the Real World

Delip Rao,

Dec 25, 2019

Last month I gave an informal talk to an intimate gathering of friends with this title that I am putting down in words. This post is mainly for people who are using machine learning to build something as opposed to people who are working on machine learning for its own sake (god bless them). Although the latter group will do well to listen to these truths and introspect their work. In the spirit of holiday baking, I will throw in a Truth Zero to make this a bakers’ dozen.

0. You are Not a Scientist.

Yes, that’s all of you building stuff with machine learning with a “scientist” in the title, including all of you with PhDs, has-been-academics, and academics with one foot in the industry. Machine learning (and other AI application areas, like NLP, Vision, Speech, …) is an engineering research discipline (as opposed to science research).

Tangent I won’t get into here: I think everything is art. The C. P. Snow “Two Cultures” is complete BS, but that’s for another day.

What’s the difference between science research and engineering research, you ask? I cannot improve upon what George A. Hazelrigg wrote in his “HONING YOUR PROPOSAL WRITING SKILLS” memo (emphasis mine):

Some scientists are taught how to frame research projects. Few engineers are, even PhD-level engineers. So let’s first try to understand the difference between science research and engineering research. To me, the difference is quite clear. The scientist seeks to understand nature at its core, to get to the fundamental essence. To do this, the scientist typically strips away extraneous effects and dives deeply into a very narrow element of nature. And from this look comes what we refer to as the laws of nature: energy and mass are the same thing, for every action there is an equal and opposite reaction, and so on. There are lots of laws of nature, and they apply everywhere all the time. Engineers live with the laws of nature. They have no choice. Their goal is to design things that work within what nature allows. To do this, they have to be able to predict the behavior of systems. So a big question for engineers is, how do we understand and predict the behavior of systems in which all the laws of nature apply everywhere all the time. This is an issue of integration, and it is every bit as difficult as finding the laws in the first place. To account for all the laws of nature everywhere all the time is an impossible task. So the engineer must find ways of determining which laws are important and which can be neglected, and how to approximate those laws that are important over space and time. Engineers do more than merely predict the future. They make decisions based in part on their predictions in the knowledge that their predictions cannot be both precise and certain. Understanding and applying the mathematics of this is also important. This includes the application of probability theory, decision theory, game theory, optimization, control theory, and other such mathematics in the engineering decision making context. This also is a legitimate area of research for engineering.

As an ML researcher and practitioner, you have to worry about the right models for the data you have as opposed to the right datasets for the models you have (like many research papers). If you’ve ever asked “what is the right dataset for this model”, then you are not in the Real World. What the heck is this Real World anyways? Real World is where you don’t have a choice about the data you have to deal with. Here the data defines the problem and not the other way around. Sometimes ML practitioners of the Real World pretend they are scientists by creating their own worlds as playgrounds for their modeling enterprise, such as “inventing” a language for doing NLP (hello BaBi!) or creating closed environments with simplifying assumptions for reinforcement learning. These produce interesting results, but their scope is limited to the worlds they emerge from, even if researchers like to sell it in their papers as something applicable for the Real World. In the Real World, the distribution of input more likely changes than not, “curve balls” from long-tails come out of nowhere, and you don’t always have an answer.

When working in the Real World, there are several truths one has to contend with, which is the main body of this post. But this prologue is essential. If you do ML research in the Real World, you are an engineer and not a scientist. Keep that in mind. There are a few recurring themes we find as we practice the craft. Interestingly, these themes are airlifted almost verbatim from another engineering research discipline — Networking — to make a point.

1. It has to work

While this sounds like a no-brainer, I am amazed how many people, new and experienced, get carried away with new fancy-sounding names or because something came out of DeepMind or OpenAI or Stanford/MIT/what have you. Participating in the Real World has no room for ideology or specific research agendas. If your fancy model does not work on their dataset, environment and resource constraints, the Real World will mercilessly reject it. There are many results on arXiv that only work on a handful of datasets or work on bajillion GPUs that only Google infrastructure can support. Do the community a favor and stop publishing those as general results. It has to work. Not just as “kosher” science in your paper but also for others’ situations. It is for the same reason why we don’t think of doing anything in Computer Vision without ConvNets today or why we readily use Attention with sequence models. It has to work.

Conjecture: So many, esp. folks new to ML, get carried away with fancy models names and can’t wait to try them, or write blog posts about them, and so on. I think this is like someone newly learning to write. They think using big words will make their writing better, but experience will teach them otherwise.

2. No matter how hard you push and no matter what the priority, you can’t increase the speed of light Cache hierarchies must be respected, network overheads will throw a wrench in your distributed training, there is only so much you can cram in a vector, and so on.

3. With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea A sufficiently motivated graduate student or a large hyperparameter sweep at a massive datacenter can find a set of hyperparams that will make a crazy complicated model work well or even produce outstanding results, but no one in the Real World ever ships models that are so hard to tune. A dirty secret I found while helping companies with their ML teams back when running Joostware — most did not know/care about hyperparameter tuning.

4. Some things in life can never be fully appreciated nor understood unless experienced firsthand Some things in machine learning can never be fully understood by someone who neither builds production ML models nor maintains them. No amount of courseware, MOOCs, Kaggling will prepare you for that. There is no substitute for deploying a model, observing user interactions with the model, dealing with code/model rot, and so on.

5. It is always possible to agglutinate multiple separate problems into a single complex interdependent solution. In most cases, this is a bad idea End-to-End learning sounds like a good idea on paper, but for most deployment scenarios, pipelined architectures that are piecewise optimized will continue to stay. That doesn’t mean we will not have end-to-end systems at all (speech recognition and machine translation have decent production-worthy end-to-end solutions), but for most situations having observable paths for debugging will trump other options.

6. It is easier to ignore or move a problem around than it is to solve it

For example, in speech, acoustic modeling is hard, but you can let your network figure out those details on the way to solving a different problem (say speech recognition). In NLP, parsing is hard to get right. But thankfully, for 99% of the Real World tasks, we can get by without parsing. In Vision, don’t solve a segmentation problem first if all you need is a classifier. The list is endless.

Corollary: Don’t solve a problem unless you absolutely have to.

7. You always have to tradeoff something

Speed vs. memory, battery life vs. accuracy, fairness vs. accuracy, precision vs. recall, ease of implementation vs. maintainability, …

8. Everything is more complicated than you think Analogous to sticker shock in shopping, there is “effort shock” in working. Even most seasoned researchers engineers experience effort shock, either because they underestimate 1) the engineering issues dealing with large datasets, 2) the complexity of the domain they are wrestling with, and 3) adversaries. There is also a 4th reason for effort shock I call the Karate Kid effect — most papers we read make things appear simpler than they are by not noting the million failures that came before the linear success narrative that gets documented. As a result, papers are not research but a result of doing research. You will never experience doing research by reading papers for that reason, and you will develop a skewed sense of effort.

9. You will always under-provision resources This is a combination of #8 and the fact that any model remotely successful can collapse due to its own success if not planned properly.

10. One size never fits all. Your model will make embarrassing errors all the time despite your best intentions Corner cases and long fat tail of failure modes will haunt you. For many, thankfully, non-critical ML deployments this is no big deal. At the worst, it will make a funny tweet. But if you work in healthcare or some other high-risk situation ML deployments will be a nightmare because of these.

11. Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works Schimdhuber may be making a larger point. Nobody listens to the man, and like him, we rehash old wines in new bottles, and are forced repeat historical mistakes.

12. Perfection has been reached not when there is nothing left to add, but when there is nothing left to take away True with everything in life and also in machine learning for the Real World. Alas, our conference reviewing process with its penchant for “novelty” creates unwanted arXiv-spam with a lot of garbage that doesn’t need to exist in the first place. Unless doing “science” incentivizes publicizing what works as opposed to what’s new, I don’t see this changing.

P.S. I am intentionally using Machine Learning, instead of AI or Deep Learning everywhere here. The former is pretentious, and the latter is a branding exercise, fighting which is a lost cause. But you could substitute either terminology for your taste.