In this post, I will talk about Language Models, when (and when not) to use LSTMs for language modeling, and some state of the art results. While I mostly discuss the “Exploring Limits” paper, I’m adding a few things elementary (for some) here for completeness sake. The Exploring Limits paper is not new, but I think it’s a good illustration of how deep sequence models should be used in practice. This is bigger than my usual posts, but hey we are exploring limits here. Ok, running out of italics now.
One can say the ImageNet corpus, in part, fueled the modern (deep learning based) Computer Vision research due to its sheer size that allowed models to regularize better and kaggleification of the engineering behind the tasks. The Billion Word Corpus (LM1B hereafter) aims to be that for language modeling. A billion words is not big for language modeling (for e.g., Google Translate folks routinely use several orders more for training their LMs), but having a largish common corpus is always a nice idea. Just as image/object recognition is a small sliver of Computer Vision, so is language modeling for Natural Language Processing. And like object recognition, it’s a primitive task that allows one to learn representations for atomic units — such as words — that could be used in many upstream tasks.
One of the problems with natural language is, like Vision, it is hard to pinpoint the definition of “understanding”. In some ways, it is harder than computer vision. For example, if I show you this image, you can probably tell it is, mostly, of a cow.
But with words and sentences, things get complicated. One prerequisite to understanding language is making sure the sentence or phrase we are reading or generating at least looks like the language we see (as sampled by the corpus) in daily life — i.e., what is the probability of seeing that sequence. Sometimes people describe language modeling as the probability of predicting the next word given a sequence. With a little bit of probability math, you can convince yourself that these are just two ways of looking at the same thing. So, a (probabilistic) language model will tell you how *likely* some sentence or phrase can be modulo LM errors. Sometimes people attribute more to this likeliness and hallucinate their models are syntactic or semantic or “creative”. For example, if you have a partial sentence as:
I eat my soup with _________
and your LM completes this partial sentence with:
1. “a spoon” — you could hallucinate that your LM is doing semantic understanding.
2. “a pillow” — that’s semantically garbage. But you could be kind to your model and say, “Oh it’s grammatically correct, so my LM is producing grammatically valid sentences.”
3. “my mind” — that’s rubbish too, except there will be some folks who will brand this as “Creative AI”, and say “Look, our models are creative in their answers”. We saw some of that nonsense happening with image captioning.
4. “the the” — Now you would probably say, “That’s unfortunate. Probably the training data didn’t have enough contexts with soup and eat.” (seeing nonsensical LM output is not uncommon with OOVs and long tail of the vocabulary; we will visit this later).
So, a language model scores likelihood of some sentence or a fragment w.r.t a corpus it has been trained on. No more. No less. Decades of language modeling research could be summarized by innovations in the sources of data used, the amount of data used, and modeling techniques used. The latter depending on the former two factors more closely than researchers will claim in their papers. The LM1B corpus further exercises the empirically well-founded narrative that more data will lead to better language models — the notion of “better” as measured by perplexity which itself is a questionable metric. It’s important however to note that improvements in perplexity doesn’t always mean an improvement in the upstream task, esp. when you notice small improvements. Big improvements in perplexity, however, should be taken seriously (as a rule of thumb, for language modeling, consider “small” as < 5 and “big” as > 10 points).
Earlier this year, Google Brain published a paper that showed massive improvements in perplexity on the LM1B corpus in a paper titled, “Exploring The Limits of Language Modeling“. This paper is impressive because:
1. It knocks perplexity on LM1B all the way down to 23.7 (in ensemble or 30.0 if you consider just their single best model).
2. AFAIK, it is the first pure LSTM model to do that, i.e. without interpolating with an ngram model — not that it’s advisable to do, but I’m jumping the gun again.
3. It shows how char CNN inputs could be used to control the LSTM model sizes — not sure if it’s the first paper to use char CNNs explicitly for that purpose, but it’s a useful trick.
To put the performance gains in perspective, a standard interpolated Kneser-Ney 5-gram model (KN5) gives you 67.6 perplexity for around 1.8 billion params (floats). Their single bi-layer LSTM-CNN model with perplexity 30.0 has just over a 1 billion params. That’s more than 55% reduction in perplexity and over 44% fewer parameters! Impressive indeed.
With such improvements staring at you, shouldn’t you always be using an LSTM? The answer, unfortunately, is not that simple. So let me list out some of the pros and cons (Data for some of the charts below come from Cantab Research):
These include Recurrent Neural Nets, and gated variants like LSTMs and GRUs.
1. Captures long range history instead of being fixed-order Markov
Typical n-gram based language model makes a fixed-order Markov assumption — a word relies only on a fixed length history in the past.
With LSTMs, this context could, in theory, be unbounded, but in practice go over the extents of the sentence. One could argue that adding more context can provide better language models even for count-based models. Let’s see how this does with a regular n-gram model with Kneser-Ney smoothing:
So just by throwing in more context in an n-gram model, we observe diminishing returns in the reduction of test set perplexity. Going from a bigram to a trigram model gives a huge win. However going from trigrams to higher order ngrams plateaus improvements while drastically increasing the size of the models. This effect is noticeable between the 4-gram and 5-gram models in the figure above.
The n-gram models, like most other ML models, improve with data. However, the improvement asymptotes quickly. In fact, if you interpolate from the 5-gram curve, getting even around a perplexity of 60 will require around a trillion words. A consequence of this is the LM will easily be several hundred gigabytes, and becomes impossible to deploy on most servers.
2. Competitive perplexity
Ever since Tomás Mikolov published his results on RNN language models, we know RNNs beat n-grams hands down with increasing data size.
Kneser-Ney 5-gram vs. RNN (different #states) on LM1B
And this effect is pronounced when more states are added.
Many of the difficulties in recurrent neural network training are now reasonably understood, and gated architectures (LSTMs and GRUs) are de jure for sequence modeling because of their robustness than vanilla RNNs.
3 . Can control/limit number of parameters in the RNN
An excellent demonstration of the Exploring Limits paper is how to model to tradeoff between model performance and the number of parameters in a sequence model.
A trick I found useful is to use character CNN to model word inputs as a means to reduce parameter sizes of sequence models. Another paper that explore this in the context of language modeling is by Yoon Kim and friends, “Character Aware Language Modeling“, where they claim up to 60% reduction in parameters. This is a better motivation of CharCNNs than, say for text classification.
4. RNN-based models do well on rare words
My favorite graph (somewhat unexpected for me) from the Exploring Limits paper is one where they show the difference in entropy of their best LSTM model (a 2 layer LSTM with CNN inputs) and KN5 for words of increasing rarity.
1. Super fast to train — at least an order of magnitude fewer hours needed to train.
2. Works well for small quantities of data. With the right smoothing/priors, you could make this work decently with a fraction of the data. Let’s say if the government comes to you asking to build a language model in Dari or Pashto, and you have just a few hundred thousand words, you really cannot do better than using an n-gram model and a visit to your friendly neighborhood linguistics department. (Using those languages as an example; quite sure, given the interest, by now we have enough data for those languages).
3. Still, the best option to use in combination: Even when you have tonnes of data, you want to use the n-gram model in combination, as they usually give better results, when interpolated, than using just the LSTM model alone. Even when you add more states to your LSTM, you will continue to get better performance interpolating with a 5-gram model. Another case in point is the Speech Recognition system from MSR that’s making the news now for getting the best Word Error Rate: They use an n-gram interpolated RNN model too.
The Exploring Limits paper counters this argument. They suggest n-gram interpolation is not necessary and with careful tuning it is possible to find an LSTM+CNN architecture that will provide competitive results as an interpolated model. That’s just an exercise in exploring limits of your patience or your GPU infrastructure.
I’m going to leave you all with a decision tree of what to do when faced with building LMs in your startups, if LMs are a means to an end for the problem you are solving.