Word Embedding as a Learning To Rank Problem

by  on

I'm writing more about word embeddings. Weird. I know. They are frustratingly useful in product development, and opaque when it comes to understanding the gotchas. So every time I come across a paper that improves my understanding I get all excited. Like when Levy et al O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings, TACL 2015 showed how the competing embedding methods were pretty much the same if you took care while setting up your experiment. That was cool.

Recently I came across WordRank -- a fresh new approach to embedding words by looking at it as a ranking problem. In hindsight, this makes sense. In typical language modeling situation, NN based or otherwise, we are interested in this: you have a context cc, and you want to predict which word w^\hat{w} from your vocabulary Σ\Sigma will follow it. Naturally, this can be setup either as a ranking problem or a classification problem. If you are coming from the learning the rank camp, all sorts of bells might be going off at this point, and you might have several good reasons for favoring the ranking formulation. As an aside, an excellent resource on learning to rank (LTR) is the tutorial by Hang Li. Hang, Li. "A short introduction to learning to rank.", 2011. LTR is a versatile toolkit to have in your ML war chest; you will be amazed where you can use them. That's exactly what we see in this paper. By setting up word embedding as a ranking problem, you get a discriminative training regimen and built in attention-like capability (more on that later).

Every interesting paper has at least one eye-catcher. For the WordRank paper:

... with 17 million tokens our method performs almost as well as existing methods using 7.2 billion tokens on a popular word similarity benchmark.

I mean look at these plots showing word similarity task accuracy as a function of  the number of tokens in the training corpus. Red is WordRank, Blue is Word2Vec, and Green is Glove.


While reading this paper, a few thoughts came to mind:

1. I think this is interesting from a science perspective, but in real life situations, I rarely see small/tiny monolingual corpora, unless I'm working with LCTLs.
2. Besides, in the absence of error bars on those red and blue dots, your guess is as good as mine when it comes to telling how different they really are. ¯\_(ツ)_
3. Remember, from the Swivel post, when comparing analogy-task accuracies of various word embedding methods it is a good idea to show their accuracies on tokens in different frequency quantiles: swivel_chart

I would love to see a similar chart for WordRank. Will it do well on highly frequent words where all methods suck right now (polysemous words are important in this category)?

Given all these caveats, should you care about this? I think so. Here's why:

1. A lot of the techniques in this paper are general enough to be useful for embedding things other than words, where instance sizes could be small.
2. There's less black magic! Cannot underestimate this.
3. Ranking has a built-in robustness to noise (coming from one off co-occurrences or typos). Why? That's because the loss functions used in ranking, like (N)DCG, focus more on getting items right at the top of the list than penalizing the models equally for not doing well at the bottom, which we don't really care. So what the ranking loss function is doing here is similar to the recent visual attention models we see in Deep Learning literature, and 90s computational neuroscience literature.

Olshausen et al 93

So if I'm not happy with Word2Vec performance, I wouldn't mind trying this, as opposed to something that's very brittle/depending heavily on hyper param optimization. The authors provide a reference C++ implementation of WordRank in case you want to try it out.

Bonus: This paper also brought to my attention RoBiRankYun, Hyokun, Parameswaran Raman, and S. Vishwanathan. "Ranking via robust binary classification."  NIPS 2014., a Learning To Rank method, from the same authors that I did not know. That's a shame because I used to follow Vishy Vishwanathan`s work closely when I used to geek out on graph kernels, but glad to find this. RoBiRank is a very clever reformulation of the NDCG loss that makes it easier to optimize. They also describe a near-linear scalable distributed version of RoBiRank. Check it out. The source for RobiRank is also available.


Copyright © 2021. Delip Rao