# Word Embedding as a Learning To Rank Problem

I’m writing more about word embeddings. Weird. I know. They are frustratingly useful in product development, and opaque when it comes to understanding the gotchas. So every time I come across a paper that improves my understanding I get all excited. Like when Levy et al 1 showed how the competing embedding methods were pretty much the same if you took care while setting up your experiment. That was cool.

Recently I came across WordRank — a fresh new approach to embedding words by looking at it as a ranking problem. In hindsight, this makes sense. In typical language modeling situation, NN based or otherwise, we are interested in this: you have a context $c$, and you want to predict which word $\hat{w}$ from your vocabulary $\Sigma$ will follow it. Naturally, this can be setup either as a ranking problem or a classification problem. If you are coming from the learning the rank camp, all sorts of bells might be going off at this point, and you might have several good reasons for favoring the ranking formulation. That’s exactly what we see in this paper. By setting up word embedding as a ranking problem, you get a discriminative training regimen and built in attention-like capability (more on that later).

As an aside, an excellent resource on learning to rank (LTR) is the tutorial by Hang Li2 LTR is a versatile toolkit to have in your ML war chest; you will be amazed where you can use them. But we digress.

Every interesting paper has at least one eye-catcher. For the WordRank paper:

… with 17 million tokens our method performs almost as well as existing methods using 7.2 billion tokens on a popular word similarity benchmark.

I mean look at these plots showing word similarity task accuracy as a function of  the number of tokens in the training corpus. Red is WordRank, Blue is Word2Vec, and Green is Glove.

While reading this paper, a few thoughts came to mind:

1. I think this is interesting from a science perspective, but in real life situations, I rarely see small/tiny monolingual corpora, unless I’m working with LCTLs.
2. Besides, in the absence of error bars on those red and blue dots, your guess is as good as mine when it comes to telling how different they really are. ¯\_(ツ)_/¯
3. Remember, from the Swivel post, when comparing analogy-task accuracies of various word embedding methods it is a good idea to show their accuracies on tokens in different frequency quantiles:

I would love to see a similar chart for WordRank. Will it do well on highly frequent words where all methods suck right now (polysemous words are important in this category)?