Word Embedding as a Learning To Rank Problem

I’m writing more about word embeddings. Weird. I know. They are frustratingly useful in product development, and opaque when it comes to understanding the gotchas. So every time I come across a paper that improves my understanding I get all excited. Like when Levy et al 1 showed how the competing embedding methods were pretty much the same if you took care while setting up your experiment. That was cool.

Recently I came across WordRank — a fresh new approach to embedding words by looking at it as a ranking problem. In hindsight, this makes sense. In typical language modeling situation, NN based or otherwise, we are interested in this: you have a context c, and you want to predict which word \hat{w} from your vocabulary \Sigma will follow it. Naturally, this can be setup either as a ranking problem or a classification problem. If you are coming from the learning the rank camp, all sorts of bells might be going off at this point, and you might have several good reasons for favoring the ranking formulation. That’s exactly what we see in this paper. By setting up word embedding as a ranking problem, you get a discriminative training regimen and built in attention-like capability (more on that later).

As an aside, an excellent resource on learning to rank (LTR) is the tutorial by Hang Li2 LTR is a versatile toolkit to have in your ML war chest; you will be amazed where you can use them. But we digress.

Every interesting paper has at least one eye-catcher. For the WordRank paper:

… with 17 million tokens our method performs almost as well as existing methods using 7.2 billion tokens on a popular word similarity benchmark.

I mean look at these plots showing word similarity task accuracy as a function of  the number of tokens in the training corpus. Red is WordRank, Blue is Word2Vec, and Green is Glove.

wordrank-fig2

While reading this paper, a few thoughts came to mind:

1. I think this is interesting from a science perspective, but in real life situations, I rarely see small/tiny monolingual corpora, unless I’m working with LCTLs.
2. Besides, in the absence of error bars on those red and blue dots, your guess is as good as mine when it comes to telling how different they really are. ¯\_(ツ)_/¯
3. Remember, from the Swivel post, when comparing analogy-task accuracies of various word embedding methods it is a good idea to show their accuracies on tokens in different frequency quantiles:
swivel_chart

I would love to see a similar chart for WordRank. Will it do well on highly frequent words where all methods suck right now (polysemous words are important in this category)?

Given all these caveats, should you care about this? I think so. Here’s why:

1. A lot of the techniques in this paper are general enough to be useful for embedding things other than words, where instance sizes could be small.
2. There’s less black magic! Cannot underestimate this.
3. Ranking has a built-in robustness to noise (coming from one off co-occurrences or typos). Why? That’s because the loss functions used in ranking, like (N)DCG, focus more on getting items right at the top of the list than penalizing the models equally for not doing well at the bottom, which we don’t really care. So what the ranking loss function is doing here is similar to the recent visual attention models we see in Deep Learning literature, and 90s computational neuroscience literature.

Olshausen et al 93

So if I’m not happy with Word2Vec performance, I wouldn’t mind trying this, as opposed to something that’s very brittle/depending heavily on hyper param optimization. The authors provide a reference C++ implementation of WordRank in case you want to try it out.

Bonus: This paper also brought to my attention RoBiRank3, a Learning To Rank method, from the same authors that I did not know. That’s a shame because I used to follow Vishy Vishwanathan`s work closely when I used to geek out on graph kernels, but glad to find this. RoBiRank is a very clever reformulation of the NDCG loss that makes it easier to optimize. They also describe a near-linear scalable distributed version of RoBiRank. Check it out. The source for RobiRank is also available.

, , ,

  • Christophe Cerisara

    Thanks a lot for your very interesting blog and tweets.

    Do the authors make available some pretrained WordRank embeddings so that we can try them easily in our task ?

    • Delip Rao

      @cerisara:disqus, I don’t think they do, but the authors have released the code (which I think is more important), that you can run any of the existing large data collections.

  • Xiangnan He

    Great post. Thank you for sharing!

    I totally agree that ”
    1. I think this is interesting from a science perspective, but in real life situations, I rarely see small/tiny monolingual corpora

    3. it is a good idea to show their accuracies on tokens in different frequency quantiles

    My own experience of training word vectors (full wiki):
    1. In large corpus (at least the full wiki), word2vec and GloVe have very good performance, and it’s really hard to further improve them in both analogy and similarity tasks. Usually, word2vec yields better performance on similarity, while GloVe is more strong on analogy (I got over 0.73 accuracy of GloVe vectors when training in wiki)

    2. Most other papers that claimed they can improve over word2vec and GloVe are not trustable. They have either used a smaller dataset or did not use the correct setting/implementation of the baselines.

    3. Swivel is probably the only work that can beat GloVe and word2vec in both analogy and similarity tasks (I had the similar idea with Swivel, but Google researchers are faster than me…). But, it is of very very high computational complexity, and it’s not easy for others to train it.

© 2016 Delip Rao. All Rights Reserved.