Swivel by Google — a bizarre word embedding paper

A new word embedding paper came out of Google that promises to look at things missed by Word2Vec and Glove, provide a better understanding, and a better embedding. All word embedding schemes try to map a discrete word to a real vector. So if \Sigma is your vocabulary, a word embedding is a function \Phi:

\Phi: \Sigma \rightarrow \mathbb{R}^d

Of course, the elements of \Sigma need not be words but any discrete “thing” — like a feature, a named entity, what have you. While finding good embeddings has been a long-standing science art, the recent work in neural embeddings caught the imagination of the audience outside of the traditional research community because 1) they’re easy to use, and 2) they work pretty well.

Way too many people have written about Word2Vec. Read them if you are new to this. Glove has done well on analogy tasks but just that (more on this later). This post is mostly to discuss the Swivel paper.

First off, the Swivel paper does one thing very well. It explains the difference between the different embeddings in a very intuitive manner. Kudos to the authors for that. Now that’s out of the way, I found the paper itself quite bizarre:

– The paper provides a lot of implementation details on how to parallelize the embedding learning problem so we can deal with gigantic co-occurrence matrices. That’s nice. But this detail also confounds this paper. For example, there is an elaborate construction on how to shard the co-occurrence matrix (training happens on these individual shards). Is all that (sorting/sharding) done only to distribute jobs in a parameter server setup, and keep communication overhead low in the process? Or is it key to the learning? My guess is the former but hard to tease out the effects without direct experimentation (say with a matrix that could easily fit in memory).

“Swivel is an attempt to have our cake and eat it, too”. The central claim is none of the competing methods provide any special treatment to unobserved word-context co-occurrences. This paper deals with it by constructing a piecewise loss function, where the penalties are different if the (word, context) pair co-occur or not in the training set. This is cute, and in general, a useful trick to know while coming up with ML methods. It also highlights the beauty of computational graphs and automatic differentiation. But the construction of these loss functions is El Bizzaro. If the (word, context) pair co-occurs in training, nothing special is done. The loss function is like everything we’ve seen before: \mathcal{L}_0 = \frac{1}{2}f(x_{ij})(w_i^\intercal \tilde{w}_j~\text{-}~\mathbf{pmi}(i, j))^2. This is a weighted square loss. The weight (or “confidence” as the paper calls it) is a monotone function of the count x_{ij}. However, the choice of that function in Swivel is nothing but black magic: f(x) = 0.1 + \frac{1}{4}\sqrt{x}.

 

WTF

– Now for the other part of the loss function — cases where the (word, context) pair are not seen. For this case a count of 1 is assumed, and a different loss function is used:\mathcal{L}_1 = \log[1 + exp(w_i^\intercal \tilde{w}_j~\text{-}~\mathbf{pmi^*}(i, j))].If you are an NLP researcher, you might be asking yourself — a) why just assume the count is 1, as opposed to doing a proper add-1 smoothing? b) why not do an add-\lambda, or even better Good-Turing? I’m inclined to think this is purely for efficiency. This is hacky but at this point, the hacks don’t surprise us in Deep Learning. Hey, it works. Two more questions about this loss function: Why not just leave it as (w_i^\intercal \tilde{w}_j~\text{-}~\mathbf{pmi^*}(i, j))? After some “careful” calculation on a napkin, I don’t see the “1 + ” serving any role here, but I could be wrong. I’m willing to bet, if you just used an \ell_1 loss, you should get comparable results. A valid question at this point would be why not use the other loss function, \mathcal{L}_0, for this but with the count as 1?

– The most useful part of this paper is this chart that shows, for the word analogy task, the variation in task performance with the popularity of the words involved in the task.

swivel_chart

Basically, it tells us four things: 1) You are out of luck if the words you are dealing with are all very frequent (typically stopwords and polysemous words), 2) For rare words, word2vec is better than Glove, 3) For everything else, there’s not much difference between Word2Vec and Glove (my personal experience too — actually Word2Vec almost always worked better for me), 4) Swivel does better than both Glove and Word2Vec, for the word analogy task  (but is word analogy even a right task for learning representations. That’s a different debate ..).

To their credit, the authors caveat this

While this works well, it does seem ad hoc: we hope that future investigation can yield a more principled approach.

Overall, nice reading but unless I see consistent wins from Swivel used in a wide variety of tasks in literature, I will stick with word2vec and task-specific embeddings.

  • Thanks for sharing your thoughts on this. The loss function L1 is actually defined as the “soft hinge” or log loss for a perceptron and known in the neural net community, the choice is not that arbitrary. Here are some slides on this by Yann LeCun: http://www.mit.edu/~9.520/spring07/Classes/lecun-20070502-mit.pdf
    and this paper: http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf

    • Delip Rao

      Daniel, no one is questioning whether the loss function expression is arbitrary, but the choice of the loss function. Why not the ell_1 loss here instead of this loss function? If you tweak the mathcal{L}_1 expression a bit, my comment should become obvious.

© 2016 Delip Rao. All Rights Reserved.