## Can you compare perplexity across different segmentations?

2019-04-23
In recent years, models in NLP have strayed from the old assumption that the word is the atomic unit of choice: subword-based models (using BPE or sentencepiece) and character-based (or even byte-based!) models have proven to be very effective for many tasks, allowing for a small vocabulary size that can still cover all possible occuring text (i.e., that is open-vocabulary).
With all this change comes a different question though: can we compare metrics over words with those over characters or subword units?
In this post, we will be looking at at metrics for *language models*, specifically the most popular one: perplexity.

### What is perplexity?

“Perplexity is the exponentiated average negative log-likelihood per token.” What does that mean? Fundamentally, a language model is a probability distribution that assigns probabilities to entire strings, for example:

\[ \begin{align*} p(\texttt{the cat sat on the mat}) &= 0.000000000341\\ p(\texttt{the mat sat on the cat}) &= 0.000000000239\\ p(\texttt{the cat the on mat mat}) &= 0.000000000001\\ \end{align*} \]

These probabilities are extremely low because there are so many sentences that the distribution all has to cover!
^{[1]}

Say, the first sentence is our test data. Then, given its likelihood under our model, we can compute a *perplexity per word*, counting the \(\mathrm{EOS}\) (end of string) as a seventh word:^{[2]}
\[ ppl_{word} = \exp\dfrac{-\log 0.000000000341}{6+1} = 22.5 \]

This perplexity is what people usually mean when they say “perplexity”: the perplexity per word on the test data. But we can compute other perplexities, too! The sentence had \(6+1\) words, yes, but it also had \(22+1\) characters: \[ ppl_{word} = \exp\dfrac{-\log 0.000000000341}{22+1} = 2.7 \]

Note how we only changed the denominator! Both these perplexities have an intuitive interpretation: the first one tells us that the model was “trying to pick” between about 23 words, the second one that it had a choice of two to three characters it was usually unsure between. Or more precisely: if we let it guess token by token (i.e., word by word or character by character), on average it would have taken 22.5 tries and 2.7 tries, respectively, to find the right one.^{[3]}

### Decomposing the string into tokens

If you have seen a standard exposition to language models, you might be a bit unhappy with what we've done so far: we never really established *how* the probability value of \(0.000000000341\) was calculated!
Let's remedy that.

Language models are usually implemented as *autoregressive* models, i.e., they predict the output piece by piece, each new piece *conditioned* on the previously generated pieces:
\[ p(\texttt{the cat sat}) = p(\texttt{the} | \varepsilon) \cdot p(\texttt{cat} | \texttt{the}) \cdot p(\texttt{sat} | \texttt{the cat}) \cdot p(\mathrm{EOS} | \texttt{the cat sat}) \]
where \(\varepsilon\) refers to the empty string and the equality holds true because of the chain rule of probability. This decomposition makes things easy enough: instead of trying to assign probabilities to infinitely many things all at once, we now only have to assign probabilities to a finite vocabulary at each *timestep* — easy enough with a softmax layer in a neural model.

Now you should see why people report perplexities per word if this is indeed the decomposition they choose! Say a word-based RNN/Transformer language model predicts: \[ \begin{align*} p(\texttt{the} | \varepsilon) &= 0.01 \\ p(\texttt{cat} | \texttt{the}) &= 0.001 \\ p(\texttt{sat} | \texttt{the cat}) &= 0.008 \\ p(\mathrm{EOS} | \texttt{the cat sat}) &= 0.04 \\ \end{align*} \] \[ \Rightarrow p(\texttt{the cat sat}) = 0.1 \cdot 0.01 \cdot 0.008 \cdot 0.04 = 0.00000032 \]

Then we can of course compute the perplexity as we have before: \[ ppl_{word} = \exp\dfrac{-\log 0.00000032}{3+1} = 42.04 \] But it is common in modern LM implementations to instead just use the token-level probabilities to compute the average: \[ ppl_{word} = \exp\dfrac{-(\log 0.01 + \log 0.001 + \log 0.008 + \log 0.04)}{3+1} = 42.04 \]

### Different decompositions

Of course, this decomposition is only one of many! If we do our prediction character by character we end up with: \[ p(\texttt{the cat sat}) = p(\texttt{t} | \varepsilon) \cdot p(\texttt{h} | \texttt{t}) \cdot p(\texttt{e} | \texttt{th}) \cdot \ldots \cdot p(\mathrm{EOS} | \texttt{the cat sat}) \] And again we could compute a perplexity, this time over characters: \[ ppl_{char} = \exp\dfrac{-(\log p(\texttt{t} | \varepsilon) + \ldots + \log p(\mathrm{EOS} | \texttt{the cat sat}))}{11+1} = 3.48 \] assuming that indeed the character-level model is equally good (i.e., assigns the entire string the same likelihood) as the word-level model.

Both these decompositions are equally valid — they result in a distribution over the same strings! That is, as long as...

### “The importance of being open-vocab” or “What strings does the model cover?”

One potential confound we have ignored so far is the fact that word-level models tend to have what is called a “closed vocabulary” — they cannot produce all words but only those that are in its vocabulary. That means that \(\texttt{cat}\) is okay, but \(\texttt{wolpertinger}\) may not be. A word-level model thus could not explain (i.e., assign positive probability to) the sentence \(\texttt{the wolpertinger sat}\) — a character-level model on the other hand would have little issue predicting that novel word one character at a time, yielding a positive, even if small, probability for the entire sentence.

In summary, the character-level model assigns probability to all sorts of novel words, in fact, to infinitely many of them. All this probability mass that it loses on them, the word-level model can use to make \(\texttt{the cat sat}\) and other simple in-vocabulary sentences more likely. Or, on our example, if we imagine that \(\texttt{the cat sat}\) and \(\texttt{the wolpertinger sat}\) were the only two sentences that existed: because of \[ p_{word}(\texttt{the wolpertinger sat}) < p_{char}(\texttt{the wolpertinger sat}) \] we have that \[ p_{word}(\texttt{the cat sat}) > p_{char}(\texttt{the cat sat}) \]

So, in a sense the closed-vocabulary can “cheat” by ignoring all these low-probability words while the character-level model has to spread its probability mass thin over them. Unfair!
Specifically, if we evaluate these two models on a test set that doesn't contain any of these novel words, the closed-vocabulary model will end up with a much lower perplexity.^{[4]}

Takeaway: we can only compare distributions on likelihood-based metrics like perplexity if they have support on the exact same set of strings.

Corollary: closed-vocab perplexities and open-vocab perplexities are not comparable at all.

To make this divide more explicit, the open-vocab language modeling community has generally avoided measuring “word-level perplexity” and instead reports “bits per character”, i.e., the average negative log-likelihood over characters, using the dual logarithm.

Finally, it should be said that “word-level == closed-vocab” and “character-level == open-vocab” is roughly true, but only part of the whole story.
For one, augmenting word-level models with character-level information can make them open-vocab — and vice versa, augmenting character-level models with word information can help, too (and a number of methods have been proposed to do these things).
More critically, however, even a character-level model assumes a closed set of characters — but with the Unicode consortium coming up with new enojis every year, that’s not really correct either. ^{[5]}

### How then do we compare on, say, subword units?

Where in this divide do subword units fall? The big selling point for both BPE and sentencepiece is that they do indeed allow open-vocab language modeling, much like a pure character-level model, but also allow the formation of bigger word-like tokens to simplify the prediction task (shorter sequences == easier modeling for the RNN), i.e., instead of splitting \(\texttt{the deforestation}\) into \(\texttt{t h e _ d e f o r e s t a t i o n}\) we split it into \(\texttt{the de@@ forest@@ ation}\), where @@ signifies that a word isn't finished yet. Because this “splitting” can go down all the way to characters, we will always find a way to cover any string, so we are open-vocab and thus comparable to characters. We are also comparable between different numbers of BPE merges or sentencepiece splits or even between these two different methods!

We should be more explicit: when we say “comparable”, it should be clear that we are not talking about the perplexity per prediction token (i.e., here per subword unit) that is usually reported by your RNNLM toolkit — those are very much not comparable. But the perplexities per character or per word — or really any metric whose denominator does *not* depend on the segmentation — are comparable!

Very practically, consider the segmentations \(s_1 = \texttt{the de@@ forest@@ ation}\) and \(s_2 = \texttt{the defor@@ estation}\). If someone handed you their precomputer perplexities-per-subword unit for these two, say, \(ppl^{sw}_1 = 19\) and \(ppl^{sw}_2 = 24\), you now know that these aren't directly comparable, but you can compute comparable metrics:

First, get the total negative log-likelihood of the entire string: \[ \begin{align*} nll_1 &= \log 19 \cdot (4+1) &= 14.7 \\ nll_2 &= \log 24 \cdot (3+1) &= 12.7 \\ \end{align*} \]

These numbers you can already fairly compare (and you will see that the second model, despite its “higher subword perplexity” is actually the better one), but if you prefer word-level perplexities, you can compute these, too: \[ \begin{align*} ppl^w_1 &= \exp\dfrac{14.7}{2+1} &= 134.3 \\ ppl^w_2 &= \exp\dfrac{12.7}{2+1} &= 68.9 \\ \end{align*} \]

You could even compute character-level perplexities: \[ \begin{align*} ppl^c_1 &= \exp\dfrac{14.7}{16+1} &= 2.37 \\ ppl^c_2 &= \exp\dfrac{12.7}{16+1} &= 2.11 \\ \end{align*} \]

Note that there is a little caveat here: we assume that one string corresponds to exactly one segmentation. That's not quite true, but that problem is a much trickier one.^{[6]}

### Conclusion

Language models are distributions over strings. The fact that we implement them as autoregressive models over tokens at most changes their support, i.e., whether they can deal with all strings or only those that their vocabulary covers.

Perplexities over different segmentation granularities (i.e., words, subwords, or characters) aren't directly comparable, but the log-likelihoods that are hidden inside them are — and those can always be converted to perplexities for any given level. Fix the level, say words, and you can (and should!) compare.

*Thanks to Annabelle Carell, Ryan Cotterell, Jason Eisner, Matthew Francis-Landau, and Chu-Cheng Lin for their proof-reading and suggestions!*

### Interested in follow-up reading?

We used this realization that equal denominators matter in our NAACL 2018 paper “Are All Languages Equally Hard to Language-Model?”, to compare language models across different languages even — you should find the idea there fairly obvious after reading this post.

If you are more curious, in our AAAI 2019 paper “Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model”, we took this a step farther and compared a likelihood-based metric not only across subword unit segmentations, but even across tokenization — if that tokenization is reversible! If it is, i.e., if ever tokenized string corresponds to exactly one untokenized string, then any distribution (i.e., any language model) we obtain over the tokenized text implies one over the untokenized text — they have the same support and that's all we needed.

In fact, there are infinitely many sentences that all should receive positive probability. However, because the infinitely large set of all sentences is still countable, it should not be hard to think of a distribution that can cover these. If you have trouble believing that, consider the Poisson distribution or the geometric distribution that also assigns positive probability to the countably infinite set of positive integers. ↩

We need an \(\mathrm{EOS}\) symbol to make sure we obtain the probability of the string as an entire string and not just as a prefix of some longer string. This will be more obvious in the next section. ↩

This “guessing game” is known as the Shannon Game and is in fact one of the origins of all this theory! Check out the original paper from 1950. ↩

Of course, this means that if your open-vocab character-level model still beats a closed-vocab word-level model on this metric, you know that it really is better. In practice, however, that is really hard to achieve — a closed vocabulary usually gives you a very pronounced advantage in these scores that is hard to overcome with just better models. ↩

And even if it weren’t, building a character-level model over the thousands of CJK characters sounds like an equally wasteful task. It’s turtles all the way down… to bytes? Bits? ↩

It is true that your segmentation tool will give you one segmentation, but that's also not the entire truth: even if the segmentation tool gives you \(\texttt{rain@@ ing}\) as the canonical segmentation of \(\texttt{raining}\), had the RNN over these units produced \(\texttt{rain@@ i@@ n@@ g}\) (and all these things are necessarily in its vocabulary!), we would have just as happily taken it. So, the issue is that \(p(\texttt{raining})\) is not just \(\texttt{rain@@ ing}\), but the sum of all probabilities for segmented strings that are consistent with this string (a set way too large to enumerate)! Is that an issue in practice? Most likely not. We found that in our experiments for this paper the difference was neglegible. Still, make claims of a BPE model being worse in likelihood with some caution, lest you forget these other segmentations! ↩