How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25%

After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.
As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.
So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.
Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.
All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.
We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.
In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.
We have used Perplexity as a standard metric to evaluate our Language Model.

Results

As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.
The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.
We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table
Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.

So what is the Idea of Generalized Language Models in non scientific slang?

Why would you do all this stuff?

Evaluation Setup, methodology, download of data sets and source code

Results

Conclustion and links

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

Leave a Reply Cancel reply