Over one year ago I was starting to think about indexing scored stings for auto completion queries. I stumbled upon this problem after seeing the strength of the predictions of the typology approach for next word prediction on smartphones. The typology approach had one major drawback: Though its suggestions had a high precision the speed with 50 milliseconds per suggestion was rather slow especially when working on a server side application.
- On August 16 th 2012 I found a first solution building on Nicolai Diethelms Suggest Tree. Though the speedup was great the suggest tree at this time had several major drawbacks (1. the amount of suggestions had to be known before building the tree 2.) large memory overhead and high redundancy 3.) no possibility of updating weights or even inserting new strings after building the tree) (the last 2 issues have been fixed just last month)
- So I tried to find a solution which required less redundancy. But still for indexing Gigabytes of 5-grams we needed a persistent method. We tried Lucene and MySql in December and January. After seeing that MySQL does not provide any indices for this kind of query I decided to misuse multidimensional trees of MySQL in a highly redundant way to somehow be able to evaluate the strength of typology on large data sets with gigabytes of n-grams. Creating one of the dirtiest hacks in my life I could at least handle the data but the solution was rather engineered and consisted of throwing hardware at the problem.
- After Christoph tried to solve this using bitmap indices which was quite fast but had issues with scaling and index maintainability we had a discussion and finally the solution popped in my mind in the beginning of march this year.
Even though I was thinking of scored tries before they always lacked the problem that they could only find the top-1 element efficiently. Then I realized that one had to sort the children of a node by score and use a priority queue during retrieval. In this way one would get the maximum possible runtime I was doing this in a rather redundant way because I was aiming for fast prefix retrieval of the trie node and then fast retrieval of the top children.
After I came up with my solution and after talking to Lucene contributers from IBM in Haifa I realized that Lucene had a pretty similar solution as a less popular “hidden feature” which I tested. Anyway in my experiment I also experienced a large memory overhead with the Lucene solution so my friend Heinrich and me started to develop my trie based solution and benchmark it with various baselines in order to produce a good solid output.
The developement started last month and we had quite some progress. Our goal was always to be about as fast as Nicolai Diethelms suggest tree but not running into all the drawbacks of his solution. In our coding session yesterday we realized that Nicolai improved his data structure a lot by getting rid of his memory overhead and also being able to update, insert and delete new items to his index (still the amount of suggestions has to be known before the tree was build)
Yet while learning more about the ternary tree data structure he used to build up his solution I found a paper that will be presented TODAY at WWW conference. Guess what: Independently of us Giuseppe Ottaviano explains in Chapter 4 the exact solution and algorithm that I came up with this march. Combined with an efficient implementation of the tries and many compression techniques (even respecting cache locality of the processor) he even beats Nicolai Diethelms suggest tree.
I looked up Giuseppe Ottaviano and the only thing two things I have to say are:
- Congratulations Giuseppe. You really worked on that kind of problems for a long time an created an amazing paper. This is also reflected by the related work section and all the small details that are in your paper which we were still in the process of figuring out.
- If anyone needs an auto completion service this is the way to go. Being able to provide suggestions from a dictionary with 10 Mio. entries in a few micro seconds (yes micro not milli!) means that a single computer can handle about 100’000 requests per second which is certainly web scale. Even the updated suggest tree by Nicolai is now the way to go and maybe much easier to use since it is java based and not C++ and the full code is open sourced.
Anyway life goes on and by thinking about the trie based solution we already came up with a decent list of future work which we can most certainly use for follow up work and I will certainly contact the authors maybe a collaboration in future will be possible.