suggest tree – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 The best way to create an autocomplete service: And the winner is…. Giuseppe Ottaviano https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/ https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/#respond Wed, 15 May 2013 13:06:42 +0000 http://www.rene-pickhardt.de/?p=1594 Over one year ago I was starting to think about indexing scored stings for auto completion queries. I stumbled upon this problem after seeing the strength of the predictions of  the typology approach for next word prediction on smartphones. The typology approach had one major drawback: Though its suggestions had a high precision the speed with 50 milliseconds per suggestion was rather slow especially when working on a server side application.

  • On August 16 th 2012 I found a first solution building on Nicolai Diethelms Suggest Tree. Though the speedup was great the suggest tree at this time had several major drawbacks (1. the amount of suggestions had to be known before building the tree 2.) large memory overhead and high redundancy 3.) no possibility of updating weights or even inserting new strings after building the tree) (the last 2 issues have been fixed just last month)
  • So I tried to find a solution which required less redundancy. But still for indexing Gigabytes of 5-grams we needed a persistent method. We tried Lucene and MySql in December and January. After seeing that MySQL does not provide any indices for this kind of query I decided to  misuse multidimensional trees of MySQL in a highly redundant way to somehow be able to evaluate the strength of typology on large data sets with gigabytes of n-grams. Creating one of the dirtiest hacks in my life I could at least handle the data but the solution was rather engineered and consisted of throwing hardware at the problem.
  • After Christoph tried to solve this using bitmap indices which was quite fast but had issues with scaling and index maintainability we had a discussion and finally the solution popped in my mind in the beginning of march this year.

Even though I was thinking of scored tries before they always lacked the problem that they could only find the top-1 element efficiently. Then I realized that one had to sort the children of a node by score and use a priority queue during retrieval. In this way one would get the maximum possible runtime I was doing this in a rather redundant way because I was aiming for fast prefix retrieval of the trie node and then fast retrieval of the top children.
After I came up with my solution and after talking to Lucene contributers from IBM in Haifa I realized that Lucene had a pretty similar solution as a less popular “hidden feature” which I tested. Anyway in my experiment I also experienced a large memory overhead with the Lucene solution so my friend Heinrich and me started to develop my trie based solution and benchmark it with various baselines in order to produce a good solid output.
The developement started last month and we had quite some progress. Our goal was always to be about as fast as Nicolai Diethelms suggest tree but not running into all the drawbacks of his solution. In our coding session yesterday we realized that Nicolai improved his data structure a lot by getting rid of his memory overhead and also being able to update, insert and delete new items to his index (still the amount of suggestions has to be known before the tree was build)
Yet while learning more about the ternary tree data structure he used to build up his solution I found a paper that will be presented TODAY at WWW conference. Guess what: Independently of us Giuseppe Ottaviano explains in Chapter 4 the exact solution and algorithm that I came up with this march. Combined with an efficient implementation of the tries and many compression techniques (even respecting cache locality of the processor) he even beats Nicolai Diethelms suggest tree. 
I looked up Giuseppe Ottaviano and the only thing two things I have to say are:

  1. Congratulations Giuseppe. You really worked on that kind of problems for a long time an created an amazing paper. This is also reflected by the related work section and all the small details that are in your paper which we were still in the process of figuring out. 
  2. If anyone needs an auto completion service this is the way to go. Being able to provide suggestions from a dictionary with 10 Mio. entries  in a few micro seconds (yes micro not milli!) means that a single computer can handle about 100’000 requests per second which is certainly web scale.  Even the updated suggest tree by Nicolai is now the way to go and maybe much easier to use since it is java based and not C++ and the full code is open sourced.
Ok so much for the history of events and the congratulations to Giuseppe. I am happy to see that the algorithm really performs that well but there is one little thing that really bothers me a lot: 
 
How come our community of researchers hasn’t come up with a good way of sharing credits to a person like me who came up independently with the solution? As for me I feel that the strongest chapter of my dissertation just collapsed and one year of research just burnt away. I mean personally I gained and learnt a lot from it but from a carrier point of view this seems rather like a huge drawback.

Anyway life goes on and by thinking about the trie based solution we already came up with a decent list of future work which we can most certainly use for follow up work and I will certainly contact the authors maybe a collaboration in future will be possible. 

]]>
https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/feed/ 0
Typology Oberseminar talk and Speed up of retrieval by a factor of 1000 https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/ https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/#comments Thu, 16 Aug 2012 11:39:25 +0000 http://www.rene-pickhardt.de/?p=1396 Almost 2 months ago I talked in our oberseminar about Typology. Update: Download slides Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.
Additionally I have finally uploaded some source code to git hub that makes the typology retrieval algorithm pretty fast. There are still some issues with this code since it lowers the quality of predictions a little bit. Also the index has to be built first. Last but not least the original SuggestTree code did not save the weights of the items to be suggested. I need those weights in the aggregation phase. Since i did not want to extend the original code I placed the weights at the end of the suggested Items. This is a little inefficent.
The main idea why retrieval speeds up with the new algorithm is that typology needs to make sorting over all outedges of a node. This is rather slow especially if one only needs the top k elements. Since neo4j as a graph data base does not provide indices for this kind of data I was forced to look for another way to presort the data. Additionally if a prefix is known one does not have to look at all outgoing edges. I found the Suggest Tree class by Nicolai Diethelm. Which solved the problem in a very good way and lead to such a great speed. The index is not persistent yet and it also needs quite some memory. On the other hand for every node a suggest tree is built. This means that the index can be distributed in a very easy manner over several machines allowing for horizontal scaling!
Anyway the old algorithm was only able to handle like 20 requests per second and now we have something like 14 k requests and as I mentioned there is still a little space for more (:
I hope indices like this will be standard in neo4j soon. This would open up the range of applications that could make good use of neo4j.
Like always I am happy for any suggestions and I am looking forward to do the complete evaluation and paper writing for typology.

]]>
https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/feed/ 2