auto completion – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 GWT + database connection in Servlet ContextListener – Auto Complete Video Tutorial Part 5 https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/ https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/#comments Mon, 24 Jun 2013 11:44:47 +0000 http://www.rene-pickhardt.de/?p=1653 Finally we have all the basics that are needed for building an Autocomplete service and now comes the juicy part. From now on we are looking at how to make it fast and robust. In the current approach we open a new Data base connection for every HTTP request. This needs quite some time to lock the data base (at least when using neo4j in the embedded mode) and then also to run the query without having any opportunities to use the caching strategy of the data base.
In this tutorial I will introduce you to the concept of a ContextListener. This is roughly spoken a way of storing objects in the Java Servlet global memory using key value pairs. Once we understand this the roadmap is very clear. We can store objects like data base connections or search indices in the memory of our web server. As from what I currently understand this could also be used to implement some server side caching. I did not do any benchmarking yet testing how fast retrieving objects from context works in tomcat. Also this method of caching does not scale horizontally well as using memcached.
Anyway have fun learning about the context listener.

If you have any suggestions, comments or thoughts or even know of some solid benchmarks about caching using the ServletContext (I did a quick web search for a view minutes and didn’t find any) feel free to contact me and discuss this!

]]>
https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/feed/ 1
Building an Autocomplete Service in GWT screencast Part 4: Integrating the neo4j Data base https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-4-integrating-the-neo4j-data-base/ https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-4-integrating-the-neo4j-data-base/#comments Thu, 20 Jun 2013 12:38:46 +0000 http://www.rene-pickhardt.de/?p=1640 In this screencast of my series I explain at a very basic level how to integrate a data base to pull data for autocomplete queries. Since we have been working with neo4j at this time I used a neo4j data base. It will be only in the next two parts of this series where I introduce an efficient way of handling the data base (using the context listener of the web server) and building fast indices. So in this lesson the resulting auto complete service will be really slow and impractical to use but I am sure for didactic reasons it is ok to invest 7 minutes for a rather bad design.
Anyway if you want to use the same data set as I used in this screencast you can go to http://data.related-work.net and find the data set as well as a description of the data base schema:

]]>
https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-4-integrating-the-neo4j-data-base/feed/ 2
Metalcon finally gets a redesign – Thinking about high scalability https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/ https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/#comments Mon, 17 Jun 2013 15:21:30 +0000 http://www.rene-pickhardt.de/?p=1631 Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

  • a user base of 10’000 users.
  • about 500 registered bands
  • highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
  • 624 MB of text and structured data about the mentioned topics.
  • fairly good visibility in search engines.
  • > 30k lines of code (mostly PHP)
  • a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
  • no unit tests (so code maintenance is almost impossible)
  • no music and audio files
  • no processes for content moderation
  • no processes to fight spam and block users
  • a really bad usability (I could write tons of posts at which points the usability lacks)
  • no clear distinction of features for users to understand

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

  • Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers. 
  • During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
  • In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
  • I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

]]>
https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/feed/ 10
Building an Autocomplete Service in GWT screencast Part 3: Getting the Server code to send a basic response https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-3-getting-the-server-code-to-send-a-basic-response/ https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-3-getting-the-server-code-to-send-a-basic-response/#comments Mon, 17 Jun 2013 12:20:11 +0000 http://www.rene-pickhardt.de/?p=1626 In this screencast of my series on building an autocomplete service you will learn how to implement a Server servlet in GWT such that autocomplete queries receive a response. In this video the response will always be static and very naive. It will be up to the fourth part of this series which will follow already this week to make the server to something meaningful with the query. This part is rather created to see how the server is supposed to be invoked and what kind of tools and classes are needed. So see this as a preparation for the really interesting stuff.

If you have any questions, suggestions and comments feel free to discuss them.

]]>
https://www.rene-pickhardt.de/building-an-autocomplete-service-in-gwt-screencast-part-3-getting-the-server-code-to-send-a-basic-response/feed/ 1
The best way to create an autocomplete service: And the winner is…. Giuseppe Ottaviano https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/ https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/#respond Wed, 15 May 2013 13:06:42 +0000 http://www.rene-pickhardt.de/?p=1594 Over one year ago I was starting to think about indexing scored stings for auto completion queries. I stumbled upon this problem after seeing the strength of the predictions of  the typology approach for next word prediction on smartphones. The typology approach had one major drawback: Though its suggestions had a high precision the speed with 50 milliseconds per suggestion was rather slow especially when working on a server side application.

  • On August 16 th 2012 I found a first solution building on Nicolai Diethelms Suggest Tree. Though the speedup was great the suggest tree at this time had several major drawbacks (1. the amount of suggestions had to be known before building the tree 2.) large memory overhead and high redundancy 3.) no possibility of updating weights or even inserting new strings after building the tree) (the last 2 issues have been fixed just last month)
  • So I tried to find a solution which required less redundancy. But still for indexing Gigabytes of 5-grams we needed a persistent method. We tried Lucene and MySql in December and January. After seeing that MySQL does not provide any indices for this kind of query I decided to  misuse multidimensional trees of MySQL in a highly redundant way to somehow be able to evaluate the strength of typology on large data sets with gigabytes of n-grams. Creating one of the dirtiest hacks in my life I could at least handle the data but the solution was rather engineered and consisted of throwing hardware at the problem.
  • After Christoph tried to solve this using bitmap indices which was quite fast but had issues with scaling and index maintainability we had a discussion and finally the solution popped in my mind in the beginning of march this year.

Even though I was thinking of scored tries before they always lacked the problem that they could only find the top-1 element efficiently. Then I realized that one had to sort the children of a node by score and use a priority queue during retrieval. In this way one would get the maximum possible runtime I was doing this in a rather redundant way because I was aiming for fast prefix retrieval of the trie node and then fast retrieval of the top children.
After I came up with my solution and after talking to Lucene contributers from IBM in Haifa I realized that Lucene had a pretty similar solution as a less popular “hidden feature” which I tested. Anyway in my experiment I also experienced a large memory overhead with the Lucene solution so my friend Heinrich and me started to develop my trie based solution and benchmark it with various baselines in order to produce a good solid output.
The developement started last month and we had quite some progress. Our goal was always to be about as fast as Nicolai Diethelms suggest tree but not running into all the drawbacks of his solution. In our coding session yesterday we realized that Nicolai improved his data structure a lot by getting rid of his memory overhead and also being able to update, insert and delete new items to his index (still the amount of suggestions has to be known before the tree was build)
Yet while learning more about the ternary tree data structure he used to build up his solution I found a paper that will be presented TODAY at WWW conference. Guess what: Independently of us Giuseppe Ottaviano explains in Chapter 4 the exact solution and algorithm that I came up with this march. Combined with an efficient implementation of the tries and many compression techniques (even respecting cache locality of the processor) he even beats Nicolai Diethelms suggest tree. 
I looked up Giuseppe Ottaviano and the only thing two things I have to say are:

  1. Congratulations Giuseppe. You really worked on that kind of problems for a long time an created an amazing paper. This is also reflected by the related work section and all the small details that are in your paper which we were still in the process of figuring out. 
  2. If anyone needs an auto completion service this is the way to go. Being able to provide suggestions from a dictionary with 10 Mio. entries  in a few micro seconds (yes micro not milli!) means that a single computer can handle about 100’000 requests per second which is certainly web scale.  Even the updated suggest tree by Nicolai is now the way to go and maybe much easier to use since it is java based and not C++ and the full code is open sourced.
Ok so much for the history of events and the congratulations to Giuseppe. I am happy to see that the algorithm really performs that well but there is one little thing that really bothers me a lot: 
 
How come our community of researchers hasn’t come up with a good way of sharing credits to a person like me who came up independently with the solution? As for me I feel that the strongest chapter of my dissertation just collapsed and one year of research just burnt away. I mean personally I gained and learnt a lot from it but from a carrier point of view this seems rather like a huge drawback.

Anyway life goes on and by thinking about the trie based solution we already came up with a decent list of future work which we can most certainly use for follow up work and I will certainly contact the authors maybe a collaboration in future will be possible. 

]]>
https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/feed/ 0
Building an Autocompletion on GWT screencast Part 2: Invoking The Remote Procedure Call https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-2-invoking-the-remote-procedure-call/ https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-2-invoking-the-remote-procedure-call/#comments Tue, 12 Mar 2013 07:25:00 +0000 http://www.rene-pickhardt.de/?p=1544 Hey everyone after posting my first screencast in this series reviewing the basic process for creating remote procedure calls in GWT we are now finally starting with the real tutorial for building an autocomplete service.
This tutorial (again hosted on wikipedia) covers the basic user interface meaning

  • how to integreate a SuggestBox instead of a textfield into the GWT Starter project
  • how to set up the neccessary stuff (extending a SuggestOracle) to fire a remote procedure call that requests suggestions if the user has typed something.
  • how to override the necessary methods from the SuggestOracle Interface

So here we go with the second part of the screencast which you can of course directly download from wikipedia:

Feel free to ask questions, give comments and improve the screencast!

]]>
https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-2-invoking-the-remote-procedure-call/feed/ 2
Building an Autocompletion on GWT screencast Part 1: Getting Warm – Reviewing remote procedure calls https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-1-getting-warm-reviewing-remote-procedure-calls/ https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-1-getting-warm-reviewing-remote-procedure-calls/#comments Tue, 19 Feb 2013 09:11:29 +0000 http://www.rene-pickhardt.de/?p=1539 Quite a while ago I promised to create some screencasts on how to build a (personalized) Autocompletion in GWT. Even though the screencasts have been created for quite some time now I had to wait publishing them for various reasons.
Finally it is now the time to go public with the first video. I do really start from scratch. So the first video might be a little bit boaring since I am only reviewing the Remote Procedure calls of GWT.
A litte Note: The video is hosted on Wikipedia! I think it is important to spread knowledge under a creative commons licence and the youtubes, vimeos,… of this world are rather trying to do a vendor lock in. So If the embedded player is not so well you can go directly to wikipedia for a fullscreen version or direct download of the video.

Another note: I did not publish the source code! This has a pretty simple reason (and yes you can call me crazy): If you really want to learn something, copying and pasting code doesn’t help you to get the full understanding. Doing it step by step e.g. watching the screencasts and reproducing the steps is the way to go.
As always I am open to suggestions and feedback but please have in mind that the entire course of videos is already recorded.

]]>
https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-screencast-part-1-getting-warm-reviewing-remote-procedure-calls/feed/ 4
Swiftkey XKCD comic: Sorry this has not happened before… https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/ https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/#respond Wed, 13 Jun 2012 15:29:17 +0000 http://www.rene-pickhardt.de/?p=1369 Current readers of my blog know about typology. The project which makes predictions of what you will type in next on your smartphone. This is of course pretty similar to Swiftkey. Amazing to see how xkcd took swiftkey as a topic for the current comic. More amazing is that 3 people (not in the development team of typology) send me the link to this project independently.
I liked the comic and I think it really demonstrates the importance of a software like this. So I am really looking forward to see how typology will evolve. It also demonstrates the fallbacks of personalization of everything. In this sense I think there is also a hint to the filter bubble

]]>
https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/feed/ 0
Building an Autocompletion on GWT with RPC, ContextListener and a Suggest Tree: Part 0 https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/ https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/#comments Wed, 13 Jun 2012 13:15:29 +0000 http://www.rene-pickhardt.de/?p=1360 Over the last weeks there was quite some quality programming time for me. First of all I built some indices on the typology data base in which way I was able to increase the retrieval speed of typology by a factor of over 1000 which is something that rarely happens in computer science. I will blog about this soon. But heaving those techniques at hand I also used them to built a better auto completion for the search function of my online social network metalcon.de.
The search functionality is not deployed to the real site yet. But on the demo page you can find a demo showing how the completion is helping you typing. Right now the network requests are faster than google search (which I admit it is quite easy if you only have to handle a request a second and also have a much smaller concept space).  Still I was amazed by the ease and beauty of the program and the fact that the suggestions for autocompletion are actually more accurate than our current data base search. So feel free to have a look at the demo:
http://134.93.129.135:8080/wiki.html
Right now it consists of about 150 thousand concepts which come from 4 different data sources (Metal Bands, Metal records, Tracks and Germen venues for Heavy metal) I am pretty sure that increasing the size of the concept space by 2 orders of magnitude should not be a problem. And if everything works out fine I will be able to test this hypothesis on my joint project related work which will have a data base with at least 1 mio. concepts that need to be autocompleted.
Even though everyting I used but the ContextListener and my small but effective caching strategy can be found at http://developer-resource.blogspot.de/2008/07/google-web-toolkit-suggest-box-rpc.html and the data structure (suggest tree) is open source and can be found at http://sourceforge.net/projects/suggesttree/ I am planning to produce a series of screencasts and release the source code of my implementation together with some test data over the next weeks in order to spread the knowledge of how to built strong auto completion engines. The planned structure of these articles will be:

part 1: introduction of which parts exist and where to find them

  • Set up a gwt project
  • Erease all files that are not required
  • Create a basic Design

part 2: AutoComplete via RPC

  • Neccesary client side Stuff
  • Integration of SuggestBox and Suggest Oracle
  • Setting up the Remote procedure call

part 3: A basic AutoComplete Server

  • show how to fill data with it and where to include it in the autocomplete
  • disclaimer! of not a good solution yet
  • Always the same suggestions

part 4: AutoComplete Pulling suggestions from a data base

  • inlcuding a data base
  • locking the data base for every auto complete http request
  • show how this is a poor design
  • demonstrate low response times speed

part 5: Introducing the context Listener

  • introducing a context listener.
  • demonstrate lacks in speed with every network request

part 6: Introducing a fast Index (Suggest Tree)

  • inlcude the suggest tree
  • demonstrate increased speed

part 7: Introducing client side caching and formatting

  • introducing caching
  • demonstrate no network traffic for cached completions

not covered topics: (but for some points happy for hints)

  • on user login: create personalized suggest tree save in some context data structure
  • merging from personalized AND gobal index (google will only display 2 or 3 personalized results)
  • index compression
  • schedualing / caching / precalculation of index
  • not prefix retrieval (merging?)
  • css of retrieval box
  • parallel architectures for searching
]]>
https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/feed/ 6