gwt – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

My ranked list of priorities for Backend Web Programming: Scalability > Maintainable code > performance

Rene — Sat, 27 Jul 2013 09:21:55 +0000

The redevelopment of metalcon is going on and so far I have been very concerned about performance and webscale. Due to the progress of Martin on his bachlor thesis we did a code review of the code to calculate Generalized Language Models with Kneser Ney Smoothing. Even though his code is a standalone (but very performance sensitive) software I realized that for a Web application writing maintainable code seems to be as important as thinking about scalability.

Scaling vs performance

I am a performance guy. I love algorithms and data structures. When I was 16 years old I already programmed a software that could play chess against you using a high performance programming technique called Bitboards.
But thinking about metalcon I realized that web scale is not so much about performance of single services or parts of the software but rather about the scalability of the entire architecture. After many discussions with colleagues Heinrich Hartmann and me came up with a software architecture from which we believe it will scale for web sites that are supposed to handle several million monthly active users (probably not billions though). After discussing the architecture with my team of developers Patrik wrote a nice blog article about the service oriented data denormalized architecture for scalable web applications (which of course was not invented by Heinrich an me. Patrik found out that it was already described in a WWW publication from 2008).
Anyway this discussion showed me that Scalability is more important than performance! Though I have to say that the stand alone services should also be very performant. if a service can only handle 20 requests per seconds – even if it easily scales horizontally – you will just need too many machines.

Performance vs. Maintainable code

Especially after the code review but also having the currently running metalcon version in mind I came to the conclusion that there is an incredibly high value in maintainable code. The hackers community seems to agree on the fact that maintainability comes over performance (only one of many examples).
At that point I want to recall my initial post on the redesign of metalcon. I had in mind that performace is the same as scaling (which is a wrong assumption) and asked about Ruby on rails vs GWT. I am totally convinced that GWT is much more performant than ruby. But I have seen GWT code and it seems almost impractical to maintain. On the other side from all that I know Ruby on Rails is very easy to maintain but it is less performant. The good thing is it easily scales horizontally so it seems almost like a no brainer to use Ruby on Rails rather than GWT for the front end design and middle layer of metalcon.

Maintainable code vs scalability

Now comes the most interesting fact that I realized. A software architecture scales best if it has a lot of independent services. If services need to interact they should be asynchronous and non blocking. Creating a clear software architecture with clear communication protocols between its parts will do 2 things for you:

It will help you to maintain the code. This will cut down development cost and time. Especially it will be easy to add , remove or exchange functionality from the entire software architecture. The last point is crucial since
Being easily able to exchange parts of the software or single services will help you to scale. Every time you identify the bottleneck you can fix it by exchanging this part of the software to a better performing system.

In order to achieve scalable code one needs to include some middle layer for caching and one needs to abstract certain things. The same stuff is done in order to get maintainable code (often decreasing performance)

Summary

I find this to be very interesting and counter intuitive. One would think that performance is a core element for scalability but I have the strong feeling that writing maintainable code is much more important. So my ranked list of priorities for backend web programming (!) looks like that:

Scalability first: No Maintainable code helps you if the system doesn’t scale and can’t be served to millions of users
Maintainable code: As stated above this should go almost hand in hand with scalability
performance: Of course we can’t have a data base design where queries need seconds or minutes to run. Everthing should happen within a few milliseconds. But if the code can become more maintainable at the cost of another few milliseconds I guess thats a good investment.

GWT + database connection in Servlet ContextListener – Auto Complete Video Tutorial Part 5

Rene — Mon, 24 Jun 2013 11:44:47 +0000

Finally we have all the basics that are needed for building an Autocomplete service and now comes the juicy part. From now on we are looking at how to make it fast and robust. In the current approach we open a new Data base connection for every HTTP request. This needs quite some time to lock the data base (at least when using neo4j in the embedded mode) and then also to run the query without having any opportunities to use the caching strategy of the data base.
In this tutorial I will introduce you to the concept of a ContextListener. This is roughly spoken a way of storing objects in the Java Servlet global memory using key value pairs. Once we understand this the roadmap is very clear. We can store objects like data base connections or search indices in the memory of our web server. As from what I currently understand this could also be used to implement some server side caching. I did not do any benchmarking yet testing how fast retrieving objects from context works in tomcat. Also this method of caching does not scale horizontally well as using memcached.
Anyway have fun learning about the context listener.

If you have any suggestions, comments or thoughts or even know of some solid benchmarks about caching using the ServletContext (I did a quick web search for a view minutes and didn’t find any) feel free to contact me and discuss this!

Building an Autocomplete Service in GWT screencast Part 4: Integrating the neo4j Data base

Rene — Thu, 20 Jun 2013 12:38:46 +0000

In this screencast of my series I explain at a very basic level how to integrate a data base to pull data for autocomplete queries. Since we have been working with neo4j at this time I used a neo4j data base. It will be only in the next two parts of this series where I introduce an efficient way of handling the data base (using the context listener of the web server) and building fast indices. So in this lesson the resulting auto complete service will be really slow and impractical to use but I am sure for didactic reasons it is ok to invest 7 minutes for a rather bad design.
Anyway if you want to use the same data set as I used in this screencast you can go to http://data.related-work.net and find the data set as well as a description of the data base schema:

Metalcon finally gets a redesign – Thinking about high scalability

Rene — Mon, 17 Jun 2013 15:21:30 +0000

Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

a user base of 10’000 users.
about 500 registered bands
highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
624 MB of text and structured data about the mentioned topics.
fairly good visibility in search engines.
> 30k lines of code (mostly PHP)
a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
no unit tests (so code maintenance is almost impossible)
no music and audio files
no processes for content moderation
no processes to fight spam and block users
a really bad usability (I could write tons of posts at which points the usability lacks)
no clear distinction of features for users to understand
…

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers.
During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

Building an Autocomplete Service in GWT screencast Part 3: Getting the Server code to send a basic response

Rene — Mon, 17 Jun 2013 12:20:11 +0000

In this screencast of my series on building an autocomplete service you will learn how to implement a Server servlet in GWT such that autocomplete queries receive a response. In this video the response will always be static and very naive. It will be up to the fourth part of this series which will follow already this week to make the server to something meaningful with the query. This part is rather created to see how the server is supposed to be invoked and what kind of tools and classes are needed. So see this as a preparation for the really interesting stuff.

If you have any questions, suggestions and comments feel free to discuss them.

Building an Autocompletion on GWT screencast Part 2: Invoking The Remote Procedure Call

Rene — Tue, 12 Mar 2013 07:25:00 +0000

Hey everyone after posting my first screencast in this series reviewing the basic process for creating remote procedure calls in GWT we are now finally starting with the real tutorial for building an autocomplete service.
This tutorial (again hosted on wikipedia) covers the basic user interface meaning

how to integreate a SuggestBox instead of a textfield into the GWT Starter project
how to set up the neccessary stuff (extending a SuggestOracle) to fire a remote procedure call that requests suggestions if the user has typed something.
how to override the necessary methods from the SuggestOracle Interface

So here we go with the second part of the screencast which you can of course directly download from wikipedia:

Feel free to ask questions, give comments and improve the screencast!

Building an Autocompletion on GWT screencast Part 1: Getting Warm – Reviewing remote procedure calls

Rene — Tue, 19 Feb 2013 09:11:29 +0000

Quite a while ago I promised to create some screencasts on how to build a (personalized) Autocompletion in GWT. Even though the screencasts have been created for quite some time now I had to wait publishing them for various reasons.
Finally it is now the time to go public with the first video. I do really start from scratch. So the first video might be a little bit boaring since I am only reviewing the Remote Procedure calls of GWT.
A litte Note: The video is hosted on Wikipedia! I think it is important to spread knowledge under a creative commons licence and the youtubes, vimeos,… of this world are rather trying to do a vendor lock in. So If the embedded player is not so well you can go directly to wikipedia for a fullscreen version or direct download of the video.

Another note: I did not publish the source code! This has a pretty simple reason (and yes you can call me crazy): If you really want to learn something, copying and pasting code doesn’t help you to get the full understanding. Doing it step by step e.g. watching the screencasts and reproducing the steps is the way to go.
As always I am open to suggestions and feedback but please have in mind that the entire course of videos is already recorded.

Slides of Related work application presented in the Graphdevroom at FOSDEM

Rene — Sat, 02 Feb 2013 15:13:02 +0000

Download the slidedeck of our talk at fosdem 2013 including all the resources that we pointed to.
Most important other links are:

was great talking here and again we are open source, open data and so on. So if you have suggestions or want to contribute feel free. Just do it or contact us. We are really looking forward to meet some other hackers that just want to go geek and change the world
the video of the talk can be found here:

Building an Autocompletion on GWT with RPC, ContextListener and a Suggest Tree: Part 0

Rene — Wed, 13 Jun 2012 13:15:29 +0000

Over the last weeks there was quite some quality programming time for me. First of all I built some indices on the typology data base in which way I was able to increase the retrieval speed of typology by a factor of over 1000 which is something that rarely happens in computer science. I will blog about this soon. But heaving those techniques at hand I also used them to built a better auto completion for the search function of my online social network metalcon.de.
The search functionality is not deployed to the real site yet. But on the demo page you can find a demo showing how the completion is helping you typing. Right now the network requests are faster than google search (which I admit it is quite easy if you only have to handle a request a second and also have a much smaller concept space). Still I was amazed by the ease and beauty of the program and the fact that the suggestions for autocompletion are actually more accurate than our current data base search. So feel free to have a look at the demo:
http://134.93.129.135:8080/wiki.html
Right now it consists of about 150 thousand concepts which come from 4 different data sources (Metal Bands, Metal records, Tracks and Germen venues for Heavy metal) I am pretty sure that increasing the size of the concept space by 2 orders of magnitude should not be a problem. And if everything works out fine I will be able to test this hypothesis on my joint project related work which will have a data base with at least 1 mio. concepts that need to be autocompleted.
Even though everyting I used but the ContextListener and my small but effective caching strategy can be found at http://developer-resource.blogspot.de/2008/07/google-web-toolkit-suggest-box-rpc.html and the data structure (suggest tree) is open source and can be found at http://sourceforge.net/projects/suggesttree/ I am planning to produce a series of screencasts and release the source code of my implementation together with some test data over the next weeks in order to spread the knowledge of how to built strong auto completion engines. The planned structure of these articles will be:

part 1: introduction of which parts exist and where to find them

Set up a gwt project
Erease all files that are not required
Create a basic Design

part 2: AutoComplete via RPC

Neccesary client side Stuff
Integration of SuggestBox and Suggest Oracle
Setting up the Remote procedure call

part 3: A basic AutoComplete Server

show how to fill data with it and where to include it in the autocomplete
disclaimer! of not a good solution yet
Always the same suggestions

part 4: AutoComplete Pulling suggestions from a data base

inlcuding a data base
locking the data base for every auto complete http request
show how this is a poor design
demonstrate low response times speed

part 5: Introducing the context Listener

introducing a context listener.
demonstrate lacks in speed with every network request

part 6: Introducing a fast Index (Suggest Tree)

inlcude the suggest tree
demonstrate increased speed

part 7: Introducing client side caching and formatting

introducing caching
demonstrate no network traffic for cached completions

not covered topics: (but for some points happy for hints)

on user login: create personalized suggest tree save in some context data structure
merging from personalized AND gobal index (google will only display 2 or 3 personalized results)
index compression
schedualing / caching / precalculation of index
not prefix retrieval (merging?)
css of retrieval box
parallel architectures for searching

neo4j based social news feed demo on wikipedia graph running

Rene — Wed, 21 Sep 2011 22:12:04 +0000

UPDATE: you can find an evaluation of the following blog post and idea on: http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/
Hey everyone I can finally demonstrate the neo4j and gwt system that I have been blogging about over the last weeks here and here. But please find the demo under the following adress:
http://gwt.metalcon.de/GWT-Modelling
The code will be available soon! But it was hacked together pretty nasty and definatly needs some cleanup! Also I think that I still have one lack in memory while maintinging the index for my news feed. Hopefully I will be able to do this while writing my paper and do more on the evaluation. Already thanks a lot to Jonas Kunze and Dr. Thomas Gottron as well as Peter from neo technologies for the great support during this project!

The Graph that I used is extracted from the complete revision dump of the bavarian wikipedia and has the following properties:

8’760 entities
~95’000 Content items
==> together more than 100’000 nodes
almost 20’000 different relationship types (more to come in bigger graphs!)
about 100’000 edges connecting the 8’760 entities
mainting the index was possible with (only :[ ) 177 writes / second on a slow notebook computer

I interpret the wikipedia graph in the following way as a social network:

every article corresponds to an entity
every link corresponds to a directed friendship (follower)
every revision corresponds eather to a status update (content item) or a change in the friendship graph

I have no idea yet how many reads I can do per second. Even though I am a little disappointed about the low speed for writing on the graph I am sure that I will achieve my theoretical goals for reads per second. I also hope to increase writing spead if I introduce better transaction management. Anyway I will blog about the results of the reads / second later.