neo4j – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Graphity Server for social activity streams released (GPLv3)

Rene — Mon, 02 Sep 2013 07:11:22 +0000

It is almost 2 years over since I published my first ideas and works on graphity which is nowadays a collection of algorithms to support efficient storage and retrieval of more than 10k social activity streams per second. You know the typical application of twitter, facebook and co. Retrieve the most current status updates from your circle of friends.
Today I proudly present the first version of the Graphity News Stream Server. Big thanks to Sebastian Schlicht who worked for me implementing most of the Servlet and did an amazing job! The Graphity Server is a neo4j powered servlet with the following properties:

Response times for requests are usually less than 10 milliseconds (+network i/o e.g. TCP round trips coming from HTTP)
The Graphity News Stream Server is a free open source software (GPLv3) and hosted in the metalcon git repository. (Please also use the bug tracker there to submit bugs and feature requests)
It is running two Graphity algorithms: One is read optimized and the other one is write optimized, if you expect your application to have more write than read requests.
The server comes with an REST API which makes it easy to hang in the server in whatever application you have.
The server’s response also follows the activitystrea.ms format so out of the box there are a large amount of clients available to render the response of the server.
The server ships together with unit tests and extensive documentation especially of the news stream server protocol (NSSP) which specifies how to talk to the server. The server can currently handle about 100 write requests in medium size (about a million nodes) networks. I do not recommend to use this server if you expect your user base to grow beyond 10 Mio. users (though we are working to get the server scaling) This is mostly due to the fact that our data base right now won’t really scale beyond one machine and some internal stuff has to be handled synchronized.

Koding.com is currently thinking to implement Graphity like algorithms to power their activity streams. It was for Richard from their team who pointed out in a very fruitfull discussion how to avoid the neo4j limit of 2^15 = 32768 relationship types by using an overlay network. So his ideas of an overlay network have been implemented in the read optimized graphity algorithm. Big thanks to him!
Now I am relly excited to see what kind of applications you will build when using Graphity.

If you’ll use graphity

Please tell me if you start using Graphity, that would be awesome to know and I will most certainly include you to a list of testimonials.
By they way if you want to help spreading the server (which is also good for you since more developer using it means higher chance to get newer versions) you can vote up my answer in stack overflow:
http://stackoverflow.com/questions/202198/whats-the-best-manner-of-implementing-a-social-activity-stream/13171306#13171306

How to get started

its darn simple!

You clone the git repository or get hold of the souce code.
then switch to the repo and type sudo ./install.sh
copy the war file to your tomcat webapps folder (if you don’t know how to setup tomcat and maven which are needed we have a detailed setup guide)
and you’re done more configuration details are in our README.md!
look in the newswidget folder to find a simple html / java script client which can interact with the server.

I also created a small simple screen cast to demonstrate the setup:

Get involved

There are plenty ways to get involved:

Fork the server
commit some bug report
Fix a bug
Subscribe to the mailing list.

Furhter links:

git repository
originial graphity blogpost
graphity paper
Stack overflow discussion on social activity streams (for voting thanks!)
issue tracker

Drug junkie steals my neo4j t-shirt out of my physical mailbox

Rene — Wed, 17 Jul 2013 18:51:46 +0000

Me wearing my stolen neo4j shirt which I took back from the thief

Being at FOSDEM 20013 Peter from Neo4j asked my if I would like to get a neo4j shirt send to my home adress. We have to keep in mind that i just moved back to Koblenz from China. I did not only move to Koblenz but I moved to Koblenz Lützel. I knew from my collegues that this part of Koblenz is supposed to be like the Harlem of NYC. But I found a really nice flat where I leave together with 3 other nice students. Even though I was skeptical looking at that flat I had to move there after I met my future roommates.
A couple of weeks after moving in I realized that I smelled pod more and more frequently from the streets. Especially from people smoking it in front of my front door. I had even observed people exchanging packages in the backyard of our house. Of course I cannot say for sure but even at the time of seeing this I was pretty confident that they would deal drugs. Over the last couple weeks we had several problems in our house.

People broke in our basement and stole some stuff
Another time people broke in our basement and stored bikes there which did not belong to any of our neighbors.
There is a hole from a gun in our front door.
last but not least: I was about to leave our backyard when I saw a guy wearing my neo4j shirt.

Ok let me elaborate on the last one:
Neo4j the graph database is not known as the most famous fashion brand in the world. So I hardly recognized the shirt when I saw him wear it. But I somehow recognized the design of the shirts and decided to turn around in order to get a second look at the shirt. Who in my neighborhood would wear such a shirt and what connection would he have to this rather new piece of technology?
When i turned around things got even stranger. I saw the back of the guy and his shirt said:
“My stream is faster than yours” which certainly is a link to graphity
and also displayed the Cypher Query:
“(renepickhardt) <-[:cites]- (peter)”
I was so perplex that I didn’t realize that I was alone and the guy was standing there with 2 other men. I said: “Sorry, you are wearing my shirt!” And his friends came in and told my I was crazy and how I could come up with this idea. I insisted that my name was written on the shirt. In particular my full name! Especially I knew the quote which was exactly what Peter had planned to print on my shirt.
The guys started mocking me and telling me to f… off. But I somehow resisted and pointed out again that this was certainly my shirt. At that moment the door of the Kung Fu School opened and the coach Mr. Lai came out and asked if the guys again stole packages from our post box. At that moment the guy with my shirt had to turn around again so anybody could see my name. He stared telling me some weird lie about how he got this shirt as a present and just thought it was nice looking but he finally returned it to me.
Most interestingly the police didn’t care. The policeman only said: “It’s your own fault when you move to a place like Koblenz Lützel.” I find this to be very disappointing. I always thought that our policemen should be objective and neutral. Stealing and opening other peoples mail is a crime in Germany. Also owning drugs or stealing bikes… It is said that the police refuses to help us with our situation.
Well anyway If you have an important message for me why don’t you use email rather than physical mail. My email is also potentially read by third parties but at least it is still safely delivered. Anyway big thanks to the guys from neo4j for my new shirt (:

Video of FOSDEM talk finally online

Rene — Tue, 25 Jun 2013 15:55:57 +0000

I was visiting FOSDEM 2013 with Heinrich Hartmann and talking about related-work.net the video of this talk is finally online and of course I would like to share this with the community:

The slides can be found here.

GWT + database connection in Servlet ContextListener – Auto Complete Video Tutorial Part 5

Rene — Mon, 24 Jun 2013 11:44:47 +0000

Finally we have all the basics that are needed for building an Autocomplete service and now comes the juicy part. From now on we are looking at how to make it fast and robust. In the current approach we open a new Data base connection for every HTTP request. This needs quite some time to lock the data base (at least when using neo4j in the embedded mode) and then also to run the query without having any opportunities to use the caching strategy of the data base.
In this tutorial I will introduce you to the concept of a ContextListener. This is roughly spoken a way of storing objects in the Java Servlet global memory using key value pairs. Once we understand this the roadmap is very clear. We can store objects like data base connections or search indices in the memory of our web server. As from what I currently understand this could also be used to implement some server side caching. I did not do any benchmarking yet testing how fast retrieving objects from context works in tomcat. Also this method of caching does not scale horizontally well as using memcached.
Anyway have fun learning about the context listener.

If you have any suggestions, comments or thoughts or even know of some solid benchmarks about caching using the ServletContext (I did a quick web search for a view minutes and didn’t find any) feel free to contact me and discuss this!

Building an Autocomplete Service in GWT screencast Part 4: Integrating the neo4j Data base

Rene — Thu, 20 Jun 2013 12:38:46 +0000

In this screencast of my series I explain at a very basic level how to integrate a data base to pull data for autocomplete queries. Since we have been working with neo4j at this time I used a neo4j data base. It will be only in the next two parts of this series where I introduce an efficient way of handling the data base (using the context listener of the web server) and building fast indices. So in this lesson the resulting auto complete service will be really slow and impractical to use but I am sure for didactic reasons it is ok to invest 7 minutes for a rather bad design.
Anyway if you want to use the same data set as I used in this screencast you can go to http://data.related-work.net and find the data set as well as a description of the data base schema:

Metalcon finally gets a redesign – Thinking about high scalability

Rene — Mon, 17 Jun 2013 15:21:30 +0000

Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

a user base of 10’000 users.
about 500 registered bands
highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
624 MB of text and structured data about the mentioned topics.
fairly good visibility in search engines.
> 30k lines of code (mostly PHP)
a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
no unit tests (so code maintenance is almost impossible)
no music and audio files
no processes for content moderation
no processes to fight spam and block users
a really bad usability (I could write tons of posts at which points the usability lacks)
no clear distinction of features for users to understand
…

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers.
During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

Building an Autocomplete Service in GWT screencast Part 3: Getting the Server code to send a basic response

Rene — Mon, 17 Jun 2013 12:20:11 +0000

In this screencast of my series on building an autocomplete service you will learn how to implement a Server servlet in GWT such that autocomplete queries receive a response. In this video the response will always be static and very naive. It will be up to the fourth part of this series which will follow already this week to make the server to something meaningful with the query. This part is rather created to see how the server is supposed to be invoked and what kind of tools and classes are needed. So see this as a preparation for the really interesting stuff.

If you have any questions, suggestions and comments feel free to discuss them.

Building an Autocompletion on GWT screencast Part 2: Invoking The Remote Procedure Call

Rene — Tue, 12 Mar 2013 07:25:00 +0000

Hey everyone after posting my first screencast in this series reviewing the basic process for creating remote procedure calls in GWT we are now finally starting with the real tutorial for building an autocomplete service.
This tutorial (again hosted on wikipedia) covers the basic user interface meaning

how to integreate a SuggestBox instead of a textfield into the GWT Starter project
how to set up the neccessary stuff (extending a SuggestOracle) to fire a remote procedure call that requests suggestions if the user has typed something.
how to override the necessary methods from the SuggestOracle Interface

So here we go with the second part of the screencast which you can of course directly download from wikipedia:

Feel free to ask questions, give comments and improve the screencast!

Building an Autocompletion on GWT screencast Part 1: Getting Warm – Reviewing remote procedure calls

Rene — Tue, 19 Feb 2013 09:11:29 +0000

Quite a while ago I promised to create some screencasts on how to build a (personalized) Autocompletion in GWT. Even though the screencasts have been created for quite some time now I had to wait publishing them for various reasons.
Finally it is now the time to go public with the first video. I do really start from scratch. So the first video might be a little bit boaring since I am only reviewing the Remote Procedure calls of GWT.
A litte Note: The video is hosted on Wikipedia! I think it is important to spread knowledge under a creative commons licence and the youtubes, vimeos,… of this world are rather trying to do a vendor lock in. So If the embedded player is not so well you can go directly to wikipedia for a fullscreen version or direct download of the video.

Another note: I did not publish the source code! This has a pretty simple reason (and yes you can call me crazy): If you really want to learn something, copying and pasting code doesn’t help you to get the full understanding. Doing it step by step e.g. watching the screencasts and reproducing the steps is the way to go.
As always I am open to suggestions and feedback but please have in mind that the entire course of videos is already recorded.

The start of the Linked Data benchmark council Eu FP7 Big Data pro

Rene — Sat, 02 Feb 2013 14:56:55 +0000

Peter who is working for Neo4j is an industry partner of the http://www.ldbc.eu/ which is a EU FP7 Project in the Big Data call.
The goal of this project is to put out good methodologies for benchmarking linked open data and rdf stores as well as graph data bases. In this context the council should also provide data sets for benchmarking.
Peter points out that a simple problem exists with benchmarks:”who ever puts it out wins” One simple reason is that benchmarking has so many flexible variables that it is really hard. he compared the challanges to the tpc http://de.wikipedia.org/wiki/Transaction_Processing_Performance_Council
After talking about the need for good benchmarks he pointed out again why the transaction processing Performence Council Benchmarks are not sufficient anymore giving many different examples of exploding big graphs being around in the world (Facebook, Google Knowledge Graph, Linked open data, dbpedia).
Since the project is really new Peter could not report any results yet. Anyway I am pretty sure that anyone interested in graph data bases and graph data should look into the project which has the following list of deliverables

overvew of current graph benchmakrs and designs
benchmark principles and methods
Query Languages (Cypher, Gremlin, SPARQL)
Analysis and classification of Choke points (Supernodes, data generators)
Benchmark transactions (which are in general very slow)
Benchmark the complexity of queries
Analysis (if anyone has data sets and usecases contact the LDBC, actually I think we have data comming from related work)
Navigational benchmark (e.g. open streetmaps)
Benchmarking design for pattern matching (e.g. SPARQL and Cypher)

As you could here from beside there is a huge discussion going on about query languages which i like. Creating a query language is a tough task. The more expressive a language is (like SPARQL) the less efficient this might become. So I hope the EU project will really create some good solid output. I am also happy that many different industry vendors are part of this project. In this sense the results will hopefully be objective and don’t suffer from the “Who ever puts it on wins” paradigm.
Interestingly the LDBC makes a speration between graph data bases and rdf stores which I am very pleased to see and have been thinking a lot.