Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

Rene — Wed, 21 Aug 2013 10:43:02 +0000

Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

2400 concurrent users against a graph cluster!
real time!
16 different (non trivial queries) queries
achieving more than 10k requests answered per second!
graph with more than a billion nodes!
graph partitioning is plugable
graph schema helps indexing for queries

So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.

Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:

Scaling data size
scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated.
making partitioning pluggable
requiring an schema for the graph (to enable efficient indexing)
being able on runtime to extend the schema.
building on top of ether Cassandra (for realtime) or HBase for consistency
being compatible with the tinkerpop techstack
bringing up an entire framework for analytics and graph processing.

Further resources:

https://github.com/thinkaurelius/titan
https://github.com/renepickhardt/metalcon/wiki/Technologytitan (our wiki page in metalcon where we will collect more information about titan.)
There is also a screencast by Marko available on how to set up titan on an amazon cluster and querying music brainz rdf data:

Metalcon finally gets a redesign – Thinking about high scalability

Rene — Mon, 17 Jun 2013 15:21:30 +0000

Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

a user base of 10’000 users.
about 500 registered bands
highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
624 MB of text and structured data about the mentioned topics.
fairly good visibility in search engines.
> 30k lines of code (mostly PHP)
a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
no unit tests (so code maintenance is almost impossible)
no music and audio files
no processes for content moderation
no processes to fight spam and block users
a really bad usability (I could write tons of posts at which points the usability lacks)
no clear distinction of features for users to understand
…

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers.
During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

high scalability – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

Further resources:

Metalcon finally gets a redesign – Thinking about high scalability

Roadmap for the redesign