Reading Club Management of Big Data

Rene — Thu, 05 Sep 2013 10:01:28 +0000

Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.
The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.
Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

google file system (implemented in hadoop file system)
map reduce (implemented in hadoop)
google big table (implemented in hbase)
google pregel (implemented in giraph)
amazon dynamo (imiplemented in cassandra)
On the way we will probably dig into some basics like: Message passing interface or the gossip protocol or the CAP-Theorem

Along these lines we want to understand

Why do these technologies scale?
How do they handle concurrent traffic (especially write requests)?
How performance can be increased if there is another way of building up such highly scalable systems?
What kind of applications (like titan or mahout) are build on top of these systems?

At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design.

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr.
Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.
I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!
So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.

Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

Rene — Wed, 21 Aug 2013 10:43:02 +0000

Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

2400 concurrent users against a graph cluster!
real time!
16 different (non trivial queries) queries
achieving more than 10k requests answered per second!
graph with more than a billion nodes!
graph partitioning is plugable
graph schema helps indexing for queries

So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.

Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:

Scaling data size
scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated.
making partitioning pluggable
requiring an schema for the graph (to enable efficient indexing)
being able on runtime to extend the schema.
building on top of ether Cassandra (for realtime) or HBase for consistency
being compatible with the tinkerpop techstack
bringing up an entire framework for analytics and graph processing.

Further resources:

https://github.com/thinkaurelius/titan
https://github.com/renepickhardt/metalcon/wiki/Technologytitan (our wiki page in metalcon where we will collect more information about titan.)
There is also a screencast by Marko available on how to set up titan on an amazon cluster and querying music brainz rdf data:

hbase – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Reading Club Management of Big Data

Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

Further resources: