Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
 

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.

I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)

During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
 
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:

If you like this post, you might like these related posts:

  1. Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) Today we finally had our reading club and discussed several...
  2. PhD proposal on distributed graph data bases Over the last week we had our off campus meeting...
  3. From Graph (batch) processing towards a distributed graph data base Yesterdays meeting of the reading club was quite nice. We...
  4. Wishlist of features for a distributed graph data base technology I am just dreaming this does not exist and needs...
  5. Reading Club on distributed graph db returns with a new Format on April 4th 2012 The reading club was quite inactive due to traveling and...

Sharing:

Tags: , , , , , , , , ,

1 Comment on Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

  1. Jack Jones says:

    Hi René!

    Did you happen to see this post on Titan? http://thinkaurelius.com/2012/.....raph-data/

    Note the O(log(n)) you get from making the datetime the “primary key” index on the stream edges! Clever stuff the “vertex-centric indices”. Blew my mind.

    Keep in touch.

Leave a Reply

*

Close

Subscribe to my newsletter

You don't like mail?