distributed graph data base – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database!

Rene — Wed, 21 Aug 2013 10:43:02 +0000

Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

2400 concurrent users against a graph cluster!
real time!
16 different (non trivial queries) queries
achieving more than 10k requests answered per second!
graph with more than a billion nodes!
graph partitioning is plugable
graph schema helps indexing for queries

So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.

Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:

Scaling data size
scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated.
making partitioning pluggable
requiring an schema for the graph (to enable efficient indexing)
being able on runtime to extend the schema.
building on top of ether Cassandra (for realtime) or HBase for consistency
being compatible with the tinkerpop techstack
bringing up an entire framework for analytics and graph processing.

Further resources:

https://github.com/thinkaurelius/titan
https://github.com/renepickhardt/metalcon/wiki/Technologytitan (our wiki page in metalcon where we will collect more information about titan.)
There is also a screencast by Marko available on how to set up titan on an amazon cluster and querying music brainz rdf data:

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

Wishlist of features for a distributed graph data base technology

Rene — Fri, 24 Feb 2012 12:53:59 +0000

I am just dreaming this does not exist and needs to be refined in a later stage.

Fast traversals:
- Jumping from one vertex of the graph to another should be possible in O(1)
Online processing:
- “Standard queries” (<–whatever this means) should compute within miliseconds.
- As an example: Local recommendations e.g. similar users in a bipartite “User – Band” graph should be possible to process online in less than a second.
Query language:
- A programming model that supports pattern matching and traversals with one (or possibly several) starting nodes
- No SPARQL (too general for a reasonable graph application) support needed.
- Support for reading and writing new data (to disk)!
Distribution effort:
- The programmer should not have to care about the distribution techniques.
- He should just be able to use the technology.
Fault tolerance:
- The system has to run stable if working computers are added or removed.
- Probably by introducing redundancy in some way [1]
Persistence:
- Transactions and persistence are important for any data base service.

It is very clear that this wish list is very high level. But I think these are reasonable assumptions from which we can break down the problem and discuss pros and cons of all the techniques needed to built such a system.

[1] on the Redundancy discussion:

Depending on the techniques used, introducing redundancy has probably two positive effects on:

Fast traversals
Fault tolerance

On the other hand it has a deep impact on

Persistence (which is hard to achieve in a distributed setting anyway is even harder to achieve once redundancies are included.)

It is not clear if we really need redundancy. Maybe there are some other techniques that enable us to find our goals but I personally have the feeling that a good model for redundancy will “solve” the problem.

relation to the reading club

I already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and the Challenges in parallel graph processing strengthen my confidence that an approach described above seems very reasonable.

What is your oppinion?

Do you think I am missing some features or should keep a focus on one particular feature? What about methods to achieve those goals? I am happy to discuss your thoughts!