distributed graphdb – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Reading Club Management of Big Data https://www.rene-pickhardt.de/reading-club-management-of-big-data/ https://www.rene-pickhardt.de/reading-club-management-of-big-data/#comments Thu, 05 Sep 2013 10:01:28 +0000 http://www.rene-pickhardt.de/?p=1765 Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.
The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.
Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

Along these lines we want to understand

  • Why do these technologies scale? 
  • How do they handle concurrent traffic (especially write requests)?
  • How performance can be increased if there is another way of building up such highly scalable systems?
  • What kind of applications (like titan or mahout) are build on top of these systems?
At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design. 

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr
Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.
I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!
So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.

]]>
https://www.rene-pickhardt.de/reading-club-management-of-big-data/feed/ 1
Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database! https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/ https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/#comments Wed, 21 Aug 2013 10:43:02 +0000 http://www.rene-pickhardt.de/?p=1736 Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
 

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
 
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:

]]>
https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/feed/ 2
Michael Hunger talks about High Availability of Neo4j built on Paxos in the GraphDevroom @ FOSDEM https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/ https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/#comments Sat, 02 Feb 2013 12:01:24 +0000 http://www.rene-pickhardt.de/?p=1511 As we know neo4j has a master slave replication with eventual consistency so there is not the typical ACID requirements. The way is ether wring the master which pushes to the slaves. But it is also possible to write to the slaves directly which is super save but much slower since syncronization between slaves is required.
In gerneral (not very specific to neo4j there are a view concerns)

  • Cluster management (how to handle new machines joining or leaving the cluster as well as heartbeat messages) this also holds true for failover (Master election, Distribution of Master status)
  • Replication (synchronized id-generation, distributed locks, and so on

Neo4j was building on Apache Zookeeper to take care of the concerns. Michael points out that there have been problems with using Zookeeper.

  • How to koordinate Zookeeper with neo4j cluster
  • unrelieable operations
  • people did not like the typology required from the zookeper architecture
  • Also Zookeeper is electing a new master to often which especially bad in a heavy load environment
  • no dynamic reconfigeration of the Zookeeper cluster.

The solution of neo4j was to rewrite the multi-paxos paradigm and replace zookeper. Micheal especially suggests to read the Paxos Made Simple paper by Leslie Lamport. The core exists of State Machines implemented using Java Enums.
I still remember a lot of discussions in the reading club on distributed graph data bases. We never actually looked into Apache Zookeper and the Paxos paradigm which would certainly an interesting technique to learn!
In the next part there was a lot of detail discussions which where hard to follow for me since I am so far not familiar with the Paxos Paradigm.
If you are curious about the HA of neo4j and you can bet I am you can look into Peter’s screencast that leads you through setting up neo4j HA

Setting up a local HA cluster in Neo4j 1.9 from Peter Neubauer on Vimeo.

]]>
https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/feed/ 1
Reading Club on distributed graph db returns with a new Format on April 4th 2012 https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/ https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/#comments Mon, 02 Apr 2012 09:37:05 +0000 http://www.rene-pickhardt.de/?p=1231 The reading club was quite inactive due to traveling and also a not optimal process for the choice of literature. That is why a new format for the reading club has been discussed and agreed upon. 
The new Format means that we have 4 new rules

  1. we will only discuss up to 3 papers in 90 minutes of time. So rough speaking we have 30 minutes per paper but this does not have to be strict.
  2. The decided papers should be read by everyone before the reading club takes place.
  3. For every paper there is one responsible person (moderator) who did read the entire paper before he suggested it as a common reading.
  4. Open questions to the (potential) reading assignments and ideas for reading can and should be discussed on http://related-work.rene-pickhardt.de/ (use the same template as I used for the reading assignments in this blogpost) eg:

Moderator:
Paper download:
Why to read it
topics to discuss / open questions:

For next meeting on April 4th 2 pm CET (in two days) the literature will be:

While preparing these papers we might come across some other interesting literature.
If you want to suggest some of the literature you should also read that piece of work until the reading club meeting takes place and know why you want everybody to prepare the same paper and discuss it (rule 3). Additionally you should open a topic on the paper on http://related-work.rene-pickhardt.de/ using the above template before the reading club takes place (rule 4)
I hope this is of help for the entire project and I am looking forward to the next meeting!

]]>
https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/feed/ 6
PhD proposal on distributed graph data bases https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/ https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/#comments Tue, 27 Mar 2012 10:19:22 +0000 http://www.rene-pickhardt.de/?p=1214 Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

  • indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
  • running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
  • storing web log files and click streams of users
  • doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

  • biology
  • finance
  • regular graphs 
  • geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

  • graph processing (Signal Collect, Pregel,…)
  • graph theory (especially data structures and algorithms)
  • (dynamic/adaptive) graph partitioning
  • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
  • redundancy vs fault tolerance
  • network programming (protocols, latency vs bandwidth)
  • data bases (ACID, multiple user access, …)
  • graph data base query languages (SPARQL, Gremlin, Cypher,…)
  • Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base. 
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages. 
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment. 

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID  are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID. 

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

  • 2012 / 04: hand in PhD proposal
  • 2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
  • 2012 / 10: experiments of basic CRUD operations done
  • 2013 / 02: paper with results from basic CRUD operations done
  • 2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
  • 2013 /08: min 3 month research internship  in a company benchmarking my system on real data
  • end of 2013: publishing the results
  • 2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

]]>
https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/feed/ 11
Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/ https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/#comments Wed, 07 Mar 2012 16:34:00 +0000 http://www.rene-pickhardt.de/?p=1166 Today we finally had our reading club and discussed several papers from last week’s asignments
Before I give my usual summary I want to introduce our new infrastructure for the reading club. Go to: 
http://related-work.rene-pickhardt.de/
There you can find a question and answer system which we will use to discuss questions and answers of papers. Due to the included voting system we thought this is much more convenient than a closed unstructured mailing list. I hope this is of help to the entire community and I can only invite anyone to read and and discuss with us on http://related-work.rene-pickhardt.de/

Reading list for next meeting Wed March 14th 2 pm CET

We first discussed the memcached paper:

One of the first topics we discussed was how is the dynamically hash done? We also wondered how DHT take care of overloading in general? In the memcached paper this fact is not discussed very well. Schegi knows a good paper that explains the dynamics behind DHT’s and will provide the link soon.
Afterwards we discussed what would happen if a distributed Key Value store like memcached is used to implement a graph store. Obviously creating a graph store on the Key value model is possible. Additionally memcached is very fast in its lookups. One could add another persistence layer to memcached that woul enable disk writes. 
We think the main counter arguments are:

  • In this setting graph distribution to worker nodes is randomly done.
  • No performance gain by graph partitioning possible

We realized that we should really read about distributed graph distribution
If using memcached you can store much more than an adjacncy list in the value of one key. In this way reducing information needed.
Again I pointed out that seperating the data model from the data storage could help essentially. I will soon write an entire blog article about this idea in the stetting of relational / graph models and relational database management systems.
personally I am still convinced that memcached could be used to improve asynchronous message passing in distributed systems like signal / collect

Scalable SPARQL Query of Large RDF graphs:

We agreed that one of the core principles in this paper is that they remove supernodes (everything connected via RDF type) in order to have a much sparser graph and do the partitioning (which speed up computation a lot) afterwards they added the supernodes as a redundancy to all workers where the supernodes could be needed. This methodology could generalize pretty well to arbitrary graphs: You just look at the node degree and remove the x% nodes with highest degree from the graph run a cluster algorithm and then add the supernodes in a redundant way to the workers. 
Thomas pointed out that this paper had a drawback of not using a distributed cluster algorithm but then used a framework like map reduce

Beehive:

We all agreed that the beehive paper was solving a problem with a really great methodology by first looking into query distribution and then using proactive caching strategies. The interesting points are that they create an analytical model which they can solve in a closed way. The p2p protocols are enhanced by gossip talk to distribute the parameters of the protocol. In this way an adaptive system is created which will adjust its caching strategy once the queries are changing.
We thought that the behive approach could be generalized to various settings. Especially it might be possible to not only analyze zipf distributions but also other distributions of the queries and derive various analytical models which could even coexist in such a system.
You can find our questions and thoughts and joind our discussion about beehive online!

Challenges in parallel Graph processing:

Unfortunately we did not really have the time to discuss this – in my opinion – great paper. I created a discussion in our new question board. so feel free to discuss this paper at: http://related-work.rene-pickhardt.de/questions/13/challenges-in-parallel-graph-processing-what-do-you-think

]]>
https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/feed/ 1