graph processing – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Davy Suvee on FluxGraph – Towareds a time aware graph built on Datomic

Rene — Sat, 02 Feb 2013 13:40:09 +0000

Davy really nicely introduced the problem of looking at a snapshot of a data base. This problem obviously exists for any data base technology. You have a lot of timestamped records but running a query as if you fired it a couple of month ago is always a difficult challange.
With FluxGraph a solution to this is introduced.
How I understood him in the talk he introduces new versions of a vertex or an edge everytime it gets updated, added or removed. So far I am wondering about scaling and runtime. This approach seems like a lot of overhead to me. Later during Q & A I began to have the feeling that he has a more efficient way of storing this information so I really have to get in touch with davy to rediscuss the internals.
FluxGraph anyway provides a very clean API to access these temporal information.
On the various snapshots of the graph one is able to calculate for example the difference graph of the two checkpoints and gets a fully blueprints compatible result graph.
github.com/datablend/fluxgraph
His use case comes from a data set with 15000 cancer patients from 2001 to 2010 on which he could ask questions.
As a resume I can say that Davy used his software for his work and open sourced it which is cool.

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

Wishlist of features for a distributed graph data base technology

Rene — Fri, 24 Feb 2012 12:53:59 +0000

I am just dreaming this does not exist and needs to be refined in a later stage.

Fast traversals:
- Jumping from one vertex of the graph to another should be possible in O(1)
Online processing:
- “Standard queries” (<–whatever this means) should compute within miliseconds.
- As an example: Local recommendations e.g. similar users in a bipartite “User – Band” graph should be possible to process online in less than a second.
Query language:
- A programming model that supports pattern matching and traversals with one (or possibly several) starting nodes
- No SPARQL (too general for a reasonable graph application) support needed.
- Support for reading and writing new data (to disk)!
Distribution effort:
- The programmer should not have to care about the distribution techniques.
- He should just be able to use the technology.
Fault tolerance:
- The system has to run stable if working computers are added or removed.
- Probably by introducing redundancy in some way [1]
Persistence:
- Transactions and persistence are important for any data base service.

It is very clear that this wish list is very high level. But I think these are reasonable assumptions from which we can break down the problem and discuss pros and cons of all the techniques needed to built such a system.

[1] on the Redundancy discussion:

Depending on the techniques used, introducing redundancy has probably two positive effects on:

Fast traversals
Fault tolerance

On the other hand it has a deep impact on

Persistence (which is hard to achieve in a distributed setting anyway is even harder to achieve once redundancies are included.)

It is not clear if we really need redundancy. Maybe there are some other techniques that enable us to find our goals but I personally have the feeling that a good model for redundancy will “solve” the problem.

relation to the reading club

I already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and the Challenges in parallel graph processing strengthen my confidence that an approach described above seems very reasonable.

What is your oppinion?

Do you think I am missing some features or should keep a focus on one particular feature? What about methods to achieve those goals? I am happy to discuss your thoughts!

From Graph (batch) processing towards a distributed graph data base

Rene — Thu, 23 Feb 2012 12:45:22 +0000

Yesterdays meeting of the reading club was quite nice. We all agreed that the papers where of good quality and we gained some nice insights. The only drawback of the papers was that it did not directly tell us how to achieve our goal for a real time distributed graph data base technology. In the readings for next meeting (which will take place Wednesday March 7th 2pm CET) we tried to choose papers that don’t discuss these distributed graph / data processing techniques but focus more on speed or point out the general challenges in parallel graph processing.

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

memcached paper: To understand how for distributed shared memory works which could essentially speed up approaches like Signal Collect
Beehive: to see a p2p aproach for graph distribution.
Challenges in parallel graph processing. For obvious reasons since it points out the large picture.
http://www.boost.org/doc/libs/1_48_0/libs/graph_parallel/doc/html/index.html The boos library is a general parallel graph processing framework. In any case it is interesting and good to understand what is going on there.
Topology partitioning applied to SPARQL, HADOOP and TripleStores Shows how a speedup of 1000x can be achieved due to smart partitioning of a graph

Again while reading an preparing stuff feel free to add more reading wishes to the comments of this blog post or drop me a mail!

Summary of yesterdays meeting

As written in the introduction we agreed that the papers where interesting but not heading in our direction. Claudio pointed out that everyone should consider the following set of questions.

Do we want the graph to be mutable or is it supposed to writable or is it supposed to be read only?
- writing makes sens. If it is read only it is called batch processing
- Writing is hard you care about locking consistancy

Do we want to answer queries (Cypher/gremlin/whatever)?
Do we want to provide an API for processing?
How big is the data set we want to support
- many people do in memory
- If you go to the disk you open a whole new bottle of topics
- One approach would be to solve the problem in memory first.

I am very confident that it was a good idea to start with graph processing but that we are taking the right steps now to go in the direction of real distributed graph data base systems. I think there are some more questions and high level assumptions that one has to fix which I will post in a few days on this blog. Sorry I am in a hurry for this day / rest of the week.

Infrastructure

Schegi just suggested to create a Mailingliste for the reading club or to switch to Google Groups. He pointed out that a private blog is kind of a weired medium to be so central. What is your opinion on that? Do we need some other / more formal infrastructure?

Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons

Rene — Sun, 19 Feb 2012 17:05:49 +0000

One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.
So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

Summary of Google Pregel

The methodology is heavily based on Bulk Sychronous Parallel Model (BSP) and also has some similarties to MapReduce (with just one superstep). The main idea is to spread the data over several machines and introduce some supersteps. For each superstep every vertex of the graph calculates a certain function that is given by the programmer.
This enables one to process large graphs which are distributed over several machines. The paper describes how to use Checkpoints to increase fault tolerance and also how to make good use of the Google File System in order to partition the graph data on the workers. The authors mention that smarter hashing functions could help to distribute the vertices not randomly but rather in a way they are connected on the graph which could potentially increase performance.
Overall the goal of Google Pregel seems to enable one to process large graph data and gain knowledge from it. The focus does not seem to increase the usage of the calculation power of the distributed system efficiently. In stead it rather seems to create a system that makes distribution of data – that will not fit into one machine – possible at a decent speed and without much effort for the programmer by introducing methods for increasing fault tolerance.

Summary of Signal Collect

Signal Collect as a system is pretty similar to Google Pregel. The main difference is that the authors introduce a threshold score which is used to decide weather a node should collect its signals or weather it should send signals. Using this score the processing of algorithms can be accelerated in a way that for every super step only signals and collects are performed if a certain threashhold is hit.
From here the authors say that one can get rid of the superstep model and make the entire calculation asynchronous. This is done by introducing randomization on the set of vertices on which signal and collect computations have to be computed (as long as the threasholdscores are overcome)
The entire system is implemented on a single machine but the vertices of the compute graph are processed by different workers (in this setting Threads). All Threads are able to share the main memory of the system which makes message passing of Signal and Collect computations unnecessary. The authors show how the increasing number of workers actually antiproportionally lower the runtime of the algorithm in the asynchronous setting. They also give evidence that different scheduling strategies seem to fit the needs for different graphs or algorithms.

Discussion of Pros and Cons

From the text above it seems very obvious that Signal Collect with its Asynchronous Programming model seems superior. But – in opposite to the authors – I have hidden to mention the drawbacks of one small but important detail. The fact that all the workers share a common knowledge which they can access due random access in main memory of the machine allows their model to be so fast while being asynchronous. It is not clear how to maintain this speed with a real distributed system. So in this way Signal Collect only give a proof of concept that an abstract programming model for graph processing exists and it enables fast distribution in theory.
Pregel actually is a real frame work that can really achieve distribution of large data to clusters of several thousand machines which for sure is a huge pro.
Signal Collect proposes to be more general than Pregel since Pregel can only respect one vertex type and edges are stored implicitly. Whereas Signal Collect is able to store RDF Graphs. I personally understand that Signal Collect can only send signals from one vertex to another if and edge exists and is also not able to add or remove edges or vertices. In this sense I still think that Pregel is the more general system. But I guess one can still argue on my point of view.
Pregel’s big drawbacks in my opinion are that the system is not optimized for speed. As already discussed in the last meeting of the reading club Map Reduce – with its one Superstep attitude – is able to start Backup tasks towards the end of the computation in order to fight stragglers. Pregel has to wait for those stragglers in every superstep in order to make synchronous Barriers possible.
Another point that is unique with Pregel is the deep integration with Google File System (btw. I am almost through the google file system paper and even if you already know of the idea it is absolutely worthwhile reading it and understanding the arguments for the design decisions of the google file system). So far I am not sure weather this integration is a strong or a weak point. This is due to the fact that I can’t see all the implications. However it gives strenght to my argument that for a distributed system some things like network protocols and file systems should be considered since they seem to have a strong impact on the entire system.
Both systems in my opinion fail to consider partitioning of the graph and a different network protocol as an important task. Especially for Pregel I do not understand this since it already has so much network traffic. Partitioning the graph might increase start up Traffic on the one hand but could increase overall traffic on the long term.

Outlooks and personal thoughts:

I am considering to invite the authors of both papers to next weeks reading club. It would be even more interesting to discuss these and other questions directly with the guys who built that stuff.
Also I like Schegi’s idea to see what happens if one actually runs several neo4j servers on different machines and just use a model similar to Signal Collect or Pregel to perform some computations. In this way a programming model could be given and research on the core distribution framework – relying on good technologies for the workers – could be done.
For the development of the first version of metalcon we used memcached. I read a lot that memcached scales perfectly horizontal over several machines. I wonder how an integration of memcached to Signal Collect would work in order to make the asynchronous computation possible in a distributed fashion. Since random access memory is a bottleneck in any application I suggest to put the original memcached paper on our reading list.
One last point to mention is that both systems still don’t seem to be useful as a technology to built a distributed graph data base which enables online query processing.

Nils Grunwald from Linkfluence talks at FOSDEM about Cascalog for graph processing

Rene — Sun, 05 Feb 2012 10:10:03 +0000

Nils Grunwald works at the french startup Linkefluence. Their product is more or less social network analysis and graph processing. They crawl the web and blogs or get other social network data and provide solutions with statistics and insights for their customers.
In this scenario obviously big data is envolved and the data carries a natural structure of a graph. He sais a system to process the data has the following constrains:

The processing should not compromise the rest of the system
Low maintenance costs
Used for queries and rapid prototyping (so they want a “general” graph processing solution as customer needs changes)
Flexible, hard to tell which field or metadata will be used beforehand.

He afterwards introduces their solution Cascalog based on Hadoop and is also inspired by cascading a workflow managment system and datalog a subset of prolog which as a declarative, expressive language is very concise way of writing queries and enable quick prototyping
For me personally it is not a very interesting solution since it is not able to answer queries in realtime which of course is obvious if you consider the technologies it is based on. But I quess for people that have time and just do analysis this solution will properly work pretty well!
What I really liked about his the solution is that after processing the graph you can export the data to Gephi or to Neo4j to have fast query processing.
Hey then explained alot specific details about the syntax of cascalog:

nils grundwald from linkfluence talks about cascalog at fosdem

Claudio Martella talks @ FOSDEM about Apache Giraph: Distributed Graph Processing in the Cloud

Rene — Sun, 05 Feb 2012 09:01:45 +0000

Claudio Martella introduces Apache Giraph which according to him is a loose implementation of Google Pregel which was introduced on SIGMOD in 2010. He points out that Map Reduce cannot be used to do graph processing.

He then gave an example on how MapReduce can be used to to do page rank calculation. He points out that Pagerank can be calculated as a local property of a graph in a distributed way by calculating local pagerank from the knowledge of the neighbours. He did this to show what the Drawbacks of this method are in his oppinion:

job boostrap take some time
disk is hit about 6 times
Data is sorted
Graph is passed through

Like in the Pregel Paper he says that other Graphalgorithms like singlesource shortest paths have the same problems.

: Claudio Martella from Apache explains how giraph works at in the graph dev room @ Fosdem 2012

After introducing more about implementing Pregle ontop of the existing MapReduce structure for distributing he says that this system has some advantages over MapReduce

it’s a stateful computation
Disk is hit if/only for checkpoints
No sorting is necessary
Only messages hit the network

He points out that the advantages of Giraph over other methods (Hama, GoldenOrb, Signal/Collect) are especially an active community (Facebook, Yahoo, Linkedin, Twitter) behind this project. I personally think another advantage is that it is run by Apache who already run MapReduce (Hadoop) with great success. So it is something that people trust…
Claudio points out explicitly that they are searching for more contributors and I think this is really an interesting topic to work on! So thank Claudio for your inspiring work!

here the video streams from the graph dev room: