FOSDEM – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Drug junkie steals my neo4j t-shirt out of my physical mailbox

Rene — Wed, 17 Jul 2013 18:51:46 +0000

Me wearing my stolen neo4j shirt which I took back from the thief

Being at FOSDEM 20013 Peter from Neo4j asked my if I would like to get a neo4j shirt send to my home adress. We have to keep in mind that i just moved back to Koblenz from China. I did not only move to Koblenz but I moved to Koblenz Lützel. I knew from my collegues that this part of Koblenz is supposed to be like the Harlem of NYC. But I found a really nice flat where I leave together with 3 other nice students. Even though I was skeptical looking at that flat I had to move there after I met my future roommates.
A couple of weeks after moving in I realized that I smelled pod more and more frequently from the streets. Especially from people smoking it in front of my front door. I had even observed people exchanging packages in the backyard of our house. Of course I cannot say for sure but even at the time of seeing this I was pretty confident that they would deal drugs. Over the last couple weeks we had several problems in our house.

People broke in our basement and stole some stuff
Another time people broke in our basement and stored bikes there which did not belong to any of our neighbors.
There is a hole from a gun in our front door.
last but not least: I was about to leave our backyard when I saw a guy wearing my neo4j shirt.

Ok let me elaborate on the last one:
Neo4j the graph database is not known as the most famous fashion brand in the world. So I hardly recognized the shirt when I saw him wear it. But I somehow recognized the design of the shirts and decided to turn around in order to get a second look at the shirt. Who in my neighborhood would wear such a shirt and what connection would he have to this rather new piece of technology?
When i turned around things got even stranger. I saw the back of the guy and his shirt said:
“My stream is faster than yours” which certainly is a link to graphity
and also displayed the Cypher Query:
“(renepickhardt) <-[:cites]- (peter)”
I was so perplex that I didn’t realize that I was alone and the guy was standing there with 2 other men. I said: “Sorry, you are wearing my shirt!” And his friends came in and told my I was crazy and how I could come up with this idea. I insisted that my name was written on the shirt. In particular my full name! Especially I knew the quote which was exactly what Peter had planned to print on my shirt.
The guys started mocking me and telling me to f… off. But I somehow resisted and pointed out again that this was certainly my shirt. At that moment the door of the Kung Fu School opened and the coach Mr. Lai came out and asked if the guys again stole packages from our post box. At that moment the guy with my shirt had to turn around again so anybody could see my name. He stared telling me some weird lie about how he got this shirt as a present and just thought it was nice looking but he finally returned it to me.
Most interestingly the police didn’t care. The policeman only said: “It’s your own fault when you move to a place like Koblenz Lützel.” I find this to be very disappointing. I always thought that our policemen should be objective and neutral. Stealing and opening other peoples mail is a crime in Germany. Also owning drugs or stealing bikes… It is said that the police refuses to help us with our situation.
Well anyway If you have an important message for me why don’t you use email rather than physical mail. My email is also potentially read by third parties but at least it is still safely delivered. Anyway big thanks to the guys from neo4j for my new shirt (:

Video of FOSDEM talk finally online

Rene — Tue, 25 Jun 2013 15:55:57 +0000

I was visiting FOSDEM 2013 with Heinrich Hartmann and talking about related-work.net the video of this talk is finally online and of course I would like to share this with the community:

The slides can be found here.

Slides of Related work application presented in the Graphdevroom at FOSDEM

Rene — Sat, 02 Feb 2013 15:13:02 +0000

Download the slidedeck of our talk at fosdem 2013 including all the resources that we pointed to.
Most important other links are:

was great talking here and again we are open source, open data and so on. So if you have suggestions or want to contribute feel free. Just do it or contact us. We are really looking forward to meet some other hackers that just want to go geek and change the world
the video of the talk can be found here:

The start of the Linked Data benchmark council Eu FP7 Big Data pro

Rene — Sat, 02 Feb 2013 14:56:55 +0000

Peter who is working for Neo4j is an industry partner of the http://www.ldbc.eu/ which is a EU FP7 Project in the Big Data call.
The goal of this project is to put out good methodologies for benchmarking linked open data and rdf stores as well as graph data bases. In this context the council should also provide data sets for benchmarking.
Peter points out that a simple problem exists with benchmarks:”who ever puts it out wins” One simple reason is that benchmarking has so many flexible variables that it is really hard. he compared the challanges to the tpc http://de.wikipedia.org/wiki/Transaction_Processing_Performance_Council
After talking about the need for good benchmarks he pointed out again why the transaction processing Performence Council Benchmarks are not sufficient anymore giving many different examples of exploding big graphs being around in the world (Facebook, Google Knowledge Graph, Linked open data, dbpedia).
Since the project is really new Peter could not report any results yet. Anyway I am pretty sure that anyone interested in graph data bases and graph data should look into the project which has the following list of deliverables

overvew of current graph benchmakrs and designs
benchmark principles and methods
Query Languages (Cypher, Gremlin, SPARQL)
Analysis and classification of Choke points (Supernodes, data generators)
Benchmark transactions (which are in general very slow)
Benchmark the complexity of queries
Analysis (if anyone has data sets and usecases contact the LDBC, actually I think we have data comming from related work)
Navigational benchmark (e.g. open streetmaps)
Benchmarking design for pattern matching (e.g. SPARQL and Cypher)

As you could here from beside there is a huge discussion going on about query languages which i like. Creating a query language is a tough task. The more expressive a language is (like SPARQL) the less efficient this might become. So I hope the EU project will really create some good solid output. I am also happy that many different industry vendors are part of this project. In this sense the results will hopefully be objective and don’t suffer from the “Who ever puts it on wins” paradigm.
Interestingly the LDBC makes a speration between graph data bases and rdf stores which I am very pleased to see and have been thinking a lot.

Davy Suvee on FluxGraph – Towareds a time aware graph built on Datomic

Rene — Sat, 02 Feb 2013 13:40:09 +0000

Davy really nicely introduced the problem of looking at a snapshot of a data base. This problem obviously exists for any data base technology. You have a lot of timestamped records but running a query as if you fired it a couple of month ago is always a difficult challange.
With FluxGraph a solution to this is introduced.
How I understood him in the talk he introduces new versions of a vertex or an edge everytime it gets updated, added or removed. So far I am wondering about scaling and runtime. This approach seems like a lot of overhead to me. Later during Q & A I began to have the feeling that he has a more efficient way of storing this information so I really have to get in touch with davy to rediscuss the internals.
FluxGraph anyway provides a very clean API to access these temporal information.
On the various snapshots of the graph one is able to calculate for example the difference graph of the two checkpoints and gets a fully blueprints compatible result graph.
github.com/datablend/fluxgraph
His use case comes from a data set with 15000 cancer patients from 2001 to 2010 on which he could ask questions.
As a resume I can say that Davy used his software for his work and open sourced it which is cool.

Frank Cellar introduces ArangoDB

Rene — Sat, 02 Feb 2013 13:06:41 +0000

Frank Cellar (https://twitter.com/fceller) introduces his ArangoDB which is basically a Document store (key, value) and uses a blueprint graph interface.
Interestingly he is doing his demonstrations on the DBLP data set which is highly relevant for Heinrich and my related work project which we are introducing in our talk.
ArangoDB has several APIs to interact with it they consist of:

available from JavaScript
RESTful HTTP API
Blueprint bindings (Gremlin support is available: nice!)

After playing around with the ArangoDB and Gremlin console Frank started to Introduce AQL (=ArangoDB Query Language). ArangoDB also consists of a traverser Framework.
After the talk I now know about the API’s of ArangoDB what I am missing is a benchmark against some other technologies. Still the technology looked very promising and I am sure we will have some looks at it.

Michael Hunger talks about High Availability of Neo4j built on Paxos in the GraphDevroom @ FOSDEM

Rene — Sat, 02 Feb 2013 12:01:24 +0000

As we know neo4j has a master slave replication with eventual consistency so there is not the typical ACID requirements. The way is ether wring the master which pushes to the slaves. But it is also possible to write to the slaves directly which is super save but much slower since syncronization between slaves is required.
In gerneral (not very specific to neo4j there are a view concerns)

Cluster management (how to handle new machines joining or leaving the cluster as well as heartbeat messages) this also holds true for failover (Master election, Distribution of Master status)
Replication (synchronized id-generation, distributed locks, and so on

Neo4j was building on Apache Zookeeper to take care of the concerns. Michael points out that there have been problems with using Zookeeper.

How to koordinate Zookeeper with neo4j cluster
unrelieable operations
people did not like the typology required from the zookeper architecture
Also Zookeeper is electing a new master to often which especially bad in a heavy load environment
no dynamic reconfigeration of the Zookeeper cluster.

The solution of neo4j was to rewrite the multi-paxos paradigm and replace zookeper. Micheal especially suggests to read the Paxos Made Simple paper by Leslie Lamport. The core exists of State Machines implemented using Java Enums.
I still remember a lot of discussions in the reading club on distributed graph data bases. We never actually looked into Apache Zookeper and the Paxos paradigm which would certainly an interesting technique to learn!
In the next part there was a lot of detail discussions which where hard to follow for me since I am so far not familiar with the Paxos Paradigm.
If you are curious about the HA of neo4j and you can bet I am you can look into Peter’s screencast that leads you through setting up neo4j HA

Setting up a local HA cluster in Neo4j 1.9 from Peter Neubauer on Vimeo.

Birds of a feather: Graph processing future trends in Graph Devroom

Rene — Sun, 05 Feb 2012 11:00:56 +0000

Since one of the talks got canceled the organisers of the Graph Devroom at Fosdem used the opportunity to make a public discussions with all the developers to talk about some future trends in graph processing. I really liked the idea but unfortunately the discussion wasn’t really kicking off well. I guess for a discussion like this people have to prepared in a better way.

Topics were Blueprints (a common graph accass api) created by Marko Rodriguez as
Problem of real time graph processing
Benchmarking issues (we need standards for benchmarking).
A guy from OrientDB raised the question weather Graph databases should really have ACID transaction?
Max De Marzi raised the question about are graphs changing while processing or are they rather static?
Achim Pointed out that Relational databases are actually a special case of Graph databases. He demands vendors to generalize more and consolidate the technologies…

The room was not as full as the talks before but sill half of the seats have been filled as you can see on this short video:

My thoughts on acid discussion

I think that the ACID question was interesting. Alistair from neo4j gave a fine response to this saying that it clearly depends on the usecase and the kind of transactions that you really need. He compared to the relational data base world where you might have the option of switching ACID off.
He says that “in neo4j you cannot shut off ACID as neo4j believes that for most of their customers this is the best choice. But he admits that there are use cases where you migh want to shutoff ACID.

Changing graphs vs static graphs

I think that this is also a very important question. On one hand we have static models like Giraph that are able to find answers on huge static graphs on the other side you have situations like graphity where you have fast fluctuation in data. Unfortunately for the later there is no technology I know of (besides hidden Facebook and Twitter and Googleplus) .
the entire disussion was recorded here:

Nils Grunwald from Linkfluence talks at FOSDEM about Cascalog for graph processing

Rene — Sun, 05 Feb 2012 10:10:03 +0000

Nils Grunwald works at the french startup Linkefluence. Their product is more or less social network analysis and graph processing. They crawl the web and blogs or get other social network data and provide solutions with statistics and insights for their customers.
In this scenario obviously big data is envolved and the data carries a natural structure of a graph. He sais a system to process the data has the following constrains:

The processing should not compromise the rest of the system
Low maintenance costs
Used for queries and rapid prototyping (so they want a “general” graph processing solution as customer needs changes)
Flexible, hard to tell which field or metadata will be used beforehand.

He afterwards introduces their solution Cascalog based on Hadoop and is also inspired by cascading a workflow managment system and datalog a subset of prolog which as a declarative, expressive language is very concise way of writing queries and enable quick prototyping
For me personally it is not a very interesting solution since it is not able to answer queries in realtime which of course is obvious if you consider the technologies it is based on. But I quess for people that have time and just do analysis this solution will properly work pretty well!
What I really liked about his the solution is that after processing the graph you can export the data to Gephi or to Neo4j to have fast query processing.
Hey then explained alot specific details about the syntax of cascalog:

nils grundwald from linkfluence talks about cascalog at fosdem

Claudio Martella talks @ FOSDEM about Apache Giraph: Distributed Graph Processing in the Cloud

Rene — Sun, 05 Feb 2012 09:01:45 +0000

Claudio Martella introduces Apache Giraph which according to him is a loose implementation of Google Pregel which was introduced on SIGMOD in 2010. He points out that Map Reduce cannot be used to do graph processing.

He then gave an example on how MapReduce can be used to to do page rank calculation. He points out that Pagerank can be calculated as a local property of a graph in a distributed way by calculating local pagerank from the knowledge of the neighbours. He did this to show what the Drawbacks of this method are in his oppinion:

job boostrap take some time
disk is hit about 6 times
Data is sorted
Graph is passed through

Like in the Pregel Paper he says that other Graphalgorithms like singlesource shortest paths have the same problems.

: Claudio Martella from Apache explains how giraph works at in the graph dev room @ Fosdem 2012

After introducing more about implementing Pregle ontop of the existing MapReduce structure for distributing he says that this system has some advantages over MapReduce

it’s a stateful computation
Disk is hit if/only for checkpoints
No sorting is necessary
Only messages hit the network

He points out that the advantages of Giraph over other methods (Hama, GoldenOrb, Signal/Collect) are especially an active community (Facebook, Yahoo, Linkedin, Twitter) behind this project. I personally think another advantage is that it is run by Apache who already run MapReduce (Hadoop) with great success. So it is something that people trust…
Claudio points out explicitly that they are searching for more contributors and I think this is really an interesting topic to work on! So thank Claudio for your inspiring work!

here the video streams from the graph dev room: