related work – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 Video of FOSDEM talk finally online Tue, 25 Jun 2013 15:55:57 +0000 I was visiting FOSDEM 2013 with Heinrich Hartmann and talking about the video of this talk is finally online and of course I would like to share this with the community:

The slides can be found here.

]]> 0
Building an Autocomplete Service in GWT screencast Part 4: Integrating the neo4j Data base Thu, 20 Jun 2013 12:38:46 +0000 In this screencast of my series I explain at a very basic level how to integrate a data base to pull data for autocomplete queries. Since we have been working with neo4j at this time I used a neo4j data base. It will be only in the next two parts of this series where I introduce an efficient way of handling the data base (using the context listener of the web server) and building fast indices. So in this lesson the resulting auto complete service will be really slow and impractical to use but I am sure for didactic reasons it is ok to invest 7 minutes for a rather bad design.
Anyway if you want to use the same data set as I used in this screencast you can go to and find the data set as well as a description of the data base schema:

]]> 2
Related work slides from Rigour and Openness @ Oxford 2013 Fri, 12 Apr 2013 11:54:50 +0000 please find all the information of the talk in oxford.

Btw it will be build on graphity to achieve scaling of the newsfeed

]]> 0
Organization of the Open Access event 2013 in Oxford. Tue, 19 Mar 2013 19:08:07 +0000 During the past two months I invested quite some of my spare free time to contribute to the organization of the open access event Rigor and Openness in 21st century science which will take place in the University of Oxford on April 11th and 12th.
The Idea of the conference came up during Heinrich’s time in Oxford where he and me have been working a lot on related work and he was working closely together with people from the Akorn project (c.f. Heinrichs blog article why openess benefits research)
Thanks to a great effort of the organizing team (mostly students from Oxford + Heinrich and me) we can finally publicly announce the conference.
As you can see from the web page
we have been able to attract quite some famous speakers to Oxford for the various sessions, keynotes and the debate. I am particularly proud that we could even get hold of Amelia Andersdotter MEP and member of the Swedish pirate party for the public debate together with people from publishers…

Please help us to spread the word of the conference. Even if you cannot come to Oxford or live abroad some of your friends might be close and might want to attend.
The topic of open access is important since we have to preserve or knowledge (c.f.

]]> 0
Slides of Related work application presented in the Graphdevroom at FOSDEM Sat, 02 Feb 2013 15:13:02 +0000 Download the slidedeck of our talk at fosdem 2013 including all the resources that we pointed to.
Most important other links are:

was great talking here and again we are open source, open data and so on. So if you have suggestions or want to contribute feel free. Just do it or contact us. We are really looking forward to meet some other hackers that just want to go geek and change the world
the video of the talk can be found here:

]]> 6
Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language Tue, 06 Nov 2012 11:55:02 +0000 As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
Node coAuthor = p.endNode();

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator it = result.columnAs("coAuthor");
while (it.hasNext()){
Node coAuthor =;
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.
  • The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
  • The Traverser framework is about 25% slower than the Core API
  • Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.
  • I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

    One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

    Quotes from Andres Taylor:

    Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

    I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up


    So far I was only using the Java Core API working with neo4j and I will continue to do so.
    If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
    Also I personally like the amount of control that you have when traversing over the core yourself.
    Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
    The complete source code of the benchmark can be found at (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
    The detailed results can be found in this spreadsheet.

    ]]> 16 Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) Wed, 07 Mar 2012 16:34:00 +0000 Today we finally had our reading club and discussed several papers from last week’s asignments
    Before I give my usual summary I want to introduce our new infrastructure for the reading club. Go to:
    There you can find a question and answer system which we will use to discuss questions and answers of papers. Due to the included voting system we thought this is much more convenient than a closed unstructured mailing list. I hope this is of help to the entire community and I can only invite anyone to read and and discuss with us on

    Reading list for next meeting Wed March 14th 2 pm CET

    We first discussed the memcached paper:

    One of the first topics we discussed was how is the dynamically hash done? We also wondered how DHT take care of overloading in general? In the memcached paper this fact is not discussed very well. Schegi knows a good paper that explains the dynamics behind DHT’s and will provide the link soon.
    Afterwards we discussed what would happen if a distributed Key Value store like memcached is used to implement a graph store. Obviously creating a graph store on the Key value model is possible. Additionally memcached is very fast in its lookups. One could add another persistence layer to memcached that woul enable disk writes. 
    We think the main counter arguments are:

    • In this setting graph distribution to worker nodes is randomly done.
    • No performance gain by graph partitioning possible

    We realized that we should really read about distributed graph distribution
    If using memcached you can store much more than an adjacncy list in the value of one key. In this way reducing information needed.
    Again I pointed out that seperating the data model from the data storage could help essentially. I will soon write an entire blog article about this idea in the stetting of relational / graph models and relational database management systems.
    personally I am still convinced that memcached could be used to improve asynchronous message passing in distributed systems like signal / collect

    Scalable SPARQL Query of Large RDF graphs:

    We agreed that one of the core principles in this paper is that they remove supernodes (everything connected via RDF type) in order to have a much sparser graph and do the partitioning (which speed up computation a lot) afterwards they added the supernodes as a redundancy to all workers where the supernodes could be needed. This methodology could generalize pretty well to arbitrary graphs: You just look at the node degree and remove the x% nodes with highest degree from the graph run a cluster algorithm and then add the supernodes in a redundant way to the workers. 
    Thomas pointed out that this paper had a drawback of not using a distributed cluster algorithm but then used a framework like map reduce


    We all agreed that the beehive paper was solving a problem with a really great methodology by first looking into query distribution and then using proactive caching strategies. The interesting points are that they create an analytical model which they can solve in a closed way. The p2p protocols are enhanced by gossip talk to distribute the parameters of the protocol. In this way an adaptive system is created which will adjust its caching strategy once the queries are changing.
    We thought that the behive approach could be generalized to various settings. Especially it might be possible to not only analyze zipf distributions but also other distributions of the queries and derive various analytical models which could even coexist in such a system.
    You can find our questions and thoughts and joind our discussion about beehive online!

    Challenges in parallel Graph processing:

    Unfortunately we did not really have the time to discuss this – in my opinion – great paper. I created a discussion in our new question board. so feel free to discuss this paper at:

    ]]> 1