As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.

For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.

Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))

As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

 Java |  copy code |? 
01
02
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
03
Node paper = rel.getOtherNode(author);
04
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
05
	Node coAuthor = coAuthorRel.getOtherNode(paper);
06
	if (coAuthor.getId()==author.getId())continue;
07
		resCnt++;
08
	}
09
}
10

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

 Java |  copy code |? 
1
2
Traversal t = new Traversal();
3
for (Path p:t.description().breadthFirst().
4
  relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
5
  uniqueness(Uniqueness.NONE).traverse(author)){
6
	Node coAuthor = p.endNode();
7
	resCnt++;
8
}
9

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

 Java |  copy code |? 
01
02
ExecutionEngine engine = new ExecutionEngine( graphDB );
03
String query = "START author=node("+author.getId()+
04
             ") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
05
                     "]-()-[:"+RelationshipTypes.AUTHOROF.name()+
06
                     "]- coAuthor RETURN coAuthor";
07
ExecutionResult result = engine.execute( query);
08
scala.collection.Iterator<Node> it = result.columnAs("coAuthor");
09
while (it.hasNext()){
10
	Node coAuthor = it.next();
11
	resCnt++;
12
}
13
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.
14

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.

8 Comments on Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language

  1. Jacob Hansson says:

    Awesome work, really interesting input. Reading the benchmark though, I noticed that you create a new ExecutionEngine in the inner-most loop of the cypher benchmark. That means cypher needs to re-parse and re-calculate an execution plan for each invocation, which should have some impact on performance.

    If possible, would you mind trying it with a single ExecutionEngine, and using a parameterized query instead of creating a fixed-string query like it is today? So it would be something like (in pseudo code):

    engine = new ExecutionEngine(db)
    query = “START author=node({authorId})
    MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+"]-()-[:"+RelationshipTypes.AUTHOROF.name()+"]- coAuthor
    RETURN coAuthor”

    # You could even ask cypher to prepare the query here, either by
    # executing it once, or by using the Scala ExecutionEngine which exposes a prepare method

    startTimer()
    for author in authors:
    result = engine.execute(query, {“authorId”:author.id})
    for row in result:
    pass
    stopTimer()

    • Hey Jacob thanks for your comment!

      Starting the execution engine only once makes almost no difference and using parameters increases the Speed of Cypher by 10 – 20 % which is still much slower than the core API or the Traverser Framework which I like better anyway.

      I am too lazy right now to redraw the plots and run the complete benchmark since it needs quite some while to compute.

      But thanks for pointing this out! The corrected benchmark will be in my git repo next time I push.

  2. [...] title: “Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of C…“, makes you appreciate why René’s day job is “computer scientist” and not [...]

  3. robinkc says:

    Have you tried gremlin? Where does gremlin fit in the performance graph?

    • no i did not try gremlin but I assume since gremlin also builds on top of neo4j that it will not be faster than the rest. Especially i had some bad experiences with gremlin being rather slow.

      But the code is open as is the data set. feel free to write a gremlin query.

  4. rpiccand says:

    Hi René! Thanks for putting this together. Very interesting outcomes. Have you had a chance to re-evaluate Cypher v. 2.x ?

    Best,
    -R.

Leave a Reply

*

Close

Subscribe to my newsletter

You don't like mail?