Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){ Node paper = rel.getOtherNode(author); for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){ Node coAuthor = coAuthorRel.getOtherNode(paper); if (coAuthor.getId()==author.getId())continue; resCnt++; } }
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal(); for (Path p:t.description().breadthFirst(). relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)). uniqueness(Uniqueness.NONE).traverse(author)){ Node coAuthor = p.endNode(); resCnt++; }
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB ); String query = "START author=node("+author.getId()+ ") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+ "]-()-[:"+RelationshipTypes.AUTHOROF.name()+ "]- coAuthor RETURN coAuthor"; ExecutionResult result = engine.execute( query); scala.collection.Iterator it = result.columnAs("coAuthor"); while (it.hasNext()){ Node coAuthor = it.next(); resCnt++; } I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

foafQueryesOnNeo4j — All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.

The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
The Traverser framework is about 25% slower than the Core API
Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.

I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

neo4jBenchmarkFOAFQueriesPerSecond — One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

Quotes from Andres Taylor:

Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

Conclusion:

So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.

16 Comments

Jacob Hansson says:

November 6, 2012 at 2:24 pm

Awesome work, really interesting input. Reading the benchmark though, I noticed that you create a new ExecutionEngine in the inner-most loop of the cypher benchmark. That means cypher needs to re-parse and re-calculate an execution plan for each invocation, which should have some impact on performance.
If possible, would you mind trying it with a single ExecutionEngine, and using a parameterized query instead of creating a fixed-string query like it is today? So it would be something like (in pseudo code):
engine = new ExecutionEngine(db)
query = “START author=node({authorId})
MATCH author-[:”+RelationshipTypes.AUTHOROF.name()+”]-()-[:”+RelationshipTypes.AUTHOROF.name()+”]- coAuthor
RETURN coAuthor”
# You could even ask cypher to prepare the query here, either by
# executing it once, or by using the Scala ExecutionEngine which exposes a prepare method
startTimer()
for author in authors:
result = engine.execute(query, {“authorId”:author.id})
for row in result:
pass
stopTimer()

René Pickhardt says:
November 6, 2012 at 7:29 pm

Reply

Hey Jacob thanks for your comment!
Starting the execution engine only once makes almost no difference and using parameters increases the Speed of Cypher by 10 – 20 % which is still much slower than the core API or the Traverser Framework which I like better anyway.
I am too lazy right now to redraw the plots and run the complete benchmark since it needs quite some while to compute.
But thanks for pointing this out! The corrected benchmark will be in my git repo next time I push.

Jacob Hansson says:

November 6, 2012 at 2:24 pm

René Pickhardt says:
November 6, 2012 at 7:29 pm

Reply

Hey Jacob thanks for your comment!
Starting the execution engine only once makes almost no difference and using parameters increases the Speed of Cypher by 10 – 20 % which is still much slower than the core API or the Traverser Framework which I like better anyway.
I am too lazy right now to redraw the plots and run the complete benchmark since it needs quite some while to compute.
But thanks for pointing this out! The corrected benchmark will be in my git repo next time I push.

Full power to the Neo4j engines, Mr. Scott! « Another Word For It says:

November 6, 2012 at 5:43 pm

[…] title: “Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of C…“, makes you appreciate why René’s day job is “computer scientist” and not […]

Full power to the Neo4j engines, Mr. Scott! « Another Word For It says:

November 6, 2012 at 5:43 pm

robinkc says:

March 21, 2013 at 4:27 pm

Have you tried gremlin? Where does gremlin fit in the performance graph?

René Pickhardt says:
March 26, 2013 at 2:37 pm

Reply

no i did not try gremlin but I assume since gremlin also builds on top of neo4j that it will not be faster than the rest. Especially i had some bad experiences with gremlin being rather slow.
But the code is open as is the data set. feel free to write a gremlin query.

robinkc says:

March 21, 2013 at 4:27 pm

Have you tried gremlin? Where does gremlin fit in the performance graph?

René Pickhardt says:
March 26, 2013 at 2:37 pm

Reply

no i did not try gremlin but I assume since gremlin also builds on top of neo4j that it will not be faster than the rest. Especially i had some bad experiences with gremlin being rather slow.
But the code is open as is the data set. feel free to write a gremlin query.

Slides of Related work application presented in the Graphdevroom at FOSDEM says:

June 25, 2013 at 4:44 pm

[…] cypher benchmark […]

Slides of Related work application presented in the Graphdevroom at FOSDEM says:

June 25, 2013 at 4:44 pm

[…] cypher benchmark […]

rpiccand says:

March 13, 2014 at 8:10 am

Hi René! Thanks for putting this together. Very interesting outcomes. Have you had a chance to re-evaluate Cypher v. 2.x ?
Best,
-R.

René Pickhardt says:
March 17, 2014 at 11:22 pm

Reply

Sorry I didn’t find the time to do that yet but the code is open source so feel free to fork it and run the Cypher v.2.x benchmark.

rpiccand says:

March 13, 2014 at 8:10 am

Hi René! Thanks for putting this together. Very interesting outcomes. Have you had a chance to re-evaluate Cypher v. 2.x ?
Best,
-R.

René Pickhardt says:
March 17, 2014 at 11:22 pm

Reply

Sorry I didn’t find the time to do that yet but the code is open source so feel free to fork it and run the Cypher v.2.x benchmark.

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL

16 Comments

Leave a Reply to René Pickhardt Cancel reply

Java Core API

Traverser Framework

Cypher Query Language

What where the results?

Quotes from Andres Taylor:

Conclusion:

You may also like...

Popular Posts

16 Comments

Leave a Reply to René Pickhardt Cancel reply