CC-0 via https://www.pexels.com/photo/speed-racing-speedway-racing-car-12801/

Time lines and news streams: Neo4j is 377 times faster than MySQL

Over the last weeks I did some more work on neo4j. And I am ready to present some more results on the speed (In my use case neo4j outperformed MySQL by a factor of 377 ! That is more than two magnitudes). As known one part of my PhD thesis is to create a social newsstream application around my social networking site metalcon.de. It is very obvious that a graph structure for social newsstreams are very natural: You go to a user. Travers to all his friends or objects of interest and then traverse one step deeper to the newly created content items. A problem with this kind of application is the sorting by Time or relvance of the content items. But before I discuss those problems I just want to present another comparission between MySQL and neo4j.

Setting:

I took a fairly small dataset with the following properties:

  • 14986 content items (entries in a forum of a band)
  • 1391 Bands
  • 854 user having at leas one of those bands in their list of fav bands
  • a bipartit graph of fan relations between users and bands

For every User I wanted to select all content items from his favourite bands. I know this is far away from the social newsstream application I am about to build but I used it as a first test to see weather graph databases really are the more natural setting for my questions.

Results:

In MySQL this would look something like this:

for all (User_ID in Users){
SELECT ce.Text
FROM Entry e
JOIN Topic t ON t.ID = e.Topic_ID
JOIN UserBand ub ON ct.Band_ID = ub.Band_ID
WHERE ub.User_ID = User_ID
}
Even though we have all relevant colums indexed those joins are expensive. Especially because the Entry Table is much bigger than 14986 Items.
Using MySQL It took 152 Seconds = 2:32 Minutes to create the interesting newsstreams for all 854 users or 177 ms per user

Let’s look at neo4j:

Not using any traverser but just some hacked in code I have something like the following
for all (user in index){
for (Relationship rel: user.getRelationships(DynamicRelationshipType.withName(“FAN”), Direction.BOTH)){
Node n1 = rel.getOtherNode(user);

for (Relationship r2: n1.getRelationships(DynamicRelationshipType.withName(“CONTENTITEM”), Direction.BOTH)){
Node n2 = r2.getOtherNode(n1)
edgecnt++;
}
}
}

Even thogh we only have 854 users and 1391 bands we end up with  1368270 relations that we have traversed to retrieve all content items for all favourite bands of each user.
Using neo4j it took  3,4 Seconds in doing this or 4 ms per user
That is about 44 times faster than MySQL

After warming the caches.

When repeating this experiment MySQL does not get faster. I know they have a query cache but I guess it gets emptied before the first queries are run again. In neo4j though this result gets even better. Each time I repeated this experiment the runtime decreased until I came to something like 0,4 Seconds for the job which now is 377 times faster than MySQL. The best part of this is scaling. When my Social graph grows the search in neo4j stays a local search. more users and more discussions does not mean more edges to traverse from the user to his favourite bands. in MySQl though the cost of these heavy joins will just explode.
Yes I know in MySQL I would denormalize my tables to create such an application in order to gain speed. But denormalizing means more redundancy and again a graph is a much more natural structure for a social application.

Open Questions! Feel free to discuss them (-:

There is an very important open question though and that is sorting. A social newsstream of course should be sorted by time. in MySQL it is easy to create an Index over a colum with timestamps on the Contentitems. Sorting my resultset by time will in this case not increase the runtime.
In a graph database with this particular schema of my graph I am not quite sure how to retrieve the results in a sorted way. (Anyone has some ideas?) There needs to be further testing to see if sorting after retrieving still makes my solution faster than MySQL or (my prefered solution) if there is some way of designing the graph in a way that for any user with any set of favourite bands there is a quick way of traversing through the content items and receiving them in an ordered way. I even guess this is already possible in neo4j using traversers with bredth first search and tell them in wich order to traverse relations. Just have too look deeper into this and i will keep you updated.
I am happy and open for comments and suggestions! Oh and could anyone suggest a nice syntax highlightning plugin for wordpress?

You may also like...

Popular Posts

28 Comments

  1. Hi there,
    great post! Also, related to news streams, there has been a Quora discussion that might be interesting in this context: http://www.quora.com/Which-Relational-NoSQL-approach-will-you-recommend-for-storing-Trees-with-real-time-aggregation-from-child-nodes-to-Parent-nodes
    /peter

  2. Hi there,
    great post! Also, related to news streams, there has been a Quora discussion that might be interesting in this context: http://www.quora.com/Which-Relational-NoSQL-approach-will-you-recommend-for-storing-Trees-with-real-time-aggregation-from-child-nodes-to-Parent-nodes
    /peter

  3. Aliabbas Petiwala says:

    Great job rene,
    can you share with me the code and dataset , so that i can tweak it right away and perform the same tests with my dataset?

    1. the dataset comes from my social networking site. According to our terms and conditions I cannot share the data set.
      As for the source code. We are about to develope the new version of metalcon open source. As soon as we make the source code available (which should happen in less than a month) you can have access to the source. Just follow me on twitter or subscribe to my blog and you will know immidiately once we upload our source code.

  4. Aliabbas Petiwala says:

    Great job rene,
    can you share with me the code and dataset , so that i can tweak it right away and perform the same tests with my dataset?

    1. the dataset comes from my social networking site. According to our terms and conditions I cannot share the data set.
      As for the source code. We are about to develope the new version of metalcon open source. As soon as we make the source code available (which should happen in less than a month) you can have access to the source. Just follow me on twitter or subscribe to my blog and you will know immidiately once we upload our source code.

  5. Hi René,
    thanks for your post about neo4j. I might give it a try as I’m currently working with lots of graphs.
    edit: I’ve just asked if it might be interesting. According to other programmers in the team we don’t need a database for those graphs.
    You asked for a WP-Plugin for syntax highlighting. As I don’t use WP, I don’t know a plugin for syntax highlighting in WP. But you could try snipt.

    1. I just looked at snipt. It seems that I have to embed some javascript from their site. this is a little scary to me. What happens if the site doesn’t offer the service anymore or if the siteowner at some point in time silently changes the javascript to do some harmful stuff.

  6. Hi René,
    thanks for your post about neo4j. I might give it a try as I’m currently working with lots of graphs.
    edit: I’ve just asked if it might be interesting. According to other programmers in the team we don’t need a database for those graphs.
    You asked for a WP-Plugin for syntax highlighting. As I don’t use WP, I don’t know a plugin for syntax highlighting in WP. But you could try snipt.

    1. I just looked at snipt. It seems that I have to embed some javascript from their site. this is a little scary to me. What happens if the site doesn’t offer the service anymore or if the siteowner at some point in time silently changes the javascript to do some harmful stuff.

  7. It is worth noting that Twitter’s FlockDB graph database does run on the MySQL engine, though. I think you should state that you’re comparing Neo4j with a graph schema to MySQL and a relational schema.

    1. Hey Jamie,
      I totally agree on the point you make! In the paper I am writing I already mentioned FlockDB and the fact that twitter as well as facebook are both systems running on MySQL. Which definately is a proof of concept and speaks for relational systems.
      I am totally aware of the fact, that I could denormalize my MySQL data base and have less than two joins or create some adjacancy index structures on mySQL like FlockDB does.
      I am sure that I won’t gain a factor of 377 then. But it is nice to see that neo4j with almost no effort comes out of the box much faster than a standard relational approach! I didn’t say that you couldn’t tweak mySQL. but I also didn’t tweak neo4j.
      I am also by no means saying that MySQL is bad. I love that data base and it has done a great job for our site! But I am still convinced that graph data bases are much more natural for my particular usecase (-:

      1. Thanks, Rene. I’m hardly an evangelist for MySQL, but it’s important that everyone notes the distinction.
        Peter Neubauer and the Neo4j team are doing great work.

  8. It is worth noting that Twitter’s FlockDB graph database does run on the MySQL engine, though. I think you should state that you’re comparing Neo4j with a graph schema to MySQL and a relational schema.

    1. Hey Jamie,
      I totally agree on the point you make! In the paper I am writing I already mentioned FlockDB and the fact that twitter as well as facebook are both systems running on MySQL. Which definately is a proof of concept and speaks for relational systems.
      I am totally aware of the fact, that I could denormalize my MySQL data base and have less than two joins or create some adjacancy index structures on mySQL like FlockDB does.
      I am sure that I won’t gain a factor of 377 then. But it is nice to see that neo4j with almost no effort comes out of the box much faster than a standard relational approach! I didn’t say that you couldn’t tweak mySQL. but I also didn’t tweak neo4j.
      I am also by no means saying that MySQL is bad. I love that data base and it has done a great job for our site! But I am still convinced that graph data bases are much more natural for my particular usecase (-:

      1. Thanks, Rene. I’m hardly an evangelist for MySQL, but it’s important that everyone notes the distinction.
        Peter Neubauer and the Neo4j team are doing great work.

  9. […] Time lines and news streams: Neo4j is 377 times faster than MySQL by René Pickhardt. […]

  10. […] Time lines and news streams: Neo4j is 377 times faster than MySQL by René Pickhardt. […]

  11. very nice and have a try
    thanks

  12. very nice and have a try
    thanks

  13. Junegunn Choi says:

    Hi Rene,
    I think your conclusion here is somewhat misleading.
    I don’t think you should execute 854 SQLs (which is extremely inefficient), instead Users table should be included in the SQL above, so with one more join, you can traverse all the items with only one SQL. Considering the size of the data set, it shouldn’t take longer than a few seconds.

  14. Junegunn Choi says:

    Hi Rene,
    I think your conclusion here is somewhat misleading.
    I don’t think you should execute 854 SQLs (which is extremely inefficient), instead Users table should be included in the SQL above, so with one more join, you can traverse all the items with only one SQL. Considering the size of the data set, it shouldn’t take longer than a few seconds.

    1. as already said there is a lot of room to tweak the sql here. But I want to retrieve the stream for every user sperately. So as far as I understand your solution I think this won’t be of any help.
      The question is: “How much time does it take to generate the stream for a single user?” Since this ist still pretty fast (especially with neo4j) I looped over all users so I could build an average and also have all different kind of user types included.
      If i would build a streaming system on SQL in a realworld I would just denormalize these tables and create one newsstream table that will just handle the updates and store date in a redundant way.
      the following two discussions on stack overview are helpfull.
      http://stackoverflow.com/questions/2835075/php-news-feed-database-design
      http://stackoverflow.com/questions/202198/whats-the-best-manner-of-implementing-a-social-activity-stream
      But I don’t like a denormalized system. This is not flexible if the social graph of my network changes (e.g. edges are added or deleted) and it is also not very flexible if the functionalilty of the network grows.

      1. Junegunn Choi says:

        “How much time does it take to generate the stream for a single user?”
        So it’s about latency of each request. It seems that the huge difference in the latency comes from the fact that Neo4j is embedded. (I’m just assuming that you’re using an embedded one, correct me if I’m wrong.) Cause it’s just much simpler to access the data in the same address space. However, scalability is an issue for embedded database. So I would love to see a follow-up experiment employing REST API of Neo4j server. I myself haven’t tried it yet.
        Cheers.

  15. I’m still learning SQL but wouldn’t a single ordered query be faster?
    SELECT ce.Text
    FROM Entry e
    JOIN Topic t ON t.ID = e.Topic_ID
    JOIN UserBand ub ON ct.Band
    ORDER by ub.User_ID
    I could be totally off here, as it’s a little past 4am, and I don’t have the data or a shell to try this out

  16. I’m still learning SQL but wouldn’t a single ordered query be faster?
    SELECT ce.Text
    FROM Entry e
    JOIN Topic t ON t.ID = e.Topic_ID
    JOIN UserBand ub ON ct.Band
    ORDER by ub.User_ID
    I could be totally off here, as it’s a little past 4am, and I don’t have the data or a shell to try this out

Leave a Reply

Your email address will not be published. Required fields are marked *