Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

UPDATE: the paper got accepted at SOCIALCOM2012 and the source code and data sets are online especially the source code of the graphity server software is now online!
UPDATE II: Download the paper (11 Pages from Social Com 2012 with Co Authors: Thomas Gottron, Jonas Kunze, Ansgar Scherp and Steffen Staab) and the slides
I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them. you can find a demo of the system here
I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)
In order to achieve this I had several points in mind:

  1. I wanted to use a graph data base to store the social network data as the core technology. As anyone can guess my choice was neo4j which turned out to be a very good idea as the technology is robust and the guys in sweeden gave me a great support.
  2. I wanted to make retrieval of a users news stream only depend on the number of items that are to be displayed in the news stream. E.g. fetching a news feed should not depend on the number of nodes in the network or the number of friends a user has.
  3. I wanted to provide a technology that is as fast as relational data bases or flat files (due to denormalization) but still does not have redundnacy and can dynamically handle changes in the underlying network

How Graphity works is explained in my first presentation and my poster which I both already talked about in an older blog post.  But you can also watch this presentation to get an idea of it and learn about the evaluation results:

Presentation at FOSDEM

I gave a presentation at FOSDEM 2012 in the graph Devroom which was video taped. Feel free to have a look at it. You can also find the slides from that talk at: http://www.rene-pickhardt.de/wp-content/uploads/2012/11/FOSDEMGraphity.pdf

Summary of results

in order to be among the first to receive the paper, the source code and the used data sets as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.
With my data model graphity built on neo4j I am able to retrieve more than 10’000 dynamically generated news streams per second from the data base. Even in big data bases with several million users graphity is able to handle more than 100 newly created content items (e.g. status updates) per second which is still high if one considers that Twitter only had 600 tweets beeing created per second as of last year. This means that graphity is almost able to handle the amount of data that twitter needs to handle on a sigle machine! Graphity is creating streams dynamically so if the friendships of the network changes the user still get accurate news feeds!

Evaluation:

Although we used some data from metalcon to test graphity we realized that metalcon is a rather small data set. To overcome this issue we Used the german wikipedia as a data set. We interpreted every wikipedia article as a node in a social network. A link between articles as a follow relation and revisions of articles as status updates. With this in mind we did the following testings.

characteristics of the used data sets

Characteristics of the data sets in millions. A is the number of Users in the graph. A_{d>10} the number of users with node degree bigger 10. E is the number of edges between users and C the number of content items (e.g. status updates created by the users)

As you can see the biggest data sets have 2 million users and 38 million status updates.

Degree distribution of our data sets

Nothing surprising here

STOU as a Baseline

Our baseline method STOU retrieves all the nodes from the egonetwork of an node and orders them by the time of their most recently created content item. Afterwards feeds are generated as in graphity by using top-k n-way merge algorithm.

7.2 Retrieving News Feeds

For every snapshot of the Wikipedia data set and the Metalcon data set we retrieved the news feeds for every aggregating node in the data set. We measured the time in order to calculate the rate of retrieved news feeds per second. With bigger data sets we discovered a drop in retrieval rate for STOU as well as GRAPHITY. A detailed analysis revealed that this was due to the fact that more than half of the aggregating nodes have a node degree of less than 10. This becomes visible when looking at the degree distribution. Retrieval of news feeds for those aggregating nodes showed that on average the news feed where shorter than our desired length of k = 15. Due to the fact that retrieval of smaller news feeds is significantly faster and a huge percentage of nodes having this small degree we conducted an experiment in which we just retrieved the feeds for nodes with a degree higher than 10.

Retrieval of news feeds for the wikipedia data set. GRAPHITY was able to retrieve 15

We see that for the smallest data set GRAPHITY retrieves news feeds as fast as STOU. Then the retrieval speed for GRAPHITY rises and stays constant – in paticular independent of the size of the graph data base – as expected afterwards. The retrieval speed for STOU also stays constant which we did not expect. Therefor we conducted another evaluation to gain a deeper understanding.

Independence of the Node Degree

After binning articles together with the same node degree and creating bins of the same size by randomly selecting articles we retrieved the news feeds for each bin.

rate of retrieved news streams for nodes having a fixed node degree. graphity clearly stays constant and is thous independent of the node degree

7.2.3 Dependency on k for news feeds retrieval

For our tests, we choose k = 15 for retrieving the news feeds. In this section, we argue for the choice of this value for k and show the influence of selecting k with respect to the performance of retrieving the news feeds per second. On the Wikipedia 2009 snapshot, we have retrieved the news feeds for all aggregating nodes with a node degree d > 10 and changed k.

Amount of retrieved streams in graphity with varieng k. k is the number of news items that are supposed to be fetched. We see that Graphity performs particular well for small k.

Tthere is a clear dependency of GRAPHITY’s retrieval rate to the selected k. For small k’s, STOU’s retrieval rate is almost constant and sorting of ego networks (which is independent of k) is the dominant factor. With bigger k’s, STOU’s speed drops as both merging O(k log(k)) and sorting O(d log(d)) need to be conducted. The dashed line shows the interpolation of the measured frequency of retrieving the news feeds given the function 1/k log(k) while the dotted line is the interpolation based on the function 1/k. As we can see, the dotted line is a better approximation to the actually measured values. This indicates that our theoretical estimation for the retrieval complexity of k log(k) is quite high compared to the empirically measured value which is close to k.

Index Maintaining and Updating

Here we investigate the runtime of STOU and GRAPHITY in maintaining changes of the network as follow edges are added and removed as well as content nodes are created. We have evaluated this for the snapshots of Wikipedia from 2004 to 2008. For Metalcon this data was not available.
For every snapshot we simulated add / remove follow edges as well as create content nodes. We did this in the order as these events would occur in the wikipedia history dump.

handling new status updates

Number of status updates that graphity is able to handle depending on the size of the data set

We see that the number of updates that the algorithms are able to handle drops as the data set grows. Their ratio stays however almost constant between a factor of 10 and 20. As the retrieval rate of GRAPHITY for big data sets stays with 12k retrieved news feeds per second the update rate of the biggest data set is only about 170 updated GRAPHITY indexes per second. For us this is ok since we designed graphity with the assumption in mind that retrieving news feeds happens much more frequently than creating new status updates.

handling changes in the social network graph

Number of new friendship relations that graphity is able to handle per second depending on the network size

The ratio for adding follow edges is about the same as the one for adding new content nodes and updating GRAPHITY indexes. This makes perfect sense since both operations are linear in the node degree O(d) Over all STOU was expected to outperform GRAPHITY in this case since the complexity class of STOU for these tasks is O(1)

Number of broken friendships that graphity is able to handle per second

As we can see from the figure removing friendships has a ratio of about one meaning that this task is in GRAPHITY as fast as in STOU.
This is also as expected since the complexity class of this task is O(1) for both algorithms.

Build the index

We have analyzed how long it takes to build the GRAPHITY and STOU index for an entire network. Both indices have been computed on a graph with existing follow relations.
To compute the GRAPHITY and STOU indices, for every aggregating node $a$ all content nodes are inserted to the linked list C(a). Subsequently, only for the GRAPHITY index for every aggregating node $a$ the ego network is sorted by time in decending order. For both indices, we have measured the rates of processing the aggregating nodes per second as shown in the following graph.

Time to build the index with respect to network size

As one can see, the time needed for computing the indices increases over time. This can be explained by the two steps of creating the indices: For the first step, the time needed for inserting content nodes increases as the average amount of content nodes per aggregating node grows over time. For the second step, the time for sorting increases as the size of the ego networks grow and the sorting part becomes more time consuming. Overall, we can say that for the largest Wikipedia data set from 2011, still a rate of indexing 433 nodes per second with GRAPHITY is possible. Creating the GRAPHITY index for the entire Wikipedia 2011 data set can be conducted in 77 minutes.
For computing the STOU index only, 42 minutes are needed.

Data sets:

The used data sets are available in another blog article
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Source code:

the source code of the evaluation framework can be found at my blog post about graphity source. There is also the source code of the graphity server online.
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Future work & Application:

The plan was actually to use these results in metalcon. But I am currently thinking to implement my solution to diaspora what do you think about that?

Thanks to

first of all many thanks go to my Co-Authors (Steffen Staab, Jonas Kunze, Thomas Gottron and Ansgar Scherp) But I also want to thank Mattias Persson and Peter Neubauer from neotechnology.com and to the community on the neo4j mailinglist for helpful advices on their technology and for providing a neo4j fork that was able to store that many different relationship types.
Thanks to Knut Schumach for coming up with the name GRAPHITY and Matthias Thimm for helpful discussions

You may also like...

Popular Posts

37 Comments

  1. […] Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks by Rene Pickhardt. […]

  2. […] mindful that this work underlies the approach in Graphity which is retrieving 10,000 nodes per second from social […]

  3. This is very interesting — thanks for sharing!

  4. After thinking about it some more, I’m a bit confused, Rene.
    You say you don’t like the simple timeline approach (what you call “baseline”; 3:08 in the video) because it depends on the out-degree — the number of other users the selected user follows. I agree with that so far.
    But then you give the example of Lady Gaga, who has some million followers. But those are her *followers* — in-degree — not the number of users *she* follows — out-degree.
    I’m further confused, because Lady Gaga is actually a more relevant example for the Graphity index. Anytime she tweets, you have to update the Graphity indices for all her million followers, no?
    Seems like with Graphity, the read performance is indeed O(1), which is awesome, but doesn’t it have a heavy write-time cost for popular users?
    With the baseline approach, the write performance OTOH is O(1), whereas read performance is roughly O(d), maybe O(d log d) more precisely, but given that d — out-degree — is usually quite small, is that not acceptable?
    Thanks again!

    1. Hey Aseem,
      First of all thanks for your feedback and your question!
      I agree with you on your interpretation of my Lady Gaga example and the fact that this is not a good example. Thanks for pointing this out! (The social network I am running has a friendship model like facebook in which indegree = outdegree. In that case the example works and this is what I had in my mind during the video)
      But let us move to the more exciting topic of runtime scenarios:
      What your wrote is absolutly true!
      My index is O(1) in reading and O(d) in writing.
      Whereas the baseline approach is O(1) in writing and O(d log d) in retrieval.
      When I was thinking about the social networking site I am operating and how to create a scalable news feed system I was driven by the fact that reading operations happen far more frequently than writing operations and schould be the fastest.
      On the other hand I looked into many techblogs and discussion groups and everyone was talking about denormalization of mysql tables or even a flat file approach in which you would redundantly save the personalized stream for every user. This operation also happens for every follower (meaning O(d) updates at write time) I concluded that a runtime of O(d) while writing is accaptable. Since most approaches seem to work in this way and looking at my own site I see how retrieving happens far more often than writing.
      So yes Users like Lady Gaga are a “problem” for Graphity. But I think it is an acceptable one. The worst that could happen is that lady gaga tweets and a fan is retrieving his stream while his index is not updated yet.
      Due to the implementation of top k – n way merge that I use for retrieval chances are pretty high that he will any way see the tweet of lady gaga and if that would not happen since she hasen’t published anything for a year and is at the end of the user’s Graphity index than he will have to retrieve his news feed a couple seconds later.
      What I am saying is end users don’t see the few seconds between the tweet and the update of their index. Whereas the approach should overall take a lot of load from servers and the data base.
      I am happy for strong counter arguments on this viewpoint!
      To go a step further let me point out some other interesting properties of graphity:
      1.) it is dynamically: If frienships change the retrieved news streams respect these changes which is not that easy in a denormalized or flat file approach
      2.) it has no redundancy. I know hard disk is almost of no issue but I like the idea of no redundancy since it helps to have everything consistent and contributes to the dynamic nature.
      3.) It happens on a graph data base (-: This means that once I fetch my newsfeed I have all the nodes that appear in the newsfeed which allows me to extend my graph traversals and display some other meta data or change the meta data I want to display later on. So again I gain a lot of flexability.
      I hope this answers your question if not I am happy to discuss more!

  5. Very interesting. I am implementing some very similar things, but I often sub-divide things in various ways with in-graph indexes that are helpful in certain use cases. I also do fan-out writes for notifications, all processed asynchronously.
    I am curious how you’ve implemented the many relationship types. Can you provide more neo4j detail, perhaps a link to which post on the mailing list helped you? Did you use dynamic relationships named according to some convention with the owning entity id or something else?
    Also I have some questions on the part where you mentioned how the stream would update itself because it points to the underlying node that is the resource for the stream (ex: new status update). It’s clear to me how an update work without any further actions, but won’t deletes be expensive now? I may be following this wrong, but since you are following a “next” pointer to the next item in the stream depth first, if you delete a node in the middle of that stream, won’t you have to reconnect all of the incoming edges to the next node to avoid holes in the path for certain users?
    I think the above could possibly be mitigated by marking nodes for deletion and not actually deleting them, or doing the same but having a deletion job that realigns everything. In both cases, though you have to consider deleted nodes in your traversals and ignore them which would have a slight effect on list generation time if you had to check a status property for every node.
    Another approach would be creating proxy nodes for every status item that point to the real node, and you can always delete the real node. Then when you retrieve the items, you know the node is deleted because it won’t have an outgoing relationship of the type to the underlying nodes. Of course you still have to eventually fix the proxy nodes and this adds slightly to the read time which is not good.
    Finally, regarding Diaspora, I don’t mean any offense, but I wouldn’t bother. The project has good intentions, but I don’t see it going too far and they’ve already gotten more press and attention than deserved. I’m not too impressed with what I’ve seen so far. I think you’d do better to release your code on github or somewhere as a regular open source project on neo4j.
    Overall, happy to see someone doing similar work. It validates a lot of our design in a sense even though we differ in some slight ways. In general, our principles are the same, but we haven’t had as much time to test things (after the end of the year) due to other projects that take priority. Thanks.

    1. particular posts on the mailing list. I was just thinking about how to model a graph that can efficiently handle this particular problem.
      Concerning your questions. I will release the source code and the pseudo code of the algorithms for retrieval as soon as the paper is accepted at sigmod or some other conference.
      I used dynamic relationship types using the following pattern

      void createIndex(Node user){
      String key = (String) user.getProperty("key");
      DynamicRelationshipTyp egoType = DynamicRelationshipType.withName("ego:"+key);
      }

      as for deleting and entity: Yes that is true. I would have to connect all its predecessors to its respective successors. But again I was having a real social network in mind. How often do you delete an entity there?
      As far as I understand your suggestions to cope with the deleting problem I am sure both ways would work but as written above I think deletion is not a frequent operation in my usecase.
      Anway may I ask you what you are working on if you talk about your problems and approaches?

  6. There is a typo, it’s “Sweden” not “sweeden”. Thanks for the article.

  7. […] website Metalcon), who is now researching in the context of Web technologies, has come up with a new concept how to efficiently retrieve Twitter-like newsfeeds in social networks, using graph databases such as Neo4j. If you are interested in Web technology in general or social […]

  8. Mathieu Durand says:

    Hi Rene,
    This is very interesting indeed, but something is not clear to me, but I’ll try to be clear explaning it 🙂
    If I take back your example A, having 3 feeds 8->4, 7->1, 5->3.
    How does having the feeds in order based on the latest items helps when merging the 3 feeds to provided the top elements? The feeds are provided in order using the graphity index based on the first(latest timestamp) elements (8,7 and 5), but after the first element, there is no assumption about the order of the next elements.
    Let’s take a modified version of your example, 8->4, 7->5. First I pick 8, then I look at the second element, 4, but since it’s bigger that the top of the next feed, I pick 7. But then I have to check both list to determine the next element, so I fail to see the gain of having the feeds sorted by a graphity index.
    Let’s hope it’s clear enough.
    Thanks

    1. Hey Mathieu,
      thanks for your question! I use a self programmed version of a top k n way merge algorithm The pseudocode is in the paper (hopefully soon to be published) and the source code of my graphity implementation is to be published at the same time:
      I successively push pointers to the top elements of my lists in the graphity index on a priority queue. Once an element is retrieved from the priority queue I insert the next element from the corresponding list to the priority queue and if neccessary i also include the top element of the next list.
      In order to show everything that happens I user your example 8 -> 4, 7 – >5 and extend it with 2 -> 1:
      push(8) (first list)
      push(7) (second list) (technically I don’t push all top elements on the priority queue but I include more and more lists as needed by saving one pointer to the top element of the last included list (which would be a pointer to the list with the seven)
      retrieve top ( = 8 )
      now realize that 8 belongs to 8->4 therefor:
      push(4)
      retrieve top ( = 7 )
      now realize that 7 is the last included list and push the top element from the next list therefor:
      push(2) (third list)
      also realize that 7 belongs to 7 ->5 therefor:
      push(5)
      my priority queue has now three elements (2, 4 and 5) and it is clear with which to proceed
      retrieve top ( = 5) which belongs to the second list!
      since inserts to a priority quere are log(k) the whole retrievel is at most k log (k) (the way I described it here at most k+1 elements are pushed on the queue during retrieval, but in practice I could even make sure it is at most k)
      I hope that could clarify your question?

      1. Mathieu Durand says:

        HI Rene,
        Thanks for the answer.
        You say
        push(4)
        push(2)
        push(5)
        but the priority list would be 4,2,5 (ordered), so to pick the next element from the priority list you have to scan the entire list ? or keep the list sorted at all time … Am I missing the point ?
        And why would insert in priority list be log(k) ?

        1. It is not a priority list but a priority queue! you can find more information in its wikipedia article http://en.wikipedia.org/wiki/Priority_queue#Usual_implementation
          though I did not use a heap for implementation but just used the java built in TreeMap

  9. Hello Rene,
    I have a question regarding your graph model – given the relational model (Figure 2), couldn’t such a system also be optimized in a similar fashion by keeping a table of the most recent updates from a user?
    Instead of relying on joins in a user, follower, and creatednews table, maybe have a table – named StatusCache – that contains the most recent status updates for each user, and on a write from a user update this table?
    StatusCache
    userid time activity_json
    a 32443 {“action”: “liked”, “url”: “/photos/funny.jpg”, “label”: “funny photo”, objectid”: 323}
    a 32440 {“action”: “shared”, “url”: “/photos/funny.jpg”, “label”: “funny photo”, “objectid”: 323}
    a 32330 {“action”: “is now friends with”, “url”: “/users/44”, “label”: “Ozzy Ozbourne”, “objectid”: 32342}
    b 32444 {“action”: “shared”, “url”: “/photos/sun.jpg”, “label”: “the sun”, “objectid”: 32332}
    b 32330 {“action”: “is now friends with”, “url”: “/users/32”, “label”: “Jonas Kunze”, “objectid”: 323233}
    of course this would have to be joined to a follower table, but it offers the following advantages:
    1. You can store the most recent n status updates, and to build an activity feed you simply read the most recent n status updates in a single table instead of fanning out statuses on writes – one copy is required. For example you could decide that you only want to store the 10 most recent status updates for users in this table – would that would make creating a newsfeed much more efficient?
    2. You don’t have to apply a date range to the criteria, just simply pull their records and sort by time. The activity_json field contains sufficient data for displaying and linking.
    3. a single table can be used (or multiple tables if sharding) to store a subset of status updates.
    4. It would be optimized for reads – updates would also be expensive, but as you point out in general that’s an acceptable tradeoff..
    5. To generate a news feed, one only has to grab the list of their friend’s id numbers, and select from this table. This can be done as a join, or in memory and iterating through the list of friends or creating dynamic sql statements (whatever works best)..
    Though this solution still involves some level of redundancy, an issue with decay, it’s far less than copying statues for every friend one may have. A reasonable compromise given the problems of using an RDBMS?
    In your opinion, would this be similar to your Graphity model? Would it capture the “spirit” of your model for use in an RDBMS?
    Thank your for your time,
    Ruben

    1. hello ruben those are very nice and relevant questions. First of all your solution of denormalizing the relational database table is nowadays in production in many applications and does in fact work for lets say medium size social networks but it will not scale due to various facts.
      first of all we have to think about the numbers we are depending on. We have
      k = number of items to be retrieved
      d = average node degree of the network
      n = number of user nodes in the network
      usually k << d << n anything that works with a big table like yours is O(log n) for a query against the status update table and O(log(n^2)) = O(log (n)) against the follow relationship table. Graphity instead is O(k log (k)) It think this will answer the points you made in 1 to 5 except 2: in graphity I don't require any sorting and do a sort on the full resultset or an intermediate step might be to cost intensive. I am not sure if I get your decay argument correctly. Graphity only stores a status update once so there is no redundancy. It might still take some time to update all the graphity indices but as our paper showed we where able to update more than 100 graphity indices per second. As a result I think RDBMS are not well suited for the newsfeed problem though your way of using a RDBMS is probably the most effectiv solution you can achieve in an RDBMS at least it goes d'accord with various discussions on stackoverflow. I hope that was of help if not feel free to ask more.

      1. Rene,
        Thank you – not being a computer science nor math major, I’m not sure what this is: “k << d << n" – Could you please explain this in layman's terms?
        After doing a bit of research, I understand this is better – O(k log (k))..
        Thanks for you help!
        Ruben

      2. k << d << n means k is not only smaller than d but much smaller than d. in our case we usually want to get the 10 ( = k ) most recent newsitems from our 150 ( = d ) friends out of 1 Mio ( = n ) users. In every case the number is bigger an order of magnitude (more than a factor 10) so physicists started to use the "<<" notation.

  10. If I were to use Neo4j with Graphity, I could only support 32,000 users with newsfeeds, since Neo4j currently only supports that many relationship types (ref: http://docs.neo4j.org/chunked/milestone/capabilities-capacity.html). Is my understanding correct?

    1. yes that is correct. I used a fork of neo4j by http://stackoverflow.com/users/211920/mattias-persson that supports much more relationship types. I have jars of this fork on my computer but that is neo4j version 1.5M01. As far as i know neo4j will increase the support for more relationship types in future neo4j releases.
      Mattias could probably also point you to his fork.
      Currently we create a graphity server and client application that will be open source and work out of the box as a stand alone news feed server

  11. Hi Rene,
    Great work on this model, just have a question. Could you not implement the same Graphity idea on the posts themselves instead of the users? In other words, why not simply connect each post in temporal order via the same per-user relationships? For example, user Matt follows A, B, & C, with the following posts A:(20, 15) B:(19, 16) C:(18, 14). In the graphity model, you link list A->B->C , why not link list 20->:Matt_feed->19->:Matt_feed->18->etc?
    It would take the same write time, as you would still need an O(1) for each follower, but now the read requires no merge at all. Obviously you will be creating more relationships over time, but this allows for O(k) reads to any point in history for the user. Graphity on the other hand becomes more difficult to traverse through history as k grows.
    The only drawback I can see, is for adding/removing follower relationships, with regards to prior histories. If you don’t want to modify the feed prior to a relationship changes, then there is no problem.
    Do you have any thoughts on this model? It appears to have the same write times, faster read times O(k), with only a memory cost of creating more relationships, but this allows for reading to any point in history easily.
    Would love to discuss this some time.

    1. Hey Matt,
      actually what you are saying will work and is something I was thinking of in the beginning. The Problem is that friendships in a social network change over time. If you add a new friend or even worse if you remove a friend from your feed you want to take out all the status updates from that list. This could be a little bit annoying since you have to go through the entire list. But if you are willing to take this into account then you could use this method.
      Also take into consideration that O(k) and O(k log k) for small k is very similar. What you don’t want to have is something that depends linear or worse from the node degree d.
      Finally if you look at neo4j you approach will yield high node degrees on the content nodes in my approach there is only high node degree on users which exists already anyway due to friendship relations.
      So anyway to sum up. I think your approach works fine but if you go for a simple system like that why not create a plain textfile with the feed of a user? This would use more storage space but also scale out much better anyway.

      1. Thanks for the reply Rene, just a couple more questions.
        It wouldn’t happen too often I don’t think, but how would graphity perform if the users wanted to go quite far back in time with their feed?
        Also, do you have any links/pointers to more information on the merge algorithm used in graphity? Is this something achievable through cyphers alone, or do you need more direct access to the DB? I ask because my system is dependent upon the REST api.

  12. Yes an algorithm/example for news feeds using cypher would awesome. I know cypher can be up to 10 times slower than using the API, but for purposes of rapid application development cypher offers many advantages… Also as far as i understand, the high availability option for Neo4j does not work in embedded mode – one has to use REST..
    I don’t expect a Cypher & REST based version to have the same performance , but it still should perform very well considering.
    Thanks!
    Ruben

    1. I’m using Ruby on Rails (not Jruby), REST, and the Neography GEM which simply encapsulates the REST calls. So a cypher based solution, even one consisting of separate and multiple queries, is what I am looking for…

      1. hey Ruben,
        the idea of the Graphity algorithms is really to do a good data base design in combination with controlling how the traversal goes. Despite the fact that cypher can be 10x slower as shown in this post: http://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ i am not even sure if it is possible to express the way I am collecting nodes for the newsfeed in cypher. So using cypher might give you another performance downside.
        But maybe there is another solution for you: My student Sebastian is currently implementing a standalone Graphity server which is based in java and runs in tomcat. it has a REST interface to create status updates and friendships as well as requesting streams.
        The developement is almost done an can currently be followed at https://github.com/renepickhardt/Graphity-Server (look at the develop branch) The graphity server provides news feeds following the activitystrea.ms format. So if you are willing to “install” another software as a standalone newsfeedserver feel free to think about it.

  13. Rene,
    Thank you.. I will check out Sebastian’s server – I don’t mind having a standalone newsfeed server – I’ve been thinking along those lines as it is..
    Ruben

  14. In the video (slide 18/28 “updating graphity – insert content items (3)”), there’s something I really don’t get.
    When getting the graphity index of user A : I agree that I’ll get content items 9,8,7.
    But how do you get the following content items? I would expect content items 5, 4, 3 and 1.
    I really don’t see how you can retrieve that since there are no further connections between the entities.
    I looked at the java source code, and the only thing I see is that when reading the graphity index : if the current user has more status updates, he is pulled back in the queue.
    But, that does not mean that the next status update of this user is more recent that an other one of an other user in the queue, right ?

  15. it is not the user that is pulled back. it is the next status update in the linked list. the key for the priority queue is the time stamp. So after the items 9, 8 and 7 are pulled from the queue the queue looks what kind of items it currently has on the queue {1,4,5} and it pulls the one with the highest priority (which would be 5} after 5 is pulled the next item from the list is pushed which is {3} so the queue contains the items {1, 3, 4} now. as you can see dequeuing twice results in 4 and 3. which is as you wished…
    you saw that Graphity is fully implemented in a standalone http REST server? http://www.rene-pickhardt.de/graphity-server-for-social-activity-streams-released-gplv3/

  16. I know this is a bit late, but I’m really curious how you would handle thread safety in this scenario. If a reader is traversing the edges of the nodes – doing a merge sort as it goes – and suddenly a writer swaps the pointer for a user node that has already been traversed, there is a good chance the algorithm could walk over the node again and yield duplicates, or even yield nodes out of order.
    Am I missing something here?
    Obviously putting a global lock would prevent this, but clearly that’s going to kill the readers…

    1. I don’t that is the drawback. Yout can look at Sebastian Schlichts Work who tried a distributed version but also there concurrency was a problem.

  17. […] there is another way to model a news feed called “Graphity” by Rene Pickhardt that trades great read performance for somewhat complicated write and […]

Leave a Reply to Jérôme Kunegis « Another Word For It Cancel reply

Your email address will not be published. Required fields are marked *