social news stream – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Why Facabook likes do not really matter

Rene — Wed, 23 Jul 2014 12:16:36 +0000

Long time I have been trying to educate people about social networks and marketing within social networking services. I thought I would not create an article about the topic in my blog again, but I just stumbled upon a really nice video summarizing a lot of my concerns from a different angle.
So here you can see how much click fraud exists with likes on Facebook. Why you should not buy any likes and in general about the problems with Facebook’s news stream.

I am really wondering at what point in time the industries will realize that Facebook is not a vivid ecosystem and this thing will break down. I have been predicting the Downfall of Facebook quite often and it did not happen so far. I am still expecting it to happen and I guess we all have to wait just a little bit longer.

Graphity Server for social activity streams released (GPLv3)

Rene — Mon, 02 Sep 2013 07:11:22 +0000

It is almost 2 years over since I published my first ideas and works on graphity which is nowadays a collection of algorithms to support efficient storage and retrieval of more than 10k social activity streams per second. You know the typical application of twitter, facebook and co. Retrieve the most current status updates from your circle of friends.
Today I proudly present the first version of the Graphity News Stream Server. Big thanks to Sebastian Schlicht who worked for me implementing most of the Servlet and did an amazing job! The Graphity Server is a neo4j powered servlet with the following properties:

Response times for requests are usually less than 10 milliseconds (+network i/o e.g. TCP round trips coming from HTTP)
The Graphity News Stream Server is a free open source software (GPLv3) and hosted in the metalcon git repository. (Please also use the bug tracker there to submit bugs and feature requests)
It is running two Graphity algorithms: One is read optimized and the other one is write optimized, if you expect your application to have more write than read requests.
The server comes with an REST API which makes it easy to hang in the server in whatever application you have.
The server’s response also follows the activitystrea.ms format so out of the box there are a large amount of clients available to render the response of the server.
The server ships together with unit tests and extensive documentation especially of the news stream server protocol (NSSP) which specifies how to talk to the server. The server can currently handle about 100 write requests in medium size (about a million nodes) networks. I do not recommend to use this server if you expect your user base to grow beyond 10 Mio. users (though we are working to get the server scaling) This is mostly due to the fact that our data base right now won’t really scale beyond one machine and some internal stuff has to be handled synchronized.

Koding.com is currently thinking to implement Graphity like algorithms to power their activity streams. It was for Richard from their team who pointed out in a very fruitfull discussion how to avoid the neo4j limit of 2^15 = 32768 relationship types by using an overlay network. So his ideas of an overlay network have been implemented in the read optimized graphity algorithm. Big thanks to him!
Now I am relly excited to see what kind of applications you will build when using Graphity.

If you’ll use graphity

Please tell me if you start using Graphity, that would be awesome to know and I will most certainly include you to a list of testimonials.
By they way if you want to help spreading the server (which is also good for you since more developer using it means higher chance to get newer versions) you can vote up my answer in stack overflow:
http://stackoverflow.com/questions/202198/whats-the-best-manner-of-implementing-a-social-activity-stream/13171306#13171306

How to get started

its darn simple!

You clone the git repository or get hold of the souce code.
then switch to the repo and type sudo ./install.sh
copy the war file to your tomcat webapps folder (if you don’t know how to setup tomcat and maven which are needed we have a detailed setup guide)
and you’re done more configuration details are in our README.md!
look in the newswidget folder to find a simple html / java script client which can interact with the server.

I also created a small simple screen cast to demonstrate the setup:

Get involved

There are plenty ways to get involved:

Fork the server
commit some bug report
Fix a bug
Subscribe to the mailing list.

Furhter links:

git repository
originial graphity blogpost
graphity paper
Stack overflow discussion on social activity streams (for voting thanks!)
issue tracker

Drug junkie steals my neo4j t-shirt out of my physical mailbox

Rene — Wed, 17 Jul 2013 18:51:46 +0000

Me wearing my stolen neo4j shirt which I took back from the thief

Being at FOSDEM 20013 Peter from Neo4j asked my if I would like to get a neo4j shirt send to my home adress. We have to keep in mind that i just moved back to Koblenz from China. I did not only move to Koblenz but I moved to Koblenz Lützel. I knew from my collegues that this part of Koblenz is supposed to be like the Harlem of NYC. But I found a really nice flat where I leave together with 3 other nice students. Even though I was skeptical looking at that flat I had to move there after I met my future roommates.
A couple of weeks after moving in I realized that I smelled pod more and more frequently from the streets. Especially from people smoking it in front of my front door. I had even observed people exchanging packages in the backyard of our house. Of course I cannot say for sure but even at the time of seeing this I was pretty confident that they would deal drugs. Over the last couple weeks we had several problems in our house.

People broke in our basement and stole some stuff
Another time people broke in our basement and stored bikes there which did not belong to any of our neighbors.
There is a hole from a gun in our front door.
last but not least: I was about to leave our backyard when I saw a guy wearing my neo4j shirt.

Ok let me elaborate on the last one:
Neo4j the graph database is not known as the most famous fashion brand in the world. So I hardly recognized the shirt when I saw him wear it. But I somehow recognized the design of the shirts and decided to turn around in order to get a second look at the shirt. Who in my neighborhood would wear such a shirt and what connection would he have to this rather new piece of technology?
When i turned around things got even stranger. I saw the back of the guy and his shirt said:
“My stream is faster than yours” which certainly is a link to graphity
and also displayed the Cypher Query:
“(renepickhardt) <-[:cites]- (peter)”
I was so perplex that I didn’t realize that I was alone and the guy was standing there with 2 other men. I said: “Sorry, you are wearing my shirt!” And his friends came in and told my I was crazy and how I could come up with this idea. I insisted that my name was written on the shirt. In particular my full name! Especially I knew the quote which was exactly what Peter had planned to print on my shirt.
The guys started mocking me and telling me to f… off. But I somehow resisted and pointed out again that this was certainly my shirt. At that moment the door of the Kung Fu School opened and the coach Mr. Lai came out and asked if the guys again stole packages from our post box. At that moment the guy with my shirt had to turn around again so anybody could see my name. He stared telling me some weird lie about how he got this shirt as a present and just thought it was nice looking but he finally returned it to me.
Most interestingly the police didn’t care. The policeman only said: “It’s your own fault when you move to a place like Koblenz Lützel.” I find this to be very disappointing. I always thought that our policemen should be objective and neutral. Stealing and opening other peoples mail is a crime in Germany. Also owning drugs or stealing bikes… It is said that the police refuses to help us with our situation.
Well anyway If you have an important message for me why don’t you use email rather than physical mail. My email is also potentially read by third parties but at least it is still safely delivered. Anyway big thanks to the guys from neo4j for my new shirt (:

Metalcon finally gets a redesign – Thinking about high scalability

Rene — Mon, 17 Jun 2013 15:21:30 +0000

Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

a user base of 10’000 users.
about 500 registered bands
highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
624 MB of text and structured data about the mentioned topics.
fairly good visibility in search engines.
> 30k lines of code (mostly PHP)
a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
no unit tests (so code maintenance is almost impossible)
no music and audio files
no processes for content moderation
no processes to fight spam and block users
a really bad usability (I could write tons of posts at which points the usability lacks)
no clear distinction of features for users to understand
…

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers.
During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

Submission history of my first academic research paper (graphity at socialcom 2012)

Rene — Tue, 10 Jul 2012 09:29:38 +0000

My graph index graphity was – as mentioned in another blogpost – accepted at socialcom 2012. After I explained how it works and sharted the source code I now want to share some information about the history of submissions, reviews, quality of reviews, taken actions and so on. So if you are a coder, hacker, geek or what so ever just skip this post and digg into the source code of graphity. If you are a researcher you might enjoy learning from my experiences.

2011 April: Defining the research question

Wondering about possible research topics many people told me that running a social network like metalcon and seeing so much data would be like treasure for research questions. Even though at that time our goal was just to keep metalcon running in its current state I startet thinking about what research questions could be asked. For most questions we just did not collect good data. In this sense I realized if metalcon should really give rise to research questions I would have to reimplement big parts of it.
Since metalcon also had the problems of slow page loading times I decided to efficiently recode the project. I new that running metalcon on a graph data base like neo4j would be a good idea. That is why I started wondering about how to implement a newsfeed system efficiently in a scalable manner using a graph data base.
==> A research question was formulated.

2011 Mid August: Solution to the problem was

After 3 months of hard thinking and trying out different data base designs I had a meeting with my potential supervisor Prof. Dr. Steffen Staab. He would have loved to see some prelimenary results. While summing up everything for the meeting the solution or better said the idea of Graphity was suddenly clear before my eyes.
In late august I talked about the ideas in the oberseminar and I presented a poster at the european summer school of information retrieval. I wondered about where to submit this.
Steffen thought that the idea was pretty smart and could be published in a top journal like VLDB which he suggested me to choose. He said it would be my first scientific work and VLDB has a very short reviewing cycle of less than one month that could provide me fast feedback. A pieve of advice which I did not understand and also did not follow. Well, sometimes you have to learn the hard way…
Being confident about the quality of the index also due to Steffens positive feedback and plans for VLDB my plan was rather to submit to WWW conference. I thought social networks would be a relevant topic to this conference. Steffen agreed but pointed out that he was a program chair for that conference which would mean that submissions from his own institute would not be the best idea.
Even though graphity was not implemented or evaluated yet and no related work was read we decided that the paper should be submitted to SIGMOD conference. With an upcoming deadline of only 2 months later.
By the way having these results and starting this hard work gave me founding for 3 years starting in october 2012!

2011 november: Submission to sigmod

After two months of very hard work especially together with Jonas Kunze a physicist from metalcon from whom I really learnt a lot about setting up software and evaluation frameworks the paper was submitted to sigmod. Meanwhile I tried to get an overview of the related work (which at that time was kind of the hardest part) since I really did not know how to start and my research question came from a very practical and relevant usecase but did not emerge after reading many papers.
Anyway we finished our submission just in time together with all the experiments and a – I would still say decent – text about graphity. (you can have the final copy of the sigmod publication on request)

2011 middle of November: Blogpost with best content from the paper

I talked back to my now Supervisor Steffen and asked him what he thought about blogging the results of graphity already trying to get some awareness in the community. He was a little skeptical since blogs can not really be cited and the paper was not published yet. I argued that blogs are becoming more and more importent in research. I said even if a blogpost is not publication in the scientific sense it still is a form of publication and gains visability. Steffen agreed to test this out and I have to say the idea to do this was perfect. I put the core results on my blog received very positiv feedback from people working at linkedin, microsoft and other hackers. Also I realized that the problem was currently unpublished / unsolved and relevant to many people.
Interestingly one of my co-authors tried to persue me to publish the sigmod version of the paper as a technical report. I did not get the point of doing so (he said something like, then it is officially published and no one can steal the idea…” Up to today I don’t get the point of doing this other then manipulating ones official publication count…)

2012 February: Reject from Sigmod

After all the strong feedback on my blog and via mail the reviews from SIGMOD conference where quite disappointing. Every single reviewer highly valued the idea and the relevance of the problem. But they criticized the evaluation and the related work.

For the evaluation they criticized that using a memory disk is manipulating and the distributing and scaling behaviour could not be seen by theoretical and emperical proves of some big O notations.
As for the related work they where missing some papers especially from the data base community and as I said related work was definately one of the weaknesses of the paper.
Other feedback was on our notation and naming for some baselines.

The most disappointing of all the feedback was the following quote:

W1: I cannot find any strong and novel technical contribution in their algorithms. Their proposed method is too simple and straightforward. It is simply an application of using doubly linked list to graph structures.

To give an answer to this at least once: Yes the algorithm is simple and only based on double linked lists. BUT it is in the best complexity class one can imagine for the problem it scales perfectly and this was prooven. How can one throw away an answer because it is too simple if it is the best possible answer? I sense that this is one of the biggest differences between a mathematician and computer scientist. In particular the same reviewer pointed out that the problem was relevant and important.
Having this feedback we came to the conclusion that the database community might have been the wrong community for the problem. Again we would have figured this out much faster if submitting to VLDB like Steffen had suggested.

2012 February: Resubmission to hypertext

Only 1 week away from the feedback was the submission for the hypertext conference. Since one week was not really much time we decided to not implement any of the feedback and just check what another community would say. My supervisor Steffen was not really convinced of this idea but we argued that the notification of hypertext conference was only one month later and this quick and aditional feedback might be of help.

2012 late march: Reject from Hypertext

The reviews from hypertext conference have been rather strange. One strong accept but the reviewer said he was not an expert in the field. One strong reject of a person that did not understand the correctness of our top k n way merge (which was not even core to the paper) and a borderline from one reviewer who really had some interesting comments on our terminology and the related work.
Overall the paper was accepted as a poster presentation and we decided that this was not sufficient for our work so we withdraw our submission.

2012 mai resubmission to socialcom

Discussions rose up which conference we should target now. We have been sure that the style of the evaluation was for sure nothing for the data base community. On the other hand graph data bases alone would not be sufficient for semantic conferences and social is such a hype right now that we really were not sure.
Steffen would have liked to submit the paper to ISWC since this is an important community for our institue. I suggested SocialCom since the core of our paper was really lying in social networking. All reviews so far have valued the problem as important and relevant for social networks and people basically liked the idea.
We tried to find related work for the used baselines and figured out that for our strongest baseline we could not find any related work. 3 days before the submission to Socialcom Steffen argued that we should drop the name of the paper and not call it graphity anymore. He said that we just sell the work as two new indices for social news streams (which I thought was kind of reasonable since the baseline to graphity really makes sense to use) and was not presented in any related work. Also we could enhance the paper with some feedback from my blog and the neo4j mailinglist. The only downside of this strategy was that changing the title and changing the story line would yield to a complete rewrite of the paper. Something I was not to keen of doing within 3 days. My coauther and I were convinced that we should stick to our changes from working in the feedback and stick to our argument for the baseline without related work.
We talked back to Steffen. He said he left the final decission to us but would strongly recommend to change the storyline of the paper and drop the name. Even though I was not 100% convinced and my coauthor also did not want to rewrite the paper I decided to follow Steffens experience. We rewrote the story line.

2012 July accept at SocialCom

The paper was accepted as a full paper to SocialCom. So following Steffens advice was a good idea. Also not getting down by bad reviews from the wrong community and staying convinced that the work had some strong results turned out to be a good thing. I am sure that posting preliminary results on the blog was really nice. In this way we could integrate open accessable feedback from the community to our third submission. I am excited how much I will have learnt from this experience with later submissions.
I will publish the paper on the blog as soon as the camera ready version is done! Subscribe to my newsleter, RSS or twitter if you don’t want to miss the final version of the paper!

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Rene — Tue, 15 Nov 2011 16:03:33 +0000

UPDATE: the paper got accepted at SOCIALCOM2012 and the source code and data sets are online especially the source code of the graphity server software is now online!
UPDATE II: Download the paper (11 Pages from Social Com 2012 with Co Authors: Thomas Gottron, Jonas Kunze, Ansgar Scherp and Steffen Staab) and the slides
I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them. you can find a demo of the system here
I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)
In order to achieve this I had several points in mind:

I wanted to use a graph data base to store the social network data as the core technology. As anyone can guess my choice was neo4j which turned out to be a very good idea as the technology is robust and the guys in sweeden gave me a great support.
I wanted to make retrieval of a users news stream only depend on the number of items that are to be displayed in the news stream. E.g. fetching a news feed should not depend on the number of nodes in the network or the number of friends a user has.
I wanted to provide a technology that is as fast as relational data bases or flat files (due to denormalization) but still does not have redundnacy and can dynamically handle changes in the underlying network

How Graphity works is explained in my first presentation and my poster which I both already talked about in an older blog post. But you can also watch this presentation to get an idea of it and learn about the evaluation results:

Presentation at FOSDEM

I gave a presentation at FOSDEM 2012 in the graph Devroom which was video taped. Feel free to have a look at it. You can also find the slides from that talk at: http://www.rene-pickhardt.de/wp-content/uploads/2012/11/FOSDEMGraphity.pdf

Summary of results

in order to be among the first to receive the paper, the source code and the used data sets as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.
With my data model graphity built on neo4j I am able to retrieve more than 10’000 dynamically generated news streams per second from the data base. Even in big data bases with several million users graphity is able to handle more than 100 newly created content items (e.g. status updates) per second which is still high if one considers that Twitter only had 600 tweets beeing created per second as of last year. This means that graphity is almost able to handle the amount of data that twitter needs to handle on a sigle machine! Graphity is creating streams dynamically so if the friendships of the network changes the user still get accurate news feeds!

Evaluation:

Although we used some data from metalcon to test graphity we realized that metalcon is a rather small data set. To overcome this issue we Used the german wikipedia as a data set. We interpreted every wikipedia article as a node in a social network. A link between articles as a follow relation and revisions of articles as status updates. With this in mind we did the following testings.

characteristics of the used data sets

Characteristics of the data sets in millions. A is the number of Users in the graph. A_{d>10} the number of users with node degree bigger 10. E is the number of edges between users and C the number of content items (e.g. status updates created by the users)

As you can see the biggest data sets have 2 million users and 38 million status updates.

Degree distribution of our data sets

Nothing surprising here

STOU as a Baseline

Our baseline method STOU retrieves all the nodes from the egonetwork of an node and orders them by the time of their most recently created content item. Afterwards feeds are generated as in graphity by using top-k n-way merge algorithm.

7.2 Retrieving News Feeds

For every snapshot of the Wikipedia data set and the Metalcon data set we retrieved the news feeds for every aggregating node in the data set. We measured the time in order to calculate the rate of retrieved news feeds per second. With bigger data sets we discovered a drop in retrieval rate for STOU as well as GRAPHITY. A detailed analysis revealed that this was due to the fact that more than half of the aggregating nodes have a node degree of less than 10. This becomes visible when looking at the degree distribution. Retrieval of news feeds for those aggregating nodes showed that on average the news feed where shorter than our desired length of k = 15. Due to the fact that retrieval of smaller news feeds is significantly faster and a huge percentage of nodes having this small degree we conducted an experiment in which we just retrieved the feeds for nodes with a degree higher than 10.

Retrieval of news feeds for the wikipedia data set. GRAPHITY was able to retrieve 15

We see that for the smallest data set GRAPHITY retrieves news feeds as fast as STOU. Then the retrieval speed for GRAPHITY rises and stays constant – in paticular independent of the size of the graph data base – as expected afterwards. The retrieval speed for STOU also stays constant which we did not expect. Therefor we conducted another evaluation to gain a deeper understanding.

Independence of the Node Degree

After binning articles together with the same node degree and creating bins of the same size by randomly selecting articles we retrieved the news feeds for each bin.

rate of retrieved news streams for nodes having a fixed node degree. graphity clearly stays constant and is thous independent of the node degree

7.2.3 Dependency on k for news feeds retrieval

For our tests, we choose k = 15 for retrieving the news feeds. In this section, we argue for the choice of this value for k and show the influence of selecting k with respect to the performance of retrieving the news feeds per second. On the Wikipedia 2009 snapshot, we have retrieved the news feeds for all aggregating nodes with a node degree d > 10 and changed k.

Amount of retrieved streams in graphity with varieng k. k is the number of news items that are supposed to be fetched. We see that Graphity performs particular well for small k.

Tthere is a clear dependency of GRAPHITY’s retrieval rate to the selected k. For small k’s, STOU’s retrieval rate is almost constant and sorting of ego networks (which is independent of k) is the dominant factor. With bigger k’s, STOU’s speed drops as both merging O(k log(k)) and sorting O(d log(d)) need to be conducted. The dashed line shows the interpolation of the measured frequency of retrieving the news feeds given the function 1/k log(k) while the dotted line is the interpolation based on the function 1/k. As we can see, the dotted line is a better approximation to the actually measured values. This indicates that our theoretical estimation for the retrieval complexity of k log(k) is quite high compared to the empirically measured value which is close to k.

Index Maintaining and Updating

Here we investigate the runtime of STOU and GRAPHITY in maintaining changes of the network as follow edges are added and removed as well as content nodes are created. We have evaluated this for the snapshots of Wikipedia from 2004 to 2008. For Metalcon this data was not available.
For every snapshot we simulated add / remove follow edges as well as create content nodes. We did this in the order as these events would occur in the wikipedia history dump.

handling new status updates

Number of status updates that graphity is able to handle depending on the size of the data set

We see that the number of updates that the algorithms are able to handle drops as the data set grows. Their ratio stays however almost constant between a factor of 10 and 20. As the retrieval rate of GRAPHITY for big data sets stays with 12k retrieved news feeds per second the update rate of the biggest data set is only about 170 updated GRAPHITY indexes per second. For us this is ok since we designed graphity with the assumption in mind that retrieving news feeds happens much more frequently than creating new status updates.

handling changes in the social network graph

Number of new friendship relations that graphity is able to handle per second depending on the network size

The ratio for adding follow edges is about the same as the one for adding new content nodes and updating GRAPHITY indexes. This makes perfect sense since both operations are linear in the node degree O(d) Over all STOU was expected to outperform GRAPHITY in this case since the complexity class of STOU for these tasks is O(1)

Number of broken friendships that graphity is able to handle per second

As we can see from the figure removing friendships has a ratio of about one meaning that this task is in GRAPHITY as fast as in STOU.
This is also as expected since the complexity class of this task is O(1) for both algorithms.

Build the index

We have analyzed how long it takes to build the GRAPHITY and STOU index for an entire network. Both indices have been computed on a graph with existing follow relations.
To compute the GRAPHITY and STOU indices, for every aggregating node $a$ all content nodes are inserted to the linked list C(a). Subsequently, only for the GRAPHITY index for every aggregating node $a$ the ego network is sorted by time in decending order. For both indices, we have measured the rates of processing the aggregating nodes per second as shown in the following graph.

Time to build the index with respect to network size

As one can see, the time needed for computing the indices increases over time. This can be explained by the two steps of creating the indices: For the first step, the time needed for inserting content nodes increases as the average amount of content nodes per aggregating node grows over time. For the second step, the time for sorting increases as the size of the ego networks grow and the sorting part becomes more time consuming. Overall, we can say that for the largest Wikipedia data set from 2011, still a rate of indexing 433 nodes per second with GRAPHITY is possible. Creating the GRAPHITY index for the entire Wikipedia 2011 data set can be conducted in 77 minutes.
For computing the STOU index only, 42 minutes are needed.

Data sets:

The used data sets are available in another blog article
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Source code:

the source code of the evaluation framework can be found at my blog post about graphity source. There is also the source code of the graphity server online.
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Future work & Application:

The plan was actually to use these results in metalcon. But I am currently thinking to implement my solution to diaspora what do you think about that?

Thanks to

first of all many thanks go to my Co-Authors (Steffen Staab, Jonas Kunze, Thomas Gottron and Ansgar Scherp) But I also want to thank Mattias Persson and Peter Neubauer from neotechnology.com and to the community on the neo4j mailinglist for helpful advices on their technology and for providing a neo4j fork that was able to store that many different relationship types.
Thanks to Knut Schumach for coming up with the name GRAPHITY and Matthias Thimm for helpful discussions

11 lessons learnt after my first scientific paper was submitted

Rene — Wed, 02 Nov 2011 00:58:28 +0000

During the last month my blog was rather quite. I dicided that I was aiming to submit my first paper to a top conference with a deadline of november first. Well besides the fact that I almost forgot about the fact that I also have a private life – as well as my collegues helping me with the paper – there were several lessons learnt:

If your advisor tells you that the deadline is to short he is probably right! We beat the deadline but the cost for doing so was really high.
Physicists rock like hell. Evaluating my algorithms I did many experiments together with Jonas Kunze my partner at metalcon. I was totally amazed by the way he approached meassuring things. I rememberd my time as an undergrad standing in the lab meassuring things for my physics classes. Despite the fact that I knew a usefull skill was being tought to me I hated it and decided to go for pure mathmatics… Well I now learnt the hard way what I didn’t learn as an under graduate.
Things become clearer when you really dig into it. It is amazing how all the practical runtimes of my graph data base index for social news feeds – let’s call it GRAPHITY – matched the theoretical runtimes. But while evaluating you see how bad experiments have been designed in the planning phase and you reajust. Even if things work out right a way several times I got a deeper understanding by just seeing and feeling it. What I want to say things are more complicated than you might think after 2 minutes, half a day or half a month of thinking.
The whole learning experience was really nice and it was more about techniques for scientific working than the graph databes index.
If your advisor tells you to change notation it is most likely true that even though he is not as deep as you in the topic he has more experience and changing notation is a good idea! Even though I was totally convinced that my notation was great ( at least I have learnt how to model things while studying mathmatics) it made things more complicated. After I finally listened to my advisor things worked out like magic (at least concerning notation)
people in university have a very different approach to people in consultancies. But if the deadline comes closer both work until late at night!
Freedom is perfect! Thinking of the problem and solution I did not have many conferences and current research topics in mind. I thought of practical problems for improving metalcon. While emerging with my ideas the first criticism was that my motivation was not scientific. Well screw that! As soon as you really work on describing the problem and doing evaluation you do the science!
You can always generalize. I was pretty sures my skills in doing so are quite good. Well now I now there is space for improvement.
Structure, structure, structure. You cannot have enough structure!
Making a traffic light status overview document in Google docs or some other collaborative system as my friend Heinrich Harmann showed me during “schüler akademie” as he has learnt with McKinsey & Company is really good invested effort and time!
neo4j is really a cool and exciting technology and the guys in sweeden are really helpfull and cool.

I guess I could boar you all has hell for the next couple pages. I actually should even do this because I know myself i will never come back and right down what is in my mind right now. That is the reason why I publish this here right now at 2 o’clock in the morning!

neo4j based social news feed demo on wikipedia graph running

Rene — Wed, 21 Sep 2011 22:12:04 +0000

UPDATE: you can find an evaluation of the following blog post and idea on: http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/
Hey everyone I can finally demonstrate the neo4j and gwt system that I have been blogging about over the last weeks here and here. But please find the demo under the following adress:
http://gwt.metalcon.de/GWT-Modelling
The code will be available soon! But it was hacked together pretty nasty and definatly needs some cleanup! Also I think that I still have one lack in memory while maintinging the index for my news feed. Hopefully I will be able to do this while writing my paper and do more on the evaluation. Already thanks a lot to Jonas Kunze and Dr. Thomas Gottron as well as Peter from neo technologies for the great support during this project!

The Graph that I used is extracted from the complete revision dump of the bavarian wikipedia and has the following properties:

8’760 entities
~95’000 Content items
==> together more than 100’000 nodes
almost 20’000 different relationship types (more to come in bigger graphs!)
about 100’000 edges connecting the 8’760 entities
mainting the index was possible with (only :[ ) 177 writes / second on a slow notebook computer

I interpret the wikipedia graph in the following way as a social network:

every article corresponds to an entity
every link corresponds to a directed friendship (follower)
every revision corresponds eather to a status update (content item) or a change in the friendship graph

I have no idea yet how many reads I can do per second. Even though I am a little disappointed about the low speed for writing on the graph I am sure that I will achieve my theoretical goals for reads per second. I also hope to increase writing spead if I introduce better transaction management. Anyway I will blog about the results of the reads / second later.

Data structure for Social news streams on Graph data bases

Rene — Mon, 05 Sep 2011 12:36:57 +0000

UPDATE: look at http://www.rene-pickhardt.de/graphity for a more scientific survey and evaluation of this data structure.
Ok you guys did not hear much from me most recently. I was on vaccation and then on summer school and I worked on my first scientific poster and on a talk which will hopefully ontribute to my PhD thesis. Well at least I can now share some ressources which include my poster and the slides from my talk. But let me first show you two pictures.

The standard graph of a social network. You see several people and attached to them content items identified by numbers which are supposed to be time stamps

My model of a social network graph. The ego network changed from a star topology to a list topology and each ego network has a certain edge type which is modeled by edge color here. This graph stores exactly the same information as the standard model but makes retrieval of news streams much faster

Poster

Feel free to download and look at my first poster with the Title: a model for social news streams and time indices on graph data bases
You will probably not see so many things in it without the slides from my talk. So let me explain some things. I was looking into the data structures to model social news streams. Basically there is the approach of normalized or denormalized relational data bases which I call the twitter approach for the reason that Twitter is doing something similar with FlockDB
I also looked into the case of saving the news stream as a flat file for every user in which the events from his friends are saved for every user. For some reason I thought I had picked up somewhere that facebook is running on such a system. But right now I can’t find the resource anymore. If you can, please tell me! Anyway while studying these different approaches I realized that the flat file approach even though it seems to be primitive makes perfect sense. It scales to infinity and is very fast for reading! Even though I can’t find the resource anymore I will still call this approach the Facebook approach.
I was now wondering how you would store a social news stream in a graph data base like neo4j in a way that you get some nice properties. More specifically I wanted to combine the advantages of both the facebook and the twitter approach and try to get rid of the downfalls. And guess what! To me this seems actually possible on graph data bases. The key Idea is to store the social network and content items created by the users not only in a star topology but also in a list topology ordered by time of occuring events. The crucial part is to maintain this topology which is actually possible in O(1) while Updates occure to the graph.

Talk

As mentioned together with this poster I was giving a talk social news streams and time indices on social network graphs. Please feel free to download the slides. Unfortunally I improved the example while making the poster so that some pictures are not consistant with those from the poster! If I find the time I will not only update the slides but also give the talk as a video lecture on youtube! I think this will be helpful to spread the idea!

Future Work

I need to publish all these results in a good coference or journal
relevance filtering and recommendations which is the problem I am really interested in.
Implementing this stuff for my social network (see blog post)

Open Questions

Is it possible in neo4j to specify edgetypes (Relationship types) on runtime rather than compiletime.
If so: Is the time of accessing them O(1) with respect to the node degree?
If not: is there a graph data base that is capable of doing this?

Discussion

Anyway it is great to see how much more insight you get when thinking of a problem in a scientific way and not only implement it right away!