data – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language

Rene — Tue, 06 Nov 2012 11:55:02 +0000

As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){ Node paper = rel.getOtherNode(author); for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){ Node coAuthor = coAuthorRel.getOtherNode(paper); if (coAuthor.getId()==author.getId())continue; resCnt++; } }
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal(); for (Path p:t.description().breadthFirst(). relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)). uniqueness(Uniqueness.NONE).traverse(author)){ Node coAuthor = p.endNode(); resCnt++; }
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB ); String query = "START author=node("+author.getId()+ ") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+ "]-()-[:"+RelationshipTypes.AUTHOROF.name()+ "]- coAuthor RETURN coAuthor"; ExecutionResult result = engine.execute( query); scala.collection.Iterator it = result.columnAs("coAuthor"); while (it.hasNext()){ Node coAuthor = it.next(); resCnt++; } I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.

The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
The Traverser framework is about 25% slower than the Core API
Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.

I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

Quotes from Andres Taylor:

Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

Conclusion:

So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.

Graphity source code and wikipedia raw data is online (neo4j based social news stream framework)

Rene — Mon, 09 Jul 2012 15:43:57 +0000

UPDATE: there is now the source code of an entire graphity server application online!
8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.
Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.
Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.
There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.
The code can be found at https://github.com/renepickhardt/graphity-evaluation and the main Algorithm should lie in the file:
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/GraphityBuilder.java
other files of high interest should be:

https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/SortUtils.java topk nway merge inside a graph db
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NodeQueueIterator.java iterator over the graphity index
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NeoUtils.java some shortcuts for neo4j coding

I did not touch it again over the last couple months and it really has a lot of debugging comments inside. My appologies for this bad practice. I hope you can oversee this by having in mind that I am a mathematician and this was one of my first bigger evaluation projects. In my own interest I promise next time I produce code that will be easier to read / understand and reuse.
Still if you have any questions suggestions or comments feel free to contact me.
The raw data is can be downloaded at:

18 MB: http://glm.rene-pickhardt.de/de-nodeIds.txt.bz2
650 MB: http://glm.rene-pickhardt.de/de-events.log.bz2 All events that ever happened to german wikipedia articles up to middle of 2011

the format of these files is straight foward:
de-nodeIs.txt has first some ID then a tab and then the title of the wikipedia article this is just necessary if you want to display your data with titles rather than names.
the interesting file is the de-events.log in this file there are 4 columns
timestamp TAB FromNodeID TAB [ToNodeID] TAB U/R/A
So every line tells exactly when an article FromNodeID changes. if only 3 collumns are available and an U is written then the article just changed. Maybe links in the article changed in this case there exists another nodeID in the 3 column and an A or a R for add or remove respectively.
I think processing these files is rather straight forward. With this file you can totally simulate the growth of wikipedia over time. The file is sorted by the 2. column. If you want to use it in our evaluation framework you should sort this by the first column. This can be done on a unix shell in less than 10 minutes with the sort command.
Sorry I cannot publish the paper right now on my blog yet since the camera ready version has to be prepared and checked in to IEEE. But follow me on twitter or subscribe to my newsletter so I can let you know as soon as the entire paper as a pdf is available.

Download Google n gram data set and neo4j source code for storing it

Rene — Sun, 27 Nov 2011 13:28:20 +0000

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is.
http://en.wikipedia.org/wiki/N-gram
The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.
This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

So let me rather talk about some possible applications with this source of pure gold:
I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.
The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.
You can find some source code of the newer version at: https://github.com/renepickhardt/typology/tree/develop
Note that this is just a primitive algorithm to process the ngrams and store the information in a neo4j graph data base. Interestingly it can already produce decent recommendations and it uses less storage space than the ngrams dataset since the graph format is much more natural (and also due to the fact that we did not store all of the data saved in the ngrams to neo4j e.g. n-grams of different years have been aggregated.)
From what I know the roadmap is very clear now. Normalize the weights and for prediction use a weighed sum of all different kinds of n-grams and use machine learning (supervised learning) to learn those weights. As a training data set a corpus from different domains could be used (e.g. wikipedia corpus as a general purpose corpus or a corpus of a certain domain for a special porpus)
If you have any suggestions to the work the students did and their approach using graph data bases and neo4j to process and store ngrams as well as predicting sentences feel free to join the discussion right here!

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Rene — Tue, 15 Nov 2011 16:03:33 +0000

UPDATE: the paper got accepted at SOCIALCOM2012 and the source code and data sets are online especially the source code of the graphity server software is now online!
UPDATE II: Download the paper (11 Pages from Social Com 2012 with Co Authors: Thomas Gottron, Jonas Kunze, Ansgar Scherp and Steffen Staab) and the slides
I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them. you can find a demo of the system here
I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)
In order to achieve this I had several points in mind:

I wanted to use a graph data base to store the social network data as the core technology. As anyone can guess my choice was neo4j which turned out to be a very good idea as the technology is robust and the guys in sweeden gave me a great support.
I wanted to make retrieval of a users news stream only depend on the number of items that are to be displayed in the news stream. E.g. fetching a news feed should not depend on the number of nodes in the network or the number of friends a user has.
I wanted to provide a technology that is as fast as relational data bases or flat files (due to denormalization) but still does not have redundnacy and can dynamically handle changes in the underlying network

How Graphity works is explained in my first presentation and my poster which I both already talked about in an older blog post. But you can also watch this presentation to get an idea of it and learn about the evaluation results:

Presentation at FOSDEM

I gave a presentation at FOSDEM 2012 in the graph Devroom which was video taped. Feel free to have a look at it. You can also find the slides from that talk at: http://www.rene-pickhardt.de/wp-content/uploads/2012/11/FOSDEMGraphity.pdf

Summary of results

in order to be among the first to receive the paper, the source code and the used data sets as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.
With my data model graphity built on neo4j I am able to retrieve more than 10’000 dynamically generated news streams per second from the data base. Even in big data bases with several million users graphity is able to handle more than 100 newly created content items (e.g. status updates) per second which is still high if one considers that Twitter only had 600 tweets beeing created per second as of last year. This means that graphity is almost able to handle the amount of data that twitter needs to handle on a sigle machine! Graphity is creating streams dynamically so if the friendships of the network changes the user still get accurate news feeds!

Evaluation:

Although we used some data from metalcon to test graphity we realized that metalcon is a rather small data set. To overcome this issue we Used the german wikipedia as a data set. We interpreted every wikipedia article as a node in a social network. A link between articles as a follow relation and revisions of articles as status updates. With this in mind we did the following testings.

characteristics of the used data sets

Characteristics of the data sets in millions. A is the number of Users in the graph. A_{d>10} the number of users with node degree bigger 10. E is the number of edges between users and C the number of content items (e.g. status updates created by the users)

As you can see the biggest data sets have 2 million users and 38 million status updates.

Degree distribution of our data sets

Nothing surprising here

STOU as a Baseline

Our baseline method STOU retrieves all the nodes from the egonetwork of an node and orders them by the time of their most recently created content item. Afterwards feeds are generated as in graphity by using top-k n-way merge algorithm.

7.2 Retrieving News Feeds

For every snapshot of the Wikipedia data set and the Metalcon data set we retrieved the news feeds for every aggregating node in the data set. We measured the time in order to calculate the rate of retrieved news feeds per second. With bigger data sets we discovered a drop in retrieval rate for STOU as well as GRAPHITY. A detailed analysis revealed that this was due to the fact that more than half of the aggregating nodes have a node degree of less than 10. This becomes visible when looking at the degree distribution. Retrieval of news feeds for those aggregating nodes showed that on average the news feed where shorter than our desired length of k = 15. Due to the fact that retrieval of smaller news feeds is significantly faster and a huge percentage of nodes having this small degree we conducted an experiment in which we just retrieved the feeds for nodes with a degree higher than 10.

Retrieval of news feeds for the wikipedia data set. GRAPHITY was able to retrieve 15

We see that for the smallest data set GRAPHITY retrieves news feeds as fast as STOU. Then the retrieval speed for GRAPHITY rises and stays constant – in paticular independent of the size of the graph data base – as expected afterwards. The retrieval speed for STOU also stays constant which we did not expect. Therefor we conducted another evaluation to gain a deeper understanding.

Independence of the Node Degree

After binning articles together with the same node degree and creating bins of the same size by randomly selecting articles we retrieved the news feeds for each bin.

rate of retrieved news streams for nodes having a fixed node degree. graphity clearly stays constant and is thous independent of the node degree

7.2.3 Dependency on k for news feeds retrieval

For our tests, we choose k = 15 for retrieving the news feeds. In this section, we argue for the choice of this value for k and show the influence of selecting k with respect to the performance of retrieving the news feeds per second. On the Wikipedia 2009 snapshot, we have retrieved the news feeds for all aggregating nodes with a node degree d > 10 and changed k.

Amount of retrieved streams in graphity with varieng k. k is the number of news items that are supposed to be fetched. We see that Graphity performs particular well for small k.

Tthere is a clear dependency of GRAPHITY’s retrieval rate to the selected k. For small k’s, STOU’s retrieval rate is almost constant and sorting of ego networks (which is independent of k) is the dominant factor. With bigger k’s, STOU’s speed drops as both merging O(k log(k)) and sorting O(d log(d)) need to be conducted. The dashed line shows the interpolation of the measured frequency of retrieving the news feeds given the function 1/k log(k) while the dotted line is the interpolation based on the function 1/k. As we can see, the dotted line is a better approximation to the actually measured values. This indicates that our theoretical estimation for the retrieval complexity of k log(k) is quite high compared to the empirically measured value which is close to k.

Index Maintaining and Updating

Here we investigate the runtime of STOU and GRAPHITY in maintaining changes of the network as follow edges are added and removed as well as content nodes are created. We have evaluated this for the snapshots of Wikipedia from 2004 to 2008. For Metalcon this data was not available.
For every snapshot we simulated add / remove follow edges as well as create content nodes. We did this in the order as these events would occur in the wikipedia history dump.

handling new status updates

Number of status updates that graphity is able to handle depending on the size of the data set

We see that the number of updates that the algorithms are able to handle drops as the data set grows. Their ratio stays however almost constant between a factor of 10 and 20. As the retrieval rate of GRAPHITY for big data sets stays with 12k retrieved news feeds per second the update rate of the biggest data set is only about 170 updated GRAPHITY indexes per second. For us this is ok since we designed graphity with the assumption in mind that retrieving news feeds happens much more frequently than creating new status updates.

handling changes in the social network graph

Number of new friendship relations that graphity is able to handle per second depending on the network size

The ratio for adding follow edges is about the same as the one for adding new content nodes and updating GRAPHITY indexes. This makes perfect sense since both operations are linear in the node degree O(d) Over all STOU was expected to outperform GRAPHITY in this case since the complexity class of STOU for these tasks is O(1)

Number of broken friendships that graphity is able to handle per second

As we can see from the figure removing friendships has a ratio of about one meaning that this task is in GRAPHITY as fast as in STOU.
This is also as expected since the complexity class of this task is O(1) for both algorithms.

Build the index

We have analyzed how long it takes to build the GRAPHITY and STOU index for an entire network. Both indices have been computed on a graph with existing follow relations.
To compute the GRAPHITY and STOU indices, for every aggregating node $a$ all content nodes are inserted to the linked list C(a). Subsequently, only for the GRAPHITY index for every aggregating node $a$ the ego network is sorted by time in decending order. For both indices, we have measured the rates of processing the aggregating nodes per second as shown in the following graph.

Time to build the index with respect to network size

As one can see, the time needed for computing the indices increases over time. This can be explained by the two steps of creating the indices: For the first step, the time needed for inserting content nodes increases as the average amount of content nodes per aggregating node grows over time. For the second step, the time for sorting increases as the size of the ego networks grow and the sorting part becomes more time consuming. Overall, we can say that for the largest Wikipedia data set from 2011, still a rate of indexing 433 nodes per second with GRAPHITY is possible. Creating the GRAPHITY index for the entire Wikipedia 2011 data set can be conducted in 77 minutes.
For computing the STOU index only, 42 minutes are needed.

Data sets:

The used data sets are available in another blog article
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Source code:

the source code of the evaluation framework can be found at my blog post about graphity source. There is also the source code of the graphity server online.
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Future work & Application:

The plan was actually to use these results in metalcon. But I am currently thinking to implement my solution to diaspora what do you think about that?

Thanks to

first of all many thanks go to my Co-Authors (Steffen Staab, Jonas Kunze, Thomas Gottron and Ansgar Scherp) But I also want to thank Mattias Persson and Peter Neubauer from neotechnology.com and to the community on the neo4j mailinglist for helpful advices on their technology and for providing a neo4j fork that was able to store that many different relationship types.
Thanks to Knut Schumach for coming up with the name GRAPHITY and Matthias Thimm for helpful discussions

Download your open source Youtube insights statistics tool

Rene — Sun, 13 Nov 2011 11:54:45 +0000

Today I have figured out that you can download a lot of statistics about your on videos from Youtube. That is actually very nice since I was always sceptical that you miss the knowledge of who is watching your videos once you do not host them yourself. Unfortunately there are some drawbacks to these statistics:

Youtube only lets you download statistics for a periode of 30 days. So if you want to download your statistics for an entire year you have to download 12 files.
Next the statistics are only available as rawdata and you need to process them in order to receive some useful information.

For me and in legend these statistics are very interesting for three reasons:

We want to figure out from which websites people come to youtube in order to watch our video. This is important for us so we can contact the webmasters of theses sites as soon as new videos are available.
We want to to analyze if the user behaviour is influenced by our Youtube ballads n bullets dvd and if so we want to see weather the influence is actually positive (which we of course expect).
We want to see from which videos our video is linked as a related video. This could enable us to have some really useful collaborations.

Since there are so many reasons to process the statistics I wrote a little python script in order to analyze the statistics and retrieve them in a human readable format. Because open source is really important for our society and the web I made the decission to share this little tool with everyone on the web under a GPL licence (which means you can use it for free!).
So feel free to download this little python script and run it on your website. It is also avail able in the google code svn repository

Screencast and instructions:

Since the script is only displaying some information there is almost no configuration to be done. But I know from the time when my programming skills were not as good as today that it was always hard to run the source code of someone else. To make it even easier for you I created a little screencast that explains you how to download the programm, How to download the statistics from youtube and how to run the statistics tool.

Oh by the way this is some background knowledge about youtube insights can be found at the official youtube data api documentation
I hope you like that tool. If you find some bugs or you have some suggestions I would be more than happy to hear from you about your thoughts!

My Blog guesses your name – Binary Search Exercise for Algorithms and data structures class

Rene — Mon, 07 Nov 2011 11:59:05 +0000

Binary Search http://en.wikipedia.org/wiki/Binary_search_algorithm is a very basic algorithm in computer science. Despite this fact it is also important to understand the fundamental principle behind it. Unfortunately the algorithm is tought so early and the algorithm is so simple that beginning students sometimes have a hard time to understand the abstract principle behind it. Also many exercises are just focused on implementing the algorithm.
I tried to provide an exercise that aims and focusses on the core principle of binara search rather than implementation. The exercise is split into three parts.

Excercise – Binary Search

Your task is to write a computer program that is able to guess your name!

Feel free to check out the following applet that enables my blog to guess your name in less than 16 steps!

In order to achieve this task we will look at three different approaches.

Part 1

Download this file containing 48’000 names. It is important for me to state that this file is under GNU public licence and I just processed the original much richer file which you can find at: http://www.heise.de/ct/ftp/07/17/182/.
Now you can apply the binary search algorithm (imperative implementation as well as recursive implementation) to let the computer guess the name of the user
Provide a small User interface that makes it possible for the user to give feedback if the name would be found befor or after that guess in a telephone book

Part 2

It could happen though that some rare names are not included in this name list. That is why your task now is to use a different approach:

Let the Computer ask the user of how many letters his name consists.
Create a function BinarySearch(String from, String to, length) and call it for example with the parameters “(AAAA”,”ZZZZ”,4)
Use the function LongToString in order to map the String to a Long that respects the lexicographical order and enables to find the middle value of the two strings to implement this
Use the function LongToString and the user interface from part one in order to provide the output to the user

private static String LongToString(long l,int length) { String result = ""; for (int i=0;i private static long StringToLong(String from) { long result=0; int length=from.length(); for (int i=0;i


Part 3
Our program from exercise 2 is still not able to guess names with a special charset which is important for many languages. So your task is to fix this problem by improving the approach that you have just implemented.

One way to do this is to understand that char(65)=A, char(66)=B,...  and transfer this to  create a sorted array of letters that are allowed to be within a name.

Now choose freely one of the following methods:

Method A

Improve LongToString and StringToLong so it can use the array instead of the special characters.

Method B

Start guessing the name letter for letter by using this array. You would guess every letter with the help of binary search.
Part 4 (discussion)

Explain shortly why approach number 2 will take more guesses in general than approach one.
Explain shortly why Method A and B will need the same amount of guesses in the worst case.
If you would already know some letters of the Name does this have an influence on the method you choose. Why?



Download network graph data sets from Konect – the koblenz network colection
Rene — Mon, 26 Sep 2011 14:14:13 +0000
UPDATE: now with link to the PhD thesis. By the time of blogging the thesis was not published. thanks to Patrick Durusau for pointing out the missing link.

One of the first things I did @ my Institute when starting my PhD program was reading the PhD thesis of Jérôme Kunegis. For a mathematician a nice piece of work to read. For his thesis he analayzed the evolution of networks. Over the last years Jérôme has collected several (119 !) data sets with network graphs. All have different properties.

He provides the data sets and some basic statistics @ http://konect.uni-koblenz.de/networks

Sometimes edges are directed, somtimes they have timestamps somtimes even content. Some graphs are bipartite and the graphs come from different application domains such as trust, social networks, web graphs, co citation, semantic, features, ratings and communications…

Jérôme also has calculated some metrics such as degree distributions and eigenvalue distributions for all these data sets and provides links to the sources. For most of the time he also states the paper for which these data sets have been created.

Over all the Koblenz network collection is a very nice data set for everyone who wants to do a PhD and wants to do analysis of graphs or do data mining or just needs some data sets for evaluation and testing! I am excited to see how this site will evolve over time and want to thank Jérôme for making this data available!



Business Model of Metaweb with freebase
Rene — Thu, 09 Jun 2011 21:08:47 +0000
Today I want to talk about one of my favourite Internet start ups. It is called Metaweb. Metaweb was founded in 2005. To my knowledge it had two rounds of investment (all together about 60 Million US Dollar) and it was bought by Google in June 2010.

So why do I love metaweb? Well they started to work on a really important and great problem using modern technology to solve it. Up to my knowledge they did not have a direct business model. They just knew that solving this problem would reward them and after gaining experience on the market they would find their busniess model.
The problem metaweb was solving
We are lucky. modern and great companies have very informative videos about their product.



I also love the open approach they took. Freebase the database created by metaweb is not protected from others. No it is creative common licence. Everyone in the community knows that the semantic database they run is nice to have but the key point is really the technology they built to run queries against the data base and handel data reconciling. Making it creative common helps others to improve it like wikipedia. it also created trust and built a community around freebase.
Summary of the business model

openess
focussing on a real problem
learn with the market

I know it sounds as if I was 5 years old but I am fully convinced that there is an abstract concept behind the success of metaweb and I call it their business model. Of course I can think of many other ways than selling to google how to monetarize this technoligy. But in the case of metaweb this was just not neccessary.



Google, Facebook & co. are not free!
Rene — Thu, 09 Jun 2011 20:05:05 +0000
Besides ecommerce one really big Internet business model obviously is the trade or monetarization of data. This Blogpost reminded to write an article about this topic.

I think the author points out the most important facts! Services like Facebook / gmail / Youtube and so on are not free. Moneywise they are free of charge but with your data you pay a price for using it. Luckily for Facebook / Google and co. most people don’t know the value of their data which turns Facebook, Google and co. into very successful business models generating billions of us dollar.

Since it is very easy to collect data on the web and only very few people understand the dynamics of the Web. I would guess that right now this is one of the most profitable business models on the Internet.

Your contact data vs. data about your interests that define you

Think about it. Giving your name and adress to me even in public on the internet doesn’t allow me to do any business. While building my social network metalcon I experienced something interesting. People have been scared as hell to provide their real name while registering with metalcon. But talking to some of the users in reality they did not realize that the data they produce while using the plattform is what should rather scare them.

You don’t believe it? Let us look into to data facebook is collecting from you in order to monetarize it.
What  obvious facts does facebook know about you?

first of all the interests from your  profile
who your friends are
The interests of your friends.

This is already very strong and combined with artificial intelligence will lead to amazing service like friend recommender and highly personalized advertising systems
What not so obvious facts does Facebook know about you?
It is akward that almost no one is aware of the fact that you don’t have to fill out your social networking profile or have friendships to tell facebook your interests (I don’t have friends on facebook and the friend recommender works) . Facebook knows:

Who you communicate with and how frequently.
What you are really interested in (you constantly tell facebook by clicking on things and visiting profiles and pages!)
which people your really know (being tagged together in a lot of fotos is a strong indicator!)
What websites you surf on (there are like buttons everywhere! and most the time you are still logged in with facebook) it is like criterio
In which geographical regions you login to facebook
How many people are interested in you and if you are an oppinion leader

The boldest part is how facebook uses all the knowledge about useres and brands to earn money by just enabling people to communicate
Summeray of the Data trading business model
You run a website that people love and you collect user behavoir and other data they tell you implicitly. This data just needs to define them and their interests. You don’t cara about a name! You make it anonymous just like google serach was for a long time. Now you use this data to create an ad system or anything similar. You can even go out and start selling products your own!
Jumping back in time
I still remember the first days of the internet. You registered on some site and they asked you to fill out an old fashioned survey => very annoying and it was not even helping to earn good money.

age
sex
income
children
education
choose 3 interests
what papers do you read
what tv do you watch

In comparison to the Data facebook and Google collect about you that was really innocent. Comparing those modern companies with their sophisticated approaches and compare it to the pioneers in the online user data business it is incredible to see how the .com bubble could really rise up.



Where to upload your music – The perfect Band website: part 5
Rene — Tue, 07 Jun 2011 22:51:38 +0000
While writing my article on link baiting for musicians I realized that this topic is overdue. besides the fact that Facebook is highly overrated this is probably the most valueable piece of advice I wrote in my blog so far.
Quick answer: “where to upload your music”:

You upload all of your music to the internet.
It is your best asset and marketing tool. Nothing will help more to attract fans
But you don’t upload it everywhere. You only upload it on your website.

There might be exceptions to the rule but the fact of the matter ist that your music is the best content you have. Why woud you want to put it to some other site and increase that sites value?

In fact everything I describe here works even better if you give your music away as a download
Explainations
If there is one thing your fan is most certainly interested from you it is probably your music. On the other side you are – or at least should be – interested in talking to your fans. Myspace became so popular because it was the first more or less legal spaces where fans could listen to music. Great for the founders but what about the musicians? There have been complaints that myspace wasn’t paying royalties for the music…

Look at the following graphic by http://www.informationisbeautiful.net/2010/how-much-do-music-artists-earn-online/ and understand the flow of money and value chain in music industry.



Even though you don’ tmake money directly when streaming from your site. You give away user data something much more valueable if your music is being streamed on other sites (no matter if they pay you royalties like last.fm and spotify or not just like our good old friend Facebook!)

You don’t belief how valueable user data are? Remember Google one of the most valuable companies in the world has business models around user data and making 30 billion $ per year of it.

Of course having music on other sites might lead some of the customers from those sites to get to know your music but let us have a look at the numbers and understand the potentials of streaming the music yourself and making profit of the data.
Having music on your page results in huge benefit and sales opportunities
From the image we saw 1.5 Mio. listens on last.fm are equivelent to selling and distributing your music just by yourself to 143 people. Let us see how many sales 1.5 Mio. streams on your homepage should produce once you start a conversion with the people streaming your music.

Let us assume the player on your homepage allows a user to stream 2 or 3 songs right away but asks him to register before he streams more music. Since you also needed to be registered with last.fm let us assume that the 1.5 Mio streams really came from people who registered on your site! Also assume that next to the player are several shopping opportunities for merchandise and cds.
50’000 Fans produce 1.5 Mio streams
According to Lady Gagas Last.fm profile 2.389.882 fans produced 122.908.193 streams. This means 51 streams per fan. With the In Legend facebook app I recorded about 10 Streams per registered fan. Let us take the average of 30 streams per registered fan. Under this assumption 1.5 Mio streams are produced by 50’000 registered fans!
50’000 fans streaming music in your page can result in sales of 28’400 $
To understand this let us talk about conversion rates in online marketing. For selling Ballads n Bullets I have experienced conversionrates around 4% which is not to high for online marketing but still reasonable.

4% of 50’000 fans = 2’000 cd’s your fans should buy. If you earn $8 per CD because you distribute it yourself this makes you earn 16’000 $ which is about 14 times more than your royalties at last.fm

If you ditribute via amazon you will get additionally comissions for the other purchases of users. In this amount of purchases you should earn 7,5 % comission on amazon. From my experience about only 1/3 of your amazon sales come from your music. So you can add another 7.5% of 32’000 $ which is 2’400 $

Let us assume the conversionrate on merchandise is half the one from your cd. On merch you should also have earnings of 10$ ==> 10’000 $ of merchandise earnings

Result: 16’000$ (CD) + 2’400$ (Amazon Comission) + 10’000$ (merchandise) = 28’400 $ earnings!

Do you see how much potential you give to other sites if they have your music or videos! I did not include selling tour tickets or increased benefits in search engine optimization and other options for Customer Relation Management!