what is – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

Question by Filip Stilin (House on Mars): What do you think of Bandcamp?

Rene — Sat, 21 Jan 2012 00:36:10 +0000

Filip Stilin is the frontman of House on Mars a promesing young croation band (check out their music on bandcamp). He loves music and online marketing so he read my blog and sent me and email with a couple of interesting observations and questions. I got his permission to publish parts of his mail and answer the questions to a wider audiance in my blog.

Filip: Even though I wasn’t agreeing with your Facebook skepticism in the beginning, I realized that I was overestimating Facebook in its promotion role. I’ve been creating extremely successful, targeted (extremely low budget though – just 5 or 10 euros at a time) campaigns for my band on Facebook.

Results?

Even though I managed to inflate the number of fans (with 12 fans/1 euro average), the interaction stayed the same. These campaigns aren’t entirely useless, though – having this number of fans or more looks nice in a smaller Croatian market and can help in booking bigger shows (thus getting to more fans)..but still dissappointing. I won’t campaign until we release a website/album/have something to sell.

Rene: I like your observation. The first was that gaining these paid likes does not really increase interactions and increase your reach. As I am saying to the In Legend guys all the time: “Money invested in facebook or even effort in facebook reach is not the best way to increase one’s reach” I am very glad that you came to the same conclusion and shared your insights!
Secondly I partially agree with the effect of large fan numbers while booking gigs. It certainly looks good to business partners like bookers, labels, distributers,… if your social media numbers burst. But again I would say the price is too high. With 1 Euro / 12 fans you would need to invest 1000 Euro for 12’000 fans an 12’000 isn’t even skyhigh (well I don’t know about croation standards). But as we know only a very small fraction of these 12’000 fans would actually become real fans and start interacting with you. All this for getting a gig! I guess this money could much better be invested in a high quality video which especially for a young band is a very good investment. With the video in combination with smart music downloads you will be able to increase your reach. Maybe not to 12’000 fans but still to a solid number of real fans that actually come to your concert because they really care! In this way your social media fancount (especially facebook) will also grow.

Filip: Question1 – sharing music!
About the thesis of providing music only on the band site – I think it’s hard for someone to become our fan if there is no music on Facebook. Choosing one song for preview and directing a fan to .com might work, but they are attracting entirely different audiences – think poppy indie rock vs. oriental, modern metal ballad. Is it okay to let this promo run free and spread like wildfire until the album release? Or should I provide 2/3 of songs for free, and ask for a mail adress for the 3rd one? This might be a good model.

Rene: I agree with what you say. At the time of writing the blog post you are referring too I wasn’t aware of the existing facebook music apps. The important thing is getting a sustainable contact to the person interested in your music. This is achieved ultimately by his email adress. But your question is very important. Of course you have to give people a bait. This could be

snippets
entire song(s) on streeming
a music video (In legend offers downloads under every music video)
a free download (without registration)

and I really don’t know where to set the boarder.
In the beginning times of In Legend we had 3 songs for streaming on myspace and 4 songs on the ep for download in exchange of an email adress. That turned out to be a good solution. People who liked the first songs where curious to download them together with one additional song. So I guess a 2/3 split would work as well. I will just warn you. Asking people for their mail address scares 4 of 5 people away. But hey at least you get the adresses of your fans that are really willing to give something for the music!
Now about the place where to make the connection. If you achive getting the fans mail adress via a smart Facebook music player or via download on your homepage I don’t care. Once people like your music (and chances are higher once you can talk to them frequently) they will also turn into facebook fans. So I recommend switching from rootmusic which you are using right now on your facebook profile to bandRX or Songpier since both services allow you to give access to your music in return of mail adresses. Songpier is a very new service but they also offer a cool mobile app (right now also without collecting mail adresses)
Here is a video about bandRX

Filip: Question2 – What do you think of Bandcamp?
I personally think it’s a great platform for selling music and there is an option of collecting mail adresses. The downside is that there is no valuable content that can be published, like blogs. It would be perfect if it was just a music-streaming, checkout widget on my site.

Rene: One of my Favourite (but retired) bands Jester’s Funeral have just published all their songs to bandcamp and have linked to their homepage and from the homepage to bandcamp. It is not quite the widged you are asking for but I guess this stays an option especially if you don’t bring the technical know how of programming a homepage that enables you to offer your music as a download in exchange of mail adresses.
There is only one thing that bothers me about bandcamp. They only let 200 fans per month download your music for free (email exchange) and offer a pay as much as you want option. From the money raised they keep 15% as a service charge. If you want more free downloads you have to buy them or have people pay for your music.
To some extend they offer a fair deal. It is a good service for a reasonable price. I as a programmer would just do it on my own have the full controll of my data and keep the 15% but I am pretty convinced that bandcamp should be a pretty good option for many musicians
I hope I could answer your questions to your satisfaction! Sorry that there isn’t always the clear black our white, right or wrong. Things are complex on the web but by reading your mail I am very convinced that you are asking the right questions which means that at least in online marketing you are far ahead of 95% of all musicians!

Wikipedia to Blackout for 24 hours to fight SOPA and PIPA – Copy of the user discussion and poll on my blog

Rene — Tue, 17 Jan 2012 11:31:49 +0000

I am one of the web pioneers but this is about the most amazing thing that I will be witnessing on the web as long as I can remember. Tomorrow on January 18th the english version of Wikipedia will shut down for 24 hours to protest two upcoming (?) american laws (SOPA and PIPA) that set the legal foundations to censor the web. This is happening in the country that is so proud of it’s freedom of speech.

Press release: http://wikimediafoundation.org/wiki/Press_releases/English_Wikipedia_to_go_dark
Letter to Users: http://wikimediafoundation.org/wiki/English_Wikipedia_anti-SOPA_blackout
User discussion: http://en.wikipedia.org/wiki/Wikipedia:SOPA_initiative/Action see below for a saved copy!

This is such an important move of democracy that I was standing still for a couple of minutes after I heard of this! 1’800 active wikipedia authors moderators and administrators collectively agreed to make this move in order to show a protest! I am very excited to see where this will be going and what impact this has. Freedom of the internet is what makes this such a beautiful space. Everyone spread this word! discuss this! Don’t let anyone take the freedom of speech and information sharing from you!
Since the user discussion and poll won’t be available tomorrow I attached them to my blogpost.
http://www.rene-pickhardt.de/wp-content/uploads/2012/01/Wikipedia-SOPA-initiative-Action-Wikipedia-the-free-encyclopedia.html
I will not comment on this any further. Please everyone Have your own oppinion and act with responsability.

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Rene — Tue, 15 Nov 2011 16:03:33 +0000

UPDATE: the paper got accepted at SOCIALCOM2012 and the source code and data sets are online especially the source code of the graphity server software is now online!
UPDATE II: Download the paper (11 Pages from Social Com 2012 with Co Authors: Thomas Gottron, Jonas Kunze, Ansgar Scherp and Steffen Staab) and the slides
I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them. you can find a demo of the system here
I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)
In order to achieve this I had several points in mind:

I wanted to use a graph data base to store the social network data as the core technology. As anyone can guess my choice was neo4j which turned out to be a very good idea as the technology is robust and the guys in sweeden gave me a great support.
I wanted to make retrieval of a users news stream only depend on the number of items that are to be displayed in the news stream. E.g. fetching a news feed should not depend on the number of nodes in the network or the number of friends a user has.
I wanted to provide a technology that is as fast as relational data bases or flat files (due to denormalization) but still does not have redundnacy and can dynamically handle changes in the underlying network

How Graphity works is explained in my first presentation and my poster which I both already talked about in an older blog post. But you can also watch this presentation to get an idea of it and learn about the evaluation results:

Presentation at FOSDEM

I gave a presentation at FOSDEM 2012 in the graph Devroom which was video taped. Feel free to have a look at it. You can also find the slides from that talk at: http://www.rene-pickhardt.de/wp-content/uploads/2012/11/FOSDEMGraphity.pdf

Summary of results

in order to be among the first to receive the paper, the source code and the used data sets as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.
With my data model graphity built on neo4j I am able to retrieve more than 10’000 dynamically generated news streams per second from the data base. Even in big data bases with several million users graphity is able to handle more than 100 newly created content items (e.g. status updates) per second which is still high if one considers that Twitter only had 600 tweets beeing created per second as of last year. This means that graphity is almost able to handle the amount of data that twitter needs to handle on a sigle machine! Graphity is creating streams dynamically so if the friendships of the network changes the user still get accurate news feeds!

Evaluation:

Although we used some data from metalcon to test graphity we realized that metalcon is a rather small data set. To overcome this issue we Used the german wikipedia as a data set. We interpreted every wikipedia article as a node in a social network. A link between articles as a follow relation and revisions of articles as status updates. With this in mind we did the following testings.

characteristics of the used data sets

Characteristics of the data sets in millions. A is the number of Users in the graph. A_{d>10} the number of users with node degree bigger 10. E is the number of edges between users and C the number of content items (e.g. status updates created by the users)

As you can see the biggest data sets have 2 million users and 38 million status updates.

Degree distribution of our data sets

Nothing surprising here

STOU as a Baseline

Our baseline method STOU retrieves all the nodes from the egonetwork of an node and orders them by the time of their most recently created content item. Afterwards feeds are generated as in graphity by using top-k n-way merge algorithm.

7.2 Retrieving News Feeds

For every snapshot of the Wikipedia data set and the Metalcon data set we retrieved the news feeds for every aggregating node in the data set. We measured the time in order to calculate the rate of retrieved news feeds per second. With bigger data sets we discovered a drop in retrieval rate for STOU as well as GRAPHITY. A detailed analysis revealed that this was due to the fact that more than half of the aggregating nodes have a node degree of less than 10. This becomes visible when looking at the degree distribution. Retrieval of news feeds for those aggregating nodes showed that on average the news feed where shorter than our desired length of k = 15. Due to the fact that retrieval of smaller news feeds is significantly faster and a huge percentage of nodes having this small degree we conducted an experiment in which we just retrieved the feeds for nodes with a degree higher than 10.

Retrieval of news feeds for the wikipedia data set. GRAPHITY was able to retrieve 15

We see that for the smallest data set GRAPHITY retrieves news feeds as fast as STOU. Then the retrieval speed for GRAPHITY rises and stays constant – in paticular independent of the size of the graph data base – as expected afterwards. The retrieval speed for STOU also stays constant which we did not expect. Therefor we conducted another evaluation to gain a deeper understanding.

Independence of the Node Degree

After binning articles together with the same node degree and creating bins of the same size by randomly selecting articles we retrieved the news feeds for each bin.

rate of retrieved news streams for nodes having a fixed node degree. graphity clearly stays constant and is thous independent of the node degree

7.2.3 Dependency on k for news feeds retrieval

For our tests, we choose k = 15 for retrieving the news feeds. In this section, we argue for the choice of this value for k and show the influence of selecting k with respect to the performance of retrieving the news feeds per second. On the Wikipedia 2009 snapshot, we have retrieved the news feeds for all aggregating nodes with a node degree d > 10 and changed k.

Amount of retrieved streams in graphity with varieng k. k is the number of news items that are supposed to be fetched. We see that Graphity performs particular well for small k.

Tthere is a clear dependency of GRAPHITY’s retrieval rate to the selected k. For small k’s, STOU’s retrieval rate is almost constant and sorting of ego networks (which is independent of k) is the dominant factor. With bigger k’s, STOU’s speed drops as both merging O(k log(k)) and sorting O(d log(d)) need to be conducted. The dashed line shows the interpolation of the measured frequency of retrieving the news feeds given the function 1/k log(k) while the dotted line is the interpolation based on the function 1/k. As we can see, the dotted line is a better approximation to the actually measured values. This indicates that our theoretical estimation for the retrieval complexity of k log(k) is quite high compared to the empirically measured value which is close to k.

Index Maintaining and Updating

Here we investigate the runtime of STOU and GRAPHITY in maintaining changes of the network as follow edges are added and removed as well as content nodes are created. We have evaluated this for the snapshots of Wikipedia from 2004 to 2008. For Metalcon this data was not available.
For every snapshot we simulated add / remove follow edges as well as create content nodes. We did this in the order as these events would occur in the wikipedia history dump.

handling new status updates

Number of status updates that graphity is able to handle depending on the size of the data set

We see that the number of updates that the algorithms are able to handle drops as the data set grows. Their ratio stays however almost constant between a factor of 10 and 20. As the retrieval rate of GRAPHITY for big data sets stays with 12k retrieved news feeds per second the update rate of the biggest data set is only about 170 updated GRAPHITY indexes per second. For us this is ok since we designed graphity with the assumption in mind that retrieving news feeds happens much more frequently than creating new status updates.

handling changes in the social network graph

Number of new friendship relations that graphity is able to handle per second depending on the network size

The ratio for adding follow edges is about the same as the one for adding new content nodes and updating GRAPHITY indexes. This makes perfect sense since both operations are linear in the node degree O(d) Over all STOU was expected to outperform GRAPHITY in this case since the complexity class of STOU for these tasks is O(1)

Number of broken friendships that graphity is able to handle per second

As we can see from the figure removing friendships has a ratio of about one meaning that this task is in GRAPHITY as fast as in STOU.
This is also as expected since the complexity class of this task is O(1) for both algorithms.

Build the index

We have analyzed how long it takes to build the GRAPHITY and STOU index for an entire network. Both indices have been computed on a graph with existing follow relations.
To compute the GRAPHITY and STOU indices, for every aggregating node $a$ all content nodes are inserted to the linked list C(a). Subsequently, only for the GRAPHITY index for every aggregating node $a$ the ego network is sorted by time in decending order. For both indices, we have measured the rates of processing the aggregating nodes per second as shown in the following graph.

Time to build the index with respect to network size

As one can see, the time needed for computing the indices increases over time. This can be explained by the two steps of creating the indices: For the first step, the time needed for inserting content nodes increases as the average amount of content nodes per aggregating node grows over time. For the second step, the time for sorting increases as the size of the ego networks grow and the sorting part becomes more time consuming. Overall, we can say that for the largest Wikipedia data set from 2011, still a rate of indexing 433 nodes per second with GRAPHITY is possible. Creating the GRAPHITY index for the entire Wikipedia 2011 data set can be conducted in 77 minutes.
For computing the STOU index only, 42 minutes are needed.

Data sets:

The used data sets are available in another blog article
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Source code:

the source code of the evaluation framework can be found at my blog post about graphity source. There is also the source code of the graphity server online.
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Future work & Application:

The plan was actually to use these results in metalcon. But I am currently thinking to implement my solution to diaspora what do you think about that?

Thanks to

first of all many thanks go to my Co-Authors (Steffen Staab, Jonas Kunze, Thomas Gottron and Ansgar Scherp) But I also want to thank Mattias Persson and Peter Neubauer from neotechnology.com and to the community on the neo4j mailinglist for helpful advices on their technology and for providing a neo4j fork that was able to store that many different relationship types.
Thanks to Knut Schumach for coming up with the name GRAPHITY and Matthias Thimm for helpful discussions

Teach First Germany – Why education is a Human right.

Rene — Wed, 17 Aug 2011 09:24:18 +0000

Today i want to introduce you to a project that is very important to me! It is called Teach First Germany and follows the model of the uk based idea of teach first (see wikipedia). I don’t know if one could really call it social entrepreneurship but I would do so!
Teach First was first introduced to me by Simon Turschner during a summer school in Guidel in 2008. For the last 2 years he has been working as a fellow teacher with teach first and has just helped to create this video:

I think the video speaks for itself!
Last year I applied for Teach First Germany. I am very happy about their effort and think it is a project that needs support! When I got accepted to my PhD program I decided to drop out of the application process which was everything but easy to decide. Even though I put the PhD over teach first at that time, this programm is still very important to me and I can perfectly see myself applying again after finnishing my PhD. I would also like to encourage you to consider applying at teach first!
Last but not least I want to thank all the fellows especially Simon at teach first for making this effort and commitment. You guys are great! We need more people like you!

List of Internet Business Models

Rene — Mon, 13 Jun 2011 19:38:35 +0000

I just realized that I wasn’t to active in this section of my blog. While thinking about Internet business models I tried to generalize the ideas behind some products and companies I know of. This results in my rather uncomplete list of Internet Business Models that I know of. This list will be extended every time I discover a new business model. I will also try to write articles about each of these business models in future including their monetarization models.

Data Trading
- Google
- Facebook
- …
Problem Solving
- Metaweb
- Google
- Twitter
- …
The open source model
- loads of Software companies
- Google
- …
Content creation
- Newspaper
- Mags
- Blogs
- message boards
- …
E Commerce
- C2C Ebay
- B2C
- Amazon
- smaller stores
- …
Premium / freemium Content / Services
- linkedIN
- music
- Porn
- …
Communication Services
- Email
- Skype
- Social Networking
- Chat
- …
Filesharing Services
- Youtube
- Flickr
- 1click hosting
- p2p networks
- Dropbox
- …
Collaboration Services
- dropbox
- Google Docs
- …
Technical Infrastructure Software
- Webserver
- Database
- Webbrowser
- CMS
- …
Technical Infrastructure Hardware
- Hosting
- Cloud computing
- Internet Service Provider
- …
Knowledge distribution
- Wikipedia
- e learning
- …
Advertising
- Affiliate
- Lead generation
- per Click
- per Impression
- retargeting
Web design
- Usability
- Software engeneering
- Graphic Designer
- …
Information distribution
- wikipedia
- googlemaps
- directories
- google scholar
- …
Web search
- universal search
- specialized search
- semantic search
- …
(Micro) payment
- paypall
- google checkout
- Facebook
- …
Consulting
- Online Marketing
- Web Design
- Technical

Maybe it would be better to create a matrix with ways to monetarize these models as a dimension and business sectors and product ideas on the other dimension.
If you have any suggestions or know of ideas missing I am happy if you post about those in the comments!

Business Model of Metaweb with freebase

Rene — Thu, 09 Jun 2011 21:08:47 +0000

Today I want to talk about one of my favourite Internet start ups. It is called Metaweb. Metaweb was founded in 2005. To my knowledge it had two rounds of investment (all together about 60 Million US Dollar) and it was bought by Google in June 2010.
So why do I love metaweb? Well they started to work on a really important and great problem using modern technology to solve it. Up to my knowledge they did not have a direct business model. They just knew that solving this problem would reward them and after gaining experience on the market they would find their busniess model.

The problem metaweb was solving

We are lucky. modern and great companies have very informative videos about their product.

I also love the open approach they took. Freebase the database created by metaweb is not protected from others. No it is creative common licence. Everyone in the community knows that the semantic database they run is nice to have but the key point is really the technology they built to run queries against the data base and handel data reconciling. Making it creative common helps others to improve it like wikipedia. it also created trust and built a community around freebase.

Summary of the business model

openess
focussing on a real problem
learn with the market

I know it sounds as if I was 5 years old but I am fully convinced that there is an abstract concept behind the success of metaweb and I call it their business model. Of course I can think of many other ways than selling to google how to monetarize this technoligy. But in the case of metaweb this was just not neccessary.

What are the 57 signals google uses to filter search results?

Rene — Tue, 17 May 2011 22:58:16 +0000

Since my blog post on Eli Pariser’s Ted talk about the filter bubble became quite popular and a lot of people seem to be interested in which 57 signals Google would use to filter search results I decided to extend the list from my article and list the signals I would use if I was google. It might not be 57 signals but I guess it is enough to get an idea:

Our Search History.
Our location – verfied -> more information
the browser we use.
the browsers version
The computer we use
The language we use
the time we need to type in a query
the time we spend on the search result page
the time between selecting different results for the same query
our operating system
our operating systems version
the resolution of our computer screen
average amount of search requests per day
average amount of search requests per topic (to finish search)
distribution of search services we use (web / images / videos / real time / news / mobile)
average position of search results we click on
time of the day
current date
topics of ads we click on
frequency we click advertising
topics of adsense advertising we click while surfing other websites
frequency we click on adsense advertising on other websites
frequency of searches of domains on Google
use of google.com or google toolbar
our age
our sex
use of “i feel lucky button”
do we use the enter key or mouse to send a search request
do we use keyboard shortcuts to navigate through search results
do we use advanced search commands (how often)
do we use igoogle (which widgets / topics)
where on the screen do we click besides the search results (how often)
where do we move the mouse and mark text in the search results
amount of typos while searching
how often do we use related search queries
how often do we use autosuggestion
how often do we use spell correction
distribution of short / general queries vs. specific / long tail queries
which other google services do we use (gmail / youtube/ maps / picasa /….)
how often do we search for ourself

Uff I have to say after 57 minutes of brainstorming I am running out of ideas for the moment. But this might be because it is already one hour after midnight!
If you have some other ideas for signals or think some of my guesses are totally unreasonable, why don’t you tell me in the comments?
Disclaimer: this list of signals is a pure guess based on my knowledge and education on data mining. Not one signal I name might correspond to the 57 signals google is using. In future I might discuss why each of these signals could be interesting. But remember: as long as you have a high diversity in the distribution you are fine with any list of signals.

What is Linked open data or the Web of data?

Rene — Wed, 16 Mar 2011 21:55:59 +0000

Linked open data or the web of data is the main concept and idea of the semantic web proposed by Tim Berners-Lee. The semantic web is often also referred to as web3.0. But what is the difference of the Internet as we know it by now and the web of data?
On the Internet there are hypertext documents which are interlinked to each other. As we all know search engines like Google use the hyperlinks to calculate which websites are most relevant. A hyper text document is created for humans to read and understand. Even though Google can search those documents in a pretty efficient way and does amazing things with it Google is not able to read and interpret these documents or understand the semantics. Even though the search result quality is already very high it could be increased by a lot if search engines where only able to understand the semantics of the documents people put on the web.
The idea of the Linked open data is that data should also be published in a way that machines can easily read and “understand” it. Once the data is published web documents could even be annotated with this data invisible to humans but helping computers to understand the semantics and thereby result in better distribution of information. Also a lot of new webservices would become possibible increasing the quality of our daily live just as Google and Wikipedia have done.

An Example of linked data:

You might wonder now how data can be linked? This is pretty easy. Let us take me for example. Take the following statement about me: “Rene lives in Koblenz which is a City in Germany.”
So I could create the following data triples:

(Rene Pickhardt, lives in, Koblenz)
(Koblenz, type, city)
(Koblenz, is in, Germany)
(Germany, type, country)

Linked Data Triples

As you can see from the picture these triples form a graph. Just like the Internet. Now comes the cool part. The Graph is easily processable by a computer and we can do some reasoning: A computer by pure reasoning can conclude that I also live in Germany and that cities are in countries.

Linked open data triples with reasoning

Now assume everyone would publish his little data graphs using an open standard and there was a way to connect those graphs. Pretty quickly we would have a lot of knowledge about René Pickhardt, Koblenz, Cities, Germany, countries,… especially if we automatically process a web document and detect some data we can use this background knowledge to fight ambiguities in language or just to better interconnect parts of the text by using the semantics from the Web of Data.

Is linked open data a good or a bad thing?

I think it is clear what great things can be achieved once we have this linked open data. But there are still a lot of challenges to take.

How to fight inconsistencies and Spam?
How to trust a certain source of data?
How can we easily connect data from different source and identify same nodes?
How to fight ambiguities?

Fortunately we already have more or less satisfying answers to these questions. But as with any science we have to carefully watch out. Since linked open data is accessible by everyone and enables probably as many bad things as it enables people to do good things with it. So of course we should all become very euphoric about this great thing but bear in mind that nuclear science was not only good thing. By the end of the day it lead to really bad things like nuclear bombs!
I am happy to get your feedback and opinion about linked open data! I will very soon publish some articles with links to sources of linked open data. If you know some why don’t you tell me in the comments?

IBM's Watson & Google – What is the the difference?

Rene — Tue, 15 Feb 2011 19:25:57 +0000

Recently there was a lot of news on the Web about IBM’s natural language processing system Watson. As you might have heard right now Watson is challenging two of the best Jeopardy players in the US. A lot of news magazines compare Watson with Google which is the reason for this article. Even though the algorithms behind Watson and Google are not open source still a lot of estimates and guesses can be made about the algorithms both computer systems use in order to give intelligent answers to the questions people ask them. Based on this guesses I will explain the differences between Google and Watson.
Even though both systems have a lot of things in common (natural language processing, apparently intelligent, machine learning algorithms,…) I will compare the intelligence behind Google and Watson to demonstrate the difference and the limitations both systems still have.
Google is an information retrieval system. It has indexed a lot of text documents and uses heavy machine learning and data mining algorithms to decide which document is most relevant for any given keyword or combination of keywords. To do so Google uses several techniques. The main concept when Google started was the calculation of PageRank and other Graph Algorithms that evaluate the trust and relevance of a given resource (which means the domain of a website). This is a huge difference to Watson. A given hypertext document being hosted on two different domains will most probably result to complete different Google rankings for the same keyword. This is quite interesting because the information and data within the document are completely identical. So for deciding which Hypertext document is most relevant Google does much more than studying this particular document. Backlinks, neighborhood, context, (and maybe some more?) are metrics besides formatting, term frequency and other internal factors.
Watson on the other hand doesn’t want to justify its answer by returning the text documents where it found the evidence. Also Watson doesn’t want to find documents that are most suitable to a given Keyword. For Watson the task is rather to understand the semantics behind a given key phrase or question. Once this is done Watson will use its huge knowledge base to find the correct answer. I would guess that Watson uses a lot more artificial intelligence algorithms than Google. Especially supervised learning, and prediction and classification models. If anyone has some evidence for these statements I will be happy if you tell me!
An interesting fact worthwhile mentioning is that both information retrieval systems first of all use collective intelligence. Google does so by using the structure of the Web to calculate the trust of information. Also it uses the set of all text documents to calculate synonyms and other things specific to the semantics of words. Watson also uses collective intelligence. It is provided with a lot of information human beings have published in books, on the web or probably even in knowledge systems like ontologies. The systems also have in common that they use a huge amount of calculation power and caching in order to provide their answers at a decent speed.
So is Google or Watson more intelligent?
Even though I think that Watson uses much more AI algorithms the answer should clearly be Google. Watson is highly specialized to one certain task. It can solve it amazingly accurate. But Google solves a much more universal Problem. Also Google has (as IBM of course) some of the best engineers in the world working for them. The Watson team might have been around 5 years with 40 people and Google is more like 10 years with nowadays over 20’000 coworkers.
I am exciting to get to know your opinion!