semantic – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Related-work.net – Product Requirement Document released!

Rene — Mon, 12 Mar 2012 10:26:50 +0000

Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.
The idea of http://www.related-work.net was born. A scientific platform which is open source and open data and tries to solve those problems.
But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Beginning of our Product Requirement Document

We propose to create a new website for the scientific community which brings together people which are reading the same paper. The basic idea is to mix the functionality of a Q&A platform (like MathOverflow) with a paper database (like arXiv). We follow a strict openness principal by making available the source code and the data we collect.
We start with an analysis how the internet is currently used in different fields and explain the shortcomings. The actual product description can be found under the section “Basic idea”. At the end we present an overview over the websites which follow a similar approach.
This document – as well as the whole project – is work in progress. We are happy about any kind of comments or other contributions.

The distribution of scientific knowledge

Every scientist hast to stay up to date with the developments in his area of research. The basic sources for finding new information are:

Conferences
Research Seminars
Journals
Preprint-servers (arXiv)
Review Databases (MathSciNet, Zentralblatt, …)
Q&A Sites (MathOverflow, StackOverflow, …)
Blogs
Social Networks (Twitter, Google+)
Bibliograhpic Databases (Mendeley, nNode, Medline, etc. )

Every community has found its very own way of how to use this tools.

Mathematics by Heinrich Hartmann – Oxford:

To stay up to date with recent developments I check arxiv.org on a daily basis (RSS feed) participate in mathoverflow.net and search for papers over Google Scholar or MathSciNet. Occasionally interesting work is shared by people in my Google+ circles. In general the speed of pure mathematics is very slow. New research often builds upon work which has been out for a few years. To stay reasonably up to date it is enough to go to conferences every 3-5 months.
I read many papers on myself because I am the only one at the department who does research on that particular topic. We have a reading class where we read papers/lecture notes which are relevant for more people. Usually they are concerned with introductions to certain kinds of theory. We have weekly seminars where people talk about their recently published work. There are some very active blogs by famous mathematicians, but in my area blogs play virtually no role.

Computer Science by René Pickhardt – Uni Koblenz

In Computer Science topics are evolving but also changing very quickly. It is always important to have both an overview of upcoming technologies (which you get from tech blogs) as well as access to current research trends.
Since the speed in computer science is so fast and the review process in Journals often takes much time our main source of information and papers are conferences and twitter.

Usually conference papers are distributed digitally to participants. If one is interested in those papers google queries like “conference name year papers” are frequently used. Sites like http://www.sciweavers.org/ host and aggregate preprints of papers and organize them by conference.
The general method to follow a conference that one is not attending is to follow the hashtag of the conference on Twitter. In general Twitter is the most used tool to share distribute and find information not only for papers but also for the above mentioned news about upcoming technologies.

Another rich source for computer scientists is, of course, the related work of papers and google scholar. Especially useful is the method of finding a very influential paper with more than 1000 citations and find newer papers that quote this paper containing a certain keyword which is one of the features of google scholar.
The main problem in computer science is not to find a rare paper or idea but rather to filter the huge amount of publications and also bad publications and also keep track of trends. In this way a system that ranks and summarize papers (not only by abstract and citation counts) would help me a lot to select what related work of a paper I should read!

Psychology by Elisa Scheller – Uni Freiburg

As a psychologist/neuroscientist, I receive recommendations for scientific papers via google scholar alerts or science direct alerts (http://www.sciencedirect.com/); I receive alerts regarding keywords or regarding entire journal issues. When I search for a certain publication, I use pubmed.org or scholar.google.com. This can sometimes be kind of annoying, as I receive multiple alerts from different sources; but I guess it is the best way to stay up to date regarding recent developments. This is especially important in my field, as we feel a big amount of “publication pressure”; I work on a method which is considered as “quite fancy” at the moment, so I also use the alerts to make sure nobody has published “my” experiment yet.
Sometimes a facebook friend recommends a certain publication or a colleague points me to it. Most of the time, I read articles on my own, as I am the only person working on this specific topic at my institution. Additionally, we have a weekly journal club where everyone in turn presents work which is related to our focus of research, e.g. a certain part of the human brain. There is also a weekly seminar dedicated to presentations about ongoing projects.
Blogs (e.g. mindhacks.com, http://neuroskeptic.blogspot.com/) can be a source to get an overview about recent developments, but I have to admit I use them mainly for work-related entertainment.
All in all, it is easy to stay up to date using alerts from different platforms; the annoying part of it is the flood of emails you receive and that you are quite often alerted to articles that don’t fit your interests (no matter how exact you try to specify your keywords).

Biomedical Research by Johanna Goldmann – MIT

In the biological sciences, in research at the bench – communication is one of the most fundamental tools a scientist can have. Communication with other scientist may open up the possibilities of new collaborations, can lead to a completely new view point of a known question, the integration and expansion of methods as well as allowing a scientist to have a good understanding of what is known, what is not known and what other people have – both successfully and unsuccessfully – tried to investigate.
Yet communication is something that is currently very much lacking in academic science – lacking to the extent that most scientist will agree hinders the progress of research. Nonetheless the lack of communication and the issues it brings with it is something that most scientists will have accepted as a necessary evil – not knowing how to possibly change it.
Progress is only reported in peer-reviewed journals – many which are greatly affected not only but what is currently “sexy” in research but also by politics and connections and the “publish or perish” pressure. Due to the amount of this pressure in publishing in journals and the amount of weight the list of your publications will have upon any young scientists chances of success, scientist tend also to be very reluctant in sharing any information pre-publication.
Furthermore one of the major issues is that currently there really is no way of publishing or communicating either negative results or minor findings, which causes may questions or methods to be repeatedly investigated as well as a loss of information.
Given how much social networks and the internet has changed communication as well as the access to information over the past years – there is a need for this change to affect research and communication in the life science and transform the way we think not only about solving and approaching research questions we gather but the information and insights we gain as a whole.

Philosophy by Sascha Benjamin Fink – Uni Osnabrück

The most important source of information for philosophers is http://philpapers.org/. You can follow trends going on in your field of interest. Philpapers has a list of almost all papers together with their abstracts, keywords and categories as well as a link to the publisher. Additional information about similar papers is displayed.
Every category of papers is managed by some editor. For each category it is possible to subscribe to a newsletter. In this way once per month I will be informed about current publications in journals related to my topic of interest. Every User is able to create an account and manage his literature and the papers of his he is interested in.
Other research and information exchange methods among philosophers consist of mailing lists, reading clubs and Blogs. Have a look at David Chalmers blog list. Blogs are also becoming more and more important. Unfortunately they are usually on general topics and discussing developments of the community (e.g. Leiter’s Blog, Chalmers’ Blog and Schwitzgebel’s Blog).
But all together I still think that for me a centralized service like Philpapers is my favourite tool because it aggregates most information. If I don’t hear about it on Philpapers usually it is not that important. I think among Philosophers this platform – though incomplete – seems to be the standard for the next couple of years.

Problems

As a scientist it is crucial to be informed about the current developments in the research area. Abstracting from the reports above we divide the tasks roughly into the following stages.

1. Finding and filtering new publications:

What is happening right now? What are the current hot topics my area? What are current trends? (→ Check arXiv/Twitter)
Did a friend of mine write something? Did a “big shot” write something?
(→ Check meta information: title, authors)
Are my colleagues excited about a new development? (→ Talk to them.)

2. Getting more information about a given paper:

What is actually done in a given paper? Is it relevant for me? Is it really new? Is it a breakthrough? (→ Read abstracts. Find a good readable summary/review.)
Judge the quality of a paper: Is it correct? Is it well written?
( → Where is it published, if at all? Skim through content.)

Finally there is a fundamental decision: Shall I read the whole paper, or not? which leads us to the next task.

3. Understanding a paper: Understanding a paper in depth can be a very time consuming and tedious process. The presentation is often very short and much knowledge is assumed from the reader. The notation choices can be bad, so that even the statements are hard to understand. In effect the paper is easily readable only for a very small circle of specialist in the area. If one is not in the lucky situation to belong to that circle, one usually applies the following strategies:

Lookup references. This forces you to process a whole tree of older papers which might be hard to read, and hard to get hold of. Sometimes it is worthwhile to consult a textbook to polish up fundamentals.
Finding additional resources. Is there a review? Is there a related video lecture or slides explaining the material in more detail? Is the author going to a conference in the near future, or even giving a seminar in the area?
Join forces. Find people thinking about the same paper: Has somebody at my department already read the paper, so that I can ask some questions? Is there enough interest to make a reading group, or more formally, run a seminar about that paper.
Contact the author. This a last resort. If you have struggled with understanding the paper for a very long time and really need/want to get it, you might eventually write an email to the author – who might respond, or not. Sometimes even errors are found! – and not published! An indeed, there is no easy way to publish “errata” anywhere on the net.

In mathematics most papers are not getting read though the end. One uses strategies 1 & 2 till one gets stuck and moves on to something more exciting. The chances of survival are much better with strategy 3 where one is committed putting a lot of effort in it over weeks.

4. Finding related work. Where to go from there? Is the paper superseded by a more recent development? Which are the relevant papers which the author builds upon? What are the historic influences? What are the founding ideas of the subject? Finding related work is very time consuming. It is easy to overlook things given that the references are often vast, and sometimes hard to get hold of. Getting information over citations requires often access to commercial databases.

Basic idea:

All researchers around the world are faced with the same problems and come up with their individual solutions. There are great synergies in bringing these people together with an online platform! Most of the addressed problems are solved with a paper centric service which allows you to…

…get to know other readers of the paper.
…exchange with the other readers: ask questions, write comments, reviews.
…share the gained insights with the community.
…ask questions about the paper.
…discuss the paper.
…review the paper.

We want to do that with a new mixture of a traditional Q&A system like StackExchange or MathOverflow with a paper database and social features. The key features of this system are as follows:

Openness: We follow a strict openness principle. The software will be developed in open source. All data generated on this site will be under a creative commons license (like Wikipedia) and will be made available to the community in form of database dumps or an API (open data).

We use two different types of content sites in our system: Papers and Discussions.

Paper sites. A paper site is dedicated to a single publication. And has the following features:

Paper meta information
– show title, author, abstract, journal, tags
– leave a comment
– write a review (with wiki option)
– vote up/down
Paper resources
– show pdfs, slides, notes, video lectures, etc.
– add a resource
Related Work
– show the reference-tree and citations in an intelligent way.
Discussions:
– show related discussions
– start a new discussion
Social features
– bookmark
– share on G+, twitter

The point “Related Work” deserves some further explanation. The citation graph offers a great deal more information than just a list of references. Together with the user generated content like votes and the individual paper bookmarks and social graph one has a very interesting data set which can be harvested. We want this point at least view with respect to: Popularity/Topics/Read by Friends. Later on one could add more sophisticated, even graphical views on this graph.

Discussion sites. A discussion looks more like a traditional QA-question, with the difference, that each discussion may have related (many) papers. A discussion site contains:

Discussion meta information (title, author, body)
Discussion content
Related papers
Voting
Follow/Bookmark

Besides the content sides we want to provide the following features:

News Stream. This is the start page of our website. It will be generated from the network consisting of friends, papers and authors. There should be several modes like:

hot: heavily discussed papers/discussions
new papers: list new publications (filtered by tag, like arXiv feed)
social: What did your friends do lately
default: intelligent mix of recent activity that is relevant to the logged in user

Moreover, filter by tag should be always available.

Search bar:

Searches contents of the site, but should also find papers on freely available databases (e.g. arXiv). Adding a paper should be very seamless process from there.
Search result ranking uses vote and view information.
Personalized search information. (Physicists usually do not want sociology results.)
Auto completion on paper titles, author, discussions.

Social: (hard to implement, maybe for second version!)

Easily refer to users by @-syntax familiar from Twitter/Google+
Maintain a friendship / trust graph
Friendship recommendations
Find friends from Google+ on the site

Benefits

Our proposed websites improves the above mentioned problems in the following ways.
1. Finding and filtering new publications:This step can be improved with even very little community effort:

Tell other people, that you are interested in the paper. Vote it up or leave a comment if you are very excited about it.

Point out a paper to a colleague.

2. Getting more information about a given paper:

Write a summary or review about a paper you have read or skimmed through. Maybe the introduction is hard to read or some results are not clearly stated.
Can you recommend reading this paper? Vote it up!
Ask a colleague for his opinion on the paper. Maybe he can write a summary?

Many reviews of new papers are already written. E.g. MathSciNet and Zentralblatt maintain a large database of Reviews which are provided by the community and are not freely available. Many authors would be much more happy to write them to an open system!
3. Understanding a paper:Here are the mayor synergies which we want to address with our project.

Ask a question: Why is the author using this experimental method? How does Lemma 3.4 work? Why do I need this assumption? What is the intiution behind the “virtual truncation”? What implications does this work have?
Start a discussion: (might involve more than one paper.) What is the difference of these two papers? Is there a reference explaining this more clearly? What should I read in advance to understand the theory?
Add resources. Tell the community about related videos, notes, books etc. which are available on other sites.
Share your notes. If you have discussed a paper in a reading class or seminar. Collect your notes or opinions and make them available for the community.
Restate interesting statements. Tell the community when you have found a helpful result which is buried inside the paper. In that way Google may find it!

4. Finding related work. Having a well structured and easily navigable view on related papers simplifies the search a lot. The filtering benefits from the content generated by the users (votes) and individual information, like friends who have written/bookmarked a paper.

Similar Sites on the Web

There are several discussions in QA forum which are discussing precisely this problem:

Quora.com: Where can I comment on sci papers?
MathOverflow: Good websites for discussions of mathematical papers?
MathOverflow: Is a free alternative to MathSciNet possible?
MathOverflow: Errata-Database

We found three sites on the internet which follow a similar approach which we examined more carefully.
1. There is a social network which has most of our features implemented:

researchgate.net
“Connect with researchers, make your work visible, and stay current.”

The Economist has dedicated an article to them. It is essentially a facebook clone, with special features for scientist.

Large, fast growing community. 1.4m +50.000/m. Mainly Biology and Medicine.
(As Daniel Mietchen points out, the size might be misleading due to institutional accounts)
Very professional Look and Feel. Company from Berlin, Germany, funded by VC. (48 People involved, 10 Jobs advertised)
Huge Feature set:

Profile site, Connect to friends
News Feed
Publication Database, Conference Finder, Jobmarket
Every Paper its own page: with

Voting up/down
Comments
Metadata (Title, Author, Abstract, Preveiw)
Social Media (Share, Bookmark, Follow author)

Organize Workgroups/Reading Classes.

Differences to our approach:

Closed Data / Closed Source
Very complex site which solves a lot of purposes
Only very basic features on paper site: vote/comment.
QA system is not linked well to paper database
No MathML
Mainly populated by undergraduates

2. Another website which comes reasonably close is:

http://www.sciweavers.org/

“an academic network that aggregates links to research paper preprints
then categorizes them into proceedings.”

Includes a large collection of online tools for various purposes
Have a big library of papers/software/datasets/conferences for computer science.
Paper sites have:

Meta information and preview
Vote functionality and view statistics, tags
Comments
Related work
Bookmarking
Author information

User profiles (no friendships)

Differences to our approach:

Focus on computer science community
Comment and Discussions are well hidden on paper sites
No News stream
Very spacious design

3. Another very similar site is:

journalfire.com – beta
“Share what your read – connect to colleagues – create journal clubs.”

It has the following features:

Comment on Papers. Activity feed (?). Follow articles.
Host Journal Clubs. Create Events related to papers.
Powerful search box fetching papers from Arxiv and Pubmed (slow)
Social features on site: User profiles, friend finder (no fb/g+ integration yet)
News feed – from subscribed papers and friends
Easy paper import via Bookmarklet
Good usability!! (but slow loading times)
Private reading clubs cost money!

They are very skilled: Maintained by 3 PhD students/postdocs from Caltec and MIT.

Differences to our approach:

Closed Data, Closed Source
Also this site misses (currently) misses out ranking features
Very Closed model – Signup required
Weak Crowd sourcing: Cannot add Meta information

The site is still at its very beginning with little users. The project started in 2010 and did not gain much momentum since.

The other sites are roughly classified in the following categories:
1. Single people who are following a very similar idea:

annotatr.appspot.com. Combines a metadata-base with the disqus plugin. You can comment but not rate. Good usability. Nice CSS. Good search function. No MathML. No related article suggestion. Maintained by two academics in private time. Hosted on Google Apps. Closed Source – Closed Data.
r-Forum – a resource where mathematicians can collect record reviews, corrections of a resource (e.g. paper, talk, …). A simple Vanilla-Forum/Wiki with almost no content used by maybe 12 people in US. No automated Data import. No rating system.
http://math-arch.org/ – Post comments to math papers. very bad usability – get even errors. Maintained by a group of russian programmers LogicSun. Closed Source – Closed Data.

Analysis: Although the principal idea to connect people reading papers is there. The implementation is very bad in terms of usability and even basic programming. Also the voting features are missed out.

2. (Semi) Professional sites.

Public Libary of Science very professional, huge paper data base for mainly biology, medicine. Features full text papers, lots of interesting meta information including references. Has comment features (not very visible) and news stream on the start page.
No QA features (+1, Ask question) on the site. Only published articles are on the site.
Mendeley.com – Huge Bibliographic database with bookmarking and social features. You can organize reading groups in there, with comments and notes shared among the participants. Features a news stream with papers by friends. Nice import. Impressive fulltext data and Reference features.
No QA features for paper. No comments for paper. Requires Signup to do anything useful.
papercritic.com – Open review database. Connected to Mendely bibliographic libary. You can post reviews. No rating. No comments. Not open: Mendely is commercial.
webofknowledge.com. Commercial academic citation index.
zotero.org – features programm that runs inside a browser. “easy-to-use tool to help you collect, organize, cite, and share your research sources”

Analysis: The goal of all these tools is to simplify the reference management, by providing metadata like references, citations, abstracts, author profiles. Commenting features on the paper site are not there or not promoted.
3. Vaguely related sites which solve different problems:

citeulike.org – Social bookmarking for papers. Closed Source – Open Data.
http://www.scholarpedia.org. A peer reviewed open access encyclopedia.
Philica.com Online Journal which publishes articles from any field along with its reviews.
MathSciNet/Zentralblatt – Review database for math community. Closed Source – Commercial.
http://f1000research.com/ – Online Journal with a public, post publish review process. “Open Science – Open Data – Open Review”
http://altmetrics.org/manifesto/ as an emerging trend from the web-science trust community. Their goal is to revolutionize the review process and create better filters for scientific publications making use of link structures and public discussions. (Might be interesting for us).
http://meta.wikimedia.org/wiki/WikiScholar – one of several ideas under discussion at Wikimedia as to a central repository for references (that are cited on Wikipedias and other Wikimedia projects)

Upshot of all this:

There is not a single site featuring good Q&A features for papers.

If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Business Model of Metaweb with freebase

Rene — Thu, 09 Jun 2011 21:08:47 +0000

Today I want to talk about one of my favourite Internet start ups. It is called Metaweb. Metaweb was founded in 2005. To my knowledge it had two rounds of investment (all together about 60 Million US Dollar) and it was bought by Google in June 2010.
So why do I love metaweb? Well they started to work on a really important and great problem using modern technology to solve it. Up to my knowledge they did not have a direct business model. They just knew that solving this problem would reward them and after gaining experience on the market they would find their busniess model.

The problem metaweb was solving

We are lucky. modern and great companies have very informative videos about their product.

I also love the open approach they took. Freebase the database created by metaweb is not protected from others. No it is creative common licence. Everyone in the community knows that the semantic database they run is nice to have but the key point is really the technology they built to run queries against the data base and handel data reconciling. Making it creative common helps others to improve it like wikipedia. it also created trust and built a community around freebase.

Summary of the business model

openess
focussing on a real problem
learn with the market

I know it sounds as if I was 5 years old but I am fully convinced that there is an abstract concept behind the success of metaweb and I call it their business model. Of course I can think of many other ways than selling to google how to monetarize this technoligy. But in the case of metaweb this was just not neccessary.

Social news streams – a possible PhD research topic?

Rene — Mon, 25 Apr 2011 22:03:08 +0000

It is two months now of reading papers since I started my PhD program. Enough time to think about possible research topics. I am more and more interested in search, social networks in general and social news streams in particular. It is obvious that it is becoming more and more important to aggregate news around a users interests and social circle and display them to the user in an efficient manner. Facebook and Twitter are doing this in an obvious way but also Google, Google News and a lot of other sites have similar products.

To much information in one’s social environment

In order to create a news stream there is the possibility to just show the most recent information to the user (as Twitter is doing it). Due to the huge amount of information created, one wants to filter the results in order to gain a higher user experience. Facebook first started to filter the news stream on their site which lead to the widely spread discussion about their ironically called EdgeRank algorithm. Many users seem to be unhappy with the user experience of Facebook’s Top News.
Also for some information such as the existence of an event in future it might not be the best moment to display the information as soon as it becomes available.

Interesting research hook points and difficulties

I observed these trends and realized that this problem can be seen as a special case of search or more general recommendation engines in information retrieval. We want to obtain the most relevant information updates around a certain time window for every specific user.
This problem seems to me algorithmically much harder than web search where the results don’t have this time component and for a long time also haven’t been personalized to the user’s interest. The time component makes it hard to decide the question for relevance. The information is new and you don’t have any votes or indicators of relevance. Consider a news source or person in someone’s environment that wasn’t important before. All of a sudden this person could provide a highly relevant and useful information to the user.

My goal and roadmap

Fortunately in the past I have created metalcon.de together with several friends. Metalcon is a social network for heavy metal fans. On metalcon users can access information (cd releases, upcoming concerts, discussions, news, reviews,…) about their favorite music bands, concerts and venues in their region and updates from their friends. These information can perfectly be displayed in a social news stream. On the other hand metalcon users share information about their taste of music, the venues they go to and the people they are friend with.
This means that I have a perfect sandbox to develop and test (with real users) some smart social news algorithms that are supposed to aggregate and filter the most relevant news to our users based on their interests.
Furthermore regional information and information about music are available as linked open data. So the news stream can easily be enriched with semantic components.
Since I am about to redesign (a lot of work) metalcon for the purpose of research and I am about to go into this direction for my PhD thesis I would be very happy to receive some feedback and thoughts about my suggestions of my future research topic. You can leave a comment or contact me.
Thanks you!

Current Achievments:

What is Linked open data or the Web of data?

Rene — Wed, 16 Mar 2011 21:55:59 +0000

Linked open data or the web of data is the main concept and idea of the semantic web proposed by Tim Berners-Lee. The semantic web is often also referred to as web3.0. But what is the difference of the Internet as we know it by now and the web of data?
On the Internet there are hypertext documents which are interlinked to each other. As we all know search engines like Google use the hyperlinks to calculate which websites are most relevant. A hyper text document is created for humans to read and understand. Even though Google can search those documents in a pretty efficient way and does amazing things with it Google is not able to read and interpret these documents or understand the semantics. Even though the search result quality is already very high it could be increased by a lot if search engines where only able to understand the semantics of the documents people put on the web.
The idea of the Linked open data is that data should also be published in a way that machines can easily read and “understand” it. Once the data is published web documents could even be annotated with this data invisible to humans but helping computers to understand the semantics and thereby result in better distribution of information. Also a lot of new webservices would become possibible increasing the quality of our daily live just as Google and Wikipedia have done.

An Example of linked data:

You might wonder now how data can be linked? This is pretty easy. Let us take me for example. Take the following statement about me: “Rene lives in Koblenz which is a City in Germany.”
So I could create the following data triples:

(Rene Pickhardt, lives in, Koblenz)
(Koblenz, type, city)
(Koblenz, is in, Germany)
(Germany, type, country)

Linked Data Triples

As you can see from the picture these triples form a graph. Just like the Internet. Now comes the cool part. The Graph is easily processable by a computer and we can do some reasoning: A computer by pure reasoning can conclude that I also live in Germany and that cities are in countries.

Linked open data triples with reasoning

Now assume everyone would publish his little data graphs using an open standard and there was a way to connect those graphs. Pretty quickly we would have a lot of knowledge about René Pickhardt, Koblenz, Cities, Germany, countries,… especially if we automatically process a web document and detect some data we can use this background knowledge to fight ambiguities in language or just to better interconnect parts of the text by using the semantics from the Web of Data.

Is linked open data a good or a bad thing?

I think it is clear what great things can be achieved once we have this linked open data. But there are still a lot of challenges to take.

How to fight inconsistencies and Spam?
How to trust a certain source of data?
How can we easily connect data from different source and identify same nodes?
How to fight ambiguities?

Fortunately we already have more or less satisfying answers to these questions. But as with any science we have to carefully watch out. Since linked open data is accessible by everyone and enables probably as many bad things as it enables people to do good things with it. So of course we should all become very euphoric about this great thing but bear in mind that nuclear science was not only good thing. By the end of the day it lead to really bad things like nuclear bombs!
I am happy to get your feedback and opinion about linked open data! I will very soon publish some articles with links to sources of linked open data. If you know some why don’t you tell me in the comments?

IBM's Watson & Google – What is the the difference?

Rene — Tue, 15 Feb 2011 19:25:57 +0000

Recently there was a lot of news on the Web about IBM’s natural language processing system Watson. As you might have heard right now Watson is challenging two of the best Jeopardy players in the US. A lot of news magazines compare Watson with Google which is the reason for this article. Even though the algorithms behind Watson and Google are not open source still a lot of estimates and guesses can be made about the algorithms both computer systems use in order to give intelligent answers to the questions people ask them. Based on this guesses I will explain the differences between Google and Watson.
Even though both systems have a lot of things in common (natural language processing, apparently intelligent, machine learning algorithms,…) I will compare the intelligence behind Google and Watson to demonstrate the difference and the limitations both systems still have.
Google is an information retrieval system. It has indexed a lot of text documents and uses heavy machine learning and data mining algorithms to decide which document is most relevant for any given keyword or combination of keywords. To do so Google uses several techniques. The main concept when Google started was the calculation of PageRank and other Graph Algorithms that evaluate the trust and relevance of a given resource (which means the domain of a website). This is a huge difference to Watson. A given hypertext document being hosted on two different domains will most probably result to complete different Google rankings for the same keyword. This is quite interesting because the information and data within the document are completely identical. So for deciding which Hypertext document is most relevant Google does much more than studying this particular document. Backlinks, neighborhood, context, (and maybe some more?) are metrics besides formatting, term frequency and other internal factors.
Watson on the other hand doesn’t want to justify its answer by returning the text documents where it found the evidence. Also Watson doesn’t want to find documents that are most suitable to a given Keyword. For Watson the task is rather to understand the semantics behind a given key phrase or question. Once this is done Watson will use its huge knowledge base to find the correct answer. I would guess that Watson uses a lot more artificial intelligence algorithms than Google. Especially supervised learning, and prediction and classification models. If anyone has some evidence for these statements I will be happy if you tell me!
An interesting fact worthwhile mentioning is that both information retrieval systems first of all use collective intelligence. Google does so by using the structure of the Web to calculate the trust of information. Also it uses the set of all text documents to calculate synonyms and other things specific to the semantics of words. Watson also uses collective intelligence. It is provided with a lot of information human beings have published in books, on the web or probably even in knowledge systems like ontologies. The systems also have in common that they use a huge amount of calculation power and caching in order to provide their answers at a decent speed.
So is Google or Watson more intelligent?
Even though I think that Watson uses much more AI algorithms the answer should clearly be Google. Watson is highly specialized to one certain task. It can solve it amazingly accurate. But Google solves a much more universal Problem. Also Google has (as IBM of course) some of the best engineers in the world working for them. The Watson team might have been around 5 years with 40 people and Google is more like 10 years with nowadays over 20’000 coworkers.
I am exciting to get to know your opinion!

Why blogging about collective intelligence

Rene — Mon, 14 Feb 2011 14:21:12 +0000

Collective intelligence is a phenomenon that existed in human society for many years. The British mathematician Francis Galton basically discovered collective intelligence when he found out, after averaging all individual guesses, that a crowd of 787 visitors at a county fair indeed accurately guessed the weight of an ox. In his blog Ryan Tomayko provides a text copy of the original article vox populi that appeared in the scientific journal Nature on on March 07, 1907..
In fact, collective intelligence is such a significant issue (particularly on the Internet) that I will devote a whole section of my blog to this topic. Great examples of collective intelligence are Wikipedia, Google and most open source projects for example the content management system behind my blog (WordPress) or the software I use to run this website, which would be Debian Linux, Apache, MySQL and PHP.
Even though artificial intelligence and collective intelligence often appear together there is quite a difference between these to concepts. The crowd from Francis Galton or the people contributing to Wikipedia are not artificial at all. On the other hand Google uses the structure from the linked documents on the web and combines this data with artificial intelligence and some heuristics to obtain search engine rankings and gain information and knowledge.
Future articles about collective intelligence will address, discuss and explain these issues in a more detailed way.I will demonstrate what kind of knowledge is already available on the Internet, and I plan to discuss how to use data mining or the ideas from the semantic web to make this knowledge easier accessible to humans.

Download Data Sets on this blog

Rene — Mon, 14 Feb 2011 13:15:33 +0000

On the Internet you can download some very interesting data sets that computer scientists and others use for scientific or ecnomic reasons. One application, for example, might have the function to test data mining algorithms or even further create new knowledge from the data. In other cases data sets can be used to create mesh-ups. A mesh-up is an application which combines several data sets to create an interesting new peace of information. In this section I will try to collect a bunch of different data sets that are available for download on the Web. In some cases I might also include a short tutorial on how to use and process the respective data set, though I think that in most cases the publisher of the sets will already give you enough information on how to use them. Since my major in university is Webscience, semantic data and open linked data will be of special significance in this section. .

Some data sets might only be available through an API. If there doesn’t exist good tutorials I will introduce the API in these cases.

In the end, I just wanted to encourage you that if you think I am missing out on some good data sets, to please contact me and tell me about it immediately! That would be very helpful!.