collaboration – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Experiences on semantifying a Mediawiki for the biggest recource about Chinese rock music: rockinchina .com

Rene — Mon, 07 Jan 2013 09:38:45 +0000

During my trip in China I was visiting Beijing on two weekends and Maceau on another weekend. These trips have been mainly motivated to meet old friends. Especially the heads behind the biggest English resource of Chinese Rock music Rock in China who are Max-Leonhard von Schaper and the founder of the biggest Chinese Rock Print Magazin Yang Yu. After looking at their wiki which is pure gold in terms of content but consists mainly of plain text I introduced them the idea of putting semantics inside the project. While consulting them a little bit and pointing them to the right resources Max did basically the entire work (by taking a one month holiday from his job. Boy this is passion!).
I am very happy to anounce that the data of rock in china is published as linked open data and the process of semantifying the website is in great shape. In the following you can read about Max experiences doing the work. This is particularly interesting because Max has no scientific background in semantic technologies. So we can learn a lot on how to improve these technologies to be ready to be used by everybody:

Max report on semantifying

Max-Leonhard von Schaper in Beijing.

To summarize, for a non-scientific greenhorn experimenting with semantic mediawiki and the semantic data principle in general, a good two months were required to bring our system to the point where it is today. As easy as it seems in the beginning, there is still a lot of manual coding and changing to be done as well as trial-and-error to understand how the new system is working.
Apart from the great learning experience and availability of our data in RDF format, our own website expanded in the process by ~20% of content pages (from 4000 to above 5000), adding over 10000 real property triplets and gaining an additional 300 thousand pageviews.
Lessons learnt in a comprised way:

DBPedia resources are to be linked with “resources” in the URI not with “page”
SMW requires the pre-fix “foaf:” or “mo:” or something else for EACH imported property
Check the Special:ExportRDF early to see if your properties work
Properties / Predicates , no difference with SMW
How to get data to freebase depends on the backlinks and sameas to other ontologies as well as entering data in semantic search engines
Forms for user data entry are very important!
As a non-scientific person without feedback I would not have been able to implement that.
DBPedia and music ontology ARE not interlinked with SAMEAS (as checked on sameas.org).
Factbox only works with the standard skin (monoskin). For other skins one has to include it in the PHP code oneself.

Main article

The online wiki Rock in China has been online for a number of years and focusses on Chinese underground music. Prior to starting implementing Semantic Mediawikia our wiki had roughly 4000 content pages with over 1800 artists and 900 records. We used a number of templates for bands, CDs, venues and labels, but apart from using numerous categories and the DynamicPageList extension for a few joints, we were not able to tangibly use the available data.
DPL example for JOINT between two Wikipedia Categories:


category = Metal Artists
category = Beijing Artists
mode     = ricstyle
order  = ascending

Results of a simple mashup query: display venues in beijing on a Google Map

After having had an interesting discussion with Rene on the benefits of semantic data and Open Linked Data, we decided to go Semantic. As total greenhorns to the field and with only limited programming skills timely available, we started off googeling the respective key terms and quickly enough came to the websites of the Music Ontology and the Semantic Mediawiki, which we decided to install.
Being an electrical engineer with basic IT backgrounds and many years of working on the web in PHP, HTML, Joomla or Mediawiki, it was still a challenge to get used to the new semantic way of talking and understanding the principles behind. Not so much because there might not be enough tutorials or data information out in the web, but because the guiding principle is somewhere but not where I was looking. Without the help of Rene and several feedback discussions I don’t it would have been possible for us to implement this system within the month that it took us.
Our first difficulty (after getting the extension on our FTP server) was to upgrade our existing Mediawiki from version 1.16 to version 1.19. An upgrade that used up the better part of two days, including updating all other extensions as well (with five of them not working anymore at all, as they are not being further developed) and finally getting our first Semantic Property running.
Upon starting of implementing the semantic approach, I read a lot online on the various ontologies available and intensively checked the Music Ontology. However Music Ontology is by far the wrong use case for our wiki, as Music Ontology is going more into the musical creation process and Rock in China is describing the scene developments. All our implementations were tracked on the wiki page Rock in China – Semantic Approach for other team members to understand the current process and to document workarounds and problems.
Our first test class had been Venue, a category in which we had 40 – 50 live houses of China with various level of data depth that we could put into the following template SemanticVenue:

{{SemanticVenue
|Image=
|ImageDescription=
|City=
|Address=
|Phone=
|Opened=
|Closed=
|GeoLocation=
}}

As can be seen from the above template both predicates (City) and properties (Opened) are being proposed for the semantic class VENUE. Semantic Mediawiki is implementing this decisive difference in a very user-friendly way by setting the TYPE of each SMW property to either PAGE or something else. As good as this is, it somehow confuses if one is talking with someone else about the semantic concept in principle.
A major problem had been the implementation of external ontologies which was not sufficiently documented on the semantic mediawiki page, most probably due to a change in versioning. Especially the cross-referencing to the URI was a major problem. As per Semantic Mediawiki documentation, aliases would be allowed, however with trial and error, it was revealed that only a property with a domain prefix, e.g. foaf:phone or owl:sameas would be correctly recognized. We used the Special:RDFExport function to find most of these errors, everytime our URI referencing was wrong, we would get a parser function error.
First, the wrong way for the following two wiki pages:

Mediawiki:smw_import_mo
Property:genre

Mediawiki:smw_import_mo:

http://purl.org/ontology/mo/ |[http://musicontology.com/ Music Ontology Specification]
activity_end|Type:Date
activity_start|Type:Date
MusicArtist|Category
genre|Type:Page
Genre|Category
track|Type:String
media_type|Type:String
publisher|Type:Page
origin|Type:Page
lyrics|Type:Text
free_download|Type:URL

Property:genre:

[[Has type::Page]][[Imported from::mo:genre]]

And now the correct way how it should be actually implemented to work:
Mediawiki:smw_import_mo:

http://purl.org/ontology/mo/|[http://musicontology.com/ Music Ontology Specification]
activity_end|Type:Date
activity_start|Type:Date
MusicArtist|Category
genre|Type:Page
Genre|Category
track|Type:String
media_type|Type:String
publisher|Type:Page
origin|Type:Page
lyrics|Type:Text
free_download|Type:URL

Property:mo:genre:

[[Has type::Page]][[Imported from::mo:genre]]

The ontology with most problems was the dbpedia, which documentation did not tell us what the correct URI was. Luckily the mailing list provided support and we got to know which the correct URI was:

http://www.dbpedia.org/ontology/

Being provided that, we were able to implement a number of semantic properties for a number of classes and start updating our wiki pages to get the data on our semantic database.
To utilize semantic properties within a wiki, there is a number of extensions available, such as Semantic Forms, Semantic Result Formats and Semantic Maps. The benefits we were able to gain were tremendous. For example the original JOINT query that we had been running at the beginning of the blog post with DPL was now able to be utilized with the following ASK query:

{{#ask: [[Category:Artists]] [[mo:origin:Beijing]]
|format=list
}}

However with the major benefit that the extension would NOT be broken after setting the inline query within a page. Dynamic Page List breaks the , rendering a lot of information lost. Other examples of how we benefitted from semantics is that previously we were only able to use Categories and read information of joining one or two categories, e.g. Artist pages that were both categorized as BEIJING artists and METAL artists. However now, with semantic properties, we had a lot of more data to play around with and could create mashup pages such as ROCK or Category:Records on which we were able to implement random videos from any ROCK artists or on which we were able to include a TIMELINE view of released records.

Mashup Page with a suitable video

With the help of the mailing list of Semantic Mediawiki itself (which was of great help when we were struggling) we implemented inline queries using templates to avoid later data changes on multiple pages. That step taken, the basic semantic structures were set up at our wiki and it was time for our next step: Bringing the semantic data of our wiki to others!
And here we are, asking ourselves: How will Freebase or DBpedia actually find our data? How will they include it? Discussing this with Rene a few structural problems became apparent. Being used to work with Wikipedia we usually set the property same:

Owl:sameas (or sameas)

On various of our pages directly to Wikipedia pages.
However we learnt that the property

foaf:primaryTopic

is a much better and accurate property for this. The sameas property should be used for semantic RDF pages, i.e. the respective DBPedia RESOURCE page (not the PAGE page). Luckily we already implemented the sameas property mostly in templates, so it was easy enough to exchange the properties.
Having figured out this issue, we checked out both the freebase page as well as other pages, such as DBpedia or musicbrainz, but there seems to be no “submit RDF” form. Hence we decided that the best way for getting recognized in the Semantic Web is to include more links to other RDF resources, e.g. for our Category:Artists we set sameas links to dbpedia and music ontology. For dbpedia we linked to the class and for music ontology to the URI for the class.
Note on the side here, when checking on sameas.org, it seems that music ontology is NOT cross-linked to dbpedia so far.
Following the recommendations set forth at Sindice, we changed our robots.txt to include our semantic sitemap(s):

Sitemap: http://www.music-china.org/wiki/index.php?title=Special:RecentChanges&feed=atom
Sitemap: http://www.rockinchina.com/wiki/index.php?title=Special:RecentChanges&feed=atom

Going the next step we analyzed how we can include external data on our SMW, e.g. from musicbrainz or from youtube. Being a music-oriented page especially Youtube was of particular interest for us. We found the SMW extension External Data that we could use to connect with the Google API:

{{#get_web_data:
url=https://www.googleapis.com/youtube/v3/search?part=snippet&q=carsick+cars&topicId=%2Fm%2F03cmgbv&type=video&key=Googlev3API&maxResults=50
|format=JSON
|data= videoId=videoId,title=title
}}

And

{{#for_external_table:
{{Youtube|ID={{{videoId}}}|title={{{title}}} }}

{{{videoId}}} and {{{title}}}

}}

See our internal TESTPAGE for the live example.
Youtube is using its in-house Freebase ID system to generate auto-channels filled with official music videos of bands and singers. The Freebase ID can be found on the individual freebase RESOURCE page after pressing the EDIT button. Alternatively one could use the Google API to receive the ID, but would need a Youtube internal HC ID prior to that. Easy implementation for our wiki: Include the FreebaseID as semantic property on artist pages within our definitions template:

{{Definitions
|wikipedia=
|dbpedia=
|freebase=
|freebaseID=
|musicbrainz=
|youtubeautochannel=
}}

Voila, with the additional SQL-based caching of request queries (e.g. JSON) our API load on Google is extremely low as well as increasing speed for loading a page at our wiki. Using this method we were able to increase our saved YOUTUBE id tags from the original 500 to way over 1000 within half a day.

A big variety of videos for an act like carsick cars is now available thanks to semantifying

With these structures in place it was time to inform the people in our community not only on the changes that have been made but also on the additional benefits and possibilities. We used our own blog as well as our Facebook page and Facebook group to spread the word.

Report of Socialcom2012 online in our new WeST blog

Rene — Thu, 20 Sep 2012 06:31:33 +0000

Hey everyone,
longtime no see! Well yeah the summer time usually means vaccation and traveling and so on. But I also have been busy creating some cool content and visiting some conference!
So first of all together with my friend Leon Kastler I have established our new WeST Blog. The goal is to be more transparent with the research of our institute! Quite some people in our institute already shared some nice information. So why don’t you give our blog a try and follow it at:
http://blog.west.uni-koblenz.de/
Since our institute now has its own blog I will in future publish some of my articles over there and will just point to them from this blog. This will happen especially if the content of the articles is founded by research money from our institute. So my presentation at SocialCom 2012 is most certainly founded by our institute so thats why you can read my summery together with some suggestions of interesting paper at my article:
http://blog.west.uni-koblenz.de/2012-09-20/west-researchers-summary-of-socialcom-2012-in-amsterdam/

PhD proposal on distributed graph data bases

Rene — Tue, 27 Mar 2012 10:19:22 +0000

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
storing web log files and click streams of users
doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

biology
finance
regular graphs
geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

graph processing (Signal Collect, Pregel,…)
graph theory (especially data structures and algorithms)
(dynamic/adaptive) graph partitioning
distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
redundancy vs fault tolerance
network programming (protocols, latency vs bandwidth)
data bases (ACID, multiple user access, …)
graph data base query languages (SPARQL, Gremlin, Cypher,…)
Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base.
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages.
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment.

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID.

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

2012 / 04: hand in PhD proposal
2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
2012 / 10: experiments of basic CRUD operations done
2013 / 02: paper with results from basic CRUD operations done
2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
2013 /08: min 3 month research internship in a company benchmarking my system on real data
end of 2013: publishing the results
2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

Google Video on Search Quality Meeting: Spelling for Long Queries by Lars Hellsten

Rene — Mon, 12 Mar 2012 19:11:04 +0000

Amazing! Today I had a discussion with a coworker about transparency and the way companies should be more open about what they are doing! And what happens on the same day? One of my favourite webcompanies has decided to publish a short video taken from the weekly search quality meeting!
The proposed change by Lars Hellsten is that instead of only checking the first 10 words for possible spelling corrections one could predict which two words are most likely spelled wrong and add an additional window of +-5 words around them. They discuss how this change has much better scores than the old one.
The entire video is interesting because they say that semantic context is usually given by using 3 grams. My students used up to 5 grams in order to make their scentence prediction and the machine learning already told them that 4grams would be sufficient to make syntactically and semantically correct predictions.
Anyway enjoy this great video by Google and thanks to Google for sharing this:

Related-work.net – Product Requirement Document released!

Rene — Mon, 12 Mar 2012 10:26:50 +0000

Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.
The idea of http://www.related-work.net was born. A scientific platform which is open source and open data and tries to solve those problems.
But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Beginning of our Product Requirement Document

We propose to create a new website for the scientific community which brings together people which are reading the same paper. The basic idea is to mix the functionality of a Q&A platform (like MathOverflow) with a paper database (like arXiv). We follow a strict openness principal by making available the source code and the data we collect.
We start with an analysis how the internet is currently used in different fields and explain the shortcomings. The actual product description can be found under the section “Basic idea”. At the end we present an overview over the websites which follow a similar approach.
This document – as well as the whole project – is work in progress. We are happy about any kind of comments or other contributions.

The distribution of scientific knowledge

Every scientist hast to stay up to date with the developments in his area of research. The basic sources for finding new information are:

Conferences
Research Seminars
Journals
Preprint-servers (arXiv)
Review Databases (MathSciNet, Zentralblatt, …)
Q&A Sites (MathOverflow, StackOverflow, …)
Blogs
Social Networks (Twitter, Google+)
Bibliograhpic Databases (Mendeley, nNode, Medline, etc. )

Every community has found its very own way of how to use this tools.

Mathematics by Heinrich Hartmann – Oxford:

To stay up to date with recent developments I check arxiv.org on a daily basis (RSS feed) participate in mathoverflow.net and search for papers over Google Scholar or MathSciNet. Occasionally interesting work is shared by people in my Google+ circles. In general the speed of pure mathematics is very slow. New research often builds upon work which has been out for a few years. To stay reasonably up to date it is enough to go to conferences every 3-5 months.
I read many papers on myself because I am the only one at the department who does research on that particular topic. We have a reading class where we read papers/lecture notes which are relevant for more people. Usually they are concerned with introductions to certain kinds of theory. We have weekly seminars where people talk about their recently published work. There are some very active blogs by famous mathematicians, but in my area blogs play virtually no role.

Computer Science by René Pickhardt – Uni Koblenz

In Computer Science topics are evolving but also changing very quickly. It is always important to have both an overview of upcoming technologies (which you get from tech blogs) as well as access to current research trends.
Since the speed in computer science is so fast and the review process in Journals often takes much time our main source of information and papers are conferences and twitter.

Usually conference papers are distributed digitally to participants. If one is interested in those papers google queries like “conference name year papers” are frequently used. Sites like http://www.sciweavers.org/ host and aggregate preprints of papers and organize them by conference.
The general method to follow a conference that one is not attending is to follow the hashtag of the conference on Twitter. In general Twitter is the most used tool to share distribute and find information not only for papers but also for the above mentioned news about upcoming technologies.

Another rich source for computer scientists is, of course, the related work of papers and google scholar. Especially useful is the method of finding a very influential paper with more than 1000 citations and find newer papers that quote this paper containing a certain keyword which is one of the features of google scholar.
The main problem in computer science is not to find a rare paper or idea but rather to filter the huge amount of publications and also bad publications and also keep track of trends. In this way a system that ranks and summarize papers (not only by abstract and citation counts) would help me a lot to select what related work of a paper I should read!

Psychology by Elisa Scheller – Uni Freiburg

As a psychologist/neuroscientist, I receive recommendations for scientific papers via google scholar alerts or science direct alerts (http://www.sciencedirect.com/); I receive alerts regarding keywords or regarding entire journal issues. When I search for a certain publication, I use pubmed.org or scholar.google.com. This can sometimes be kind of annoying, as I receive multiple alerts from different sources; but I guess it is the best way to stay up to date regarding recent developments. This is especially important in my field, as we feel a big amount of “publication pressure”; I work on a method which is considered as “quite fancy” at the moment, so I also use the alerts to make sure nobody has published “my” experiment yet.
Sometimes a facebook friend recommends a certain publication or a colleague points me to it. Most of the time, I read articles on my own, as I am the only person working on this specific topic at my institution. Additionally, we have a weekly journal club where everyone in turn presents work which is related to our focus of research, e.g. a certain part of the human brain. There is also a weekly seminar dedicated to presentations about ongoing projects.
Blogs (e.g. mindhacks.com, http://neuroskeptic.blogspot.com/) can be a source to get an overview about recent developments, but I have to admit I use them mainly for work-related entertainment.
All in all, it is easy to stay up to date using alerts from different platforms; the annoying part of it is the flood of emails you receive and that you are quite often alerted to articles that don’t fit your interests (no matter how exact you try to specify your keywords).

Biomedical Research by Johanna Goldmann – MIT

In the biological sciences, in research at the bench – communication is one of the most fundamental tools a scientist can have. Communication with other scientist may open up the possibilities of new collaborations, can lead to a completely new view point of a known question, the integration and expansion of methods as well as allowing a scientist to have a good understanding of what is known, what is not known and what other people have – both successfully and unsuccessfully – tried to investigate.
Yet communication is something that is currently very much lacking in academic science – lacking to the extent that most scientist will agree hinders the progress of research. Nonetheless the lack of communication and the issues it brings with it is something that most scientists will have accepted as a necessary evil – not knowing how to possibly change it.
Progress is only reported in peer-reviewed journals – many which are greatly affected not only but what is currently “sexy” in research but also by politics and connections and the “publish or perish” pressure. Due to the amount of this pressure in publishing in journals and the amount of weight the list of your publications will have upon any young scientists chances of success, scientist tend also to be very reluctant in sharing any information pre-publication.
Furthermore one of the major issues is that currently there really is no way of publishing or communicating either negative results or minor findings, which causes may questions or methods to be repeatedly investigated as well as a loss of information.
Given how much social networks and the internet has changed communication as well as the access to information over the past years – there is a need for this change to affect research and communication in the life science and transform the way we think not only about solving and approaching research questions we gather but the information and insights we gain as a whole.

Philosophy by Sascha Benjamin Fink – Uni Osnabrück

The most important source of information for philosophers is http://philpapers.org/. You can follow trends going on in your field of interest. Philpapers has a list of almost all papers together with their abstracts, keywords and categories as well as a link to the publisher. Additional information about similar papers is displayed.
Every category of papers is managed by some editor. For each category it is possible to subscribe to a newsletter. In this way once per month I will be informed about current publications in journals related to my topic of interest. Every User is able to create an account and manage his literature and the papers of his he is interested in.
Other research and information exchange methods among philosophers consist of mailing lists, reading clubs and Blogs. Have a look at David Chalmers blog list. Blogs are also becoming more and more important. Unfortunately they are usually on general topics and discussing developments of the community (e.g. Leiter’s Blog, Chalmers’ Blog and Schwitzgebel’s Blog).
But all together I still think that for me a centralized service like Philpapers is my favourite tool because it aggregates most information. If I don’t hear about it on Philpapers usually it is not that important. I think among Philosophers this platform – though incomplete – seems to be the standard for the next couple of years.

Problems

As a scientist it is crucial to be informed about the current developments in the research area. Abstracting from the reports above we divide the tasks roughly into the following stages.

1. Finding and filtering new publications:

What is happening right now? What are the current hot topics my area? What are current trends? (→ Check arXiv/Twitter)
Did a friend of mine write something? Did a “big shot” write something?
(→ Check meta information: title, authors)
Are my colleagues excited about a new development? (→ Talk to them.)

2. Getting more information about a given paper:

What is actually done in a given paper? Is it relevant for me? Is it really new? Is it a breakthrough? (→ Read abstracts. Find a good readable summary/review.)
Judge the quality of a paper: Is it correct? Is it well written?
( → Where is it published, if at all? Skim through content.)

Finally there is a fundamental decision: Shall I read the whole paper, or not? which leads us to the next task.

3. Understanding a paper: Understanding a paper in depth can be a very time consuming and tedious process. The presentation is often very short and much knowledge is assumed from the reader. The notation choices can be bad, so that even the statements are hard to understand. In effect the paper is easily readable only for a very small circle of specialist in the area. If one is not in the lucky situation to belong to that circle, one usually applies the following strategies:

Lookup references. This forces you to process a whole tree of older papers which might be hard to read, and hard to get hold of. Sometimes it is worthwhile to consult a textbook to polish up fundamentals.
Finding additional resources. Is there a review? Is there a related video lecture or slides explaining the material in more detail? Is the author going to a conference in the near future, or even giving a seminar in the area?
Join forces. Find people thinking about the same paper: Has somebody at my department already read the paper, so that I can ask some questions? Is there enough interest to make a reading group, or more formally, run a seminar about that paper.
Contact the author. This a last resort. If you have struggled with understanding the paper for a very long time and really need/want to get it, you might eventually write an email to the author – who might respond, or not. Sometimes even errors are found! – and not published! An indeed, there is no easy way to publish “errata” anywhere on the net.

In mathematics most papers are not getting read though the end. One uses strategies 1 & 2 till one gets stuck and moves on to something more exciting. The chances of survival are much better with strategy 3 where one is committed putting a lot of effort in it over weeks.

4. Finding related work. Where to go from there? Is the paper superseded by a more recent development? Which are the relevant papers which the author builds upon? What are the historic influences? What are the founding ideas of the subject? Finding related work is very time consuming. It is easy to overlook things given that the references are often vast, and sometimes hard to get hold of. Getting information over citations requires often access to commercial databases.

Basic idea:

All researchers around the world are faced with the same problems and come up with their individual solutions. There are great synergies in bringing these people together with an online platform! Most of the addressed problems are solved with a paper centric service which allows you to…

…get to know other readers of the paper.
…exchange with the other readers: ask questions, write comments, reviews.
…share the gained insights with the community.
…ask questions about the paper.
…discuss the paper.
…review the paper.

We want to do that with a new mixture of a traditional Q&A system like StackExchange or MathOverflow with a paper database and social features. The key features of this system are as follows:

Openness: We follow a strict openness principle. The software will be developed in open source. All data generated on this site will be under a creative commons license (like Wikipedia) and will be made available to the community in form of database dumps or an API (open data).

We use two different types of content sites in our system: Papers and Discussions.

Paper sites. A paper site is dedicated to a single publication. And has the following features:

Paper meta information
– show title, author, abstract, journal, tags
– leave a comment
– write a review (with wiki option)
– vote up/down
Paper resources
– show pdfs, slides, notes, video lectures, etc.
– add a resource
Related Work
– show the reference-tree and citations in an intelligent way.
Discussions:
– show related discussions
– start a new discussion
Social features
– bookmark
– share on G+, twitter

The point “Related Work” deserves some further explanation. The citation graph offers a great deal more information than just a list of references. Together with the user generated content like votes and the individual paper bookmarks and social graph one has a very interesting data set which can be harvested. We want this point at least view with respect to: Popularity/Topics/Read by Friends. Later on one could add more sophisticated, even graphical views on this graph.

Discussion sites. A discussion looks more like a traditional QA-question, with the difference, that each discussion may have related (many) papers. A discussion site contains:

Discussion meta information (title, author, body)
Discussion content
Related papers
Voting
Follow/Bookmark

Besides the content sides we want to provide the following features:

News Stream. This is the start page of our website. It will be generated from the network consisting of friends, papers and authors. There should be several modes like:

hot: heavily discussed papers/discussions
new papers: list new publications (filtered by tag, like arXiv feed)
social: What did your friends do lately
default: intelligent mix of recent activity that is relevant to the logged in user

Moreover, filter by tag should be always available.

Search bar:

Searches contents of the site, but should also find papers on freely available databases (e.g. arXiv). Adding a paper should be very seamless process from there.
Search result ranking uses vote and view information.
Personalized search information. (Physicists usually do not want sociology results.)
Auto completion on paper titles, author, discussions.

Social: (hard to implement, maybe for second version!)

Easily refer to users by @-syntax familiar from Twitter/Google+
Maintain a friendship / trust graph
Friendship recommendations
Find friends from Google+ on the site

Benefits

Our proposed websites improves the above mentioned problems in the following ways.
1. Finding and filtering new publications:This step can be improved with even very little community effort:

Tell other people, that you are interested in the paper. Vote it up or leave a comment if you are very excited about it.

Point out a paper to a colleague.

2. Getting more information about a given paper:

Write a summary or review about a paper you have read or skimmed through. Maybe the introduction is hard to read or some results are not clearly stated.
Can you recommend reading this paper? Vote it up!
Ask a colleague for his opinion on the paper. Maybe he can write a summary?

Many reviews of new papers are already written. E.g. MathSciNet and Zentralblatt maintain a large database of Reviews which are provided by the community and are not freely available. Many authors would be much more happy to write them to an open system!
3. Understanding a paper:Here are the mayor synergies which we want to address with our project.

Ask a question: Why is the author using this experimental method? How does Lemma 3.4 work? Why do I need this assumption? What is the intiution behind the “virtual truncation”? What implications does this work have?
Start a discussion: (might involve more than one paper.) What is the difference of these two papers? Is there a reference explaining this more clearly? What should I read in advance to understand the theory?
Add resources. Tell the community about related videos, notes, books etc. which are available on other sites.
Share your notes. If you have discussed a paper in a reading class or seminar. Collect your notes or opinions and make them available for the community.
Restate interesting statements. Tell the community when you have found a helpful result which is buried inside the paper. In that way Google may find it!

4. Finding related work. Having a well structured and easily navigable view on related papers simplifies the search a lot. The filtering benefits from the content generated by the users (votes) and individual information, like friends who have written/bookmarked a paper.

Similar Sites on the Web

There are several discussions in QA forum which are discussing precisely this problem:

Quora.com: Where can I comment on sci papers?
MathOverflow: Good websites for discussions of mathematical papers?
MathOverflow: Is a free alternative to MathSciNet possible?
MathOverflow: Errata-Database

We found three sites on the internet which follow a similar approach which we examined more carefully.
1. There is a social network which has most of our features implemented:

researchgate.net
“Connect with researchers, make your work visible, and stay current.”

The Economist has dedicated an article to them. It is essentially a facebook clone, with special features for scientist.

Large, fast growing community. 1.4m +50.000/m. Mainly Biology and Medicine.
(As Daniel Mietchen points out, the size might be misleading due to institutional accounts)
Very professional Look and Feel. Company from Berlin, Germany, funded by VC. (48 People involved, 10 Jobs advertised)
Huge Feature set:

Profile site, Connect to friends
News Feed
Publication Database, Conference Finder, Jobmarket
Every Paper its own page: with

Voting up/down
Comments
Metadata (Title, Author, Abstract, Preveiw)
Social Media (Share, Bookmark, Follow author)

Organize Workgroups/Reading Classes.

Differences to our approach:

Closed Data / Closed Source
Very complex site which solves a lot of purposes
Only very basic features on paper site: vote/comment.
QA system is not linked well to paper database
No MathML
Mainly populated by undergraduates

2. Another website which comes reasonably close is:

http://www.sciweavers.org/

“an academic network that aggregates links to research paper preprints
then categorizes them into proceedings.”

Includes a large collection of online tools for various purposes
Have a big library of papers/software/datasets/conferences for computer science.
Paper sites have:

Meta information and preview
Vote functionality and view statistics, tags
Comments
Related work
Bookmarking
Author information

User profiles (no friendships)

Differences to our approach:

Focus on computer science community
Comment and Discussions are well hidden on paper sites
No News stream
Very spacious design

3. Another very similar site is:

journalfire.com – beta
“Share what your read – connect to colleagues – create journal clubs.”

It has the following features:

Comment on Papers. Activity feed (?). Follow articles.
Host Journal Clubs. Create Events related to papers.
Powerful search box fetching papers from Arxiv and Pubmed (slow)
Social features on site: User profiles, friend finder (no fb/g+ integration yet)
News feed – from subscribed papers and friends
Easy paper import via Bookmarklet
Good usability!! (but slow loading times)
Private reading clubs cost money!

They are very skilled: Maintained by 3 PhD students/postdocs from Caltec and MIT.

Differences to our approach:

Closed Data, Closed Source
Also this site misses (currently) misses out ranking features
Very Closed model – Signup required
Weak Crowd sourcing: Cannot add Meta information

The site is still at its very beginning with little users. The project started in 2010 and did not gain much momentum since.

The other sites are roughly classified in the following categories:
1. Single people who are following a very similar idea:

annotatr.appspot.com. Combines a metadata-base with the disqus plugin. You can comment but not rate. Good usability. Nice CSS. Good search function. No MathML. No related article suggestion. Maintained by two academics in private time. Hosted on Google Apps. Closed Source – Closed Data.
r-Forum – a resource where mathematicians can collect record reviews, corrections of a resource (e.g. paper, talk, …). A simple Vanilla-Forum/Wiki with almost no content used by maybe 12 people in US. No automated Data import. No rating system.
http://math-arch.org/ – Post comments to math papers. very bad usability – get even errors. Maintained by a group of russian programmers LogicSun. Closed Source – Closed Data.

Analysis: Although the principal idea to connect people reading papers is there. The implementation is very bad in terms of usability and even basic programming. Also the voting features are missed out.

2. (Semi) Professional sites.

Public Libary of Science very professional, huge paper data base for mainly biology, medicine. Features full text papers, lots of interesting meta information including references. Has comment features (not very visible) and news stream on the start page.
No QA features (+1, Ask question) on the site. Only published articles are on the site.
Mendeley.com – Huge Bibliographic database with bookmarking and social features. You can organize reading groups in there, with comments and notes shared among the participants. Features a news stream with papers by friends. Nice import. Impressive fulltext data and Reference features.
No QA features for paper. No comments for paper. Requires Signup to do anything useful.
papercritic.com – Open review database. Connected to Mendely bibliographic libary. You can post reviews. No rating. No comments. Not open: Mendely is commercial.
webofknowledge.com. Commercial academic citation index.
zotero.org – features programm that runs inside a browser. “easy-to-use tool to help you collect, organize, cite, and share your research sources”

Analysis: The goal of all these tools is to simplify the reference management, by providing metadata like references, citations, abstracts, author profiles. Commenting features on the paper site are not there or not promoted.
3. Vaguely related sites which solve different problems:

citeulike.org – Social bookmarking for papers. Closed Source – Open Data.
http://www.scholarpedia.org. A peer reviewed open access encyclopedia.
Philica.com Online Journal which publishes articles from any field along with its reviews.
MathSciNet/Zentralblatt – Review database for math community. Closed Source – Commercial.
http://f1000research.com/ – Online Journal with a public, post publish review process. “Open Science – Open Data – Open Review”
http://altmetrics.org/manifesto/ as an emerging trend from the web-science trust community. Their goal is to revolutionize the review process and create better filters for scientific publications making use of link structures and public discussions. (Might be interesting for us).
http://meta.wikimedia.org/wiki/WikiScholar – one of several ideas under discussion at Wikimedia as to a central repository for references (that are cited on Wikipedias and other Wikimedia projects)

Upshot of all this:

There is not a single site featuring good Q&A features for papers.

If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Wikipedia to Blackout for 24 hours to fight SOPA and PIPA – Copy of the user discussion and poll on my blog

Rene — Tue, 17 Jan 2012 11:31:49 +0000

I am one of the web pioneers but this is about the most amazing thing that I will be witnessing on the web as long as I can remember. Tomorrow on January 18th the english version of Wikipedia will shut down for 24 hours to protest two upcoming (?) american laws (SOPA and PIPA) that set the legal foundations to censor the web. This is happening in the country that is so proud of it’s freedom of speech.

Press release: http://wikimediafoundation.org/wiki/Press_releases/English_Wikipedia_to_go_dark
Letter to Users: http://wikimediafoundation.org/wiki/English_Wikipedia_anti-SOPA_blackout
User discussion: http://en.wikipedia.org/wiki/Wikipedia:SOPA_initiative/Action see below for a saved copy!

This is such an important move of democracy that I was standing still for a couple of minutes after I heard of this! 1’800 active wikipedia authors moderators and administrators collectively agreed to make this move in order to show a protest! I am very excited to see where this will be going and what impact this has. Freedom of the internet is what makes this such a beautiful space. Everyone spread this word! discuss this! Don’t let anyone take the freedom of speech and information sharing from you!
Since the user discussion and poll won’t be available tomorrow I attached them to my blogpost.
http://www.rene-pickhardt.de/wp-content/uploads/2012/01/Wikipedia-SOPA-initiative-Action-Wikipedia-the-free-encyclopedia.html
I will not comment on this any further. Please everyone Have your own oppinion and act with responsability.

Open Source Facebook music streaming App for free download!

Rene — Tue, 03 Jan 2012 14:54:14 +0000

In an earlier post I have explained the need for a Facebook streaming app that has to be enhanced with some features in order to create viral word of mouth effects. Together with Yann Leretaille and Robert Naumann we programmed the facebook API and developed such an app for my band In legend. Today ( even though xmas is gone and 2012 has already started ) it is the time for me to share the source code of this app.

have a look at the app here:

Your Browser can’t display iframes. go to: this page in order to listen

Features and Problems

works on facebook and on any other webpage
enables setting more and more songs free for streaming while more people install the app (in order to spread the word)
users need to connect (with facebook or via email adress) in order to listen
some lightweight statistics
encrypted flash player (not open source yet) that makes it hard to download the music (Though I myself have some moral problems with this kind of feature. But well it is how the industry works…)
Slideshow of pictures to improve listening experience
optimized usability for high conversion rates

The app runs on PHP, MySQL, JavaScript (MooTools) and you will need your own webspace in order to host it

A kind warning

The App was developed with a lot of time pressure and we had some nasty bugs that needed to be fixed. That is why the source code is messed up with some really fast and dirty quick fixes. Afterwards I never really had the time to clean up the source and make a good documentation. As my PhD progresses this situation will not change in the foreseeable future. Since my prediction says that Facebook will be overrun by Google+ within this year it is more than time to share the app!
The good part: most of the stuff can be reused once Google+ opens its API and the app can be transformed to a great social network.

Source code on google code

http://code.google.com/p/in-legend-facebook-music-streaming-app/source/checkout

Download Google n gram data set and neo4j source code for storing it

Rene — Sun, 27 Nov 2011 13:28:20 +0000

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is.
http://en.wikipedia.org/wiki/N-gram
The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.
This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

So let me rather talk about some possible applications with this source of pure gold:
I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.
The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.
You can find some source code of the newer version at: https://github.com/renepickhardt/typology/tree/develop
Note that this is just a primitive algorithm to process the ngrams and store the information in a neo4j graph data base. Interestingly it can already produce decent recommendations and it uses less storage space than the ngrams dataset since the graph format is much more natural (and also due to the fact that we did not store all of the data saved in the ngrams to neo4j e.g. n-grams of different years have been aggregated.)
From what I know the roadmap is very clear now. Normalize the weights and for prediction use a weighed sum of all different kinds of n-grams and use machine learning (supervised learning) to learn those weights. As a training data set a corpus from different domains could be used (e.g. wikipedia corpus as a general purpose corpus or a corpus of a certain domain for a special porpus)
If you have any suggestions to the work the students did and their approach using graph data bases and neo4j to process and store ngrams as well as predicting sentences feel free to join the discussion right here!

11 lessons learnt after my first scientific paper was submitted

Rene — Wed, 02 Nov 2011 00:58:28 +0000

During the last month my blog was rather quite. I dicided that I was aiming to submit my first paper to a top conference with a deadline of november first. Well besides the fact that I almost forgot about the fact that I also have a private life – as well as my collegues helping me with the paper – there were several lessons learnt:

If your advisor tells you that the deadline is to short he is probably right! We beat the deadline but the cost for doing so was really high.
Physicists rock like hell. Evaluating my algorithms I did many experiments together with Jonas Kunze my partner at metalcon. I was totally amazed by the way he approached meassuring things. I rememberd my time as an undergrad standing in the lab meassuring things for my physics classes. Despite the fact that I knew a usefull skill was being tought to me I hated it and decided to go for pure mathmatics… Well I now learnt the hard way what I didn’t learn as an under graduate.
Things become clearer when you really dig into it. It is amazing how all the practical runtimes of my graph data base index for social news feeds – let’s call it GRAPHITY – matched the theoretical runtimes. But while evaluating you see how bad experiments have been designed in the planning phase and you reajust. Even if things work out right a way several times I got a deeper understanding by just seeing and feeling it. What I want to say things are more complicated than you might think after 2 minutes, half a day or half a month of thinking.
The whole learning experience was really nice and it was more about techniques for scientific working than the graph databes index.
If your advisor tells you to change notation it is most likely true that even though he is not as deep as you in the topic he has more experience and changing notation is a good idea! Even though I was totally convinced that my notation was great ( at least I have learnt how to model things while studying mathmatics) it made things more complicated. After I finally listened to my advisor things worked out like magic (at least concerning notation)
people in university have a very different approach to people in consultancies. But if the deadline comes closer both work until late at night!
Freedom is perfect! Thinking of the problem and solution I did not have many conferences and current research topics in mind. I thought of practical problems for improving metalcon. While emerging with my ideas the first criticism was that my motivation was not scientific. Well screw that! As soon as you really work on describing the problem and doing evaluation you do the science!
You can always generalize. I was pretty sures my skills in doing so are quite good. Well now I now there is space for improvement.
Structure, structure, structure. You cannot have enough structure!
Making a traffic light status overview document in Google docs or some other collaborative system as my friend Heinrich Harmann showed me during “schüler akademie” as he has learnt with McKinsey & Company is really good invested effort and time!
neo4j is really a cool and exciting technology and the guys in sweeden are really helpfull and cool.

I guess I could boar you all has hell for the next couple pages. I actually should even do this because I know myself i will never come back and right down what is in my mind right now. That is the reason why I publish this here right now at 2 o’clock in the morning!

Analysing user behaviour in internet communities

Rene — Fri, 06 May 2011 17:55:58 +0000

Yesterday Julia Preusse – who started her PhD studies at WeST just a month before I did – had a talk about her research interests. She is interested in communities and the behaviour of users in discussions. I thought she was having some interesting ideas. That is why I asked her for a meeting today. We sat down and discussed some of her ideas. While doing so it seemed that she liked my aproach to solve some of her problems.
After only 30 minutes of discussion we had a concrete roadmap for some interesting research on user behaviour in internet message boards. Of course I cannot talk about the concrete idea here – most part is Julias interlectual property and she has not published it yet – but the general idea is that we want to analyze the reply structure of users within topics in order to cluster the users to different groups. From there Julia wants to go to the next levle. We talked back with Dr. Thomas Gottron about our ideas and he seemed to be very happy with our approach.
Even though my focus is on social news streams I think Julias topics are very interesting and working with her together could also help me to gain a deeper insight on users behaviour for metalcon. That is why I will help her within the next weeks to work on these questions. I am excited to see the output and I will of course keep you updated on the topic. Of course I won’t forget to meanwhile work on my own research topics (-: