Information retrieval – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 The best way to create an autocomplete service: And the winner is…. Giuseppe Ottaviano https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/ https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/#respond Wed, 15 May 2013 13:06:42 +0000 http://www.rene-pickhardt.de/?p=1594 Over one year ago I was starting to think about indexing scored stings for auto completion queries. I stumbled upon this problem after seeing the strength of the predictions of  the typology approach for next word prediction on smartphones. The typology approach had one major drawback: Though its suggestions had a high precision the speed with 50 milliseconds per suggestion was rather slow especially when working on a server side application.

  • On August 16 th 2012 I found a first solution building on Nicolai Diethelms Suggest Tree. Though the speedup was great the suggest tree at this time had several major drawbacks (1. the amount of suggestions had to be known before building the tree 2.) large memory overhead and high redundancy 3.) no possibility of updating weights or even inserting new strings after building the tree) (the last 2 issues have been fixed just last month)
  • So I tried to find a solution which required less redundancy. But still for indexing Gigabytes of 5-grams we needed a persistent method. We tried Lucene and MySql in December and January. After seeing that MySQL does not provide any indices for this kind of query I decided to  misuse multidimensional trees of MySQL in a highly redundant way to somehow be able to evaluate the strength of typology on large data sets with gigabytes of n-grams. Creating one of the dirtiest hacks in my life I could at least handle the data but the solution was rather engineered and consisted of throwing hardware at the problem.
  • After Christoph tried to solve this using bitmap indices which was quite fast but had issues with scaling and index maintainability we had a discussion and finally the solution popped in my mind in the beginning of march this year.

Even though I was thinking of scored tries before they always lacked the problem that they could only find the top-1 element efficiently. Then I realized that one had to sort the children of a node by score and use a priority queue during retrieval. In this way one would get the maximum possible runtime I was doing this in a rather redundant way because I was aiming for fast prefix retrieval of the trie node and then fast retrieval of the top children.
After I came up with my solution and after talking to Lucene contributers from IBM in Haifa I realized that Lucene had a pretty similar solution as a less popular “hidden feature” which I tested. Anyway in my experiment I also experienced a large memory overhead with the Lucene solution so my friend Heinrich and me started to develop my trie based solution and benchmark it with various baselines in order to produce a good solid output.
The developement started last month and we had quite some progress. Our goal was always to be about as fast as Nicolai Diethelms suggest tree but not running into all the drawbacks of his solution. In our coding session yesterday we realized that Nicolai improved his data structure a lot by getting rid of his memory overhead and also being able to update, insert and delete new items to his index (still the amount of suggestions has to be known before the tree was build)
Yet while learning more about the ternary tree data structure he used to build up his solution I found a paper that will be presented TODAY at WWW conference. Guess what: Independently of us Giuseppe Ottaviano explains in Chapter 4 the exact solution and algorithm that I came up with this march. Combined with an efficient implementation of the tries and many compression techniques (even respecting cache locality of the processor) he even beats Nicolai Diethelms suggest tree. 
I looked up Giuseppe Ottaviano and the only thing two things I have to say are:

  1. Congratulations Giuseppe. You really worked on that kind of problems for a long time an created an amazing paper. This is also reflected by the related work section and all the small details that are in your paper which we were still in the process of figuring out. 
  2. If anyone needs an auto completion service this is the way to go. Being able to provide suggestions from a dictionary with 10 Mio. entries  in a few micro seconds (yes micro not milli!) means that a single computer can handle about 100’000 requests per second which is certainly web scale.  Even the updated suggest tree by Nicolai is now the way to go and maybe much easier to use since it is java based and not C++ and the full code is open sourced.
Ok so much for the history of events and the congratulations to Giuseppe. I am happy to see that the algorithm really performs that well but there is one little thing that really bothers me a lot: 
 
How come our community of researchers hasn’t come up with a good way of sharing credits to a person like me who came up independently with the solution? As for me I feel that the strongest chapter of my dissertation just collapsed and one year of research just burnt away. I mean personally I gained and learnt a lot from it but from a carrier point of view this seems rather like a huge drawback.

Anyway life goes on and by thinking about the trie based solution we already came up with a decent list of future work which we can most certainly use for follow up work and I will certainly contact the authors maybe a collaboration in future will be possible. 

]]>
https://www.rene-pickhardt.de/the-best-way-to-create-an-autocomplete-service-and-the-winner-is-giuseppe-ottaviano/feed/ 0
Typology Oberseminar talk and Speed up of retrieval by a factor of 1000 https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/ https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/#comments Thu, 16 Aug 2012 11:39:25 +0000 http://www.rene-pickhardt.de/?p=1396 Almost 2 months ago I talked in our oberseminar about Typology. Update: Download slides Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.
Additionally I have finally uploaded some source code to git hub that makes the typology retrieval algorithm pretty fast. There are still some issues with this code since it lowers the quality of predictions a little bit. Also the index has to be built first. Last but not least the original SuggestTree code did not save the weights of the items to be suggested. I need those weights in the aggregation phase. Since i did not want to extend the original code I placed the weights at the end of the suggested Items. This is a little inefficent.
The main idea why retrieval speeds up with the new algorithm is that typology needs to make sorting over all outedges of a node. This is rather slow especially if one only needs the top k elements. Since neo4j as a graph data base does not provide indices for this kind of data I was forced to look for another way to presort the data. Additionally if a prefix is known one does not have to look at all outgoing edges. I found the Suggest Tree class by Nicolai Diethelm. Which solved the problem in a very good way and lead to such a great speed. The index is not persistent yet and it also needs quite some memory. On the other hand for every node a suggest tree is built. This means that the index can be distributed in a very easy manner over several machines allowing for horizontal scaling!
Anyway the old algorithm was only able to handle like 20 requests per second and now we have something like 14 k requests and as I mentioned there is still a little space for more (:
I hope indices like this will be standard in neo4j soon. This would open up the range of applications that could make good use of neo4j.
Like always I am happy for any suggestions and I am looking forward to do the complete evaluation and paper writing for typology.

]]>
https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/feed/ 2
Building an Autocompletion on GWT with RPC, ContextListener and a Suggest Tree: Part 0 https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/ https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/#comments Wed, 13 Jun 2012 13:15:29 +0000 http://www.rene-pickhardt.de/?p=1360 Over the last weeks there was quite some quality programming time for me. First of all I built some indices on the typology data base in which way I was able to increase the retrieval speed of typology by a factor of over 1000 which is something that rarely happens in computer science. I will blog about this soon. But heaving those techniques at hand I also used them to built a better auto completion for the search function of my online social network metalcon.de.
The search functionality is not deployed to the real site yet. But on the demo page you can find a demo showing how the completion is helping you typing. Right now the network requests are faster than google search (which I admit it is quite easy if you only have to handle a request a second and also have a much smaller concept space).  Still I was amazed by the ease and beauty of the program and the fact that the suggestions for autocompletion are actually more accurate than our current data base search. So feel free to have a look at the demo:
http://134.93.129.135:8080/wiki.html
Right now it consists of about 150 thousand concepts which come from 4 different data sources (Metal Bands, Metal records, Tracks and Germen venues for Heavy metal) I am pretty sure that increasing the size of the concept space by 2 orders of magnitude should not be a problem. And if everything works out fine I will be able to test this hypothesis on my joint project related work which will have a data base with at least 1 mio. concepts that need to be autocompleted.
Even though everyting I used but the ContextListener and my small but effective caching strategy can be found at http://developer-resource.blogspot.de/2008/07/google-web-toolkit-suggest-box-rpc.html and the data structure (suggest tree) is open source and can be found at http://sourceforge.net/projects/suggesttree/ I am planning to produce a series of screencasts and release the source code of my implementation together with some test data over the next weeks in order to spread the knowledge of how to built strong auto completion engines. The planned structure of these articles will be:

part 1: introduction of which parts exist and where to find them

  • Set up a gwt project
  • Erease all files that are not required
  • Create a basic Design

part 2: AutoComplete via RPC

  • Neccesary client side Stuff
  • Integration of SuggestBox and Suggest Oracle
  • Setting up the Remote procedure call

part 3: A basic AutoComplete Server

  • show how to fill data with it and where to include it in the autocomplete
  • disclaimer! of not a good solution yet
  • Always the same suggestions

part 4: AutoComplete Pulling suggestions from a data base

  • inlcuding a data base
  • locking the data base for every auto complete http request
  • show how this is a poor design
  • demonstrate low response times speed

part 5: Introducing the context Listener

  • introducing a context listener.
  • demonstrate lacks in speed with every network request

part 6: Introducing a fast Index (Suggest Tree)

  • inlcude the suggest tree
  • demonstrate increased speed

part 7: Introducing client side caching and formatting

  • introducing caching
  • demonstrate no network traffic for cached completions

not covered topics: (but for some points happy for hints)

  • on user login: create personalized suggest tree save in some context data structure
  • merging from personalized AND gobal index (google will only display 2 or 3 personalized results)
  • index compression
  • schedualing / caching / precalculation of index
  • not prefix retrieval (merging?)
  • css of retrieval box
  • parallel architectures for searching
]]>
https://www.rene-pickhardt.de/building-an-autocompletion-on-gwt-with-rpc-contextlistener-and-a-suggest-tree-part-0/feed/ 6
Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

Chapter 1 Introduction

This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

]]>
https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
Typology using neo4j wins 2 awards at the German federal competition young scientists. https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/ https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/#comments Mon, 21 May 2012 09:44:54 +0000 http://www.rene-pickhardt.de/?p=1341 Two days ago I arrived in Erfurt in order to visit the federal competition young scientists (Jugend Forscht). I reported about the project typology by Till Speicher and Paul Wagner which I supervised over the last half year and which already won many awards.
Saturday night they have already won a special award donated by the Gesellschaft fuer Informatik this award has the title “special award for a contribution which demonstrates particularly the usefulness of computer science for Society.” (Sonderpreis fuer eine Arbeit, die in besonderer Art und Weise den Nutzen der Informatik verdeutlicht.) This award was connected with 1500 Euro in cash!
Yesterday there was the final award ceremony. I was quite excited to see how the hard work that Till and Paul put into their research project would be evaluated by the juryman. Out of 457 submissions with a tough competition the top5 projects have been awared. Till and Paul came in 4th and will now be allowed to visit the German chancelor Angela Merkel in Berlin.
With the use of spreading activation Typology is able to make precise predictions of what you gonna type next on your smartphone. It outperforms the current scientific standards (language models) by more than 100% and has a precision of 67%!
A demo of the project can be found at www.typology.de
The android App of this system is available in the appstore. It is currently only available for German text and is beta. But the source code is open and we are happy for anybody who wants to contribute can check out the code. A mailinglist will be set up soon. But anyone who is interested can already drop a short message at mail@typology.de and will be added to the mailinglist as soon as it is established.
You can also look at the documentation which will soon be available in english. Alternatively you commit bugs and request features in our bug tracker
I am currently trying to improve the data structures and the data base design in order to make retrieval of suggestions faster and decrease the calculation power of our web server. If you have expertise in top – k aggregation joins in combination with prefix filtering drop me a line and we can discuss about this issue!
Happy winners Paul Wagner (left) and Till Speicher (right) with me

]]>
https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/feed/ 5
Smartphones of Policemen could give criminals a competitive advantage https://www.rene-pickhardt.de/smartphones-of-policemen-could-give-criminals-a-competitive-advantage/ https://www.rene-pickhardt.de/smartphones-of-policemen-could-give-criminals-a-competitive-advantage/#respond Fri, 27 Apr 2012 09:31:43 +0000 http://www.rene-pickhardt.de/?p=1310 If I were a criminal I would create a smart phone app which would give me the possability to geographically and socially track policemen. Here some background on this thought.
Yesterday I was sitting in the German summit on “Facebook Goolgle & Co – Chances and Risks” (which I will blog about soon) But today during my train trip to the second day of the summit I was sitting in the train talking to a very friendly police officer. He agreed with what was said on the summit. The police is using social networks to find potential criminals. They also use cellphone tracking together with mobile providers to find people they are looking for. Nothing new and special so far. But now my interesting observation.
The police officer proudly told me that he is not using any social networking service because he enjoys his life in privacy. I understood that he believed this to be necessary in his job. By telling me this he was holding his iPhone in his hand. Again this shows one of the most crucial parts in this entire privacy discussion. Even highly educated people often lack an understanding of how much private information they implicitly give to third parties.
So I asked him if he used it during work times and he told me that he did since only mobile providers would know where he is and they could not give away data that easily. I was amazed! A policeman using an iPhone during work. That is such a security lack. If I were a terrorist organization I would create an iPhone and android app (or if possible an open mobile html5 app like Tim Berners Lee suggests <– you see the ethics overwhelm I am just not a criminal (-:). I would design this app in a way to support policemen. Help them communicate or have a cool map integration anything that was useful for the police. In this way I would create a database with real movement data of policemen. This data I could use for a different service similar to http://girlsaround.me/ displaying the current position and face of policemen (including if requested a list of people they recently communicated with including their phone numbers) on a map to anyone of my terrorist organization. The police just could never catch me since I would always know where they are (without asking any mobile provider!) I could even give them fake phone calls pretending I am one of the people they recently communicated with inputting them false information or just distracting them.

Of course this setting is only half realistic:

  • Every policeman would have to have a smartphone and use it during work time
  • Every policeman would have to install the app of the criminal
  • The criminal can distinguish between policemen and other people using the app (should be possible with data mining)
  • The criminal can decide weather the policeman is currently working or in leisure time

But it should show and demonstrate the dangers…

To conclude:

We have to disallow policemen to use private smart phones during work! Or if they do so they must not install any applications from a source they don’t trust. And here is the crucial point. Who to trust and who not? Trust usually is created through social ties. So if the app is there and some policemen like the service and recommend it to their coworkers trust is created. Who does really ask about the source of an app and about who is running/owning the data servers. A service that is well known on the web can easily run by 2 or 3 people and even if they are nice it is easy to manipulate or blackmail them in order to get access to these very sensitive data.
And on another more technical topic: We need a decentralized mobile space. There has to be a frequency on which people are able to set up their own transmitters and create decentralized mobile networks. It is a shame that those frequencies are all owned by companies creating centralized services.
By the way this would be a good solution since it would also enable the police to have their own decentralized mobile networks giving them privacy against third parties!

Disclaimer:

I never thought I would write an article in this paranoid way telling people what is possible and where the risks are. I almost feel like a member of ccc, anonymous or finally like a real pirate. But one year of PhD in a very data driven environment having social networks, information retrieval and the web as a focus really makes me understand more and more what is possible (in particular easy to achieve). Also the low awareness of society about these dangers (probably due to the complex technologies) overwhelms me and makes me feel like I have to act and at least inform people.
To bad that mostly people who are already aware of these topics read my blog. Maybe I have to go geek and create this app to demonstrate the functionality in order to really rise awareness. There are just too many interesting things to do during a PhD program so I think this time only writing about this has to be sufficient.

]]>
https://www.rene-pickhardt.de/smartphones-of-policemen-could-give-criminals-a-competitive-advantage/feed/ 0
PhD proposal on distributed graph data bases https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/ https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/#comments Tue, 27 Mar 2012 10:19:22 +0000 http://www.rene-pickhardt.de/?p=1214 Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

  • indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
  • running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
  • storing web log files and click streams of users
  • doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

  • biology
  • finance
  • regular graphs 
  • geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

  • graph processing (Signal Collect, Pregel,…)
  • graph theory (especially data structures and algorithms)
  • (dynamic/adaptive) graph partitioning
  • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
  • redundancy vs fault tolerance
  • network programming (protocols, latency vs bandwidth)
  • data bases (ACID, multiple user access, …)
  • graph data base query languages (SPARQL, Gremlin, Cypher,…)
  • Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base. 
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages. 
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment. 

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID  are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID. 

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

  • 2012 / 04: hand in PhD proposal
  • 2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
  • 2012 / 10: experiments of basic CRUD operations done
  • 2013 / 02: paper with results from basic CRUD operations done
  • 2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
  • 2013 /08: min 3 month research internship  in a company benchmarking my system on real data
  • end of 2013: publishing the results
  • 2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

]]>
https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/feed/ 11
Paul Wagner and Till Speicher won State Competition "Jugend Forscht Hessen" and best Project award using neo4j https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/ https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/#comments Fri, 16 Mar 2012 11:18:38 +0000 http://www.rene-pickhardt.de/?p=1204 6 months of hard coding and supervising by me are over and end with a huge success! After analyzing 80 GB of Google ngrams data Paul and Till put them to a neo4j graph data base in order to make predictions for fast scentence completion. Today was the award ceremony and the two students from Darmstadt and Saarbrücken (respectivly) won the first place. Additionally the received the “beste schöpferische Arbeit” award. Which is the award for the best project in the entire competition (over all disciplines).
With their technology and the almost finnished android app typing will be revolutionized! While typing a scentence they are able to predict the next word with a recall of 67% creating a huge additional vallue for today’s smartphones.
So stay tuned of the upcomming news and the federal competition on May in Erfurt.
Have a look at their website where you can find the (still) German Documentation. As well as the source code and a demo (which I also include here (use tab completion (-: as in unix bash)
Right now it only works for German Language – since only German data was processed – so try sentences like

  • “Warum ist die Banane krumm” (where the rare word krumm is correctly predicted due to the relation of the famous question why is the banana curved?
  • “Das kann ich doch auch” (I am also able to do that)
  • “geht wirklich nur deutsche Sprache ?” (Is really only German language possible?)


&lt;br /&gt; Ihr Browser kann leider keine eingebetteten Frames anzeigen:&lt;br /&gt; Sie können die eingebettete Seite über den folgenden Verweis&lt;br /&gt; aufrufen: &lt;a href=&#8221;http://complet.typology.de&#8221; mce_href=&#8221;http://complet.typology.de&#8221; data-mce-href=&#8221;http://complet.typology.de&#8221;&gt;Demo&lt;/a&gt;&lt;br /&gt;

]]>
https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/feed/ 11
Google Video on Search Quality Meeting: Spelling for Long Queries by Lars Hellsten https://www.rene-pickhardt.de/google-video-on-search-quality-meeting-spelling-for-long-queries-by-lars-hellsten/ https://www.rene-pickhardt.de/google-video-on-search-quality-meeting-spelling-for-long-queries-by-lars-hellsten/#respond Mon, 12 Mar 2012 19:11:04 +0000 http://www.rene-pickhardt.de/?p=1196 Amazing! Today I had a discussion with a coworker about transparency and the way companies should be more open about what they are doing! And what happens on the same day? One of my favourite webcompanies has decided to publish a short video taken from the weekly search quality meeting!
The proposed change by Lars Hellsten is that instead of only checking the first 10 words for possible spelling corrections one could predict which two words are most likely spelled wrong and add an additional window of +-5 words around them. They discuss how this change has much better scores than the old one.
The entire video is interesting because they say that semantic context is usually given by using 3 grams. My students used up to 5 grams in order to make their scentence prediction and the machine learning already told them that 4grams would be sufficient to make syntactically and semantically correct predictions.
Anyway enjoy this great video by Google and thanks to Google for sharing this:

]]>
https://www.rene-pickhardt.de/google-video-on-search-quality-meeting-spelling-for-long-queries-by-lars-hellsten/feed/ 0
Related-work.net – Product Requirement Document released! https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/ https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/#comments Mon, 12 Mar 2012 10:26:50 +0000 http://www.rene-pickhardt.de/?p=1176 Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.
The idea of http://www.related-work.net was born. A scientific platform which is open source and open data and tries to solve those problems.
But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Beginning of our Product Requirement Document

We propose to create a new website for the scientific community which brings together people which are reading the same paper. The basic idea is to mix the functionality of a Q&A platform (like MathOverflow) with a paper database (like arXiv). We follow a strict openness principal by making available the source code and the data we collect.
We start with an analysis how the internet is currently used in different fields and explain the shortcomings. The actual product description can be found under the section “Basic idea”. At the end we present an overview over the websites which follow a similar approach.
This document – as well as the whole project – is work in progress. We are happy about any kind of comments or other contributions.

The distribution of scientific knowledge

Every scientist hast to stay up to date with the developments in his area of research. The basic sources for finding new information are:

  • Conferences
  • Research Seminars
  • Journals
  • Preprint-servers (arXiv)
  • Review Databases (MathSciNet, Zentralblatt, …)
  • Q&A Sites (MathOverflow, StackOverflow, …)
  • Blogs
  • Social Networks (Twitter, Google+)
  • Bibliograhpic Databases (Mendeley, nNode, Medline, etc. )

Every community has found its very own way of how to use this tools.

Mathematics by Heinrich Hartmann – Oxford:

To stay up to date with recent developments I check arxiv.org on a daily basis (RSS feed) participate in mathoverflow.net and search for papers over Google Scholar or MathSciNet. Occasionally interesting work is shared by people in my Google+ circles. In general the speed of pure mathematics is very slow. New research often builds upon work which has been out for a few years. To stay reasonably up to date it is enough to go to conferences every 3-5 months.
I read many papers on myself because I am the only one at the department who does research on that particular topic. We have a reading class where we read papers/lecture notes which are relevant for more people. Usually they are concerned with introductions to certain kinds of theory. We have weekly seminars where people talk about their recently published work. There are some very active blogs by famous mathematicians, but in my area blogs play virtually no role.

Computer Science by René Pickhardt – Uni Koblenz

In Computer Science topics are evolving but also changing very quickly. It is always important to have both an overview of upcoming technologies (which you get from tech blogs) as well as access to current research trends.
Since the speed in computer science is so fast and the review process in Journals often takes much time our main source of information and papers are conferences and twitter.

  • Usually conference papers are distributed digitally to participants. If one is interested in those papers google queries like “conference name year papers” are frequently used. Sites like http://www.sciweavers.org/ host and aggregate preprints of papers and organize them by conference.
  • The general method to follow a conference that one is not attending is to follow the hashtag of the conference on Twitter. In general Twitter is the most used tool to share distribute and find information not only for papers but also for the above mentioned news about upcoming technologies.

Another rich source for computer scientists is, of course, the related work of papers and google scholar. Especially useful is the method of finding a very influential paper with more than 1000 citations and find newer papers that quote this paper containing a certain keyword which is one of the features of google scholar.
The main problem in computer science is not to find a rare paper or idea but rather to filter the huge amount of publications and also bad publications and also keep track of trends. In this way a system that ranks and summarize papers (not only by abstract and citation counts) would help me a lot to select what related work of a paper I should read!

Psychology by Elisa Scheller – Uni Freiburg

As a psychologist/neuroscientist, I receive recommendations for scientific papers via google scholar alerts or science direct alerts (http://www.sciencedirect.com/); I receive alerts regarding keywords or regarding entire journal issues. When I search for a certain publication, I use pubmed.org or scholar.google.com. This can sometimes be kind of annoying, as I receive multiple alerts from different sources; but I guess it is the best way to stay up to date regarding recent developments. This is especially important in my field, as we feel a big amount of “publication pressure”; I work on a method which is considered as “quite fancy” at the moment, so I also use the alerts to make sure nobody has published “my” experiment yet.
Sometimes a facebook friend recommends a certain publication or a colleague points me to it. Most of the time, I read articles on my own, as I am the only person working on this specific topic at my institution. Additionally, we have a weekly journal club where everyone in turn presents work which is related to our focus of research, e.g. a certain part of the human brain. There is also a weekly seminar dedicated to presentations about ongoing projects.
Blogs (e.g. mindhacks.com, http://neuroskeptic.blogspot.com/) can be a source to get an overview about recent developments, but I have to admit I use them mainly for work-related entertainment.
All in all, it is easy to stay up to date using alerts from different platforms;  the annoying part of it is the flood of emails you receive and that you are quite often alerted to articles that don’t fit your interests (no matter how exact you try to specify your keywords).

Biomedical Research by Johanna Goldmann – MIT

In the biological sciences, in research at the bench – communication is one of the most fundamental tools a scientist can have. Communication with other scientist may open up the possibilities of new collaborations, can lead to a completely new view point of a known question, the integration and expansion of methods as well as allowing a scientist to have a good understanding of what is known, what is not known and what other people have – both successfully and unsuccessfully – tried to investigate.
Yet communication is something that is currently very much lacking in academic science – lacking to the extent that most scientist will agree hinders the progress of research. Nonetheless the lack of communication and the issues it brings with it is something that most scientists will have accepted as a necessary evil – not knowing how to possibly change it.
Progress is only reported in peer-reviewed journals – many which are greatly affected not only but what is currently “sexy” in research but also by politics and connections and the “publish or perish” pressure. Due to the amount of this pressure in publishing in journals and the amount of weight the list of your publications will have upon any young scientists chances of success, scientist tend also to be very reluctant in sharing any information pre-publication.
Furthermore one of the major issues is that currently there really is no way of publishing or communicating either negative results or minor findings, which causes may questions or methods to be repeatedly investigated as well as a loss of information.
Given how much social networks and the internet has changed communication as well as the access to information over the past years – there is a need for this change to affect research and communication in the life science and transform the way we think not only about solving and approaching research questions we gather but the information and insights we gain as a whole.

Philosophy by Sascha Benjamin Fink – Uni Osnabrück

The most important source of information for philosophers is http://philpapers.org/. You can follow trends going on in your field of interest. Philpapers has a list of almost all papers together with their abstracts, keywords and categories as well as a link to the publisher. Additional information about similar papers is displayed.
Every category of papers is managed by some editor. For each category it is possible to subscribe to a newsletter. In this way once per month I will be informed about current publications in journals related to my topic of interest. Every User is able to create an account and manage his literature and the papers of his he is interested in.
Other research and information exchange methods among philosophers consist of mailing lists, reading clubs and Blogs. Have a look at David Chalmers blog list. Blogs are also becoming more and more important. Unfortunately they are usually on general topics and discussing developments of the community (e.g. Leiter’s Blog, Chalmers’ Blog and Schwitzgebel’s Blog).
But all together I still think that for me a centralized service like Philpapers is my favourite tool because it aggregates most information. If I don’t hear about it on Philpapers usually it is not that important. I think among Philosophers this platform – though incomplete – seems to be the standard for the next couple of years.

Problems

As a scientist it is crucial to be informed about the current developments in the research area. Abstracting from the reports above we divide the tasks roughly into the following stages.

1. Finding and filtering new publications:

  • What is happening right now? What are the current hot topics my area? What are current trends? (→ Check arXiv/Twitter)
  • Did a friend of mine write something? Did a “big shot” write something?
    (→ Check meta information: title, authors)
  • Are my colleagues excited about a new development? (→ Talk to them.)

2. Getting more information about a given paper:

  • What is actually done in a given paper? Is it relevant for me? Is it really new? Is it a breakthrough? (→ Read abstracts. Find a good readable summary/review.)
  • Judge the quality of a paper: Is it correct? Is it well written?
    ( → Where is it published, if at all? Skim through content.)

Finally there is a fundamental decision: Shall I read the whole paper, or not? which leads us to the next task.

3. Understanding a paper: Understanding a paper in depth can be a very time consuming and tedious process. The presentation is often very short and much knowledge is assumed from the reader. The notation choices can be bad, so that even the statements are hard to understand. In effect the paper is easily readable only for a very small circle of specialist in the area. If one is not in the lucky situation to belong to that circle, one usually applies the following strategies:

  1. Lookup references. This forces you to process a whole tree of older papers which might be hard to read, and hard to get hold of. Sometimes it is worthwhile to consult a textbook to polish up fundamentals.
  2. Finding additional resources. Is there a review? Is there a related video lecture or slides explaining the material in more detail? Is the author going to a conference in the near future, or even giving a seminar in the area?
  3. Join forces. Find people thinking about the same paper: Has somebody at my department already read the paper, so that I can ask some questions? Is there enough interest to make a reading group, or more formally, run a seminar about that paper.
  4. Contact the author. This a last resort. If you have struggled with understanding the paper for a very long time and really need/want to get it, you might eventually write an email to the author – who might respond, or not. Sometimes even errors are found! – and not published! An indeed, there is no easy way to publish “errata” anywhere on the net.

In mathematics most papers are not getting read though the end. One uses strategies 1 & 2 till one gets stuck and moves on to something more exciting. The chances of survival are much better with strategy 3 where one is committed putting a lot of effort in it over weeks.

4. Finding related work. Where to go from there? Is the paper superseded by a more recent development? Which are the relevant papers which the author builds upon? What are the historic influences? What are the founding ideas of the subject? Finding related work is very time consuming. It is easy to overlook things given that the references are often vast, and sometimes hard to get hold of. Getting information over citations requires often access to commercial databases.

Basic idea:

All researchers around the world are faced with the same problems and come up with their individual solutions. There are great synergies in bringing these people together with an online platform! Most of the addressed problems are solved with a paper centric service which allows you to…

  • …get to know other readers of the paper.
  • …exchange with the other readers: ask questions, write comments, reviews.
  • …share the gained insights with the community.
  • …ask questions about the paper.
  • …discuss the paper.
  • …review the paper.

We want to do that with a new mixture of a traditional Q&A system like StackExchange or MathOverflow with a paper database and social features. The key features of this system are as follows:

Openness: We follow a strict openness principle. The software will be developed in open source. All data generated on this site will be under a creative commons license (like Wikipedia) and will be made available to the community in form of database dumps or an API (open data).

We use two different types of content sites in our system: Papers and Discussions.

Paper sites. A paper site is dedicated to a single publication. And has the following features:

  1. Paper meta information
    – show title, author, abstract, journal, tags
    – leave a comment
    – write a review (with wiki option)
    – vote up/down
  2. Paper resources
    – show pdfs, slides, notes, video lectures, etc.
    – add a resource
  3. Related Work
    – show the reference-tree and citations in an intelligent way.
  4. Discussions:
    – show related discussions
    – start a new discussion
  5. Social features
    – bookmark
    – share on G+, twitter

The point “Related Work” deserves some further explanation. The citation graph offers a great deal more information than just a list of references. Together with the user generated content like votes and the individual paper bookmarks and social graph one has a very interesting data set which can be harvested. We want this point at least view with respect to: Popularity/Topics/Read by Friends. Later on one could add more sophisticated, even graphical views on this graph.


Discussion sites.
A discussion looks more like a traditional QA-question, with the difference, that each discussion may have related (many) papers. A discussion site contains:

  1. Discussion meta information (title, author, body)
  2. Discussion content
  3. Related papers
  4. Voting
  5. Follow/Bookmark

Besides the content sides we want to provide the following features:

News Stream. This is the start page of our website. It will be generated from the network consisting of friends, papers and authors. There should be several modes like:

  • hot: heavily discussed papers/discussions
  • new papers: list new publications (filtered by tag, like arXiv feed)
  • social: What did your friends do lately
  • default: intelligent mix of recent activity that is relevant to the logged in user


Moreover, filter by tag should be always available.

Search bar:

  • Searches contents of the site, but should also find papers on freely available databases (e.g. arXiv). Adding a paper should be very seamless process from there.
  • Search result ranking uses vote and view information.
  • Personalized search information. (Physicists usually do not want sociology results.)
  • Auto completion on paper titles, author, discussions.

Social: (hard to implement, maybe for second version!)

  • Easily refer to users by @-syntax familiar from Twitter/Google+
  • Maintain a friendship / trust graph
  • Friendship recommendations
  • Find friends from Google+ on the site

Benefits

Our proposed websites improves the above mentioned problems in the following ways.
1. Finding and filtering new publications:This step can be improved with even very little  community effort:

  • Tell other people, that you are interested in the paper. Vote it up or leave a comment if you are very excited about it.
  • Point out a paper to a colleague.

2. Getting more information about a given paper:

  • Write a summary or review about a paper you have read or skimmed through. Maybe the introduction is hard to read or some results are not clearly stated.
  • Can you recommend reading this paper? Vote it up!
  • Ask a colleague for his opinion on the paper. Maybe he can write a summary?

Many reviews of new papers are already written. E.g. MathSciNet and Zentralblatt maintain a large database of Reviews which are provided by the community and are not freely available. Many authors would be much more happy to write them to an open system!
3. Understanding a paper:Here are the mayor synergies which we want to address with our project.

  • Ask a question: Why is the author using this experimental method? How does Lemma 3.4 work? Why do I need this assumption? What is the intiution behind the “virtual truncation”? What implications does this work have?
  • Start a discussion: (might involve more than one paper.) What is the difference of these two papers? Is there a reference explaining this more clearly? What should I read in advance to understand the theory?
  • Add resources. Tell the community about related videos, notes, books etc. which are available on other sites.
  • Share your notes. If you have discussed a paper in a reading class or seminar. Collect your notes or opinions and make them available for the community.
  • Restate interesting statements. Tell the community when you have found a helpful result which is buried inside the paper. In that way Google may find it!

4. Finding related work. Having a well structured and easily navigable view on related papers simplifies the search a lot. The filtering benefits from the content generated by the users (votes) and individual information, like friends who have written/bookmarked a paper.

Similar Sites on the Web

There are several discussions in QA forum which are discussing precisely this problem:

We found three sites on the internet which follow a similar approach which we examined more carefully.
1. There is a social network which has most of our features implemented:

researchgate.net
“Connect with researchers, make your work visible, and stay current.”

The Economist has dedicated an article to them. It is essentially a facebook clone, with special features for scientist.

  • Large, fast growing community. 1.4m +50.000/m. Mainly Biology and Medicine.
    (As Daniel Mietchen points out, the size might be misleading due to institutional accounts)
  • Very professional Look and Feel. Company from Berlin, Germany, funded by VC. (48 People involved, 10 Jobs advertised)
  • Huge Feature set:
    • Profile site, Connect to friends
    • News Feed
    • Publication Database, Conference Finder, Jobmarket
    • Every Paper its own page: with
      • Voting up/down
      • Comments
      • Metadata (Title, Author, Abstract, Preveiw)
      • Social Media (Share, Bookmark, Follow author)
    • Organize Workgroups/Reading Classes.

Differences to our approach:

  • Closed Data / Closed Source
  • Very complex site which solves a lot of purposes
  • Only very basic features on paper site: vote/comment.
  • QA system is not linked well to paper database
  • No MathML
  • Mainly populated by undergraduates

2. Another website which comes reasonably close is:

http://www.sciweavers.org/

“an academic network that aggregates links to research paper preprints
then categorizes them into proceedings.”

  • Includes a large collection of online tools for various purposes
  • Have a big library of papers/software/datasets/conferences for computer science.
    Paper sites have:
    • Meta information and preview
    • Vote functionality and view statistics, tags
    • Comments
    • Related work
    • Bookmarking
    • Author information
  • User profiles (no friendships)


Differences to our approach:

  • Focus on computer science community
  • Comment and Discussions are well hidden on paper sites
  • No News stream
  • Very spacious design

 
3. Another very similar site is:

journalfire.com – beta
“Share what your read – connect to colleagues – create journal clubs.”

It has the following features:

  • Comment on Papers. Activity feed (?). Follow articles.
  • Host Journal Clubs. Create Events related to papers.
  • Powerful search box fetching papers from Arxiv and Pubmed (slow)
  • Social features on site: User profiles, friend finder (no fb/g+ integration yet)
  • News feed – from subscribed papers and friends
  • Easy paper import via Bookmarklet
  • Good usability!! (but slow loading times)
  • Private reading clubs cost money!

They are very skilled: Maintained by 3 PhD students/postdocs from Caltec and MIT.

Differences to our approach:

  • Closed Data, Closed Source
  • Also this site misses (currently) misses out ranking features
  • Very Closed model – Signup required
  • Weak Crowd sourcing: Cannot add Meta information

The site is still at its very beginning with little users. The project started in 2010 and did not gain much momentum since.

The other sites are roughly classified in the following categories:
1. Single people who are following a very similar idea:

  • annotatr.appspot.com. Combines a metadata-base with the disqus plugin. You can comment but not rate. Good usability. Nice CSS. Good search function. No MathML. No related article suggestion. Maintained by two academics in private time. Hosted on Google Apps. Closed Source – Closed Data.
  • r-Forum – a resource where mathematicians can collect record reviews, corrections of a resource (e.g. paper, talk, …). A simple Vanilla-Forum/Wiki with almost no content used by maybe 12 people in US. No automated Data import. No rating system.
  • http://math-arch.org/ – Post comments to math papers. very bad usability – get even errors. Maintained by a group of russian programmers LogicSun. Closed Source – Closed Data.

Analysis: Although the principal idea to connect people reading papers is there. The implementation is very bad in terms of usability and even basic programming. Also the voting features are missed out.

2. (Semi) Professional sites.

  • Public Libary of Science very professional, huge paper data base for mainly biology, medicine. Features full text papers, lots of interesting meta information including references. Has comment features (not very visible) and news stream on the start page.
    No QA features (+1, Ask question) on the site. Only published articles are on the site.
  • Mendeley.com – Huge Bibliographic database with bookmarking and social features. You can organize reading groups in there, with comments and notes shared among the participants. Features a news stream with papers by friends. Nice import. Impressive fulltext data and Reference features.
    No QA features for paper. No comments for paper. Requires Signup to do anything useful.
  • papercritic.com – Open review database. Connected to Mendely bibliographic libary. You can post reviews. No rating. No comments. Not open: Mendely is commercial.
  • webofknowledge.com. Commercial academic citation index.
  • zotero.org – features programm that runs inside a browser. “easy-to-use tool to help you collect, organize, cite, and share your research sources”

Analysis: The goal of all these tools is to simplify the reference management, by providing metadata like references, citations, abstracts, author profiles. Commenting features on the paper site are not there or not promoted.
3. Vaguely related sites which solve different problems:

  • citeulike.org – Social bookmarking for papers. Closed Source – Open Data.
  • http://www.scholarpedia.org. A peer reviewed open access encyclopedia.
  • Philica.com Online Journal which publishes articles from any field along with its reviews.
  • MathSciNet/Zentralblatt – Review database for math community. Closed Source – Commercial.
  • http://f1000research.com/ – Online Journal with a public, post publish review process. “Open Science – Open Data – Open Review”
  • http://altmetrics.org/manifesto/ as an emerging trend from the web-science trust community. Their goal is to revolutionize the review process and create better filters for scientific publications making use of link structures and public discussions. (Might be interesting for us).
  • http://meta.wikimedia.org/wiki/WikiScholar – one of several ideas under discussion at Wikimedia as to a central repository for references (that are cited on Wikipedias and other Wikimedia projects)

Upshot of all this:

There is not a single site featuring good Q&A features for papers.

If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

]]>
https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/feed/ 17