PhD – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25% https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/ https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/#respond Tue, 15 Apr 2014 20:11:49 +0000 http://www.rene-pickhardt.de/?p=1820 After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.
As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.
So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.
Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.
All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.
We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.
In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.
We have used Perplexity as a standard metric to evaluate our Language Model.

Results

As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.
The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5
Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.
We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:
Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.
Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table
Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.
Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.
Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.

]]>
https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/feed/ 0
Teaching Web Science (web architecture and Web ethics) to students https://www.rene-pickhardt.de/teaching-web-science-web-architecture-and-web-ethics-to-students/ https://www.rene-pickhardt.de/teaching-web-science-web-architecture-and-web-ethics-to-students/#respond Tue, 26 Mar 2013 13:52:51 +0000 http://www.rene-pickhardt.de/?p=1558 In July 2012 we taught a course for the German National Summer School for high school students. The course consisted of 50 hours over 14 days. Due to some specific settings of the Summer School we had to make a few adjustments to the format of our curriculum and lectures. Still we gathered some good experiences for future teaching. The main lesson learnt was that knowledge of the Internet protocol suite contributes to a better understanding of the decentralized and open aspects of the web. This leads to a better comprehension of the ethical aspects of the web like net neutrality, copyright, relevance paradox, censorship and others. We propose that any curriculum about Web Science should include a fair part of lectures on Web Architecture and the Internet protocol Stack.

Course context (level, students, discipline, etc.)

The course was designed for 16 highly gifted high school students (11th and 12th grade). The level was supposed to be manageable for a second year undergraduate student. Since our students came from different grades and schools we were forced to sacrifice some course time to teach some basic programming skills. Thus we could not cover all the aspects of Web Science. Instead we focused on three main course objectives:

Course objectives and targeted competencies

By the end of the course our students should…

  1. understand the current web architecture in particular the decentralized and open aspects.
  2. gain the ability to form and defend a solid opinion on currently ongoing ethical discussions related to the Web.
  3. realize that the study of the Web needs much broader skill set than knowledge about Computer Science.

Course content (Structure, sections, topics, references)

All students were asked to prepare a talk and read the book ”Weaving the Web” by Sir Tim Berners-Lee before the summer school started. Ten of the talks included the technical foundations starting with binary numbers going all the way to the application layer and all the necessary protocols. This included the theoretical study of IP, TCP and HTTP as well as routing algorithms (BGP ) and DNS. To ensure a better understanding the students had to form groups and implement a simple Web Server and a Web Client that were able to process HTTP1.0 GET requests during course time. This was done using the Java Programming Language and the socket classes from the Java API. These topics have been covered in the first week of the course. In the second half we focused on the ethics of the web. After each talk on an ethical topic which was supposed to give an overview for about 20 minutes we entered a 2 hour group discussion. For example for the discussion on net neutrality we knew the following groups of interests from the overview talk: Large internet providers, big web companies, small web companies, politicians, consumers. Students were randomly assigned to one of these groups. Within 10 minutes they had to prepare a list of arguments that would reflect the interests of their particular group as well as arguments they would expect from other groups. While discussing the issue on a round table they had to find a good solution respecting the technical nature of the web and the interests of their group.

Evaluation methods (Tests, projects, papers,etc.)

Even though the Summer School is very competitive participation is voluntary so there can’t be an exam or something similar in the end. Also all work had to be completed during the 50 hours course time without any home work assignments. We had three evaluation methods to ensure the comprehension of the course content.
1. Hacking Project: As already mentioned students implemented a Web Server and Web Client during the first half of the course. Being in groups of 2 or 3 students and being new to programming we teachers helped students out which gave us a nice feedback whether or not students understood the content.
2. Oral presentation: After the middle of the course students had to prepare and give a presentation to be consumed by an interdisciplinary audience i.e the students from other courses of the summer school, which are all not covering any IT topics. We asked the students to create a theatre role-play of what happens if someone types www.wikipedia.org into a web browser and hits the enter key. All students placed routing tables on the seats for the audience, created TCP / IP packets (filled with candy that represented the time to live) and routed DNS requests as well as HTTP requests together with the TCP handshake around the audience in the class room demonstrating that the basic decentralized web architecture was understood by everyone in the course.
3. Paper Writing: During the last days of the course the students were expected to collectively prepare a 25 pages documentation with scientific standards of what they have learned during the summer school. The process of creating this documentation is not only guided by us teachers but gives also a nice feedback loop to see if the goals of the course have been achieved.
Overall we can say that the concept of the course worked really well. Especially putting such a high focus on the Web Architecture and actually letting students implement protocols helped to gain a deeper understanding.

]]>
https://www.rene-pickhardt.de/teaching-web-science-web-architecture-and-web-ethics-to-students/feed/ 0
Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/#comments Tue, 06 Nov 2012 11:55:02 +0000 http://www.rene-pickhardt.de/?p=1460 As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator it = result.columnAs("coAuthor");
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.
  • The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
  • The Traverser framework is about 25% slower than the Core API
  • Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.
  • I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

    One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

    Quotes from Andres Taylor:

    Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

    I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

    Conclusion:

    So far I was only using the Java Core API working with neo4j and I will continue to do so.
    If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
    Also I personally like the amount of control that you have when traversing over the core yourself.
    Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
    The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
    The detailed results can be found in this spreadsheet.

    ]]> https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/feed/ 16 Open access and data from my research. Old resources for various topics finally online. https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/ https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/#respond Mon, 05 Nov 2012 05:19:53 +0000 http://www.rene-pickhardt.de/?p=1430 Being strong pro on the topic of open access I always try to publish all my work on my blog but sometimes I am busy or I forget to update so today I took the time to look at all my old drafts and the stuff that hasn’t been published yet. So here is a list of new content on my blog that should have been published long ago I also linked it in the articles of interest:

    In the last month I have created quite some content for my blog and it will be published over the next weeks. So watch out for screen casts how to create an autocompletion in gwt with neo4j, how to create ngrams from wikipedia, thoughts and techniques for related work, reasearch ideas and questions that we found but probably have not the time to work on

    ]]>
    https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/feed/ 0
    How to do a presentation in China? Some of my experiences https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/ https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/#comments Fri, 02 Nov 2012 08:33:22 +0000 http://www.rene-pickhardt.de/?p=1432 So the culture is different from Western culture we all know that! I am certainly not an expert on China but after living in China for almost 2 years knowing some language and working in a chinese company seeing presentations every week and also visiting over 30 western and chinese companies placed in China I think I have some insights about how you should organize your presentation in China.
    Since I recently went to Shanghai in order to to research exchange with Jiaotong University I was about to give a presentation to introduce my institute and me. So here you can find my rather uncommon presentation and some remarks, why some slides where designed in the way they are.
    http://www.rene-pickhardt.de/wp-content/uploads/2012/11/ApexLabIntroductionOfWeST.pdf

    Guanxi – your relations

    First of all I think it is really important to understand that in China everything is related to your relations (http://en.wikipedia.org/wiki/Guanxi). A chinese business card will always name a view of your best and strongest contacts. This is more important than your adress for example. If a conference starts people exchange namecards before they sit down and discuss.
    This principle of Guanxi is also reflected in the style presentations are made. Here are some basic rules:

    • Show pictures of people you worked together with
    • Show pictures of groups while you organized events
    • Show pictures of the panels that run events
    • Show your partners (for business not only clients but also people you are buying from or working together with in general)

    My way of respecting these principles:

    • I first showed a group picture of our institute!
    • I also showed for almost every project where I could get hold of it pictures of the people that are responsible for the project
    • I did not only show the European research projects our university is in but listed all the different partners and showed logos of them

    Family

    The second thing is that in China the concept of family is very important. I would say as a rule of thumb if you want to make business with someone in china and you havent been introduced to their family things are not going like you might expect this.
    For this reason I have included some slides with a worldmap going further down to the place where I was born and where I studied and where my parents still leave!

    Localizing

    When I choosed a worldmap I did not only take one with Chinese language but I also took one where china was centered. In my contact data I also put chinese social networks. Remember Twitter, Facebook and many other sites are blocked in China. So if you really want to communicate with chinese people why not getting a QQ number or weibo account?

    Design of the slides

    You saw this on conferences many times. Chinese people just put a hack a lot of stuff on a slide. I strongly believe this is due to the fact that reading and recognizing Chinese characters is much  faster than western characters. So if your presentation is in Chinese Language don’t be afraid to stuff your slides with information. I have seen many talks by Chinese people that where literally reading word by word what was written on the slides. Where in western countries this is considered bad practice in China this is all right. 

    Language

    Speaking of Language: Of course if you know some chinese it shows respect if you at least try to include some chinese. I split my presentation in 2 parts. One which was in chinese and one that was in english.

    Have an interesting take away message

    So in my case I included the fact that we have PhD positions open and scholarships. That our institut is really international and the working language is english. Of course I also included some slides about my past and current research like Graphity and Typology

    During the presentation:

    In China it is not rude at all if ones cellphone rings and one has more important stuff to do. You as presenter should switch of your phone but you should not be disturbed or annoyed if people in the audience receive phone calls and go out of the room doing that business. This is very common in China.
    I am sure there are many more rules on how to hold a presentation in China and maybe I even made some mistakes in my presentation but at least I have the feeling that the reaction was quite positiv. So if you have questions, suggestions and feedback feel free to drop a line I am more than happy to discuss cultural topics!

    ]]>
    https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/feed/ 3
    Report of Socialcom2012 online in our new WeST blog https://www.rene-pickhardt.de/report-of-socialcom2012-online-in-our-new-west-blog/ https://www.rene-pickhardt.de/report-of-socialcom2012-online-in-our-new-west-blog/#respond Thu, 20 Sep 2012 06:31:33 +0000 http://www.rene-pickhardt.de/?p=1416 Hey everyone,
    longtime no see! Well yeah the summer time usually means vaccation and traveling and so on. But I also have been busy creating some cool content and visiting some conference!
    So first of all together with my friend Leon Kastler I have established our new WeST Blog. The goal is to be more transparent with the research of our institute! Quite some people in our institute already shared some nice information. So why don’t you give our blog a try and follow it at:
    http://blog.west.uni-koblenz.de/
    Since our institute now has its own blog I will in future publish some of my articles over there and will just point to them from this blog. This will happen especially if the content of the articles is founded by research money from our institute. So my presentation at SocialCom 2012 is most certainly founded by our institute so thats why you can read my summery together with some suggestions of interesting paper at my article:
    http://blog.west.uni-koblenz.de/2012-09-20/west-researchers-summary-of-socialcom-2012-in-amsterdam/

    ]]>
    https://www.rene-pickhardt.de/report-of-socialcom2012-online-in-our-new-west-blog/feed/ 0
    Big step towards open access by Great Britain and a comment from Neelie Kroes https://www.rene-pickhardt.de/big-step-towards-open-access-by-great-britain-and-a-comment-from-neelie-kroes/ https://www.rene-pickhardt.de/big-step-towards-open-access-by-great-britain-and-a-comment-from-neelie-kroes/#respond Sun, 19 Aug 2012 18:33:39 +0000 http://www.rene-pickhardt.de/?p=1407 During my vaccation a lot of stuff has been happened and it was just for today that I came along the following article and discussion: http://royalsociety.org/policy/projects/science-public-enterprise/report/. Yes you read correctly the royal society wants to create open access to all publications financed by the British government. What a big step! Congratulation to all British people for being such a role model.
    It fits perfectly to my project related work and other discussions I was joining e.g.

    Even though this development is very good to see I am not happy about how the following discussion is going on about models how to fulfill the goals from the royal society.

    Neelie Kroes from the European comission posted a really nice answer!

    I am glad to see this step forward. After my successful submission of Graphity and reading the copyright form of IEEE which I had to sign I really did have concerns publishing my work with them.
    I am still considering not submitting to big journals and conferences anymore but just publishing on my universities website, my blog and/or on open preprint archives.

    ]]>
    https://www.rene-pickhardt.de/big-step-towards-open-access-by-great-britain-and-a-comment-from-neelie-kroes/feed/ 0
    Submission history of my first academic research paper (graphity at socialcom 2012) https://www.rene-pickhardt.de/submission-history-of-my-first-academic-research-paper-graphity-at-socialcom-2012/ https://www.rene-pickhardt.de/submission-history-of-my-first-academic-research-paper-graphity-at-socialcom-2012/#respond Tue, 10 Jul 2012 09:29:38 +0000 http://www.rene-pickhardt.de/?p=1387 My graph index graphity was – as mentioned in another blogpost – accepted at socialcom 2012. After I explained how it works and sharted the source code I now want to share some information about the history of submissions, reviews, quality of reviews, taken actions and so on. So if you are a coder, hacker, geek or what so ever just skip this post and digg into the source code of graphity. If you are a researcher you might enjoy learning from my experiences.

    2011 April: Defining the research question

    Wondering about possible research topics many people told me that running a social network like metalcon and seeing so much data would be like treasure for research questions. Even though at that time our goal was just to keep metalcon running in its current state I startet thinking about what research questions could be asked. For most questions we just did not collect good data. In this sense I realized if metalcon should really give rise to research questions I would have to reimplement big parts of it.
    Since metalcon also had the problems of slow page loading times I decided to efficiently recode the project. I new that running metalcon on a graph data base like neo4j would be a good idea. That is why I started wondering about how to implement a newsfeed system efficiently in a scalable manner using a graph data base.
    ==> A research question was formulated.

    2011 Mid August: Solution to the problem was

    After 3 months of hard thinking and trying out different data base designs I had a meeting with my potential supervisor Prof. Dr. Steffen Staab. He would have loved to see some prelimenary results. While summing up everything for the meeting the solution or better said the idea of Graphity was suddenly clear before my eyes.
    In late august I talked about the ideas in the oberseminar and I presented a poster at the european summer school of information retrieval. I wondered about where to submit this.
    Steffen thought that the idea was pretty smart and could be published in a top journal like VLDB which he suggested me to choose. He said it would be my first scientific work and VLDB has a very short reviewing cycle of less than one month that could provide me fast feedback. A pieve of advice which I did not understand and also did not follow. Well, sometimes you have to learn the hard way…
    Being confident about the quality of the index also due to Steffens positive feedback and plans for VLDB my plan was rather to submit to WWW conference. I thought social networks would be a relevant topic to this conference. Steffen agreed but pointed out that he was a program chair for that conference which would mean that submissions from his own institute would not be the best idea.
    Even though graphity was not implemented or evaluated yet and no related work was read we decided that the paper should be submitted to SIGMOD conference. With an upcoming deadline of only 2 months later.
    By the way having these results and starting this hard work gave me founding for 3 years starting in october 2012!

    2011 november: Submission to sigmod

    After two months of very hard work especially together with Jonas Kunze a physicist from metalcon from whom I really learnt a lot about setting up software and evaluation frameworks the paper was submitted to sigmod. Meanwhile I tried to get an overview of the related work (which at that time was kind of the hardest part) since I really did not know how to start and my research question came from a very practical and relevant usecase but did not emerge after reading many papers.
    Anyway we finished our submission just in time together with all the experiments and a – I would still say decent – text about graphity. (you can have the final copy of the sigmod publication on request)

    2011 middle of November: Blogpost with best content from the paper

    I talked back to my now Supervisor Steffen and asked him what he thought about blogging the results of graphity already trying to get some awareness in the community. He was a little skeptical since blogs can not really be cited and the paper was not published yet. I argued that blogs are becoming more and more importent in research. I said even if a blogpost is not publication in the scientific sense it still is a form of publication and gains visability. Steffen agreed to test this out and I have to say the idea to do this was perfect. I put the core results on my blog received very positiv feedback from people working at linkedin, microsoft and other hackers. Also I realized that the problem was currently unpublished / unsolved and relevant to many people.
    Interestingly one of my co-authors tried to persue me to publish the sigmod version of the paper as a technical report. I did not get the point of doing so (he said something like, then it is officially published and no one can steal the idea…” Up to today I don’t get the point of doing this other then manipulating ones official publication count…)

    2012 February: Reject from Sigmod

    After all the strong feedback on my blog and via mail the reviews from SIGMOD conference where quite disappointing. Every single reviewer highly valued the idea and the relevance of the problem. But they criticized the evaluation and the related work.

    • For the evaluation they criticized that using a memory disk is manipulating and the distributing and scaling behaviour could not be seen by theoretical and emperical proves of some big O notations.
    • As for the related work they where missing some papers especially from the data base community and as I said related work was definately one of the weaknesses of the paper.
    • Other feedback was on our notation and naming for some baselines.

    The most disappointing of all the feedback was the following quote:

    W1: I cannot find any strong and novel technical contribution in their algorithms. Their proposed method is too simple and straightforward. It is simply an application of using doubly linked list to graph structures.

    To give an answer to this at least once: Yes the algorithm is simple and only based on double linked lists. BUT it is in the best complexity class one can imagine for the problem it scales perfectly and this was prooven. How can one throw away an answer because it is too simple if it is the best possible answer? I sense that this is one of the biggest differences between a mathematician and computer scientist. In particular the same reviewer pointed out that the problem was relevant and important.
    Having this feedback we came to the conclusion that the database community might have been the wrong community for the problem. Again we would have figured this out much faster if submitting to VLDB like Steffen had suggested.

    2012 February: Resubmission to hypertext

    Only 1 week away from the feedback was the submission for the hypertext conference. Since one week was not really much time we decided to not implement any of the feedback and just check what another community would say. My supervisor Steffen was not really convinced of this idea but we argued that the notification of hypertext conference was only one month later and this quick and aditional feedback might be of help.

    2012 late march: Reject from Hypertext

    The reviews from hypertext conference have been rather strange. One strong accept but the reviewer said he was not an expert in the field. One strong reject of a person that did not understand the correctness of our top k n way merge (which was not even core to the paper) and a borderline from one reviewer who really had some interesting comments on our terminology and the related work.
    Overall the paper was accepted as a poster presentation and we decided that this was not sufficient for our work so we withdraw our submission.

    2012 mai resubmission to socialcom

    Discussions rose up which conference we should target now. We have been sure that the style of the evaluation was for sure nothing for the data base community. On the other hand graph data bases alone would not be sufficient for semantic conferences and social is such a hype right now that we really were not sure.
    Steffen would have liked to submit the paper to ISWC since this is an important community for our institue. I suggested SocialCom since the core of our paper was really lying in social networking. All reviews so far have valued the problem as important and relevant for social networks and people basically liked the idea.
    We tried to find related work for the used baselines and figured out that for our strongest baseline we could not find any related work. 3 days before the submission to Socialcom Steffen argued that we should drop the name of the paper and not call it graphity anymore. He said that we just sell the work as two new indices for social news streams (which I thought was kind of reasonable since the baseline to graphity really makes sense to use) and was not presented in any related work. Also we could enhance the paper with some feedback from my blog and the neo4j mailinglist. The only downside of this strategy was that changing the title and changing the story line would yield to a complete rewrite of the paper. Something I was not to keen of doing within 3 days. My coauther and I were convinced that we should stick to our changes from working in the feedback and stick to our argument for the baseline without related work.
    We talked back to Steffen. He said he left the final decission to us but would strongly recommend to change the storyline of the paper and drop the name. Even though I was not 100% convinced and my coauthor also did not want to rewrite the paper I decided to follow Steffens experience. We rewrote the story line.

    2012 July accept at SocialCom

    The paper was accepted as a full paper to SocialCom. So following Steffens advice was a good idea. Also not getting down by bad reviews from the wrong community and staying convinced that the work had some strong results turned out to be a good thing. I am sure that posting preliminary results on the blog was really nice. In this way we could integrate open accessable feedback from the community to our third submission. I am excited how much I will have learnt from this experience with later submissions.
    I will publish the paper on the blog as soon as the camera ready version is done! Subscribe to my newsleter, RSS or twitter if you don’t want to miss the final version of the paper!

    ]]>
    https://www.rene-pickhardt.de/submission-history-of-my-first-academic-research-paper-graphity-at-socialcom-2012/feed/ 0
    Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
    While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

    Chapter 1 Introduction

    This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
    I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
    What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
    Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
    distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
    Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
    Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
    Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
    Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
    KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
    Exercises
    I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
    What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
    Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

    ]]>
    https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
    Neo4j based Typology also awarded top 90 at Google Science fair. https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/ https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/#respond Tue, 22 May 2012 13:08:48 +0000 http://www.rene-pickhardt.de/?p=1350 Yesterday I shared the good news about Till and Paul who have been awarded one of the top 5 projects at the German federal competition Young scientists. Today the good news continues. Together with more than 1000 competitors they did also submit the project to the Google Science Fair.
    And guess what! Google likes graph data bases, information retrieval, auto completion, and scentence prediction. Till and Paul have been (together with 5 other germans) selected to be in the second round together with 90 remaining projects. Even Heise reported about this! Within the next month they will have phone calls with Google employees and with a little bit of luck they will be selected to be in the top 15 which would mean they would be invited to the Googleplex at Mountain View California.
    Remember Typology is open source and already available as an android app (only in German language but this will be fixed soon).
    Btw I am just wondering when I will be traveling to Google in Mountain View?

    ]]>
    https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/feed/ 0