typology – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25% https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/ https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/#respond Tue, 15 Apr 2014 20:11:49 +0000 http://www.rene-pickhardt.de/?p=1820 After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.
As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.
So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.
Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.
All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.
We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.
In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.
We have used Perplexity as a standard metric to evaluate our Language Model.

Results

As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.
The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5
Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.
We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:
Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.
Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table
Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.
Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.
Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.

]]>
https://www.rene-pickhardt.de/how-generalized-language-models-outperform-modified-kneser-ney-smoothing-by-a-perplexity-drop-of-up-to-25/feed/ 0
My ranked list of priorities for Backend Web Programming: Scalability > Maintainable code > performance https://www.rene-pickhardt.de/my-ranked-list-of-priorities-for-backend-web-programming-scalability-maintainable-code-performance/ https://www.rene-pickhardt.de/my-ranked-list-of-priorities-for-backend-web-programming-scalability-maintainable-code-performance/#comments Sat, 27 Jul 2013 09:21:55 +0000 http://www.rene-pickhardt.de/?p=1721 The redevelopment of metalcon is going on and so far I have been very concerned about performance and webscale. Due to the progress of Martin on his bachlor thesis we did a code review of the code to calculate Generalized Language Models with Kneser Ney Smoothing. Even though his code is a standalone (but very performance sensitive) software I realized that for a Web application writing maintainable code seems to be as important as thinking about scalability.

Scaling vs performance

I am a performance guy. I love algorithms and data structures. When I was 16 years old I already programmed a software that could play chess against you using a high performance programming technique called Bitboards.
But thinking about metalcon I realized that web scale is not so much about performance of single services or parts of the software but rather about the scalability of the entire architecture. After many discussions with colleagues Heinrich Hartmann and me came up with a software architecture from which we believe it will scale for web sites that are supposed to handle several million monthly active users (probably not billions though). After discussing the architecture with my team of developers Patrik wrote a nice blog article about the service oriented data denormalized architecture for scalable web applications (which of course was not invented by Heinrich an me. Patrik found out that it was already described in a WWW publication from 2008).
Anyway this discussion showed me that Scalability is more important than performance! Though I have to say that the stand alone services should also be very performant. if a service can only handle 20 requests per seconds – even if it easily scales horizontally – you will just need too many machines.

Performance vs. Maintainable code

Especially after the code review but also having the currently running metalcon version in mind I came to the conclusion that there is an incredibly high value in maintainable code. The hackers community seems to agree on the fact that maintainability comes over performance (only one of many examples).
At that point I want to recall my initial post on the redesign of metalcon. I had in mind that performace is the same as scaling (which is a wrong assumption) and asked about Ruby on rails vs GWT. I am totally convinced that GWT is much more performant than ruby. But I have seen GWT code and it seems almost impractical to maintain. On the other side from all that I know Ruby on Rails is very easy to maintain but it is less performant. The good thing is it easily scales horizontally so it seems almost like a no brainer to use Ruby on Rails rather than GWT for the front end design and middle layer of metalcon.

Maintainable code vs scalability

Now comes the most interesting fact that I realized. A software architecture scales best if it has a lot of independent services. If services need to interact they should be asynchronous and non blocking. Creating a clear software architecture with clear communication protocols between its parts will do 2 things for you:

  1. It will help you to maintain the code. This will cut down development cost and time. Especially it will be easy to add , remove or exchange functionality from the entire software architecture. The last point is crucial since
  2. Being easily able to exchange parts of the software or single services will help you to scale. Every time you identify the bottleneck you can fix it by exchanging this part of the software to a better performing system.
In order to achieve scalable code one needs to include some middle layer for caching and one needs to abstract certain things. The same stuff is done in order to get maintainable code (often decreasing performance)

Summary

I find this to be very interesting and counter intuitive. One would think that performance is a core element for scalability but I have the strong feeling that writing maintainable code is much more important. So my ranked list of priorities for backend web programming (!) looks like that:

  1. Scalability first: No Maintainable code helps you if the system doesn’t scale and can’t be served to millions of users
  2. Maintainable code: As stated above this should go almost hand in hand with scalability
  3. performance: Of course we can’t have a data base design where queries need seconds or minutes to run. Everthing should happen within a few milliseconds. But if the code can become more maintainable at the cost of another few milliseconds I guess thats a good investment.
]]>
https://www.rene-pickhardt.de/my-ranked-list-of-priorities-for-backend-web-programming-scalability-maintainable-code-performance/feed/ 2
Open access and data from my research. Old resources for various topics finally online. https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/ https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/#respond Mon, 05 Nov 2012 05:19:53 +0000 http://www.rene-pickhardt.de/?p=1430 Being strong pro on the topic of open access I always try to publish all my work on my blog but sometimes I am busy or I forget to update so today I took the time to look at all my old drafts and the stuff that hasn’t been published yet. So here is a list of new content on my blog that should have been published long ago I also linked it in the articles of interest:

In the last month I have created quite some content for my blog and it will be published over the next weeks. So watch out for screen casts how to create an autocompletion in gwt with neo4j, how to create ngrams from wikipedia, thoughts and techniques for related work, reasearch ideas and questions that we found but probably have not the time to work on

]]>
https://www.rene-pickhardt.de/open-access-and-data-from-my-research-old-resources-for-various-topics-finally-online/feed/ 0
How to do a presentation in China? Some of my experiences https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/ https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/#comments Fri, 02 Nov 2012 08:33:22 +0000 http://www.rene-pickhardt.de/?p=1432 So the culture is different from Western culture we all know that! I am certainly not an expert on China but after living in China for almost 2 years knowing some language and working in a chinese company seeing presentations every week and also visiting over 30 western and chinese companies placed in China I think I have some insights about how you should organize your presentation in China.
Since I recently went to Shanghai in order to to research exchange with Jiaotong University I was about to give a presentation to introduce my institute and me. So here you can find my rather uncommon presentation and some remarks, why some slides where designed in the way they are.
http://www.rene-pickhardt.de/wp-content/uploads/2012/11/ApexLabIntroductionOfWeST.pdf

Guanxi – your relations

First of all I think it is really important to understand that in China everything is related to your relations (http://en.wikipedia.org/wiki/Guanxi). A chinese business card will always name a view of your best and strongest contacts. This is more important than your adress for example. If a conference starts people exchange namecards before they sit down and discuss.
This principle of Guanxi is also reflected in the style presentations are made. Here are some basic rules:

  • Show pictures of people you worked together with
  • Show pictures of groups while you organized events
  • Show pictures of the panels that run events
  • Show your partners (for business not only clients but also people you are buying from or working together with in general)

My way of respecting these principles:

  • I first showed a group picture of our institute!
  • I also showed for almost every project where I could get hold of it pictures of the people that are responsible for the project
  • I did not only show the European research projects our university is in but listed all the different partners and showed logos of them

Family

The second thing is that in China the concept of family is very important. I would say as a rule of thumb if you want to make business with someone in china and you havent been introduced to their family things are not going like you might expect this.
For this reason I have included some slides with a worldmap going further down to the place where I was born and where I studied and where my parents still leave!

Localizing

When I choosed a worldmap I did not only take one with Chinese language but I also took one where china was centered. In my contact data I also put chinese social networks. Remember Twitter, Facebook and many other sites are blocked in China. So if you really want to communicate with chinese people why not getting a QQ number or weibo account?

Design of the slides

You saw this on conferences many times. Chinese people just put a hack a lot of stuff on a slide. I strongly believe this is due to the fact that reading and recognizing Chinese characters is much  faster than western characters. So if your presentation is in Chinese Language don’t be afraid to stuff your slides with information. I have seen many talks by Chinese people that where literally reading word by word what was written on the slides. Where in western countries this is considered bad practice in China this is all right. 

Language

Speaking of Language: Of course if you know some chinese it shows respect if you at least try to include some chinese. I split my presentation in 2 parts. One which was in chinese and one that was in english.

Have an interesting take away message

So in my case I included the fact that we have PhD positions open and scholarships. That our institut is really international and the working language is english. Of course I also included some slides about my past and current research like Graphity and Typology

During the presentation:

In China it is not rude at all if ones cellphone rings and one has more important stuff to do. You as presenter should switch of your phone but you should not be disturbed or annoyed if people in the audience receive phone calls and go out of the room doing that business. This is very common in China.
I am sure there are many more rules on how to hold a presentation in China and maybe I even made some mistakes in my presentation but at least I have the feeling that the reaction was quite positiv. So if you have questions, suggestions and feedback feel free to drop a line I am more than happy to discuss cultural topics!

]]>
https://www.rene-pickhardt.de/how-to-do-a-presentation-in-china-some-of-my-experiences/feed/ 3
Typology Oberseminar talk and Speed up of retrieval by a factor of 1000 https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/ https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/#comments Thu, 16 Aug 2012 11:39:25 +0000 http://www.rene-pickhardt.de/?p=1396 Almost 2 months ago I talked in our oberseminar about Typology. Update: Download slides Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.
Additionally I have finally uploaded some source code to git hub that makes the typology retrieval algorithm pretty fast. There are still some issues with this code since it lowers the quality of predictions a little bit. Also the index has to be built first. Last but not least the original SuggestTree code did not save the weights of the items to be suggested. I need those weights in the aggregation phase. Since i did not want to extend the original code I placed the weights at the end of the suggested Items. This is a little inefficent.
The main idea why retrieval speeds up with the new algorithm is that typology needs to make sorting over all outedges of a node. This is rather slow especially if one only needs the top k elements. Since neo4j as a graph data base does not provide indices for this kind of data I was forced to look for another way to presort the data. Additionally if a prefix is known one does not have to look at all outgoing edges. I found the Suggest Tree class by Nicolai Diethelm. Which solved the problem in a very good way and lead to such a great speed. The index is not persistent yet and it also needs quite some memory. On the other hand for every node a suggest tree is built. This means that the index can be distributed in a very easy manner over several machines allowing for horizontal scaling!
Anyway the old algorithm was only able to handle like 20 requests per second and now we have something like 14 k requests and as I mentioned there is still a little space for more (:
I hope indices like this will be standard in neo4j soon. This would open up the range of applications that could make good use of neo4j.
Like always I am happy for any suggestions and I am looking forward to do the complete evaluation and paper writing for typology.

]]>
https://www.rene-pickhardt.de/typology-oberseminar-talk-and-speed-up-of-retrieval-by-a-factor-of-1000/feed/ 2
Swiftkey XKCD comic: Sorry this has not happened before… https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/ https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/#respond Wed, 13 Jun 2012 15:29:17 +0000 http://www.rene-pickhardt.de/?p=1369 Current readers of my blog know about typology. The project which makes predictions of what you will type in next on your smartphone. This is of course pretty similar to Swiftkey. Amazing to see how xkcd took swiftkey as a topic for the current comic. More amazing is that 3 people (not in the development team of typology) send me the link to this project independently.
I liked the comic and I think it really demonstrates the importance of a software like this. So I am really looking forward to see how typology will evolve. It also demonstrates the fallbacks of personalization of everything. In this sense I think there is also a hint to the filter bubble

]]>
https://www.rene-pickhardt.de/swiftkey-xkcd-comic-sorry-this-has-not-happened-before/feed/ 0
Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

Chapter 1 Introduction

This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

]]>
https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
Neo4j based Typology also awarded top 90 at Google Science fair. https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/ https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/#respond Tue, 22 May 2012 13:08:48 +0000 http://www.rene-pickhardt.de/?p=1350 Yesterday I shared the good news about Till and Paul who have been awarded one of the top 5 projects at the German federal competition Young scientists. Today the good news continues. Together with more than 1000 competitors they did also submit the project to the Google Science Fair.
And guess what! Google likes graph data bases, information retrieval, auto completion, and scentence prediction. Till and Paul have been (together with 5 other germans) selected to be in the second round together with 90 remaining projects. Even Heise reported about this! Within the next month they will have phone calls with Google employees and with a little bit of luck they will be selected to be in the top 15 which would mean they would be invited to the Googleplex at Mountain View California.
Remember Typology is open source and already available as an android app (only in German language but this will be fixed soon).
Btw I am just wondering when I will be traveling to Google in Mountain View?

]]>
https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/feed/ 0
Typology using neo4j wins 2 awards at the German federal competition young scientists. https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/ https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/#comments Mon, 21 May 2012 09:44:54 +0000 http://www.rene-pickhardt.de/?p=1341 Two days ago I arrived in Erfurt in order to visit the federal competition young scientists (Jugend Forscht). I reported about the project typology by Till Speicher and Paul Wagner which I supervised over the last half year and which already won many awards.
Saturday night they have already won a special award donated by the Gesellschaft fuer Informatik this award has the title “special award for a contribution which demonstrates particularly the usefulness of computer science for Society.” (Sonderpreis fuer eine Arbeit, die in besonderer Art und Weise den Nutzen der Informatik verdeutlicht.) This award was connected with 1500 Euro in cash!
Yesterday there was the final award ceremony. I was quite excited to see how the hard work that Till and Paul put into their research project would be evaluated by the juryman. Out of 457 submissions with a tough competition the top5 projects have been awared. Till and Paul came in 4th and will now be allowed to visit the German chancelor Angela Merkel in Berlin.
With the use of spreading activation Typology is able to make precise predictions of what you gonna type next on your smartphone. It outperforms the current scientific standards (language models) by more than 100% and has a precision of 67%!
A demo of the project can be found at www.typology.de
The android App of this system is available in the appstore. It is currently only available for German text and is beta. But the source code is open and we are happy for anybody who wants to contribute can check out the code. A mailinglist will be set up soon. But anyone who is interested can already drop a short message at mail@typology.de and will be added to the mailinglist as soon as it is established.
You can also look at the documentation which will soon be available in english. Alternatively you commit bugs and request features in our bug tracker
I am currently trying to improve the data structures and the data base design in order to make retrieval of suggestions faster and decrease the calculation power of our web server. If you have expertise in top – k aggregation joins in combination with prefix filtering drop me a line and we can discuss about this issue!
Happy winners Paul Wagner (left) and Till Speicher (right) with me

]]>
https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/feed/ 5
Paul Wagner and Till Speicher won State Competition "Jugend Forscht Hessen" and best Project award using neo4j https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/ https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/#comments Fri, 16 Mar 2012 11:18:38 +0000 http://www.rene-pickhardt.de/?p=1204 6 months of hard coding and supervising by me are over and end with a huge success! After analyzing 80 GB of Google ngrams data Paul and Till put them to a neo4j graph data base in order to make predictions for fast scentence completion. Today was the award ceremony and the two students from Darmstadt and Saarbrücken (respectivly) won the first place. Additionally the received the “beste schöpferische Arbeit” award. Which is the award for the best project in the entire competition (over all disciplines).
With their technology and the almost finnished android app typing will be revolutionized! While typing a scentence they are able to predict the next word with a recall of 67% creating a huge additional vallue for today’s smartphones.
So stay tuned of the upcomming news and the federal competition on May in Erfurt.
Have a look at their website where you can find the (still) German Documentation. As well as the source code and a demo (which I also include here (use tab completion (-: as in unix bash)
Right now it only works for German Language – since only German data was processed – so try sentences like

  • “Warum ist die Banane krumm” (where the rare word krumm is correctly predicted due to the relation of the famous question why is the banana curved?
  • “Das kann ich doch auch” (I am also able to do that)
  • “geht wirklich nur deutsche Sprache ?” (Is really only German language possible?)


<br /> Ihr Browser kann leider keine eingebetteten Frames anzeigen:<br /> Sie können die eingebettete Seite über den folgenden Verweis<br /> aufrufen: <a href=”http://complet.typology.de” mce_href=”http://complet.typology.de” data-mce-href=”http://complet.typology.de”>Demo</a><br />

]]>
https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/feed/ 11