René Pickhardt on April 24th, 2014

I was just reading through the recent notes of Heinrich which I can recommend to read as well as his old notes. When I stumbled upon the note called Monitor /etc/ using git I was confused. Why would one do this?

So I talked to Heinrich and he said:

“Well you want to monitor changes of your system config. You want to be able to revert them and you don’t want to care about this when you do something.”

I really liked this and thinks its so useful and a smart idea that I wanted to share it with you. Just keep in mind that you don’t push the git repository to some public space since the config files might include a lot of passwords. Also look out for his .gitignore in his case the printer does a lot of automatic changes and is thus ignored. You might have similar settings for your configs.

I hope sharing this was useful for you!


Tags: , , ,

After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.

As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.

All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.

We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.

In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.

We have used Perplexity as a standard metric to evaluate our Language Model.


As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.

The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.

We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table

Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.


Tags: , , , , , ,

There is a trend on the web that artists, start ups (and other people) try to do some crowd founding. I really like the concept. Ask your true supporters, fans, customers to realize the next big project which they will like. But sometimes I have the feeling that especially in music crowd founding might not go as good as possible or to say it simple: People unfortunately have just the attitude expect to get music for free. I am not saying that everyone is participating in peer to peer platforms and doing copyright violations but music as a good is nowadays just available in a way as it was never before and most people seem to forget the intrinsic value of music. After I helped to build up the band in Legend and have some experience about music and the cost of creating music I want to share another story that brings me to my sad conclusion:

There is this German Band called Mock Unit. In 2009 they released their first video ready to mock which was followed by is doch kärb (2010) and seid ihr dabei?! (2011). Until now these videos have generated almost half a millions views on youtube. You think that is nothing in comparison to psy’s gangnam style. But consider that the band has a very unique style:

  • Their lyrics consist almost exclusively of swearwords, there is a lot of sexism in the text and they seem to make a lot of fun of a certain part of our society. Until now I hope that this is satire and a way of expressing disappointment about the existence of problems like sexism in our socity. (Otherwise I would reconsider my positive feelings towards the band)
  • They sing in a very local dialect which only few Germans would understand right away (I would guess about 1-2 Mio.)
  • The musical style itself is also not something for everybody and rather speaking to young people.

Yet The band has quite some success in this very small target group.

  • Their Facebook profile has about 5k fans (I assume most of them living closer than 50 km around Mainz)
  • Fans start to take own videos in the same style but with other songs
  • Concerts are sold old. 4 month ago they had a concert in red cat and had to send home people because the venue was completely sold out. 
  • Fans are dressing up in the same “trashy” style and when I did the same and walked through Mainz people on the street would associate me with this band. 

All this together I conclude that this band (which obviously is not casted by some label) has a strong fan base in a very small region and also has not really a commercial interest but just wants to continue to create music for their fans. Now the mock unit wants to create a second record and decided to go for a crowd founding campaign on startnext (basically the german kickstarter). They provided a good nice advertising video in their style and also the rewards are fitting the style. Everything is really authentic.

Also the Mock Unit only asks for 6000 Euro. Which according to the number of Facebook fans is an average of only 1.2 Euro per fan. Apparently this seems impossible to achieve. Currently the band has not even collected 2k euro which I really do not understand. I don’t ask you to support this band or become a fan of. I just think it is incredible that a band that has delivered music and videos, that  has a solid fan base, that is able to sell out venues is not able to collect 1.2 Euro per Facebookfan for the sake of creating a new record and video.

I think it is really a shame in our society that culture became apparently so worthless. By the way I would have much more faith in a musicians crowdfounding campaign if he would promise to release the record under creative commons share alike by licence. In this way I would know that in return of my support I would not only be able to expect some entertainment but I would also obtain the legal right to use the material which I directly founded.

For those who know the band and like the band I created a video using their style in order to animate other fans to think about if they really do not want to support the band:


Tags: , , , , , ,

René Pickhardt on January 18th, 2014

Hey everyone I wonder if you could help me out. I am currently at my parents home and there are some old pcs from the time when I was young (even my first very on pc is among them). They might have been bought between 1997 and 2002 and have single core processors starting from 333 Mhz going up to 1800 Mhz. Memory is also varying between 64 MB to 1 GB as is the hard disk. Those computers need way to much energy, the fan is really loud and so on…

In general the electronic parts of these computers are still in a good shape and they have served a good purpose for a long time. I can imagine many use cases yet looking at ebay you would only be able to sell these computers for 1 euro each.

I kind of refuse to give them away for free or even worse trow them away. But apparently these computers are worth nothing. Which sorry to say this again I refuse to accept. Computing power is an amazing thing.

Does anyone have a cool idea what one could do with them? Maybe install some lite weight Linux and use them to control some hardware or investigate some networking projects. I even considered using them as a file / backup server but this also seems not to be a good idea since the energy consumption as mentioned above is too high and network storage devices which you can buy nowadays seem to fulfill the service much better.

I tried to google for the problem but I only find boring articles without any good ideas. So if anyone of you had some idea this would be highly appreciated.


Tags: , , ,

2 months ago I started to create the Web Science MOOC and now you can join our MOOC as a student. We will start online streamed  flipped classroom lessons on October 29th. Our MOOC is truely open meaning that all the teaching material will be provided as open educational resources with a creative commons 3.0 attribution share alike licence. 

 In the first month we will learn about the following topics

  • Ethernet
  • Internet Protocol
  • Transfer Controll Protocol
  • Domain Name System
  • URIs
  • HTTP
  • HTML
  • RDF
  • Javascript / CSS

The Ethernet lessons can be found at:


The Internet protocol lessons can be found at:


Since wikiversity in comparison to other MOOC platforms is truely open you might also want to watch some of my introductory videos. They are in particular helpful to show how to make the best use of wikiversity as MOOC platform and how one can really engage into the discussion.  You can find the videos at:


but maybe your are already interested in watching some of the content right here right away: 



Tags: , , , ,

René Pickhardt on September 14th, 2013

I would like to have an discussion with people that have experience or are interested in MOOCs and Wikiversity. The goal is to checkout the possibilities for creating (otherwise over commercialized) MOOCs in an OER environment (especially wikiversity).


According to my former blog post there are  3 ways for creating a MOOC that is truely OER:

Out of these I would love to discuss what possibilities exist in the context of Wikiversity and how such a MOOC could benefit from the ecosystem of other Wikimedia projects (e.g. books, commons, wikipedia and of course wikiversity itself)

I would also love to create a list of requirements for wikiversity software with functionalities needed (e.g. access to multiple choice results of students) to create an OER MOOC. This list could be present to the wikimedia foundation in order to extend the wikiversity software.

My experiences:





Tags: , , , , , ,

René Pickhardt on September 5th, 2013

Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.

The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.

Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

Along these lines we want to understand

  • Why do these technologies scale? 
  • How do they handle concurrent traffic (especially write requests)?
  • How performance can be increased if there is another way of building up such highly scalable systems?
  • What kind of applications (like titan or mahout) are build on top of these systems?
At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design. 

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr

Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.

I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!

So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.


Tags: , , , , , ,

René Pickhardt on September 2nd, 2013

It is almost 2 years over since I published my first ideas and works on graphity which is nowadays a collection of algorithms to support efficient storage and retrieval of more than 10k social activity streams per second. You know the typical application of twitter, facebook and co. Retrieve the most current status updates from your circle of friends.

Today I proudly present the first version of the Graphity News Stream Server. Big thanks to Sebastian Schlicht who worked for me implementing most of the Servlet and did an amazing job! The Graphity Server is a neo4j powered servlet with the following properties:

  • Response times for requests are usually less than 10 milliseconds (+network i/o e.g. TCP round trips coming from HTTP)
  • The Graphity News Stream Server is a free open source software (GPLv3) and hosted in the metalcon git repository. (Please also use the bug tracker there to submit bugs and feature requests)
  • It is running two Graphity algorithms: One is read optimized and the other one is write optimized, if you expect your application to have more write than read requests.
  • The server comes with an REST API which makes it easy to hang in the server in whatever application you have.
  • The server’s response also follows the format so out of the box there are a large amount of clients available to render the response of the server.
  • The server ships together with unit tests and extensive documentation especially of the news stream server protocol (NSSP) which specifies how to talk to the server. The server can currently handle about 100 write requests in medium size (about a million nodes) networks. I do not recommend to use this server if you expect your user base to grow beyond 10 Mio. users (though we are working to get the server scaling) This is mostly due to the fact that our data base right now won’t really scale beyond one machine and some internal stuff has to be handled synchronized. is currently thinking to implement Graphity like algorithms to power their activity streams. It was for Richard from their team who pointed out in a very fruitfull discussion how to avoid the neo4j limit of 2^15 = 32768 relationship types by using an overlay network. So his ideas of an overlay network have been implemented in the read optimized graphity algorithm. Big thanks to him!

Now I am relly excited to see what kind of applications you will build when using Graphity.

If you’ll use graphity

Please tell me if you start using Graphity, that would be awesome to know and I will most certainly include you to a list of testimonials.

By they way if you want to help spreading the server (which is also good for you since more developer using it means higher chance to get newer versions) you can vote up my answer in stack overflow:

How to get started

its darn simple!

  1. You clone the git repository or get hold of the souce code.
  2. then switch to the repo and type sudo ./
  3. copy the war file to your tomcat webapps folder (if you don’t know how to setup tomcat and maven which are needed we have a detailed setup guide)
  4. and you’re done more configuration details are in our!
  5. look in the newswidget folder to find a simple html / java script client which can interact with the server.
I also created a small simple screen cast to demonstrate the setup: 

Get involved

There are plenty ways to get involved:

  • Fork the server
  • commit some bug report
  • Fix a bug
  • Subscribe to the mailing list.

Furhter links:


Tags: , , , , , , ,

Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.

I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)

During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:


Tags: , , , , , , , , ,

The redevelopment of metalcon is going on and so far I have been very concerned about performance and webscale. Due to the progress of Martin on his bachlor thesis we did a code review of the code to calculate Generalized Language Models with Kneser Ney Smoothing. Even though his code is a standalone (but very performance sensitive) software I realized that for a Web application writing maintainable code seems to be as important as thinking about scalability.

Scaling vs performance

I am a performance guy. I love algorithms and data structures. When I was 16 years old I already programmed a software that could play chess against you using a high performance programming technique called Bitboards.
But thinking about metalcon I realized that web scale is not so much about performance of single services or parts of the software but rather about the scalability of the entire architecture. After many discussions with colleagues Heinrich Hartmann and me came up with a software architecture from which we believe it will scale for web sites that are supposed to handle several million monthly active users (probably not billions though). After discussing the architecture with my team of developers Patrik wrote a nice blog article about the service oriented data denormalized architecture for scalable web applications (which of course was not invented by Heinrich an me. Patrik found out that it was already described in a WWW publication from 2008).

Anyway this discussion showed me that Scalability is more important than performance! Though I have to say that the stand alone services should also be very performant. if a service can only handle 20 requests per seconds – even if it easily scales horizontally – you will just need too many machines.

Performance vs. Maintainable code

Especially after the code review but also having the currently running metalcon version in mind I came to the conclusion that there is an incredibly high value in maintainable code. The hackers community seems to agree on the fact that maintainability comes over performance (only one of many examples).

At that point I want to recall my initial post on the redesign of metalcon. I had in mind that performace is the same as scaling (which is a wrong assumption) and asked about Ruby on rails vs GWT. I am totally convinced that GWT is much more performant than ruby. But I have seen GWT code and it seems almost impractical to maintain. On the other side from all that I know Ruby on Rails is very easy to maintain but it is less performant. The good thing is it easily scales horizontally so it seems almost like a no brainer to use Ruby on Rails rather than GWT for the front end design and middle layer of metalcon.

Maintainable code vs scalability

Now comes the most interesting fact that I realized. A software architecture scales best if it has a lot of independent services. If services need to interact they should be asynchronous and non blocking. Creating a clear software architecture with clear communication protocols between its parts will do 2 things for you:

  1. It will help you to maintain the code. This will cut down development cost and time. Especially it will be easy to add , remove or exchange functionality from the entire software architecture. The last point is crucial since
  2. Being easily able to exchange parts of the software or single services will help you to scale. Every time you identify the bottleneck you can fix it by exchanging this part of the software to a better performing system.
In order to achieve scalable code one needs to include some middle layer for caching and one needs to abstract certain things. The same stuff is done in order to get maintainable code (often decreasing performance)


I find this to be very interesting and counter intuitive. One would think that performance is a core element for scalability but I have the strong feeling that writing maintainable code is much more important. So my ranked list of priorities for backend web programming (!) looks like that:

  1. Scalability first: No Maintainable code helps you if the system doesn’t scale and can’t be served to millions of users
  2. Maintainable code: As stated above this should go almost hand in hand with scalability
  3. performance: Of course we can’t have a data base design where queries need seconds or minutes to run. Everthing should happen within a few milliseconds. But if the code can become more maintainable at the cost of another few milliseconds I guess thats a good investment.

Tags: , , , , , , , , , ,


Subscribe to my newsletter

You don't like mail?