René Pickhardt on July 23rd, 2014

Long time I have been trying to educate people about social networks and marketing within social networking services. I thought I would not create an article about the topic in my blog again, but I just stumbled upon a really nice video summarizing a lot of my concerns from a different angle.

So here you can see how much click fraud exists with likes on Facebook. Why you should not buy any likes and in general about the problems with Facebook’s news stream.

I am really wondering at what point in time the industries will realize that Facebook is not a vivid ecosystem and this thing will break down. I have been predicting the Downfall of Facebook quite often and it did not happen so far. I am still expecting it to happen and I guess we all have to wait just a little bit longer.

Sharing:

Tags: , , , ,

René Pickhardt on May 29th, 2014

This is a copy of an answer I left on Larry Page’s Google Plus profile after another announcement about the progress of the self driving car:

How come you have a patent for the self driving taxi company in the USA? I always thought of you as a creative role model who wanted to change the world towards a better place.

Why not leaving the competition open? I really loved Jonathan Rosenberg’s blog entry in which you ask for openness. This inspired me in my PhD program a lot to make everything open.

I was so keen when you announced the self driving car as it was my childhood dream to have self driving cars in the world. (Visiting the United states in 2001 I learnt programming in high school and in my book I drafted my own first class Schema for a self driving car.)

After you announced the self driving car I was keen to build up the taxi company. Contributing to the progress in transportation and logistics and helping the process of automation.

There is much criticism against Google but after your recent TED talk I thought “Larry is really the guy who doesn’t care for money but wants to make progress to the world!” Before I often wondered why you didn’t take part in Bill Gates giving pledge. But I realized that you needed your capital for building stuff like the car or glasses. I thought that this is also an amazing way to change the world towards a better place.

In the talk you mentioned shared economy and too many cars in LA. I thought “Great, this is exactly where the Taxi company comes in.” and I was happy to realize that you already seem to have the same ideas. I even expected Google to create a taxi company.

Anyway please give the self driving car to the world and don’t own all the applications of it. Please build open systems and give people the chance to contribute to the ecosystem around the self driving car.
Closed systems will slow down the progress as you say yourself in the above mentioned blog article.

Sharing:

Tags: , , , ,

René Pickhardt on April 24th, 2014

I was just reading through the recent notes of Heinrich which I can recommend to read as well as his old notes. When I stumbled upon the note called Monitor /etc/ using git I was confused. Why would one do this?

So I talked to Heinrich and he said:

“Well you want to monitor changes of your system config. You want to be able to revert them and you don’t want to care about this when you do something.”

I really liked this and thinks its so useful and a smart idea that I wanted to share it with you. Just keep in mind that you don’t push the git repository to some public space since the config files might include a lot of passwords. Also look out for his .gitignore in his case the printer does a lot of automatic changes and is thus ignored. You might have similar settings for your configs.

I hope sharing this was useful for you!

Sharing:

Tags: , , ,

After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.

As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.

All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.

We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.

In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.

We have used Perplexity as a standard metric to evaluate our Language Model.

Results

As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.

The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.

We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table

Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.

Sharing:

Tags: , , , , , ,

There is a trend on the web that artists, start ups (and other people) try to do some crowd founding. I really like the concept. Ask your true supporters, fans, customers to realize the next big project which they will like. But sometimes I have the feeling that especially in music crowd founding might not go as good as possible or to say it simple: People unfortunately have just the attitude expect to get music for free. I am not saying that everyone is participating in peer to peer platforms and doing copyright violations but music as a good is nowadays just available in a way as it was never before and most people seem to forget the intrinsic value of music. After I helped to build up the band in Legend and have some experience about music and the cost of creating music I want to share another story that brings me to my sad conclusion:

There is this German Band called Mock Unit. In 2009 they released their first video ready to mock which was followed by is doch kärb (2010) and seid ihr dabei?! (2011). Until now these videos have generated almost half a millions views on youtube. You think that is nothing in comparison to psy’s gangnam style. But consider that the band has a very unique style:

  • Their lyrics consist almost exclusively of swearwords, there is a lot of sexism in the text and they seem to make a lot of fun of a certain part of our society. Until now I hope that this is satire and a way of expressing disappointment about the existence of problems like sexism in our socity. (Otherwise I would reconsider my positive feelings towards the band)
  • They sing in a very local dialect which only few Germans would understand right away (I would guess about 1-2 Mio.)
  • The musical style itself is also not something for everybody and rather speaking to young people.

Yet The band has quite some success in this very small target group.

  • Their Facebook profile has about 5k fans (I assume most of them living closer than 50 km around Mainz)
  • Fans start to take own videos in the same style but with other songs
  • Concerts are sold old. 4 month ago they had a concert in red cat and had to send home people because the venue was completely sold out. 
  • Fans are dressing up in the same “trashy” style and when I did the same and walked through Mainz people on the street would associate me with this band. 

All this together I conclude that this band (which obviously is not casted by some label) has a strong fan base in a very small region and also has not really a commercial interest but just wants to continue to create music for their fans. Now the mock unit wants to create a second record and decided to go for a crowd founding campaign on startnext (basically the german kickstarter). They provided a good nice advertising video in their style and also the rewards are fitting the style. Everything is really authentic.

Also the Mock Unit only asks for 6000 Euro. Which according to the number of Facebook fans is an average of only 1.2 Euro per fan. Apparently this seems impossible to achieve. Currently the band has not even collected 2k euro which I really do not understand. I don’t ask you to support this band or become a fan of. I just think it is incredible that a band that has delivered music and videos, that  has a solid fan base, that is able to sell out venues is not able to collect 1.2 Euro per Facebookfan for the sake of creating a new record and video.

I think it is really a shame in our society that culture became apparently so worthless. By the way I would have much more faith in a musicians crowdfounding campaign if he would promise to release the record under creative commons share alike by licence. In this way I would know that in return of my support I would not only be able to expect some entertainment but I would also obtain the legal right to use the material which I directly founded.

For those who know the band and like the band I created a video using their style in order to animate other fans to think about if they really do not want to support the band:

Sharing:

Tags: , , , , , ,

René Pickhardt on January 18th, 2014

Hey everyone I wonder if you could help me out. I am currently at my parents home and there are some old pcs from the time when I was young (even my first very on pc is among them). They might have been bought between 1997 and 2002 and have single core processors starting from 333 Mhz going up to 1800 Mhz. Memory is also varying between 64 MB to 1 GB as is the hard disk. Those computers need way to much energy, the fan is really loud and so on…

In general the electronic parts of these computers are still in a good shape and they have served a good purpose for a long time. I can imagine many use cases yet looking at ebay you would only be able to sell these computers for 1 euro each.

I kind of refuse to give them away for free or even worse trow them away. But apparently these computers are worth nothing. Which sorry to say this again I refuse to accept. Computing power is an amazing thing.

Does anyone have a cool idea what one could do with them? Maybe install some lite weight Linux and use them to control some hardware or investigate some networking projects. I even considered using them as a file / backup server but this also seems not to be a good idea since the energy consumption as mentioned above is too high and network storage devices which you can buy nowadays seem to fulfill the service much better.

I tried to google for the problem but I only find boring articles without any good ideas. So if anyone of you had some idea this would be highly appreciated.

Sharing:

Tags: , , ,

2 months ago I started to create the Web Science MOOC and now you can join our MOOC as a student. We will start online streamed  flipped classroom lessons on October 29th. Our MOOC is truely open meaning that all the teaching material will be provided as open educational resources with a creative commons 3.0 attribution share alike licence. 

 In the first month we will learn about the following topics

  • Ethernet
  • Internet Protocol
  • Transfer Controll Protocol
  • Domain Name System
  • URIs
  • HTTP
  • HTML
  • RDF
  • Javascript / CSS

The Ethernet lessons can be found at:

https://en.wikiversity.org/wiki/Topic:Web_Science/Part1:_Foundations_of_the_web/Internet_Architecture/Ethernet

 

The Internet protocol lessons can be found at:

https://en.wikiversity.org/wiki/Topic:Web_Science/Part1:_Foundations_of_the_web/Internet_Architecture/Internet_Protocol

 

Since wikiversity in comparison to other MOOC platforms is truely open you might also want to watch some of my introductory videos. They are in particular helpful to show how to make the best use of wikiversity as MOOC platform and how one can really engage into the discussion.  You can find the videos at: 

https://en.wikiversity.org/wiki/Topic:Web_Science/New_here

 

but maybe your are already interested in watching some of the content right here right away: 

 

Sharing:

Tags: , , , ,

René Pickhardt on September 14th, 2013

I would like to have an discussion with people that have experience or are interested in MOOCs and Wikiversity. The goal is to checkout the possibilities for creating (otherwise over commercialized) MOOCs in an OER environment (especially wikiversity).

Background:

According to my former blog post there are  3 ways for creating a MOOC that is truely OER:

Out of these I would love to discuss what possibilities exist in the context of Wikiversity and how such a MOOC could benefit from the ecosystem of other Wikimedia projects (e.g. books, commons, wikipedia and of course wikiversity itself)

I would also love to create a list of requirements for wikiversity software with functionalities needed (e.g. access to multiple choice results of students) to create an OER MOOC. This list could be present to the wikimedia foundation in order to extend the wikiversity software.

My experiences:

 


 


 

Sharing:

Tags: , , , , , ,

René Pickhardt on September 5th, 2013

Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.

The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.

Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

Along these lines we want to understand

  • Why do these technologies scale? 
  • How do they handle concurrent traffic (especially write requests)?
  • How performance can be increased if there is another way of building up such highly scalable systems?
  • What kind of applications (like titan or mahout) are build on top of these systems?
At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design. 

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr

Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.

I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!

So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.

Sharing:

Tags: , , , , , ,

René Pickhardt on September 2nd, 2013

It is almost 2 years over since I published my first ideas and works on graphity which is nowadays a collection of algorithms to support efficient storage and retrieval of more than 10k social activity streams per second. You know the typical application of twitter, facebook and co. Retrieve the most current status updates from your circle of friends.

Today I proudly present the first version of the Graphity News Stream Server. Big thanks to Sebastian Schlicht who worked for me implementing most of the Servlet and did an amazing job! The Graphity Server is a neo4j powered servlet with the following properties:

  • Response times for requests are usually less than 10 milliseconds (+network i/o e.g. TCP round trips coming from HTTP)
  • The Graphity News Stream Server is a free open source software (GPLv3) and hosted in the metalcon git repository. (Please also use the bug tracker there to submit bugs and feature requests)
  • It is running two Graphity algorithms: One is read optimized and the other one is write optimized, if you expect your application to have more write than read requests.
  • The server comes with an REST API which makes it easy to hang in the server in whatever application you have.
  • The server’s response also follows the activitystrea.ms format so out of the box there are a large amount of clients available to render the response of the server.
  • The server ships together with unit tests and extensive documentation especially of the news stream server protocol (NSSP) which specifies how to talk to the server. The server can currently handle about 100 write requests in medium size (about a million nodes) networks. I do not recommend to use this server if you expect your user base to grow beyond 10 Mio. users (though we are working to get the server scaling) This is mostly due to the fact that our data base right now won’t really scale beyond one machine and some internal stuff has to be handled synchronized.

Koding.com is currently thinking to implement Graphity like algorithms to power their activity streams. It was for Richard from their team who pointed out in a very fruitfull discussion how to avoid the neo4j limit of 2^15 = 32768 relationship types by using an overlay network. So his ideas of an overlay network have been implemented in the read optimized graphity algorithm. Big thanks to him!

Now I am relly excited to see what kind of applications you will build when using Graphity.

If you’ll use graphity

Please tell me if you start using Graphity, that would be awesome to know and I will most certainly include you to a list of testimonials.

By they way if you want to help spreading the server (which is also good for you since more developer using it means higher chance to get newer versions) you can vote up my answer in stack overflow:

http://stackoverflow.com/questions/202198/whats-the-best-manner-of-implementing-a-social-activity-stream/13171306#13171306

How to get started

its darn simple!

  1. You clone the git repository or get hold of the souce code.
  2. then switch to the repo and type sudo ./install.sh
  3. copy the war file to your tomcat webapps folder (if you don’t know how to setup tomcat and maven which are needed we have a detailed setup guide)
  4. and you’re done more configuration details are in our README.md!
  5. look in the newswidget folder to find a simple html / java script client which can interact with the server.
I also created a small simple screen cast to demonstrate the setup: 

Get involved

There are plenty ways to get involved:

  • Fork the server
  • commit some bug report
  • Fix a bug
  • Subscribe to the mailing list.

Furhter links:

Sharing:

Tags: , , , , , , ,

Close

Subscribe to my newsletter

You don't like mail?