After I won the community award in the Wikipedia Video contest in the category documentation and interview with my pointer in C video I would like to share some experiences on creating educational videos. This is mainly to encourage anyone to do the same as I did.

Have a look at the winning video again if you don’t remember it

In my opinion it doesn’t take much more than a real interest in education. So the video that won the award was used for me in a real teaching scenario. I only had one dry run before recording my one and only and published version of it (which still with more iterations could be a little bit shorter, more focused and slicker). Most time (about 3 hours) for the process was in planning how to present the learning content – something everyone teaching something should do anyway. The entire time it took me was less than 5 hours including planning, dryrun, recording, uploading and sharing with students.

The impact: From originally 16 Students that where participents in my class the video has been watched about 10 thousand times by now. Especially it was included in the wikipedia article on pointers and thus is hopefully a helpful resource for people interested in that topic.

Most important I did not need expensive technology. As you can see from the attached picture I did not even have a proper way of fixing my digital camera. The microphone was the internal one from that very digital camera. I used a couple of books together with a ruler to bring the camera to the correct position in order to be able to have a nice shot of the whiteboard that I was using. Other than that I used two lamps for proper light and lowered the outside courtains of the window.

What I am basically saying: Everyone who owns a camera (which most people nowadays do) can take a video and explain something. You can contribute your explaining video to the growing knowledge base on wikimedia commons. You can contribute to the ongoing discussion weather wikipedia articles should be enhanced with videos or not. Most important if you do everything like me on the whiteboard you will most certainly not run into any of the copyright problems that I ran before.

So what are you waiting for? I am sure you are an expert on something. Go and give it a shot and share your video here in the comments but also via wikimedia commons and maybe include it even within some wikipedia article that is fitting well.


Tags: , , , , , , ,

In the following article I want to give an overview of the discussions and movements that are going on about video and multimedia content for wikipedia and her sister projects. I will start with some positive experiences and then tell you about some bad experiences. This article is not to wine about some edit war it is more about observing an undecided / open topic within the community of wikipedians.

During my time as a PhD student I actively contributed to open educational resources by uploading so far 52 educating videos to wikimedia commons. Some of those videos have been created together with Robert Naumann. Another share of the videos was uploaded by him. So a large fraction of those videos have been made for the web science mooc an can be found at:

Last week we submitted the following video to the OPERA Award, which is an award for OER video material. It was established with the goal of having more of such content. 

As you can see it was selected to be the media file of the day on November 2nd on wikimedia commons (*cheering*) can anyone show me how this has happened? I was looking for the process but I did not find it.

Also I have included another video about pointers in C (in German language: Zeiger in C) into an wikipedia article. 

Does wikipedia like videos within articles?

From my experience the Pointer video was removed a couple of times from the german wikipedia article related to that topic and then also brought back to the article. So it seems like there isn’t any consensus within the community yet about having videos. Interestingly enough I was asked by some wikipedians to submit my video for a video competition they are doing. So the goal of this competition is to have more content creators like me to upload their material to commons and include it into Wikipedia articles. This effort seems to be founded by money which was donated by the users. There seems to be a similar project in the english wikipedia. So well at least money is flowing towards the direction of creating more video content. 

Even though these seem to be strong arguments I have the feeling that not the entire Wikipedia community supports this movement – or one could call it strategic move. 1 year ago without knowing about these kind of efforts I have tried to include some of the web science videos to wikipedia articles. For example I included the following video:

to the corresponding wikipedia article it was removed with a statement of saying this would be video SPAM which in my opinion is a little bit of an overreaction.

A summary of the discussion can be taken from my slides of my talk at the german open educational resources conference:
If you are interested you can find the entire discussion at the discussion page of the ethernet frame article

Problems for creating video Content for Commons:

Obviously there is a problem about the copyright. So for example I have pointed out in the past that creating a screencast during lecture on a Windows machine means committing a copyright violation since the the start button and the windows interface by Microsoft EULA are protected by copyright. Also in former discussions at #OER13de we agreed that it is hard to collaboratively edit videos (sorry link in german language) because the software often is not free and wikimedia commons does not support uploading the source files of the videos anyway.


It is not clear if video content will survive in Wikipedia even though some strategic movement is put into that idea. The people who are against this have pretty decent arguments and I also say that it is really hard to have a tool for collaboratively editing video files. If one does not have such tools even access to the source files of the videos would make it hard for people to work on this together. So I am curious to see what the competitions will bring and how the discussions on movies will evolve over time.
At least in wikiversity we are able to use our videos for teaching as we anticipated and I am pretty sure this space won’t be affected by the ongoing discussion.


Tags: , , , , , , , , ,

René Pickhardt on November 4th, 2014

Last year we have created a MOOC on Web Science. We had chosen Wikiversity as a platform for hosting the MOOC. The reason for this was the high trust we had in the Wikimedia foundation strengthening the open movement. The main problem we experienced with Wikiversity was that the software running Wikiversity is obviously a Mediawiki which is great for collaboratively building an encyclopedia. It is not so well suited to provide a learning environment in which students can focus on an interactive learning experience. Also it is hard for teachers to learn how to use the Mediawiki software.

So I decided to spend some time together with Sebastian Schlicht (my student assistant, who did an excellent job) to build a little bit more of infrastructure on top of the mediawiki on wikiversity to provide a better interface for learning. Watch the demo here:

As you can see we created a platform that supports:

  • A click and point experience for teachers to create classes
  • On page discussion for students which supports the standard discussion system in Mediawiki
  • a nice modern navigation which adapts to users while interacting with the page

For me with this system our videos, quizes and scripts content shines in a much brighter light than it did before. For the first time I have fun consuming our the content of the MOOC.

For me this was an important step towards my goal of freeing educational content. Not only that our MOOC is completely OER we now also create core infrastructure for any teacher to create more classes that are OER. If you consider doing such a class feel free to drop be a message and receive free support. You could also start reading the documentation of the MOOC-Interface or see the slides(: 


I am looking forward to hear back from you.


Tags: , , , , , , ,

I understand that the following article is written in a very personal way. But this thing seems to me so unjust that it is just unbelievable. So this is my sad story of me trying to bring free educational resources to the world and having Microsoft indirectly not allowing me to do so :( The following article is dedicated to Aaron Swartz:


Copyright is f*** up on this planet.

We have been creating almost all of our so far 69 produced videos by ourselves. The videos which which we did not produce ourselves have been published under a creative commons by licence by the copyright owners. In one case I even called a professor in the united states and asked him to change the licence of his videos on Youtube such that we could reuse them within the Wikimedia commons ecosystem which he did (:

So you might think everything is alright. The guys paid attention to proper licences if they used material by others and for the rest they created everything themselves. Unfortunately this is not true.

For some of our Flipped classroom sessions we created hangouts on air with screen casts of our Smartbord. Currently our university only supportes the smartboard software SMARTNotebook on Microsoft Windows. Creating a Screencast on a Microsoft operating system is critical since there is the Microsoft Start button visible and also the user interface of SMARTNotebook. At least the microsoft interfaces are protected under copyright and I belief similar constraints will hold for SMARTNotebook. This has the consequence that we cannot put a creative commons licence to these materials. Consequently we must not host the materials on Wikimedia Commons as wikimedia commons supports only free content.

What we can do now is to move the videos to Wikiversity which allows material with a fair use licence. Ok great I can still host my course but parts of it are not free anymore. Don’t be afraid you don’t have to pay, like you have to at other sites. But you loose a lot of your freedom. You cannot remix, correct, translate, [...] the videos. In particular I am not even sure if I am legally allowed to publish the videos under the terms of Fair Use. I am not an American citizen and my university clearly is a German institution. The Fair Use law is an United States law. Ok we are hosting the materials on an American Website but will this be sufficient? Last time I had a similar law problem and asked the law consultants from our university the only answer I received was: “Better take the material down. You don’t want to end up in a law fight”. Ok so not only we have absurd laws influenced from money making industries, we are also scared of the industries.

On the other side being forced to move to Fair Use licence will allow me to include a lot of creative commons materials where the NC tag is placed to the licence. Not that I now don’t want to do any open educational resources. But the quality of the MOOC also suffered from not being able to include CC-NC material. 

Think about this again:

We as a university – and in the very end as a society, since the university is payed by tax money – pay high licence fees to Microsoft in order to be allowed to use their crappy Software. We are then forced by the administration that if we want to use modern technology like smartboards we have to use Microsoft Software. We pay high wages for professors, me and technical staff to create an free and open online course. And now Microsoft – which I did not even choose to use  but was forced to use by our university which is just following the the majority vote of computer users – is telling me that I cannot publish the content I  created under the license that I want.

You might say: Hey guy calm down. What’s the problem? The course is still online and nothing has changed. But that is the problem that everything has changed. We don’t pay attention to the subtleties as a society and wonder why we are having unjust laws.


We need to think about our law. It is us who makes them anyway! Regional laws are conflicting with the idea of a global network (Fair Use for example). Many ideas of copyright are just not suitable to a tech driven world in which sharing, citing and giving attribution and fame to people who create something has been fundamentally changed. These laws like the ones mentioned are just outdated an ridiculous. Also other laws like (sorry for a link to German wikipedia. I might translate the article at some point in time) fall into this category.


Tags: , , , , , , , ,

René Pickhardt on September 1st, 2014

Out of my personal offline social network I have been frequently approached by non techies that ask me questions like

  • “I want to learn programming, which language should I learn?”
  • “At what school / course can I participate to learn programming?”
  • “Which book can you recommend if I want to learn programming?”

These questions are in my opinion very tough but I will try to give an answer. Even though a lot of what I will say is probably true for people like me who learned programming just by curiosity while being in high school or for people who wish to make a career in IT engineering I try to aim this article for people with a university degree who already have a certain knowledge of how to approach learning and want to have programming skills as an additional skill but not for their main living.

The outline of this article will be the following.

  • Why do you want to learn programming? What I will assume about you.
  • What does it mean to learn programming?
  • What is really necessary in order to learn programming?
  • Discussing various kind of languages for certain purposes.
  • Your concrete road map.

Why do you want to learn programming? What I will assume about you.

I will assume that you do not want to learn programming because you want to change your major towards IT or that you are suddenly totally interested in programming, computers, nerds and geek talk. I will rather assume that you might be a linguist who realizes that statistical natural language processing might be useful. Or you are a biologist or psychologist and need to automate some work with your experimental data from the lab. You could also be a sociologist and want to do some experiments on the web or study how people behave on the web. Maybe you ended up in a company because jobs in your major are rare and you seek for an additional qualifications. You might have similar reasons but I guess you get the gist of the article.

What does it mean to learn programming? 

Being able to program is a somewhat weird skill. Those who are able to code know clearly what they can expect from another hacker. Yet I have the feeling that people who want to learn programming often have wrong expectations what programming means. So I will try to work on your expectation management.

Know to code is a skill that exists on various levels. Being a good programmer and acquiring programming skills that are really useful for you might be a long road and might require some effort (c.f. peter norvig’s “how to learn programming in 10 years?“ ). Still I have to state (in a somewhat arrogant way) that the first step is very easy (at least in theory).

The first step would probably be to learn the syntax – which are the commands or instructions of a programming language. This is simple because programming languages usually have a very small syntax. Learning the syntax and playing around with it should seriously not take you more than a view hours in total. If you consider practice and repetitions this time will be distributed over a couple of days. This is what books with titles like “how to learn XXX in YYY days” which are heavily criticized by Norvig will teach you. The same can be learnt from online courses on The main problem here is that this skill – even though at core it is all you need – seems very useless to a beginner.

In particular once you have acquired this skill you will have the feeling to know as much as before. Compare it with having the force like a Jedi knight. You might feel it but you don’t know how to use those skills yet.

In my oppinion the main problem starts now. In order to make something meaningful with your new skill you will do much more that will distract you from using or learning that skill. You will become frustrated. You will find everything complicated and you most likely will often have the feeling that you don’t know why something does not work as explained and if it finally does you might also not know why it does so at this point.

Common pitfalls for distraction are:

  • Interacting with an Integrated development environment.
  • The same operator in a programming language might have a different semantics depending on the context.
  • Interacting with a compiler or interpreter.
  • (Most likely implicitly) interacting with the operating system
  • Using third party API’s and libraries.
  • Interacting with an editor that has its special properties
  • Doing a typo with all the new syntax receiving error messages that are not telling you “hey you forgot a semicolon here…” but rather look like total chaos.

The problem is these things are almost not documented at all and are just assumed to be learnt implicitly or while talking to people or while being in a classroom. You might want to have a look at my German screen cast where I explain how to run “hello world in c“.

The program is the most basic program and and it involves almost no programming at all it just has all the overhead that was described above. Even though the screen cast is 20 minutes I still omit so much information that you might think: “Why the hell should I learn all about this?” The point I am trying to make is that it is all the other stuff that is complex.

Taking this in mind I sometimes wonder how anyone can learn programming at all anyway? The bad news is. Books won’t tell you all the differences most of the time they don’t even help you with setting up your developing environment so they leave you lost to learn a task which – again – at its core is actually pretty simple.

I would say a subset of being able to program is actually being aware of all these things and being able to distinguish what is happening at what point in time and not becoming confused by all this “magic”.

It already needs some experience to know all of this. The good news is. Once you master these things you know about everything you need in order to move on.

What is really necessary?

Even though you just want to program a little bit I think what will help you most is to bring curiosity about computers in general and the willingness to iterate, fail and play around. Don’t get frustrated if you seem to not understand anything anymore. Abstract and structured thinking is a very helpful skill which you probably bring along when you read this article and have been able to follow until here.

Most likely you will have to move out of your comfort zone. I remember learning 10 finger typing. In the beginning I thought chatting with friends is faster with my old system. I had to move out of my comfort zone to use the new skill and I had to use it in order to get used to it. The same will hold true for programming. In the beginning you might think that stuff from your day to day job go faster the way you have always done them. Take the challange. This time using programming might take longer. But over time it will save you a hack a lot of time.

But moving out of your comfort zone will mean that you have to be open to geek culture. Computer scientists are nerds and they celebrate it. I honestly learnt stuff about hacking from xkcd comics (or similar stuff).

Having courage to try out stuff and play around will be essential but still many people seem to confuse “try and error” with meaning full testing of “what happens actually if I …” (sorry that was so important I had to bold print it and I will repeat it later…). Read the fucking manual (or documentation – even though it is often missing and badly written) will be of tremendous help for you. What you have to learn is to ask as many questions as possible. A good practice is to write down the questions and explaining your problems. While doing so you will understand your problem better and most likely solve the problem on your own. (Ask questions even to this article. I am well aware of the fact, that this article leaves open questions. Go and ask me in the comments as an exercise. (NOW (:)) In general before asking questions googeling is a great idea. This of course is another way of asking questions. Honestly it is a skill to use a search engine. I have seen so many students which are not able to properly use a search engine to solve their problems. Fact is if you want to learn programming almost all questions that you have, have already been asked on the web and are indexed by a search engine. Being aware of this fact and learning to find those answers is a large portion of what you need to become a programmer.

Fiannly you will need a lot of paticence. The complexity of programming comes from the sheer amount of technologies that play together. It will be inevitably that you will be confused. Stay focused take a deep breadth start over again and try once more.

Discussing various kind of languages for certain purposes

There are various kinds of languages which serve a certain purpose. If you are coming from the outside world you will most likely know what language you want to learn. Honestly the syntax is almost the same all the time anyway. What is different are the libraries and APIs. When gaining more experience you will realize that you use tools to remember APIs and at some point in time you have a feeling of what functions already exist in some API and you probably can guess the name or at least google for it. What I am trying to say it doesn’t really matter which language you will take for learning how to program.

Still I think there are three languages that might be particular interesting candidates. All of these language have object oriented concepts. Even though I think object oriented programming is great and probably not so difficult for you. You could probably ignore the object oriented concepts.

c/c++: This language is great for people who want to focus on the understanding part of fundamental stuff and computers. It is certainly most difficult to learn and easy to mess up things. The level of frustration will be highest but the reward will be the biggest. I learnt programming with C first and from there on I could very easily move on to other languages.

Python: I think Python has a very beautiful syntax and it minimizes the distraction elements I wrote about so broadly. Especially for people who want to use programming as an additional skill python is very quick for solving smaller tasks with few overhead on the amount of code that you need. Also the standard API that comes with python is easy to use. Python can be used in the Web, in the shell. So it is probably a very nice compromise among all scripting languages.

Java: I would say Java is one of the widest spread languages. It has many strong concepts and especially a lot of tool chain around it. I did not find any other language that has so much code completion and tools to make your life easier. Still having tools thinking for you while you did not understand something is a difficult thing.

There are other languages and excellent reasons why you would want to learn them. The only language I would not recommend is java script. This language is a mess for various reasons. So unless you want to explicitly do user interfaces on the web I see almost no reason why a non IT person should learn javascript.

Your concrete road map

Let’s finally get concrete now. How to approach this project of learning to program:

10 relatively easy tips for the first week:

  1. Choose a language
  2. Install the IDE, compiler, toolchain, …
  3. Do it the hard way. Install linux (e.g ubuntu), use a fancy editor like emacs or vim and go for a fancy tool chain or version control system like git. Not that you need it. But as I said it forces you out of your comfort zone and brings you in a hacking mindset. This is most important. Loosing the fear of shallow water within usage of computers. You need to loose your fear of technology.
  4. Learn the syntax with some random book / course about how to learn programming or your selected language in 7 days. (Again the book could nowadays also be an online course)
  5. Understand that most books are didactically very bad and don’t separate between concepts, APIs, operating systems and language specific parts. 
  6. Use a cheat sheet for learning the syntax. Not in order to cheat all the time. But it helps you to distinguish between what is syntax and what is all the distracting overhead. just google for “c cheat sheet” or “java cheat sheet” or “python cheat sheet”.
  7. Understand the syntax and remember it by heart.
  8. Try not to read the examples and remember them by heart but try to reproduce them without looking at them. If you can’t read the chapter again and iterate.
  9. Most important: While working through the book / course. Play around! If something worked start asking questions. What happens if I change code here. Try to answer the question and then try it out. Even try to break the code in order to see error messages from the compiler / interpreter. don’t get confused by naming try try try and try again. 
  10. Understand the concept of variable scope - this can also be done best while playing around.

Once you are here and as I said you should not need to long. Go to the next level – The transfer / experience stage.

  1. Ask yourself what is a task that you frequently do by hand (on a computer or elsewhere) which IT people might automate. Try to automate it.
  2. Find some open source project. Look for bugtracker, mailinglist, or IRC channel. Go an contribute. Experience, experience experience. talk to the people. almost every IT person is happy if you try to solve their problems (which they just might not find the time to solve themselves). Tell the people you are a beginner. Chances are pretty high that people will help you out.
  3. Learn blackboxing (after I learned programming with 12 and creating a large scale application with 22 I started to really dig into what computers and programming languages are 4 years ago. I understood many things down to the physical process. yet still most stuff I use is a blackbox to me. It is much less of a blackbox compared to 10 years ago but still most is a blackbox!) You cannot understand everything. Sometimes it is just enogh to know “hey if I enter print(‘a’); the letter a will be outputed to the terminal.” You don’t have to understand how system i/o (input and output) really works. it will be sufficient for you to be able to use it.
  4. Understand what libraries and APIs are. Understand that some instructions and commands that first seem to be typical code (or syntax) of your language are just APIs that you should use as a black box. a typical example is the above mentioned hello world program.
  5. To some degree understand what an operating system is, what its purpose are.
  6. Have a concrete roadmap project. Nothing is more boring that a computer scientist telling you “hey I can automatically calculate square numbers…” I once wrote an blog article explaining with the python language how to find the most frequent words in books or the longest sentence and decide how long it is… (I admit this program was probably to complex for a beginner. but you could still try)  

Finnaly the hard parts

  1. As with all learning tasks find a) a sparing partner and b) as early as possible teach the stuff to others.
  2. Get a feeling on what is important to know while programming.
  3. Listen to the fucking nerd. We sometimes might not communicate well (guess what: Communication is a two way thing. Chances are high you are not asking well ether and not communicating your needs in a good way) But most nerds will be very happy to help you out. They might assume you have too much knowledge which you do not have. If they do, stop them or slow them down. They might loose themselves in details. If they do, try to ask them if this is still relevant to the question and going towards the right direction. Do your job on good communication. Remember your talking to someone from a different subculture (:
  4. Have clear questions and clear goals.

I wish you good luck and a lot of fun while acquiring your new skills. If this article was helpful for you or changed your life please tell me in 10 years or so what kind of amazing things you have built since then. Just put the mark in your calendar right now. You don’t use an electronic calender system yet? Ohhh, much to learn young Padawan (:


Tags: , , , ,

René Pickhardt on August 27th, 2014

This is my personal list of people that I admire. In a sense I would say if you want to know what I stand for you can just have a brief look at this list and at the values, norms and ideas the people of the list stand for. I have been heavily criticized that this list contains too many white men and not people from other cultures and sex. I think the main reason is that I am a western person and even though I lived in China I can just see to the horizon of my culture and of course I am being influenced by my culture. This is also where my values come from. So if you know people with a similar set of ideas and beliefs from other cultures feel free to contact me or leave a comment and point them out to me. I am very excited to “meet” more exciting people especially outside of my current horizon.

Also the following list has a randomized order.

Tank man


A man who stood in front of a column of tanks on June 5, 1989, the morning after the Chinese military had suppressed the Tiananmen Square protests of 1989 by force, became known as the Tank Man or Unknown Protester. The tanks manoeuvred to pass by the man, and he moved to continue to obstruct them, in something like a dance. The incident was filmed and seen worldwide.

further info:

own reason:
This is an unbelievable example of civil courage. Obviously his actions did not really change how things have been going on around tiananmen but I think this is truly heroic and brave.
I wish I will always have a similar courage when it comes to the point of fighting for a good thing or idea.

Aaron Swartz


Aaron Hillel Swartz (November 8, 1986 – January 11, 2013) was an American computer programmer, writer, political organizer and Internet Hacktivist.
Swartz was involved in the development of the web feed format RSS, the organization Creative Commons, the website framework and the social news site, Reddit, in which he became a partner after its merger with his company, Infogami.
Swartz’s work also focused on sociology, civic awareness and activism. He helped launch the Progressive Change Campaign Committee in 2009 to learn more about effective online activism. In 2010 he became a research fellow at Harvard University’s Safra Research Lab on Institutional Corruption, directed by Lawrence Lessig. He founded the online group Demand Progress, known for its campaign against the Stop Online Piracy Act.
On January 6, 2011, Swartz was arrested by MIT police on state breaking-and-entering charges, after systematically downloading academic journal articles from JSTOR. Federal prosecutors later charged him with two counts of wire fraud and 11 violations of the Computer Fraud and Abuse Act, carrying a cumulative maximum penalty of $1 million in fines, 35 years in prison, asset forfeiture, restitution and supervised release.
Swartz declined a plea bargain under which he would serve six months in federal prison. Two days after the prosecution rejected a counter-offer by Swartz, he was found dead in his Brooklyn, New York apartment, where he had hanged himself.
In June 2013, Swartz was posthumously inducted into the Internet Hall of Fame.

further info:

own reason:

Just read the guerilla open access manifesto. Writing something like this and understanding the impact of open access is terrific. But living it through the PACER project and also through the JSTOR case at MIT is a complete different story.
I strongly believe that unjust laws exist but we have to understand that law is a relative thing. It is us in our society who make the laws. So it is also us to change them. I think norms and values of a society should stand above a particular law. So what Aaron did is following a very strong set of norms and values and fighting for a better law. One might doubt if his actions have been to radical and not in the way how we as a society decided to live our democratic processes but I am sure Aaron was driven by the deep wish to make the world a more place with more justice.

Lawrence Lessig


Lawrence “Larry” Lessig (born June 3, 1961) is an American academic and political activist. He is a proponent of reduced legal restrictions on copyright, trademark, and radio frequency spectrum, particularly in technology applications, and he has called for state-based activism to promote substantive reform of government with a Second Constitutional Convention. In May 2014, he launched a crowd-funded political action committee which he termed May Day PAC with the purpose of electing candidates to Congress who would pass campaign finance reform.
Lessig is director of the Edmond J. Safra Center for Ethics at Harvard University and a Professor of Law at Harvard Law School. Previously, he was a professor of law at Stanford Law School and founder of the Center for Internet and Society. Lessig is a founding board member of Creative Commons and the founder of Rootstrikers, and is on the board of MapLight. He is on the advisory boards of the Democracy Café, Sunlight Foundation and Americans Elect. He is a former board member of the Free Software Foundation, Software Freedom Law Center and the Electronic Frontier Foundation.

further info:

own reason:
I have to admit that I did not come around to read his book code2.0 which is said to be excellent. But from his talks and actions I love how Lessig points out problems within society and how he is trying to educate people about it. He seems to have a very similar set of norms and values as Aaron did (and I do) but he is following “the protocol” of our society to fight for them. Especially he seems to be a true intellectual and not just a person who made a career in academia.

Geschwister Scholl


Hans and Sophie Scholl, often referred to in German as die Geschwister Scholl (literally: the Scholl siblings), were a brother and sister who were members of the White Rose, a student group in Munich that was active in the non-violent resistance movement in Nazi Germany, especially in distributing flyers against the war and the dictatorship of Adolf Hitler. In post-war Germany, Hans and Sophie Scholl are recognized as symbols of the humanist German resistance movement against the totalitarian Nazi regime.

further info:

own reason:
It always is hard to pick a single person or in this case siblings when it comes to role models in opposing a regime that is harmful for the people of a society. Of course the Geschwister Scholl have not been the only people in the resistence movement of Nazi Germany and there have been other regimes in other places that also had resitence movements. Still I believe their actions are very remarkable. I think it is the role of students to point out problems in our society. Nowadays many students seem to just accept everything that is happening. Distributing the fliers with the “truth” about Nazi Germany was not only brave but also at the university attracting many people that could multiply the message

I think it is similar to Aaron Swartz. Students and young people are in the role of more radically pointing out problems within society and the Geschwister Scholl most certainly fulfilled this role.

Randy Pausch


Randolph Frederick “Randy” Pausch (October 23, 1960 – July 25, 2008) was an American professor of computer science, human-computer interaction, and design at Carnegie Mellon University (CMU) in Pittsburgh, Pennsylvania.
Pausch learned that he had pancreatic cancer in September 2006, and in August 2007 he was given a terminal diagnosis: “3 to 6 months of good health left”. He gave an upbeat lecture titled “The Last Lecture: Really Achieving Your Childhood Dreams” on September 18, 2007, at Carnegie Mellon, which became a popular YouTube video and led to other media appearances. He then co-authored a book called The Last Lecture on the same theme, which became a New York Times best-seller. Pausch died of complications from pancreatic cancer on July 25, 2008.

further info:

own reason:
It might be the American optimism that is behind Randy Pausch’s lecture and talk but I actually do not admire him for giving an inspiring lecture even though he was dying. I admire him much more for the fact that he seemd to have lived his life in a very positive way. His goal of enabling the dreams of others sounds very honest to me. I also like the statements that he made about “If you life your life in the right way, the dreams come to you”. I think Randy is a very good example to show that no matter what fate did with a person it is the person’s responsibility to answer to this. When people cry out they might receive pitty but probably not really improve their situation. I guess one can summerize Randy with his quote:

We cannot change the cards we are dealt with only the way we play them.

By the way I especially like the idea that he gave this talk for his kids to teach them a lesson at a time when they are grown up and he would not be around anymore.

Tim Berners-Lee


Sir Timothy John “Tim” Berners-Lee, OM, KBE, FRS, FREng, FRSA, DFBCS (born 8 June 1955), also known as “TimBL”, is an English computer scientist, best known as the inventor of the World Wide Web. He made a proposal for an information management system in March 1989, and he implemented the first successful communication between a Hypertext Transfer Protocol (HTTP) client and server via the Internet sometime around mid November of that same year.
Berners-Lee is the director of the World Wide Web Consortium (W3C), which oversees the Web’s continued development. He is also the founder of the World Wide Web Foundation, and is a senior researcher and holder of the Founders Chair at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is a director of the Web Science Research Initiative (WSRI), and a member of the advisory board of the MIT Center for Collective Intelligence.
In 2004, Berners-Lee was knighted by Queen Elizabeth II for his pioneering work. In April 2009, he was elected a foreign associate of the United States National Academy of Sciences. He was honoured as the “Inventor of the World Wide Web” during the 2012 Summer Olympics opening ceremony, in which he appeared in person, working with a vintage NeXT Computer at the London Olympic Stadium. He tweeted “This is for everyone”, which instantly was spelled out in LCD lights attached to the chairs of the 80,000 people in the audience.

further info:
Even though he is a bad talker and reading his book (weaving the web) will help much more I link a video here:

own reason:
In my Oppinion there are many reasons to admire Tim Berners Lee. Of course he is famouse for inventing the world wide web. But I think the time was due for this invention. Internet itself was not very useful. The ideas of hypertext where around and similar systems existed. As always on the internet we have a strong the winner takes it all phenomenon. So bringing us the world wide web is certainly something Tim should get credit for but it is not the main reason why I admire him.
What is really cool about Tim Berners Lee is that he seems to have a very clear sense and abstraction of technical things and especially about their impact. Maybe it is easy to develop this sense after creating a technology that literally everyone on the Internet is using but still I like his activism for openess, ineroperability, net neutrality and freedom in general but freedom of speech in particular. Also he addressed me directly after asking a question in a Q&A session at a conference. His attitude of saying if you want to change the world you have the tools don’t talk just go geek and do it will certainly stick to me for the rest of my life.

Other than that I like that he does not fear to make a political statement about the problems with the web and where it should go and that he seems to have no interest whatsoever in becoming a multi billionear which he could have easily achieved after sitting on the invention of the world wide web and being so central in its development.

Albert Einstein


Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist and philosopher of science. He developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). He is best known in popular culture for his mass–energy equivalence formula E = mc2 (which has been dubbed “the world’s most famous equation”). He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”. The latter was pivotal in establishing quantum theory.
Near the beginning of his career, Einstein thought that Newtonian mechanics was no longer enough to reconcile the laws of classical mechanics with the laws of the electromagnetic field. This led to the development of his special theory of relativity. He realized, however, that the principle of relativity could also be extended to gravitational fields, and with his subsequent theory of gravitation in 1916, he published a paper on the general theory of relativity. He continued to deal with problems of statistical mechanics and quantum theory, which led to his explanations of particle theory and the motion of molecules. He also investigated the thermal properties of light which laid the foundation of the photon theory of light. In 1917, Einstein applied the general theory of relativity to model the large-scale structure of the universe.
He was visiting the United States when Adolf Hitler came to power in 1933 and, being Jewish, did not go back to Germany, where he had been a professor at the Berlin Academy of Sciences. He settled in the U.S., becoming an American citizen in 1940. On the eve of World War II, he endorsed a letter to President Franklin D. Roosevelt alerting him to the potential development of “extremely powerful bombs of a new type” and recommending that the U.S. begin similar research. This eventually led to what would become the Manhattan Project. Einstein supported defending the Allied forces, but largely denounced the idea of using the newly discovered nuclear fission as a weapon. Later, with the British philosopher Bertrand Russell, Einstein signed the Russell–Einstein Manifesto, which highlighted the danger of nuclear weapons. Einstein was affiliated with the Institute for Advanced Study in Princeton, New Jersey, until his death in 1955.
Einstein published more than 300 scientific papers along with over 150 non-scientific works. His great intellectual achievements and originality have made the word “Einstein” synonymous with genius.

further info:

own reason:
He was probably one of my first role models. I admire him for two reasons.
The first – which I nowadays actually find a stupid reason to admire someone – is just his pure intellect. Creating relativity theory was an amazing achievement of ignoring what we seem to know and just following the facts (as all good mathematicians and computer scientists should do all the time) But the list of his physical achiements does not stop at relativity theory (actually David Hilbert brought us general relativity much quicker and before Einstein (after he had talked to him on a conference) DOUBLE CHECK FACT) Further than that the list of various independent fields that he was working on in physics is just increadibly long.
The second reason is the way Einstein behaved about the development of the nuclear bomb. He first pointed out – by signing a letter to the American president of that time Roosevelt – that there is the danger that Nazi Germany might create a nuclear weapon. This led to the Manhatten project. The interesting part comes at the moment where Einstein regrets signing the letter. He said that if had known that this weapon would have been used against civil people and that Nazi Germany would not be successfull in developing such a bomb he would have done nothing.
Many scientists have a great responsability. Knowledge can quickly become very dangerous or can be missused for a strategic adventage in harmful actions. Unfortunately I have the feeling that many scientists do not have the time or courage to think about ethics and the real impact of their reasearch (I mean the impact that is not measured by citations and impact factors…). Even Einstein seemd not to be aware of his impact by writing this letter that led to the Manhatten project. Still he took responsibility after the Bombs had been used in Japan. I think many people in Einsteins position would have found a way of justifying how the americans had used the bomb against Japan. He did not. He publicly regreted what he did and had started. Finally he was a key player and intellectual of this open letter which pledges to the governments of this world to resolve conflicts in a peaceful way

Chelsea Manning


Chelsea Elizabeth Manning (born Bradley Edward Manning, December 17, 1987) is a United States Army soldier who was convicted in July 2013 of violations of the Espionage Act and other offenses, after releasing the largest set of classified documents ever leaked to the public. Manning was sentenced in August 2013 to 35 years confinement with the possibility of parole in eight years, and to be dishonorably discharged from the Army. Manning is a trans woman who, in a statement the day after sentencing, said she had felt female since childhood, wanted to be known as Chelsea, and desired to begin hormone replacement therapy. From early life and through much of her Army life, Manning was known as Bradley; she was diagnosed with gender identity disorder while in the Army.
Assigned in 2009 to an Army unit in Iraq as an intelligence analyst, Manning had access to classified databases. In early 2010, she leaked classified information to WikiLeaks and confided this to Adrian Lamo, an online acquaintance. Lamo informed Army Counterintelligence, and Manning was arrested in May that same year. The material included videos of the July 12, 2007 Baghdad airstrike, and the 2009 Granai airstrike in Afghanistan; 250,000 U.S. diplomatic cables; and 500,000 Army reports that came to be known as the Iraq War logs and Afghan War logs. Much of the material was published by WikiLeaks or its media partners between April and November 2010.
Manning was ultimately charged with 22 offenses, including aiding the enemy, which was the most serious charge and could have resulted in a death sentence. She was held at the Marine Corps Brig, Quantico in Virginia, from July 2010 to April 2011 under Prevention of Injury status—which entailed de facto solitary confinement and other restrictions that caused domestic and international concern—before being transferred to Fort Leavenworth, Kansas, where she could interact with other detainees. She pleaded guilty in February 2013 to 10 of the charges. The trial on the remaining charges began on June 3, 2013, and on July 30 she was convicted of 17 of the original charges and amended versions of four others, but was acquitted of aiding the enemy. She is serving her sentence at the maximum-security U.S. Disciplinary Barracks at Fort Leavenworth.
Reaction to Manning’s disclosures, arrest, and sentence was mixed. Denver Nicks, one of her biographers, writes that the leaked material, particularly the diplomatic cables, was widely seen as a catalyst for the Arab Spring that began in December 2010, and that Manning was viewed as both a 21st-century Tiananmen Square Tank Man and an embittered traitor. Reporters Without Borders condemned the length of the sentence, saying that it demonstrated how vulnerable whistleblowers are.

further info:

own reason:
Obviously I did not have the time to read everything that Manning has made public so I might be blinded by media coverage of his case. From what I know I can say that many others on the list Manning was bound to her moral and not to what she was allowed to do or not. I think she was truly trying to point out unjust things and I think especially the way she did it was actually pretty smart. I guess there is a lot of structural violence in politics and military. Pointing out problems in the “correct way” seems to not really change something. Therefor she just had to release the video of american soldiers randomly shooting civilians. Did she have to make public everything else? Who knows. Actually who cares? Making this video itself public is heroic and should have a much bigger impact than it did.
Going to jail for 35 years and having the society accepting this makes me just said. I really wonder what has to happen for people to make a revolution. Not that I believe in such a drastic action but having Manning in prison for 35 years is f*** up. I strongly hope that one day Chelsea Manning will receive the peace nobel price at some time.

Noam Chomsky


Avram Noam Chomsky (/ˈnoʊm ˈtʃɒmski/; born December 7, 1928) is an American linguist, philosopher, cognitive scientist, logician, political commentator and activist. Sometimes described as the “father of modern linguistics”, Chomsky is also a major figure in analytic philosophy. He has spent most of his career at the Massachusetts Institute of Technology (MIT), where he is currently Professor Emeritus, and has authored over 100 books. He has been described as a prominent cultural figure, and was voted the “world’s top public intellectual” in a 2005 poll.
Born to a middle-class Ashkenazi Jewish family in Philadelphia, Chomsky developed an early interest in anarchism from relatives in New York City. He later undertook studies in linguistics at the University of Pennsylvania, where he obtained his BA, MA, and PhD, while from 1951 to 1955 he was appointed to Harvard University’s Society of Fellows. In 1955 he began work at MIT, soon becoming a significant figure in the field of linguistics for his publications and lectures on the subject. He is credited as the creator or co-creator of the Chomsky hierarchy, the universal grammar theory, and the Chomsky–Schützenberger theorem. Chomsky also played a major role in the decline of behaviorism, and was especially critical of the work of B.F. Skinner. In 1967 he gained public attention for his vocal opposition to U.S. involvement in the Vietnam War, in part through his essay The Responsibility of Intellectuals, and came to be associated with the New Left while being arrested on multiple occasions for his anti-war activism. While expanding his work in linguistics over subsequent decades, he also developed the propaganda model of media criticism with Edward S. Herman. Following his retirement from active teaching, he has continued his vocal public activism, praising the Occupy movement for example.
Chomsky has been a highly influential academic figure throughout his career, and was cited within the field of Arts and Humanities more often than any other living scholar between 1980 and 1992. He was also the eighth most cited scholar overall within the Arts and Humanities Citation Index during the same period. His work has influenced fields such as artificial intelligence, cognitive science, computer science, logic, mathematics, music theory and analysis, political science, programming language theory and psychology. Chomsky continues to be well known as a political activist, and a leading critic of U.S. foreign policy, state capitalism, and the mainstream news media. Ideologically, he aligns himself with anarcho-syndicalism and libertarian socialism.

further info:

own reason:
Chomsky is very new on the list so I cannot say very much about him. I have watched several interviews and talk by him and I just find it amazing how he turned completely towards ethics and political activism and is highly educated, rational and fact driven (he seems always to just have the better argument). In particular I like his point of view on power systems (As far as I understand him he is not blaming single people for injustice but he is seeing the problem of structural violence). I also like his critical view on mass media therefor I am eager to read his book: manufacturing consent
I particular like his very clear view on fundamental issues and how certain policies inevitably lead to certain abuse.

Melinda Gates (also Bill Gates)


Bill & Melinda Gates Foundation (BMGF or the Gates Foundation) is one of the largest private foundations in the world, founded by Bill and Melinda Gates. It was launched in 2000 and is said to be the largest transparently operated private foundation in the world. It is “driven by the interests and passions of the Gates family”. The primary aims of the foundation are, globally, to enhance healthcare and reduce extreme poverty, and in America, to expand educational opportunities and access to information technology. The foundation, based in Seattle, Washington, is controlled by its three trustees: Bill Gates, Melinda Gates and Warren Buffett. Other principal officers include Co-Chair William H. Gates, Sr. and Chief Executive Officer Susan Desmond-Hellmann.
It had an endowment of US$38.3 billion as of 30 June 2013. The scale of the foundation and the way it seeks to apply business techniques to giving makes it one of the leaders in the philanthrocapitalism revolution in global philanthropy, though the foundation itself notes that the philanthropic role has limitations. In 2007, its founders were ranked as the second most generous philanthropists in America, and Warren Buffett the first. As of May 16, 2013, Bill Gates had donated US$28 billion to the foundation.

further info:

own reason:
Ok I admit it is not fair to just name her. I mean it is still the Bill and Melinda Gates foundation. But from my perception it is Melinda who was the driving force and the eyeopener for Bill Gates. I always realized Bill Gates as one of the coldest and disgusting business man out there (On the same list as Steve Jobs and Marc Zuckerberg). Using Patents and Licence agreements and closed systems just for the purpose of becoming increadibly rich. Like other computer scientists he already had a deep impact on people and bringing us the operating systems and office suite was probably not that bad after all. I mean they were still useful tools for most people. Still he could have choosen a more ethical business model. Well how should he have seen these things when he was young. I guess he was even bound to investors and to what they wanted.
I guess with the help of Melinda he also realized that it would be to late to make drastic changes to Microsoft so he changed the focus in his life to create something new. Something that is much more sustainable and that feels very good.
Now using their wealth Bill and Melinde Gates start to tackle really important issues that we as humans can all tackle but which seem economically unimportant to tackle. This feels a little bit like a modern version of Robin Hood. Microsoft is pulling money out of the rich part of the world with nowadays ok software at high cost and vendor lockin but Bill and Melinda are distributing this money e.g. to fight diseases in areas of the world where the western world simply doesn’t care to fight these diseases. Also they act as multiplicators to convince other rich people to do similar. I think this contributes a lot to more justice and progress.

Besides my love for technological topics the Bill and Melinda Gates foundation is besides the Wikimedia Foundation probably the only interesting NGO I am aware of and that I would be willing to work for and sacrifice my tech career. But I guess this could still even be done after a successfull tech career (:

By the way funfact: The rich get richer principle holds so increadibly in the case of bill and melinda gates. Warren Buffet the “opponent” to Gates of being the wealthiest person in the world donated almost all his money to the Bill and Melinde Gates foundation which I think is an incredible trust provider to what Bill and Melinda are doing.

Uncertain candidates – since its had to say

There are some borderline candidates which I am still not sure about.

Julian Assange

I do not even know how to make up my mind. On the one hand Julian Assange seems to be an increadible important person and really doing a lot of good. On the other hand he seems very self centered and sometimes not authentic. I understand that he of has course operational costs and no fixed income. Still I am not sure how much is real

RAF – resp. Ulrike Meinhof

I guess in Germany it is almost as impossible to say that one sympasizes with the RAF as it would be to state that one sympasizes with the NSDAP. Yet I liked the fundamental problems the RAF addressed. Their methods where stupid and I guess there where a lot of “dead fish” swimming with the RAF and pursuing all the terror the RAF did but from their core beliefs and problems with the German society they seemed to have some really valid points.

Richard Stallman

Inventing the GPL was an an increadible smart move. I am not sure if this was the first copyleft licence and if stallman really came up himself with the idea. Still he probably could and would have if he didn’t.
Stallman is often perceived to be too radical and not able to make a compromise. From what I understand (and within this article I believe that this is the topic with my biggest expertise) this is just the only way. Ther cannot be such a thing as “half free software” you are free or you are not free. The impact of being free is so incredibly big that I think it is indeed one of the view points in life where people really should not make a compromize. So I think that what Stallman is frequently being critisized for is actually one of his strongest points.

Linus Torvalds

I am not sure if he is just a winner takes it all guy or if there is more to him. Besides linux bringing git to the hacker community is the second and maybe on the long term even more impactful innovation by linus torvalds. Also the processes how he seems to work how he seems to understand the dynamics and social processes of the open source community is crazy.

Larry Page

People might ask: “Rene why is Steve jobs and zuckerberg on your bad list and Larry page not? Where did he donate his money do and did he do all the philantropic work like bill gates?” My only response is: Yes that is a problem and that is part of the reason why I am still undecided about Page. What speaks for page is his creativity combined with his strong will to use technology, and financial power to change the world and make it more automized and efficient. By pursuing this goal he seems to ignore economical principles. Google has released a bunch of products that are hard to monetarize (even indirectly) or really “moonshot” projects. I have the feeling that page cannot donate money or give up power within google unless he has brought the amount of innovation to the world that he wanted.
* Self driving cars (probably as shared economy with taxi, logistics, online shoping and not for sale)
* a better “semantic” search (in combination with android and more knowledge of user context)
Even though not everything is perfect google does it is stil increadibly that a company with so many employees is still able to manage such a great company culture. At least Google is a company that started with a clear mission statement (“to make the worlds knowledge universally accessable for everyone everywhere”) and as said probably page cannot rest and focus on other things unlesss he has fullfilled his nobal goal.


Tags: , , , , , , , , , , ,

René Pickhardt on July 23rd, 2014

Long time I have been trying to educate people about social networks and marketing within social networking services. I thought I would not create an article about the topic in my blog again, but I just stumbled upon a really nice video summarizing a lot of my concerns from a different angle.

So here you can see how much click fraud exists with likes on Facebook. Why you should not buy any likes and in general about the problems with Facebook’s news stream.

I am really wondering at what point in time the industries will realize that Facebook is not a vivid ecosystem and this thing will break down. I have been predicting the Downfall of Facebook quite often and it did not happen so far. I am still expecting it to happen and I guess we all have to wait just a little bit longer.


Tags: , , , ,

René Pickhardt on May 29th, 2014

This is a copy of an answer I left on Larry Page’s Google Plus profile after another announcement about the progress of the self driving car:

How come you have a patent for the self driving taxi company in the USA? I always thought of you as a creative role model who wanted to change the world towards a better place.

Why not leaving the competition open? I really loved Jonathan Rosenberg’s blog entry in which you ask for openness. This inspired me in my PhD program a lot to make everything open.

I was so keen when you announced the self driving car as it was my childhood dream to have self driving cars in the world. (Visiting the United states in 2001 I learnt programming in high school and in my book I drafted my own first class Schema for a self driving car.)

After you announced the self driving car I was keen to build up the taxi company. Contributing to the progress in transportation and logistics and helping the process of automation.

There is much criticism against Google but after your recent TED talk I thought “Larry is really the guy who doesn’t care for money but wants to make progress to the world!” Before I often wondered why you didn’t take part in Bill Gates giving pledge. But I realized that you needed your capital for building stuff like the car or glasses. I thought that this is also an amazing way to change the world towards a better place.

In the talk you mentioned shared economy and too many cars in LA. I thought “Great, this is exactly where the Taxi company comes in.” and I was happy to realize that you already seem to have the same ideas. I even expected Google to create a taxi company.

Anyway please give the self driving car to the world and don’t own all the applications of it. Please build open systems and give people the chance to contribute to the ecosystem around the self driving car.
Closed systems will slow down the progress as you say yourself in the above mentioned blog article.


Tags: , , , ,

René Pickhardt on April 24th, 2014

I was just reading through the recent notes of Heinrich which I can recommend to read as well as his old notes. When I stumbled upon the note called Monitor /etc/ using git I was confused. Why would one do this?

So I talked to Heinrich and he said:

“Well you want to monitor changes of your system config. You want to be able to revert them and you don’t want to care about this when you do something.”

I really liked this and thinks its so useful and a smart idea that I wanted to share it with you. Just keep in mind that you don’t push the git repository to some public space since the config files might include a lot of passwords. Also look out for his .gitignore in his case the printer does a lot of automatic changes and is thus ignored. You might have similar settings for your configs.

I hope sharing this was useful for you!


Tags: , , ,

After 2 years of hard work I can finally proudly present the core of my PhD thesis. Starting from Till Speicher and Paul Georg Wagner implementing one of my ideas for next work prediction as an award winning project for the Young Scientists competition and several iterations over this idea which resulted in gaining a deeper understanding of what I am actually doing, I have developed the theory of Generalized Language Models and evaluated its strength together with Martin Körner over the last years.

As I will present in this blog article and as you can read in my publication (ACL 2014) it seems like Generalized Language Models outperform Modified Kneser Ney Smoothing which was accepted as the defacto state-of-the-art method for the last 15 years.

So what is the Idea of Generalized Language Models in non scientific slang?

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Why would you do all this stuff?

Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.

Evaluation Setup, methodology, download of data sets and source code

The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.

All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.

We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.

In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.

We have used Perplexity as a standard metric to evaluate our Language Model.


As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.

The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

Absolute perplexity values and relative reduction of perplexity from MKN to GLM on all data sets for models of order 3 to 5

As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets.
In general we see a larger improvement in performance for models of higher orders (n=5).
The gain for 3-gram models, instead, is negligible.
For German texts the increase in performance is the highest (12.7%) for a model of order 5.
We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.

We made consistent observations in our second experiment where we iteratively shrank the size of the training data set.
We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data.
The results for the English Wikipedia data set are illustrated in the next figure:

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

Variation of the size of the training data on 100k test sequences on the English Wikipedia data set with different model lengths for GLM.

We see that the GLM performs particularly well on small training data.
As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to 25.7% compared to language models with modified Kneser-Ney smoothing on the same data set.
The absolute perplexity values for this experiment are presented in the next table

Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data.
This conjecture has been investigated in a last experiment.
For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets.
The data set unseen consists of all test sequences which have never been observed in the training data.
The set observed consists only of test sequences which have been observed at least once in the training data.
Again we have calculated the perplexity of each set.
For reference, also the values of the complete test data set are shown in the following Table.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

Absolute perplexity values and relative reduction of perplexity from MKN to GLM for the complete and split test file into observed and unseen sequences for models of order 3 to 5. The data set is the largest English Wikipedia corpus.

As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case.
More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from 10.5% to 15.6% on the unseen test case.
This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data.
It also confirms that our motivation to produce lower order n-grams by omitting not only the first word of the local context but systematically all words has been fruitful.
However, we also see that for the observed sequences the GLM performs slightly worse than MKN.
For the observed cases we find the relative change to be negligible.

Conclustion and links

With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:

If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.


Tags: , , , , , ,


Subscribe to my newsletter

You don't like mail?