Artificial Intelligence – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Download Google n gram data set and neo4j source code for storing it

Rene — Sun, 27 Nov 2011 13:28:20 +0000

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is.
http://en.wikipedia.org/wiki/N-gram
The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.
This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

So let me rather talk about some possible applications with this source of pure gold:
I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.
The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.
You can find some source code of the newer version at: https://github.com/renepickhardt/typology/tree/develop
Note that this is just a primitive algorithm to process the ngrams and store the information in a neo4j graph data base. Interestingly it can already produce decent recommendations and it uses less storage space than the ngrams dataset since the graph format is much more natural (and also due to the fact that we did not store all of the data saved in the ngrams to neo4j e.g. n-grams of different years have been aggregated.)
From what I know the roadmap is very clear now. Normalize the weights and for prediction use a weighed sum of all different kinds of n-grams and use machine learning (supervised learning) to learn those weights. As a training data set a corpus from different domains could be used (e.g. wikipedia corpus as a general purpose corpus or a corpus of a certain domain for a special porpus)
If you have any suggestions to the work the students did and their approach using graph data bases and neo4j to process and store ngrams as well as predicting sentences feel free to join the discussion right here!

My Blog guesses your name – Binary Search Exercise for Algorithms and data structures class

Rene — Mon, 07 Nov 2011 11:59:05 +0000

Binary Search http://en.wikipedia.org/wiki/Binary_search_algorithm is a very basic algorithm in computer science. Despite this fact it is also important to understand the fundamental principle behind it. Unfortunately the algorithm is tought so early and the algorithm is so simple that beginning students sometimes have a hard time to understand the abstract principle behind it. Also many exercises are just focused on implementing the algorithm.
I tried to provide an exercise that aims and focusses on the core principle of binara search rather than implementation. The exercise is split into three parts.

Excercise – Binary Search

Your task is to write a computer program that is able to guess your name!

Feel free to check out the following applet that enables my blog to guess your name in less than 16 steps!

In order to achieve this task we will look at three different approaches.

Part 1

Download this file containing 48’000 names. It is important for me to state that this file is under GNU public licence and I just processed the original much richer file which you can find at: http://www.heise.de/ct/ftp/07/17/182/.
Now you can apply the binary search algorithm (imperative implementation as well as recursive implementation) to let the computer guess the name of the user
Provide a small User interface that makes it possible for the user to give feedback if the name would be found befor or after that guess in a telephone book

Part 2

It could happen though that some rare names are not included in this name list. That is why your task now is to use a different approach:

Let the Computer ask the user of how many letters his name consists.
Create a function BinarySearch(String from, String to, length) and call it for example with the parameters “(AAAA”,”ZZZZ”,4)
Use the function LongToString in order to map the String to a Long that respects the lexicographical order and enables to find the middle value of the two strings to implement this
Use the function LongToString and the user interface from part one in order to provide the output to the user

private static String LongToString(long l,int length) { String result = ""; for (int i=0;i private static long StringToLong(String from) { long result=0; int length=from.length(); for (int i=0;i


Part 3
Our program from exercise 2 is still not able to guess names with a special charset which is important for many languages. So your task is to fix this problem by improving the approach that you have just implemented.

One way to do this is to understand that char(65)=A, char(66)=B,...  and transfer this to  create a sorted array of letters that are allowed to be within a name.

Now choose freely one of the following methods:

Method A

Improve LongToString and StringToLong so it can use the array instead of the special characters.

Method B

Start guessing the name letter for letter by using this array. You would guess every letter with the help of binary search.
Part 4 (discussion)

Explain shortly why approach number 2 will take more guesses in general than approach one.
Explain shortly why Method A and B will need the same amount of guesses in the worst case.
If you would already know some letters of the Name does this have an influence on the method you choose. Why?



Google, Facebook & co. are not free!
Rene — Thu, 09 Jun 2011 20:05:05 +0000
Besides ecommerce one really big Internet business model obviously is the trade or monetarization of data. This Blogpost reminded to write an article about this topic.

I think the author points out the most important facts! Services like Facebook / gmail / Youtube and so on are not free. Moneywise they are free of charge but with your data you pay a price for using it. Luckily for Facebook / Google and co. most people don’t know the value of their data which turns Facebook, Google and co. into very successful business models generating billions of us dollar.

Since it is very easy to collect data on the web and only very few people understand the dynamics of the Web. I would guess that right now this is one of the most profitable business models on the Internet.

Your contact data vs. data about your interests that define you

Think about it. Giving your name and adress to me even in public on the internet doesn’t allow me to do any business. While building my social network metalcon I experienced something interesting. People have been scared as hell to provide their real name while registering with metalcon. But talking to some of the users in reality they did not realize that the data they produce while using the plattform is what should rather scare them.

You don’t believe it? Let us look into to data facebook is collecting from you in order to monetarize it.
What  obvious facts does facebook know about you?

first of all the interests from your  profile
who your friends are
The interests of your friends.

This is already very strong and combined with artificial intelligence will lead to amazing service like friend recommender and highly personalized advertising systems
What not so obvious facts does Facebook know about you?
It is akward that almost no one is aware of the fact that you don’t have to fill out your social networking profile or have friendships to tell facebook your interests (I don’t have friends on facebook and the friend recommender works) . Facebook knows:

Who you communicate with and how frequently.
What you are really interested in (you constantly tell facebook by clicking on things and visiting profiles and pages!)
which people your really know (being tagged together in a lot of fotos is a strong indicator!)
What websites you surf on (there are like buttons everywhere! and most the time you are still logged in with facebook) it is like criterio
In which geographical regions you login to facebook
How many people are interested in you and if you are an oppinion leader

The boldest part is how facebook uses all the knowledge about useres and brands to earn money by just enabling people to communicate
Summeray of the Data trading business model
You run a website that people love and you collect user behavoir and other data they tell you implicitly. This data just needs to define them and their interests. You don’t cara about a name! You make it anonymous just like google serach was for a long time. Now you use this data to create an ad system or anything similar. You can even go out and start selling products your own!
Jumping back in time
I still remember the first days of the internet. You registered on some site and they asked you to fill out an old fashioned survey => very annoying and it was not even helping to earn good money.

age
sex
income
children
education
choose 3 interests
what papers do you read
what tv do you watch

In comparison to the Data facebook and Google collect about you that was really innocent. Comparing those modern companies with their sophisticated approaches and compare it to the pioneers in the online user data business it is incredible to see how the .com bubble could really rise up.



What are the 57 signals google uses to filter search results?
Rene — Tue, 17 May 2011 22:58:16 +0000
Since my blog post on Eli Pariser’s Ted talk about the filter bubble became quite popular and a lot of people seem to be interested in which 57 signals Google would use to filter search results I decided to extend the list from my article and list the signals I would use if I was google. It might not be 57 signals but I guess it is enough to get an idea:

Our Search History.
Our location – verfied -> more information
the browser we use.
the browsers version
The computer we use
The language we use
the time we need to type in a query
the time we spend on the search result page
the time between selecting different results for the same query
our operating system
our operating systems version
the resolution of our computer screen
average amount of search requests per day
average amount of search requests per topic (to finish search)
distribution of search services we use (web / images / videos / real time / news / mobile)
average position of search results we click on
time of the day
current date
topics of ads we click on
frequency we click advertising
topics of adsense advertising we click while surfing other websites
frequency we click on adsense advertising on other websites
frequency of searches of domains on Google
use of google.com or google toolbar
our age
our sex
use of “i feel lucky button”
do we use the enter key or mouse to send a search request
do we use keyboard shortcuts to navigate through search results
do we use advanced search commands  (how often)
do we use igoogle (which widgets / topics)
where on the screen do we click besides the search results (how often)
where do we move the mouse and mark text in the search results
amount of typos while searching
how often do we use related search queries
how often do we use autosuggestion
how often do we use spell correction
distribution of short / general  queries vs. specific / long tail queries
which other google services do we use (gmail / youtube/ maps / picasa /….)
how often do we search for ourself

Uff I have to say after 57 minutes of brainstorming I am running out of ideas for the moment. But this might be because it is already one hour after midnight!

If you have some other ideas for signals or think some of my guesses are totally unreasonable, why don’t you tell me in the comments?

Disclaimer: this list of signals is a pure guess based on my knowledge and education on data mining. Not one signal I name might correspond to the 57 signals google is using. In future I might discuss why each of these signals could be interesting. But remember: as long as you have a high diversity in the distribution you are fine with any list of signals.



Social news streams – a possible PhD research topic?
Rene — Mon, 25 Apr 2011 22:03:08 +0000
It is two months now of reading papers since I started my PhD program. Enough time to think about possible research topics. I am more and more interested in search, social networks in general and social news streams in particular. It is obvious that it is becoming more and more important to aggregate news around a users interests and social circle and display them to the user in an efficient manner. Facebook and Twitter are doing this in an obvious way but also Google, Google News and a lot of other sites have similar products.
To much information in one’s social environment
In order to create a news stream there is the possibility to just show the most recent information to the user (as Twitter is doing it). Due to the huge amount of information created, one wants to filter the results in order to gain a higher user experience.  Facebook first started to filter the news stream on their site which lead to the widely spread discussion about their ironically called EdgeRank algorithm. Many users seem to be unhappy with the user experience of Facebook’s Top News.

Also for some information such as the existence of an event in future it might not be the best moment to display the information as soon as it becomes available.
Interesting research hook points and difficulties
I observed these trends and realized that this problem can be seen as a special case of search or more general recommendation engines in information retrieval. We want to obtain the most relevant information updates around a certain time window for every specific user.

This problem seems to me algorithmically much harder than web search where the results don’t have this time component and for a long time also haven’t been personalized to the user’s interest. The time component makes it hard to decide the question for relevance. The information is new and you don’t have any votes or indicators of relevance. Consider a news source or person in someone’s environment that wasn’t important before. All of a sudden this person could provide a highly relevant and useful information to the user.
My goal and roadmap
Fortunately in the past I have created metalcon.de together with several friends. Metalcon is a social network for heavy metal fans. On metalcon users can access information (cd releases, upcoming concerts, discussions, news, reviews,…) about their favorite music bands, concerts and venues in their region and updates from their friends. These information can perfectly be displayed in a social news stream. On the other hand metalcon users share information about their taste of music, the venues they go to and the people they are friend with.

This means that I have a perfect sandbox to develop and test (with real users) some smart social news algorithms that are supposed to aggregate and filter the most relevant news to our users based on their interests.

Furthermore regional information and information about music are available as linked open data. So the news stream can easily be enriched with semantic components.

Since I am about to redesign (a lot of work) metalcon for the purpose of research and I am about to go into this direction for my PhD thesis I would be very happy to receive some feedback and thoughts about my suggestions of my future research topic. You can leave a comment or contact me.

Thanks you!
Current Achievments:

http://www.rene-pickhardt.de/data-structure-for-social-news-streams-on-graph-data-bases/
poster on data models for social news streams and time indices on graph data bases
talk on data models for social news streams and time indices on graph data bases




IBM's Watson & Google – What is the the difference?
Rene — Tue, 15 Feb 2011 19:25:57 +0000
Recently there was a lot of news on the Web about IBM’s natural language processing system Watson. As you might have heard right now Watson is challenging two of the best Jeopardy players in the US. A lot of news magazines compare Watson with Google which is the reason for this article. Even though the algorithms behind Watson and Google are not open source still a lot of estimates and guesses can be made about the algorithms both computer systems use in order to give intelligent answers to the questions people ask them. Based on this guesses I will explain the differences between Google and Watson. 

Even though both systems have a lot of things in common (natural language processing, apparently intelligent, machine learning algorithms,…) I will compare the intelligence behind Google and Watson to demonstrate the difference and the limitations both systems still have.

Google is an information retrieval system. It has indexed a lot of text documents and uses heavy machine learning and data mining algorithms to decide which document is most relevant for any given keyword or combination of keywords. To do so Google uses several techniques. The main concept when Google started was the calculation of PageRank and other Graph Algorithms that evaluate the trust and relevance of a given resource (which means the domain of a website). This is a huge difference to Watson. A given hypertext document being hosted on two different domains will most probably result to complete different Google rankings for the same keyword. This is quite interesting because the information and data within the document are completely identical. So for deciding which Hypertext document is most relevant Google does much more than studying this particular document. Backlinks,  neighborhood, context, (and maybe some more?) are metrics besides formatting, term frequency and other internal factors.

Watson on the other hand doesn’t want to justify its answer by returning the text documents where it found the evidence. Also Watson doesn’t want to find documents that are most suitable to a given Keyword. For Watson the task is rather to understand the semantics behind a given key phrase or question. Once this is done Watson will use its huge knowledge base to find the correct answer. I would guess that Watson uses a lot more artificial intelligence algorithms than Google. Especially supervised learning, and prediction and classification models. If anyone has some evidence for these statements I will be happy if you tell me!

An interesting fact worthwhile mentioning is that both information retrieval systems first of all use collective intelligence. Google does so by using the structure of the Web to calculate the trust of information. Also it uses the set of all text documents to calculate synonyms and other things specific to the semantics of words. Watson also uses collective intelligence. It is provided with a lot of information human beings have published in books, on the web or probably even in knowledge systems like ontologies. The systems also have in common that they use a huge amount of calculation power and caching in order to provide their answers at a decent speed.

So is Google or Watson more intelligent?

Even though I think that Watson uses much more AI algorithms the answer should clearly be Google. Watson is highly specialized to one certain task. It can solve it amazingly accurate. But Google solves a much more universal Problem. Also Google has (as IBM of course) some of the best engineers in the world working for them. The Watson team might have been around 5 years with 40 people and Google is more like 10 years with nowadays over 20’000 coworkers.

I am exciting to get to know your opinion!



Why blogging about collective intelligence
Rene — Mon, 14 Feb 2011 14:21:12 +0000
Collective intelligence is a phenomenon that existed in human society for many years. The British mathematician Francis Galton basically discovered collective intelligence when he found out, after averaging all individual guesses, that a crowd of 787 visitors at a county fair indeed accurately guessed the weight of an ox. In his blog Ryan Tomayko provides a text copy of the original article vox populi that appeared in the scientific journal Nature on on March 07, 1907..

In fact, collective intelligence is such a significant issue (particularly on the Internet) that I will devote a whole section of my blog to this topic. Great examples of collective intelligence are Wikipedia, Google and most open source projects for example the content management system behind my blog (WordPress) or the software I use to run this website, which would be Debian Linux, Apache, MySQL and PHP.

Even though artificial intelligence and collective intelligence often appear together there is quite a difference between these to concepts. The crowd from Francis Galton or the people contributing to Wikipedia are not artificial at all. On the other hand Google uses the structure from the linked documents on the web and combines this data with artificial intelligence and some heuristics to obtain search engine rankings and gain information and knowledge.

Future articles about collective intelligence will address, discuss and explain these issues in a more detailed way.I will demonstrate what kind of knowledge is already available on the Internet, and I plan to discuss how to use data mining or the ideas from the semantic web to make this knowledge easier accessible to humans.