Text mining – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

Chapter 1 Introduction

This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

]]>
https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
How to download Wikipedia https://www.rene-pickhardt.de/how-to-download-wikipedia/ https://www.rene-pickhardt.de/how-to-download-wikipedia/#comments Wed, 16 Feb 2011 18:58:00 +0000 http://www.rene-pickhardt.de/?p=249 Wikipedia is an amazing data set to do all different kinds of research which will go far beyond text mining. The best thing about Wikipedia is that it is licensed under creative common license. So you are allowed to download Wikipedia and use it in any way you want. The articles have almost no spelling mistakes and a great structure with meaningful headings and subheadings. This makes Wikipedia to a frequently used data set in computer science. No surprise that I decided to download and examine Wikipedia. I first wanted to gain experience in natural language processing. Furthermore I wanted to test some graph mining algorithms and I wanted to obtain some statistics about my mother tong German.
Even though it is very well documented how to download this great data set there are some tiny obstacles which made me struggle once in a while. For the experienced data miner these challenges will probably be easy to master but I still think it is worth wile blogging about them.
Don’t crawl Wikipedia please!
After reading Toby Segaran’s book “Programming collective intelligence” about 2 years ago I wanted to build my first simple web crawler and download Wikipedia by crawling Wikipedia. After installing python and the library beautiful soup which is recommended by Toby I realized that my script could not download Wikipedia pages. I also didn’t get any meaningful error message which I could have typed in Google. After a moment of thinking I realized that Wikipedia might not be happy with to many unwanted crawler since crappy crawlers can produce a lot of load on the web server. So I had a quick look at http://de.wikipedia.org/robots.txt and quickly realized that Wikipedia is not to happy with strangers crawling and downloading it.
I once have heard that database dumps of Wikipedia are available for download. So why not downloading a database dump, installing a web server on my notebook and crawl my local version of Wikipedia? This should be much faster anyway.
Before going over to the step of downloading a database dump I tried to change my script in order to send “better” http-headers to Wikipedia while requesting pages to download. That was not because I wanted to go on crawling Wikipedia anyway I just wanted to see weather I would be able to trick them. Even though I set my user agent to mozilla I was not able to download one single Wikipedia page using my python script.

Wikipedia is huge!

Even though I went for the German Wikipadia which doesn’t even have half the size of the English one I ran into serious trouble due to the huge amount of data. As we know data mining usually is not complex because the algorithms are so difficult but rather because the amount of data is so huge. I would consider Wikipedia to be a relatively small data set but as stated above sufficient big to cause problems.
After downloading the correct data base dump which was about 2 GB in size I had to unzip it. Amazingly no zip program was able to unzip the 7.9 GB huge XML file that contained all current Wikipedia articles. I realized that changing to my Linux operating system might have been a better Idea. So I put the file on my external hard drive and reboot my system. Well Linux didn’t work ether. After exactly 4 GB the unzipping process would stop. Even though I am aware of the fact that 2^32 = 4 GB I was confused an asked Yann for advice and he immediately asked weather I would use Windows or Linux. I told him that I just switched to Linux but then it also came to my mind. My external hard drive has Fat32 as a file system which cannot handle files bigger than 4 GB.
After copying the zipped database dump to my Linux file system the unzipping problem was successfully solved. I installed a media wiki on my local system in order to have all the necessary data base tables. Mediawiki also comes with an import script. This script is php based and incredibly slow. About two articles per second will be parsed and imported to the data base. 1 million articles would therefor need about 138 hours or more than five and a half days. For a small data set and experiment this is unacceptable. Fortunately Wikipedia also provides a java file called mwdumper which can process about 70 articles per second. The first time I was running this java program it crashed after 150’000 articles. I am still not to sure what reason caused the crash but I decided to tweak the mysql settings in /etc/mysql/my.cnf a little bit. So after assigning more memory to mysql I started the script for a second time realizing that it couldn’t continue to import the dump. After truncating all tables I restarted the whole import process again. Right now it is still ongoing but it already has included 1’384’000 articles and my system still seems stable.

Summery: How to download Wikipedia in a nutshell?

  1. Install some Linux on your computer. I recommend Ubuntu
  2. Use the package manager to install mysql, apache and PHP
  3. Install and set up mediawiki (this can also be done via package manger)
  4. Read http://en.wikipedia.org/wiki/Wikipedia:Database_download in opposite to the German version file size issues are discussed within the article
  5. Find the Wikipedia dump that you want to download on the above page
  6. Use the Java Programm MWDumper and read the instructions.
  7. Install Java if not already done (can be done with package manager)
  8. Donate some money to the Wikimedia foundation or at least contribute to Wikipedia by correcting some articles.

So obviously I haven’t even started with the real interesting research of German Wikipedia. But I thought that it might already be interesting to share the experience I already had with you. A nice but not surprising effect is by the way that the local version of Wikipedia is amazingly fast. After I have crawled and indexed Wikipedia and transferred the data to some format that I can better use I might write another article and maybe I will even publish a dump of my data structures.

]]>
https://www.rene-pickhardt.de/how-to-download-wikipedia/feed/ 5
IBM's Watson & Google – What is the the difference? https://www.rene-pickhardt.de/ibms-watson-google-what-is-the-the-difference/ https://www.rene-pickhardt.de/ibms-watson-google-what-is-the-the-difference/#respond Tue, 15 Feb 2011 19:25:57 +0000 http://www.rene-pickhardt.de/?p=241 Recently there was a lot of news on the Web about IBM’s natural language processing system Watson. As you might have heard right now Watson is challenging two of the best Jeopardy players in the US. A lot of news magazines compare Watson with Google which is the reason for this article. Even though the algorithms behind Watson and Google are not open source still a lot of estimates and guesses can be made about the algorithms both computer systems use in order to give intelligent answers to the questions people ask them. Based on this guesses I will explain the differences between Google and Watson.
Even though both systems have a lot of things in common (natural language processing, apparently intelligent, machine learning algorithms,…) I will compare the intelligence behind Google and Watson to demonstrate the difference and the limitations both systems still have.
Google is an information retrieval system. It has indexed a lot of text documents and uses heavy machine learning and data mining algorithms to decide which document is most relevant for any given keyword or combination of keywords. To do so Google uses several techniques. The main concept when Google started was the calculation of PageRank and other Graph Algorithms that evaluate the trust and relevance of a given resource (which means the domain of a website). This is a huge difference to Watson. A given hypertext document being hosted on two different domains will most probably result to complete different Google rankings for the same keyword. This is quite interesting because the information and data within the document are completely identical. So for deciding which Hypertext document is most relevant Google does much more than studying this particular document. Backlinks, neighborhood, context, (and maybe some more?) are metrics besides formatting, term frequency and other internal factors.
Watson on the other hand doesn’t want to justify its answer by returning the text documents where it found the evidence. Also Watson doesn’t want to find documents that are most suitable to a given Keyword. For Watson the task is rather to understand the semantics behind a given key phrase or question. Once this is done Watson will use its huge knowledge base to find the correct answer. I would guess that Watson uses a lot more artificial intelligence algorithms than Google. Especially supervised learning, and prediction and classification models. If anyone has some evidence for these statements I will be happy if you tell me!
An interesting fact worthwhile mentioning is that both information retrieval systems first of all use collective intelligence. Google does so by using the structure of the Web to calculate the trust of information. Also it uses the set of all text documents to calculate synonyms and other things specific to the semantics of words. Watson also uses collective intelligence. It is provided with a lot of information human beings have published in books, on the web or probably even in knowledge systems like ontologies. The systems also have in common that they use a huge amount of calculation power and caching in order to provide their answers at a decent speed.
So is Google or Watson more intelligent?
Even though I think that Watson uses much more AI algorithms the answer should clearly be Google. Watson is highly specialized to one certain task. It can solve it amazingly accurate. But Google solves a much more universal Problem. Also Google has (as IBM of course) some of the best engineers in the world working for them. The Watson team might have been around 5 years with 40 people and Google is more like 10 years with nowadays over 20’000 coworkers.
I am exciting to get to know your opinion!

]]>
https://www.rene-pickhardt.de/ibms-watson-google-what-is-the-the-difference/feed/ 0