linguistics – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

Chapter 1 Introduction

This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

]]>
https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
Data mining (text analysis) for linguists on Ulysses by James Joyce & Faust by Goethe https://www.rene-pickhardt.de/data-mining-text-analysis-for-linguists-on-ulysses-by-james-joyce-faust-by-goethe/ https://www.rene-pickhardt.de/data-mining-text-analysis-for-linguists-on-ulysses-by-james-joyce-faust-by-goethe/#comments Mon, 19 Sep 2011 20:37:04 +0000 http://www.rene-pickhardt.de/?p=806 Over the weekend I met some students studying linguistics. Methods from Linguistics are very important for text retrieval and data mining. That is why in my oppinion Linguistics is also a very important part of web science. I am always concerned that most people doing web science actually are computer scientists and that much of the potential in web science is being lost by not paying attention to all the disciplines that could contribute to web science!
That is why I tried to teach the linguists some basic python in order to do some basic analysis on literature. The following script which is rather hacked than beautiful code can be used to analyse texts by different authors. It will display the following statistics:

  • Count how many words are in the text
  • count how many sentences
  • Calculate average words per sentence
  • Count how many different words are in the text
  • Count how many time each word apears
  • Count how many words appear only once, twice, three times, and so on…
  • Display the longest scentence in the text

you could probably ask even more interesting questions and analyze texts from different centuries, languages and do a lot of interesting stuff! I am a computer scientist / mathmatician I don’t know what questions to ask. So if you are a linguist feel free to give me feedback and suggest some more interesting questions (-:

Some statistics I calculated

ulysses
264965 words in 27771 sentences
==> 9.54 words per sentence
30086 different words
every word was used 8.82 times on average
faust1
30632 words in 4178 sentences
==> 7.33 words per sentence
6337 different words
==> every word was used 4.83 times on average
faust2
44534 words in 5600 sentences
==> 7.95 words per sentence
10180 different words
==> every word was used 4.39 times on average

Disclaimer

I know that this is not yet a tutorial and that I don’t explain the code very well. To be honest I don’t explain the code at all. This is sad. When I was trying to teach python to the linguists I was starting like you would always start: “This is a loop and that is a list. Now let’s loop over the list and display the items…” There wasn’t much motivation left. The script below was created after I realized that coding is not supposed to be abstract and an interesting example has to be used.
If people are interested (please tell me in the comments!) I will consider to create a python tutorial for linguists that will start right a way with small scripts doing usefull stuff.
by the way you can download the texts that I used for analyzing on the following spots


# this code is licenced under creative commons licence as long as you
# cite the author: Rene Pickhardt / www.rene-pickhardt.de
# adds leading zeros to a string so all result strings can be ordered
def makeSortable(w):
l = len(w)
tmp = ""
for i in range(5-l):
tmp = tmp + "0"
tmp = tmp + w
return tmp
#replaces all kind of structures passed in l in a text s with the 2nd argument
def removeDelimiter(s,new,l):
for c in l:
s = s.replace(c, new);
return s;
def analyzeWords(s):
s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("])
wordlist = s.split()
dictionary = {}
for word in wordlist:
if word in dictionary:
tmp = dictionary[word]
dictionary[word]=tmp+1
else:
dictionary[word]=1
l = [makeSortable(str(dictionary[k])) + " # " + k for k in dictionary.keys()]
for w in sorted(l):
print w
count = {}
for k in dictionary.keys():
if dictionary[k] in count:
tmp = count[dictionary[k]]
count[dictionary[k]] = tmp + 1
else:
count[dictionary[k]] = 1
for k in sorted(count.keys()):
print str(count[k]) + " words appear " + str(k) + " times"
def differentWords(s):
s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("])
wordlist = s.split()
count = 0
dictionary = {}
for word in wordlist:
if word in dictionary:
tmp = dictionary[word]
dictionary[word]=tmp+1
else:
dictionary[word]=1
count = count + 1
print str(count) + " different words"
print "every word was used " + str(float(len(wordlist))/float(count)) + " times on average"
return count
def analyzeSentences(s):
s = removeDelimiter(s,".",[".",";",":","!","?"])
sentenceList = s.split(".")
wordList = s.split()
wordCount = len(wordList)
sentenceCount = len(sentenceList)
print str(wordCount) + " words in " + str(sentenceCount) + " sentences ==> " + str(float(wordCount)/float(sentenceCount)) + " words per sentence"
max = 0
satz = ""
for w in sentenceList:
if len(w) > max:
max = len(w);
satz = w;
print satz + "laenge " + str(len(satz))
texts = ["ulysses.txt","faust1.txt","faust2.txt"]
for text in texts:
print text
datei = open(text,'r')
s = datei.read().lower()
analyzeSentences(s)
differentWords(s)
analyzeWords(s)
datei.close()

If you call this script getstats.py on a linux machine you can pass the output directly into a file on which you can work next by using
python getstats.py > out.txt

]]>
https://www.rene-pickhardt.de/data-mining-text-analysis-for-linguists-on-ulysses-by-james-joyce-faust-by-goethe/feed/ 4