source code – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

Open access and data from my research. Old resources for various topics finally online.

Rene — Mon, 05 Nov 2012 05:19:53 +0000

Being strong pro on the topic of open access I always try to publish all my work on my blog but sometimes I am busy or I forget to update so today I took the time to look at all my old drafts and the stuff that hasn’t been published yet. So here is a list of new content on my blog that should have been published long ago I also linked it in the articles of interest:

The slides of my Graphity talk at FOSDEM 2012
The slides of my Graphity talk at SocialCom 2012
The slides of my Oberseminar talk on Typology.
We consolidated the source code for related work into a git repo.

In the last month I have created quite some content for my blog and it will be published over the next weeks. So watch out for screen casts how to create an autocompletion in gwt with neo4j, how to create ngrams from wikipedia, thoughts and techniques for related work, reasearch ideas and questions that we found but probably have not the time to work on

Typology Oberseminar talk and Speed up of retrieval by a factor of 1000

Rene — Thu, 16 Aug 2012 11:39:25 +0000

Almost 2 months ago I talked in our oberseminar about Typology. Update: Download slides Most readers of my blog will already know the project which was initially implemented by my students Till and Paul. I am just about to share some slides with you. They explain on one hand how the systems works and on the other hand give some overview of the related work.
As you can see from the slides we are planning to submit our results to SIGIR conference. So one year after my first blogpost on graphity which devoloped in a full paper for socialcom2012 (graphity blog post and blog post for source code) there is the yet informal typology blog post with the slides about the Typology Oberseminar talk and 3 months left for our SIGIR submission. I expect this time the submission will not be such a hassle as graphity since I shuold have learnt some lessons and also have a good student who is helping me with the implementation of all the tests.
Additionally I have finally uploaded some source code to git hub that makes the typology retrieval algorithm pretty fast. There are still some issues with this code since it lowers the quality of predictions a little bit. Also the index has to be built first. Last but not least the original SuggestTree code did not save the weights of the items to be suggested. I need those weights in the aggregation phase. Since i did not want to extend the original code I placed the weights at the end of the suggested Items. This is a little inefficent.
The main idea why retrieval speeds up with the new algorithm is that typology needs to make sorting over all outedges of a node. This is rather slow especially if one only needs the top k elements. Since neo4j as a graph data base does not provide indices for this kind of data I was forced to look for another way to presort the data. Additionally if a prefix is known one does not have to look at all outgoing edges. I found the Suggest Tree class by Nicolai Diethelm. Which solved the problem in a very good way and lead to such a great speed. The index is not persistent yet and it also needs quite some memory. On the other hand for every node a suggest tree is built. This means that the index can be distributed in a very easy manner over several machines allowing for horizontal scaling!
Anyway the old algorithm was only able to handle like 20 requests per second and now we have something like 14 k requests and as I mentioned there is still a little space for more (:
I hope indices like this will be standard in neo4j soon. This would open up the range of applications that could make good use of neo4j.
Like always I am happy for any suggestions and I am looking forward to do the complete evaluation and paper writing for typology.

Graphity source code and wikipedia raw data is online (neo4j based social news stream framework)

Rene — Mon, 09 Jul 2012 15:43:57 +0000

UPDATE: there is now the source code of an entire graphity server application online!
8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.
Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.
Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.
There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.
The code can be found at https://github.com/renepickhardt/graphity-evaluation and the main Algorithm should lie in the file:
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/GraphityBuilder.java
other files of high interest should be:

https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/SortUtils.java topk nway merge inside a graph db
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NodeQueueIterator.java iterator over the graphity index
https://github.com/renepickhardt/graphity-evaluation/blob/master/src/de/metalcon/neo/evaluation/neo/NeoUtils.java some shortcuts for neo4j coding

I did not touch it again over the last couple months and it really has a lot of debugging comments inside. My appologies for this bad practice. I hope you can oversee this by having in mind that I am a mathematician and this was one of my first bigger evaluation projects. In my own interest I promise next time I produce code that will be easier to read / understand and reuse.
Still if you have any questions suggestions or comments feel free to contact me.
The raw data is can be downloaded at:

18 MB: http://glm.rene-pickhardt.de/de-nodeIds.txt.bz2
650 MB: http://glm.rene-pickhardt.de/de-events.log.bz2 All events that ever happened to german wikipedia articles up to middle of 2011

the format of these files is straight foward:
de-nodeIs.txt has first some ID then a tab and then the title of the wikipedia article this is just necessary if you want to display your data with titles rather than names.
the interesting file is the de-events.log in this file there are 4 columns
timestamp TAB FromNodeID TAB [ToNodeID] TAB U/R/A
So every line tells exactly when an article FromNodeID changes. if only 3 collumns are available and an U is written then the article just changed. Maybe links in the article changed in this case there exists another nodeID in the 3 column and an A or a R for add or remove respectively.
I think processing these files is rather straight forward. With this file you can totally simulate the growth of wikipedia over time. The file is sorted by the 2. column. If you want to use it in our evaluation framework you should sort this by the first column. This can be done on a unix shell in less than 10 minutes with the sort command.
Sorry I cannot publish the paper right now on my blog yet since the camera ready version has to be prepared and checked in to IEEE. But follow me on twitter or subscribe to my newsletter so I can let you know as soon as the entire paper as a pdf is available.

Data mining (text analysis) for linguists on Ulysses by James Joyce & Faust by Goethe

Rene — Mon, 19 Sep 2011 20:37:04 +0000

Over the weekend I met some students studying linguistics. Methods from Linguistics are very important for text retrieval and data mining. That is why in my oppinion Linguistics is also a very important part of web science. I am always concerned that most people doing web science actually are computer scientists and that much of the potential in web science is being lost by not paying attention to all the disciplines that could contribute to web science!
That is why I tried to teach the linguists some basic python in order to do some basic analysis on literature. The following script which is rather hacked than beautiful code can be used to analyse texts by different authors. It will display the following statistics:

Count how many words are in the text
count how many sentences
Calculate average words per sentence
Count how many different words are in the text
Count how many time each word apears
Count how many words appear only once, twice, three times, and so on…
Display the longest scentence in the text

you could probably ask even more interesting questions and analyze texts from different centuries, languages and do a lot of interesting stuff! I am a computer scientist / mathmatician I don’t know what questions to ask. So if you are a linguist feel free to give me feedback and suggest some more interesting questions (-:

Some statistics I calculated

ulysses
264965 words in 27771 sentences
==> 9.54 words per sentence
30086 different words
every word was used 8.82 times on average
faust1
30632 words in 4178 sentences
==> 7.33 words per sentence
6337 different words
==> every word was used 4.83 times on average
faust2
44534 words in 5600 sentences
==> 7.95 words per sentence
10180 different words
==> every word was used 4.39 times on average

Disclaimer

I know that this is not yet a tutorial and that I don’t explain the code very well. To be honest I don’t explain the code at all. This is sad. When I was trying to teach python to the linguists I was starting like you would always start: “This is a loop and that is a list. Now let’s loop over the list and display the items…” There wasn’t much motivation left. The script below was created after I realized that coding is not supposed to be abstract and an interesting example has to be used.
If people are interested (please tell me in the comments!) I will consider to create a python tutorial for linguists that will start right a way with small scripts doing usefull stuff.
by the way you can download the texts that I used for analyzing on the following spots

# this code is licenced under creative commons licence as long as you # cite the author: Rene Pickhardt / www.rene-pickhardt.de # adds leading zeros to a string so all result strings can be ordered def makeSortable(w): l = len(w) tmp = "" for i in range(5-l): tmp = tmp + "0" tmp = tmp + w return tmp #replaces all kind of structures passed in l in a text s with the 2nd argument def removeDelimiter(s,new,l): for c in l: s = s.replace(c, new); return s; def analyzeWords(s): s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("]) wordlist = s.split() dictionary = {} for word in wordlist: if word in dictionary: tmp = dictionary[word] dictionary[word]=tmp+1 else: dictionary[word]=1 l = [makeSortable(str(dictionary[k])) + " # " + k for k in dictionary.keys()] for w in sorted(l): print w count = {} for k in dictionary.keys(): if dictionary[k] in count: tmp = count[dictionary[k]] count[dictionary[k]] = tmp + 1 else: count[dictionary[k]] = 1 for k in sorted(count.keys()): print str(count[k]) + " words appear " + str(k) + " times" def differentWords(s): s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("]) wordlist = s.split() count = 0 dictionary = {} for word in wordlist: if word in dictionary: tmp = dictionary[word] dictionary[word]=tmp+1 else: dictionary[word]=1 count = count + 1 print str(count) + " different words" print "every word was used " + str(float(len(wordlist))/float(count)) + " times on average" return count def analyzeSentences(s): s = removeDelimiter(s,".",[".",";",":","!","?"]) sentenceList = s.split(".") wordList = s.split() wordCount = len(wordList) sentenceCount = len(sentenceList) print str(wordCount) + " words in " + str(sentenceCount) + " sentences ==> " + str(float(wordCount)/float(sentenceCount)) + " words per sentence" max = 0 satz = "" for w in sentenceList: if len(w) > max: max = len(w); satz = w; print satz + "laenge " + str(len(satz)) texts = ["ulysses.txt","faust1.txt","faust2.txt"] for text in texts: print text datei = open(text,'r') s = datei.read().lower() analyzeSentences(s) differentWords(s) analyzeWords(s) datei.close()
If you call this script getstats.py on a linux machine you can pass the output directly into a file on which you can work next by using
python getstats.py > out.txt