Download your open source Youtube insights statistics tool

Rene — Sun, 13 Nov 2011 11:54:45 +0000

Today I have figured out that you can download a lot of statistics about your on videos from Youtube. That is actually very nice since I was always sceptical that you miss the knowledge of who is watching your videos once you do not host them yourself. Unfortunately there are some drawbacks to these statistics:

Youtube only lets you download statistics for a periode of 30 days. So if you want to download your statistics for an entire year you have to download 12 files.
Next the statistics are only available as rawdata and you need to process them in order to receive some useful information.

For me and in legend these statistics are very interesting for three reasons:

We want to figure out from which websites people come to youtube in order to watch our video. This is important for us so we can contact the webmasters of theses sites as soon as new videos are available.
We want to to analyze if the user behaviour is influenced by our Youtube ballads n bullets dvd and if so we want to see weather the influence is actually positive (which we of course expect).
We want to see from which videos our video is linked as a related video. This could enable us to have some really useful collaborations.

Since there are so many reasons to process the statistics I wrote a little python script in order to analyze the statistics and retrieve them in a human readable format. Because open source is really important for our society and the web I made the decission to share this little tool with everyone on the web under a GPL licence (which means you can use it for free!).
So feel free to download this little python script and run it on your website. It is also avail able in the google code svn repository

Screencast and instructions:

Since the script is only displaying some information there is almost no configuration to be done. But I know from the time when my programming skills were not as good as today that it was always hard to run the source code of someone else. To make it even easier for you I created a little screencast that explains you how to download the programm, How to download the statistics from youtube and how to run the statistics tool.

Oh by the way this is some background knowledge about youtube insights can be found at the official youtube data api documentation
I hope you like that tool. If you find some bugs or you have some suggestions I would be more than happy to hear from you about your thoughts!

Data mining (text analysis) for linguists on Ulysses by James Joyce & Faust by Goethe

Rene — Mon, 19 Sep 2011 20:37:04 +0000

Over the weekend I met some students studying linguistics. Methods from Linguistics are very important for text retrieval and data mining. That is why in my oppinion Linguistics is also a very important part of web science. I am always concerned that most people doing web science actually are computer scientists and that much of the potential in web science is being lost by not paying attention to all the disciplines that could contribute to web science!
That is why I tried to teach the linguists some basic python in order to do some basic analysis on literature. The following script which is rather hacked than beautiful code can be used to analyse texts by different authors. It will display the following statistics:

Count how many words are in the text
count how many sentences
Calculate average words per sentence
Count how many different words are in the text
Count how many time each word apears
Count how many words appear only once, twice, three times, and so on…
Display the longest scentence in the text

you could probably ask even more interesting questions and analyze texts from different centuries, languages and do a lot of interesting stuff! I am a computer scientist / mathmatician I don’t know what questions to ask. So if you are a linguist feel free to give me feedback and suggest some more interesting questions (-:

Some statistics I calculated

ulysses
264965 words in 27771 sentences
==> 9.54 words per sentence
30086 different words
every word was used 8.82 times on average
faust1
30632 words in 4178 sentences
==> 7.33 words per sentence
6337 different words
==> every word was used 4.83 times on average
faust2
44534 words in 5600 sentences
==> 7.95 words per sentence
10180 different words
==> every word was used 4.39 times on average

Disclaimer

I know that this is not yet a tutorial and that I don’t explain the code very well. To be honest I don’t explain the code at all. This is sad. When I was trying to teach python to the linguists I was starting like you would always start: “This is a loop and that is a list. Now let’s loop over the list and display the items…” There wasn’t much motivation left. The script below was created after I realized that coding is not supposed to be abstract and an interesting example has to be used.
If people are interested (please tell me in the comments!) I will consider to create a python tutorial for linguists that will start right a way with small scripts doing usefull stuff.
by the way you can download the texts that I used for analyzing on the following spots

# this code is licenced under creative commons licence as long as you # cite the author: Rene Pickhardt / www.rene-pickhardt.de # adds leading zeros to a string so all result strings can be ordered def makeSortable(w): l = len(w) tmp = "" for i in range(5-l): tmp = tmp + "0" tmp = tmp + w return tmp #replaces all kind of structures passed in l in a text s with the 2nd argument def removeDelimiter(s,new,l): for c in l: s = s.replace(c, new); return s; def analyzeWords(s): s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("]) wordlist = s.split() dictionary = {} for word in wordlist: if word in dictionary: tmp = dictionary[word] dictionary[word]=tmp+1 else: dictionary[word]=1 l = [makeSortable(str(dictionary[k])) + " # " + k for k in dictionary.keys()] for w in sorted(l): print w count = {} for k in dictionary.keys(): if dictionary[k] in count: tmp = count[dictionary[k]] count[dictionary[k]] = tmp + 1 else: count[dictionary[k]] = 1 for k in sorted(count.keys()): print str(count[k]) + " words appear " + str(k) + " times" def differentWords(s): s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?","\"",")","("]) wordlist = s.split() count = 0 dictionary = {} for word in wordlist: if word in dictionary: tmp = dictionary[word] dictionary[word]=tmp+1 else: dictionary[word]=1 count = count + 1 print str(count) + " different words" print "every word was used " + str(float(len(wordlist))/float(count)) + " times on average" return count def analyzeSentences(s): s = removeDelimiter(s,".",[".",";",":","!","?"]) sentenceList = s.split(".") wordList = s.split() wordCount = len(wordList) sentenceCount = len(sentenceList) print str(wordCount) + " words in " + str(sentenceCount) + " sentences ==> " + str(float(wordCount)/float(sentenceCount)) + " words per sentence" max = 0 satz = "" for w in sentenceList: if len(w) > max: max = len(w); satz = w; print satz + "laenge " + str(len(satz)) texts = ["ulysses.txt","faust1.txt","faust2.txt"] for text in texts: print text datei = open(text,'r') s = datei.read().lower() analyzeSentences(s) differentWords(s) analyzeWords(s) datei.close()
If you call this script getstats.py on a linux machine you can pass the output directly into a file on which you can work next by using
python getstats.py > out.txt