MySql – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Open Source Facebook music streaming App for free download! https://www.rene-pickhardt.de/open-source-facebook-music-streaming-app-for-free-download/ https://www.rene-pickhardt.de/open-source-facebook-music-streaming-app-for-free-download/#comments Tue, 03 Jan 2012 14:54:14 +0000 http://www.rene-pickhardt.de/?p=1005 In an earlier post I have explained the need for a Facebook streaming app that has to be enhanced with some features in order to create viral word of mouth effects. Together with Yann Leretaille and Robert Naumann we programmed the facebook API and developed such an app for my band In legend. Today ( even though xmas is gone and 2012 has already started ) it is the time for me to share the source code of this app.

have a look at the app here:

Features and Problems

  • works on facebook and on any other webpage
  • enables setting more and more songs free for streaming while more people install the app (in order to spread the word)
  • users need to connect (with facebook or via email adress) in order to listen
  • some lightweight statistics
  • encrypted flash player (not open source yet) that makes it hard to download the music (Though I myself have some moral problems with this kind of feature. But well it is how the industry works…)
  • Slideshow of pictures to improve listening experience
  • optimized usability for high conversion rates

The app runs on PHP, MySQL, JavaScript (MooTools) and you will need your own webspace in order to host it

A kind warning

The App was developed with a lot of time pressure and we had some nasty bugs that needed to be fixed. That is why the source code is messed up with some really fast and dirty quick fixes. Afterwards I never really had the time to clean up the source and make a good documentation. As my PhD progresses this situation will not change in the foreseeable future. Since my prediction says that Facebook will be overrun by Google+ within this year it is more than time to share the app!
The good part: most of the stuff can be reused once Google+ opens its API and the app can be transformed to a great social network.

Source code on google code

http://code.google.com/p/in-legend-facebook-music-streaming-app/source/checkout

]]>
https://www.rene-pickhardt.de/open-source-facebook-music-streaming-app-for-free-download/feed/ 2
Time lines and news streams: Neo4j is 377 times faster than MySQL https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/ https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/#comments Mon, 18 Jul 2011 10:20:49 +0000 http://www.rene-pickhardt.de/?p=671 Over the last weeks I did some more work on neo4j. And I am ready to present some more results on the speed (In my use case neo4j outperformed MySQL by a factor of 377 ! That is more than two magnitudes). As known one part of my PhD thesis is to create a social newsstream application around my social networking site metalcon.de. It is very obvious that a graph structure for social newsstreams are very natural: You go to a user. Travers to all his friends or objects of interest and then traverse one step deeper to the newly created content items. A problem with this kind of application is the sorting by Time or relvance of the content items. But before I discuss those problems I just want to present another comparission between MySQL and neo4j.

Setting:

I took a fairly small dataset with the following properties:

  • 14986 content items (entries in a forum of a band)
  • 1391 Bands
  • 854 user having at leas one of those bands in their list of fav bands
  • a bipartit graph of fan relations between users and bands

For every User I wanted to select all content items from his favourite bands. I know this is far away from the social newsstream application I am about to build but I used it as a first test to see weather graph databases really are the more natural setting for my questions.

Results:

In MySQL this would look something like this:

for all (User_ID in Users){
SELECT ce.Text
FROM Entry e
JOIN Topic t ON t.ID = e.Topic_ID
JOIN UserBand ub ON ct.Band_ID = ub.Band_ID
WHERE ub.User_ID = User_ID
}
Even though we have all relevant colums indexed those joins are expensive. Especially because the Entry Table is much bigger than 14986 Items.
Using MySQL It took 152 Seconds = 2:32 Minutes to create the interesting newsstreams for all 854 users or 177 ms per user

Let’s look at neo4j:

Not using any traverser but just some hacked in code I have something like the following
for all (user in index){
for (Relationship rel: user.getRelationships(DynamicRelationshipType.withName(“FAN”), Direction.BOTH)){
Node n1 = rel.getOtherNode(user);

for (Relationship r2: n1.getRelationships(DynamicRelationshipType.withName(“CONTENTITEM”), Direction.BOTH)){
Node n2 = r2.getOtherNode(n1)
edgecnt++;
}
}
}

Even thogh we only have 854 users and 1391 bands we end up with  1368270 relations that we have traversed to retrieve all content items for all favourite bands of each user.
Using neo4j it took  3,4 Seconds in doing this or 4 ms per user
That is about 44 times faster than MySQL

After warming the caches.

When repeating this experiment MySQL does not get faster. I know they have a query cache but I guess it gets emptied before the first queries are run again. In neo4j though this result gets even better. Each time I repeated this experiment the runtime decreased until I came to something like 0,4 Seconds for the job which now is 377 times faster than MySQL. The best part of this is scaling. When my Social graph grows the search in neo4j stays a local search. more users and more discussions does not mean more edges to traverse from the user to his favourite bands. in MySQl though the cost of these heavy joins will just explode.
Yes I know in MySQL I would denormalize my tables to create such an application in order to gain speed. But denormalizing means more redundancy and again a graph is a much more natural structure for a social application.

Open Questions! Feel free to discuss them (-:

There is an very important open question though and that is sorting. A social newsstream of course should be sorted by time. in MySQL it is easy to create an Index over a colum with timestamps on the Contentitems. Sorting my resultset by time will in this case not increase the runtime.
In a graph database with this particular schema of my graph I am not quite sure how to retrieve the results in a sorted way. (Anyone has some ideas?) There needs to be further testing to see if sorting after retrieving still makes my solution faster than MySQL or (my prefered solution) if there is some way of designing the graph in a way that for any user with any set of favourite bands there is a quick way of traversing through the content items and receiving them in an ordered way. I even guess this is already possible in neo4j using traversers with bredth first search and tell them in wich order to traverse relations. Just have too look deeper into this and i will keep you updated.
I am happy and open for comments and suggestions! Oh and could anyone suggest a nice syntax highlightning plugin for wordpress?

]]>
https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/feed/ 28
Neo4j Graph Database vs MySQL https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/ https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/#comments Thu, 05 May 2011 21:36:32 +0000 http://www.rene-pickhardt.de/?p=355 For my social news stream application I am heavily thinking about the right software to support my backend. After I designed a database model in MySQL I talked back to Jonas and he suggested to search for a better suiting technology. A little bit of research brought me to a Graph database called Neo4j.
After I downloaded the opensource java libs and got it running with eclipse and maven I did some testing with my Metalcon data set. And I have been very satisfied and the whole project looks very promesing to me. I exported 4 relations from my MySQL Database.

  1. UserUserFriend containing all the friendship requests
  2. UserProfileVisit containing the profiles a user visited
  3. UserMessage containing the messages between users
  4. UserComment containing the profile comments between users

These relations obviously form a graph on my data set. Reading the several 100’000 lines of data and put them into the graph data structure and building a search index on the nodes only took several seconds runtime. But I was even more impressed by the speed with which it was possible to traverse the graph!
Receiving the shortest path between two users of length 4 only took me 150 milliseconds. Doing a full bredthfirst search on a different heavily connected graph with 290’000 edges only took 2.7 seconds which means that neo4j is capable of traversing about 100’000 edges per second.
Now I will have to look more carefully to my usecase. Obviously I want to have edges that are labled with timestamps and retrieve them in orderd lists. Adding key value pairs to the edges and including and index is possible which makes me optimisitic that I will be able to solve a lot of my queries of interest in an efficiant manner.
Unfortunately I am batteling around with Google Webtoolkit and Eclipse and Neo4j which I want to combine for the new metlcon version but I even asked the neo4j mailinglist with an very emberassing question and the guys from neotechnology have been very kind and helpful (even thogh I still couldn’t manage to get it running) I will post an article here as soon as I know how to set everything up.
In General I am still a huge fan of relational databases but for a usecase of social networks I see why graph data bases seem to be the more sophisticated technology. I am pretty sure that I could not have perfomed so well using MySQL.
What is your experience with graph data bases and especially neo4j?

]]>
https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/feed/ 5
how to move wordpress directory: Problems with upload_path wp-content/uploads https://www.rene-pickhardt.de/how-to-move-wordpress-directory-problems-with-upload_path-wp-contentuploads/ https://www.rene-pickhardt.de/how-to-move-wordpress-directory-problems-with-upload_path-wp-contentuploads/#comments Thu, 24 Mar 2011 13:30:36 +0000 http://www.rene-pickhardt.de/?p=329 Recently I was forced to move a wordpress site to a different webserver. By moving to a new server due to different hosting policies I also had to change the physical directory in which the wordpress files were stored. Before it was something like /var/www/wordpress afterwards it became /home/domainname/www/ . So after moving the data base and moving the wordpress installation and changing the DNS entry for the domain and some minor downtime in the middle of the night I figured out that some plugins on the new server would not work.
Especially those that needed to access uploaded files did have problems.

What happened?

After some debugging I did a fulltext search on the database for %/var/www/wordpress/ and i found one entry in wp_options table having: option_name =”upload_path” option_value=”/var/www/wordpress/wp-content/uploads” so obviously on installation time wordpress is saving the path of the upload folder in its option table to the data base. Still not sure weather this is a bug or a feature but I don’t have time to further investigate. Anyone some ideas?

]]>
https://www.rene-pickhardt.de/how-to-move-wordpress-directory-problems-with-upload_path-wp-contentuploads/feed/ 4
How to download Wikipedia https://www.rene-pickhardt.de/how-to-download-wikipedia/ https://www.rene-pickhardt.de/how-to-download-wikipedia/#comments Wed, 16 Feb 2011 18:58:00 +0000 http://www.rene-pickhardt.de/?p=249 Wikipedia is an amazing data set to do all different kinds of research which will go far beyond text mining. The best thing about Wikipedia is that it is licensed under creative common license. So you are allowed to download Wikipedia and use it in any way you want. The articles have almost no spelling mistakes and a great structure with meaningful headings and subheadings. This makes Wikipedia to a frequently used data set in computer science. No surprise that I decided to download and examine Wikipedia. I first wanted to gain experience in natural language processing. Furthermore I wanted to test some graph mining algorithms and I wanted to obtain some statistics about my mother tong German.
Even though it is very well documented how to download this great data set there are some tiny obstacles which made me struggle once in a while. For the experienced data miner these challenges will probably be easy to master but I still think it is worth wile blogging about them.
Don’t crawl Wikipedia please!
After reading Toby Segaran’s book “Programming collective intelligence” about 2 years ago I wanted to build my first simple web crawler and download Wikipedia by crawling Wikipedia. After installing python and the library beautiful soup which is recommended by Toby I realized that my script could not download Wikipedia pages. I also didn’t get any meaningful error message which I could have typed in Google. After a moment of thinking I realized that Wikipedia might not be happy with to many unwanted crawler since crappy crawlers can produce a lot of load on the web server. So I had a quick look at http://de.wikipedia.org/robots.txt and quickly realized that Wikipedia is not to happy with strangers crawling and downloading it.
I once have heard that database dumps of Wikipedia are available for download. So why not downloading a database dump, installing a web server on my notebook and crawl my local version of Wikipedia? This should be much faster anyway.
Before going over to the step of downloading a database dump I tried to change my script in order to send “better” http-headers to Wikipedia while requesting pages to download. That was not because I wanted to go on crawling Wikipedia anyway I just wanted to see weather I would be able to trick them. Even though I set my user agent to mozilla I was not able to download one single Wikipedia page using my python script.

Wikipedia is huge!

Even though I went for the German Wikipadia which doesn’t even have half the size of the English one I ran into serious trouble due to the huge amount of data. As we know data mining usually is not complex because the algorithms are so difficult but rather because the amount of data is so huge. I would consider Wikipedia to be a relatively small data set but as stated above sufficient big to cause problems.
After downloading the correct data base dump which was about 2 GB in size I had to unzip it. Amazingly no zip program was able to unzip the 7.9 GB huge XML file that contained all current Wikipedia articles. I realized that changing to my Linux operating system might have been a better Idea. So I put the file on my external hard drive and reboot my system. Well Linux didn’t work ether. After exactly 4 GB the unzipping process would stop. Even though I am aware of the fact that 2^32 = 4 GB I was confused an asked Yann for advice and he immediately asked weather I would use Windows or Linux. I told him that I just switched to Linux but then it also came to my mind. My external hard drive has Fat32 as a file system which cannot handle files bigger than 4 GB.
After copying the zipped database dump to my Linux file system the unzipping problem was successfully solved. I installed a media wiki on my local system in order to have all the necessary data base tables. Mediawiki also comes with an import script. This script is php based and incredibly slow. About two articles per second will be parsed and imported to the data base. 1 million articles would therefor need about 138 hours or more than five and a half days. For a small data set and experiment this is unacceptable. Fortunately Wikipedia also provides a java file called mwdumper which can process about 70 articles per second. The first time I was running this java program it crashed after 150’000 articles. I am still not to sure what reason caused the crash but I decided to tweak the mysql settings in /etc/mysql/my.cnf a little bit. So after assigning more memory to mysql I started the script for a second time realizing that it couldn’t continue to import the dump. After truncating all tables I restarted the whole import process again. Right now it is still ongoing but it already has included 1’384’000 articles and my system still seems stable.

Summery: How to download Wikipedia in a nutshell?

  1. Install some Linux on your computer. I recommend Ubuntu
  2. Use the package manager to install mysql, apache and PHP
  3. Install and set up mediawiki (this can also be done via package manger)
  4. Read http://en.wikipedia.org/wiki/Wikipedia:Database_download in opposite to the German version file size issues are discussed within the article
  5. Find the Wikipedia dump that you want to download on the above page
  6. Use the Java Programm MWDumper and read the instructions.
  7. Install Java if not already done (can be done with package manager)
  8. Donate some money to the Wikimedia foundation or at least contribute to Wikipedia by correcting some articles.

So obviously I haven’t even started with the real interesting research of German Wikipedia. But I thought that it might already be interesting to share the experience I already had with you. A nice but not surprising effect is by the way that the local version of Wikipedia is amazingly fast. After I have crawled and indexed Wikipedia and transferred the data to some format that I can better use I might write another article and maybe I will even publish a dump of my data structures.

]]>
https://www.rene-pickhardt.de/how-to-download-wikipedia/feed/ 5