Data Mining – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/#comments Tue, 06 Nov 2012 11:55:02 +0000 http://www.rene-pickhardt.de/?p=1460 As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator it = result.columnAs("coAuthor");
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.
  • The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
  • The Traverser framework is about 25% slower than the Core API
  • Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.
  • I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

    One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

    Quotes from Andres Taylor:

    Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

    I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

    Conclusion:

    So far I was only using the Java Core API working with neo4j and I will continue to do so.
    If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
    Also I personally like the amount of control that you have when traversing over the core yourself.
    Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
    The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
    The detailed results can be found in this spreadsheet.

    ]]> https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/feed/ 16 Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
    While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

    Chapter 1 Introduction

    This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
    I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
    What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
    Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
    distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
    Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
    Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
    Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
    Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
    KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
    Exercises
    I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
    What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
    Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

    ]]>
    https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
    Neo4j based Typology also awarded top 90 at Google Science fair. https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/ https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/#respond Tue, 22 May 2012 13:08:48 +0000 http://www.rene-pickhardt.de/?p=1350 Yesterday I shared the good news about Till and Paul who have been awarded one of the top 5 projects at the German federal competition Young scientists. Today the good news continues. Together with more than 1000 competitors they did also submit the project to the Google Science Fair.
    And guess what! Google likes graph data bases, information retrieval, auto completion, and scentence prediction. Till and Paul have been (together with 5 other germans) selected to be in the second round together with 90 remaining projects. Even Heise reported about this! Within the next month they will have phone calls with Google employees and with a little bit of luck they will be selected to be in the top 15 which would mean they would be invited to the Googleplex at Mountain View California.
    Remember Typology is open source and already available as an android app (only in German language but this will be fixed soon).
    Btw I am just wondering when I will be traveling to Google in Mountain View?

    ]]>
    https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/feed/ 0
    Typology using neo4j wins 2 awards at the German federal competition young scientists. https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/ https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/#comments Mon, 21 May 2012 09:44:54 +0000 http://www.rene-pickhardt.de/?p=1341 Two days ago I arrived in Erfurt in order to visit the federal competition young scientists (Jugend Forscht). I reported about the project typology by Till Speicher and Paul Wagner which I supervised over the last half year and which already won many awards.
    Saturday night they have already won a special award donated by the Gesellschaft fuer Informatik this award has the title “special award for a contribution which demonstrates particularly the usefulness of computer science for Society.” (Sonderpreis fuer eine Arbeit, die in besonderer Art und Weise den Nutzen der Informatik verdeutlicht.) This award was connected with 1500 Euro in cash!
    Yesterday there was the final award ceremony. I was quite excited to see how the hard work that Till and Paul put into their research project would be evaluated by the juryman. Out of 457 submissions with a tough competition the top5 projects have been awared. Till and Paul came in 4th and will now be allowed to visit the German chancelor Angela Merkel in Berlin.
    With the use of spreading activation Typology is able to make precise predictions of what you gonna type next on your smartphone. It outperforms the current scientific standards (language models) by more than 100% and has a precision of 67%!
    A demo of the project can be found at www.typology.de
    The android App of this system is available in the appstore. It is currently only available for German text and is beta. But the source code is open and we are happy for anybody who wants to contribute can check out the code. A mailinglist will be set up soon. But anyone who is interested can already drop a short message at mail@typology.de and will be added to the mailinglist as soon as it is established.
    You can also look at the documentation which will soon be available in english. Alternatively you commit bugs and request features in our bug tracker
    I am currently trying to improve the data structures and the data base design in order to make retrieval of suggestions faster and decrease the calculation power of our web server. If you have expertise in top – k aggregation joins in combination with prefix filtering drop me a line and we can discuss about this issue!
    Happy winners Paul Wagner (left) and Till Speicher (right) with me

    ]]>
    https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/feed/ 5
    Amazed by neo4j, gwt and my apache tomcat webserver https://www.rene-pickhardt.de/amazed-by-neo4j-gwt-and-my-apache-tomcat-webserver/ https://www.rene-pickhardt.de/amazed-by-neo4j-gwt-and-my-apache-tomcat-webserver/#comments Thu, 15 Sep 2011 09:56:48 +0000 http://www.rene-pickhardt.de/?p=800 edit: the demo is finally online but on a different data set though: check out the demo and read about the new data set. An evaluation of graphity can be found here
    Besides reading papers I am currently implementing the infrastructure of my social news stream for the new metalcon version. For the very first time I was really using neo4j on a remote webserver in a real webapplication built on gwt. This combined the advantages of all these technologies and our new fast server! After seeing the results I am so excited I almost couldn’t sleep last night!

    Setting

    I selected a very small bipartit subgraph of metalcon which means just the fans and bands together with the fanship relation between them. This graph consists of 12’198 nodes (6’870 Bands and 5’328 Users). and 119’379 edges.

    Results

    • For every user I displayed all the favourite bands
    • for each of those band I calculated similar bands (on the fly while page request!)
    • this was done by breadth first search  (depth 2) and counting nodes on the fly

    A page load for a random user with 56 favourite bands ends up in a traversal of  555’372. Together with sending the result via GWT over the web this was done in about 0.9 seconds!

    Comparison to mySQL

    I calculated the most similar bands using this query:
    select ub.Band_ID, count(*) as anzahl from UserBand ub join UserBand ub1 on ub.User_ID=ub1.User_ID  where ub1.Band_ID = 3006 group by ub.Band_ID order by anzahl desc
    This took .17 seconds for just one band on average!
    Multiply this number with 56 and you get 9.5 seconds! And we haven’t even included sending of data and parsing in html yet.

    Demo

    Though we will release the software open source soon right now I cannot provide a demo. This is due to the fact that currently browsing this data reveals more user data than their privacy settings would allow! But I can encourage you to bookmark this link and check it out once in a while, since we are about to get rid of these privacy problems and demonstrate our results!
    http://gwt.metalcon.de/GWT-Modelling/

    Summary

    I am really excited. Very seldom I was so keen on going on programming something to see further results! Unfortunatly it is still a long way down the road but we will make it. What is the spead going to be once I have really implemented the efficient data structures and caching in the live system. And if multiple users use it and also write to the data base!
    If you want to join our open source project feel free to contact me!

    ]]>
    https://www.rene-pickhardt.de/amazed-by-neo4j-gwt-and-my-apache-tomcat-webserver/feed/ 2
    Time lines and news streams: Neo4j is 377 times faster than MySQL https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/ https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/#comments Mon, 18 Jul 2011 10:20:49 +0000 http://www.rene-pickhardt.de/?p=671 Over the last weeks I did some more work on neo4j. And I am ready to present some more results on the speed (In my use case neo4j outperformed MySQL by a factor of 377 ! That is more than two magnitudes). As known one part of my PhD thesis is to create a social newsstream application around my social networking site metalcon.de. It is very obvious that a graph structure for social newsstreams are very natural: You go to a user. Travers to all his friends or objects of interest and then traverse one step deeper to the newly created content items. A problem with this kind of application is the sorting by Time or relvance of the content items. But before I discuss those problems I just want to present another comparission between MySQL and neo4j.

    Setting:

    I took a fairly small dataset with the following properties:

    • 14986 content items (entries in a forum of a band)
    • 1391 Bands
    • 854 user having at leas one of those bands in their list of fav bands
    • a bipartit graph of fan relations between users and bands

    For every User I wanted to select all content items from his favourite bands. I know this is far away from the social newsstream application I am about to build but I used it as a first test to see weather graph databases really are the more natural setting for my questions.

    Results:

    In MySQL this would look something like this:

    for all (User_ID in Users){
    SELECT ce.Text
    FROM Entry e
    JOIN Topic t ON t.ID = e.Topic_ID
    JOIN UserBand ub ON ct.Band_ID = ub.Band_ID
    WHERE ub.User_ID = User_ID
    }
    Even though we have all relevant colums indexed those joins are expensive. Especially because the Entry Table is much bigger than 14986 Items.
    Using MySQL It took 152 Seconds = 2:32 Minutes to create the interesting newsstreams for all 854 users or 177 ms per user

    Let’s look at neo4j:

    Not using any traverser but just some hacked in code I have something like the following
    for all (user in index){
    for (Relationship rel: user.getRelationships(DynamicRelationshipType.withName(“FAN”), Direction.BOTH)){
    Node n1 = rel.getOtherNode(user);

    for (Relationship r2: n1.getRelationships(DynamicRelationshipType.withName(“CONTENTITEM”), Direction.BOTH)){
    Node n2 = r2.getOtherNode(n1)
    edgecnt++;
    }
    }
    }

    Even thogh we only have 854 users and 1391 bands we end up with  1368270 relations that we have traversed to retrieve all content items for all favourite bands of each user.
    Using neo4j it took  3,4 Seconds in doing this or 4 ms per user
    That is about 44 times faster than MySQL

    After warming the caches.

    When repeating this experiment MySQL does not get faster. I know they have a query cache but I guess it gets emptied before the first queries are run again. In neo4j though this result gets even better. Each time I repeated this experiment the runtime decreased until I came to something like 0,4 Seconds for the job which now is 377 times faster than MySQL. The best part of this is scaling. When my Social graph grows the search in neo4j stays a local search. more users and more discussions does not mean more edges to traverse from the user to his favourite bands. in MySQl though the cost of these heavy joins will just explode.
    Yes I know in MySQL I would denormalize my tables to create such an application in order to gain speed. But denormalizing means more redundancy and again a graph is a much more natural structure for a social application.

    Open Questions! Feel free to discuss them (-:

    There is an very important open question though and that is sorting. A social newsstream of course should be sorted by time. in MySQL it is easy to create an Index over a colum with timestamps on the Contentitems. Sorting my resultset by time will in this case not increase the runtime.
    In a graph database with this particular schema of my graph I am not quite sure how to retrieve the results in a sorted way. (Anyone has some ideas?) There needs to be further testing to see if sorting after retrieving still makes my solution faster than MySQL or (my prefered solution) if there is some way of designing the graph in a way that for any user with any set of favourite bands there is a quick way of traversing through the content items and receiving them in an ordered way. I even guess this is already possible in neo4j using traversers with bredth first search and tell them in wich order to traverse relations. Just have too look deeper into this and i will keep you updated.
    I am happy and open for comments and suggestions! Oh and could anyone suggest a nice syntax highlightning plugin for wordpress?

    ]]>
    https://www.rene-pickhardt.de/time-lines-and-news-streams-neo4j-is-377-times-faster-than-mysql/feed/ 28
    Analysing user behaviour in internet communities https://www.rene-pickhardt.de/analysing-user-behaviour-in-internet-communities/ https://www.rene-pickhardt.de/analysing-user-behaviour-in-internet-communities/#respond Fri, 06 May 2011 17:55:58 +0000 http://www.rene-pickhardt.de/?p=376 Yesterday Julia Preusse – who started her PhD studies at WeST just a month before I did – had a talk about her research interests. She is interested in communities and the behaviour of users in discussions. I thought she was having some interesting ideas. That is why I asked her for a meeting today. We sat down and discussed some of her ideas. While doing so it seemed that she liked my aproach to solve some of her problems.
    After only 30 minutes of discussion we had a concrete roadmap for some interesting research on user behaviour in internet message boards. Of course I cannot talk about the concrete idea here – most part is Julias interlectual property and she has not published it yet – but the general idea is that we want to analyze the reply structure of users within topics in order to cluster the users to different groups. From there Julia wants to go to the next levle. We talked back with Dr. Thomas Gottron about our ideas and he seemed to be very happy with our approach.
    Even though my focus is on social news streams I think Julias topics are very interesting and working with her together could also help me to gain a deeper insight on users behaviour for metalcon. That is why I will help her within the next weeks to work on these questions. I am excited to see the output and I will of course keep you updated on the topic. Of course I won’t forget to meanwhile work on my own research topics (-:

    ]]>
    https://www.rene-pickhardt.de/analysing-user-behaviour-in-internet-communities/feed/ 0
    Neo4j Graph Database vs MySQL https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/ https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/#comments Thu, 05 May 2011 21:36:32 +0000 http://www.rene-pickhardt.de/?p=355 For my social news stream application I am heavily thinking about the right software to support my backend. After I designed a database model in MySQL I talked back to Jonas and he suggested to search for a better suiting technology. A little bit of research brought me to a Graph database called Neo4j.
    After I downloaded the opensource java libs and got it running with eclipse and maven I did some testing with my Metalcon data set. And I have been very satisfied and the whole project looks very promesing to me. I exported 4 relations from my MySQL Database.

    1. UserUserFriend containing all the friendship requests
    2. UserProfileVisit containing the profiles a user visited
    3. UserMessage containing the messages between users
    4. UserComment containing the profile comments between users

    These relations obviously form a graph on my data set. Reading the several 100’000 lines of data and put them into the graph data structure and building a search index on the nodes only took several seconds runtime. But I was even more impressed by the speed with which it was possible to traverse the graph!
    Receiving the shortest path between two users of length 4 only took me 150 milliseconds. Doing a full bredthfirst search on a different heavily connected graph with 290’000 edges only took 2.7 seconds which means that neo4j is capable of traversing about 100’000 edges per second.
    Now I will have to look more carefully to my usecase. Obviously I want to have edges that are labled with timestamps and retrieve them in orderd lists. Adding key value pairs to the edges and including and index is possible which makes me optimisitic that I will be able to solve a lot of my queries of interest in an efficiant manner.
    Unfortunately I am batteling around with Google Webtoolkit and Eclipse and Neo4j which I want to combine for the new metlcon version but I even asked the neo4j mailinglist with an very emberassing question and the guys from neotechnology have been very kind and helpful (even thogh I still couldn’t manage to get it running) I will post an article here as soon as I know how to set everything up.
    In General I am still a huge fan of relational databases but for a usecase of social networks I see why graph data bases seem to be the more sophisticated technology. I am pretty sure that I could not have perfomed so well using MySQL.
    What is your experience with graph data bases and especially neo4j?

    ]]>
    https://www.rene-pickhardt.de/neo4j-graph-database-vs-mysql/feed/ 5
    Social news streams – a possible PhD research topic? https://www.rene-pickhardt.de/social-news-streams-a-possible-phd-research-topic/ https://www.rene-pickhardt.de/social-news-streams-a-possible-phd-research-topic/#comments Mon, 25 Apr 2011 22:03:08 +0000 http://www.rene-pickhardt.de/?p=351 It is two months now of reading papers since I started my PhD program. Enough time to think about possible research topics. I am more and more interested in search, social networks in general and social news streams in particular. It is obvious that it is becoming more and more important to aggregate news around a users interests and social circle and display them to the user in an efficient manner. Facebook and Twitter are doing this in an obvious way but also Google, Google News and a lot of other sites have similar products.

    To much information in one’s social environment

    In order to create a news stream there is the possibility to just show the most recent information to the user (as Twitter is doing it). Due to the huge amount of information created, one wants to filter the results in order to gain a higher user experience. Facebook first started to filter the news stream on their site which lead to the widely spread discussion about their ironically called EdgeRank algorithm. Many users seem to be unhappy with the user experience of Facebook’s Top News.
    Also for some information such as the existence of an event in future it might not be the best moment to display the information as soon as it becomes available.

    Interesting research hook points and difficulties

    I observed these trends and realized that this problem can be seen as a special case of search or more general recommendation engines in information retrieval. We want to obtain the most relevant information updates around a certain time window for every specific user.
    This problem seems to me algorithmically much harder than web search where the results don’t have this time component and for a long time also haven’t been personalized to the user’s interest. The time component makes it hard to decide the question for relevance. The information is new and you don’t have any votes or indicators of relevance. Consider a news source or person in someone’s environment that wasn’t important before. All of a sudden this person could provide a highly relevant and useful information to the user.

    My goal and roadmap

    Fortunately in the past I have created metalcon.de together with several friends. Metalcon is a social network for heavy metal fans. On metalcon users can access information (cd releases, upcoming concerts, discussions, news, reviews,…) about their favorite music bands, concerts and venues in their region and updates from their friends. These information can perfectly be displayed in a social news stream. On the other hand metalcon users share information about their taste of music, the venues they go to and the people they are friend with.
    This means that I have a perfect sandbox to develop and test (with real users) some smart social news algorithms that are supposed to aggregate and filter the most relevant news to our users based on their interests.
    Furthermore regional information and information about music are available as linked open data. So the news stream can easily be enriched with semantic components.
    Since I am about to redesign (a lot of work) metalcon for the purpose of research and I am about to go into this direction for my PhD thesis I would be very happy to receive some feedback and thoughts about my suggestions of my future research topic. You can leave a comment or contact me.
    Thanks you!

    Current Achievments:

    ]]>
    https://www.rene-pickhardt.de/social-news-streams-a-possible-phd-research-topic/feed/ 4
    Facebook User Search: Ever wondered how Facebook is more social than others? https://www.rene-pickhardt.de/facebook-user-search-ever-wondered-how-facebook-is-more-social-than-others/ https://www.rene-pickhardt.de/facebook-user-search-ever-wondered-how-facebook-is-more-social-than-others/#comments Mon, 14 Mar 2011 21:58:03 +0000 http://www.rene-pickhardt.de/?p=299 After Eli’s talk on TED and my recent article about the filter bubble I decided to dig a little deeper into Facebook’s EdgeRank algorithm, which decides what Updates appear in your news feed. I found some more scientific background on how EdgeRank really works. Even though EdgeRank was first mentioned on Facebook F8 Live on April 21st in 2010 it is already mentioned in a footnote of the scientific paper “All friends are not equal: using weights in social graphs to improve search” by Sudheendra Hangal, Diana MacLean, Monica S. Lam and Jeffrey Heer all from Computer Science department at Stanford university.
    Inspired by this paper I run a little test to compare user Search of Facebook and once (a long time ago) Germanys biggest social networrk StudiVZ. Not surprisingly Facebook clearly won the battle. But let me first give a brief overview on how social networks rose in Germany.

    History of Facebook and StudiVZ

    So in Germany we there was this Facebook clone – let’s call it StudiVZ – starting in late 2005. Due to the fact that hardly anyone knew of Facebook and StudiVZ started some great word of mouth marketing (and stole the entire design from Facebook) it spread very quickly and became THE social network in Germany. In 2007 / 2008 no one would have imagined how the most popular German Website could ever fall back. StudiVZ (being aquired by a traditional media company) tried to make advertising dollars. While Facebook started to gain real social network know how. Not surprisingly Facebook passed by StudiVZ within a couple of months while 2010.

    The Experiment: How good is the user search on social networks?

    A must have feature of every social networking site is the user search. So I wanted to test how good does the user search on both sites work. (already knowing that Facebook would easily win this battle) I thought of a person with a very common name that is not a friend of mine on ether of these social networking sites.
    After a little bit of thinking I came to Sebastian Jung. On Facebook as well as on StudiVZ he is registered with his real Name. (along with about 140 other Sebastian Jungs in Germany) Sebstian was in my grade in high school together with 130 other students. I hardly know him.

    Search for Sebastian Jung on StudiVZ:

    Typing his name in StudiVZ brings up his profile to the 4th position. Lucky me that he has recently updated his StudiVZ profile which is to my knowledge the variable the user search results are sorted by. If he hadn’t done this he would have disappeared somewhere between those 140 other Sebstian Jung’s that have a StudiVZ profile with the same name.

    Search for Sebstian Jung on Facebook:

    Typing his name into Facebook search immidiately shows his profile on the first position. In my case this is particular interesting but let us first explore why Facebook does so well.

    How does Facebook user search rank the results?

    Of course the exact algorithm is secrete but the idea is easy. As everyone knows we can measure the distance between to people in a social network by the shortest path of people between those two people. Uff. Shortest path?!? What does this mean?
    For Sebastian Jung and me this shortest path would be of length 1 since I have some friend from my old school that is a friend of Sebastian Jung. Which in turn means there is one person between Sebastian Jung an me.
    For our German Chancellor and me the distance would probably be 3 (wild guess) but I think you get the point. So what facebook does is to sort all the Sebastian Jungs on the result Page according to their distance from me. Pretty smart isn’t it? But Facebook is probably even using a little bit more information. Let us assume I have 4 common friends with this Sebastian Jung and maybe 1 common friend with another Sebastian Jung. The distance in both cases would be 1. But the one I have more common friends with is still probably more relevant to me and will most probably be shown first.

    Oh and why is this particular interesting for my case?

    You can call me paranoid or something but I am still afraid that facebook knows to much about me if I tell them more about my friendships. That’s why I decided to have 0 friends on Facebook. Obviously Facebook is not only using actual friendships that exist but also the 120 friendshiprequests I have received so far and other knowledge (maybe people have uploaded my email address together with their address book) Anyway this experiment show that my fear obviously has a reason but it also shows that I clearly failed to protect my most sensitive data from Facebook.

    Conclusion:

    1. Still I am very convinced that Facebook’s success is due to the fact that these little things just silently work perfect in the background producing great user satisfaction.
    2. As I always say. You cannot steal an idea on the Internet. If you don’t understand the Idea you might have a short success but then you’ll fail because your product will just not be as good as your competitors product
    3. If you want to be successful on the Internet don’t focus on selling ads and making money in the first place. Look what the big players have been doing! Focus on user satisfaction. If your users are happy I am pretty sure the money and reward will come to you!
    4. Even though the pages look a like and StudiVZ is still copying features from Facebook they oviously don’t understand the essence of these features and what exactly makes them great. Otherwise after 5 years of operations they would be able to have a good running user search which should be the kernel of any social networking service.
    5. Much to learn and improve for my own social network Metalcon that has a crappy search function over all (-:
    6. 6. I still haven’t digged deeper into the EdgeRank Algorithm 🙁

    I am happy to read about your comments and thoughts as well as your experiments to user search with Facebook and other social networks. What other (technical (!)) reasons do you think make Facebook the superior social network in comparison to sites like myspace, orkut, studiVZ, bebo,… ?

    ]]>
    https://www.rene-pickhardt.de/facebook-user-search-ever-wondered-how-facebook-is-more-social-than-others/feed/ 2