Social Network – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Extracting 2 social network graphs from the Democratic National Committee Email Corpus on Wikileaks https://www.rene-pickhardt.de/extracting-2-social-network-graphs-from-the-democratic-national-committee-email-corpus-on-wikileaks/ Thu, 28 Jul 2016 12:15:05 +0000 http://www.rene-pickhardt.de/?p=1989 tl,dr verion: Source code at github!
A couple of days ago a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First we have the temporal graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.

rpickhardt$ wc -l temporalGraph.tsv
39338 temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16

clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.

Second we have the weighted co-recipient network

Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.

rpickhardt$ wc -l weightedCCGraph.tsv
20864 weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.org MirandaL@dnc.org 848
MirandaL@dnc.org PaustenbachM@dnc.org 848
WalkerE@dnc.org PaustenbachM@dnc.org 624
PaustenbachM@dnc.org WalkerE@dnc.org 624
WalkerE@dnc.org MirandaL@dnc.org 596

clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails
 

Simple statistics

There have been

  • 1226 senders
  • 1384 recipients
  • 2030 people

included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:

MirandaL@dnc.org 1482
ComerS@dnc.org 1449
ParrishD@dnc.org 750
DNCPress@dnc.org 745
PaustenbachM@dnc.org 608
KaplanJ@dnc.org 600
ManriquezP@dnc.org 567

And the top 7 recievers are:

MirandaL@dnc.org 2951
Comm_D@dnc.org 2439
ComerS@dnc.org 1841
PaustenbachM@dnc.org 1550
KaplanJ@dnc.org 1457
WalkerE@dnc.org 1110
kaplanj@dnc.org 987

As you can see kaplanj@dnc.org and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.

Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.

Code

The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.

Roadmap

  • Use the Generalized Language Model Toolkit to build Language Models on the data
  • Compare with the social graph from twitter – many email addresses or at least names will be linked to twitter accounts. Comparing the Twitter network with the email network might reveal the differences in internal and external communication
  • Improve Quality of data i.e. better clean up of the data. Sometimes people in the recipient list have more than one email address. Currently they are treated as two different people. On the other hand sometimes mail addresses are missing and just names are included. These could probably be inferred from the other mail addresses. Also names in this case serve as uniq identifiers. So if two different people are called ‘Bob’ they become one person in the dataset. 
]]>
Why Facabook likes do not really matter https://www.rene-pickhardt.de/why-facabook-likes-do-not-really-matter/ https://www.rene-pickhardt.de/why-facabook-likes-do-not-really-matter/#respond Wed, 23 Jul 2014 12:16:36 +0000 http://www.rene-pickhardt.de/?p=1858 Long time I have been trying to educate people about social networks and marketing within social networking services. I thought I would not create an article about the topic in my blog again, but I just stumbled upon a really nice video summarizing a lot of my concerns from a different angle.
So here you can see how much click fraud exists with likes on Facebook. Why you should not buy any likes and in general about the problems with Facebook’s news stream.

I am really wondering at what point in time the industries will realize that Facebook is not a vivid ecosystem and this thing will break down. I have been predicting the Downfall of Facebook quite often and it did not happen so far. I am still expecting it to happen and I guess we all have to wait just a little bit longer.

]]>
https://www.rene-pickhardt.de/why-facabook-likes-do-not-really-matter/feed/ 0
Why would musicians use online social networking sites? https://www.rene-pickhardt.de/why-would-musicians-use-online-social-networking-sites/ https://www.rene-pickhardt.de/why-would-musicians-use-online-social-networking-sites/#respond Wed, 17 Jul 2013 09:56:45 +0000 http://www.rene-pickhardt.de/?p=1666 For the last 5 years I have been running metalcon an online social network for metal fans and metal bands.
As written recently I have the the chance to rewrite the entire platform with a team of 6 programmers.
This time we want to do it the correct why. Instead of Thinking of features right away we are now thinking about the various stake holders for whom we are creating metalcon.
Of course being a fan of metal music I know pretty well what requirements a social network for metal music should fulfill to add value for me.
Running such a platform and being member of the In Legend team I also can think of various requirements from Musicians but I want to open the discussion and ask the musicians:

What do you expect from a social networking site and for what reasons would you use it?

We have already created a small list at:
https://github.com/renepickhardt/metalcon/wiki/requirementsBand
which I am sharing here and asking musicians to contribute to. Eather here in the comment section or via the github wiki. Thanks a lot

Self promotion

Bands want to advertise their music and get a lot of attation.

Music hosting

Bands often lack technical knowledge to host their music. Metalcon can provide them with the functionality to host promotional songs and also share the player on other websites and with other services.

Control

The worst that could happen to a band is that they spend a lot of effort building up their fan base on a social networking site like they did in the early 2000s and then the site becomes irrelevant. Similar problems hold with facebook where page owners nowadays have to pay money to get their message spread to everybody who liked the fan page.

Therefore a requirement for musicians is to keep control of their fan base that they have grown so far.

Contact with fans and Streetteam

Bands want contact to their most important fans and to people who can organize stuff for them. Heaving a streetteam is essential to the sucess of many bands. Using a social networking service can help the band to fulfill their goal

Privacy

Famous musicians want to use the band profil of a social networking site without revealing their private account.

Staying in contact with other industry players

Musicians might want to stay in contact with partners from

  • labels
  • booking agencies
  • promoters
  • photographers
  • video producers
  • music producers
  • venue owners

Sell products

Bands want the possibility to sell

  • Tickets
  • Merchandise
  • Music (MP3 and CD)

Booking

Bands want to get the opportunity to book gigs. Giving them the opportunity to get in contact with bookers will help them.

release Management

Bands often have a complete produced record. This should be shared with some players from the industry. In this process the music should be

  • hosted in the web
  • watermarked (to prevent leaking)
  • shared privately with selected partners
  • only be streamed from the web
]]>
https://www.rene-pickhardt.de/why-would-musicians-use-online-social-networking-sites/feed/ 0
Metalcon finally gets a redesign – Thinking about high scalability https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/ https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/#comments Mon, 17 Jun 2013 15:21:30 +0000 http://www.rene-pickhardt.de/?p=1631 Finally metalcon.de the social networking site which Jonas, Jens and me created in 2008 gets a redesign. Thanks to the great opportunities at the Institute for Web Science and Technologies here in Koblenz (why don’t you apply for a PhD position with us?) I will have the chance to code up the new version of metalcon. Kicking off on July 15th I will lead a team of 5 programmers for the duration of 4 months. Not only will the development be open source but during this time I will constantly (hopefully on a daily basis) write in this blog about the design decisions we took in order to achieve a good scaling web service.
Before I share my thoughts on high scaling architectures for web sites I want to give a little history and background on what metalcon is and why this redesign is so necessary:

Metalcon is a social networking site for german fans of metal music. It currently has

  • a user base of 10’000 users.
  • about 500 registered bands
  • highly semantic and interlinked data base (bands, geographical coordinates, friendships, events)
  • 624 MB of text and structured data about the mentioned topics.
  • fairly good visibility in search engines.
  • > 30k lines of code (mostly PHP)
  • a bad scaling architecture (own OR-mapper, own AJAX libraries, big monolithic data base design, bad usage of PHP,…)
  • no unit tests (so code maintenance is almost impossible)
  • no music and audio files
  • no processes for content moderation
  • no processes to fight spam and block users
  • a really bad usability (I could write tons of posts at which points the usability lacks)
  • no clear distinction of features for users to understand

When we built metalcon no one on the team had experience with high scaling web applications and we were about happy to get it running any way. After returning from china and starting my PhD program in 2011 I was about to shut down metalcon. Though we became close friends the core team was already up on new projects and we have been lacking manpower. On the other side everyone kept on telling me that metalcon would be a great place to do research. So in 2011 Jonas and me decided to give it another shot and do an open redevelopment. We set up a wiki to document our features and the software and we created a developer blog which we used to exchange ideas. Also we created some open source project to which we hardly contributed code due to the lacking manpower…
Well at that time we already knew of too many problems so that fixing was not the way to go. At least we did learn a lot. Thinking about high scaling architectures at that time I new that a news feed (which the old version of metalcon already had) was very core for the user experience. Reading many stack exchange discussions I knew that you wouldn’t build such a stream on MySQL. Also playing around with graph databases like neo4j I came to my first research paper building graphity a software which is designed to distribute highly personalized news streams to users. Since our development was not proceeding we never deployed Graphity within metalcon. Also building an autocomplete service for the site should not be a problem anymore.

Roadmap for the redesign

  • Over the next weeks I hope to read as many interesting articles about technologies and high scalability as I can possibly find and I will be more than happy to get your feedback and suggestions here. I will start reading many articles of http://highscalability.com/ This blog is pure gold for serious web developers. 
  • During a nice discussion about scalability with Heinrich we already came up with a potential architecture of metalcon. I will soon introduce this architecture but want to check first about the best practices in the high scalability blog.
  • In parallel I will also collect the features needed for the new metalcon version and hopefully be able to pair them with usefull technologies. I already started a wikipage about features and planned technologies to support them.
  • I will also need to decide the programming language and paradigms for the development. Right now I am playing around with ruby on rails vs GWT. We made some greate experiences with the power of GWT but one major drawback is for sure that the website is more an application than some lightweight website.

So again feel free to give input, share your ideas and experiences with me and with the community. I will be ver greatfull for every recommendation of articles, videos, books and so on.

]]>
https://www.rene-pickhardt.de/metalcon-finally-becomes-a-redesign-thinking-about-high-scalability/feed/ 10
PhD proposal on distributed graph data bases https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/ https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/#comments Tue, 27 Mar 2012 10:19:22 +0000 http://www.rene-pickhardt.de/?p=1214 Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

[TODO] 0. Find a template for the PhD proposal

That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

1. Write an Introduction

Here I will describe the use case(s) of a distributed graph data base. These could be

  • indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
  • running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
  • storing web log files and click streams of users
  • doing information retrieval (recommender systems) in the above scenarios

There could also be very other use cases like graphs from

  • biology
  • finance
  • regular graphs 
  • geographic maps like road and traffic networks

2. Discuss all the related work

This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

  • graph processing (Signal Collect, Pregel,…)
  • graph theory (especially data structures and algorithms)
  • (dynamic/adaptive) graph partitioning
  • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
  • redundancy vs fault tolerance
  • network programming (protocols, latency vs bandwidth)
  • data bases (ACID, multiple user access, …)
  • graph data base query languages (SPARQL, Gremlin, Cypher,…)
  • Social Network and graph analysis and modelling.

3. Formalize the problem of distributed graph data bases

After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base. 
One very important part of the formalization will lead to a first research question:

4. Graph Query languages – Graph Algebra

I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages. 
Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment. 

5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

6. Development of hybrid data structures (creative input)

It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

7. Analysis of multiple user access and ACID

One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID  are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

8. Again creative input for multiple user access and ACID

After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

9. Comprehensive benchmark of my solution with existing frameworks

My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

10. Conclusion of my PhD proposal

So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID. 

11 Timetable and mile stones

This is a rough schedual fixing some of the major mile stones.

  • 2012 / 04: hand in PhD proposal
  • 2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
  • 2012 / 10: experiments of basic CRUD operations done
  • 2013 / 02: paper with results from basic CRUD operations done
  • 2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
  • 2013 /08: min 3 month research internship  in a company benchmarking my system on real data
  • end of 2013: publishing the results
  • 2014: 9 months of writing my dissertation

For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
Thank you very much for reading so far!

]]>
https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/feed/ 11
Why Musicians should have a Bandpage on Google Plus! https://www.rene-pickhardt.de/why-musicians-should-have-a-bandpage-on-google-plus/ https://www.rene-pickhardt.de/why-musicians-should-have-a-bandpage-on-google-plus/#respond Tue, 13 Mar 2012 10:37:16 +0000 http://www.rene-pickhardt.de/?p=1200 The following info graphic for businesses was released by Chris Brogan and demonstrates quite well why musicians should get on Google Plus and how to use it. It is released under a creative commons licence http://creativecommons.org/licenses/by-nc-sa/2.0/
Very good work!

]]>
https://www.rene-pickhardt.de/why-musicians-should-have-a-bandpage-on-google-plus/feed/ 0
Related-work.net – Product Requirement Document released! https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/ https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/#comments Mon, 12 Mar 2012 10:26:50 +0000 http://www.rene-pickhardt.de/?p=1176 Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.
The idea of http://www.related-work.net was born. A scientific platform which is open source and open data and tries to solve those problems.
But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

Beginning of our Product Requirement Document

We propose to create a new website for the scientific community which brings together people which are reading the same paper. The basic idea is to mix the functionality of a Q&A platform (like MathOverflow) with a paper database (like arXiv). We follow a strict openness principal by making available the source code and the data we collect.
We start with an analysis how the internet is currently used in different fields and explain the shortcomings. The actual product description can be found under the section “Basic idea”. At the end we present an overview over the websites which follow a similar approach.
This document – as well as the whole project – is work in progress. We are happy about any kind of comments or other contributions.

The distribution of scientific knowledge

Every scientist hast to stay up to date with the developments in his area of research. The basic sources for finding new information are:

  • Conferences
  • Research Seminars
  • Journals
  • Preprint-servers (arXiv)
  • Review Databases (MathSciNet, Zentralblatt, …)
  • Q&A Sites (MathOverflow, StackOverflow, …)
  • Blogs
  • Social Networks (Twitter, Google+)
  • Bibliograhpic Databases (Mendeley, nNode, Medline, etc. )

Every community has found its very own way of how to use this tools.

Mathematics by Heinrich Hartmann – Oxford:

To stay up to date with recent developments I check arxiv.org on a daily basis (RSS feed) participate in mathoverflow.net and search for papers over Google Scholar or MathSciNet. Occasionally interesting work is shared by people in my Google+ circles. In general the speed of pure mathematics is very slow. New research often builds upon work which has been out for a few years. To stay reasonably up to date it is enough to go to conferences every 3-5 months.
I read many papers on myself because I am the only one at the department who does research on that particular topic. We have a reading class where we read papers/lecture notes which are relevant for more people. Usually they are concerned with introductions to certain kinds of theory. We have weekly seminars where people talk about their recently published work. There are some very active blogs by famous mathematicians, but in my area blogs play virtually no role.

Computer Science by René Pickhardt – Uni Koblenz

In Computer Science topics are evolving but also changing very quickly. It is always important to have both an overview of upcoming technologies (which you get from tech blogs) as well as access to current research trends.
Since the speed in computer science is so fast and the review process in Journals often takes much time our main source of information and papers are conferences and twitter.

  • Usually conference papers are distributed digitally to participants. If one is interested in those papers google queries like “conference name year papers” are frequently used. Sites like http://www.sciweavers.org/ host and aggregate preprints of papers and organize them by conference.
  • The general method to follow a conference that one is not attending is to follow the hashtag of the conference on Twitter. In general Twitter is the most used tool to share distribute and find information not only for papers but also for the above mentioned news about upcoming technologies.

Another rich source for computer scientists is, of course, the related work of papers and google scholar. Especially useful is the method of finding a very influential paper with more than 1000 citations and find newer papers that quote this paper containing a certain keyword which is one of the features of google scholar.
The main problem in computer science is not to find a rare paper or idea but rather to filter the huge amount of publications and also bad publications and also keep track of trends. In this way a system that ranks and summarize papers (not only by abstract and citation counts) would help me a lot to select what related work of a paper I should read!

Psychology by Elisa Scheller – Uni Freiburg

As a psychologist/neuroscientist, I receive recommendations for scientific papers via google scholar alerts or science direct alerts (http://www.sciencedirect.com/); I receive alerts regarding keywords or regarding entire journal issues. When I search for a certain publication, I use pubmed.org or scholar.google.com. This can sometimes be kind of annoying, as I receive multiple alerts from different sources; but I guess it is the best way to stay up to date regarding recent developments. This is especially important in my field, as we feel a big amount of “publication pressure”; I work on a method which is considered as “quite fancy” at the moment, so I also use the alerts to make sure nobody has published “my” experiment yet.
Sometimes a facebook friend recommends a certain publication or a colleague points me to it. Most of the time, I read articles on my own, as I am the only person working on this specific topic at my institution. Additionally, we have a weekly journal club where everyone in turn presents work which is related to our focus of research, e.g. a certain part of the human brain. There is also a weekly seminar dedicated to presentations about ongoing projects.
Blogs (e.g. mindhacks.com, http://neuroskeptic.blogspot.com/) can be a source to get an overview about recent developments, but I have to admit I use them mainly for work-related entertainment.
All in all, it is easy to stay up to date using alerts from different platforms;  the annoying part of it is the flood of emails you receive and that you are quite often alerted to articles that don’t fit your interests (no matter how exact you try to specify your keywords).

Biomedical Research by Johanna Goldmann – MIT

In the biological sciences, in research at the bench – communication is one of the most fundamental tools a scientist can have. Communication with other scientist may open up the possibilities of new collaborations, can lead to a completely new view point of a known question, the integration and expansion of methods as well as allowing a scientist to have a good understanding of what is known, what is not known and what other people have – both successfully and unsuccessfully – tried to investigate.
Yet communication is something that is currently very much lacking in academic science – lacking to the extent that most scientist will agree hinders the progress of research. Nonetheless the lack of communication and the issues it brings with it is something that most scientists will have accepted as a necessary evil – not knowing how to possibly change it.
Progress is only reported in peer-reviewed journals – many which are greatly affected not only but what is currently “sexy” in research but also by politics and connections and the “publish or perish” pressure. Due to the amount of this pressure in publishing in journals and the amount of weight the list of your publications will have upon any young scientists chances of success, scientist tend also to be very reluctant in sharing any information pre-publication.
Furthermore one of the major issues is that currently there really is no way of publishing or communicating either negative results or minor findings, which causes may questions or methods to be repeatedly investigated as well as a loss of information.
Given how much social networks and the internet has changed communication as well as the access to information over the past years – there is a need for this change to affect research and communication in the life science and transform the way we think not only about solving and approaching research questions we gather but the information and insights we gain as a whole.

Philosophy by Sascha Benjamin Fink – Uni Osnabrück

The most important source of information for philosophers is http://philpapers.org/. You can follow trends going on in your field of interest. Philpapers has a list of almost all papers together with their abstracts, keywords and categories as well as a link to the publisher. Additional information about similar papers is displayed.
Every category of papers is managed by some editor. For each category it is possible to subscribe to a newsletter. In this way once per month I will be informed about current publications in journals related to my topic of interest. Every User is able to create an account and manage his literature and the papers of his he is interested in.
Other research and information exchange methods among philosophers consist of mailing lists, reading clubs and Blogs. Have a look at David Chalmers blog list. Blogs are also becoming more and more important. Unfortunately they are usually on general topics and discussing developments of the community (e.g. Leiter’s Blog, Chalmers’ Blog and Schwitzgebel’s Blog).
But all together I still think that for me a centralized service like Philpapers is my favourite tool because it aggregates most information. If I don’t hear about it on Philpapers usually it is not that important. I think among Philosophers this platform – though incomplete – seems to be the standard for the next couple of years.

Problems

As a scientist it is crucial to be informed about the current developments in the research area. Abstracting from the reports above we divide the tasks roughly into the following stages.

1. Finding and filtering new publications:

  • What is happening right now? What are the current hot topics my area? What are current trends? (→ Check arXiv/Twitter)
  • Did a friend of mine write something? Did a “big shot” write something?
    (→ Check meta information: title, authors)
  • Are my colleagues excited about a new development? (→ Talk to them.)

2. Getting more information about a given paper:

  • What is actually done in a given paper? Is it relevant for me? Is it really new? Is it a breakthrough? (→ Read abstracts. Find a good readable summary/review.)
  • Judge the quality of a paper: Is it correct? Is it well written?
    ( → Where is it published, if at all? Skim through content.)

Finally there is a fundamental decision: Shall I read the whole paper, or not? which leads us to the next task.

3. Understanding a paper: Understanding a paper in depth can be a very time consuming and tedious process. The presentation is often very short and much knowledge is assumed from the reader. The notation choices can be bad, so that even the statements are hard to understand. In effect the paper is easily readable only for a very small circle of specialist in the area. If one is not in the lucky situation to belong to that circle, one usually applies the following strategies:

  1. Lookup references. This forces you to process a whole tree of older papers which might be hard to read, and hard to get hold of. Sometimes it is worthwhile to consult a textbook to polish up fundamentals.
  2. Finding additional resources. Is there a review? Is there a related video lecture or slides explaining the material in more detail? Is the author going to a conference in the near future, or even giving a seminar in the area?
  3. Join forces. Find people thinking about the same paper: Has somebody at my department already read the paper, so that I can ask some questions? Is there enough interest to make a reading group, or more formally, run a seminar about that paper.
  4. Contact the author. This a last resort. If you have struggled with understanding the paper for a very long time and really need/want to get it, you might eventually write an email to the author – who might respond, or not. Sometimes even errors are found! – and not published! An indeed, there is no easy way to publish “errata” anywhere on the net.

In mathematics most papers are not getting read though the end. One uses strategies 1 & 2 till one gets stuck and moves on to something more exciting. The chances of survival are much better with strategy 3 where one is committed putting a lot of effort in it over weeks.

4. Finding related work. Where to go from there? Is the paper superseded by a more recent development? Which are the relevant papers which the author builds upon? What are the historic influences? What are the founding ideas of the subject? Finding related work is very time consuming. It is easy to overlook things given that the references are often vast, and sometimes hard to get hold of. Getting information over citations requires often access to commercial databases.

Basic idea:

All researchers around the world are faced with the same problems and come up with their individual solutions. There are great synergies in bringing these people together with an online platform! Most of the addressed problems are solved with a paper centric service which allows you to…

  • …get to know other readers of the paper.
  • …exchange with the other readers: ask questions, write comments, reviews.
  • …share the gained insights with the community.
  • …ask questions about the paper.
  • …discuss the paper.
  • …review the paper.

We want to do that with a new mixture of a traditional Q&A system like StackExchange or MathOverflow with a paper database and social features. The key features of this system are as follows:

Openness: We follow a strict openness principle. The software will be developed in open source. All data generated on this site will be under a creative commons license (like Wikipedia) and will be made available to the community in form of database dumps or an API (open data).

We use two different types of content sites in our system: Papers and Discussions.

Paper sites. A paper site is dedicated to a single publication. And has the following features:

  1. Paper meta information
    – show title, author, abstract, journal, tags
    – leave a comment
    – write a review (with wiki option)
    – vote up/down
  2. Paper resources
    – show pdfs, slides, notes, video lectures, etc.
    – add a resource
  3. Related Work
    – show the reference-tree and citations in an intelligent way.
  4. Discussions:
    – show related discussions
    – start a new discussion
  5. Social features
    – bookmark
    – share on G+, twitter

The point “Related Work” deserves some further explanation. The citation graph offers a great deal more information than just a list of references. Together with the user generated content like votes and the individual paper bookmarks and social graph one has a very interesting data set which can be harvested. We want this point at least view with respect to: Popularity/Topics/Read by Friends. Later on one could add more sophisticated, even graphical views on this graph.


Discussion sites.
A discussion looks more like a traditional QA-question, with the difference, that each discussion may have related (many) papers. A discussion site contains:

  1. Discussion meta information (title, author, body)
  2. Discussion content
  3. Related papers
  4. Voting
  5. Follow/Bookmark

Besides the content sides we want to provide the following features:

News Stream. This is the start page of our website. It will be generated from the network consisting of friends, papers and authors. There should be several modes like:

  • hot: heavily discussed papers/discussions
  • new papers: list new publications (filtered by tag, like arXiv feed)
  • social: What did your friends do lately
  • default: intelligent mix of recent activity that is relevant to the logged in user


Moreover, filter by tag should be always available.

Search bar:

  • Searches contents of the site, but should also find papers on freely available databases (e.g. arXiv). Adding a paper should be very seamless process from there.
  • Search result ranking uses vote and view information.
  • Personalized search information. (Physicists usually do not want sociology results.)
  • Auto completion on paper titles, author, discussions.

Social: (hard to implement, maybe for second version!)

  • Easily refer to users by @-syntax familiar from Twitter/Google+
  • Maintain a friendship / trust graph
  • Friendship recommendations
  • Find friends from Google+ on the site

Benefits

Our proposed websites improves the above mentioned problems in the following ways.
1. Finding and filtering new publications:This step can be improved with even very little  community effort:

  • Tell other people, that you are interested in the paper. Vote it up or leave a comment if you are very excited about it.
  • Point out a paper to a colleague.

2. Getting more information about a given paper:

  • Write a summary or review about a paper you have read or skimmed through. Maybe the introduction is hard to read or some results are not clearly stated.
  • Can you recommend reading this paper? Vote it up!
  • Ask a colleague for his opinion on the paper. Maybe he can write a summary?

Many reviews of new papers are already written. E.g. MathSciNet and Zentralblatt maintain a large database of Reviews which are provided by the community and are not freely available. Many authors would be much more happy to write them to an open system!
3. Understanding a paper:Here are the mayor synergies which we want to address with our project.

  • Ask a question: Why is the author using this experimental method? How does Lemma 3.4 work? Why do I need this assumption? What is the intiution behind the “virtual truncation”? What implications does this work have?
  • Start a discussion: (might involve more than one paper.) What is the difference of these two papers? Is there a reference explaining this more clearly? What should I read in advance to understand the theory?
  • Add resources. Tell the community about related videos, notes, books etc. which are available on other sites.
  • Share your notes. If you have discussed a paper in a reading class or seminar. Collect your notes or opinions and make them available for the community.
  • Restate interesting statements. Tell the community when you have found a helpful result which is buried inside the paper. In that way Google may find it!

4. Finding related work. Having a well structured and easily navigable view on related papers simplifies the search a lot. The filtering benefits from the content generated by the users (votes) and individual information, like friends who have written/bookmarked a paper.

Similar Sites on the Web

There are several discussions in QA forum which are discussing precisely this problem:

We found three sites on the internet which follow a similar approach which we examined more carefully.
1. There is a social network which has most of our features implemented:

researchgate.net
“Connect with researchers, make your work visible, and stay current.”

The Economist has dedicated an article to them. It is essentially a facebook clone, with special features for scientist.

  • Large, fast growing community. 1.4m +50.000/m. Mainly Biology and Medicine.
    (As Daniel Mietchen points out, the size might be misleading due to institutional accounts)
  • Very professional Look and Feel. Company from Berlin, Germany, funded by VC. (48 People involved, 10 Jobs advertised)
  • Huge Feature set:
    • Profile site, Connect to friends
    • News Feed
    • Publication Database, Conference Finder, Jobmarket
    • Every Paper its own page: with
      • Voting up/down
      • Comments
      • Metadata (Title, Author, Abstract, Preveiw)
      • Social Media (Share, Bookmark, Follow author)
    • Organize Workgroups/Reading Classes.

Differences to our approach:

  • Closed Data / Closed Source
  • Very complex site which solves a lot of purposes
  • Only very basic features on paper site: vote/comment.
  • QA system is not linked well to paper database
  • No MathML
  • Mainly populated by undergraduates

2. Another website which comes reasonably close is:

http://www.sciweavers.org/

“an academic network that aggregates links to research paper preprints
then categorizes them into proceedings.”

  • Includes a large collection of online tools for various purposes
  • Have a big library of papers/software/datasets/conferences for computer science.
    Paper sites have:
    • Meta information and preview
    • Vote functionality and view statistics, tags
    • Comments
    • Related work
    • Bookmarking
    • Author information
  • User profiles (no friendships)


Differences to our approach:

  • Focus on computer science community
  • Comment and Discussions are well hidden on paper sites
  • No News stream
  • Very spacious design

 
3. Another very similar site is:

journalfire.com – beta
“Share what your read – connect to colleagues – create journal clubs.”

It has the following features:

  • Comment on Papers. Activity feed (?). Follow articles.
  • Host Journal Clubs. Create Events related to papers.
  • Powerful search box fetching papers from Arxiv and Pubmed (slow)
  • Social features on site: User profiles, friend finder (no fb/g+ integration yet)
  • News feed – from subscribed papers and friends
  • Easy paper import via Bookmarklet
  • Good usability!! (but slow loading times)
  • Private reading clubs cost money!

They are very skilled: Maintained by 3 PhD students/postdocs from Caltec and MIT.

Differences to our approach:

  • Closed Data, Closed Source
  • Also this site misses (currently) misses out ranking features
  • Very Closed model – Signup required
  • Weak Crowd sourcing: Cannot add Meta information

The site is still at its very beginning with little users. The project started in 2010 and did not gain much momentum since.

The other sites are roughly classified in the following categories:
1. Single people who are following a very similar idea:

  • annotatr.appspot.com. Combines a metadata-base with the disqus plugin. You can comment but not rate. Good usability. Nice CSS. Good search function. No MathML. No related article suggestion. Maintained by two academics in private time. Hosted on Google Apps. Closed Source – Closed Data.
  • r-Forum – a resource where mathematicians can collect record reviews, corrections of a resource (e.g. paper, talk, …). A simple Vanilla-Forum/Wiki with almost no content used by maybe 12 people in US. No automated Data import. No rating system.
  • http://math-arch.org/ – Post comments to math papers. very bad usability – get even errors. Maintained by a group of russian programmers LogicSun. Closed Source – Closed Data.

Analysis: Although the principal idea to connect people reading papers is there. The implementation is very bad in terms of usability and even basic programming. Also the voting features are missed out.

2. (Semi) Professional sites.

  • Public Libary of Science very professional, huge paper data base for mainly biology, medicine. Features full text papers, lots of interesting meta information including references. Has comment features (not very visible) and news stream on the start page.
    No QA features (+1, Ask question) on the site. Only published articles are on the site.
  • Mendeley.com – Huge Bibliographic database with bookmarking and social features. You can organize reading groups in there, with comments and notes shared among the participants. Features a news stream with papers by friends. Nice import. Impressive fulltext data and Reference features.
    No QA features for paper. No comments for paper. Requires Signup to do anything useful.
  • papercritic.com – Open review database. Connected to Mendely bibliographic libary. You can post reviews. No rating. No comments. Not open: Mendely is commercial.
  • webofknowledge.com. Commercial academic citation index.
  • zotero.org – features programm that runs inside a browser. “easy-to-use tool to help you collect, organize, cite, and share your research sources”

Analysis: The goal of all these tools is to simplify the reference management, by providing metadata like references, citations, abstracts, author profiles. Commenting features on the paper site are not there or not promoted.
3. Vaguely related sites which solve different problems:

  • citeulike.org – Social bookmarking for papers. Closed Source – Open Data.
  • http://www.scholarpedia.org. A peer reviewed open access encyclopedia.
  • Philica.com Online Journal which publishes articles from any field along with its reviews.
  • MathSciNet/Zentralblatt – Review database for math community. Closed Source – Commercial.
  • http://f1000research.com/ – Online Journal with a public, post publish review process. “Open Science – Open Data – Open Review”
  • http://altmetrics.org/manifesto/ as an emerging trend from the web-science trust community. Their goal is to revolutionize the review process and create better filters for scientific publications making use of link structures and public discussions. (Might be interesting for us).
  • http://meta.wikimedia.org/wiki/WikiScholar – one of several ideas under discussion at Wikimedia as to a central repository for references (that are cited on Wikipedias and other Wikimedia projects)

Upshot of all this:

There is not a single site featuring good Q&A features for papers.

If you like our approach you can contact us or contribute on the source code find some starting documentation!
So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects which will eventually help to rank related work of a paper.
Feel free to provide us with feedback and wishes and join our effort!

]]>
https://www.rene-pickhardt.de/related-work-net-product-requirement-document-released/feed/ 17
Nils Grunwald from Linkfluence talks at FOSDEM about Cascalog for graph processing https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/ https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/#respond Sun, 05 Feb 2012 10:10:03 +0000 http://www.rene-pickhardt.de/?p=1090 Nils Grunwald works at the french startup Linkefluence. Their product is more or less social network analysis and graph processing. They crawl the web and blogs or get other social network data and provide solutions with statistics and insights for their customers. 
In this scenario obviously big data is envolved and the data carries a natural structure of a graph. He sais a system to process the data has the following constrains:

  • The processing should not compromise the rest of the system
  • Low maintenance costs
  • Used for queries and rapid prototyping (so they want a “general” graph processing solution as customer needs changes)
  • Flexible, hard to tell which field or metadata will be used beforehand.

He afterwards introduces their solution Cascalog based on Hadoop and is also inspired by cascading a workflow managment system and datalog a subset of prolog which as a declarative, expressive language is very concise way of writing queries and enable quick prototyping
For me personally it is not a very interesting solution since it is not able to answer queries in realtime which of course is obvious if you consider the technologies it is based on. But I quess for people that have time and just do analysis this solution will properly work pretty well!
What I really liked about his the solution is that after processing the graph you can export the data to Gephi or to Neo4j to have fast query processing. 
Hey then explained alot specific details about the syntax of cascalog:
 

nils grundwald fosdem
nils grundwald from linkfluence talks about cascalog at fosdem

]]>
https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/feed/ 0
My first PhD year summerized: What a great choice of mine! https://www.rene-pickhardt.de/my-first-phd-year-summerized-what-a-great-choice-of-mine/ https://www.rene-pickhardt.de/my-first-phd-year-summerized-what-a-great-choice-of-mine/#comments Wed, 28 Dec 2011 13:48:11 +0000 http://www.rene-pickhardt.de/?p=992 2011 is almost over and more than 9 months of my PhD have already passed by. During my math diploma I was founded by the german national academic foundation. Besides some really nice benefits that came along with this every 6 months I was forced to write reports about my study progress. Even though theses reports were sometimes quite annoying I realized that they are a good method for oneself to focus and work more efficient. That is why I decided to continue writing these reports. This time just in english and for a wider audience.
So here is the layout for this longer article:

Things I have done in 2011

  • I started my PhD time Koblenz on a scholarship and I felt I almost had to much freedom. Noone to report to. I have to admit in the beginning it was hard to focus with that much freedom.
  • I attended the TET workshop where I have learned some techniques and methods about design thinking. What a great subject and topic. I also met some students from WHU which was also nice!
  • I was allowed to visit the Webscie summer school at DERI in NUI Galway Ireland. That was really fantastic.
  • I attended the future music camp where I learned a lot about the business and why I don’t think I fit there. Especially I ran a session on bandpage SEO.
  • Since my university organized it I also attended the European Summer school on Information Retrieval. It was nice since I had my first poster presentation which reminded me to my old days in school when I attended Jugend Forscht.
  • I quit my scholarship and moved to a three year contract as a research assistant. That was nice from a money perspective but in particularly I wanted to have teaching responsabilities and the safty to be founded longer than 2 years.
  • I had the idea for my first paper graphity and I conducted an evaluation and created the paper in a team of 5 people.
  • I was attending the social sensor kick off meeting in Thessaloniki.
  • For the second time I touht a class at “Deutsche Schüler Akademie”. This time with students from 5 different countries and only 20% native German speakers.
  • I am supervising a Jugend Forscht project of 2 highly gifted and talented high school students which also uses neo4j and works on software to improove typing.
  • I learned more about the impact of social networks by createing the in legend facebook streaming app
  • I am advising a bachlor thesis on graph data bases and linked open data.
  • I became a most valued blogger for dzone
  • I had my first blog article or thing on the internet that became kind of viral.
  • I read tim berners lee’s book on weaving the web

Things that I have learned

  • It took some time but I got to know computer scientists and their culture / way of thinking
  • By now I am finnaly less afraid of programming.
  • I am also less afraid of using, configuring and fixing linux
  • Even though there is way of improvement I have a much more structured aproach to getting things done (especially writing paper)
  • I realized the power of blogging. It is amazing how much feedback you receive if you share your thoughts. You also get to know better recources and get to know people! It is really amazing how much reach a blog can create and how much it can grow. I am really excited to see where this will be going! And I encourage anyone to start blogging!
  • I have gained more background on internet technology (protocols / technologies / general understanding)
  • Reading the law might help more than talking to a lawyer
  • There is a lot of diplomancy during teamwork and it is really good if one (not neccessarily onself) is able to apply it.
  • Smart and creative ideas are very appreciated in university.
  • Amazingly motivating people is still one of my greatest assets.
  • I kind of understand how EU founded research projects are applied for and how they are working and where a lot of money in our institute comes from.
  • I am becoming more familiar how the system within computer science works
  • I made the experience how (at least theoretically easy) it is to create a paper
  • Lenovo thinkpads are just amazing + having the suitable business notebook really makes you move it everywhere and work on your stuff. 8 hours of battery are just perfect! (no I am not sponsored by lenovo but honestly it is the first notebook I am literally taking everywhere and it just works fine)
  • Diversity is the key to everything. Diversity in teams and between human beeings in general will almost every time lead to the most amazing things
  • Unfortunatly I am mentally not as flexible / dynamic fast learning as I used to be when I was younger (thinking in used patterns seems to be very comfortable)
  • Mathmaticians really have an amazing ability to understand complex abstract concepts in any context. They are able to generalzie almost everything.
  • It is incredibly easy to become an authority or gain social proof while making statements on something. In particular it is interesting how much more credible things are if someone else gives you trust (e.g. being cited or being invited to give a talk)
  • I increased my marketing knowledge and experience in how to create a brand
  • If you want to be a successfull enterpreneur or company focus on great products and outstanding service! This is how you beat the market. Marketing is not about selling and promoting stuff. it is about having the best product / service and communicate this in a smart way….
  • I understood many different levels of the general information retrieval problem. What different levels of search exist. The concept of information need. In particular I understood the many non technical challanges of this problem and the problems of language and semantics.
  • I finnaly realized why social networks should not be monetarized via advertising (almost no click through rates on banner ads) I also understood why search is such a cash cow (at least revenue wise) information need (also for advertising) from the user is given which naturally leads to high click through rates.
  • I understood how big the facebook bubble is that is being created. I almost hope they will enter the stock market soon. I will definatly bet some money and buy puts.
  • Speaking of this I got introduced to the concept of an ego network and what the implications of this are to a social network and to running a social network. It is actually embarresing that I never realized those concepts myself by running metalcon. Furthermore it is just amazing to me how much impact the concept of a persons ego network has to his everyday life and to his mindset.
  • I did learn why people using linux after a while only show a sad smile to windows users. It is unbelievable how much pain in the ass windows is

Weaknesses I still have

  • Even though I have acomplished many things I still have the feeling that I am procrastinating a lot (guess due to bad time management)
  • I still show a tendancy towards overcommitment which means to many paralell projects
  • Together with this comes my hard time focussing (especially on scientific output related output)
  • For some reason I am still not too keen on reading. I am not reading enough paper / blogs / mags / mailing lists / news …
  • My written communication skills could improve a lot. Especially spelling and structure
  • With this comes communication in general way to improve.
  • I am not doing enough pyhiscal excercises I have gained weight and I am tired a lot
  • I am not learning enouh chinese not to say I am forgetting my Chinese.
  • I have to make things happen and state the obvious. Especially realize the moments in which I think outside the box and have creative ideas. In my research I was constantly talking about a social circle in order to reference the ego network of a user. Some months later google plus comes up with the circle concept. I was talking about this for ages without realizing that I should seperate this from all the rest in order to create something big!

goals for 2012

  • The main goal is to write a solid good PhD proposal. I am still batteling between too topics. 1.) organizing social news feeds from your cirle of friends. 2.) distributing graph data bases. The first problem is more application oriented the second one seems to be more technical. There are many reasons that speak for both topics. I guess I will move twords the second problem. In any case I will have to write the proposal and submit it to a suitable conference. I will also have to write a German version of it in order to aply for the PhD scholarship from the German national academic foundation.
  • I want to go back to China. In the best case do a 3 month research trip with jiaotong daxue in shanghai.
  • I am still very interested in doing an internship. Since my professor said I could eather do an internship or go abroad I will have to choose. But if I should go for the internship I will have too look into google (research), yahoo research, linked in, facebook, last.fm maybe even simfy or find another interesting company.
  • There is most certainly the need to write more papers and I already have some very concrete and specific ideas (including solutions) so if anyone is interested in real joint work contact me any time! The ideas are in the subjects of (information retrieval, differential geometry, logging in graph data bases, sparql queries as graph traversals, sentence prediction using n grams and neo4j
  • There has to be progress with metalcon and with in legend
  • I want to run a seminar on one or two topics which could be search engines or graph data bases and its applications.
  • I want to spend more time on contributing to wikipedia. Especially I want to include this in teaching at university.
  • I already started and want to further improve my time management skills. I want to use todo lists more eficiently and also make more use of tools like a calander system. Also my workbalance between free time vs. work time has to be increased
  • I want to improve on my chinese language skills
  • I want teach my third course with deutsche schüler akademie.
  • I want to do even more team work projects
  • I want to create a reading class in university in order to do some more efficient research.
  • And as some private goals I want to make more music, model, do the toungtwister video, physical exercise on a regular basis

Some final thoughts

I really received a lot of help from my advisor / university. Going back to university was so far the best idea in my live. I totally know why I returned. I am enjoing the time in university from both perspectives the freedom as well as the topic.
So what have you guys been doing in 2011 and what are your goals for 2012? In any case I wish you a happy new year!

]]>
https://www.rene-pickhardt.de/my-first-phd-year-summerized-what-a-great-choice-of-mine/feed/ 4
Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks https://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/ https://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/#comments Tue, 15 Nov 2011 16:03:33 +0000 http://www.rene-pickhardt.de/?p=901 UPDATE: the paper got accepted at SOCIALCOM2012 and the source code and data sets are online especially the source code of the graphity server software is now online!
UPDATE II: Download the paper (11 Pages from Social Com 2012 with Co Authors: Thomas Gottron, Jonas Kunze, Ansgar Scherp and Steffen Staab) and the slides
I already said that my first research results have been submitted to SIGMOD conference to the social networks and graph databases track. Time to sum up the results and blog about them. you can find a demo of the system here
I created a data model to make retrieval of social news feeds in social networks very efficient. It is able to dynamically retrieve more than 10’000 temporal ordered news feeds per second in social networks with millions of users like Facebook and Twitter by using graph data bases (like neo4j)
In order to achieve this I had several points in mind:

  1. I wanted to use a graph data base to store the social network data as the core technology. As anyone can guess my choice was neo4j which turned out to be a very good idea as the technology is robust and the guys in sweeden gave me a great support.
  2. I wanted to make retrieval of a users news stream only depend on the number of items that are to be displayed in the news stream. E.g. fetching a news feed should not depend on the number of nodes in the network or the number of friends a user has.
  3. I wanted to provide a technology that is as fast as relational data bases or flat files (due to denormalization) but still does not have redundnacy and can dynamically handle changes in the underlying network

How Graphity works is explained in my first presentation and my poster which I both already talked about in an older blog post.  But you can also watch this presentation to get an idea of it and learn about the evaluation results:

Presentation at FOSDEM

I gave a presentation at FOSDEM 2012 in the graph Devroom which was video taped. Feel free to have a look at it. You can also find the slides from that talk at: http://www.rene-pickhardt.de/wp-content/uploads/2012/11/FOSDEMGraphity.pdf

Summary of results

in order to be among the first to receive the paper, the source code and the used data sets as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.
With my data model graphity built on neo4j I am able to retrieve more than 10’000 dynamically generated news streams per second from the data base. Even in big data bases with several million users graphity is able to handle more than 100 newly created content items (e.g. status updates) per second which is still high if one considers that Twitter only had 600 tweets beeing created per second as of last year. This means that graphity is almost able to handle the amount of data that twitter needs to handle on a sigle machine! Graphity is creating streams dynamically so if the friendships of the network changes the user still get accurate news feeds!

Evaluation:

Although we used some data from metalcon to test graphity we realized that metalcon is a rather small data set. To overcome this issue we Used the german wikipedia as a data set. We interpreted every wikipedia article as a node in a social network. A link between articles as a follow relation and revisions of articles as status updates. With this in mind we did the following testings.

characteristics of the used data sets

Characteristics of the data sets in millions. A is the number of Users in the graph. A_{d>10} the number of users with node degree bigger 10. E is the number of edges between users and C the number of content items (e.g. status updates created by the users)

As you can see the biggest data sets have 2 million users and 38 million status updates.

Degree distribution of our data sets

Nothing surprising here

STOU as a Baseline

Our baseline method STOU retrieves all the nodes from the egonetwork of an node and orders them by the time of their most recently created content item. Afterwards feeds are generated as in graphity by using top-k n-way merge algorithm.

7.2 Retrieving News Feeds

For every snapshot of the Wikipedia data set and the Metalcon data set we retrieved the news feeds for every aggregating node in the data set. We measured the time in order to calculate the rate of retrieved news feeds per second. With bigger data sets we discovered a drop in retrieval rate for STOU as well as GRAPHITY. A detailed analysis revealed that this was due to the fact that more than half of the aggregating nodes have a node degree of less than 10. This becomes visible when looking at the degree distribution. Retrieval of news feeds for those aggregating nodes showed that on average the news feed where shorter than our desired length of k = 15. Due to the fact that retrieval of smaller news feeds is significantly faster and a huge percentage of nodes having this small degree we conducted an experiment in which we just retrieved the feeds for nodes with a degree higher than 10.

Retrieval of news feeds for the wikipedia data set. GRAPHITY was able to retrieve 15

We see that for the smallest data set GRAPHITY retrieves news feeds as fast as STOU. Then the retrieval speed for GRAPHITY rises and stays constant – in paticular independent of the size of the graph data base – as expected afterwards. The retrieval speed for STOU also stays constant which we did not expect. Therefor we conducted another evaluation to gain a deeper understanding.

Independence of the Node Degree

After binning articles together with the same node degree and creating bins of the same size by randomly selecting articles we retrieved the news feeds for each bin.

rate of retrieved news streams for nodes having a fixed node degree. graphity clearly stays constant and is thous independent of the node degree

7.2.3 Dependency on k for news feeds retrieval

For our tests, we choose k = 15 for retrieving the news feeds. In this section, we argue for the choice of this value for k and show the influence of selecting k with respect to the performance of retrieving the news feeds per second. On the Wikipedia 2009 snapshot, we have retrieved the news feeds for all aggregating nodes with a node degree d > 10 and changed k.

Amount of retrieved streams in graphity with varieng k. k is the number of news items that are supposed to be fetched. We see that Graphity performs particular well for small k.

Tthere is a clear dependency of GRAPHITY’s retrieval rate to the selected k. For small k’s, STOU’s retrieval rate is almost constant and sorting of ego networks (which is independent of k) is the dominant factor. With bigger k’s, STOU’s speed drops as both merging O(k log(k)) and sorting O(d log(d)) need to be conducted. The dashed line shows the interpolation of the measured frequency of retrieving the news feeds given the function 1/k log(k) while the dotted line is the interpolation based on the function 1/k. As we can see, the dotted line is a better approximation to the actually measured values. This indicates that our theoretical estimation for the retrieval complexity of k log(k) is quite high compared to the empirically measured value which is close to k.

Index Maintaining and Updating

Here we investigate the runtime of STOU and GRAPHITY in maintaining changes of the network as follow edges are added and removed as well as content nodes are created. We have evaluated this for the snapshots of Wikipedia from 2004 to 2008. For Metalcon this data was not available.
For every snapshot we simulated add / remove follow edges as well as create content nodes. We did this in the order as these events would occur in the wikipedia history dump.

handling new status updates

Number of status updates that graphity is able to handle depending on the size of the data set

We see that the number of updates that the algorithms are able to handle drops as the data set grows. Their ratio stays however almost constant between a factor of 10 and 20. As the retrieval rate of GRAPHITY for big data sets stays with 12k retrieved news feeds per second the update rate of the biggest data set is only about 170 updated GRAPHITY indexes per second. For us this is ok since we designed graphity with the assumption in mind that retrieving news feeds happens much more frequently than creating new status updates.

handling changes in the social network graph

Number of new friendship relations that graphity is able to handle per second depending on the network size

The ratio for adding follow edges is about the same as the one for adding new content nodes and updating GRAPHITY indexes. This makes perfect sense since both operations are linear in the node degree O(d) Over all STOU was expected to outperform GRAPHITY in this case since the complexity class of STOU for these tasks is O(1)

Number of broken friendships that graphity is able to handle per second

As we can see from the figure removing friendships has a ratio of about one meaning that this task is in GRAPHITY as fast as in STOU.
This is also as expected since the complexity class of this task is O(1) for both algorithms.

Build the index

We have analyzed how long it takes to build the GRAPHITY and STOU index for an entire network. Both indices have been computed on a graph with existing follow relations.
To compute the GRAPHITY and STOU indices, for every aggregating node $a$ all content nodes are inserted to the linked list C(a). Subsequently, only for the GRAPHITY index for every aggregating node $a$ the ego network is sorted by time in decending order. For both indices, we have measured the rates of processing the aggregating nodes per second as shown in the following graph.

Time to build the index with respect to network size

As one can see, the time needed for computing the indices increases over time. This can be explained by the two steps of creating the indices: For the first step, the time needed for inserting content nodes increases as the average amount of content nodes per aggregating node grows over time. For the second step, the time for sorting increases as the size of the ego networks grow and the sorting part becomes more time consuming. Overall, we can say that for the largest Wikipedia data set from 2011, still a rate of indexing 433 nodes per second with GRAPHITY is possible. Creating the GRAPHITY index for the entire Wikipedia 2011 data set can be conducted in 77 minutes.
For computing the STOU index only, 42 minutes are needed.

Data sets:

The used data sets are available in another blog article
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Source code:

the source code of the evaluation framework can be found at my blog post about graphity source. There is also the source code of the graphity server online.
in order to be among the first to receive the paper as soon as the paper is accepted sign in to my newsletter or follow me on twitter or subscribe to my rss feed.

Future work & Application:

The plan was actually to use these results in metalcon. But I am currently thinking to implement my solution to diaspora what do you think about that?

Thanks to

first of all many thanks go to my Co-Authors (Steffen Staab, Jonas Kunze, Thomas Gottron and Ansgar Scherp) But I also want to thank Mattias Persson and Peter Neubauer from neotechnology.com and to the community on the neo4j mailinglist for helpful advices on their technology and for providing a neo4j fork that was able to store that many different relationship types.
Thanks to Knut Schumach for coming up with the name GRAPHITY and Matthias Thimm for helpful discussions

]]>
https://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/feed/ 37