Reading Club – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Reading Club Management of Big Data https://www.rene-pickhardt.de/reading-club-management-of-big-data/ https://www.rene-pickhardt.de/reading-club-management-of-big-data/#comments Thu, 05 Sep 2013 10:01:28 +0000 http://www.rene-pickhardt.de/?p=1765 Even though the reading club on distributed graph data bases stopped I never really lost interest in management of big data and graph data. Due to the development of research grants and some new workers in our group I decided to create a new reading club. (The next and first meeting will be Thursday September 12th 15:30 central European time.) The reading club won’t be on a weekly basis but rather something like once a month. Tell me if you want to join via hangout or something similar! But I would like to be clear: If you didn’t carefully prepare the reading assignments by bringing questions and points for discussion to the meeting then don’t join the meeting. I don’t consider skimming a paper as a careful preparation.
The road map for the reading club on big data is quite clear: We will read again some papers that we read before but we will also look deeper and check out some existing technologies. So the reading will not only consist of scientific work (though this will build up the basis) but it will also consist of hand on and practical sessions which we obtain from reading blogs, tutorials, documentation and hand books.
Here will be the preliminary structure and road map for the reading club on big data which of course could easily vary over time!

Along these lines we want to understand

  • Why do these technologies scale? 
  • How do they handle concurrent traffic (especially write requests)?
  • How performance can be increased if there is another way of building up such highly scalable systems?
  • What kind of applications (like titan or mahout) are build on top of these systems?
At some point I would also love to do some side reading on distributed algorithms and distributed and parallel algorithm and data structure design. 

As stated above the reading club will be much more hand on in future than before I expect us to also deliver tutorials like that one on getting Nutch running on top of HBase and Solr
Even though we want to get hands on in current technologies the goal is rather to understand the principles behind them and find ways of improving them instead of just applying them to various problems.
I am considering to start a wikipage on wikiversity to create something like a course on big data management but I would only do this if I find a couple of people who would actively help to contribute to such a course. So please contact me if you are interested!
So to sum up the reading assignment for the first meeting are the Google file system and the map reduce paper.

]]>
https://www.rene-pickhardt.de/reading-club-management-of-big-data/feed/ 1
Reading Club on distributed graph db returns with a new Format on April 4th 2012 https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/ https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/#comments Mon, 02 Apr 2012 09:37:05 +0000 http://www.rene-pickhardt.de/?p=1231 The reading club was quite inactive due to traveling and also a not optimal process for the choice of literature. That is why a new format for the reading club has been discussed and agreed upon. 
The new Format means that we have 4 new rules

  1. we will only discuss up to 3 papers in 90 minutes of time. So rough speaking we have 30 minutes per paper but this does not have to be strict.
  2. The decided papers should be read by everyone before the reading club takes place.
  3. For every paper there is one responsible person (moderator) who did read the entire paper before he suggested it as a common reading.
  4. Open questions to the (potential) reading assignments and ideas for reading can and should be discussed on http://related-work.rene-pickhardt.de/ (use the same template as I used for the reading assignments in this blogpost) eg:

Moderator:
Paper download:
Why to read it
topics to discuss / open questions:

For next meeting on April 4th 2 pm CET (in two days) the literature will be:

While preparing these papers we might come across some other interesting literature.
If you want to suggest some of the literature you should also read that piece of work until the reading club meeting takes place and know why you want everybody to prepare the same paper and discuss it (rule 3). Additionally you should open a topic on the paper on http://related-work.rene-pickhardt.de/ using the above template before the reading club takes place (rule 4)
I hope this is of help for the entire project and I am looking forward to the next meeting!

]]>
https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/feed/ 6
Related work of the Reading club on distributed graph data bases (Beehive, Scalable SPARQL Querying of Large RDF Graphs, memcached) https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/ https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/#comments Wed, 07 Mar 2012 16:34:00 +0000 http://www.rene-pickhardt.de/?p=1166 Today we finally had our reading club and discussed several papers from last week’s asignments
Before I give my usual summary I want to introduce our new infrastructure for the reading club. Go to: 
http://related-work.rene-pickhardt.de/
There you can find a question and answer system which we will use to discuss questions and answers of papers. Due to the included voting system we thought this is much more convenient than a closed unstructured mailing list. I hope this is of help to the entire community and I can only invite anyone to read and and discuss with us on http://related-work.rene-pickhardt.de/

Reading list for next meeting Wed March 14th 2 pm CET

We first discussed the memcached paper:

One of the first topics we discussed was how is the dynamically hash done? We also wondered how DHT take care of overloading in general? In the memcached paper this fact is not discussed very well. Schegi knows a good paper that explains the dynamics behind DHT’s and will provide the link soon.
Afterwards we discussed what would happen if a distributed Key Value store like memcached is used to implement a graph store. Obviously creating a graph store on the Key value model is possible. Additionally memcached is very fast in its lookups. One could add another persistence layer to memcached that woul enable disk writes. 
We think the main counter arguments are:

  • In this setting graph distribution to worker nodes is randomly done.
  • No performance gain by graph partitioning possible

We realized that we should really read about distributed graph distribution
If using memcached you can store much more than an adjacncy list in the value of one key. In this way reducing information needed.
Again I pointed out that seperating the data model from the data storage could help essentially. I will soon write an entire blog article about this idea in the stetting of relational / graph models and relational database management systems.
personally I am still convinced that memcached could be used to improve asynchronous message passing in distributed systems like signal / collect

Scalable SPARQL Query of Large RDF graphs:

We agreed that one of the core principles in this paper is that they remove supernodes (everything connected via RDF type) in order to have a much sparser graph and do the partitioning (which speed up computation a lot) afterwards they added the supernodes as a redundancy to all workers where the supernodes could be needed. This methodology could generalize pretty well to arbitrary graphs: You just look at the node degree and remove the x% nodes with highest degree from the graph run a cluster algorithm and then add the supernodes in a redundant way to the workers. 
Thomas pointed out that this paper had a drawback of not using a distributed cluster algorithm but then used a framework like map reduce

Beehive:

We all agreed that the beehive paper was solving a problem with a really great methodology by first looking into query distribution and then using proactive caching strategies. The interesting points are that they create an analytical model which they can solve in a closed way. The p2p protocols are enhanced by gossip talk to distribute the parameters of the protocol. In this way an adaptive system is created which will adjust its caching strategy once the queries are changing.
We thought that the behive approach could be generalized to various settings. Especially it might be possible to not only analyze zipf distributions but also other distributions of the queries and derive various analytical models which could even coexist in such a system.
You can find our questions and thoughts and joind our discussion about beehive online!

Challenges in parallel Graph processing:

Unfortunately we did not really have the time to discuss this – in my opinion – great paper. I created a discussion in our new question board. so feel free to discuss this paper at: http://related-work.rene-pickhardt.de/questions/13/challenges-in-parallel-graph-processing-what-do-you-think

]]>
https://www.rene-pickhardt.de/related-work-of-the-reading-club-on-distributed-graph-data-bases-beehive-scalable-sparql-querying-of-large-rdf-graphs-memcached/feed/ 1
From Graph (batch) processing towards a distributed graph data base https://www.rene-pickhardt.de/from-graph-batch-processing-towards-a-distributed-graph-data-base/ https://www.rene-pickhardt.de/from-graph-batch-processing-towards-a-distributed-graph-data-base/#comments Thu, 23 Feb 2012 12:45:22 +0000 http://www.rene-pickhardt.de/?p=1156 Yesterdays meeting of the reading club was quite nice. We all agreed that the papers where of good quality and we gained some nice insights. The only drawback of the papers was that it did not directly tell us how to achieve our goal for a real time distributed graph data base technology. In the readings for next meeting (which will take place Wednesday March 7th 2pm CET) we tried to choose papers that don’t discuss these distributed graph / data processing techniques but   focus more on speed or point out the general challenges in parallel graph processing.

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

Again while reading an preparing stuff feel free to add more reading wishes to the comments of this blog post or drop me a mail!

Summary of yesterdays meeting

As written in the introduction we agreed that the papers where interesting but not heading in our direction. Claudio pointed out that everyone should consider the following set of questions.

  • Do we want the graph to be mutable or is it supposed to writable or is it supposed to be read only?
    • writing makes sens. If it is read only it is called batch processing
    • Writing is hard you care about locking consistancy
  • Do we want to answer queries (Cypher/gremlin/whatever)?
  • Do we want to provide an API for processing?
  • How big is the data set we want to support
    • many people do in memory
    • If you go to the disk you open a whole new bottle of topics
    • One approach would be to solve the problem in memory first.

I am very confident that it was a good idea to start with graph processing but that we are taking the right steps now to go in the direction of real distributed graph data base systems. I think there are some more questions and high level assumptions that one has to fix which I will post in a few days on this blog. Sorry I am in a hurry for this day / rest of the week.

Infrastructure

Schegi just suggested to create a Mailingliste for the reading club or to switch to Google Groups. He pointed out that a private blog is kind of a weired medium to be so central. What is your opinion on that? Do we need some other / more formal infrastructure?

]]>
https://www.rene-pickhardt.de/from-graph-batch-processing-towards-a-distributed-graph-data-base/feed/ 8
Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/ https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/#comments Sun, 19 Feb 2012 17:05:49 +0000 http://www.rene-pickhardt.de/?p=1134 One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.
So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

Summary of Google Pregel

The methodology is heavily based on Bulk Sychronous Parallel Model (BSP) and also has some similarties to MapReduce (with just one superstep). The main idea is to spread the data over several machines and introduce some supersteps. For each superstep every vertex of the graph calculates a certain function that is given by the programmer.
This enables one to process large graphs which are distributed over several machines. The paper describes how to use Checkpoints to increase fault tolerance and also how to make good use of the Google File System in order to partition the graph data on the workers. The authors mention that smarter hashing functions could help to distribute the vertices not randomly but rather in a way they are connected on the graph which could potentially increase performance.
Overall the goal of Google Pregel seems to enable one to process large graph data and gain knowledge from it. The focus does not seem to increase the usage of the calculation power of the distributed system efficiently. In stead it rather seems to create a system that makes distribution of data – that will not fit into one machine – possible at a decent speed and without much effort for the programmer by introducing methods for increasing fault tolerance.

Summary of Signal Collect

Signal Collect as a system is pretty similar to Google Pregel. The main difference is that the authors introduce a threshold score which is used to decide weather a node should collect its signals or weather it should send signals. Using this score the processing of algorithms can be accelerated in a way that for every super step only signals and collects are performed if a certain threashhold is hit.
From here the authors say that one can get rid of the superstep model and make the entire calculation asynchronous. This is done by introducing randomization on the set of vertices on which signal and collect computations have to be computed (as long as the threasholdscores are overcome)
The entire system is implemented on a single machine but the vertices of the compute graph are processed by different workers (in this setting Threads). All Threads are able to share the main memory of the system which makes message passing of Signal and Collect computations unnecessary. The authors show how the increasing number of workers actually antiproportionally lower the runtime of the algorithm in the asynchronous setting. They also give evidence that different scheduling strategies seem to fit the needs for different graphs or algorithms.

Discussion of Pros and Cons

  • From the text above it seems very obvious that Signal Collect with its Asynchronous Programming model seems superior. But – in opposite to the authors – I have hidden to mention the drawbacks of one small but important detail. The fact that all the workers share a common knowledge which they can access due random access in main memory of the machine allows their model to be so fast while being asynchronous. It is not clear how to maintain this speed with a real distributed system. So in this way Signal Collect only give a proof of concept that an abstract programming model for graph processing exists and it enables fast distribution in theory.
  • Pregel actually is a real frame work that can really achieve distribution of large data to clusters of several thousand machines which for sure is a huge pro.
  • Signal Collect proposes to be more general than Pregel since Pregel can only respect one vertex type and edges are stored implicitly. Whereas Signal Collect is able to store RDF Graphs. I personally understand that Signal Collect can only send signals from one vertex to another if and edge exists and is also not able to add or remove edges or vertices. In this sense I still think that Pregel is the more general system. But I guess one can still argue on my point of view.
  • Pregel’s big drawbacks in my opinion are that the system is not optimized for speed. As already discussed in the last meeting of the reading club Map Reduce – with its one Superstep attitude – is able to start Backup tasks towards the end of the computation in order to fight stragglers. Pregel has to wait for those stragglers in every superstep in order to make synchronous Barriers possible.  
  • Another point that is unique with Pregel is the deep integration with Google File System (btw. I am almost through the google file system paper and even if you already know of the idea it is absolutely worthwhile reading it and understanding the arguments for the design decisions of the google file system). So far I am not sure weather this integration is a strong or a weak point. This is due to the fact that I can’t see all the implications. However it gives strenght to my argument that for a distributed system some things like network protocols and file systems should be considered since they seem to have a strong impact on the entire system. 
  • Both systems in my opinion fail to consider partitioning of the graph and a different network protocol as an important task. Especially for Pregel I do not understand this since it already has so much network traffic. Partitioning the graph might increase start up Traffic on the one hand but could increase overall traffic on the long term. 

Outlooks and personal thoughts:

I am considering to invite the authors of both papers to next weeks reading club. It would be even more interesting to discuss these and other questions directly with the guys who built that stuff. 
Also I like Schegi’s idea to see what happens if one actually runs several neo4j servers on different machines and just use a model similar to Signal Collect or Pregel to perform some computations. In this way a programming model could be given and research on the core distribution framework – relying on good technologies for the workers – could be done.
For the development of the first version of metalcon we used memcached. I read a lot that memcached scales perfectly horizontal over several machines. I wonder how an integration of memcached to Signal Collect would work in order to make the asynchronous computation possible in a distributed fashion. Since random access memory is a bottleneck in any application I suggest to put the original memcached paper on our reading list.
One last point to mention is that both systems still don’t seem to be useful as a technology to built a distributed graph data base which enables online query processing.

]]>
https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/feed/ 8
Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/ https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/#comments Wed, 15 Feb 2012 16:54:44 +0000 http://www.rene-pickhardt.de/?p=1123 The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET. 

Summary

First take away which was well known is that there is a certain stack of Google papers and corresponding Apache implementations:

  1. Google File System vs Apache Hadoop filesystem
  2. Google Big Table vs Apache HBase
  3. Google Map reduce vs Apache Hadoop
  4. Google Pregel vs Apache Giraph

The later ones are all based eather on GFS or HDFS. Therefore we agreed that a detailed understanding of GFS (Google file system) is mandatory to fully understand the Map Reduce implementation. We don’t want to commonly discuss GFS yet but at least think everyone should be well aware of it and give room for further questions about it on next weeks reading club.
We discussed map Reduce’s advantage of handling stragglers over Pregel’s approach. In map reduce since it is a one step system it is easy to deal with Stragglers. Just reassign the job to a different machine as soon as it takes to long. This will perfectly handle stragglers that occure due to faulty machines. The superstep model in pregel has – up to our knowledge – no clear solution to these kind of Stragglers (to come up with a strategy to handle those would be a very nice research topic!) On the other hand Pregel has another kind of Stragglers that come from super nodes. There are some papers that are fixing those problems one of them is the paper that will be read for next week.
We had the discussion that partitioning the data in a smart way would make the process more efficient. We agreed that for Map Reduce and Pregel where you just want to process the graph on a cloud this is not the most important thing. But for a real time graph data base the partitioning of data will most certainly be a crucial point. Here again we saw the strong connection to Google File System since the Google File system does a lot of the partitioning in the current approaches.
Achim pointed out that Microsoft also has some proprietary products. It would be nice if someone could provide more detailed resources. He also wished that we could focus on the problems first and then talk about distributing. His solution was to make this top down.
We also discussed if frameworks that use map reduce to process large graphs have been compared with Pregel or Apache Giraph so far. This evaluation would also be a very interesting research topic. For that reason and to better understand what is happening when large graphs are processed with map reduce we included the last two papers for reading.

Feedback from you guys

After the club was over I asked everyone for suggestions and I got some usefull feedback:

  • We should prepare more than one paper
  • google hangout in combination with many people in the room is a little hard (introduce everyone in the beginning or everyone brings a notebook or group of people should sit in front of one camera)
  • We need more focus on the paper we are currently discussing. Understanding problems should be collected 1 or 2 days before we meet and be integrated into the agenda.
  • We need some check points for every paper. everyone should state: (what do i like, what do i not like, what could be further research, what do i want to discuss, what do i not understand) 
  • We need a reading pool where everyone can commit

New Rules

In order to incoperate the feedback from you guys I thought of some rules for next weeks meeting. I am not sure if they are the best rules and if they don’t work we will easily change them back.

  • There is a list of papers to be discussed (see below)
  • At the end of the club we fix 3-6 papers from the paper pool that are to be prepared for next week
  • before the club meets everyone should commit some more papers to the pool that he would like to read the week after (you can do this on the comments here or via email)
  • If more people are in the same room they should sit together in front of one camera
  • Short introduction of who is there in the beginning
  • use the checkpoints to discuss papers
  • no discussions of brand new solutions and ideas. Write them down, send a mail, discuss them at a different place. The reading club is for collectively understanding the papers that we are reading.

Last but not least. The focus is about creating ideas and research about distributed real time graph data base solutions. That is why we first want to understand the graph processing stuff.

Reading tasks for next week

for better understanding the basics (should not be discussed)

To understand Pregel and another approach that has not this rigid super step model. The last paper introduces some methods to fight stragglers that come from graph topology.

And finnaly two more papers that discuss how map reduce can be used to process large graphs without a pregel like frame work.

More feedback is welcome

If you have some suggestions to the rules or other remarks that we havn’t thought of or if you just want to read other papers feel free to comment here in this way everyone who is interested can contribute to the discussion.

]]>
https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/feed/ 15
Reading club on Graph databases and distributed systems https://www.rene-pickhardt.de/reading-club-on-graph-databases-and-distributed-systems/ https://www.rene-pickhardt.de/reading-club-on-graph-databases-and-distributed-systems/#comments Wed, 08 Feb 2012 14:57:14 +0000 http://www.rene-pickhardt.de/?p=1108 Update: find a summary of last meeting and the current reading list for next week’s meeting here.
Teaching classes is over for this term so for the next couple of weeks I want to spend a lot of time working on some research topics that are on my mind. My goal is to finnaly write down my PhD proposal and have a well organized written structure for the rest of my PhD time.

The main topic for 2012 in the scientific part of my life can be summerized by these bullets points:

  • Graph data bases
  • distributed systems
  • distributed computing
  • distribution of graph data bases
  • dynamic hash tables
  • peer to peer networks
  • graph data base query languages (since this seems to have a deep impact on the technologies that suport everything)
  • real time graph processing

So the reading club will read and most importantly understand and discuss papers that belong to those categories.

I will start with the following selection of papers:

Time and Place

The reading club will take place in D116  the “Kreuzverweisraum” and will take place every wednesday 2 pm CET . 
For next week I expect from anyone who wants to join that the Map Reduce paper will be read by Wednesday.
I will keep anyone up to date with the results from the reading club and the anouncements for next weeks readings.

How to join on the web!

I shared a google plus circle with all people who are interested:
https://plus.google.com/115250982031867883098/posts/AhSZgvbKYs8
You can contact me to be included to the circle. The circle will be invited to a hangout every wednesday about 2 pm CET (central european time)
If anyone knows a better technology for the telco feel free to tell me.

]]>
https://www.rene-pickhardt.de/reading-club-on-graph-databases-and-distributed-systems/feed/ 12