GFS – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/ https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/#comments Wed, 15 Feb 2012 16:54:44 +0000 http://www.rene-pickhardt.de/?p=1123 The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET. 

Summary

First take away which was well known is that there is a certain stack of Google papers and corresponding Apache implementations:

  1. Google File System vs Apache Hadoop filesystem
  2. Google Big Table vs Apache HBase
  3. Google Map reduce vs Apache Hadoop
  4. Google Pregel vs Apache Giraph

The later ones are all based eather on GFS or HDFS. Therefore we agreed that a detailed understanding of GFS (Google file system) is mandatory to fully understand the Map Reduce implementation. We don’t want to commonly discuss GFS yet but at least think everyone should be well aware of it and give room for further questions about it on next weeks reading club.
We discussed map Reduce’s advantage of handling stragglers over Pregel’s approach. In map reduce since it is a one step system it is easy to deal with Stragglers. Just reassign the job to a different machine as soon as it takes to long. This will perfectly handle stragglers that occure due to faulty machines. The superstep model in pregel has – up to our knowledge – no clear solution to these kind of Stragglers (to come up with a strategy to handle those would be a very nice research topic!) On the other hand Pregel has another kind of Stragglers that come from super nodes. There are some papers that are fixing those problems one of them is the paper that will be read for next week.
We had the discussion that partitioning the data in a smart way would make the process more efficient. We agreed that for Map Reduce and Pregel where you just want to process the graph on a cloud this is not the most important thing. But for a real time graph data base the partitioning of data will most certainly be a crucial point. Here again we saw the strong connection to Google File System since the Google File system does a lot of the partitioning in the current approaches.
Achim pointed out that Microsoft also has some proprietary products. It would be nice if someone could provide more detailed resources. He also wished that we could focus on the problems first and then talk about distributing. His solution was to make this top down.
We also discussed if frameworks that use map reduce to process large graphs have been compared with Pregel or Apache Giraph so far. This evaluation would also be a very interesting research topic. For that reason and to better understand what is happening when large graphs are processed with map reduce we included the last two papers for reading.

Feedback from you guys

After the club was over I asked everyone for suggestions and I got some usefull feedback:

  • We should prepare more than one paper
  • google hangout in combination with many people in the room is a little hard (introduce everyone in the beginning or everyone brings a notebook or group of people should sit in front of one camera)
  • We need more focus on the paper we are currently discussing. Understanding problems should be collected 1 or 2 days before we meet and be integrated into the agenda.
  • We need some check points for every paper. everyone should state: (what do i like, what do i not like, what could be further research, what do i want to discuss, what do i not understand) 
  • We need a reading pool where everyone can commit

New Rules

In order to incoperate the feedback from you guys I thought of some rules for next weeks meeting. I am not sure if they are the best rules and if they don’t work we will easily change them back.

  • There is a list of papers to be discussed (see below)
  • At the end of the club we fix 3-6 papers from the paper pool that are to be prepared for next week
  • before the club meets everyone should commit some more papers to the pool that he would like to read the week after (you can do this on the comments here or via email)
  • If more people are in the same room they should sit together in front of one camera
  • Short introduction of who is there in the beginning
  • use the checkpoints to discuss papers
  • no discussions of brand new solutions and ideas. Write them down, send a mail, discuss them at a different place. The reading club is for collectively understanding the papers that we are reading.

Last but not least. The focus is about creating ideas and research about distributed real time graph data base solutions. That is why we first want to understand the graph processing stuff.

Reading tasks for next week

for better understanding the basics (should not be discussed)

To understand Pregel and another approach that has not this rigid super step model. The last paper introduces some methods to fight stragglers that come from graph topology.

And finnaly two more papers that discuss how map reduce can be used to process large graphs without a pregel like frame work.

More feedback is welcome

If you have some suggestions to the rules or other remarks that we havn’t thought of or if you just want to read other papers feel free to comment here in this way everyone who is interested can contribute to the discussion.

]]>
https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/feed/ 15