Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club

The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET. 

Summary

First take away which was well known is that there is a certain stack of Google papers and corresponding Apache implementations:

  1. Google File System vs Apache Hadoop filesystem
  2. Google Big Table vs Apache HBase
  3. Google Map reduce vs Apache Hadoop
  4. Google Pregel vs Apache Giraph

The later ones are all based eather on GFS or HDFS. Therefore we agreed that a detailed understanding of GFS (Google file system) is mandatory to fully understand the Map Reduce implementation. We don’t want to commonly discuss GFS yet but at least think everyone should be well aware of it and give room for further questions about it on next weeks reading club.
We discussed map Reduce’s advantage of handling stragglers over Pregel’s approach. In map reduce since it is a one step system it is easy to deal with Stragglers. Just reassign the job to a different machine as soon as it takes to long. This will perfectly handle stragglers that occure due to faulty machines. The superstep model in pregel has – up to our knowledge – no clear solution to these kind of Stragglers (to come up with a strategy to handle those would be a very nice research topic!) On the other hand Pregel has another kind of Stragglers that come from super nodes. There are some papers that are fixing those problems one of them is the paper that will be read for next week.
We had the discussion that partitioning the data in a smart way would make the process more efficient. We agreed that for Map Reduce and Pregel where you just want to process the graph on a cloud this is not the most important thing. But for a real time graph data base the partitioning of data will most certainly be a crucial point. Here again we saw the strong connection to Google File System since the Google File system does a lot of the partitioning in the current approaches.
Achim pointed out that Microsoft also has some proprietary products. It would be nice if someone could provide more detailed resources. He also wished that we could focus on the problems first and then talk about distributing. His solution was to make this top down.
We also discussed if frameworks that use map reduce to process large graphs have been compared with Pregel or Apache Giraph so far. This evaluation would also be a very interesting research topic. For that reason and to better understand what is happening when large graphs are processed with map reduce we included the last two papers for reading.

Feedback from you guys

After the club was over I asked everyone for suggestions and I got some usefull feedback:

  • We should prepare more than one paper
  • google hangout in combination with many people in the room is a little hard (introduce everyone in the beginning or everyone brings a notebook or group of people should sit in front of one camera)
  • We need more focus on the paper we are currently discussing. Understanding problems should be collected 1 or 2 days before we meet and be integrated into the agenda.
  • We need some check points for every paper. everyone should state: (what do i like, what do i not like, what could be further research, what do i want to discuss, what do i not understand) 
  • We need a reading pool where everyone can commit

New Rules

In order to incoperate the feedback from you guys I thought of some rules for next weeks meeting. I am not sure if they are the best rules and if they don’t work we will easily change them back.

  • There is a list of papers to be discussed (see below)
  • At the end of the club we fix 3-6 papers from the paper pool that are to be prepared for next week
  • before the club meets everyone should commit some more papers to the pool that he would like to read the week after (you can do this on the comments here or via email)
  • If more people are in the same room they should sit together in front of one camera
  • Short introduction of who is there in the beginning
  • use the checkpoints to discuss papers
  • no discussions of brand new solutions and ideas. Write them down, send a mail, discuss them at a different place. The reading club is for collectively understanding the papers that we are reading.

Last but not least. The focus is about creating ideas and research about distributed real time graph data base solutions. That is why we first want to understand the graph processing stuff.

Reading tasks for next week

for better understanding the basics (should not be discussed)

To understand Pregel and another approach that has not this rigid super step model. The last paper introduces some methods to fight stragglers that come from graph topology.

And finnaly two more papers that discuss how map reduce can be used to process large graphs without a pregel like frame work.

More feedback is welcome

If you have some suggestions to the rules or other remarks that we havn’t thought of or if you just want to read other papers feel free to comment here in this way everyone who is interested can contribute to the discussion.

You may also like...

Popular Posts

15 Comments

    1. My list of further papers that I suggest to read:
      memcached paper: for distributed memory
      Beehive: to see a p2p aproach for graph distribution
      PEGASUS [37]: Realistic mathematically tractable graph generation and evolution using kronecker multiplication: to understand best synthetic graphs
      Signal Collect [12]: Probabilistic Graph Models: Principles and Techniques: to give justification why asynchronous message parsing is better.
      GFS [9]: A case for redundant arrays of inexpensive disks (RAID)
      GFS [5]: Scale and performance in a distributed file system (AFS)
      HipG [1][2]: A parallel algorithm for multilevel graph partitioning and sparse matrix ordering and polylog approximation of the minimum bisection SIAM.
      HipG [3]: Challenges in parallel graph processing.
      HipG [26]: Compressed and distributed file formats for labeled transition systems
      something on network protocolls and
      something on query languages

  1. For Pregel, did you mean to cite:
    http://kowshik.github.com/JPregel/pregel_paper.pdf
    ? The “Google Pregel” cite above is to the blog post with no link to the paper. (There is also the ACM version but it is pay-per-view.)

    1. Thanks Patrick,
      that is the link I only found the ACM version and did not want to link it since it was pay per view…
      sorry you could not make the hangout yesterday. let’s try next week!

  2. For Pregel, did you mean to cite:
    http://kowshik.github.com/JPregel/pregel_paper.pdf
    ? The “Google Pregel” cite above is to the blog post with no link to the paper. (There is also the ACM version but it is pay-per-view.)

    1. Thanks Patrick,
      that is the link I only found the ACM version and did not want to link it since it was pay per view…
      sorry you could not make the hangout yesterday. let’s try next week!

  3. […] Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club by René Pickhardt. […]

  4. […] Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club by René Pickhardt. […]

  5. […] Update: find a summary of last meeting and the current reading list for next week’s meeting here. […]

  6. […] Update: find a summary of last meeting and the current reading list for next week’s meeting here. […]

  7. […] I became aware of Signal/Collect because of René Pickhardt’s graph reading club assignment for 22 February 2012. […]

  8. […] I became aware of Signal/Collect because of René Pickhardt’s graph reading club assignment for 22 February 2012. […]

  9. […] of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and […]

  10. […] of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and […]

Leave a Reply

Your email address will not be published. Required fields are marked *