hadoop – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database! https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/ https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/#comments Wed, 21 Aug 2013 10:43:02 +0000 http://www.rene-pickhardt.de/?p=1736 Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
 

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
 
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:

]]>
https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/feed/ 2
Some thoughts on Google Mapeduce and Google Pregel after our discussions in the Reading Club https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/ https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/#comments Wed, 15 Feb 2012 16:54:44 +0000 http://www.rene-pickhardt.de/?p=1123 The first meeting of our reading club was quite a success. Everyone was well prepared and we discussed some issues about Google’s Map Reduce framework and I had the feeling that everyone now better understands what is going on there. I will now post a summary of what has been discussed and will also post some feedback and reading for next week to the end of this post. Most importantly: The reading club will meet next week Wednesday February 22nd at 2 o’clock pm CET. 

Summary

First take away which was well known is that there is a certain stack of Google papers and corresponding Apache implementations:

  1. Google File System vs Apache Hadoop filesystem
  2. Google Big Table vs Apache HBase
  3. Google Map reduce vs Apache Hadoop
  4. Google Pregel vs Apache Giraph

The later ones are all based eather on GFS or HDFS. Therefore we agreed that a detailed understanding of GFS (Google file system) is mandatory to fully understand the Map Reduce implementation. We don’t want to commonly discuss GFS yet but at least think everyone should be well aware of it and give room for further questions about it on next weeks reading club.
We discussed map Reduce’s advantage of handling stragglers over Pregel’s approach. In map reduce since it is a one step system it is easy to deal with Stragglers. Just reassign the job to a different machine as soon as it takes to long. This will perfectly handle stragglers that occure due to faulty machines. The superstep model in pregel has – up to our knowledge – no clear solution to these kind of Stragglers (to come up with a strategy to handle those would be a very nice research topic!) On the other hand Pregel has another kind of Stragglers that come from super nodes. There are some papers that are fixing those problems one of them is the paper that will be read for next week.
We had the discussion that partitioning the data in a smart way would make the process more efficient. We agreed that for Map Reduce and Pregel where you just want to process the graph on a cloud this is not the most important thing. But for a real time graph data base the partitioning of data will most certainly be a crucial point. Here again we saw the strong connection to Google File System since the Google File system does a lot of the partitioning in the current approaches.
Achim pointed out that Microsoft also has some proprietary products. It would be nice if someone could provide more detailed resources. He also wished that we could focus on the problems first and then talk about distributing. His solution was to make this top down.
We also discussed if frameworks that use map reduce to process large graphs have been compared with Pregel or Apache Giraph so far. This evaluation would also be a very interesting research topic. For that reason and to better understand what is happening when large graphs are processed with map reduce we included the last two papers for reading.

Feedback from you guys

After the club was over I asked everyone for suggestions and I got some usefull feedback:

  • We should prepare more than one paper
  • google hangout in combination with many people in the room is a little hard (introduce everyone in the beginning or everyone brings a notebook or group of people should sit in front of one camera)
  • We need more focus on the paper we are currently discussing. Understanding problems should be collected 1 or 2 days before we meet and be integrated into the agenda.
  • We need some check points for every paper. everyone should state: (what do i like, what do i not like, what could be further research, what do i want to discuss, what do i not understand) 
  • We need a reading pool where everyone can commit

New Rules

In order to incoperate the feedback from you guys I thought of some rules for next weeks meeting. I am not sure if they are the best rules and if they don’t work we will easily change them back.

  • There is a list of papers to be discussed (see below)
  • At the end of the club we fix 3-6 papers from the paper pool that are to be prepared for next week
  • before the club meets everyone should commit some more papers to the pool that he would like to read the week after (you can do this on the comments here or via email)
  • If more people are in the same room they should sit together in front of one camera
  • Short introduction of who is there in the beginning
  • use the checkpoints to discuss papers
  • no discussions of brand new solutions and ideas. Write them down, send a mail, discuss them at a different place. The reading club is for collectively understanding the papers that we are reading.

Last but not least. The focus is about creating ideas and research about distributed real time graph data base solutions. That is why we first want to understand the graph processing stuff.

Reading tasks for next week

for better understanding the basics (should not be discussed)

To understand Pregel and another approach that has not this rigid super step model. The last paper introduces some methods to fight stragglers that come from graph topology.

And finnaly two more papers that discuss how map reduce can be used to process large graphs without a pregel like frame work.

More feedback is welcome

If you have some suggestions to the rules or other remarks that we havn’t thought of or if you just want to read other papers feel free to comment here in this way everyone who is interested can contribute to the discussion.

]]>
https://www.rene-pickhardt.de/some-thoughts-on-google-mapeduce-and-google-pregel-after-our-discussions-in-the-reading-club/feed/ 15
Nils Grunwald from Linkfluence talks at FOSDEM about Cascalog for graph processing https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/ https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/#respond Sun, 05 Feb 2012 10:10:03 +0000 http://www.rene-pickhardt.de/?p=1090 Nils Grunwald works at the french startup Linkefluence. Their product is more or less social network analysis and graph processing. They crawl the web and blogs or get other social network data and provide solutions with statistics and insights for their customers. 
In this scenario obviously big data is envolved and the data carries a natural structure of a graph. He sais a system to process the data has the following constrains:

  • The processing should not compromise the rest of the system
  • Low maintenance costs
  • Used for queries and rapid prototyping (so they want a “general” graph processing solution as customer needs changes)
  • Flexible, hard to tell which field or metadata will be used beforehand.

He afterwards introduces their solution Cascalog based on Hadoop and is also inspired by cascading a workflow managment system and datalog a subset of prolog which as a declarative, expressive language is very concise way of writing queries and enable quick prototyping
For me personally it is not a very interesting solution since it is not able to answer queries in realtime which of course is obvious if you consider the technologies it is based on. But I quess for people that have time and just do analysis this solution will properly work pretty well!
What I really liked about his the solution is that after processing the graph you can export the data to Gephi or to Neo4j to have fast query processing. 
Hey then explained alot specific details about the syntax of cascalog:
 

nils grundwald fosdem
nils grundwald from linkfluence talks about cascalog at fosdem

]]>
https://www.rene-pickhardt.de/nils-grunwald-from-linkfluence-talks-at-fosdem-about-cascalog-for-graph-processing/feed/ 0
Claudio Martella talks @ FOSDEM about Apache Giraph: Distributed Graph Processing in the Cloud https://www.rene-pickhardt.de/claudio-martella-talks-fosdem-about-apache-giraph-distributed-graph-processing-in-the-cloud/ https://www.rene-pickhardt.de/claudio-martella-talks-fosdem-about-apache-giraph-distributed-graph-processing-in-the-cloud/#comments Sun, 05 Feb 2012 09:01:45 +0000 http://www.rene-pickhardt.de/?p=1085 Claudio Martella introduces Apache Giraph which according to him is a loose implementation of Google Pregel which was introduced  on SIGMOD in 2010. He points out that Map Reduce cannot be used to do graph processing.
He then gave an example on how MapReduce can be used to to do page rank calculation. He points out that Pagerank can be calculated as a local property of a graph in a distributed way by calculating local pagerank from the knowledge of the neighbours. He did this to show what the Drawbacks of this method are in his oppinion:

  • job boostrap take some time
  • disk is hit about 6  times
  • Data is sorted
  • Graph is passed through

Like in the Pregel Paper he says that other Graphalgorithms like singlesource shortest paths have the same problems. 
 

Claudio Martella from Apache explains how giraph works at in the graph dev room @ Fosdem 2012
Claudio Martella from Apache explains how giraph works at in the graph dev room @ Fosdem 2012

 
 
After introducing more about implementing Pregle ontop of the existing MapReduce structure for distributing he says that this system has some advantages over MapReduce

  • it’s a stateful computation
  • Disk is hit if/only for checkpoints
  • No sorting is necessary
  • Only messages hit the network

He points out that the advantages of Giraph over other methods (Hama, GoldenOrb, Signal/Collect) are especially an active community (Facebook, Yahoo, Linkedin, Twitter) behind this project. I personally think another advantage is that it is run by Apache who already run MapReduce (Hadoop) with great success. So it is something that people trust…
Claudio points out explicitly that they are searching for more contributors and I think this is really an interesting topic to work on! So thank Claudio for your inspiring work!

here the video streams from the graph dev room:

]]>
https://www.rene-pickhardt.de/claudio-martella-talks-fosdem-about-apache-giraph-distributed-graph-processing-in-the-cloud/feed/ 4