I am just dreaming this does not exist and needs to be refined in a later stage.

  • Fast traversals:
    • Jumping from one vertex of the graph to another should be possible in O(1)
  • Online processing:
    • “Standard queries” (<–whatever this means) should compute within miliseconds.
    • As an example: Local recommendations e.g. similar users in a bipartite “User – Band” graph should be possible to process online in less than a second.
  • Query language:
    • A programming model that supports pattern matching and traversals with one (or possibly several) starting nodes
    • No SPARQL (too general for a reasonable graph application) support needed.
    • Support for reading and writing new data (to disk)!
  • Distribution effort:
    • The programmer should not have to care about the distribution techniques.
    • He should just be able to use the technology.
  • Fault tolerance:
    • The system has to run stable if working computers are added or removed.
    • Probably by introducing redundancy in some way [1]
  • Persistence:
    • Transactions and persistence are important for any data base service.

It is very clear that this wish list is very high level. But I think these are reasonable assumptions from which we can break down the problem and discuss pros and cons of all the techniques needed to built such a system.   

[1] on the Redundancy discussion:

Depending on the techniques used, introducing redundancy has probably two positive effects on:

  1. Fast traversals
  2. Fault tolerance

On the other hand it has a deep impact on

  1. Persistence (which is hard to achieve in a distributed setting anyway is even harder to achieve once redundancies are included.)

It is not clear if we really need redundancy. Maybe there are some other techniques that enable us to find our goals but I personally have the feeling that a good model for redundancy will “solve” the problem.

relation to the reading club

I already found the time to look over our courrent reading assignments. Especially the VLDB paper (Topology partitioning applied to SPARQL, HADOOP and TripleStores) and the Challenges in parallel graph processing strengthen my confidence that an approach described above seems very reasonable.

What is your oppinion?

Do you think I am missing some features or should keep a focus on one particular feature? What about methods to achieve those goals? I am happy to discuss your thoughts!

If you like this post, you might like these related posts:

  1. From Graph (batch) processing towards a distributed graph data base Yesterdays meeting of the reading club was quite nice. We...
  2. PhD proposal on distributed graph data bases Over the last week we had our off campus meeting...
  3. Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language As I said yesterday I have been busy over the...
  4. Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons One of the reading club assignments was to read the...
  5. Reading Club on distributed graph db returns with a new Format on April 4th 2012 The reading club was quite inactive due to traveling and...

Sharing:

Tags: , , , , , ,

2 Comments on Wishlist of features for a distributed graph data base technology

  1. Stefan says:

    I agree with the main points of you “wishlist”. Only two little comments, as query language i would also prefer a tranversal language but for some usecases a graph matching language like sparql could also be interesting.
    For example, in social networks searching persons depending on different features (age, homebase, interests …). I was thinking about transforming sparql queries to index lookups and tranversals. Supported by rules (e.g. for type subsumption), such a system could even be more powerful than sparql. Sarql does not support arbitrary depth tree traversal you would need for e.g. the “give me all ancestors of a person” query.
    Combined system would be able to solve such problems.
    Second the redundancy point, i agree that redundancy seems to be very interesting, especially regarding the graph distribution. But i also think that redundancy could lead to multiple side effects with write/delete operations. Keeping all redundant copies up to date in an distributed system where multipe competitive jobs are running at the same time is a challenging task. To solve this, you would probably have to apply versioning, version control and/or blocking or log mechanisms to graph-nodes or atomic subgraphs.
    From my point of view this would be work for the later paret of the problem.
    Thirdly and lastly, for me fault tolerance is the last to think about.

  2. [...] Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology [...]

Leave a Reply

*

Close

Subscribe to my newsletter

You don't like mail?