graph database – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Graphity Server for social activity streams released (GPLv3) https://www.rene-pickhardt.de/graphity-server-for-social-activity-streams-released-gplv3/ https://www.rene-pickhardt.de/graphity-server-for-social-activity-streams-released-gplv3/#comments Mon, 02 Sep 2013 07:11:22 +0000 http://www.rene-pickhardt.de/?p=1753 It is almost 2 years over since I published my first ideas and works on graphity which is nowadays a collection of algorithms to support efficient storage and retrieval of more than 10k social activity streams per second. You know the typical application of twitter, facebook and co. Retrieve the most current status updates from your circle of friends.
Today I proudly present the first version of the Graphity News Stream Server. Big thanks to Sebastian Schlicht who worked for me implementing most of the Servlet and did an amazing job! The Graphity Server is a neo4j powered servlet with the following properties:

  • Response times for requests are usually less than 10 milliseconds (+network i/o e.g. TCP round trips coming from HTTP)
  • The Graphity News Stream Server is a free open source software (GPLv3) and hosted in the metalcon git repository. (Please also use the bug tracker there to submit bugs and feature requests)
  • It is running two Graphity algorithms: One is read optimized and the other one is write optimized, if you expect your application to have more write than read requests.
  • The server comes with an REST API which makes it easy to hang in the server in whatever application you have.
  • The server’s response also follows the activitystrea.ms format so out of the box there are a large amount of clients available to render the response of the server.
  • The server ships together with unit tests and extensive documentation especially of the news stream server protocol (NSSP) which specifies how to talk to the server. The server can currently handle about 100 write requests in medium size (about a million nodes) networks. I do not recommend to use this server if you expect your user base to grow beyond 10 Mio. users (though we are working to get the server scaling) This is mostly due to the fact that our data base right now won’t really scale beyond one machine and some internal stuff has to be handled synchronized.

Koding.com is currently thinking to implement Graphity like algorithms to power their activity streams. It was for Richard from their team who pointed out in a very fruitfull discussion how to avoid the neo4j limit of 2^15 = 32768 relationship types by using an overlay network. So his ideas of an overlay network have been implemented in the read optimized graphity algorithm. Big thanks to him!
Now I am relly excited to see what kind of applications you will build when using Graphity.

If you’ll use graphity

Please tell me if you start using Graphity, that would be awesome to know and I will most certainly include you to a list of testimonials.
By they way if you want to help spreading the server (which is also good for you since more developer using it means higher chance to get newer versions) you can vote up my answer in stack overflow:
http://stackoverflow.com/questions/202198/whats-the-best-manner-of-implementing-a-social-activity-stream/13171306#13171306

How to get started

its darn simple!

  1. You clone the git repository or get hold of the souce code.
  2. then switch to the repo and type sudo ./install.sh
  3. copy the war file to your tomcat webapps folder (if you don’t know how to setup tomcat and maven which are needed we have a detailed setup guide)
  4. and you’re done more configuration details are in our README.md!
  5. look in the newswidget folder to find a simple html / java script client which can interact with the server.
I also created a small simple screen cast to demonstrate the setup: 

Get involved

There are plenty ways to get involved:

  • Fork the server
  • commit some bug report
  • Fix a bug
  • Subscribe to the mailing list.

Furhter links:

]]>
https://www.rene-pickhardt.de/graphity-server-for-social-activity-streams-released-gplv3/feed/ 5
Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database! https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/ https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/#comments Wed, 21 Aug 2013 10:43:02 +0000 http://www.rene-pickhardt.de/?p=1736 Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
 

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
 
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:

]]>
https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/feed/ 2
Video of FOSDEM talk finally online https://www.rene-pickhardt.de/video-of-fosdem-talk-finally-online/ https://www.rene-pickhardt.de/video-of-fosdem-talk-finally-online/#respond Tue, 25 Jun 2013 15:55:57 +0000 http://www.rene-pickhardt.de/?p=1663 I was visiting FOSDEM 2013 with Heinrich Hartmann and talking about related-work.net the video of this talk is finally online and of course I would like to share this with the community:

The slides can be found here.

]]>
https://www.rene-pickhardt.de/video-of-fosdem-talk-finally-online/feed/ 0
GWT + database connection in Servlet ContextListener – Auto Complete Video Tutorial Part 5 https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/ https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/#comments Mon, 24 Jun 2013 11:44:47 +0000 http://www.rene-pickhardt.de/?p=1653 Finally we have all the basics that are needed for building an Autocomplete service and now comes the juicy part. From now on we are looking at how to make it fast and robust. In the current approach we open a new Data base connection for every HTTP request. This needs quite some time to lock the data base (at least when using neo4j in the embedded mode) and then also to run the query without having any opportunities to use the caching strategy of the data base.
In this tutorial I will introduce you to the concept of a ContextListener. This is roughly spoken a way of storing objects in the Java Servlet global memory using key value pairs. Once we understand this the roadmap is very clear. We can store objects like data base connections or search indices in the memory of our web server. As from what I currently understand this could also be used to implement some server side caching. I did not do any benchmarking yet testing how fast retrieving objects from context works in tomcat. Also this method of caching does not scale horizontally well as using memcached.
Anyway have fun learning about the context listener.

If you have any suggestions, comments or thoughts or even know of some solid benchmarks about caching using the ServletContext (I did a quick web search for a view minutes and didn’t find any) feel free to contact me and discuss this!

]]>
https://www.rene-pickhardt.de/gwt-database-connection-in-servlet-contextlistener-auto-complete-video-tutorial-part-5/feed/ 1
Michael Hunger talks about High Availability of Neo4j built on Paxos in the GraphDevroom @ FOSDEM https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/ https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/#comments Sat, 02 Feb 2013 12:01:24 +0000 http://www.rene-pickhardt.de/?p=1511 As we know neo4j has a master slave replication with eventual consistency so there is not the typical ACID requirements. The way is ether wring the master which pushes to the slaves. But it is also possible to write to the slaves directly which is super save but much slower since syncronization between slaves is required.
In gerneral (not very specific to neo4j there are a view concerns)

  • Cluster management (how to handle new machines joining or leaving the cluster as well as heartbeat messages) this also holds true for failover (Master election, Distribution of Master status)
  • Replication (synchronized id-generation, distributed locks, and so on

Neo4j was building on Apache Zookeeper to take care of the concerns. Michael points out that there have been problems with using Zookeeper.

  • How to koordinate Zookeeper with neo4j cluster
  • unrelieable operations
  • people did not like the typology required from the zookeper architecture
  • Also Zookeeper is electing a new master to often which especially bad in a heavy load environment
  • no dynamic reconfigeration of the Zookeeper cluster.

The solution of neo4j was to rewrite the multi-paxos paradigm and replace zookeper. Micheal especially suggests to read the Paxos Made Simple paper by Leslie Lamport. The core exists of State Machines implemented using Java Enums.
I still remember a lot of discussions in the reading club on distributed graph data bases. We never actually looked into Apache Zookeper and the Paxos paradigm which would certainly an interesting technique to learn!
In the next part there was a lot of detail discussions which where hard to follow for me since I am so far not familiar with the Paxos Paradigm.
If you are curious about the HA of neo4j and you can bet I am you can look into Peter’s screencast that leads you through setting up neo4j HA

Setting up a local HA cluster in Neo4j 1.9 from Peter Neubauer on Vimeo.

]]>
https://www.rene-pickhardt.de/michael-hunger-talks-about-high-availability-of-neo4j-built-on-paxos-in-the-graphdevroom-fosdem/feed/ 1
Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/#comments Tue, 06 Nov 2012 11:55:02 +0000 http://www.rene-pickhardt.de/?p=1460 As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator it = result.columnAs("coAuthor");
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.
  • The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
  • The Traverser framework is about 25% slower than the Core API
  • Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.
  • I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

    One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

    Quotes from Andres Taylor:

    Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

    I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

    Conclusion:

    So far I was only using the Java Core API working with neo4j and I will continue to do so.
    If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
    Also I personally like the amount of control that you have when traversing over the core yourself.
    Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
    The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
    The detailed results can be found in this spreadsheet.

    ]]> https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/feed/ 16 Typology using neo4j wins 2 awards at the German federal competition young scientists. https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/ https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/#comments Mon, 21 May 2012 09:44:54 +0000 http://www.rene-pickhardt.de/?p=1341 Two days ago I arrived in Erfurt in order to visit the federal competition young scientists (Jugend Forscht). I reported about the project typology by Till Speicher and Paul Wagner which I supervised over the last half year and which already won many awards.
    Saturday night they have already won a special award donated by the Gesellschaft fuer Informatik this award has the title “special award for a contribution which demonstrates particularly the usefulness of computer science for Society.” (Sonderpreis fuer eine Arbeit, die in besonderer Art und Weise den Nutzen der Informatik verdeutlicht.) This award was connected with 1500 Euro in cash!
    Yesterday there was the final award ceremony. I was quite excited to see how the hard work that Till and Paul put into their research project would be evaluated by the juryman. Out of 457 submissions with a tough competition the top5 projects have been awared. Till and Paul came in 4th and will now be allowed to visit the German chancelor Angela Merkel in Berlin.
    With the use of spreading activation Typology is able to make precise predictions of what you gonna type next on your smartphone. It outperforms the current scientific standards (language models) by more than 100% and has a precision of 67%!
    A demo of the project can be found at www.typology.de
    The android App of this system is available in the appstore. It is currently only available for German text and is beta. But the source code is open and we are happy for anybody who wants to contribute can check out the code. A mailinglist will be set up soon. But anyone who is interested can already drop a short message at mail@typology.de and will be added to the mailinglist as soon as it is established.
    You can also look at the documentation which will soon be available in english. Alternatively you commit bugs and request features in our bug tracker
    I am currently trying to improve the data structures and the data base design in order to make retrieval of suggestions faster and decrease the calculation power of our web server. If you have expertise in top – k aggregation joins in combination with prefix filtering drop me a line and we can discuss about this issue!
    Happy winners Paul Wagner (left) and Till Speicher (right) with me

    ]]>
    https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/feed/ 5
    Reading Club on distributed graph db returns with a new Format on April 4th 2012 https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/ https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/#comments Mon, 02 Apr 2012 09:37:05 +0000 http://www.rene-pickhardt.de/?p=1231 The reading club was quite inactive due to traveling and also a not optimal process for the choice of literature. That is why a new format for the reading club has been discussed and agreed upon. 
    The new Format means that we have 4 new rules

    1. we will only discuss up to 3 papers in 90 minutes of time. So rough speaking we have 30 minutes per paper but this does not have to be strict.
    2. The decided papers should be read by everyone before the reading club takes place.
    3. For every paper there is one responsible person (moderator) who did read the entire paper before he suggested it as a common reading.
    4. Open questions to the (potential) reading assignments and ideas for reading can and should be discussed on http://related-work.rene-pickhardt.de/ (use the same template as I used for the reading assignments in this blogpost) eg:

    Moderator:
    Paper download:
    Why to read it
    topics to discuss / open questions:

    For next meeting on April 4th 2 pm CET (in two days) the literature will be:

    While preparing these papers we might come across some other interesting literature.
    If you want to suggest some of the literature you should also read that piece of work until the reading club meeting takes place and know why you want everybody to prepare the same paper and discuss it (rule 3). Additionally you should open a topic on the paper on http://related-work.rene-pickhardt.de/ using the above template before the reading club takes place (rule 4)
    I hope this is of help for the entire project and I am looking forward to the next meeting!

    ]]>
    https://www.rene-pickhardt.de/reading-club-on-distributed-graph-db-returns-with-a-new-format-on-april-4th-2012/feed/ 6
    PhD proposal on distributed graph data bases https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/ https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/#comments Tue, 27 Mar 2012 10:19:22 +0000 http://www.rene-pickhardt.de/?p=1214 Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology

    [TODO] 0. Find a template for the PhD proposal

    That is straight forward. The task is just to look at other students PhD proposals also at some major conferences and see what kind of structure they use. A very common structure for papers is Jennifer Widom’s structure for writing a good research paper. This or a similar template will help to make the proposal readable in a good way. For this blog article I will follow Jennifer Widom more or less.

    1. Write an Introduction

    Here I will describe the use case(s) of a distributed graph data base. These could be

    • indexing the web graph for a general purpose search engine like Google, Bing, Baidu, Yandex…
    • running the backend of a social network like Facebook, Google+, Twitter, LinkedIn,…
    • storing web log files and click streams of users
    • doing information retrieval (recommender systems) in the above scenarios

    There could also be very other use cases like graphs from

    • biology
    • finance
    • regular graphs 
    • geographic maps like road and traffic networks

    2. Discuss all the related work

    This is done to name all the existing approaches and challenges that come with a distributed graph data base. It is also important to set onself apart from existing frameworks like graph processing. Here I will name the at least the related work in the following fields:

    • graph processing (Signal Collect, Pregel,…)
    • graph theory (especially data structures and algorithms)
    • (dynamic/adaptive) graph partitioning
    • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
    • redundancy vs fault tolerance
    • network programming (protocols, latency vs bandwidth)
    • data bases (ACID, multiple user access, …)
    • graph data base query languages (SPARQL, Gremlin, Cypher,…)
    • Social Network and graph analysis and modelling.

    3. Formalize the problem of distributed graph data bases

    After describing the related work and knowing the standard terminology it makes sense to really formalize the problem. Several steps have to be taken: There needs to be notation for distributed graph data bases fixed. This has to respect two things:
    a) the real – so far unknown – problems that will be solved during PhD. In this way fixing the notation and formalizing the (unknown) problem will be kind of hard.
    b) The use cases: For the web use case this will probably translate to scale free small world network graphs with a very small diameter. Probably in order to respect other use cases than the web it will make sense to cite different graph models e.g. mathematical models to generate graphs with certain properties from the related work.
    The important step here is that fixing a use case will also fix a notation and help to formalize the problem. The crucial part is to choose the use case still so general that all special cases and boarder line cases are included. Especially the use case should be a real extension to graph processing which should of course be possible with a distributed graph data base. 
    One very important part of the formalization will lead to a first research question:

    4. Graph Query languages – Graph Algebra

    I think graph data bases are not really general purpose data bases. They exist to solve a certain class of problems in a certain range. They seem to be especially useful where information of a local neighborhood of data points is frequently needed. They also often seem to be useful when schemaless data is processed. This leads to the question of a query language. Obviously (?) the more general the query language the harder to have a very efficient solution. The model of a relational algebra was a very successful concept in relational data bases. I guess a similar graph algebra is needed as a mathmatical concept for distributed graph data bases as a foundation of their query languages. 
    Remark that this chapter has nothing much to do with distributed graph data bases but with graph data bases in general.
    The graph algebra I have in mind so far is pretty similar to neo4j and consists of some atomic CRUD operations. Once the results are known (ether as an answer from the related work or by own research) I will be able to run my first experiments in a distributed environment. 

    5. Analysis of Basic graph data structures vs distribution strategies vs Basic CRUD operations

    As expected the graph algebra will consist of some atomic CRUD operations those operations have to be tested against all different data structures one can think of in the different known distributed environments over several different real world data sets. This task will be rather straight forward. It will be possible to know the theoretical results of most implementations. The reason for this experiment is to collect experimental experiences in a distributed setting and to understand what is really happening and where the difficulties in a distributed setting are. Already in the evaluation of graphity I realized that there is a huge gap between theoretical predictions and the real results. In this way I am convinced that this experiment is a good step forward and the deep understanding of actually implementing all this will hopefully lead to:

    6. Development of hybrid data structures (creative input)

    It would be the first time in my life where I am running such an experiment without any new ideas coming up to tweak and tune. So I am expecting to have learnt a lot from the first experiment in order to have some creative ideas how to combine several data structures and distribution techniques in order to make a better (especially bigger scaling) distributed graph data base technology.

    7. Analysis of multiple user access and ACID

    One important fact of a distributed graph data base that was not in the focus of my research so far is the part that actually makes it a data base and sets it apart from some graph processing frame work. Even after finding a good data structure and distributed model there are new limitations coming once multiple user access and ACID  are introduced. These topics are to some degree orthogonal to the CRUD operations examined in my first planned experiment. I am pretty sure that the experiments from above and more reading on ACID in distributed computing will lead to more reasearch questions and ideas how to test several standard ACID strategies for several data structures in several distributed environments. In this sense this chapter will be an extension to the 5. paragraph.

    8. Again creative input for multiple user access and ACID

    After heaving learnt what the best data structures for basic query operations in a distributed setting are and also what the best methods to achieve ACID are it is time for more creative input. This will have the goal to find a solution (data structure and distribution mechanism) that respects both the speed of basic query operations and the ease for ACID. Once this done everything is straight forward again.

    9. Comprehensive benchmark of my solution with existing frameworks

    My own solution has to be benchmarked against all the standard technologies for distributed graph data bases and graph processing frameworks.

    10. Conclusion of my PhD proposal

    So the goal of my PhD is to analyse different data structures and distribution techniques for a realization of distributed graph data base. This will be done with respect to a good runtime of some basic graph queries (CRUD) respecting a standardized graph query algebra as well as muli user access and the paradigms of ACID. 

    11 Timetable and mile stones

    This is a rough schedual fixing some of the major mile stones.

    • 2012 / 04: hand in PhD proposal
    • 2012 / 07: graph query algebra is fixed. Maybe a paper is submitted
    • 2012 / 10: experiments of basic CRUD operations done
    • 2013 / 02: paper with results from basic CRUD operations done
    • 2013 / 07: preliminary results on ACID and multi user experiments are done and submitted to a conference
    • 2013 /08: min 3 month research internship  in a company benchmarking my system on real data
    • end of 2013: publishing the results
    • 2014: 9 months of writing my dissertation

    For anyone who has input, knows of papers or can point me to similar research I am more than happy if you could contact me or start the discussion!
    Thank you very much for reading so far!

    ]]>
    https://www.rene-pickhardt.de/phd-proposal-on-distributed-graph-data-bases/feed/ 11
    Paul Wagner and Till Speicher won State Competition "Jugend Forscht Hessen" and best Project award using neo4j https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/ https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/#comments Fri, 16 Mar 2012 11:18:38 +0000 http://www.rene-pickhardt.de/?p=1204 6 months of hard coding and supervising by me are over and end with a huge success! After analyzing 80 GB of Google ngrams data Paul and Till put them to a neo4j graph data base in order to make predictions for fast scentence completion. Today was the award ceremony and the two students from Darmstadt and Saarbrücken (respectivly) won the first place. Additionally the received the “beste schöpferische Arbeit” award. Which is the award for the best project in the entire competition (over all disciplines).
    With their technology and the almost finnished android app typing will be revolutionized! While typing a scentence they are able to predict the next word with a recall of 67% creating a huge additional vallue for today’s smartphones.
    So stay tuned of the upcomming news and the federal competition on May in Erfurt.
    Have a look at their website where you can find the (still) German Documentation. As well as the source code and a demo (which I also include here (use tab completion (-: as in unix bash)
    Right now it only works for German Language – since only German data was processed – so try sentences like

    • “Warum ist die Banane krumm” (where the rare word krumm is correctly predicted due to the relation of the famous question why is the banana curved?
    • “Das kann ich doch auch” (I am also able to do that)
    • “geht wirklich nur deutsche Sprache ?” (Is really only German language possible?)


    &lt;br /&gt; Ihr Browser kann leider keine eingebetteten Frames anzeigen:&lt;br /&gt; Sie können die eingebettete Seite über den folgenden Verweis&lt;br /&gt; aufrufen: &lt;a href=&#8221;http://complet.typology.de&#8221; mce_href=&#8221;http://complet.typology.de&#8221; data-mce-href=&#8221;http://complet.typology.de&#8221;&gt;Demo&lt;/a&gt;&lt;br /&gt;

    ]]>
    https://www.rene-pickhardt.de/paul-wagner-and-till-speicher-won-state-competition-jugend-forscht-hessen-and-best-project-award-using-neo4j/feed/ 11