Data – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Virtual Payment Channels for the Lightning Network – A Lightning Network Improvement Proposal draft https://www.rene-pickhardt.de/virtual-payment-channels-for-the-lightning-network-a-lightning-network-improvement-proposal-draft/ Tue, 17 Jul 2018 11:07:50 +0000 http://www.rene-pickhardt.de/?p=2131 I suggest to add another feature to the lightning network which I call a virtual payment channel (or in short VPC) to be described in this article. Such a virtual payment channel will NOT be backed by the blockchain and thus cannot operate in a trustless manner. However I see several use cases for virtual payment channels. The most important one seems to be the fact that two parties which agree to create a virtual payment channel can instantly create a channel at basically arbitrary capacity. This will help to increase the max flow properties of the network and allow for routing to become much easier. A virtual payment channel basically allows both channel partners to access the funds of the other channel partner which yields some flexibility and obviously a high risk in case of fraudulent behavior. While writing down the article I realized there are also some dangers to virtual payment channels. The main one is that to some degree VPCs resembles the fractional reserve banking system even though it is still only credit based on positive money (I think the English language should borrow the German word “Vollgeld-System“). At first I considered dropping the article because the last thing that we need is a cryptocurrency crises similar to the banking crisis that we have been through. However I think when regulated and used properly VPCs will add more benefit than danger to the lightning network. Despite the drawbacks with this article I will still propose this feature.

I came up with the concept of virtual payment channels since I finally started my work on a splicing protocol specification for BOLT1.1. On paper I have already laid out the data flow for asynchronous, non blocking splicing in and splicing out operations – which resemble pretty much the ones discussed on the github issue. I will write about them soon. Before digging deeper into laying out my specification I wanted to increase the usefulness of the splicing operations. So I started to wonder if there was a way that splicing could take place without the need to ever hit the blockchain? While playing around with some ideas I realized (though not proven) that splicing without hitting the blockchain seems impossible if the trustless property for payment channels is mandatory. Hence I asked myself what would happen if we dropped this central property? As a result I accidentally came up with the concept of virtual payment channels.

Having talked to larger exchanges and potential hubs of the lightning network in the past and having seen that they wonder a lot about rebalancing their funds and about how they could provide capacity without locking their liquidity uselessly I realized that virtual payment channels might be a solution to some of their core problems. During the article and at the end of this article I will provide use case stories from the perspective of a central node in the network as well as from the perspective of a mining pool.

Let us begin by looking at one of the two core elements of the lightning network: The (standard) payment channel. A standard payment channel operates in a trustless manner which allows two parties to create a payment channel without the need for them to trust each other. The ability to create trustless payment channels is certainly a very useful property that the lightning network must have. However this trustless property comes at a rather high price bringing us the following three drawbacks (which should still exist even if we upgrade payment channels to follow the eltoo protocol proposal):

  1. Even when Splicing is implemented channels need to lock the bitcoins that create the capacity for the channel. Meanwhile those bitcoins cannot be used differently making the lightning network and its topology somewhat rigid. (In particular if you take into consideration how much topology could at most change within one bitcoin block.)
  2. Creating (and closing) the channels (and splicing funds) are blockchain transactions which in general are slow and expensive. Slow because we have a block time of roughly ten minutes. Expensive because independently of the block size discussion there is a physical limit of how many transactions fit into a block. Remember that space in the block for a transaction is generally bought by providing mining fees.
  3. Bitcoins within the payment channel are in a hot wallet and it is difficult to secure them properly. Even if the output of the commitment transaction script is spendable only by a cold wallet address we have no guarantee that an attacker taking over the lightning node(s) might mitigate a different output for the next commitment transaction (for example if splicing out of funds is being implemented) or even easier the attacker could just make a payment to some address emptying the channel.

Additionally an indirect problem will be that companies like exchanges, large scale shops or payment provider or even banks will have many payment channels since they are likely to be central nodes in the network. The amount of channels might exceed the capacity of the hardware running this node. So people providing core financial infrastructure might need to operate several lightning nodes (let us call them A,B,C,…) to serve their customers. Companies then have to decide by collecting and evaluating data which customers to connect to which of their nodes. They also have to connect their own nodes via largely funded payment channels among each other. Even if done in some smart circle like topology (with channels so that each node can be reached in log N hops) it is not the preferable way since way too many funds are unnecessarily being locked to maintain clearing of funds between ones own nodes. Another drawback is that those additional funds are also on hot wallets again and need to be available in the first place.

Let us assume in the above setting some node N1 wants to make a payment to some node N2. The route for the payment happens to be:

N1 –> A –> B –> N2.

In the classical lightning network the routing could only take place if A could literally send over the requested money to node B within a payment channel between A and B. For brevity I omitted the exchange of the preimages and negotiating HTLCs and new commitment transactions.  Those details are defined in the protocol specification so that A and B don’t loose money while they route a payment from N1 to N2.

However if A and B are owned by the same entity (as assumed above) there would be no harm if A received some money and B forwards it without A forwarding the money to B. In particular a payment channel with Bitcoins being locked between A and B would be superfluous. This observation is exactly the idea behind a virtual payment channel.

A high level protocol for operating a virtual payment channel could look like this:

  1. Two nodes would agree on an amount both sides provide to the virtual channel. The money does not necessarily have to be backed somewhere (though having the money available is strongly advisable) and the initial balance also does not have to be symmetric.
  2. The channel balance is maintained by both sides as a balance sheet that needs to be cryptographically signed by both sides for every new update. Even though it will not be possible to enforce the final balance via the blockchain those balance sheets might be enforceable by real courts. (Which technically should never happen as VPCs should only be created if there is complete mutual trust between the channel partners)
  3. The contract between both sides includes the fact that the balance sheet with the highest number of mutually signed updates is the correct one and considered to be used for settlement. (I will repeat myself: Such a contract could only be enforced by a real court and not by the blockchain.)
  4. Channel creation is the first update on the balance sheet  and needs to be signed by both parties. In that way a party can decline to open a virtual payment channel by refusing to sign the balance sheet.
  5. The channel is then announced to the network as virtual payment channel via an extension of the gossip protocol. The extension is necessary because currently the blockchain funding transaction needs to be referenced as a proof of existence of the channel. The transaction is encoded to the short_channel_id field of the announcement_signature message as well as the channel_announcement message. From the perspective of the network it is not any different than the other channels. It can be used for routing payments since every hop is a bilateral contract. The rest of the network doesn’t have to care if that contract is actually enforceable by the blockchain or if there is some other trust provider between the channel partners. I literally see no risk in routing money through a third party virtual payment channel for the sender and recipient of the funds.
  6. The channel announcement should be done by both sides and should at least be signed with their own bitcoin keys. This will prevent spamming channels which can’t be used for routing since the partners do not agree that the channels exist. (c.f. point 4)
  7. Routing along a VPC takes place instead of an HTLC. Channel Partners will just update the channel balance. (in fact they could even resemble the HTLC behavior to their balance sheet but I don’t see the benefit)
  8. When closing the channel one party has to somehow reimburse the other side depending on the end balance in comparison to the start balance. This problem of settlement is the problem of the two partners and not part of the protocol. From a protocol perspective any channel partner can just announce the closing of the channel which takes effect immediately. This is precisely the moment where trust and IOU come into place.
  9. Closing of the payment channel also has to be announced via the gossip protocol.

Attention! If for some reason the other party turns out not to be trustworthy one could loose as many funds as the other side virtually provided. In the best case the two nodes are owned by the same person or company or there is some written agreement which backs the virtual payment channel. However obviously one cannot enforce such circumstances and prohibit arbitrary people to create a virtual payment channel. That is why virtual payment channels have to be used with extreme caution. As soon as you accept an inbound virtual payment channel from a third party you vouch with your own funds for the payments of your channel partner. Besides that there are a couple of other drawbacks I see with this proposal.

The 3 biggest drawbacks that I see are:

  1. A virtual payment channel can be seen as an IOU resembling to some degree the mechanics of a demand deposit. One side can provide funds that it doesn’t possess to a VPC and have the other side spend those funds. One cannot tackle this issue by implementing a rule that the network would not accept virtual payment channels if the sum of the amounts in the virtual payment channel is bigger than the sum of the capacities of the standard payment channels. The reason is that the network cannot prove after a virtual payment channel is being closed and reopened that settlement between both partners took place. Hence the potential channel capacity is really infinite. This does not mean that an infinite number of bitcoins can be created out of thin air.
  2. Infinite channel capacity is not a direct problem as in the worst case when routing takes place from A to B the standard channels of B might eventually be dried out in the sense that no further routing away from B can take place. However a network or circle of virtual payment channels could lead to weird chain reactions in which I pay someone and the money is only being routed through virtual payment channels creating literally a money flow out of thin air. I did not calculate this through but I guess at some point in time it will lead to imbalances with real channels but in the first place creating more virtual payment channels is not prohibited.
  3. Since the balance sheet is signed by both parties leaking of the key used for signing is like leaking keys of a traditional payment channel. So even in a virtual payment channel the bitcoins are somehow hot. One cannot directly steal bitcoins but one could initiate sending the virtual balance to nodes one controls and the virtual balance on the routes eventually becomes real balance. So again there need to be complete trust among channel partners.

Things to take into consideration:

  • It is always safe sending money over a virtual payment channel.
  • There is no risk to provide virtual balance in a virtual payment channel.
  • The risk is receiving money on a virtual payment channel because the recipient needs to be sure she will get settled later. This holds true in particular if receiving money was linked to a task like routing it further through standard channels or selling some goods. Both of which should be the standard use cases.
  • One could adopt routing fees that reflect those risks depending in which direction one uses the virtual payment channel or who provides it.

I will now describe a realistic setting which would benefit hugely from the chance to create virtual payment channels.

A use case for virtual payment channels: Payouts in mining pools.

Let us assume you are operating a mining pool. You want to do daily payouts for your – let us say 100k  – customers. Obviously your customers already trust you since they commit their hash power to your pool and trust on the fact that you will eventually do a payout.

The payout scenario without virtual payment channels and without the lightning network.

Every day one has to fit 100k transactions to the blockchain. With batching obviously one could use less transactions but still the amount of transactions and fees will be considerable.

If we have the lightning network the situation is not even getting better: 

For the payouts on day one we have to create payment channels. Obviously a mining pool cannot overfund those channels by too much since it would already need to have the liquidity to do so. So on day 2 the channels have to be either spliced in or the channel has to be closed and reopened. Unless of course the money was already being spent by the customer over the lightning network. (This however will only be possible if additional funds of the mining pool will be used to fund other external channels giving customers the possibility to spend their funds in a meaningful way). We see that using lightning in that case actually seems to increase the amount of blockchain transactions and increases the amount of liquidity that is being used.

Obviously the answer to the sketched problems could be that the mining pool does not have to operate channels to all its customers but can just rely on the lightning network to work in order to find routes to the customers while doing payouts. I predict that in many cases it won’t be possible for the pool to find a route to the customer when she creates an invoice and a channel has to be funded anyway.

Situation with virtual payment channels on the lightning network: 

The pool could accept a virtual payment channel with funds as high as the pool owns to the customer on the customers side of the VPC. This does not require a Blockchain transaction and saves fees. For the second payout on the next day the balance can just be increased by the amount the pool owes to the user. Obviously this kind of payout is not real as the customer does not directly control the funds. However the pool provides cryptographic proof and a signature to the customer which again cannot be enforced by the blockchain. The customer at any time can pay via lightning using exactly this virtual payment channel.

In order for this to be useful for the customer his virtual payment channel needs to be connected to the “real” lightning network. This is achieved by the mining pool which has to have several outgoing channels funded to rather central nodes on the network (e.g. shops, exchanges, payment provider,…). Since the pool can assume that not all customers will spend or transfer all their funds immediately the pool only has to provide a fraction of the funds allocated in the virtual payment channels as outgoing capacity to the other real channels. This reduces the amount of liquidity that needs to be on the hot lightning network node wallet. In case a channel dries out there will be funds spliced in or the channel is closed and reopened. This will happen way less frequently than payouts to all customers and thus reduces load on the blockchain. Also the customer can already receive bitcoins over the lightning network in this virtual payment channel (if he decides to trust even more funds to the virtual channel) In any case the customer can obviously also accept or demand payments on  a standard payment channel by disallowing inbound routing over the virtual payment channel.

Finally the customer could even get hold of his funds by fulfilling inbound routing requests on his standard payment channels which do outbound routing over the virtual payment channel.

For customers of the pool this mechanism probably also has other great advantages:

  • They could configure their software that spending and routing will first go through the VPC so that their money is quickly being spent.
  • They will probably be connected much better to many shops since the mining pool is maintaining and funding those real payment channels.
  • One could even think of a feature in which the pool could “splice out” money of a virtual payment channel by funding a standard payment channel for the user in case the pool can’t find a route to the recipient of the customer. The pool could obviously just itself open that channel in order to have a better connectivity to the lightning network in the future increasing the service for all its customers.

Conclusion:

We see that Virtual Payment Channels increase the flexibility for groups of users that for some reason trust each other. We can think of a virtual payment channel as an agreement to spend the funds of the channel partner in ones own name. Applications are for services that have plenty of customers and don’t want to reallocate funds all the time as well as for very central power nodes that cannot handle all of their traffic on one computer and need to distribute the node to several computers. However virtual payment channels should be used with large caution as one needs to trust another party. In the case of just distributing ones own node this will never be an issue. In case of the mining pool scenario trust already needed to exist in the first place so it should be ok but still yields a risk for the customer. The biggest downside is that a virtual payment channel can be seen as creating liquidity out of thin air and is kind of a credit that might break forcing chain reactions. However in this case it is just the channel partner who should have known better than trusting blindly. In particular if large operators go bankrupt as a follow up of such a chain reaction everyone who has standard payment channels with those operators won’t loose his or her funds due to the trustless property of the lightning network.

To the best of my knowledge the concept of virtual payment channels for the lightning network has not been proposed before. If I am mistaken I will be happy to receive a pointer. Now I am very excited to see what you as a community will think about my proposal. In particular I am excited since I understand that my proposal violets the core philosophy of the lightning network meaning that payment channels can operate in a trustless manner. Yet I think progress will only happen if creativity is allowed by thinking about all possibilities instead of fixed limitations. Also it will not change the trustless nature of the lightning network as this will only occur if two consenting parties agree to engage to such an endeavor. As outlined in this article they might have good reasons to do so. In the great tradition of Bitcoin Improvement Proposals (BIPs) I suggest that we start a process for Lightning Network Improvement Proposals (aka LIPs). This draft would need to be specified a little bit more if people think it would be a useful feature and could be such a LIP at the end.

Obviously if my proposal won’t make it to future BOLT specifications the idea is out and we cannot prevent companies like a mining pool operator still implementing this within their proprietary lightning supported wallet which they provide to their customers. However I think it would be much more preferable if virtual payment channels either become part of the protocol standard or are never used.

If you liked this article and are interested in the follow ups you can follow me on twitter.

 

 

]]>
Thoughts about eltoo: Another protocol for payment channel management in the lightning network https://www.rene-pickhardt.de/thoughts-about-eltoo-another-protocol-for-payment-channel-management-in-the-lightning-network/ https://www.rene-pickhardt.de/thoughts-about-eltoo-another-protocol-for-payment-channel-management-in-the-lightning-network/#comments Sun, 08 Jul 2018 21:56:08 +0000 http://www.rene-pickhardt.de/?p=2116 In this article I will give an overview about the proposed lightning network extension eltoo. There has been quite some buzz about the eltoo paper proposed by Christian Decker, Rusty Russell and Olaoluwa Osuntokun in May 2018. Bitcoin magazines and blogs have been making promising statements, that lightning will become even faster than it currently is. According to them eltoo allows for payment channels to interact less with bitcoins blockchain (which I will state already is just plain wrong.)

Having read Christian Deckers main publication for his dissertation in which the authors suggest the use of invalidation trees to invalidate old states of an payment channel I was confident that eltoo would be a really nice improvement for the lightning network protocol. However following the “Don’t trust. Verify!” philosophy of the Bitcoin community I obviously had to study the eltoo protocol proposal myself. In particular I did not believe those bullish news from the magazines since from a technical perspective I do not see how the speed of the lightning network can significantly be increased. The speed of the lightning network at its current state is basically bound by the speed of TCP/IP sockets at the payment channel level and during routing by the time locks of the HTLCs. Since eltoo is only looking at channel management we can obviously omit the later case. As I am starting to work on the splicing part of the BOLT 1.1 specification which is a protocol extension of payment channel management obviously I need to understand eltoo anyway.

Yesterday after holding an openly licensed German language lecture on the basics of the lightning network technology  (not to be confused with BOLT the Lightning network RFC) I had a 5 hour train ride home from the event and finally found the time to carefully study the eltoo paper. Even though the paper is very well structured and written and I thought it might be helpful if I share my understanding of the proposed protocol back with the community. However my main reason is that I hope by writing down and explaining the paper I get an even clearer understanding myself. In case I misunderstood something about the eltoo paper please let me know so that I can update this article. Also there are some rather technical questions I still have (for example this one which I have already posted on bitcoin stack exchange)

So let us analyze eltoo by understanding the motivation for yet another payment channel protocol by reviewing some of the well known facts about the current payment channel management.

Unpublished commitment Transactions encode the balance (or more technically spoken state) of a payment channel in the lightning network.

For the current version of the lightning network payment channel protocol we need to store information about all the old states of a channel. With state of a channel we mean the current distribution of the channels capacity (aka the balance) among the channel partners. At the core the current payment channel (and also the solution proposed in eltoo) a payment channel is just a 2-2 multi signature wallet. Currently the balance of a payment channel is encoded by two unpublished transactions that would spend the channels capacity according to the balance sheet with the channel partners as receipients. Actually we call those transactions commitment transactions. These transactions are both signed by both channel partners and each of them has one of them stored on his lightning node. This creates some kind of information asymmetry and a risk of funds being stolen should the information leak to the channel partner.

Funds within the channel are being maintained by creating a new commitment transaction (kind of a double spend) that spends the funds of the multi signature wallet (the channels capacity) in a different way. In order for both parties to be able to trust in the newly agreed upon channel balance we obviously have the problem of invalidating the old commitment transactions. In a trustless setting (which we wish to achieve) channel partners cannot rely on their counterparts to delete the old commitment transaction once they agree on a new channel balance. The reason is that those commitment transaction are valid bitcoin transactions signed by both channel partners and both partners own a copy and could have theoretically made more copies in between time.

All currently known protocols for maintaining a full duplex payment channel have in common that old channel states need to be invalidated. 

The current implementation of the lightning network achieves the goal of invalidating old channel states by including the hash of a revocation key within the output script of a commitment transaction. The revocation key itself can only be known by the channel partner that does not control a signed copy of that transaction. (Again this is where the information asymmetry kicks in) The output script of the commitment transaction will eventually spend the funds according to the channel balance that was agreed upon for this particular commitment transaction. However it allows to spend the output of a published commitment transaction directly to send the entire capacity of the channel to ones own bitcoin address once one can present the signature from the secrete revocation key. If channel partners wish to update to a new channel state they will collaboratively compute the revocation resulting in the hash that is stored in the most current commitment transaction. The important requirement of the protocol design is that neither partner can compute the revocation key without the help of the other partner. Owning the revocation key prevents the channel partner to publish an old commitment transaction (aka state) since the publishing side would be punished by the other side releasing the revocation key and spending the outputs of the commitment transaction directly. Note that releasing the revocation key itself is not dangerous since it can only be useful in the output script of the transaction which includes the hash of the revocation key. As you would imagine each new state / commitment transaction includes the hash of a new revocation key. Therefor it is really the old transaction that needs to be kept secrete or even better destroyed. Obviously with such a setup both sides need to store all old revocation keys in order to be able to protect oneself from the other side publishing an old channel state (which for example could be a state in which all the funds belonged to the side that tries to publish this state)

Eltoo suggest a different protocol for channel management such that either side does not need to store all old state but just the current state. This is supposed to simplify implementing lightning nodes and also running them. According to Christian Decker this will also simplify implementing features like splicing. (On a side note I have not thought the splicing operation completely through for eltoo but it seems obvious to me that the chain of update transactions in eltoo can easily be extended by at least splice out operations.)

The initial setup of a payment channel within the Eltoo protocol is in my opinion exactly the same as in the current specification of the lightning network.
A payment channel is created by funding a 2-2 multi signature wallet. The state of the channel is again encoded within a transaction that is just not published yet. Only the process of invalidating old state differs. Both channel parties create two sets of public and private keys. The first set is called the update keys and the second set is called the settlement keys. Actually the settlement keys should be derived from a deterministic hierarchical wallet so that settlement transactions can only be bound to one specific update transaction (but I guess that is a rather subtle though important detail). According to Christian Decker what is new is that

… commitment is no longer a single transaction customized for each side, but an update/commitment that is symmetric and a settlement built on top.

A chain of update and settlement transactions allows for encoding channel state.

The state of the channel is encoded as the most recent element of a sequence of settlement transactions in which the current settlement transaction is encoding the current channel state. All the old settlement transactions are not needed and can safely be thrown away. Though not being the same as commitment transactions in the current lightning network implementations the settlement transactions can probably be considered to be the analogous counterpart. Each settlement transaction spends an – so called – update transaction.  The update transactions are formed as a “linked list” starting from the funding transaction (which could already be interpreted as an update transaction). The update transactions spend the output of the funding transaction (or the previous update transaction(s!) ). The output of the new update transaction is a two path multisignature script where in the first path the output can be directly spent with the pair of update keys. In the second path the output can be spent with the settlement keys as long as a timelock is achieved. This will always give priority to successfully double spend an old settlement transaction with a newer update transaction. In case the newest settlement transaction is being broadcast there exist no newer update transaction which is why the double spend of another update transaction cannot be achieved by a single entity in the channel. Therefor after the timelock is over the channel is closed.

This setup is bad however since the current channel balance encoded in an settlement transaction which is the child of the latest update transaction which is the child of the previous update transaction and so on. Therefor in order to close the channel one would need to broadcast the complete chain of update transactions. This would be bad as it would not take any load from the blockchain in comparison to directly transfer bitcoins every time one wold want to update a payment channel and we would also need to store all old update transactions not really decreasing the overhead. However with the current version of the bitcoin protcol there is nothing we can do about this and eltoo is nothing but a nice thought and completely impractical. However a minor update to bitcoin (which was actually already talked about in the current lightning network paper) is finally proposed as BIP 118 and would also make eltoo useful.

Eltoo achieves the simplification by introducing the idea of floating transactions which are realized by introducing a modifier SIGHASH_NOINPUT for the existing OP_CODES check{,multi}sig{,verify}

With SIGHASH_NOINPUT the transaction that is being created to spend some output will ignore the prevout_point (tx hash, output  index), sequences, and scriptCode for its signature calculation. This allows to change the output that is being spent as long as the output script matches the input script. Since the output scripts of the updating transactions are all the same (at least for the path that spends to output again to the update keys of the 2-2 multisig wallet) one can bind a later update transaction to the parents or grandparents or great grandparents or … of the most recent output transaction. (I don’t see the point of summarizing all the technical details here since they are very well described in the paper and I don’t see how I would formulate them differently or shorter) What this does however is that instead of the need to publish all previous funding transactions the current one can just be bound to the funding transaction and can be broadcasted together with its settlement transaction. Since there is a timelock in the update transaction (it could be an invalid one) the settlementtransaction should obviously only be broadcasted after the timeout since it will be rejected by the bitcoin network beforehand.

That being said the current channel state is defined by the already mined funding transaction as well as the most recent update transaction and its child the settlement transaction. In case channel partner Mallory would want to publish and older state (encoded within some settlement transaction) she would first need to publish the update transaction that is parent of this very settlement transaction. She can do so as she can also bind that corresponding update transaction to the funding transaction. However if she now spends the settlement transaction there is the path in the output script of the update transaction that allows for direct spending of a newer update transaction (which can be the newest update transaction -known by the other side – being bound to that particular update transaction). This will prevent old state from entering the blockchain.

There are some minor details about including a trigger transaction which I will omit here. However it seems important to mention that this mechanism also requires for the other side to be online or having a watchtower mechanism being included. Also with this mechanism in the uncollaborative closing case, the collaborative one and the case where one party tries to steal funds the blockchain will need to have at least three transactions being broadcasted to the blockchain (the trigger transaction, the most recent update transaction and the settlement transaction) In that sense I understand eltoo as a tradeoff between convenience for channel management in comparison to minimal blockchain involvement. As I mentioned during the autopilot discussion at the lightning hackday the blockchain will still need about 10 years to have one payment channel open for every internet user in the world. On the other side it seems to have many advantages over the current protocol for payment channel management.

Advantages of eltoo:

  • Biggest pro seems to be that it is easily compatible with the current lightning network. It does not matter of the network uses various protocols for payment channel management.
  • As fraudulent behavior can also exist by accident (e.g. a bug in the implementation) it is nice that with eltoo the punishment of fraudulent behavior is not achieved by giving all bitcoins to the honest side but rather by forcing a channel close and distributing the funds according to the last agreed upon channel balance
  • Way less protocol overhead on the implementation side (only true if we eventually only support eltoo channels and don’t have to support both payment channels)
  • (Although I did not explain the details here) Fees for settling (closing) the channel do not need to be decided for while creating the payment channel but can be chosen a posteriori.
  • Possibility to easily create multi user payment channels (though all parties have to be online in order to update the channel balance which is said because otherwise we could easily create payment caves in wich thousands of participants would share the balance increasing liquidity in the network) also the transaction size increases (but still stays smaller than having several 2 party channels created among the participants). Christian Decker pointed out that with Schnorr Signatures there won’t be a difference in the size of the transactions for 2 party channels or n party channels. I see that the signature part of the transaction won’t grow but if more parties are on the channel more outputs are being created. (Maybe I miss something. I haven’t had the chance to look at Schnorr Signatures in detail.)
  • Since honest nodes can safely forget old channel state we should never run in memory issues while maintaining one channel.

disadvantages of eltoo:

  • It might incentivize people to try behave in a fraudulent way since there is only the risk of loosing fees involved.
  • SIGHASH_NOINPUT needs to be implemented in future bitcoin versions. Though the authors claim the protocol is backward compatible with bitcoin for the lightning network this seems like a typical IT mess. Probably it will take time for bitcoin and lightning network nodes to update. Obviously a routing network can be created with various channel management protcols as long as they support HTLCs and the internal API for routing. However it seems quite cumbersome to implement future features to support both eltoo and the classical lightning payment channels. One strategy could be to just implement splicing on top of eltoo or to implement it with a higher abstraction of channel management (not sure yet if that is possible)

Open Questions:

  • What happens if we have an Integer overflow in the sequence number?

Which Christian Decker answered by stating that in case this really happens one can set back the counter by splicing or reanchoring (splice without changing the values).

If you are more interested in technical write ups about the technology of the lightning network you can follow me on twitter. If you are also an enthusiast of the lightning network you could connect to my lightning node via: 036f464b54416ea583dcfae3872d28516dbe85414ed838513b1c34fb3a4aee4e7a@144.76.235.20:9735 and help to grow the lightning network. Also I want to shout out a thanks to Christian Decker who has reviewed this article and pointed out some mistakes which have been fixed.

]]>
https://www.rene-pickhardt.de/thoughts-about-eltoo-another-protocol-for-payment-channel-management-in-the-lightning-network/feed/ 1
Improve the autopilot of bitcoin’s lightning network (Summary of the bar camp Session at the 2nd lightninghackday in Berlin) https://www.rene-pickhardt.de/improve-the-autopilot-of-bitcoins-lightning-network-summary-of-the-bar-camp-session-at-the-2nd-lightninghackday-in-berlin/ https://www.rene-pickhardt.de/improve-the-autopilot-of-bitcoins-lightning-network-summary-of-the-bar-camp-session-at-the-2nd-lightninghackday-in-berlin/#comments Sun, 24 Jun 2018 22:02:55 +0000 http://www.rene-pickhardt.de/?p=2085

I have been visiting Berlin to attend the second lightninghackday and want to give a brief wrap up about the event. This article will basically cover two topics. 1st as promised within my bar camp session on “Building an automated topology for autopilot features of the lightning network nodes” I will give an extensive protocol / summary of the session itself. 2nd I will talk about an already well known technique called splicing which I realized during the event might be one of the more important yet unimplemented features of lightning nodes.

Let me just make a quick shoutout to Jeff and his team from fulmo lightning: Thanks for organizing this excellent event. Especially bringing together such a group of high profile lightning network enthusiasts was astonishing for me. Out of the many tech events and conferences that I have attended in the past this was one of my top experiences.

In my bar camp session we had roughly 30 people attending. Luckily also Conner Fromknecht and Olaoluwa Osuntokun from the San Francisco based lightning labs joined the session and gave many insights resulting from their hands on experience. I started the session with a short presentation of my thoughts about the problem. I had previously formulated those in my github repo as a rough draft / scetch for a whitepaper on the topic. That happened after I opened Issue 677 in the LND Project criticizing the current design choices for the autopilot. The main points of those thoughts are a list of metrics and criteria I previously thought I would want to monitor and optimize in order to find a good network topology. Before the hackday that list looked like that. (We discussed this list and basically people agreed on it):

  • Diameter: A small diameter produces short paths for onion routing. Short paths are preferable because failiure is less likely to happen if less nodes are involved for routing a payment.
  • Channel balance: Channels should be properly funded but also the funds should be balanced to some degree.
  • Connectivity / Redundancy: Removing nodes (in particular strongly connected nodes) should not be a problem for the remaining nodes / connectivity of the network.
  • Uptime: It seems obvious that nodes with a high uptime are better candidates to open a channel to.
  • Blockchain Transactions: Realizing that the Blockchain only supports around 300k Transactions per day the opening, closing and updating of channels should be minimized.
  • Fees for routing: Maybe opening a channel (which is cost intensive) is cheaper overall.
  • Bandwith, Latency,…: nodes can only process a certain amount of routing requests. I assume that also the HTLCs will lock channels for a certain amount of time during onion routing.
  • Internet topology: obviously routing through the network becomes faster if the P2P network has a similar topology as the underlying physical network. Also it makes sense since even on the internet people might most of the time use products and services within their geographic region. check assumptions

Before I state the new ideas that came from the attendees and the discussion I want to briefly sum up the great keynote by the guys from lightning labs that preceded the bar camp session. In particular I think the insights and techniques (Atomic Multi Path routing and Splicing) they talked about have a huge impact on the autopilot and the topology generation problems of the lightning network. Long story short the magic concept in my opinion is splicing. For those that are unfamiliar with the topic: Splicing is the process of updating the channel balance of a payment channel by adding or removing (partial) funds. In the past I always thought that even though it was not implemented in any of the lightning nodes that this is a problem which technically is rather trivial to solve and thus of minor importance. The guys from lightning labs basically stated the same emphasizing that splicing funds out of a channel is trivial and can even be achieved easily in a non blocking way such that the channel can be used again directly after splicing out even if the spent transaction is not yet mined. However splicing in (aka adding more funds to a channel) seems to be a little bit more cumbersome. Without going into too many technical details the real eyeopener for me was the fact that splicing (together with the autopilot) seem to make lightning wallets in the way they exist these days obsolete. This obviously is a game changer. So to make it crystal clear:

With splicing and the autopilot implemented in a standard bitcoin wallet (and running in the background without the need for users to be aware of this) users can efficiently, quickly and cheaply send funds from one to another. If a path via lightning exists the transaction will be lightning fast. If no path can be found one could just splice out some funds from existing channels to create a new channel for the transaction. This would basically boil down to a common on chain transaction which happened before we had the lightning network anyway. However it doesn’t matter if all funds are currently frozen up in payment channels. Also it doesn’t waste the blockchain transaction but rather uses this opportunity to open create a funding transaction for the next channel increasing the size of the lightning network. I dare to say a bigger lightning network is a better lightning network in general. Basically the goal would be that eventually all bitcoin funds would be locked in some payment channels (which with splicing obviously doesn’t lower the control or flexibility of the user) In case a user really needs to do a transaction wich can’t be achieved via lightning it will just be spliced out and takes as much processing time as a regular on chain transaction. As a disclaimer: Obviously watchtowers are still needed in particular in this scenario in which users might not even realize they are using the lightning network.

Taking the opportunities of splicing into consideration I think that the topology problem of the autopilot becomes issue of only minor importance. One can easily splice out from existing payment channels to new payment channels if needed. The only bad thing is that such a transaction is not routed at lightning speed but rather takes a couple block times to be mined and processed. However it eventually creates a user generated network topology that hopefully pretty much follows actual flows of funds and would thus be rather robust. The only drawback with such a process would be that transactions frequently include channel creations which takes some time and that only a maximum of 300k channels can be altered per day on top of the current bitcoin protocol. This observation explains why topology generation of the autopilot still is a reasonable topic to think about since it should still help to move traffic away from the blockchain.

Finally I will now list some new thoughts that have been collected during the session. I will also update my whitepaper soon. Feel free to fork me on github and do a pull request in order to fix mistakes or add your own thoughts.

Number of nodes reachable within x hops:
It was pointed out that his metric would look quite nice. As a result of the discussion we came to realize that this greedy heuristic would basically lead to the scenario in which every node would open a channel to the most central node in the channel. such a central node would increase the number of nodes that can be reached within x hopes by the maximum amount. Still it looks like an important number to somehow optimize for.

Honesty and well behavior of nodes:
Following a similar philosophy we discussed weather a distributed topology creation algorithm should aim for global health of the network in comparison for greedy strategies in which every node tries to optimize their own view of the network. Though it wasn’t pointed out in the session I think that a strategy where every node tries to optimize their own access to the network will at some point yield a Nash equilibrium which. With my rather little understanding of game theory I think this might not necessarily be the best solution from a global perspective. Also we discussed that in the later mentioned sections where clients share information with neighbors an algorithm must be robust against fraudulent behavior or lying of nodes.

Stakeholder models:
Pretty much everyone agreed right away that different nodes might have different needs for the lightning network. so the topology creation should be configurable (or learnable by the node) taking into respect weather the node is just a casual consumer or a shop or a bank / exchange or …

Privacy vs information sharing:
We discussed quite extensively that for a distributed algorithm to make predictions which channels should be created it would be great if channel balances would be public (or at least there would be some rough information available about the distribution of the balance within one channel). We realized that as a first step following the ideas of lnd Issue 1253 nodes should start collecting historic information about their own operations and routing acticities. Actually I just saw that a pull request that claims to have resolved issue 1253 already exists.
We also realized that channel fees might act as a reasonable well proxy for the channel balance. Assume Alice and Bob have a channel and the balance is very skew in the sense that Alice has almost no funds and Bob has all of them. If Bob was asked to route a payment through that channel he would probably charge a smaller fee than Alice if she was asked to route a payment through her almost dried up channel.

Still the idea circulated around that nodes could share their channel balances with neighbors in a similar fashion how routing information in the IP network are shared with neighbors. In such a way eventually a map of paths would be constructed for each node.

A point mentioned – that in my oppionion is important but only somewhat related to these issues – was the fact that of course nodes should take higher routing fees if the timelock of the routing request is high since in the worst case this makes a channel or many other paths unusable for quite some time maybe even without a payment taking place. As a side node I just realize that this could be a more sophisticated strategy for nodes to estimate their fees if they are aware of the number of routing paths their channel makes possible.

Some technical details:
One could use the number of disjoint paths between nodes as a good criteria since it also enables heavy use of atomic multi path transactions. Also it was mentioned that one could look at the AS-number of the underlying internet hosts.

Why not using machine Learning instead of trying to find some smart algorithm / heuristics?
For me the most surprising idea was the fact that this entire autopilot problem could easily be transferd into a machine learning problem. There are some obvious flaws because single nodes might not have enough training data and if one has enough data sharing the model would probably also not work out of the box. So we discussed a little bit if that would be a use case for transfer learning. Here I will not dig deeper into the topic, since the article is already quite long. But working in the field of machine learning and being a data scientist and having not even put the slightest thought about this idea before the event took place was a little bit shocking and surprising for me.

Anyway I hope my summary of the session will be useful for you and the community! My personal roadmap now consists of four things.

  1. I am thinking to add a splicing protocol specification to the BOLT (lightning-rfc)
  2. I want to get running with go-lang and the codebase of lnd in order to be able to do some hands on experiments.
  3. I plan to update the very rough draft of the white paper.
  4. Finally I will hopefully find the time to hack a litte python script that does a simulation of how my above described splicing strategy would create a lightning network wich is able to route most payment requests. If you want to be update just follow me on twitter where I will inform you once I am done. Also feel free to leave a comment with more ideas or extend the draft of the white paper. I would love join forces working on this topic.

Also kudos to Susette Bader who took this really nice snapshot and cover image of this post while we have been hacking.

]]>
https://www.rene-pickhardt.de/improve-the-autopilot-of-bitcoins-lightning-network-summary-of-the-bar-camp-session-at-the-2nd-lightninghackday-in-berlin/feed/ 2
Extracting 2 social network graphs from the Democratic National Committee Email Corpus on Wikileaks https://www.rene-pickhardt.de/extracting-2-social-network-graphs-from-the-democratic-national-committee-email-corpus-on-wikileaks/ Thu, 28 Jul 2016 12:15:05 +0000 http://www.rene-pickhardt.de/?p=1989 tl,dr verion: Source code at github!
A couple of days ago a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First we have the temporal graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.

rpickhardt$ wc -l temporalGraph.tsv
39338 temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16

clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.

Second we have the weighted co-recipient network

Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.

rpickhardt$ wc -l weightedCCGraph.tsv
20864 weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.org MirandaL@dnc.org 848
MirandaL@dnc.org PaustenbachM@dnc.org 848
WalkerE@dnc.org PaustenbachM@dnc.org 624
PaustenbachM@dnc.org WalkerE@dnc.org 624
WalkerE@dnc.org MirandaL@dnc.org 596

clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails
 

Simple statistics

There have been

  • 1226 senders
  • 1384 recipients
  • 2030 people

included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:

MirandaL@dnc.org 1482
ComerS@dnc.org 1449
ParrishD@dnc.org 750
DNCPress@dnc.org 745
PaustenbachM@dnc.org 608
KaplanJ@dnc.org 600
ManriquezP@dnc.org 567

And the top 7 recievers are:

MirandaL@dnc.org 2951
Comm_D@dnc.org 2439
ComerS@dnc.org 1841
PaustenbachM@dnc.org 1550
KaplanJ@dnc.org 1457
WalkerE@dnc.org 1110
kaplanj@dnc.org 987

As you can see kaplanj@dnc.org and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.

Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.

Code

The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.

Roadmap

  • Use the Generalized Language Model Toolkit to build Language Models on the data
  • Compare with the social graph from twitter – many email addresses or at least names will be linked to twitter accounts. Comparing the Twitter network with the email network might reveal the differences in internal and external communication
  • Improve Quality of data i.e. better clean up of the data. Sometimes people in the recipient list have more than one email address. Currently they are treated as two different people. On the other hand sometimes mail addresses are missing and just names are included. These could probably be inferred from the other mail addresses. Also names in this case serve as uniq identifiers. So if two different people are called ‘Bob’ they become one person in the dataset. 
]]>
Aurelius Titan graph enables realtime querying with 2400 concurrent users on a distributed graph database! https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/ https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/#comments Wed, 21 Aug 2013 10:43:02 +0000 http://www.rene-pickhardt.de/?p=1736 Sorry to start with a conclusion first… To me Titan graph seems to be the egg-laying wool-milk-sow that people would dream of when working with graph data. Especially if one needs graph data in a web context and in real time. I will certainly try to free some time to check this out and get hands on. Also this thing is really new and revolutionary. This is not just another Hadoop or Giraph approach for big data processing this is distributed in real time! I am almost confident if the benchmark hold what it promised Titan will be one of the fastest growing technologies we have seen so far.
 

I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:

Slide 106:

Slide 107:

But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:

Let me sum up the amazing parts of the talk:

  • 2400 concurrent users against a graph cluster!
  • real time!
  • 16 different (non trivial queries) queries 
  • achieving more than 10k requests answered per second!
  • graph with more than a billion nodes!
  • graph partitioning is plugable
  • graph schema helps indexing for queries
So far I was not sure what kind of queries were really involved. Especially if there where also write transactions and unfortunately no one in the audience asked that question. So I started googleing and found this blog post by aurelius. As we can see there is an entire overview on the queries and much more detailed the results are presented. Unfortunately  I was not able to find the source code of that very benchmark (which Matthias promised to open in his talk). On Average most queries take less than half a second.
 
Even though the source code is not available this talk together with the Aurelius blog post looks to me like the most interesting and hottest piece of technology I came across during my PhD program. Aurelius started to think distributed right away and made some clever design decisions:
  • Scaling data size
  • scaling data access in terms of concurrent users (especially write operations) is fundamentally integrated and seems also to be successful integrated. 
  • making partitioning pluggable
  • requiring an schema for the graph (to enable efficient indexing)
  • being able on runtime to extend the schema.
  • building on top of ether Cassandra (for realtime) or HBase for consistency
  • being compatible with the tinkerpop techstack
  • bringing up an entire framework for analytics and graph processing.

Further resources:

]]>
https://www.rene-pickhardt.de/aurelius-titan-graph-enables-realtime-querying-with-2400-concurrent-users-on-a-distributed-graph-database/feed/ 2
Analyzing the final and intermediate results of the iversity MOOC Fellowship online voting https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/ https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/#comments Thu, 23 May 2013 23:07:24 +0000 http://www.rene-pickhardt.de/?p=1609 As writen before Steffen and I participated in the online voting for the MOOC fellowship. Today the competition finished and I would like to say thank you to everyone who so far participated in the voting in particular to the 435 people supporting our course. I did never image to get that many people to be interested in our course!
The voting period went from May first till today. During this period the user interface of the iversity website changed several times providing different kind of information about the voting to us users. Since I have observed a drastic change in rankings on May 9th and since the process and scores have not been very transparent I have decided on that very day to collect some data about the rankings. I already did some quick analysis on the data and found some interesting facts but I am running out of time right now to conduct an extensive data analysis. So I will share the data set with the public domain:
http://rene-pickhardt.de/mooc.tar.bz2 (33MB)
If you download the zip file and extract it you’ll find folders for every hour after May 9th. In every folder you will find 26 html-files representing the current ranking of the courses at that time and a transaction log of the http-requests which were done to download the 26 html files. There are 26 html files since 10 courses were displayed per page and we had 255 courses participating.
During the time of data collection I had 2 or 3 short down times of my web server so it could be possible that some data points are missing.
I already wrote a “dirty hack” and pushed it on github which also extracts the interesting information out of the downloaded html files.

  1. There is a file rank.tsv (334 kb) that displays for every course on an hourly basis the rankings
  2. There is a file vote.tsv (113 kb) that contains for every course on an hourly basis (between may 20th and today) the number of votes the course did acquire. The period of time for vote.tsv is so short since the votes have only been available in the html files during this time. 

Skimming the data with my eyes there are already some facts that make me very curious for a deeper data analysis:

  1. Some courses gained several hundred votes within a short period of time (usually only 2 or 3 hours) whereas most courses (especially those gaining such a large amount of votes) often stayed far under 1000 votes at all. 
  2. Also it is interesting to see how much variation has been going on in the last couple of days. 
  3. Also I haven’t crawled the views of the Youtube videos of the courses and even now after observing the following I did not take a snapshot of them it is interesting that there is such a large difference in conversion rate. Especially the top courses seem to have much more votes than they have views of the application video. Where some really high class and outstanding applications like the ones from Chrstian Spannagel (Math) or  Oliver Vornberger (Algorithms and data structures) have two or three times as many views on Youtube as votes. Especially they have about the same amount of views on Youtube as the top voted courses.

I am pretty sure there are some more interesting facts and maybe someone else has collected a better data set over the complete periode of time and including Youtube snapshots as well as Facebook and Twitter mentions.
Since I have been asked several times already: here are the final rankings to download and also as a table in the blog post:

  Kursname Anzahl an votes
1 sectio chirurgica anatomie interaktiv 8013
2 internationales agrarmanagement 2 7557
3 ingenieurmathematik fur jedermann 2669
4 harry potter and issues in international politics 2510
5 online surgery 2365
6 l3t s mooc der offene online kurs uber das lernen und lehren mit technologien 2270
7 design 101 or design basics 2 2216
8 einfuhrung in das sozial und gesundheitswesen sozialraume entdecken und entwickeln 2124
9 changeprojekte planen nachhaltige entwicklung durch social entrepreneurship 2083
10 social work open online course swooc14 2059
11 understanding sustainability environmental problems collective action and institutions 1912
12 the dance of functional programming languaging with haskell and python 1730
13 zyklenbasierte grundung systematische entwicklung von geschaftskonzepten 1698
14 a virtual living lab course for sustainable housing and lifestyle 1682
15 family politics domestic life revolution and dictatorships between 1900 1950 1476
16 h2o extrem 1307
17 dark matter in galaxies the last mystery 1261
18 algorithmen und datenstrukturen 1207
19 psychology of judgment and decision making 1168
20 the future of storytelling 1164
21 web engineering 1152
22 die autoritat der wissenschaften eine einfuhrung in das wissenschaftstheoretische denken 2 1143
23 magic and logic of music a comprehensive course on the foundations of music and its place in life 1138
24 nmooc nachhaltigkeit fur alle 1130
25 sovereign bond pricing 1115
26 soziale arbeit eine einfuhrung 1034
27 mathematische denk und arbeitsweisen in geometrie und arithmetik 1016
28 social entrepreneurship wir machen gesellschaftlichen wandel moglich 1010
29 molecular gastronomy an experimental lecture about food food processing and a bit of physiology 984
30 fundamentals of remote sensing for earth observation 920
31 kompetenzkurs ernahrungswissenschaft 891
32 erfolgreich studieren 879
33 deciphering ancient texts in the digital age 868
34 qualitative methods 861
35 karl der grosse pater europae 855
36 who am i mind consciousness and body between science and philosophy 837
37 programmieren mit java 835
38 systemisches projektmanagement 811
39 lernen ist sexy 764
40 modelling and simulation using matlab one mooc more brains an interdisciplinary course not just for experts 760
41 suchmaschinen verstehen 712
42 hands on course on embedded computing systems with raspberry pi 679
43 introduction to mixed methods and doing research online 676
44 game ai 649
45 game theory and experimental economic research 633
46 cooperative innovation 613
47 blue engineering ingenieurinnen und ingenieure mit sozialer und okologischer verantwortung 612
48 my car the unkown technical being 612
49 gesundheit ein besonderes gut eine multidisziplinare erkundung des deutschen gesundheitssystems 608
50 teaching english as a foreign language tefl part i pronunciation 597
51 wie kann lesen gelernt gelehrt und gefordert werden lesesozialisation lesedidaktik und leseforderung vom grundschulunterricht bis zur erwachsenenbildung 593
52 the european dream 576
53 education of the present what is the future of education 570
54 faszination kristalle und symmetrie 561
55 italy today a girlfriend in a coma a walk through today s italy 557
56 dna from structure to therapy 556
57 grundlagen der mensch computer interaktion 549
58 malnutrition in developing countries 548
59 marketing als strategischer erfolgsfaktor von der produktinnovation bis zur kundenbindung 540
60 environmental ethics for scientists 540
61 stem cells in biology and medicine 528
62 praxiswissen fur den kunstlerischen alltagsdschungel 509
63 physikvision 506
64 high five evidence based practice 505
65 future climate water 484
66 diversity and communication challenges for integration and mobility 477
67 social entrepreneurship 469
68 die kunst des argumentierens 466
69 der hont feat mit dem farat wek wie kinder schreiben und lesen lernen 455
70 antikrastination moocen gegen chronisches aufschieben 454
71 exercise for a healthier life 454
72 the startup source code 438
73 web science 435
74 medizinische immunologie 433
75 governance in and through human rights 431
76 europe in the world law and policy aspects of the eu in global governance 419
77 komplexe welt strukturen selbstorganisation und chaos 419
78 mooc basics of surgery want to become a real surgeon 416
79 statistical data analysis for the humanities 414
80 business math r edux 406
81 analyzing behavioral dynamics non linear approaches to social and cognitive sciences 402
82 space technology 397
83 der erzahler materialitat und virtualitat vom mittelalter bis zur gegenwart 396
84 kriminologie 395
85 von e mail skype und xing kommunikation fuhrung und berufliche zusammenarbeit im netz 394
86 wissenschaft erzahlen das phanomen der grenze 392
87 nachhaltige entwicklung 389
88 die nachste gesellschaft gesellschaft unter bedingungen der elektrizitat des computers und des internets 388
89 die grundrechte 376
90 medienbildung und mediendidaktik grundbegriffe und praxis 368
91 bubbles everywhere speculative bubbles in financial markets and in everyday life 364
92 the heart of creativity 363
93 physik und weltraum 358
94 sim suchmaschinenimplementierung als mooc 354
95 order of magnitude physics from atomic nuclei to the universe 350
96 entwurfsmethodik eingebetteter systeme 343
97 monte carlo methods in finance 335
98 texte professionell mit latex erstellen 331
99 wissenschaftlich arbeiten wissenschaftlich schreiben 330
100 e x cite join the game of social research 330
101 forschungsmethoden 323
102 complex problem solving 321
103 programmieren lernen mit effekt 317
104 molecular devices and machines 317
105 wie man erfolgreich ein startup aufbaut 315
106 grundlagen der prozeduralen und objektorientierten programmierung 314
107 introduction to disability studies 314
108 eu2c the european union explained by two partners cologne and cife 313
109 the english language a linguistic introduction 2 311
110 allgemeine betriebswirtschaftslehre 293
111 interaction design open design 293
112 how we learn nowadays possibilities and difficulties 288
113 foundations of educational technology 288
114 projektmanagement und designbasiertes lernen 281
115 human rights 278
116 kompetenz des horens technische gehorbildung 278
117 it infrastructure management 276
118 a media history in 10 artefacts 274
119 introduction to the practice of statistics and regression 271
120 what is a good society introduction to social philosophy 268
121 modellierungsmethoden in der wirtschaftsinformatik 265
122 objektorientierte programmierung von web anwendungen von anfang an 262
123 intercultural diversity networking vielfalt interkulturell vernetzen 260
124 foundations of entrepreneurship 259
125 business communication for impact and results 257
126 gamification 257
127 creativity and design in innovation management 256
128 mechanik i 252
129 global virtual project management 252
130 digital signal processing for everyone 249
131 kompetenzen fur klimaschutz anpassung 248
132 digital economy and social innovation 246
133 synthetic biology 245
134 english phonetics and phonology 245
135 leibspeisen nahrung im wandel der zeiten molekule brot kase fleisch schokolade und andere lebensmittel 243
136 critical decision making in the contemporary globalized world 238
137 einfuhrung in die allgemeine betriebswirtschaftslehre schwerpunkt organisation personalmanagement und unternehmensfuhrung 236
138 didaktisches design 235
139 an invitation to complex analysis 235
140 grundlagen der programmierung teil 1 234
141 allgemein und viszeralchirurgie 233
142 mathematik 1 fur ingenieure 231
143 consumption and identity you are what you buy 231
144 vampire fictions 230
145 grundlagen der anasthesiologie 228
146 marketing strategy and brand management 227
147 political economy an introduction 225
148 gesundheit 221
149 object oriented databases 219
150 lebenswelten perspektiven fur menschen mit demenz 217
151 applications of graphs to real life problems 210
152 introduction to epidemiology epimooc 207
153 network security 207
154 global civics 207
155 wissenschaftliches arbeiten 204
156 annaherungen an zukunfte wie lassen sich mogliche wahrscheinliche und wunschbare zukunfte bestimmen 202
157 einstieg wissenschaft 200
158 engineering english 199
159 das erklaren erklaren wie infografik klart erklart und wissen vermittelt 198
160 betriebswirtschaftliche und rechtliche grundlagen fur das nonprofit management 192
161 art and mathematics 191
162 vom phanomen zum modell mathematische modellierung von natur und alltag an ausgewahlten beispielen 190
163 design interaktiver medien technische grundlagen 189
164 business englisch 187
165 erziehung sehen analysieren gestalten 184
166 basic clinical research methods 184
167 ordinary differential equations and laplace transforms 180
168 mathematische logik 179
169 die geburt der materie in der evolution des universums 179
170 innovationsmanagement von kleinen und mittelstandischen unternehmen kmu 176
171 introduction to qualitative methods in the social sciences 175
172 advert retard wirkung industrieller interessen auf rationale arzneimitteltherapie 175
173 animation beyond the bouncing ball 174
174 entropie einfuhrung in die physikalische chemie 172
175 edufutur education for a sustainable future 165
176 social network effects on everyday life 164
177 pharmaskills for africa 163
178 nachhaltige energiewirtschaft 162
179 qualitat in der fruhpadagogik auf den anfang kommt es an 158
180 dementias 157
181 beyond armed confrontation multidisciplinary approaches and challenges from colombia s conflict 154
182 investition und finanzierung 150
183 praxis des wissensmanagements 149
184 gutenberg to google the social construction of the communciations revolution 145
185 value innovation and blue oceans 145
186 kontrapunkt 144
187 shakespeare s politics 142
188 jetzt erst recht wissen schaffen uber recht 141
189 rechtliche probleme von sozialen netzwerken 138
190 augmented tuesday suppers 137
191 positive padagogik 137
192 digital storytelling mit bewegenden bildern erzahlen 136
193 wirtschaftsethik 134
194 energieeffizientes bauen 134
195 advising startups 133
196 urban design and communication 133
197 bildungsreform 2 0 132
198 mooc management basics 130
199 healthy teeth a life long course of preventive dentistry 129
200 digitales tourismus marketing 127
201 the arctic game the struggle for control over the melting ice 127
202 disease mechanisms 127
203 special operations from raids to drones 125
204 introduction to geospatial technology 120
205 social media marketing strategy smms 119
206 korpusbasierte analyse sprechsprachlichen problemlosungsverhaltens 116
207 introduction to marketing 115
208 creative coding 114
209 mooc meets 3d 110
210 unternehmenswert die einzig sinnvolle spitzenkennzahl fur unternehmen 110
211 forming behaviour gestaltung und konzeption von web applications 109
212 technology demonstration 108
213 lebensmittelmikrobiologie und hygiene 105
214 estudi erfolgreich studieren mit dem internet 105
215 moderne geldtheorie eine paische perspektive 103
216 kollektive intelligenz 103
217 geschichte der optischen medien 100
218 alter und soziale arbeit 99
219 semantik eine theorie visueller kommunikation 97
220 erziehung und beratung in familie und schule 96
221 foreign language learning in indian context 95
222 bildgebende verfahren 92
223 applied biology 92
224 bildung in der wissensgesellschaft gerechtigkeit 92
225 standortmanagement 92
226 europe a solution from history 90
227 methodology of research in international law 90
228 when african americans came to paris 90
229 contemporary architecture 89
230 past recent encounters turkey and germany 88
231 wars to end all wars 83
232 online learning management systems 82
233 software applications 81
234 business in germany 78
235 requirements engineering 77
236 anything relationship management xrm 77
237 global standards and local practices 76
238 prodima professionalisation of disaster medicine and management 75
239 cytology with a virtual correlative light and electron microscope 75
240 the organisation of innovation 75
241 sensors for all 75
242 diagnostik in der beruflichen bildung 73
243 scientific working 71
244 escience saxony lectures 71
245 internet marketing strategy how to gain influence and spread your message online 69
246 grundlagen des e business 69
247 principles of public health 64
248 methods for shear wave velocity measurements in urban areas 64
249 democracy in america 64
250 building typology studies gebaudelehre 63
251 multi media based learning environments at the interface of science and practice hamburg university of applied sciences prof dr andrea berger klein 61
252 math mooc challenge 60
253 the value of the social 58
254 dienstleistungsmanagement und informationssysteme 57
255 ict integration in education systems e readiness e integration e transformation 56
]]>
https://www.rene-pickhardt.de/analyzing-the-final-and-intermediate-results-of-the-iversity-mooc-fellowship-online-voting/feed/ 8
Experiences on semantifying a Mediawiki for the biggest recource about Chinese rock music: rockinchina .com https://www.rene-pickhardt.de/experiences-on-semantifying-a-mediawiki-for-the-biggest-recource-about-chinese-rock-music-rockinchina-com/ https://www.rene-pickhardt.de/experiences-on-semantifying-a-mediawiki-for-the-biggest-recource-about-chinese-rock-music-rockinchina-com/#comments Mon, 07 Jan 2013 09:38:45 +0000 http://www.rene-pickhardt.de/?p=1486 During my trip in China I was visiting Beijing on two weekends and Maceau on another weekend. These trips have been mainly motivated to meet old friends. Especially the heads behind the biggest English resource of Chinese Rock music Rock in China who are Max-Leonhard von Schaper and the founder of the biggest Chinese Rock Print Magazin Yang Yu. After looking at their wiki which is pure gold in terms of content but consists mainly of plain text I introduced them the idea of putting semantics inside the project. While consulting them a little bit and pointing them to the right resources Max did basically the entire work (by taking a one month holiday from his job. Boy this is passion!).
I am very happy to anounce that the data of rock in china is published as linked open data and the process of semantifying the website is in great shape. In the following you can read about Max experiences doing the work. This is particularly interesting because Max has no scientific background in semantic technologies. So we can learn a lot on how to improve these technologies to be ready to be used by everybody:

Max report on semantifying

max-leonhard-von-schaper
Max-Leonhard von Schaper in Beijing.
To summarize, for a non-scientific greenhorn experimenting with semantic mediawiki and the semantic data principle in general, a good two months were required to bring our system to the point where it is today. As easy as it seems in the beginning, there is still a lot of manual coding and changing to be done as well as trial-and-error to understand how the new system is working.
Apart from the great learning experience and availability of our data in RDF format, our own website expanded in the process by ~20% of content pages (from 4000 to above 5000), adding over 10000 real property triplets and gaining an additional 300 thousand pageviews.
Lessons learnt in a comprised way:

  • DBPedia resources are to be linked with “resources” in the URI not with “page”
  • SMW requires the pre-fix “foaf:” or “mo:” or something else for EACH imported property
  • Check the Special:ExportRDF early to see if your properties work
  • Properties / Predicates , no difference with SMW
  • How to get data to freebase depends on the backlinks and sameas to other ontologies as well as entering data in semantic search engines
  • Forms for user data entry are very important!
  • As a non-scientific person without feedback I would not have been able to implement that.
  • DBPedia and music ontology ARE not interlinked with SAMEAS (as checked on sameas.org).
  • Factbox only works with the standard skin (monoskin). For other skins one has to include it in the PHP code oneself.

Main article

The online wiki Rock in China has been online for a number of years and focusses on Chinese underground music. Prior to starting implementing Semantic Mediawikia our wiki had roughly 4000 content pages with over 1800 artists and 900 records. We used a number of templates for bands, CDs, venues and labels, but apart from using numerous categories and the DynamicPageList extension for a few joints, we were not able to tangibly use the available data.
DPL example for JOINT between two Wikipedia Categories:

<DynamicPageList>
category = Metal Artists
category = Beijing Artists
mode     = ricstyle
order  = ascending
</DynamicPageList>

Results of a simple mashup query: display venues in beijing on a Google Map

After having had an interesting discussion with Rene on the benefits of semantic data and Open Linked Data, we decided to go Semantic. As total greenhorns to the field and with only limited programming skills timely available, we started off googeling the respective key terms and quickly enough came to the websites of the Music Ontology and the Semantic Mediawiki, which we decided to install.
Being an electrical engineer with basic IT backgrounds and many years of working on the web in PHP, HTML, Joomla or Mediawiki, it was still a challenge to get used to the new semantic way of talking and understanding the principles behind. Not so much because there might not be enough tutorials or data information out in the web, but because the guiding principle is somewhere but not where I was looking. Without the help of Rene and several feedback discussions I don’t it would have been possible for us to implement this system within the month that it took us.
Our first difficulty (after getting the extension on our FTP server) was to upgrade our existing Mediawiki from version 1.16 to version 1.19. An upgrade that used up the better part of two days, including updating all other extensions as well (with five of them not working anymore at all, as they are not being further developed) and finally getting our first Semantic Property running.
Upon starting of implementing the semantic approach, I read a lot online on the various ontologies available and intensively checked the Music Ontology. However Music Ontology is by far the wrong use case for our wiki, as Music Ontology is going more into the musical creation process and Rock in China is describing the scene developments. All our implementations were tracked on the wiki page Rock in China – Semantic Approach for other team members to understand the current process and to document workarounds and problems.
Our first test class had been Venue, a category in which we had 40 – 50 live houses of China with various level of data depth that we could put into the following template SemanticVenue:

{{SemanticVenue
|Image=
|ImageDescription=
|City=
|Address=
|Phone=
|Opened=
|Closed=
|GeoLocation=
}}

As can be seen from the above template both predicates (City) and properties (Opened) are being proposed for the semantic class VENUE. Semantic Mediawiki is implementing this decisive difference in a very user-friendly way by setting the TYPE of each SMW property to either PAGE or something else. As good as this is, it somehow confuses if one is talking with someone else about the semantic concept in principle.
A major problem had been the implementation of external ontologies which was not sufficiently documented on the semantic mediawiki page, most probably due to a change in versioning. Especially the cross-referencing to the URI was a major problem. As per Semantic Mediawiki documentation, aliases would be allowed, however with trial and error, it was revealed that only a property with a domain prefix, e.g. foaf:phone or owl:sameas would be correctly recognized. We used the Special:RDFExport function to find most of these errors, everytime our URI referencing was wrong, we would get a parser function error.
First, the wrong way for the following two wiki pages:

  • Mediawiki:smw_import_mo
  • Property:genre

Mediawiki:smw_import_mo:

http://purl.org/ontology/mo/ |[http://musicontology.com/ Music Ontology Specification]
activity_end|Type:Date
activity_start|Type:Date
MusicArtist|Category
genre|Type:Page
Genre|Category
track|Type:String
media_type|Type:String
publisher|Type:Page
origin|Type:Page
lyrics|Type:Text
free_download|Type:URL

Property:genre:

[[Has type::Page]][[Imported from::mo:genre]]

And now the correct way how it should be actually implemented to work:
Mediawiki:smw_import_mo:

http://purl.org/ontology/mo/|[http://musicontology.com/ Music Ontology Specification]
activity_end|Type:Date
activity_start|Type:Date
MusicArtist|Category
genre|Type:Page
Genre|Category
track|Type:String
media_type|Type:String
publisher|Type:Page
origin|Type:Page
lyrics|Type:Text
free_download|Type:URL

Property:mo:genre:

[[Has type::Page]][[Imported from::mo:genre]]

The ontology with most problems was the dbpedia, which documentation did not tell us what the correct URI was. Luckily the mailing list provided support and we got to know which the correct URI was:

http://www.dbpedia.org/ontology/

Being provided that, we were able to implement a number of semantic properties for a number of classes and start updating our wiki pages to get the data on our semantic database.
To utilize semantic properties within a wiki, there is a number of extensions available, such as Semantic Forms, Semantic Result Formats and Semantic Maps. The benefits we were able to gain were tremendous. For example the original JOINT query that we had been running at the beginning of the blog post with DPL was now able to be utilized with the following ASK query:

{{#ask: [[Category:Artists]] [[mo:origin:Beijing]]
|format=list
}}

However with the major benefit that the <references/> extension would NOT be broken after setting the inline query within a page. Dynamic Page List breaks the <references/>, rendering a lot of information lost. Other examples of how we benefitted from semantics is that previously we were only able to use Categories and read information of joining one or two categories, e.g. Artist pages that were both categorized as BEIJING artists and METAL artists. However now, with semantic properties, we had a lot of more data to play around with and could create mashup pages such as ROCK or Category:Records on which we were able to implement random videos from any ROCK artists or on which we were able to include a TIMELINE view of released records.

Mashup Page with a suitable video

With the help of the mailing list of Semantic Mediawiki itself (which was of great help when we were struggling) we implemented inline queries using templates to avoid later data changes on multiple pages. That step taken, the basic semantic structures were set up at our wiki and it was time for our next step: Bringing the semantic data of our wiki to others!
And here we are, asking ourselves: How will Freebase or DBpedia actually find our data? How will they include it? Discussing this with Rene a few structural problems became apparent. Being used to work with Wikipedia we usually set the property same:

Owl:sameas (or sameas)

On various of our pages directly to Wikipedia pages.
However we learnt that the property

foaf:primaryTopic

is a much better and accurate property for this. The sameas property should be used for semantic RDF pages, i.e. the respective DBPedia RESOURCE page (not the PAGE page). Luckily we already implemented the sameas property mostly in templates, so it was easy enough to exchange the properties.
Having figured out this issue, we checked out both the freebase page as well as other pages, such as DBpedia or musicbrainz, but there seems to be no “submit RDF” form. Hence we decided that the best way for getting recognized in the Semantic Web is to include more links to other RDF resources, e.g. for our Category:Artists we set sameas links to dbpedia and music ontology. For dbpedia we linked to the class and for music ontology to the URI for the class.
Note on the side here, when checking on sameas.org, it seems that music ontology is NOT cross-linked to dbpedia so far.
Following the recommendations set forth at Sindice, we changed our robots.txt to include our semantic sitemap(s):

Sitemap: http://www.music-china.org/wiki/index.php?title=Special:RecentChanges&feed=atom
Sitemap: http://www.rockinchina.com/wiki/index.php?title=Special:RecentChanges&feed=atom

Going the next step we analyzed how we can include external data on our SMW, e.g. from musicbrainz or from youtube. Being a music-oriented page especially Youtube was of particular interest for us. We found the SMW extension External Data that we could use to connect with the Google API:

{{#get_web_data:
url=https://www.googleapis.com/youtube/v3/search?part=snippet&q=carsick+cars&topicId=%2Fm%2F03cmgbv&type=video&key=Googlev3API&maxResults=50
|format=JSON
|data= videoId=videoId,title=title
}}

And

{{#for_external_table:
{{Youtube|ID={{{videoId}}}|title={{{title}}} }}<br/>
{{{videoId}}} and {{{title}}}<br/>
}}

See our internal TESTPAGE for the live example.
Youtube is using its in-house Freebase ID system to generate auto-channels filled with official music videos of bands and singers. The Freebase ID can be found on the individual freebase RESOURCE page after pressing the EDIT button. Alternatively one could use the Google API to receive the ID, but would need a Youtube internal HC ID prior to that. Easy implementation for our wiki: Include the FreebaseID as semantic property on artist pages within our definitions template:

{{Definitions
|wikipedia=
|dbpedia=
|freebase=
|freebaseID=
|musicbrainz=
|youtubeautochannel=
}}

Voila, with the additional SQL-based caching of request queries (e.g. JSON) our API load on Google is extremely low as well as increasing speed for loading a page at our wiki. Using this method we were able to increase our saved YOUTUBE id tags from the original 500 to way over 1000 within half a day.

A big variety of videos for an act like carsick cars is now available thanks to semantifying

With these structures in place it was time to inform the people in our community not only on the changes that have been made but also on the additional benefits and possibilities. We used our own blog as well as our Facebook page and Facebook group to spread the word.

]]>
https://www.rene-pickhardt.de/experiences-on-semantifying-a-mediawiki-for-the-biggest-recource-about-chinese-rock-music-rockinchina-com/feed/ 3
Get the full neo4j power by using the Core Java API for traversing your Graph data base instead of Cypher Query Language https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/ https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/#comments Tue, 06 Nov 2012 11:55:02 +0000 http://www.rene-pickhardt.de/?p=1460 As I said yesterday I have been busy over the last months producing content so here you go. For related work we are most likely to use neo4j as core data base. This makes sense since we are basically building some kind of a social network. Most queries that we need to answer while offering the service or during data mining carry a friend of a friend structure.
For some of the queries we are doing counting or aggregations so I was wondering what is the most efficient way of querying against a neo4j data base. So I did a Benchmark with quite surprising results.
Just a quick remark, we used a data base consisting of papers and authors extracted from arxiv.org one of the biggest pre print sites available on the web. The data set is available for download and reproduction of the benchmark results at http://blog.related-work.net/data/
The data base as a neo4j file is 2GB (zipped) the schema looks pretty much like that:

 Paper1  <--[ref]-->  Paper2
   |                    |
   |[author]            |[author]
   v                    v
 Author1              Author2

For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:

Java Core API

Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.

for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}

We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.

Traverser Framework

The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.

Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}

Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.

Cypher Query Language

And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.

ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator it = result.columnAs("coAuthor");
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.

What where the results?

All queries have been executed 11 times where the first time was thrown away since it warms up neo4j caches. The values are average values over the other 10 executions.
  • The Core API is able to answer about 2000 friend of a friend queries (I have to admit on a very sparse network).
  • The Traverser framework is about 25% slower than the Core API
  • Worst is cypher which is slower at least one order of magnitude only able to answer about 100 FOAF like queries per second.
  • I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.

    One can see That the Core API outperforms Cypher by an order of magnitute and the Traverser Framework by about 25%. In newer neo4j versions The core API became faster and cypher became slower

    Quotes from Andres Taylor:

    Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level

    I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up

    Conclusion:

    So far I was only using the Java Core API working with neo4j and I will continue to do so.
    If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
    Also I personally like the amount of control that you have when traversing over the core yourself.
    Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
    The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
    The detailed results can be found in this spreadsheet.

    ]]> https://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/feed/ 16 Foundations of statistical natural language processing Review of chapter 1 https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/ https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/#comments Tue, 12 Jun 2012 13:31:51 +0000 http://www.rene-pickhardt.de/?p=1354 Due to the interesting results we found by creating Typology I am currently reading the related work about query prediction and auto completion of scentences. There is quite some interesting academic work available in this area of information retrieval.
    While reading these papers I realized that I am not that strong in the field of natural language processing which seems to have a deep impact on my current research interests. That’s why I decided to put in a reading session of some basic work. A short trip to the library and also a look into the cited work of the papers made me find the book Foundations of statistical natural language processing by Christopher D. Manning and Hinrich Schütze. Even though they write in the introduction that their book is far away from being a complete coverage of the topic I think this book will be a perfect entry point to become familiar with some of the basic concepts in NLP. Since one understands and remembers better what one has read if one writes it down I am planning to write summaries and reviews of the book chapters here in my blog:

    Chapter 1 Introduction

    This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
    I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
    What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
    Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
    distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
    Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
    Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
    Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
    Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
    KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
    Exercises
    I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
    What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
    Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.

    ]]>
    https://www.rene-pickhardt.de/foundations-of-statistical-natural-language-processing-review-of-chapter-1/feed/ 2
    Neo4j based Typology also awarded top 90 at Google Science fair. https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/ https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/#respond Tue, 22 May 2012 13:08:48 +0000 http://www.rene-pickhardt.de/?p=1350 Yesterday I shared the good news about Till and Paul who have been awarded one of the top 5 projects at the German federal competition Young scientists. Today the good news continues. Together with more than 1000 competitors they did also submit the project to the Google Science Fair.
    And guess what! Google likes graph data bases, information retrieval, auto completion, and scentence prediction. Till and Paul have been (together with 5 other germans) selected to be in the second round together with 90 remaining projects. Even Heise reported about this! Within the next month they will have phone calls with Google employees and with a little bit of luck they will be selected to be in the top 15 which would mean they would be invited to the Googleplex at Mountain View California.
    Remember Typology is open source and already available as an android app (only in German language but this will be fixed soon).
    Btw I am just wondering when I will be traveling to Google in Mountain View?

    ]]>
    https://www.rene-pickhardt.de/neo4j-based-typology-also-awarded-top-90-at-google-science-fair/feed/ 0