I came up with the concept of virtual payment channels since I finally started my work on a splicing protocol specification for BOLT1.1. On paper I have already laid out the data flow for asynchronous, non blocking splicing in and splicing out operations – which resemble pretty much the ones discussed on the github issue. I will write about them soon. Before digging deeper into laying out my specification I wanted to increase the usefulness of the splicing operations. So I started to wonder if there was a way that splicing could take place without the need to ever hit the blockchain? While playing around with some ideas I realized (though not proven) that splicing without hitting the blockchain seems impossible if the trustless property for payment channels is mandatory. Hence I asked myself what would happen if we dropped this central property? As a result I accidentally came up with the concept of virtual payment channels.
Having talked to larger exchanges and potential hubs of the lightning network in the past and having seen that they wonder a lot about rebalancing their funds and about how they could provide capacity without locking their liquidity uselessly I realized that virtual payment channels might be a solution to some of their core problems. During the article and at the end of this article I will provide use case stories from the perspective of a central node in the network as well as from the perspective of a mining pool.
Let us begin by looking at one of the two core elements of the lightning network: The (standard) payment channel. A standard payment channel operates in a trustless manner which allows two parties to create a payment channel without the need for them to trust each other. The ability to create trustless payment channels is certainly a very useful property that the lightning network must have. However this trustless property comes at a rather high price bringing us the following three drawbacks (which should still exist even if we upgrade payment channels to follow the eltoo protocol proposal):
Additionally an indirect problem will be that companies like exchanges, large scale shops or payment provider or even banks will have many payment channels since they are likely to be central nodes in the network. The amount of channels might exceed the capacity of the hardware running this node. So people providing core financial infrastructure might need to operate several lightning nodes (let us call them A,B,C,…) to serve their customers. Companies then have to decide by collecting and evaluating data which customers to connect to which of their nodes. They also have to connect their own nodes via largely funded payment channels among each other. Even if done in some smart circle like topology (with channels so that each node can be reached in log N hops) it is not the preferable way since way too many funds are unnecessarily being locked to maintain clearing of funds between ones own nodes. Another drawback is that those additional funds are also on hot wallets again and need to be available in the first place.
Let us assume in the above setting some node N1 wants to make a payment to some node N2. The route for the payment happens to be:
N1 –> A –> B –> N2.
In the classical lightning network the routing could only take place if A could literally send over the requested money to node B within a payment channel between A and B. For brevity I omitted the exchange of the preimages and negotiating HTLCs and new commitment transactions. Those details are defined in the protocol specification so that A and B don’t loose money while they route a payment from N1 to N2.
However if A and B are owned by the same entity (as assumed above) there would be no harm if A received some money and B forwards it without A forwarding the money to B. In particular a payment channel with Bitcoins being locked between A and B would be superfluous. This observation is exactly the idea behind a virtual payment channel.
A high level protocol for operating a virtual payment channel could look like this:
Attention! If for some reason the other party turns out not to be trustworthy one could loose as many funds as the other side virtually provided. In the best case the two nodes are owned by the same person or company or there is some written agreement which backs the virtual payment channel. However obviously one cannot enforce such circumstances and prohibit arbitrary people to create a virtual payment channel. That is why virtual payment channels have to be used with extreme caution. As soon as you accept an inbound virtual payment channel from a third party you vouch with your own funds for the payments of your channel partner. Besides that there are a couple of other drawbacks I see with this proposal.
The 3 biggest drawbacks that I see are:
Things to take into consideration:
I will now describe a realistic setting which would benefit hugely from the chance to create virtual payment channels.
A use case for virtual payment channels: Payouts in mining pools.
Let us assume you are operating a mining pool. You want to do daily payouts for your – let us say 100k – customers. Obviously your customers already trust you since they commit their hash power to your pool and trust on the fact that you will eventually do a payout.
The payout scenario without virtual payment channels and without the lightning network.
Every day one has to fit 100k transactions to the blockchain. With batching obviously one could use less transactions but still the amount of transactions and fees will be considerable.
If we have the lightning network the situation is not even getting better:
For the payouts on day one we have to create payment channels. Obviously a mining pool cannot overfund those channels by too much since it would already need to have the liquidity to do so. So on day 2 the channels have to be either spliced in or the channel has to be closed and reopened. Unless of course the money was already being spent by the customer over the lightning network. (This however will only be possible if additional funds of the mining pool will be used to fund other external channels giving customers the possibility to spend their funds in a meaningful way). We see that using lightning in that case actually seems to increase the amount of blockchain transactions and increases the amount of liquidity that is being used.
Obviously the answer to the sketched problems could be that the mining pool does not have to operate channels to all its customers but can just rely on the lightning network to work in order to find routes to the customers while doing payouts. I predict that in many cases it won’t be possible for the pool to find a route to the customer when she creates an invoice and a channel has to be funded anyway.
Situation with virtual payment channels on the lightning network:
The pool could accept a virtual payment channel with funds as high as the pool owns to the customer on the customers side of the VPC. This does not require a Blockchain transaction and saves fees. For the second payout on the next day the balance can just be increased by the amount the pool owes to the user. Obviously this kind of payout is not real as the customer does not directly control the funds. However the pool provides cryptographic proof and a signature to the customer which again cannot be enforced by the blockchain. The customer at any time can pay via lightning using exactly this virtual payment channel.
In order for this to be useful for the customer his virtual payment channel needs to be connected to the “real” lightning network. This is achieved by the mining pool which has to have several outgoing channels funded to rather central nodes on the network (e.g. shops, exchanges, payment provider,…). Since the pool can assume that not all customers will spend or transfer all their funds immediately the pool only has to provide a fraction of the funds allocated in the virtual payment channels as outgoing capacity to the other real channels. This reduces the amount of liquidity that needs to be on the hot lightning network node wallet. In case a channel dries out there will be funds spliced in or the channel is closed and reopened. This will happen way less frequently than payouts to all customers and thus reduces load on the blockchain. Also the customer can already receive bitcoins over the lightning network in this virtual payment channel (if he decides to trust even more funds to the virtual channel) In any case the customer can obviously also accept or demand payments on a standard payment channel by disallowing inbound routing over the virtual payment channel.
Finally the customer could even get hold of his funds by fulfilling inbound routing requests on his standard payment channels which do outbound routing over the virtual payment channel.
For customers of the pool this mechanism probably also has other great advantages:
Conclusion:
We see that Virtual Payment Channels increase the flexibility for groups of users that for some reason trust each other. We can think of a virtual payment channel as an agreement to spend the funds of the channel partner in ones own name. Applications are for services that have plenty of customers and don’t want to reallocate funds all the time as well as for very central power nodes that cannot handle all of their traffic on one computer and need to distribute the node to several computers. However virtual payment channels should be used with large caution as one needs to trust another party. In the case of just distributing ones own node this will never be an issue. In case of the mining pool scenario trust already needed to exist in the first place so it should be ok but still yields a risk for the customer. The biggest downside is that a virtual payment channel can be seen as creating liquidity out of thin air and is kind of a credit that might break forcing chain reactions. However in this case it is just the channel partner who should have known better than trusting blindly. In particular if large operators go bankrupt as a follow up of such a chain reaction everyone who has standard payment channels with those operators won’t loose his or her funds due to the trustless property of the lightning network.
To the best of my knowledge the concept of virtual payment channels for the lightning network has not been proposed before. If I am mistaken I will be happy to receive a pointer. Now I am very excited to see what you as a community will think about my proposal. In particular I am excited since I understand that my proposal violets the core philosophy of the lightning network meaning that payment channels can operate in a trustless manner. Yet I think progress will only happen if creativity is allowed by thinking about all possibilities instead of fixed limitations. Also it will not change the trustless nature of the lightning network as this will only occur if two consenting parties agree to engage to such an endeavor. As outlined in this article they might have good reasons to do so. In the great tradition of Bitcoin Improvement Proposals (BIPs) I suggest that we start a process for Lightning Network Improvement Proposals (aka LIPs). This draft would need to be specified a little bit more if people think it would be a useful feature and could be such a LIP at the end.
Obviously if my proposal won’t make it to future BOLT specifications the idea is out and we cannot prevent companies like a mining pool operator still implementing this within their proprietary lightning supported wallet which they provide to their customers. However I think it would be much more preferable if virtual payment channels either become part of the protocol standard or are never used.
If you liked this article and are interested in the follow ups you can follow me on twitter.
]]>
Having read Christian Deckers main publication for his dissertation in which the authors suggest the use of invalidation trees to invalidate old states of an payment channel I was confident that eltoo would be a really nice improvement for the lightning network protocol. However following the “Don’t trust. Verify!” philosophy of the Bitcoin community I obviously had to study the eltoo protocol proposal myself. In particular I did not believe those bullish news from the magazines since from a technical perspective I do not see how the speed of the lightning network can significantly be increased. The speed of the lightning network at its current state is basically bound by the speed of TCP/IP sockets at the payment channel level and during routing by the time locks of the HTLCs. Since eltoo is only looking at channel management we can obviously omit the later case. As I am starting to work on the splicing part of the BOLT 1.1 specification which is a protocol extension of payment channel management obviously I need to understand eltoo anyway.
Yesterday after holding an openly licensed German language lecture on the basics of the lightning network technology (not to be confused with BOLT the Lightning network RFC) I had a 5 hour train ride home from the event and finally found the time to carefully study the eltoo paper. Even though the paper is very well structured and written and I thought it might be helpful if I share my understanding of the proposed protocol back with the community. However my main reason is that I hope by writing down and explaining the paper I get an even clearer understanding myself. In case I misunderstood something about the eltoo paper please let me know so that I can update this article. Also there are some rather technical questions I still have (for example this one which I have already posted on bitcoin stack exchange)
So let us analyze eltoo by understanding the motivation for yet another payment channel protocol by reviewing some of the well known facts about the current payment channel management.
Unpublished commitment Transactions encode the balance (or more technically spoken state) of a payment channel in the lightning network.
For the current version of the lightning network payment channel protocol we need to store information about all the old states of a channel. With state of a channel we mean the current distribution of the channels capacity (aka the balance) among the channel partners. At the core the current payment channel (and also the solution proposed in eltoo) a payment channel is just a 2-2 multi signature wallet. Currently the balance of a payment channel is encoded by two unpublished transactions that would spend the channels capacity according to the balance sheet with the channel partners as receipients. Actually we call those transactions commitment transactions. These transactions are both signed by both channel partners and each of them has one of them stored on his lightning node. This creates some kind of information asymmetry and a risk of funds being stolen should the information leak to the channel partner.
Funds within the channel are being maintained by creating a new commitment transaction (kind of a double spend) that spends the funds of the multi signature wallet (the channels capacity) in a different way. In order for both parties to be able to trust in the newly agreed upon channel balance we obviously have the problem of invalidating the old commitment transactions. In a trustless setting (which we wish to achieve) channel partners cannot rely on their counterparts to delete the old commitment transaction once they agree on a new channel balance. The reason is that those commitment transaction are valid bitcoin transactions signed by both channel partners and both partners own a copy and could have theoretically made more copies in between time.
All currently known protocols for maintaining a full duplex payment channel have in common that old channel states need to be invalidated.
The current implementation of the lightning network achieves the goal of invalidating old channel states by including the hash of a revocation key within the output script of a commitment transaction. The revocation key itself can only be known by the channel partner that does not control a signed copy of that transaction. (Again this is where the information asymmetry kicks in) The output script of the commitment transaction will eventually spend the funds according to the channel balance that was agreed upon for this particular commitment transaction. However it allows to spend the output of a published commitment transaction directly to send the entire capacity of the channel to ones own bitcoin address once one can present the signature from the secrete revocation key. If channel partners wish to update to a new channel state they will collaboratively compute the revocation resulting in the hash that is stored in the most current commitment transaction. The important requirement of the protocol design is that neither partner can compute the revocation key without the help of the other partner. Owning the revocation key prevents the channel partner to publish an old commitment transaction (aka state) since the publishing side would be punished by the other side releasing the revocation key and spending the outputs of the commitment transaction directly. Note that releasing the revocation key itself is not dangerous since it can only be useful in the output script of the transaction which includes the hash of the revocation key. As you would imagine each new state / commitment transaction includes the hash of a new revocation key. Therefor it is really the old transaction that needs to be kept secrete or even better destroyed. Obviously with such a setup both sides need to store all old revocation keys in order to be able to protect oneself from the other side publishing an old channel state (which for example could be a state in which all the funds belonged to the side that tries to publish this state)
Eltoo suggest a different protocol for channel management such that either side does not need to store all old state but just the current state. This is supposed to simplify implementing lightning nodes and also running them. According to Christian Decker this will also simplify implementing features like splicing. (On a side note I have not thought the splicing operation completely through for eltoo but it seems obvious to me that the chain of update transactions in eltoo can easily be extended by at least splice out operations.)
The initial setup of a payment channel within the Eltoo protocol is in my opinion exactly the same as in the current specification of the lightning network.
A payment channel is created by funding a 2-2 multi signature wallet. The state of the channel is again encoded within a transaction that is just not published yet. Only the process of invalidating old state differs. Both channel parties create two sets of public and private keys. The first set is called the update keys and the second set is called the settlement keys. Actually the settlement keys should be derived from a deterministic hierarchical wallet so that settlement transactions can only be bound to one specific update transaction (but I guess that is a rather subtle though important detail). According to Christian Decker what is new is that
… commitment is no longer a single transaction customized for each side, but an update/commitment that is symmetric and a settlement built on top.
A chain of update and settlement transactions allows for encoding channel state.
The state of the channel is encoded as the most recent element of a sequence of settlement transactions in which the current settlement transaction is encoding the current channel state. All the old settlement transactions are not needed and can safely be thrown away. Though not being the same as commitment transactions in the current lightning network implementations the settlement transactions can probably be considered to be the analogous counterpart. Each settlement transaction spends an – so called – update transaction. The update transactions are formed as a “linked list” starting from the funding transaction (which could already be interpreted as an update transaction). The update transactions spend the output of the funding transaction (or the previous update transaction(s!) ). The output of the new update transaction is a two path multisignature script where in the first path the output can be directly spent with the pair of update keys. In the second path the output can be spent with the settlement keys as long as a timelock is achieved. This will always give priority to successfully double spend an old settlement transaction with a newer update transaction. In case the newest settlement transaction is being broadcast there exist no newer update transaction which is why the double spend of another update transaction cannot be achieved by a single entity in the channel. Therefor after the timelock is over the channel is closed.
This setup is bad however since the current channel balance encoded in an settlement transaction which is the child of the latest update transaction which is the child of the previous update transaction and so on. Therefor in order to close the channel one would need to broadcast the complete chain of update transactions. This would be bad as it would not take any load from the blockchain in comparison to directly transfer bitcoins every time one wold want to update a payment channel and we would also need to store all old update transactions not really decreasing the overhead. However with the current version of the bitcoin protcol there is nothing we can do about this and eltoo is nothing but a nice thought and completely impractical. However a minor update to bitcoin (which was actually already talked about in the current lightning network paper) is finally proposed as BIP 118 and would also make eltoo useful.
Eltoo achieves the simplification by introducing the idea of floating transactions which are realized by introducing a modifier SIGHASH_NOINPUT for the existing OP_CODES check{,multi}sig{,verify}
With SIGHASH_NOINPUT the transaction that is being created to spend some output will ignore the prevout_point (tx hash, output index), sequences, and scriptCode for its signature calculation. This allows to change the output that is being spent as long as the output script matches the input script. Since the output scripts of the updating transactions are all the same (at least for the path that spends to output again to the update keys of the 2-2 multisig wallet) one can bind a later update transaction to the parents or grandparents or great grandparents or … of the most recent output transaction. (I don’t see the point of summarizing all the technical details here since they are very well described in the paper and I don’t see how I would formulate them differently or shorter) What this does however is that instead of the need to publish all previous funding transactions the current one can just be bound to the funding transaction and can be broadcasted together with its settlement transaction. Since there is a timelock in the update transaction (it could be an invalid one) the settlementtransaction should obviously only be broadcasted after the timeout since it will be rejected by the bitcoin network beforehand.
That being said the current channel state is defined by the already mined funding transaction as well as the most recent update transaction and its child the settlement transaction. In case channel partner Mallory would want to publish and older state (encoded within some settlement transaction) she would first need to publish the update transaction that is parent of this very settlement transaction. She can do so as she can also bind that corresponding update transaction to the funding transaction. However if she now spends the settlement transaction there is the path in the output script of the update transaction that allows for direct spending of a newer update transaction (which can be the newest update transaction -known by the other side – being bound to that particular update transaction). This will prevent old state from entering the blockchain.
There are some minor details about including a trigger transaction which I will omit here. However it seems important to mention that this mechanism also requires for the other side to be online or having a watchtower mechanism being included. Also with this mechanism in the uncollaborative closing case, the collaborative one and the case where one party tries to steal funds the blockchain will need to have at least three transactions being broadcasted to the blockchain (the trigger transaction, the most recent update transaction and the settlement transaction) In that sense I understand eltoo as a tradeoff between convenience for channel management in comparison to minimal blockchain involvement. As I mentioned during the autopilot discussion at the lightning hackday the blockchain will still need about 10 years to have one payment channel open for every internet user in the world. On the other side it seems to have many advantages over the current protocol for payment channel management.
Advantages of eltoo:
disadvantages of eltoo:
Open Questions:
Which Christian Decker answered by stating that in case this really happens one can set back the counter by splicing or reanchoring (splice without changing the values).
If you are more interested in technical write ups about the technology of the lightning network you can follow me on twitter. If you are also an enthusiast of the lightning network you could connect to my lightning node via: 036f464b54416ea583dcfae3872d28516dbe85414ed838513b1c34fb3a4aee4e7a@144.76.235.20:9735 and help to grow the lightning network. Also I want to shout out a thanks to Christian Decker who has reviewed this article and pointed out some mistakes which have been fixed.
]]>I have been visiting Berlin to attend the second lightninghackday and want to give a brief wrap up about the event. This article will basically cover two topics. 1st as promised within my bar camp session on “Building an automated topology for autopilot features of the lightning network nodes” I will give an extensive protocol / summary of the session itself. 2nd I will talk about an already well known technique called splicing which I realized during the event might be one of the more important yet unimplemented features of lightning nodes.
Let me just make a quick shoutout to Jeff and his team from fulmo lightning: Thanks for organizing this excellent event. Especially bringing together such a group of high profile lightning network enthusiasts was astonishing for me. Out of the many tech events and conferences that I have attended in the past this was one of my top experiences.
In my bar camp session we had roughly 30 people attending. Luckily also Conner Fromknecht and Olaoluwa Osuntokun from the San Francisco based lightning labs joined the session and gave many insights resulting from their hands on experience. I started the session with a short presentation of my thoughts about the problem. I had previously formulated those in my github repo as a rough draft / scetch for a whitepaper on the topic. That happened after I opened Issue 677 in the LND Project criticizing the current design choices for the autopilot. The main points of those thoughts are a list of metrics and criteria I previously thought I would want to monitor and optimize in order to find a good network topology. Before the hackday that list looked like that. (We discussed this list and basically people agreed on it):
Before I state the new ideas that came from the attendees and the discussion I want to briefly sum up the great keynote by the guys from lightning labs that preceded the bar camp session. In particular I think the insights and techniques (Atomic Multi Path routing and Splicing) they talked about have a huge impact on the autopilot and the topology generation problems of the lightning network. Long story short the magic concept in my opinion is splicing. For those that are unfamiliar with the topic: Splicing is the process of updating the channel balance of a payment channel by adding or removing (partial) funds. In the past I always thought that even though it was not implemented in any of the lightning nodes that this is a problem which technically is rather trivial to solve and thus of minor importance. The guys from lightning labs basically stated the same emphasizing that splicing funds out of a channel is trivial and can even be achieved easily in a non blocking way such that the channel can be used again directly after splicing out even if the spent transaction is not yet mined. However splicing in (aka adding more funds to a channel) seems to be a little bit more cumbersome. Without going into too many technical details the real eyeopener for me was the fact that splicing (together with the autopilot) seem to make lightning wallets in the way they exist these days obsolete. This obviously is a game changer. So to make it crystal clear:
With splicing and the autopilot implemented in a standard bitcoin wallet (and running in the background without the need for users to be aware of this) users can efficiently, quickly and cheaply send funds from one to another. If a path via lightning exists the transaction will be lightning fast. If no path can be found one could just splice out some funds from existing channels to create a new channel for the transaction. This would basically boil down to a common on chain transaction which happened before we had the lightning network anyway. However it doesn’t matter if all funds are currently frozen up in payment channels. Also it doesn’t waste the blockchain transaction but rather uses this opportunity to open create a funding transaction for the next channel increasing the size of the lightning network. I dare to say a bigger lightning network is a better lightning network in general. Basically the goal would be that eventually all bitcoin funds would be locked in some payment channels (which with splicing obviously doesn’t lower the control or flexibility of the user) In case a user really needs to do a transaction wich can’t be achieved via lightning it will just be spliced out and takes as much processing time as a regular on chain transaction. As a disclaimer: Obviously watchtowers are still needed in particular in this scenario in which users might not even realize they are using the lightning network.
Taking the opportunities of splicing into consideration I think that the topology problem of the autopilot becomes issue of only minor importance. One can easily splice out from existing payment channels to new payment channels if needed. The only bad thing is that such a transaction is not routed at lightning speed but rather takes a couple block times to be mined and processed. However it eventually creates a user generated network topology that hopefully pretty much follows actual flows of funds and would thus be rather robust. The only drawback with such a process would be that transactions frequently include channel creations which takes some time and that only a maximum of 300k channels can be altered per day on top of the current bitcoin protocol. This observation explains why topology generation of the autopilot still is a reasonable topic to think about since it should still help to move traffic away from the blockchain.
Finally I will now list some new thoughts that have been collected during the session. I will also update my whitepaper soon. Feel free to fork me on github and do a pull request in order to fix mistakes or add your own thoughts.
Number of nodes reachable within x hops:
It was pointed out that his metric would look quite nice. As a result of the discussion we came to realize that this greedy heuristic would basically lead to the scenario in which every node would open a channel to the most central node in the channel. such a central node would increase the number of nodes that can be reached within x hopes by the maximum amount. Still it looks like an important number to somehow optimize for.
Honesty and well behavior of nodes:
Following a similar philosophy we discussed weather a distributed topology creation algorithm should aim for global health of the network in comparison for greedy strategies in which every node tries to optimize their own view of the network. Though it wasn’t pointed out in the session I think that a strategy where every node tries to optimize their own access to the network will at some point yield a Nash equilibrium which. With my rather little understanding of game theory I think this might not necessarily be the best solution from a global perspective. Also we discussed that in the later mentioned sections where clients share information with neighbors an algorithm must be robust against fraudulent behavior or lying of nodes.
Stakeholder models:
Pretty much everyone agreed right away that different nodes might have different needs for the lightning network. so the topology creation should be configurable (or learnable by the node) taking into respect weather the node is just a casual consumer or a shop or a bank / exchange or …
Privacy vs information sharing:
We discussed quite extensively that for a distributed algorithm to make predictions which channels should be created it would be great if channel balances would be public (or at least there would be some rough information available about the distribution of the balance within one channel). We realized that as a first step following the ideas of lnd Issue 1253 nodes should start collecting historic information about their own operations and routing acticities. Actually I just saw that a pull request that claims to have resolved issue 1253 already exists.
We also realized that channel fees might act as a reasonable well proxy for the channel balance. Assume Alice and Bob have a channel and the balance is very skew in the sense that Alice has almost no funds and Bob has all of them. If Bob was asked to route a payment through that channel he would probably charge a smaller fee than Alice if she was asked to route a payment through her almost dried up channel.
Still the idea circulated around that nodes could share their channel balances with neighbors in a similar fashion how routing information in the IP network are shared with neighbors. In such a way eventually a map of paths would be constructed for each node.
A point mentioned – that in my oppionion is important but only somewhat related to these issues – was the fact that of course nodes should take higher routing fees if the timelock of the routing request is high since in the worst case this makes a channel or many other paths unusable for quite some time maybe even without a payment taking place. As a side node I just realize that this could be a more sophisticated strategy for nodes to estimate their fees if they are aware of the number of routing paths their channel makes possible.
Some technical details:
One could use the number of disjoint paths between nodes as a good criteria since it also enables heavy use of atomic multi path transactions. Also it was mentioned that one could look at the AS-number of the underlying internet hosts.
Why not using machine Learning instead of trying to find some smart algorithm / heuristics?
For me the most surprising idea was the fact that this entire autopilot problem could easily be transferd into a machine learning problem. There are some obvious flaws because single nodes might not have enough training data and if one has enough data sharing the model would probably also not work out of the box. So we discussed a little bit if that would be a use case for transfer learning. Here I will not dig deeper into the topic, since the article is already quite long. But working in the field of machine learning and being a data scientist and having not even put the slightest thought about this idea before the event took place was a little bit shocking and surprising for me.
Anyway I hope my summary of the session will be useful for you and the community! My personal roadmap now consists of four things.
Also kudos to Susette Bader who took this really nice snapshot and cover image of this post while we have been hacking.
]]>This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.
rpickhardt$ wc -l temporalGraph.tsv
39338 temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16
clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.
Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.
rpickhardt$ wc -l weightedCCGraph.tsv
20864 weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.org MirandaL@dnc.org 848
MirandaL@dnc.org PaustenbachM@dnc.org 848
WalkerE@dnc.org PaustenbachM@dnc.org 624
PaustenbachM@dnc.org WalkerE@dnc.org 624
WalkerE@dnc.org MirandaL@dnc.org 596
clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails
There have been
included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:
MirandaL@dnc.org 1482
ComerS@dnc.org 1449
ParrishD@dnc.org 750
DNCPress@dnc.org 745
PaustenbachM@dnc.org 608
KaplanJ@dnc.org 600
ManriquezP@dnc.org 567
And the top 7 recievers are:
MirandaL@dnc.org 2951
Comm_D@dnc.org 2439
ComerS@dnc.org 1841
PaustenbachM@dnc.org 1550
KaplanJ@dnc.org 1457
WalkerE@dnc.org 1110
kaplanj@dnc.org 987
As you can see kaplanj@dnc.org and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.
Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.
The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.
I met Matthias Bröcheler (CTO of Aurelius the company behind titan graph) 4 years ago in a teaching situation for the German national student high school academy. It was the time when I was still more mathematician than computer scientist but my journey in becoming a computer scientist had just started. Matthias was in the middle of his PhD program and I valued his insights and experiences a lot. It was for him that my eyes got open for the first time about what big data means and how companies like facebook, google and so on knit their business model around collecting data. Anyway Matthias influenced me in quite some way and I have a lot of respect of him.
I did not start my PhD right away and we lost contact. I knew he was interested in graphs but that was about it. First when I started to use neo4j more and more I realized that Matthias was also one of the authors of the tinkerpop blueprints which are interfaces to talk to graphs which most vendors of graph data bases use. At that time I looked him up again and I realized he was working on titan graph a distributed graph data base. I found this promising looking slide deck:
Slide 106:
Slide 107:
But at that time for me there wasn’t much evidence that Titan would really hold the promise that is given in slides 106 and 107. In fact those goals seemed as crazy and unreachable as my former PhD proposal on distributed graph databases (By the way: Reading the PhD Proposal now I am kind of amused since I did not really aim for the important and big points like Titan did.)
During the redesign phase of metalcon we started playing around with HBase to support the architecture of our like button and especially to be able to integrate this with recommendations coming from mahout. I started to realize the big fundamental differences between HBase (Implementation of Google Bigtable) and Cassandra (Implementation of Amazon Dynamo) which result from the CAP theorem about distributed systems. Looking around for information about distributed storage engines I stumbled again on titan and seeing Matthias’ talk on the Cassandra summit 2013. Around minute 21 / 22 the talk is getting really interesting. I can also suggest to skip the first 15 minutes of the talk:
Let me sum up the amazing parts of the talk:
Skimming the data with my eyes there are already some facts that make me very curious for a deeper data analysis:
I am pretty sure there are some more interesting facts and maybe someone else has collected a better data set over the complete periode of time and including Youtube snapshots as well as Facebook and Twitter mentions.
Since I have been asked several times already: here are the final rankings to download and also as a table in the blog post:
Kursname | Anzahl an votes | |
1 | sectio chirurgica anatomie interaktiv | 8013 |
2 | internationales agrarmanagement 2 | 7557 |
3 | ingenieurmathematik fur jedermann | 2669 |
4 | harry potter and issues in international politics | 2510 |
5 | online surgery | 2365 |
6 | l3t s mooc der offene online kurs uber das lernen und lehren mit technologien | 2270 |
7 | design 101 or design basics 2 | 2216 |
8 | einfuhrung in das sozial und gesundheitswesen sozialraume entdecken und entwickeln | 2124 |
9 | changeprojekte planen nachhaltige entwicklung durch social entrepreneurship | 2083 |
10 | social work open online course swooc14 | 2059 |
11 | understanding sustainability environmental problems collective action and institutions | 1912 |
12 | the dance of functional programming languaging with haskell and python | 1730 |
13 | zyklenbasierte grundung systematische entwicklung von geschaftskonzepten | 1698 |
14 | a virtual living lab course for sustainable housing and lifestyle | 1682 |
15 | family politics domestic life revolution and dictatorships between 1900 1950 | 1476 |
16 | h2o extrem | 1307 |
17 | dark matter in galaxies the last mystery | 1261 |
18 | algorithmen und datenstrukturen | 1207 |
19 | psychology of judgment and decision making | 1168 |
20 | the future of storytelling | 1164 |
21 | web engineering | 1152 |
22 | die autoritat der wissenschaften eine einfuhrung in das wissenschaftstheoretische denken 2 | 1143 |
23 | magic and logic of music a comprehensive course on the foundations of music and its place in life | 1138 |
24 | nmooc nachhaltigkeit fur alle | 1130 |
25 | sovereign bond pricing | 1115 |
26 | soziale arbeit eine einfuhrung | 1034 |
27 | mathematische denk und arbeitsweisen in geometrie und arithmetik | 1016 |
28 | social entrepreneurship wir machen gesellschaftlichen wandel moglich | 1010 |
29 | molecular gastronomy an experimental lecture about food food processing and a bit of physiology | 984 |
30 | fundamentals of remote sensing for earth observation | 920 |
31 | kompetenzkurs ernahrungswissenschaft | 891 |
32 | erfolgreich studieren | 879 |
33 | deciphering ancient texts in the digital age | 868 |
34 | qualitative methods | 861 |
35 | karl der grosse pater europae | 855 |
36 | who am i mind consciousness and body between science and philosophy | 837 |
37 | programmieren mit java | 835 |
38 | systemisches projektmanagement | 811 |
39 | lernen ist sexy | 764 |
40 | modelling and simulation using matlab one mooc more brains an interdisciplinary course not just for experts | 760 |
41 | suchmaschinen verstehen | 712 |
42 | hands on course on embedded computing systems with raspberry pi | 679 |
43 | introduction to mixed methods and doing research online | 676 |
44 | game ai | 649 |
45 | game theory and experimental economic research | 633 |
46 | cooperative innovation | 613 |
47 | blue engineering ingenieurinnen und ingenieure mit sozialer und okologischer verantwortung | 612 |
48 | my car the unkown technical being | 612 |
49 | gesundheit ein besonderes gut eine multidisziplinare erkundung des deutschen gesundheitssystems | 608 |
50 | teaching english as a foreign language tefl part i pronunciation | 597 |
51 | wie kann lesen gelernt gelehrt und gefordert werden lesesozialisation lesedidaktik und leseforderung vom grundschulunterricht bis zur erwachsenenbildung | 593 |
52 | the european dream | 576 |
53 | education of the present what is the future of education | 570 |
54 | faszination kristalle und symmetrie | 561 |
55 | italy today a girlfriend in a coma a walk through today s italy | 557 |
56 | dna from structure to therapy | 556 |
57 | grundlagen der mensch computer interaktion | 549 |
58 | malnutrition in developing countries | 548 |
59 | marketing als strategischer erfolgsfaktor von der produktinnovation bis zur kundenbindung | 540 |
60 | environmental ethics for scientists | 540 |
61 | stem cells in biology and medicine | 528 |
62 | praxiswissen fur den kunstlerischen alltagsdschungel | 509 |
63 | physikvision | 506 |
64 | high five evidence based practice | 505 |
65 | future climate water | 484 |
66 | diversity and communication challenges for integration and mobility | 477 |
67 | social entrepreneurship | 469 |
68 | die kunst des argumentierens | 466 |
69 | der hont feat mit dem farat wek wie kinder schreiben und lesen lernen | 455 |
70 | antikrastination moocen gegen chronisches aufschieben | 454 |
71 | exercise for a healthier life | 454 |
72 | the startup source code | 438 |
73 | web science | 435 |
74 | medizinische immunologie | 433 |
75 | governance in and through human rights | 431 |
76 | europe in the world law and policy aspects of the eu in global governance | 419 |
77 | komplexe welt strukturen selbstorganisation und chaos | 419 |
78 | mooc basics of surgery want to become a real surgeon | 416 |
79 | statistical data analysis for the humanities | 414 |
80 | business math r edux | 406 |
81 | analyzing behavioral dynamics non linear approaches to social and cognitive sciences | 402 |
82 | space technology | 397 |
83 | der erzahler materialitat und virtualitat vom mittelalter bis zur gegenwart | 396 |
84 | kriminologie | 395 |
85 | von e mail skype und xing kommunikation fuhrung und berufliche zusammenarbeit im netz | 394 |
86 | wissenschaft erzahlen das phanomen der grenze | 392 |
87 | nachhaltige entwicklung | 389 |
88 | die nachste gesellschaft gesellschaft unter bedingungen der elektrizitat des computers und des internets | 388 |
89 | die grundrechte | 376 |
90 | medienbildung und mediendidaktik grundbegriffe und praxis | 368 |
91 | bubbles everywhere speculative bubbles in financial markets and in everyday life | 364 |
92 | the heart of creativity | 363 |
93 | physik und weltraum | 358 |
94 | sim suchmaschinenimplementierung als mooc | 354 |
95 | order of magnitude physics from atomic nuclei to the universe | 350 |
96 | entwurfsmethodik eingebetteter systeme | 343 |
97 | monte carlo methods in finance | 335 |
98 | texte professionell mit latex erstellen | 331 |
99 | wissenschaftlich arbeiten wissenschaftlich schreiben | 330 |
100 | e x cite join the game of social research | 330 |
101 | forschungsmethoden | 323 |
102 | complex problem solving | 321 |
103 | programmieren lernen mit effekt | 317 |
104 | molecular devices and machines | 317 |
105 | wie man erfolgreich ein startup aufbaut | 315 |
106 | grundlagen der prozeduralen und objektorientierten programmierung | 314 |
107 | introduction to disability studies | 314 |
108 | eu2c the european union explained by two partners cologne and cife | 313 |
109 | the english language a linguistic introduction 2 | 311 |
110 | allgemeine betriebswirtschaftslehre | 293 |
111 | interaction design open design | 293 |
112 | how we learn nowadays possibilities and difficulties | 288 |
113 | foundations of educational technology | 288 |
114 | projektmanagement und designbasiertes lernen | 281 |
115 | human rights | 278 |
116 | kompetenz des horens technische gehorbildung | 278 |
117 | it infrastructure management | 276 |
118 | a media history in 10 artefacts | 274 |
119 | introduction to the practice of statistics and regression | 271 |
120 | what is a good society introduction to social philosophy | 268 |
121 | modellierungsmethoden in der wirtschaftsinformatik | 265 |
122 | objektorientierte programmierung von web anwendungen von anfang an | 262 |
123 | intercultural diversity networking vielfalt interkulturell vernetzen | 260 |
124 | foundations of entrepreneurship | 259 |
125 | business communication for impact and results | 257 |
126 | gamification | 257 |
127 | creativity and design in innovation management | 256 |
128 | mechanik i | 252 |
129 | global virtual project management | 252 |
130 | digital signal processing for everyone | 249 |
131 | kompetenzen fur klimaschutz anpassung | 248 |
132 | digital economy and social innovation | 246 |
133 | synthetic biology | 245 |
134 | english phonetics and phonology | 245 |
135 | leibspeisen nahrung im wandel der zeiten molekule brot kase fleisch schokolade und andere lebensmittel | 243 |
136 | critical decision making in the contemporary globalized world | 238 |
137 | einfuhrung in die allgemeine betriebswirtschaftslehre schwerpunkt organisation personalmanagement und unternehmensfuhrung | 236 |
138 | didaktisches design | 235 |
139 | an invitation to complex analysis | 235 |
140 | grundlagen der programmierung teil 1 | 234 |
141 | allgemein und viszeralchirurgie | 233 |
142 | mathematik 1 fur ingenieure | 231 |
143 | consumption and identity you are what you buy | 231 |
144 | vampire fictions | 230 |
145 | grundlagen der anasthesiologie | 228 |
146 | marketing strategy and brand management | 227 |
147 | political economy an introduction | 225 |
148 | gesundheit | 221 |
149 | object oriented databases | 219 |
150 | lebenswelten perspektiven fur menschen mit demenz | 217 |
151 | applications of graphs to real life problems | 210 |
152 | introduction to epidemiology epimooc | 207 |
153 | network security | 207 |
154 | global civics | 207 |
155 | wissenschaftliches arbeiten | 204 |
156 | annaherungen an zukunfte wie lassen sich mogliche wahrscheinliche und wunschbare zukunfte bestimmen | 202 |
157 | einstieg wissenschaft | 200 |
158 | engineering english | 199 |
159 | das erklaren erklaren wie infografik klart erklart und wissen vermittelt | 198 |
160 | betriebswirtschaftliche und rechtliche grundlagen fur das nonprofit management | 192 |
161 | art and mathematics | 191 |
162 | vom phanomen zum modell mathematische modellierung von natur und alltag an ausgewahlten beispielen | 190 |
163 | design interaktiver medien technische grundlagen | 189 |
164 | business englisch | 187 |
165 | erziehung sehen analysieren gestalten | 184 |
166 | basic clinical research methods | 184 |
167 | ordinary differential equations and laplace transforms | 180 |
168 | mathematische logik | 179 |
169 | die geburt der materie in der evolution des universums | 179 |
170 | innovationsmanagement von kleinen und mittelstandischen unternehmen kmu | 176 |
171 | introduction to qualitative methods in the social sciences | 175 |
172 | advert retard wirkung industrieller interessen auf rationale arzneimitteltherapie | 175 |
173 | animation beyond the bouncing ball | 174 |
174 | entropie einfuhrung in die physikalische chemie | 172 |
175 | edufutur education for a sustainable future | 165 |
176 | social network effects on everyday life | 164 |
177 | pharmaskills for africa | 163 |
178 | nachhaltige energiewirtschaft | 162 |
179 | qualitat in der fruhpadagogik auf den anfang kommt es an | 158 |
180 | dementias | 157 |
181 | beyond armed confrontation multidisciplinary approaches and challenges from colombia s conflict | 154 |
182 | investition und finanzierung | 150 |
183 | praxis des wissensmanagements | 149 |
184 | gutenberg to google the social construction of the communciations revolution | 145 |
185 | value innovation and blue oceans | 145 |
186 | kontrapunkt | 144 |
187 | shakespeare s politics | 142 |
188 | jetzt erst recht wissen schaffen uber recht | 141 |
189 | rechtliche probleme von sozialen netzwerken | 138 |
190 | augmented tuesday suppers | 137 |
191 | positive padagogik | 137 |
192 | digital storytelling mit bewegenden bildern erzahlen | 136 |
193 | wirtschaftsethik | 134 |
194 | energieeffizientes bauen | 134 |
195 | advising startups | 133 |
196 | urban design and communication | 133 |
197 | bildungsreform 2 0 | 132 |
198 | mooc management basics | 130 |
199 | healthy teeth a life long course of preventive dentistry | 129 |
200 | digitales tourismus marketing | 127 |
201 | the arctic game the struggle for control over the melting ice | 127 |
202 | disease mechanisms | 127 |
203 | special operations from raids to drones | 125 |
204 | introduction to geospatial technology | 120 |
205 | social media marketing strategy smms | 119 |
206 | korpusbasierte analyse sprechsprachlichen problemlosungsverhaltens | 116 |
207 | introduction to marketing | 115 |
208 | creative coding | 114 |
209 | mooc meets 3d | 110 |
210 | unternehmenswert die einzig sinnvolle spitzenkennzahl fur unternehmen | 110 |
211 | forming behaviour gestaltung und konzeption von web applications | 109 |
212 | technology demonstration | 108 |
213 | lebensmittelmikrobiologie und hygiene | 105 |
214 | estudi erfolgreich studieren mit dem internet | 105 |
215 | moderne geldtheorie eine paische perspektive | 103 |
216 | kollektive intelligenz | 103 |
217 | geschichte der optischen medien | 100 |
218 | alter und soziale arbeit | 99 |
219 | semantik eine theorie visueller kommunikation | 97 |
220 | erziehung und beratung in familie und schule | 96 |
221 | foreign language learning in indian context | 95 |
222 | bildgebende verfahren | 92 |
223 | applied biology | 92 |
224 | bildung in der wissensgesellschaft gerechtigkeit | 92 |
225 | standortmanagement | 92 |
226 | europe a solution from history | 90 |
227 | methodology of research in international law | 90 |
228 | when african americans came to paris | 90 |
229 | contemporary architecture | 89 |
230 | past recent encounters turkey and germany | 88 |
231 | wars to end all wars | 83 |
232 | online learning management systems | 82 |
233 | software applications | 81 |
234 | business in germany | 78 |
235 | requirements engineering | 77 |
236 | anything relationship management xrm | 77 |
237 | global standards and local practices | 76 |
238 | prodima professionalisation of disaster medicine and management | 75 |
239 | cytology with a virtual correlative light and electron microscope | 75 |
240 | the organisation of innovation | 75 |
241 | sensors for all | 75 |
242 | diagnostik in der beruflichen bildung | 73 |
243 | scientific working | 71 |
244 | escience saxony lectures | 71 |
245 | internet marketing strategy how to gain influence and spread your message online | 69 |
246 | grundlagen des e business | 69 |
247 | principles of public health | 64 |
248 | methods for shear wave velocity measurements in urban areas | 64 |
249 | democracy in america | 64 |
250 | building typology studies gebaudelehre | 63 |
251 | multi media based learning environments at the interface of science and practice hamburg university of applied sciences prof dr andrea berger klein | 61 |
252 | math mooc challenge | 60 |
253 | the value of the social | 58 |
254 | dienstleistungsmanagement und informationssysteme | 57 |
255 | ict integration in education systems e readiness e integration e transformation | 56 |
The online wiki Rock in China has been online for a number of years and focusses on Chinese underground music. Prior to starting implementing Semantic Mediawikia our wiki had roughly 4000 content pages with over 1800 artists and 900 records. We used a number of templates for bands, CDs, venues and labels, but apart from using numerous categories and the DynamicPageList extension for a few joints, we were not able to tangibly use the available data.
DPL example for JOINT between two Wikipedia Categories:
<DynamicPageList> category = Metal Artists category = Beijing Artists mode = ricstyle order = ascending </DynamicPageList>
{{SemanticVenue |Image= |ImageDescription= |City= |Address= |Phone= |Opened= |Closed= |GeoLocation= }}
As can be seen from the above template both predicates (City) and properties (Opened) are being proposed for the semantic class VENUE. Semantic Mediawiki is implementing this decisive difference in a very user-friendly way by setting the TYPE of each SMW property to either PAGE or something else. As good as this is, it somehow confuses if one is talking with someone else about the semantic concept in principle.
A major problem had been the implementation of external ontologies which was not sufficiently documented on the semantic mediawiki page, most probably due to a change in versioning. Especially the cross-referencing to the URI was a major problem. As per Semantic Mediawiki documentation, aliases would be allowed, however with trial and error, it was revealed that only a property with a domain prefix, e.g. foaf:phone or owl:sameas would be correctly recognized. We used the Special:RDFExport function to find most of these errors, everytime our URI referencing was wrong, we would get a parser function error.
First, the wrong way for the following two wiki pages:
Mediawiki:smw_import_mo:
http://purl.org/ontology/mo/ |[http://musicontology.com/ Music Ontology Specification] activity_end|Type:Date activity_start|Type:Date MusicArtist|Category genre|Type:Page Genre|Category track|Type:String media_type|Type:String publisher|Type:Page origin|Type:Page lyrics|Type:Text free_download|Type:URL
Property:genre:
[[Has type::Page]][[Imported from::mo:genre]]
And now the correct way how it should be actually implemented to work:
Mediawiki:smw_import_mo:
http://purl.org/ontology/mo/|[http://musicontology.com/ Music Ontology Specification] activity_end|Type:Date activity_start|Type:Date MusicArtist|Category genre|Type:Page Genre|Category track|Type:String media_type|Type:String publisher|Type:Page origin|Type:Page lyrics|Type:Text free_download|Type:URL
[[Has type::Page]][[Imported from::mo:genre]]
The ontology with most problems was the dbpedia, which documentation did not tell us what the correct URI was. Luckily the mailing list provided support and we got to know which the correct URI was:
http://www.dbpedia.org/ontology/
Being provided that, we were able to implement a number of semantic properties for a number of classes and start updating our wiki pages to get the data on our semantic database.
To utilize semantic properties within a wiki, there is a number of extensions available, such as Semantic Forms, Semantic Result Formats and Semantic Maps. The benefits we were able to gain were tremendous. For example the original JOINT query that we had been running at the beginning of the blog post with DPL was now able to be utilized with the following ASK query:
{{#ask: [[Category:Artists]] [[mo:origin:Beijing]] |format=list }}
However with the major benefit that the <references/> extension would NOT be broken after setting the inline query within a page. Dynamic Page List breaks the <references/>, rendering a lot of information lost. Other examples of how we benefitted from semantics is that previously we were only able to use Categories and read information of joining one or two categories, e.g. Artist pages that were both categorized as BEIJING artists and METAL artists. However now, with semantic properties, we had a lot of more data to play around with and could create mashup pages such as ROCK or Category:Records on which we were able to implement random videos from any ROCK artists or on which we were able to include a TIMELINE view of released records.
Owl:sameas (or sameas)
On various of our pages directly to Wikipedia pages.
However we learnt that the property
foaf:primaryTopic
is a much better and accurate property for this. The sameas property should be used for semantic RDF pages, i.e. the respective DBPedia RESOURCE page (not the PAGE page). Luckily we already implemented the sameas property mostly in templates, so it was easy enough to exchange the properties.
Having figured out this issue, we checked out both the freebase page as well as other pages, such as DBpedia or musicbrainz, but there seems to be no “submit RDF” form. Hence we decided that the best way for getting recognized in the Semantic Web is to include more links to other RDF resources, e.g. for our Category:Artists we set sameas links to dbpedia and music ontology. For dbpedia we linked to the class and for music ontology to the URI for the class.
Note on the side here, when checking on sameas.org, it seems that music ontology is NOT cross-linked to dbpedia so far.
Following the recommendations set forth at Sindice, we changed our robots.txt to include our semantic sitemap(s):
Sitemap: http://www.music-china.org/wiki/index.php?title=Special:RecentChanges&feed=atom Sitemap: http://www.rockinchina.com/wiki/index.php?title=Special:RecentChanges&feed=atom
Going the next step we analyzed how we can include external data on our SMW, e.g. from musicbrainz or from youtube. Being a music-oriented page especially Youtube was of particular interest for us. We found the SMW extension External Data that we could use to connect with the Google API:
{{#get_web_data: url=https://www.googleapis.com/youtube/v3/search?part=snippet&q=carsick+cars&topicId=%2Fm%2F03cmgbv&type=video&key=Googlev3API&maxResults=50 |format=JSON |data= videoId=videoId,title=title }}
And
{{#for_external_table: {{Youtube|ID={{{videoId}}}|title={{{title}}} }}<br/> {{{videoId}}} and {{{title}}}<br/> }}
See our internal TESTPAGE for the live example.
Youtube is using its in-house Freebase ID system to generate auto-channels filled with official music videos of bands and singers. The Freebase ID can be found on the individual freebase RESOURCE page after pressing the EDIT button. Alternatively one could use the Google API to receive the ID, but would need a Youtube internal HC ID prior to that. Easy implementation for our wiki: Include the FreebaseID as semantic property on artist pages within our definitions template:
{{Definitions |wikipedia= |dbpedia= |freebase= |freebaseID= |musicbrainz= |youtubeautochannel= }}
Voila, with the additional SQL-based caching of request queries (e.g. JSON) our API load on Google is extremely low as well as increasing speed for loading a page at our wiki. Using this method we were able to increase our saved YOUTUBE id tags from the original 500 to way over 1000 within half a day.
Paper1 <--[ref]--> Paper2 | | |[author] |[author] v v Author1 Author2
For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:
Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.
The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.
And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.
I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.
Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level
I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up
So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.
This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.