tl,dr verion: Source code at github!
A couple of days ago a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.
First we have the temporal graph
This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.
rpickhardt$ wc -l temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16
clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.
Second we have the weighted co-recipient network
Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.
rpickhardt$ wc -l weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.org MirandaL@dnc.org 848
MirandaL@dnc.org PaustenbachM@dnc.org 848
WalkerE@dnc.org PaustenbachM@dnc.org 624
PaustenbachM@dnc.org WalkerE@dnc.org 624
WalkerE@dnc.org MirandaL@dnc.org 596
clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails
There have been
- 1226 senders
- 1384 recipients
- 2030 people
included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:
And the top 7 recievers are:
As you can see email@example.com and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.
Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.
The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.
- Use the Generalized Language Model Toolkit to build Language Models on the data
- Compare with the social graph from twitter – many email addresses or at least names will be linked to twitter accounts. Comparing the Twitter network with the email network might reveal the differences in internal and external communication
- Improve Quality of data i.e. better clean up of the data. Sometimes people in the recipient list have more than one email address. Currently they are treated as two different people. On the other hand sometimes mail addresses are missing and just names are included. These could probably be inferred from the other mail addresses. Also names in this case serve as uniq identifiers. So if two different people are called ‘Bob’ they become one person in the dataset.