Extracting 2 social network graphs from the Democratic National Committee Email Corpus on Wikileaks – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany

tl,dr verion: Source code at github!
A couple of days ago a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First we have the temporal graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.
rpickhardt$ wc -l temporalGraph.tsv 39338 temporalGraph.tsv rpickhardt$ head -5 temporalGraph.tsv GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22 ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23 JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16 JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16 JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16
clearly the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly an email network is directed and can have multi edges.

Second we have the weighted co-recipient network

Looking at the data I have discovered that many mails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.
rpickhardt$ wc -l weightedCCGraph.tsv 20864 weightedCCGraph.tsv rpickhardt$ head -5 weightedCCGraph.tsv PaustenbachM@dnc.org MirandaL@dnc.org 848 MirandaL@dnc.org PaustenbachM@dnc.org 848 WalkerE@dnc.org PaustenbachM@dnc.org 624 PaustenbachM@dnc.org WalkerE@dnc.org 624 WalkerE@dnc.org MirandaL@dnc.org 596
clearly the format is: recipient1 TAB recipient2 TAB count
where count counts how ofthen recipient1 and recipient2 have been together in mails

Simple statistics

There have been

1226 senders
1384 recipients
2030 people

included in the mails. In total I found 1226 different senders and 1384 different receivers. The top 7 Senders are:
MirandaL@dnc.org 1482 ComerS@dnc.org 1449 ParrishD@dnc.org 750 DNCPress@dnc.org 745 PaustenbachM@dnc.org 608 KaplanJ@dnc.org 600 ManriquezP@dnc.org 567
And the top 7 recievers are:
MirandaL@dnc.org 2951 Comm_D@dnc.org 2439 ComerS@dnc.org 1841 PaustenbachM@dnc.org 1550 KaplanJ@dnc.org 1457 WalkerE@dnc.org 1110 kaplanj@dnc.org 987
As you can see kaplanj@dnc.org and KaplanJ@dnc.org occur in the data set so as I mention in the Roadmap section at the end of the article more clean up of data might be necessary to get a more precise picture.
Still on a first glimse the data looks pretty natural. In the following I provide a diagram showing the rank frequency plot of senders and recievers. One can see that some people are way more active then other people. Also the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 reciepient.

Also you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explaination for this. Maybe this is due to the fact that the data dump is not complete. Still I find the data looks pretty natrual to me so further investigations might make sense.

Code

The crawler code is a two-liner. just some wget and sleep magic
The python code for processing the mails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.

Roadmap

Use the Generalized Language Model Toolkit to build Language Models on the data
Compare with the social graph from twitter – many email addresses or at least names will be linked to twitter accounts. Comparing the Twitter network with the email network might reveal the differences in internal and external communication
Improve Quality of data i.e. better clean up of the data. Sometimes people in the recipient list have more than one email address. Currently they are treated as two different people. On the other hand sometimes mail addresses are missing and just names are included. These could probably be inferred from the other mail addresses. Also names in this case serve as uniq identifiers. So if two different people are called ‘Bob’ they become one person in the dataset.

Popular Posts

What are the 57 signals google uses to filter search results?

Graphity: An efficient Graph Model for Retrieving the Top-k News Feeds for users in social networks

Algorithmic Information Filter from Eli Pariser’s TED Talks

Time lines and news streams: Neo4j is 377 times faster than MySQL