Lesson 7: Mention network analysis

Another type of analysis that we can run over a large set of tweets is a network analysis of the different networks that emerge between users. These networks can be retweet networks or mention networks, and we’ve seen from prior studies that these networks can be dramatically different depending on what people are actually doing on Twitter.

For the current analysis we are just going to look at mention networks. The way that tweets are structured, we can easily pull out mentions of other users.

{
        "id_str":"265631953204686848",        
        "text":"@AllenVaughan In all fairness that \"talented\" team was 5-6...",
        "in_reply_to_user_id_str":"166318986",
        ...
        "entities":{
                "urls":[],
                "hashtags":[],
                "user_mentions":[                
                        {
                                "id_str":"166318986",
                                "indices":[0,13],
                                "name":"Allen Vaughan",
                                "screen_name":"AllenVaughan",
                                "id":166318986
                        }
                ]}
        ...
}

A potential pitfall of pulling out mentions this way is that retweets are also recorded like this, so we could be measuring both retweets and mentions. However, we can bracket this considering right now. There are ways of filtering out retweets that we can cover in future modules. For now, we will focus on how to create mention networks.

Algorithm

The most common that network packages like R’s igraph or sna read in networks is through edgelists — pairs of dyads indexed by some name. For example, let’s define a network based on shared department and whether you are a student or faculty.

user1,user2
YoungMie,Chris
Dhavan,YoungMie
Dhavan,Chris
Tim,Itay
Tim,BenS
Itay,BenS
Erika,Erika
Alex,Alex
BenT,Dave
Jane,BenT
Jane,Dave

That produces a graph that looks like this.

We need to produce edge lists, then. We can also take account the quality of the tie with a third value. This could be factored in as something like the strength of the tie, as measured by how often people interact with each other.

This will be the process for creating this output.

  1. Mapper
    for each tweet:
        user1 = current user
        for each mention of another user (user2):
            emit user1, user2, 1
  2. Intermediate – sort keys
  3. Reducer
    for each key:
        n = sum of all the values

The reducer should look familiar, since it’s the one we used in the wordcount.

The following is the code for the mapper, which is in the file called mentionMapper.py.

#!/usr/bin/env python

import json, sys

def main():

    for line in sys.stdin:
        line = line.strip()

        data = ''
        try:
            data = json.loads(line)
        except ValueError as detail:
            sys.stderr.write(detail.__str__() + "\n")
            continue

        if 'entities' in data and len(data['entities']['user_mentions']) > 0:
            user          = data['user']
            user_mentions = data['entities']['user_mentions']

            for u2 in user_mentions:                
                print "\t".join([
                    user['id_str'],
                    u2['id_str'],
                    "1"
                    ])

if __name__ == '__main__':
    main()

The program does some checking if there are entities and if there are any user mentions. It then loops through them and prints them out. Now to implement this in practice.

Implementation

Let’s start off with a small subset of tweets, the first 100, and run them through the mapper.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -100 ../data/elex2012.2012November5.json | python mentionMapper.py

You should get a lot of output, over 100 lines (154 to be exact). That’s pretty incredible. That’s an average of 1.5 mentions per tweet.

Through the same pattern as before, let’s see if we can reduce them down. Start with the first 100 tweets, sort them, then reduce them. You want to make sure to write “2” after the python nReduce.py command, since there are two fields which represent the key.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -100 ../data/elex2012.2012November5.json | python mentionMapper.py  | sort | python nReduce.py 2

Inspect the output. Does anything seemed to have summed up to over 1? Not really. Seems like there are not a lot of repeat interactions in those first 100 tweets (which makes sense — who is going to tweet twice to someone in less than a minute?)

Lastly, run the full dataset. This may take one or two minutes, especially depending on if you are running this at the same time as everyone else.

me@blogclub:~/sandbox/november-tworkshop/bin$ cat ../data/elex2012.2012November5.json | python mentionMapper.py | sort | python nReduce.py 2

This is going to spit out a lot of output. You can use your favorite network analysis program (igraph for R, NodeXL, UCINET) to run analysis on these data, and to visually represent them. Here is the current dataset as plotted in igraph.

Its complexity is quite remarkable, especially for only representing about 10 minutes of tweets. The larger nodes are those that have been mentioned more. The red edges are people who have interacted more than three times. So you see a pretty low incidence of interaction in this short time period, but a lot of mentions of elite users. You can also see a bit of a polarization developed around the two big nodes in the center, which are Obama and Romney. Once you run these analysis across time I’m sure more patterns will emerge.