Lesson 6: Basic sentiment analysis

Moving on to something a little more useful, we turn to actual Twitter data and to a transform of the actual text into something meaningful: sentiment across time. Even more so, we can look at metrics across time. This allows us to do some cool stuff to lots of data at once.

There’s lots of ways to do sentiment analysis, which include using stuff like a Naive Bayes classifier, support vector machines, or some other flavor of machine learning algorithm. What we’re going to use today is incredibly naive and will be based off a derivative of the MPQA Subjectivity Lexicon with word lists that Neal Caren, sociology faculty at UNC-Chapel Hill, put together. There’s not a one-to-one correspondence between the two word lists, but for the sake of demonstration we’re not going to worry about that right now.

All I’m saying is that you probably don’t want to try to publish the results of this analysis.

Outlining the algorithms

Let’s break out the three processes like we did for the word count.

  1. Mapper
    positiveWords = load positive words
    negativeWords = load negative words
    for each tweet:
        parse the tweet
        date  = date of the tweet down to the minute
        tweetWords = all the words in the tweet text
        positiveCount = 0
        negativeCount = 0
        for candidate in 'obama' and 'romney':
            if candidate is in the text:
                if a positive word is in the text:
                   positiveCount = positiveCount + 1
                if a negative word is in the text:
                   negativeCount = negativeCount + 1
            positiveRatio = positiveCount / count of all words
            negativeRatio = negativeCount / count of all words
    
            emit date, candidate, positiveRatio - negativeRatio
  2. Intermediate – sort keys
  3. Reducer
    for each key:
       sum = sum of the values associated with this key
       n   = number of values
       emit key, sum/n

The basic idea for the mapper is that we are taking the tweet, normalizing the text, looking for a keyword of interest (in this case, it is “Obama” or “Romney”), counting the positive and negative keywords, then subtracting the ratio of negative words from the ratio of positive words.

Here’s the actual code, which you should have in sentimentMapper.py. This code is a little more complicated that other code you’ve seen, mostly because it relies heavily on the use of self-defined functions like main. You don’t have to worry about this right now, but functions are an important part of Python so if you do more work with the language in the future you should learn about them.

#!/usr/bin/env python

from __future__ import division
import json, string, sys, time

sentimentDict = {
    'positive': {},
    'negative': {}
}

def loadSentiment():
    f = open('../data/positive.txt', 'r')
    for line in f:
        sentimentDict['positive'][line.strip()] = 1
    f.close()

    f = open('../data/negative.txt', 'r')
    for line in f:
        sentimentDict['negative'][line.strip()] = 1
    f.close()

def main():
    loadSentiment()

    for line in sys.stdin:
        line = line.strip()

        data = ''
        try:
            data = json.loads(line)
        except ValueError as detail:
            sys.stderr.write(detail.__str__() + "\n")
            continue

        if 'text' in data:
            # Parse data in the format of
            # Sat Mar 12 01:49:55 +0000 2011
            d  = string.split( data['created_at'], ' ')
            ds = ' '.join([d[1], d[2], d[3], d[5] ])
            dt = time.strptime(ds, '%b %d %H:%M:%S %Y')

            date = time.strftime('%Y-%m-%d %H:%M:00', dt)

            ## turn text into lower case
            text = data['text'].lower()

            ## encode in UTF-8 to get rid of Unicode errors
            text   = text.encode('utf-8')
            text   = text.translate( string.maketrans(string.punctuation, ' ' * len(string.punctuation)) )

            words = {}
            for w in text.split(None): 
                if len(w) > 0:
                    words[ w ] = 1

            lwords = len(words)

            counts = {
                'positive':0,
                'negative':0
                }

            ratios = {
                'positive':0,
                'negative':0
            }

            for c in ['obama', 'romney']:
                if c in text:
                    for a in ['positive', 'negative']:
                        for w in sentimentDict[a]:
                            if w in words:
                                counts[a] += 1

                        ratios[a] = counts[a]/lwords

                    ## calculate overall sentiment by subtracting one from another 
                    print "\t".join([date, c, str(ratios['positive'] - ratios['negative'])])

if __name__ == '__main__':
    main()

Now let’s look at the reducer, which is implemented in the file avgNReduce.py. It’s very similar to nReducer.py except that it outputs averages across keys instead of sums.

#!/usr/bin/env python

from __future__ import division
from operator import itemgetter
import sys

def main():
    if len(sys.argv) < 2:
        print "Usage: avgNReduce.py "
        sys.exit(0)

    c_key  = None
    c_Savg = 0
    c_n    = 0
    nkey   = int(sys.argv[1])

    # input comes from STDIN
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()

        # parse the input we got from mapper
        row  = line.split('\t')
        key  = "\t".join( row[0:nkey] )
        Savg = row[nkey]

        # convert count (currently a string) to int
        try:
            Savg = float(Savg)
        except ValueError:
            # count was not a number, so silently
            # ignore/discard this line
            continue

        # this IF-switch only works because Hadoop sorts map output
        # by key (here: word) before it is passed to the reducer
        if c_key == key:
            c_Savg += Savg
            c_n    += 1
        else:
            if c_key:
                # write result to STDOUT
                print '%s\t%s' % (c_key, c_Savg/c_n)
            c_n    = 1
            c_Savg = Savg
            c_key  = key

    # do not forget to output the last word if needed!
    if c_key == key:
        print '%s\t%s' % (c_key, c_Savg/c_n)

if __name__ == '__main__':
    main()

Let’s move to doing this in code.

Implementation

First you should grab the file called elex2012.2012November5.json from ~ahanna/public. Stash it in sandbox/november-tworkshop/data. I’m having you grab it from this directory for two reasons: 1) it’s a somewhat large file (at most 20,000 tweets, 72 MB); and 2) Twitter’s Terms of Service doesn’t allow for public distribution of raw tweets. These tweets are from our focused sample from a fairly busy time from Monday, November 5, the day before the election.

me@blogclub:~/sandbox/november-tworkshop/data$ cp /home/ahanna/public/elex2012.2012November5.json .

Next, check out the first 10 lines of the file using head.

me@blogclub:~/sandbox/november-tworkshop/data$ head -10 elex2012.2012November5.json

Generally messy. Let’s try to make sense of this. Let’s run the mapper across these first 10 tweets. cd into your bin directory.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -10 ../data/elex2012.2012November5.json | python sentimentMapper.py 
2012-11-06 01:49:00	obama	0.0
2012-11-06 01:49:00	obama	0.0

Pretty boring — two tweets mention Obama but have no sentiment attached to them. Now try the first 100 lines. By the way, the time is “01:49” because tweets record time in UTC, which can be easily converted to your preferred timezone. It is equivalent to 7:49 PM CST.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py 
...

If you did this right you should see a few more values. Now try sorting them.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py | sort
...

If you did this right you should see the values sorted by date, candidate, then value.

Finally, run it through the reducer. You need to put the number “2” after the command python avgNReduce.py or else it will not function. You should get the following.

me@blogclub:~/sandbox/november-tworkshop/bin$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py | sort | python avgNReduce.py 2
2012-11-06 01:49:00	obama	0.00725749559082
2012-11-06 01:49:00	romney	0.0195280564846

So what this presumably tells us is that Romney-sentiment was a little better than Obama-sentiment in those 100 tweets.

Finally, run the entire file through the process.

me@blogclub:~/sandbox/november-tworkshop/bin$ cat ../data/elex2012.2012November5.json | python sentimentMapper.py | sort | python avgNReduce.py 2

This should take a few seconds. Compare your output with your neighbor’s.

Again, once you have this you can throw this in your favorite stat pack and graph the change over time. Here’s the output of this algorithm across a few months of our focused sample.

Going Forward

That’s it for this portion for the analysis. Again, since there’s a 101 ways to process text, this is only scratching the surface of what’s possible with this kind of analysis. If you want to see what else you can do with Python, check out the Natural Language ToolKit for other sorts of text processing that is possible.