## Lesson 6: Basic sentiment analysis

Moving on to something a little more useful, we turn to actual Twitter data and to a transform of the actual text into something meaningful: sentiment across time. Even more so, we can look at metrics across time. This allows us to do some cool stuff to lots of data at once.

There’s lots of ways to do sentiment analysis, which include using stuff like a Naive Bayes classifier, support vector machines, or some other flavor of machine learning algorithm. What we’re going to use today is incredibly naive and will be based off a derivative of the MPQA Subjectivity Lexicon with word lists that Neal Caren, sociology faculty at UNC-Chapel Hill, put together. There’s not a one-to-one correspondence between the two word lists, but for the sake of demonstration we’re not going to worry about that right now.

All I’m saying is that you probably don’t want to try to publish the results of this analysis.

## Outlining the algorithms

Let’s break out the three processes like we did for the word count.

1. Mapper
```positiveWords = load positive words
for each tweet:
parse the tweet
date  = date of the tweet down to the minute
tweetWords = all the words in the tweet text
positiveCount = 0
negativeCount = 0
for candidate in 'obama' and 'romney':
if candidate is in the text:
if a positive word is in the text:
positiveCount = positiveCount + 1
if a negative word is in the text:
negativeCount = negativeCount + 1
positiveRatio = positiveCount / count of all words
negativeRatio = negativeCount / count of all words

emit date, candidate, positiveRatio - negativeRatio```
2. Intermediate – `sort keys`
3. Reducer
```for each key:
sum = sum of the values associated with this key
n   = number of values
emit key, sum/n```

The basic idea for the mapper is that we are taking the tweet, normalizing the text, looking for a keyword of interest (in this case, it is “Obama” or “Romney”), counting the positive and negative keywords, then subtracting the ratio of negative words from the ratio of positive words.

Here’s the actual code, which you should have in `sentimentMapper.py`. This code is a little more complicated that other code you’ve seen, mostly because it relies heavily on the use of self-defined functions like `main`. You don’t have to worry about this right now, but functions are an important part of Python so if you do more work with the language in the future you should learn about them.

```#!/usr/bin/env python

from __future__ import division
import json, string, sys, time

sentimentDict = {
'positive': {},
'negative': {}
}

f = open('../data/positive.txt', 'r')
for line in f:
sentimentDict['positive'][line.strip()] = 1
f.close()

f = open('../data/negative.txt', 'r')
for line in f:
sentimentDict['negative'][line.strip()] = 1
f.close()

def main():

for line in sys.stdin:
line = line.strip()

data = ''
try:
except ValueError as detail:
sys.stderr.write(detail.__str__() + "\n")
continue

if 'text' in data:
# Parse data in the format of
# Sat Mar 12 01:49:55 +0000 2011
d  = string.split( data['created_at'], ' ')
ds = ' '.join([d[1], d[2], d[3], d[5] ])
dt = time.strptime(ds, '%b %d %H:%M:%S %Y')

date = time.strftime('%Y-%m-%d %H:%M:00', dt)

## turn text into lower case
text = data['text'].lower()

## encode in UTF-8 to get rid of Unicode errors
text   = text.encode('utf-8')
text   = text.translate( string.maketrans(string.punctuation, ' ' * len(string.punctuation)) )

words = {}
for w in text.split(None):
if len(w) > 0:
words[ w ] = 1

lwords = len(words)

counts = {
'positive':0,
'negative':0
}

ratios = {
'positive':0,
'negative':0
}

for c in ['obama', 'romney']:
if c in text:
for a in ['positive', 'negative']:
for w in sentimentDict[a]:
if w in words:
counts[a] += 1

ratios[a] = counts[a]/lwords

## calculate overall sentiment by subtracting one from another
print "\t".join([date, c, str(ratios['positive'] - ratios['negative'])])

if __name__ == '__main__':
main()```

Now let’s look at the reducer, which is implemented in the file `avgNReduce.py`. It’s very similar to `nReducer.py` except that it outputs averages across keys instead of sums.

```#!/usr/bin/env python

from __future__ import division
from operator import itemgetter
import sys

def main():
if len(sys.argv) < 2:
print "Usage: avgNReduce.py "
sys.exit(0)

c_key  = None
c_Savg = 0
c_n    = 0
nkey   = int(sys.argv[1])

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper
row  = line.split('\t')
key  = "\t".join( row[0:nkey] )
Savg = row[nkey]

# convert count (currently a string) to int
try:
Savg = float(Savg)
except ValueError:
# count was not a number, so silently
continue

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if c_key == key:
c_Savg += Savg
c_n    += 1
else:
if c_key:
# write result to STDOUT
print '%s\t%s' % (c_key, c_Savg/c_n)
c_n    = 1
c_Savg = Savg
c_key  = key

# do not forget to output the last word if needed!
if c_key == key:
print '%s\t%s' % (c_key, c_Savg/c_n)

if __name__ == '__main__':
main()```

Let’s move to doing this in code.

## Implementation

First you should grab the file called `elex2012.2012November5.json` from `~ahanna/public`. Stash it in `sandbox/november-tworkshop/data`. I’m having you grab it from this directory for two reasons: 1) it’s a somewhat large file (at most 20,000 tweets, 72 MB); and 2) Twitter’s Terms of Service doesn’t allow for public distribution of raw tweets. These tweets are from our focused sample from a fairly busy time from Monday, November 5, the day before the election.

`me@blogclub:~/sandbox/november-tworkshop/data\$ cp /home/ahanna/public/elex2012.2012November5.json .`

Next, check out the first 10 lines of the file using `head`.

`me@blogclub:~/sandbox/november-tworkshop/data\$ head -10 elex2012.2012November5.json`

Generally messy. Let’s try to make sense of this. Let’s run the mapper across these first 10 tweets. `cd` into your `bin` directory.

```me@blogclub:~/sandbox/november-tworkshop/bin\$ head -10 ../data/elex2012.2012November5.json | python sentimentMapper.py
2012-11-06 01:49:00	obama	0.0
2012-11-06 01:49:00	obama	0.0```

Pretty boring — two tweets mention Obama but have no sentiment attached to them. Now try the first 100 lines. By the way, the time is “01:49” because tweets record time in UTC, which can be easily converted to your preferred timezone. It is equivalent to 7:49 PM CST.

```me@blogclub:~/sandbox/november-tworkshop/bin\$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py
...```

If you did this right you should see a few more values. Now try sorting them.

```me@blogclub:~/sandbox/november-tworkshop/bin\$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py | sort
...```

If you did this right you should see the values sorted by date, candidate, then value.

Finally, run it through the reducer. You need to put the number “2” after the command `python avgNReduce.py` or else it will not function. You should get the following.

```me@blogclub:~/sandbox/november-tworkshop/bin\$ head -100 ../data/elex2012.2012November5.json | python sentimentMapper.py | sort | python avgNReduce.py 2
2012-11-06 01:49:00	obama	0.00725749559082
2012-11-06 01:49:00	romney	0.0195280564846```

So what this presumably tells us is that Romney-sentiment was a little better than Obama-sentiment in those 100 tweets.

Finally, run the entire file through the process.

`me@blogclub:~/sandbox/november-tworkshop/bin\$ cat ../data/elex2012.2012November5.json | python sentimentMapper.py | sort | python avgNReduce.py 2`

This should take a few seconds. Compare your output with your neighbor’s.

Again, once you have this you can throw this in your favorite stat pack and graph the change over time. Here’s the output of this algorithm across a few months of our focused sample.

## Going Forward

That’s it for this portion for the analysis. Again, since there’s a 101 ways to process text, this is only scratching the surface of what’s possible with this kind of analysis. If you want to see what else you can do with Python, check out the Natural Language ToolKit for other sorts of text processing that is possible.