Lesson 1: Twitter API and an Introduction to the Terminal

Accessing the Twitter API

The way that researchers and other people who want to get large publically available Twitter datasets is through their API. API stands for Application Programming Interface and many services that want to start a developer community around their product usually releases one. Facebook has an API that is somewhat restrictive, while Klout has an API to let you automatically look up Klout scores and all their different facets.

The Twitter API has two different flavors: RESTful and Streaming. The RESTful API is useful for getting things like lists of followers and those who follow a particular user, and is what most Twitter clients are built off of. We are not going to deal with the RESTful API right now, but you can find more information on it here: https://dev.twitter.com/docs/api. Right now we are going to focus on the Streaming API (more info here: https://dev.twitter.com/docs/streaming-api). The Streaming API works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection.

For my own purposes, I’ve been using the tweepy package to access the Streaming API.  Using this package takes a little bit of knowledge of Python.  We won’t get into that right now, but once we do, it’s a very simple package to set up and use.

For more advanced users, I’ve incorporated two changes in my own fork that have worked well for me on both Linux and OSX systems: https://github.com/raynach/tweepy

Understanding Twitter Data

Once you’ve connected to the Twitter API, whether via the RESTful API or the Streaming API, you’re going to start getting a bunch of data back.  The data you get back will be encoded in JSON, or JavaScript Object Notation. JSON is a way to encode complicated information in a platform-independent way.  It could be considered the lingua franca of information exchange on the Internet.  When you click a snazzy Web 2.0 button on Facebook or Amazon and the page produces a lightbox (a box that hovers above a page without leaving the page you’re on now), there was probably some JSON involved.

JSON isn’t pretty to look at with the human eye. But it’s a rather simplistic and elegant way to encode complex data structures. When a tweet comes back from the API, this is what it looks like:

{"possibly_sensitive_editable":true,"text":"TeeMinus24's Shirt of the Day is Palpatine\/Vader '12. Support the Sith. Change you can't stop. http:\/\/t.co\/wFh1cCep","id_str":"175090352598945794","entities":{"urls":[{"indices":[95,115],"expanded_url":"http:\/\/fb.me\/1isEdQJSq","display_url":"fb.me\/1isEdQJSq","url":"http:\/\/t.co\/wFh1cCep"}],"hashtags":[],"user_mentions":[]},"retweeted":false,"place":null,"retweet_count":0,"in_reply_to_status_id_str":null,"coordinates":null,"source":"\u003Ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003EFacebook\u003C\/a\u003E","in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"favorited":false,"geo":null,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"truncated":false,"created_at":"Thu Mar 01 05:29:27 +0000 2012","possibly_sensitive":false,"contributors":null,"user":{"geo_enabled":false,"profile_link_color":"009999","id_str":"281077639","listed_count":1,"lang":"en","notifications":null,"location":"","is_translator":false,"follow_request_sent":null,"statuses_count":461,"profile_background_color":"131516","followers_count":43,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1428484273\/TeeMinus24_logo_normal.jpg","default_profile":false,"profile_background_tile":true,"description":"We are a limited edition t-shirt company. We make tees that are designed for the fan; movies, television shows, video games, sci-fi, web, and tech. We have it!","following":null,"profile_sidebar_fill_color":"efefef","contributors_enabled":false,"profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme14\/bg.gif","verified":false,"profile_sidebar_border_color":"eeeeee","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1428484273\/TeeMinus24_logo_normal.jpg","default_profile_image":false,"protected":false,"show_all_inline_media":false,"profile_use_background_image":true,"favourites_count":0,"created_at":"Tue Apr 12 15:48:23 +0000 2011","name":"Vincent Genovese","friends_count":52,"profile_text_color":"333333","url":"http:\/\/www.teeminus24.com","id":281077639,"profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme14\/bg.gif","time_zone":"Eastern Time (US & Canada)","utc_offset":-18000,"screen_name":"TeeMinus24"},"id":175090352598945794}

Ugh. Yeah, it’s a big long line of ugly.  Let’s break it down a little more.

    "contributors": null, 
    "truncated": false, 
    "text": "TeeMinus24's Shirt of the Day is Palpatine/Vader '12. Support the Sith. Change you can't stop. http://t.co/wFh1cCep", 
    "in_reply_to_status_id": null, 
    "id": 175090352598945794, 
    "entities": {
        "user_mentions": [], 
        "hashtags": [], 
        "urls": [
                "indices": [
                "url": "http://t.co/wFh1cCep", 
                "expanded_url": "http://fb.me/1isEdQJSq", 
                "display_url": "fb.me/1isEdQJSq"
    "retweeted": false, 
    "coordinates": null, 
    "source": "<a href="\&quot;http://www.facebook.com/twitter\&quot;" rel="\&quot;nofollow\&quot;">Facebook</a>", 
    "in_reply_to_screen_name": null, 
    "id_str": "175090352598945794", 
    "retweet_count": 0, 
    "in_reply_to_user_id": null, 
    "favorited": false, 
    "user": {
        "follow_request_sent": null, 
        "profile_use_background_image": true, 
        "default_profile_image": false, 
        "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme14/bg.gif", 
        "verified": false, 
        "profile_image_url_https": "https://si0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "profile_sidebar_fill_color": "efefef", 
        "is_translator": false, 
        "id": 281077639, 
        "profile_text_color": "333333", 
        "followers_count": 43, 
        "protected": false, 
        "location": "", 
        "profile_background_color": "131516", 
        "id_str": "281077639", 
        "utc_offset": -18000, 
        "statuses_count": 461, 
        "description": "We are a limited edition t-shirt company. We make tees that are designed for the fan; movies, television shows, video games, sci-fi, web, and tech. We have it!", 
        "friends_count": 52, 
        "profile_link_color": "009999", 
        "profile_image_url": "http://a0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg", 
        "notifications": null, 
        "show_all_inline_media": false, 
        "geo_enabled": false, 
        "profile_background_image_url": "http://a0.twimg.com/images/themes/theme14/bg.gif", 
        "screen_name": "TeeMinus24", 
        "lang": "en", 
        "profile_background_tile": true, 
        "favourites_count": 0, 
        "name": "Vincent Genovese", 
        "url": "http://www.teeminus24.com", 
        "created_at": "Tue Apr 12 15:48:23 +0000 2011", 
        "contributors_enabled": false, 
        "time_zone": "Eastern Time (US &amp; Canada)", 
        "profile_sidebar_border_color": "eeeeee", 
        "default_profile": false, 
        "following": null, 
        "listed_count": 1
    "geo": null, 
    "in_reply_to_user_id_str": null, 
    "possibly_sensitive": false, 
    "created_at": "Thu Mar 01 05:29:27 +0000 2012", 
    "possibly_sensitive_editable": true, 
    "in_reply_to_status_id_str": null, 
    "place": null

Okay. Much better. Comparatively.

What does all of the above mean? Well, let’s break down what each of the indents mean first. When we have some stuff enclosed in {curly brackets}, we can call that a dictionary. You can think of a dictionary just like you think of dictionary in common language — you pick a word (a key) and you get back a definition (a value). You can see that the whole tweet is treated like one big dictionary. Now, the second type that is used is the array. It’s like a vector in math. They’re denoted by [square brackets]. You can see a few of these a few lines from the top.

Keep those two terms — dictionary and array — tucked away somewhere. We’re going to revisit them when we talk about basic Python programming. For now, just remember that a dictionary is based on key-value pairs, while an array is just a collection of values, indexed by numbers.

Let’s move our focus now to the actual elements of the tweet. Most of the keys, that is, the words on the left of the colon, are self-explanatory. The most important ones are “text”, “entities”, and “user”. “Text” is the text of the tweet, “entities” are the user mentions, hashtags, and links used in the tweet, separated out for easy access. “User” contains a lot of information on the user, from URL of their profile image to the date they joined Twitter.

Now that you see what data you get with a tweet, you can envision interesting types of analysis that can emerge by analyzing a whole lot of them.

A Disclaimer on Collecting Tweets

Unfortunately, you do not have carte blanche to share the tweets you collect. Twitter restricts publicly releasing datasets according to their API Terms of Service (https://dev.twitter.com/terms/api-terms). This is unfortunately for collaboration when colleagues have collected very unique datasets.  However, you can share derivative analysis from tweets, such as content analysis and aggregate statistics.

Introducing the Terminal

[I’m borrowing heavily from this tutorial, because surely I’m not the first one to write a Terminal tutorial.]

The Terminal is how you access a UNIX-based machine remotely.  The data processing we’re going to be doing will be done on a Hadoop cluster with each node running Linux.  A cluster is just a fancy way of many computers working together.  A node is one computer in that cluster.

For now, we’re not going to access that cluster (and given that it’s not been setup yet, it’s not possible).  Instead we’re going to connect to an old computer I have running out of my living room.  Which means be very sympathetic to my bandwidth restrictions.  Here is how you connect to it.

For Windows:

  1. Download PuTTy (http://www.chiark.greenend.org.uk/~sgtatham/putty/)
  2. Save and open it.
  3. Type in the “host” field blogclub.alex-hanna.com
  4. Use the credential information that I give you in the workshop.

For Mac:

  1. Go to Applications -> Utilities -> Terminal and open Terminal.
  2. Type ssh <username>@blogclub.alex-hanna.com
  3. Use the credential information I give you in the workshop.


Like in Windows or Mac, files in Linux are organized hierarchically. There is the root of the filesystem at the “/” directory. From there, files and directories are ordered in a tree structure.

Right now, as you are logged in right now, you are in what is called your home directory. The home directory is your own workspace, where you can do whatever you want and you own all of your own files.

I’m going to introduce two commands for looking around in your working directory: ls and pwd. ls (with nothing typed after it) lists all the files in your working directory. pwd tells you what directory you are in.

Let’s try both of those commands:

me@blogclub:~$ ls
me@blogclub:~$ pwd

Those are both pretty boring. There’s, by default, no files in our home directories. But what happens when we look at what’s in other directories? Try typing ls /:

me@blogclub:~$ ls /
bin  boot  cdrom  dev  etc  home  lib  lost+found  media  mnt  opt  proc  root  sbin  selinux  srv  sys  tmp  usr  var

Okay, more interesting. We see what’s in the root level directory of the Linux filesystem. Let’s give ls one more argument so we get more information on the directory that we want information on.

me@blogclub:~$ ls -l /
total 88
drwxr-xr-x  2 root root  4096 2012-02-26 15:42 bin
drwxr-xr-x  2 root root  4096 2012-02-27 06:29 boot
drwxr-xr-x  2 root root  4096 2012-02-26 15:20 cdrom
drwxr-xr-x 15 root root  3680 2012-02-27 19:17 dev
drwxr-xr-x 85 root root  4096 2012-03-01 00:55 etc
drwxr-xr-x 15 root root  4096 2012-02-26 17:49 home
drwxr-xr-x 13 root root 12288 2012-02-26 16:00 lib
drwx------  2 root root 16384 2012-02-26 15:10 lost+found
drwxr-xr-x  2 root root  4096 2012-02-26 15:11 media
drwxr-xr-x  2 root root  4096 2010-04-23 05:12 mnt
drwxr-xr-x  2 root root  4096 2012-02-26 15:12 opt
dr-xr-xr-x 85 root root     0 2012-02-27 19:17 proc
drwx------  4 root root  4096 2012-02-26 15:24 root
drwxr-xr-x  2 root root  4096 2012-02-26 15:45 sbin
drwxr-xr-x  2 root root  4096 2009-12-05 16:04 selinux
drwxr-xr-x  2 root root  4096 2012-02-26 15:12 srv
drwxr-xr-x 12 root root     0 2012-02-27 19:17 sys
drwxrwxrwt  2 root root  4096 2012-03-01 00:55 tmp
drwxr-xr-x 11 root root  4096 2012-02-26 15:12 usr
drwxr-xr-x 15 root root  4096 2012-02-26 15:30 var

Okay, this looks a bit more interesting. In this section, we get information on permissions, the user who owns it, the group who owns it, file sizes and the date and time it was last modified.

One last command that is integral for navigation is the cd command, which changes directories. Let’s try using several of these commands together.

me@blogclub:~$ cd /
me@blogclub:~$ pwd
me@blogclub:~$ ls
bin  boot  cdrom  dev  etc  home  lib  lost+found  media  mnt  opt  proc  root  sbin  selinux  srv  sys  tmp  usr  var

Manipulating Files

There are four commands that are integral to changing files and directories in Linux:

  • cp – copy files and directories
  • mv – move or rename files and directories
  • rm – remove files
  • mkdir – create directories

Let’s start with mkdir and create a standard directory structure for our purposes.

First, you can make sure you are back in your home directory by typing cd ~. The tilde (~) is a shortcut for your home directory.

me@blogclub:/$ pwd
me@blogclub:/$ cd ~
me@blogclub:~$ pwd

Now let’s create a directory which we can play in. Let’s call it a “sandbox”. Then we can use ls to make sure that it was created.

me@blogclub:~$ mkdir sandbox
me@blogclub:~$ ls 

Now we have one directory, let’s create several more. As an exercise, create a directory named tworkshop inside of sandbox. Then create two more directories within that one: bin and data.

Now, enter the data directory that you created. We are going to copy some files into there.

me@blogclub:~/sandbox/tworkshop/data$ cp /home/ahanna/public/example.json .
me@blogclub:~/sandbox/tworkshop/data$ cp /home/ahanna/public/prettyExample.json .

cp has two required arguments — source is the first, and destination is the second. These are the files that we saw in the previous section. The destination argument in this case is the current directory, which is expressed by a period (.).

Now, let’s see how to rename and remove items. First make a copy of the example.json file.

me@blogclub:~/sandbox/tworkshop/data$ cp example.json copyOfExample.json

Then let’s rename that file.

me@blogclub:~/sandbox/tworkshop/data$ cp copyOfExample.json copy.json
me@blogclub:~/sandbox/tworkshop/data$ ls
copy.json  example.json  prettyExample.json

Finally, let’s get rid of it.

me@blogclub:~/sandbox/tworkshop/data$ rm copy.json
me@blogclub:~/sandbox/tworkshop/data$ ls
example.json  prettyExample.json

To get out of the ~/sandbox/tworkshop/data directory, we can go down a level by using cd ... The two periods .. represent the current working directory’s parent directory.

me@blogclub:~/sandbox/tworkshop/data$ cd ..
me@blogclub:~/sandbox/tworkshop$ cd ..
me@blogclub:~/sandbox$ cd ..

Viewing files

The last thing I want to touch on in this section is how to look at files. The relevant command here is more. It’s pretty simple to use — just type the command the file you want to look at:

me@blogclub:~/sandbox/tworkshop/data$ more prettyExample.json
    "contributors": null, 
    "truncated": false, 
    "text": "TeeMinus24's Shirt of the Day is Palpatine/Vader '12. Support the Sith. Change you can't stop. http://t.co/wFh1cCep", 
    "in_reply_to_status_id": null, 
    "id": 175090352598945794, 
    "entities": {
        "user_mentions": [], 
        "hashtags": [], 
        "urls": [
                "indices": [
                "url": "http://t.co/wFh1cCep", 
                "expanded_url": "http://fb.me/1isEdQJSq", 
                "display_url": "fb.me/1isEdQJSq"

Changing passwords

The last thing I want to tell you how to do is change your password. It’s a pretty simple command: passwd. You type in your old password, then your new one. It will not show up as you type it to safe it from any body who is snooping over your shoulder.

The things we didn’t get to this time…

There’s a lot that didn’t cover in this short introduction, including input/output redirection and important commands such as cat (concatenate files), grep (match lines according to a given pattern), and head and tail (print the first [or last] n lines of a file). I’m going to go through this in the next section, but if you want to move ahead, you can go through the tutorial linked above.