Lesson 3: Basic Python

Today we’re going to focus on learning Python.  Python is a programming language that has become incredibly popular within the last few years.  It’s object-oriented, which means that you can use add-ons without having to know many of their idiosyncracies. It’s also extendable, so if you want to add-on to the language it’s rather simple to do.  It has an intuitive syntax that is more appealing than Perl and C/C++, and it’s largely platform-independent, which means you can run it on your own computer.  If you’ve never programmed before, Python is a good place to start because it lets you cut past some of the more idiosyncratic features of programming languages and dive right into writing code.

Python can play nice with Hadoop MapReduce, which makes it good for our data processing.  We’ll get into how to do that down the line.  But even before we get to data processing, Python is great for data collection. All the Streaming API stuff I do is in Python.

A last thing to note is that Python is a programming language like Java or like C. It doesn’t work like SPSS or Stata, which are meant to operate on statistical data and take a dataset as input. That’s not how Python works. You can manipulate data in Python but it must be contained with certain data structures.  We’ll get to those below.  For now, we need to start at square 1.

As an aside, I’m basing a lot of this Tutorial from An Informal Introduction to Python, which I would highly urge you to complete after this workshop.

Setting up

Python has a file mode and an interactive mode. If you have used Stata, this is like using .do files vs using Stata like a terminal and inputting one line at a time. Today we’re just going to focus on building .py files and running them.

Like last in the last lesson, you need to open up jEdit and your Terminal to get started. We’re going to be primarily saving new files here, so you should remember how to save files via FTP.

Start a new file in jEdit. Let’s get to coding.

Variables

The first thing you need to learn about is variables. A variable is where you store values. For the most basic variables you can store numbers and strings. Let’s try to store two numbers. Type the below stuff into jEdit.

a = 10
b = 5

Alright, pretty simple. Let’s add two more variables that use some arithmetic operators.

c = 10 + 5
d = 6 * 7

Finally, let’s try to print all of this stuff to see if it actually worked. Add the following lines.

print a
print b
print c
print d

So the whole file should look like this:

a = 10
b = 5
c = 10 + 5
d = 6 * 7

print a
print b
print c
print d

Now save this file via Secure FTP to the server. Instructions on how to do this are in the last lesson. You should save it to ~/sandbox/tworkshop/bin. Call the file lab3.py.

After you’ve done that, go to your Terminal. Navigate to ~/sandbox/tworkshop/bin. Now run the lab3.py file. You should get the following output.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
10
5
15
42

You can also treat variables like numbers. So instance, remove all the other print statements and add one that does this:

a = 10
b = 5
c = 10 + 5
d = 6 * 7

print (a + b) * c

Run it.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
225

Great. For giggles, let’s add another line to the file. Go back to jEdit and type add a variable called wrong. Save the file again (by the way, do you know if you type Ctrl+S after initially having saved the file to the server, it automatically gets uploaded to the server?? Cool!) Run the file again.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
225
Traceback (most recent call last):
  File "lab3.py", line 12, in <module>
    wrong
NameError: name 'wrong' is not defined

Python prints the first four variables like usual. But then you got an error because you didn’t assign anything to that variable.

There’s a lot more we can do with numerical variables. You can use non-natural numbers (e.g. 3.141), round numbers, etc. But we’re okay on this for now. Let’s go to strings.

Strings

Strings are structured text that can be stored in variables. We’ve already used strings before in the “Hello World!” example from the previous lesson. Let’s try to replicate that. Add these two variables to your file and remove the previous print statement so that it looks like this.

a = 10
b = 5
c = 10 + 5
d = 6 * 7

hello = "Hello"
world = "World!"

print hello
print world

You should get this.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello
World!

Hrm. Close, but no cigar. This is where string concantenation comes in. Try changing the print statements to one print statement that does this:

print hello + world

Notice the + operator? That concantenates two strings together. Run it again.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
HelloWorld!

Bluh. No go. One more try — we can add the string literal between those two strings.

print hello + " " + world
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello World!

Now we’re talking.

What if we want to deal with a string with apostrophes and quotes, though? If the end of a string is delimited by another quote, then we can’t use quotes at all? But there is are actually two solutions to this problem.

Add another variable, call it text. Print it.

text = "what we are doing here is \"programming\""
print text
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello World!
what we are doing here is "programming"

We did what was called escaping the character. There are actually other escape characters that serve other purposes in Python. Two of the most used ones are tab (\t) and newline (\n). They… they do what you think they would do. Let’s use them in the string, to add some dramatic effect to our text.

text = "what \t we are doing here is\n \"programming\""
print text
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello World!
what 	 we are doing here is
 "programming"

Crazysauce. Now for the second way of getting around the quote problem. There are actually two different ways of quoting a string in Python. You can use ‘ or “. So with our original text, we could just do this.

text = 'what we are doing here is "programming"'
print text
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello World!
what we are doing here is "programming"

We can do vice versa too.

text = "what we are doing here is 'programming'"
print text
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
Hello World!
what we are doing here is 'programming'

We can do a whole bunch of things with strings, like splitting them up by a certain character or matching parts of them with these crazy things called regular expressions, but I just wanted to get your feet wet with them.

More info on string methods can be found in the Python documentation.

Arrays

Now we get into the very fun things. We are going to start with two data structures. Data structures are elements of programming that contain data. The first is the array, which has a standard feature of most programming languages since anyone really cares to remember. Another name for array in Python is a list. I’ll probably use them interchangeably. List appeals to a more intuitive sort of sensibility, that a list is just an ordered sequence of things.

So let’s try creating a list. First, delete all the print statements so we don’t have all the stuff printing from before.

Now, let’s add a variable called mylist that is just a list of numbers 42 through 46.

mylist = [42, 43, 44, 45, 46]

Now, you can access each element of a list with a subscript, which is a number which indicates which where in the list it is. Like most programming languages, indexes start from 0.

mylist = [42, 43, 44, 45, 46]
print mylist[0]
print mylist[4]
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
42
46

There are also different ways to subscript the list. You can actually subscript the list with negative numbers to go backwards. Conceptually, this is what it looks like:

 +---+---+---+---+---+
 |42 |43 |44 |45 |46 |
 +---+---+---+---+---+
 0   1   2   3   4   5
-5  -4  -3  -2  -1

(this is very much lifted from the tutorial above)

So try this:

mylist = [42, 43, 44, 45, 46]
print mylist[-1]

And it should print 46.

You can also take what are called slices of the array. Try this.

print mylist[0:2]

What does it print out? If you don’t give the first number or last number of the slice, then you can bound the slice on one end only. So try this.

print mylist[:2]
print mylist[2:]
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
[42, 43]
[44, 45, 46]

Lists are great. Taken alone, they are useful. But combined with loops, they are incredibly powerful. We’ll get to that in the last section.

Dictionaries

Dictionaries are like lists, but instead of being indexed by numbers, they are indexed by keys, and they are unordered. Dictionaries — called hash tables in some languages like Java and associative arrays in PHP — are useful for indexing values by particular keys. The key-value idiom is an important one and is at the heart of Hadoop MapReduce. In any case, let’s look at some very basic dictionaries here.

Let’s start with a variable called mydict. You’ll see that the syntax of dictionaries is similar to lists, but differs slightly. Add this dictionary, which neatly stores surnames with first names for our illustrious faculty. Surnames are the key, while first names are the value.

mydict = {'shah': 'dhavah', 'wells': 'chris', 'kim': 'young mie'} 

Now, what was the first name of that Wells fellow?

print mydict['wells']
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
chris

We can also change individual elements.

mydict['shah'] = 'chirag'
print mydict['shah']
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
chirag

Loops

Lists and dictionaries by themselves are nice, but we usually want to run a similar operation on all the items in each sequence. This is where loops come in. Python is very nice because, unlike programming languages like C, we can tell it to run an operation on all elements of a sequence.

There’s two types of lists, for and white loops. Today we’re just going to focus on the first kind, because this is most easily used with loops.

Let’s start with lists. Using the list from earlier, let’s get the sum of all the numbers in that list.

sum = 0
for i in mylist:
	sum = sum + i

print sum

Notice that for anything to be executed in the loop, they have to be at the same indent level. This is what is known as being in the loop’s scope.

me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
220

It’s not far from seeing you could do things like generating averages from this. For this we need a function that tells us how many items are in the list. Change the last line to this:

print sum / len(mylist)
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
44

Now turning to dictionaries, it’s a little different because we have both the key and the value we need to deal with. The easiest way to do this is to use the iteritems() function, which we get with the dictionary data structure.

Let’s try to fuse the names that we got from earlier.

for k,v in mydict.iteritems():
	print v + " " + k
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
dhavah shah
young mie kim
chris wells

You notice how these were output in no particular order? That’s because dictionaries are unordered lists. If you wanted to do these in any kind of order, you could sort by another list, or you could sort by key.

for key in sorted(mydict.keys()):
	print mydict[ key ] + " " + key
me@blogclub:~/sandbox/tworkshop/bin$ python lab3.py 
young mie kim
dhavah shah
chris wells

The above looping strategies are the most basic for both the list and the dictionary. There’s a lot more you can do. For more info on data structures, you can go to the Python documentation.