Week 3


Review of Homework 2

Getting Started with Python (Strings): Slides from Object-Oriented Programming in Python (Goldwasser and Letscher)

Installing NLTK (make sure your virtual environment is active):

$ pip install nltk

Open the Python interpreter:

$ python

From inside the Python interpreter:

>>> import nltk
>>> nltk.download() # first time on this machine

This should open a new window with a list of corpora to download. They don’t take up much space so feel free to download them all. Then you can close the window and exit the interpreter:

>>> exit()

Learning Objectives

(color key: Python/Programming NLP/CL Software Engineering)


Control Flow

The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:


This section on defining functions extends what we talked about in class in Week 2:

Strings Methods

For now we will cover a subset of the available string methods:

List Methods and Other Uses

Lists also have a number of useful methods and other uses:

The in Operator

Many kinds of “containers” in Python (which include strings, lists, sets, and other structures) work with Python’s in operator. For most containers, an x in y operation returns True if x is one of the elements contained in y. For strings, it returns True if x is a subsequence of the elements of y:

>>> my_list = [1, 2, 3]
>>> 2 in my_list           # check for an individual element
>>> [1, 2] in my_list      # this does not work
>>> [1, 2] in [[1, 2], 3]  # unless the list actually has [1, 2] as an element
>>> my_str = '123'
>>> '2' in my_str          # a single character is just a string with one character
>>> '12' in my_str         # `in` with strings checks for substrings
>>> '12' in '1 2 3'        # subsequences must be exact (spaces count)

Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)


Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):

Testing Your Knowledge

Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora downloaded by importing nltk and running the following two commands in Python (after >>>):

>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

You can find the available corpora like this:

>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):

>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Now moby_dick is a big string containing the entire book. Use this string and the string methods in your reading to answer the following questions:

Also use list or set comprensions to filter tokens to answer the following questions:

Word frequencies

Now that we understand what corpora are like, we can try to understand them in more detail. For example, we can look at frequency distributions (NLTK 1.3) to identify potential words that are particularly informative about a text. Let’s use NLTK’s FreqDist class to find the 50 most frequent words of Moby Dick:

>>> fdist1 = nltk.FreqDist(moby_dick.split())
>>> print(fdist1)
<FreqDist with 33265 samples and 212030 outcomes>
>>> fdist1.most_common(50)
>>> fdist1['whale']

How informative are these words?

On your own

Explore these outcomes following the text linked above. Why do our results differ from the book?