Week 3

Lecture

Review of Homework 2

Getting Started with Python (Strings): Slides from Object-Oriented Programming in Python (Goldwasser and Letscher)

Learning Objectives

(color key: Python/Programming NLP/CL Software Engineering)

Reading

Control Flow

The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:

Functions

This section on defining functions extends what we talked about in class in Week 2):

Strings Methods

For now we will cover a subset of the available string methods:

List Methods and Other Uses

Lists also have a number of useful methods and other uses:

The in Operator

Many kinds of “containers” in Python (which include strings, lists, sets, and other structures) work with Python’s in operator. For most containers, an x in y operation returns True if x is one of the elements contained in y. For strings, it returns True if x is a subsequence of the elements of y:

>>> my_list = [1, 2, 3]
>>> 2 in my_list           # check for an individual element
True
>>> [1, 2] in my_list      # this does not work
False
>>> [1, 2] in [[1, 2], 3]  # unless the list actually has [1, 2] as an element
True
>>> my_str = '123'
>>> '2' in my_str          # a single character is just a string with one character
True
>>> '12' in my_str         # `in` with strings checks for substrings
True
>>> '12' in '1 2 3'        # subsequences must be exact (spaces count)
False

Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)

Stopwords

Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):

Testing Your Knowledge

Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora downloaded by importing nltk and running the following two commands in Python (after >>>):

>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
True
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

You can find the available corpora like this:

>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):

>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Now moby_dick is a big string containing the entire book. Use this string and the string methods in your reading to answer the following questions:

Also use list or set comprensions to filter tokens to answer the following questions:

Word frequencies

Now that we understand what corpora are like, we can try to understand them in more detail. For example, we can look at frequency distributions (NLTK 1.3) to identify potential words that are particularly informative about a text. Let’s use NLTK’s FreqDist class to find the 50 most frequent words of Moby Dick:

>>> fdist1 = nltk.FreqDist(moby_dick.split())
>>> print(fdist1)
<FreqDist with 33265 samples and 212030 outcomes>
>>> fdist1.most_common(50)
...
>>> fdist1['whale']
392
>>>

How informative are these words?

On your own

Explore these outcomes following the text linked above. Why do our results differ from the book?