HG2051 – Week 3

Lecture

Review of Homework 2

Getting Started with Python (Strings): Slides from Object-Oriented Programming in Python (Goldwasser and Letscher)

Installing NLTK (make sure your virtual environment is active):

$ pip install nltk

Open the Python interpreter:

$ python

From inside the Python interpreter:

>>> import nltk
>>> nltk.download() # first time on this machine

This should open a new window with a list of corpora to download. They don’t take up much space so feel free to download them all. Then you can close the window and exit the interpreter:

>>> exit()

Learning Objectives

Data types: str list set
Concepts: comparisons conditionals loops comprehensions functions filtering stopwords efficiency
Tools: NLTK

(color key: Python/Programming NLP/CL Software Engineering)

Reading

Control Flow

The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:

Functions

This section on defining functions extends what we talked about in class in Week 2:

4.6 – Defining Functions

Strings Methods

For now we will cover a subset of the available string methods:

List Methods and Other Uses

Lists also have a number of useful methods and other uses:

The `in` Operator

Many kinds of “containers” in Python (which include strings, lists, sets, and other structures) work with Python’s in operator. For most containers, an x in y operation returns True if x is one of the elements contained in y. For strings, it returns True if x is a subsequence of the elements of y:

>>> my_list = [1, 2, 3]
>>> 2 in my_list           # check for an individual element
True
>>> [1, 2] in my_list      # this does not work
False
>>> [1, 2] in [[1, 2], 3]  # unless the list actually has [1, 2] as an element
True
>>> my_str = '123'
>>> '2' in my_str          # a single character is just a string with one character
True
>>> '12' in my_str         # `in` with strings checks for substrings
True
>>> '12' in '1 2 3'        # subsequences must be exact (spaces count)
False

Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)

Stopwords

Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):

NLTK 4.1 – Wordlist Corpora

Testing Your Knowledge

Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora downloaded by importing nltk and running the following two commands in Python (after >>>):

>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
True
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/goodmami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

You can find the available corpora like this:

>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):

>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Now moby_dick is a big string containing the entire book. Use this string and the string methods in your reading to answer the following questions:

How many lines are in the file?
Is each line exactly one complete sentence?
How many tokens are in the file?
What is the average number of tokens per line?

Also use list or set comprensions to filter tokens to answer the following questions:

How many unique, case-normalized tokens are in the book?
What proportion of the case-normalized tokens are, or are not, stopwords?
What is the set of tokens in the book that begin with “whale”?
What is the set of tokens in the book that begin with “whale” and are all alphabetic characters?

Word frequencies

Now that we understand what corpora are like, we can try to understand them in more detail. For example, we can look at frequency distributions (NLTK 1.3) to identify potential words that are particularly informative about a text. Let’s use NLTK’s FreqDist class to find the 50 most frequent words of Moby Dick:

>>> fdist1 = nltk.FreqDist(moby_dick.split())
>>> print(fdist1)
<FreqDist with 33265 samples and 212030 outcomes>
>>> fdist1.most_common(50)
...
>>> fdist1['whale']
392
>>>

How informative are these words?

On your own

Explore these outcomes following the text linked above. Why do our results differ from the book?