Lecture
Review of Homework 2
Getting Started with Python (Strings): Slides from Object-Oriented Programming in Python (Goldwasser and Letscher)
Installing NLTK (make sure your virtual environment is active):
$ pip install nltk
Open the Python interpreter:
$ python
From inside the Python interpreter:
>>> import nltk
>>> nltk.download() # first time on this machine
This should open a new window with a list of corpora to download. They don’t take up much space so feel free to download them all. Then you can close the window and exit the interpreter:
>>> exit()
Learning Objectives
- Data types: str list set
- Concepts: comparisons conditionals loops comprehensions functions filtering stopwords efficiency
- Tools: NLTK
(color key: Python/Programming NLP/CL Software Engineering)
Reading
Control Flow
The Python tutorial has a good and concise explanation of Python’s basic control-flow mechanisms:
Functions
This section on defining functions extends what we talked about in class in Week 2:
Strings Methods
For now we will cover a subset of the available string methods:
- str.startswith()
- str.endswith()
- str.isalpha()
- str.isdigit()
- str.split()
- str.splitlines()
- str.join()
- str.lower()
- str.replace()
- str.strip()
List Methods and Other Uses
Lists also have a number of useful methods and other uses:
- 5.1 – More on Lists
- 5.1.1 – Using Lists as Stacks
- 5.1.3 – List Comprehensions
- 5.2
– The
del
statement
The in
Operator
Many kinds of “containers” in Python (which include strings, lists,
sets, and other structures) work with Python’s in
operator.
For most containers, an x in y
operation returns
True
if x
is one of the elements contained in
y
. For strings, it returns True
if
x
is a subsequence of the elements of y
:
>>> my_list = [1, 2, 3]
>>> 2 in my_list # check for an individual element
True
>>> [1, 2] in my_list # this does not work
False
>>> [1, 2] in [[1, 2], 3] # unless the list actually has [1, 2] as an element
True
>>> my_str = '123'
>>> '2' in my_str # a single character is just a string with one character
True
>>> '12' in my_str # `in` with strings checks for substrings
True
>>> '12' in '1 2 3' # subsequences must be exact (spaces count)
False
Question: How might you check for the presence of subsequences in lists? (hint: consider the methods or operations in the reading above)
Stopwords
Finally, also read this section of the NLTK book, but just the part about “stopwords” (just a few sentences and code blocks):
Testing Your Knowledge
Ensure you have the NLTK’s ‘gutenberg’ and ‘stopwords’ corpora
downloaded by importing nltk
and running the following two
commands in Python (after >>>
):
>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to/home/goodmami/nltk_data...
[nltk_data] is already up-to-date!
[nltk_data] Package gutenberg True
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to/home/goodmami/nltk_data...
[nltk_data] is already up-to-date!
[nltk_data] Package stopwords True
You can find the available corpora like this:
>>> nltk.corpus.gutenberg.fileids()
'austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] [
Then get the NLTK’s “raw” (string) version of one of these as follows (here I get “Moby Dick”):
>>> moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
Now moby_dick
is a big string containing the entire
book. Use this string and the string methods in your reading to answer
the following questions:
- How many lines are in the file?
- Is each line exactly one complete sentence?
- How many tokens are in the file?
- What is the average number of tokens per line?
Also use list or set comprensions to filter tokens to answer the following questions:
- How many unique, case-normalized tokens are in the book?
- What proportion of the case-normalized tokens are, or are not, stopwords?
- What is the set of tokens in the book that begin with “whale”?
- What is the set of tokens in the book that begin with “whale” and are all alphabetic characters?
Word frequencies
Now that we understand what corpora are like, we can try to
understand them in more detail. For example, we can look at
frequency distributions (NLTK 1.3) to
identify potential words that are particularly informative about a text.
Let’s use NLTK’s FreqDist
class to find the 50 most
frequent words of Moby Dick:
>>> fdist1 = nltk.FreqDist(moby_dick.split())
>>> print(fdist1)
<FreqDist with 33265 samples and 212030 outcomes>
>>> fdist1.most_common(50)
...>>> fdist1['whale']
392
>>>
How informative are these words?
On your own
Explore these outcomes following the text linked above. Why do our results differ from the book?