Learning Objectives
- Concepts: higher-order functions recursive functions map() filter() n-grams collocations part-of-speech tags
(color key: Python/Programming NLP/CL Software Engineering)
In-class topics
- Project 1
- Project 2
- General questions
Reading
Higher-order and Recursive Functions
First-class Functions
By now you know how to define a function. For instance, to square a number, you might write:
def square(x):
return x ** 2
Note that this definition does two things: (1) it defines the code
that, when called, accepts the x
argument then computes and
returns the square of x
; and (2) it assigns this “function
object” to the name square
. When we invoke the name
square
with parentheses and the appropriate arguments, we
get a result:
>>> square(4)
16
And if you just type the function name without calling it in an interactive interpreter, Python will tell you it’s a function:
>>> square
<function square at 0x7ff9d780f790>
That is, the value of square
is the function object.
This object is a value just like more traditional values in Python, such
as integers and strings, and we can reassign the function to another
variable if we like:
>>> sq = square
>>> sq(4)
16
Python is said to have first-class functions, meaning that functions are not merely something to be called, but something that can be assigned, passed as arguments to other functions, and used as the return value of another function.
Higher-order Functions
Functions are said to be higher-order functions if they can accept other functions as arguments or if they return a function as their result.
Continuing our example from before, we may wish to generalize
square()
to a function that, given an exponent
n
, returns a function that raises x
to the
power of n
:
def power(n):
def raise_to(x):
return x ** n
return raise_to
Then we could create square
simply as follows:
>>> square = power(2)
>>> square(3)
9
>>> cube = power(3)
>>> cube(3)
27
You don’t even need to assign the created function to use it:
>>> power(3)(3)
27
In Python, map()
and filter()
are built-in functions that take a function as their first argument and
an iterable as their second argument. The map()
function
applies the function argument to each item in the iterable and yields
the result:
>>> list(map(square, range(10))) # apply square() to each value in range(10)
0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[>>> [square(x) for x in range(10)] # this does the same thing in a comprehension
0, 1, 4, 9, 16, 25, 36, 49, 64, 81] [
Note: list()
is necessary to get the full list;
otherwise map()
returns an iterator that yields one value
at a time.
The filter()
function yields values from the iterable if
the function argument applied to the value evaluates as true.
>>> list(filter(str.isnumeric, '1 two 3 four 5'.split()))
'1', '3', '5'] [
We can see how this works by using map()
to see the
True
/False
values, and noting that
str.isnumeric(x)
is the same as x.isnumeric()
when x
is a string:
>>> '1'.isnumeric() # the normal way to call `isnumeric()`
True
>>> str.isnumeric('1') # the functional way to call it
True
>>> list(map(str.isnumeric, '1 two 3 four 5'.split()))
True, False, True, False, True] [
To see how higher-order functions might help with NLP, please read this short section from the NLTK book:
Recursive Functions
You may know of recursive linguistic structures, such as grammatical
constructs which allow sentences like Kim knows that Sandy knows
that Pat knows that the dog barked. Similarly, many programming
languages allow recursive
functions which call themselves. For example, you may want to write
a function that computes the factorial of some integer
x
:
def factorial(x):
# e.g., if x is 7, this would return 7 * factorial(6),
# which returns 6 * factorial(5), etc.
return x * factorial(x - 1)
Let’s try it out:
>>> factorial(7)
Traceback (most recent call last):"<stdin>", line 1, in <module>
File "<stdin>", line 2, in factorial
File "<stdin>", line 2, in factorial
File "<stdin>", line 2, in factorial
File 996 more times]
[Previous line repeated RecursionError: maximum recursion depth exceeded
Oops. This did 7 * (6 * (5 * ...))
but it did not stop
at 1, so it continued: ... * (0 * (-1 * ...))
. Recursive
functions must have a base case which decides when it should
stop. For instance:
def factorial(x):
if x == 0:
return 1 # base case
else:
return x * factorial(x - 1)
And try again:
>>> factorial(7)
5040
>>> 7 * 6 * 5 * 4 * 3 * 2 * 1 # confirm
5040
For some problems, recursive functions are the natural way to compute
a result, but they have limits (as seen by the
RecursionError
above) and can lead to very inefficient
code. Recursive algorithms are often more efficient as an iterative loop
(for
, while
, etc.), but the iterative code can
be less intuitive.
N-grams
Bigrams, Trigrams, and more generally n-grams are subsequences of
some sequence with length 2, 3, or more. In computational linguistics we
often use n-grams for sequences of tokens, words, characters, etc. The
NLTK has the nltk.ngrams()
function which takes an iterable
and a number n
and yields subsequence tuples of length
n
from the iterable:
>>> import nltk
>>> four_words = 'many dogs chase cats'.split()
>>> list(nltk.ngrams(four_words, 1)) # unigrams
'many',), ('dogs',), ('chase',), ('cats',)]
[(>>> list(nltk.ngrams(four_words, 2)) # bigrams
'many', 'dogs'), ('dogs', 'chase'), ('chase', 'cats')]
[(>>> list(nltk.ngrams(four_words, 3)) # trigrams
'many', 'dogs', 'chase'), ('dogs', 'chase', 'cats')]
[(>>> list(nltk.ngrams(four_words, 4)) # 4-grams
'many', 'dogs', 'chase', 'cats')]
[(>>> list(nltk.ngrams(four_words, 5)) # 5-grams
[]
See also: Google’s Ngram Viewer
The items in the n-grams might be complex objects, such as n-grams of word-tag pairs:
>>> nltk.download('brown') # if you don't have it yet
>>> from nltk.corpus import brown
>>> tagged_sent = brown.tagged_sents()[0]
>>> tagged_sent[0:3] # inspect some items
'The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]
[(>>> tagged_bigrams = list(nltk.ngrams(tagged_sent, 2))
>>> for bigram in tagged_bigrams[:2]: # inspect the first two bigrams
print(bigram) # note that each bigram is a word-tag pair
... 'The', 'AT'), ('Fulton', 'NP-TL'))
(('Fulton', 'NP-TL'), ('County', 'NN-TL')) ((
Collocations
N-grams of words that occur more often than you would expect in a random distribution are called collocations. Furthermore, these words may have some special, perhaps non-compositional meaning that wouldn’t be obvious by the individual words. By counting the occurrences of n-gram pairs and comparing them to the counts of the words individually, we can find those that tend to appear together. For example:
def collocations(words):
from operator import itemgetter
# Count the words and bigrams
= nltk.FreqDist(words)
wfd = nltk.FreqDist(nltk.bigrams(words)) # bigrams is the same as nltk.ngrams(words, 2)
pfd # Compute the scores for each pair
= [((w1, w2), score(w1, w2, wfd, pfd)) for w1, w2 in pfd]
scored ## sort according to the score
=itemgetter(1), reverse=True)
scored.sort(keyreturn [p for (p, s) in scored]
def score(word1, word2, wfd, pfd, power=3):
'''return the collocation score f(w1,w2)^power/(f(w1)*f(w2))'''
= wfd[word1]
freq1 = wfd[word2]
freq2 = pfd[(word1, word2)]
freq12 return freq12 ** power / float(freq1 * freq2)
And in use:
>>> from nltk.corpus import gutenberg
>>> words = [w for w in gutenberg.words('melville-moby_dick.txt') if len(w) > 2]
>>> print(collocations(words)[:10])
'Moby', 'Dick'), ('Sperm', 'Whale'), ('White', 'Whale'), ('Dough', 'Boy'),
[('Mrs', 'Hussey'), ('Sag', 'Harbor'), ('Father', 'Mapple'), ('New', 'Bedford'),
('Fast', 'Fish'), ('have', 'been')] (
For further information (not required reading), see:
Parts-of-Speech
I assume you are familiar with word categories, namely “parts of speech”. If not, please read or skim the Wikipedia article. Then please read the following sections of the NLTK book about part-of-speech tagging:
Testing Your Knowledge
Questions
Q: If you have a list of length
m
, how many bigrams will it have? Trigrams? n-grams in general?Q: How might you deal with the problem of getting an n-gram from a sequence where the length of the sequence is less than
n
?Q: Do all languages use the same parts of speech? Does a single language have a clear and complete set of parts-of-speech?
Q: What is a tagset? What are examples of tagsets?
Practical Work
Write a recursive function
fib(x)
to compute the Fibonacci number ofx
. Callingfib(1)
should return 1,fib(5)
returns 5,fib(8)
returns 21, and so on. What happens if you callfib(20)
,fib(30)
, orfib(40)
? (use Ctrl-C to kill a stuck process). Now try rewriting the recursive function as an iterative function. Can you go higher thanfib(40)
?Write a function
bigrams(xs)
that takes a listxs
and returns a list of bigrams fromxs
.Write a function that takes a string of
word/TAG
formatted pairs and returns the list of (word, tag) tuples (hint: usenltk.str2tuple()
)Write a function that uses
nltk.map_tag()
to map a list of (word, tag) pairs to the universal tagset.