Learning Objectives
- Concepts: regular expressions re stemming lemmatization segmentation
(color key: Python/Programming NLP/CL Software Engineering)
Algorithmic Thinking
- What is the task? What is the minimum I need to do to complete the task?
Read the instructions! Laziness is a virtue!
- What are the tools and components I have at my disposal? What format or structure are they in? How can I access the information I need?
Know your basics well!
list
,dict
,for
,in
, etc..
- What is the first step? How can I make sure it is working?
Start simply! Introduce complexity slowly!
- Develop using a clear feedback system.
- Address errors as they arise. Google errors if you don’t understand them.
Reading
Please see the slides from M. W. Goodman’s 2019 workshop on regular expressions for a quick introduction:
For applications of regular expressions to NLP, please read these sections from the NLTK book:
- NLTK 3.4 – Regular Expressions for Detecting Word Patterns
- NLTK 3.5 – Useful Applications of Regular Expressions
Also, we’ve discussed tokenization and basic normalization already, but now see the following to better understand stemming, lemmatization, and segmentation.
- NLTK 3.6 – Normalizing Text
- NLTK 3.7 – Regular Expressions for Tokenizing Text
- NLTK 3.8 – Segmentation
Additional Reading
These links may be helpful, but are not assigned reading:
- Regular Expression HOWTO (Python documentation, by A.M. Kuchling)
- Python Regular Expressions (Google for Education)
- regex101 (Useful web-app for constructing and inspecting regular expressions)
Testing Your Knowledge
Questions
- Q: What are regex metacharacters?
- Q: How is stemming different from lemmatization?
- Q: What is a kind of segmentation that is not tokenization/word-segmentation?
Practical Work
- Write regular expressions to match the following classes of strings:
- A single determiner (assume that a, an, and the are the only determiners)
- An arithmetic expression using integers, addition, and
multiplication, such as
2*3+8
- Phone numbers (e.g.,
+65 8012 3456
)
- Create a function
plural()
that takes an English word and returns its plural form. Test it on dog, apple, fly, boy, woman. - Find all verb particles (things like give up, look out) in wordnet.
- Try to expand them to different inflectional forms: give up, giving up, gave up, given up