HG2051 – Week 11

Learning Objectives

Concepts: machine learning supervised learning unsupervised learning classification evaluation decision trees entropy linguistic features

(color key: Python/Programming NLP/CL Software Engineering)

Reading

NLTK 6 – Learning to Classify Text

This week covers an introduction to machine learning. Last week we looked at part-of-speech taggers that were trained by looking at examples of gold-standard pos-tagged words, and these were a kind of very basic supervised classification based on statistical inference. Machine learning also uses statistical inference, but often it builds multi-dimensional models using a variety of features of data instances instead of just counting them. For example, instead of just counting to find the most frequent part-of-speech tag for some word, we could consider additional features such as whether the word is capitalized, whether it appears after the word “to”, or the domain of the sentence (e.g., “news”, if it is known). Using all of these features as conditions would make a conditional frequency distribution too sparse to be useful (the features/conditions are too discriminating to get a model that generalizes to unseen data), so instead these features are used in a way that does not overfit to the training data.

Additional Reading

Testing Your Knowledge

Questions

Q: What does training data need to be used for supervised classification?
Q: What are features in a machine learning model?
Q: Why is overfitting a problem?
Q: How is a development set different from a test set?

Practical Work

Try to split the Brown corpus into train, dev, and test sets, but try to make representative of the corpus. Think about features you might use to split the instances (e.g., the category, length of sentence, perplexity, etc.)
Build a part-of-speech tagger using a Decision Tree, Naive Bayes, or Maximum Entropy classifier (pick one, or all three). Design your features, train your model, evaluate.