Introduction
This project constitutes 30% of your final grade for HG2051. Please work on the final program in groups of 2-3 and report together.
The goal of this assignment is to demonstrate your programming and problem-solving abilities through teamwork. If your team has an idea for another project that you would like to do instead, talk to the instructor for approval.
Project 2 involves developing a POS-tagger for a (low-resource) language. As with the individual project, your team will be required to submit data, output, and annotated code along with a short writeup that describes your goals, process, and results. Your code will be assessed based on its functionality and simplicity. Your writeup will be assessed on its organization, clarity, comprehensiveness, and quality of the discussion.
There are two possible options for completing Project 2. Both involve developing a POS tagger. The first (Project 2a) uses annotated data from Project 1 to develop a POS tagger for a low-resource language. The second (Project 2b) uses annotated data from the Universal Dependencies Treebanks project to train a POS tagger. If you were successful in annotating one of the languages your team members chose for Project 1, this is the preferred choice. If not, you can focus on using the hand-annotated data from a UDT corpus for training/testing. Make sure to read through both projects below to understand the scope of the tasks.
Preliminaries
As noted in Project 1, part of speech (POS) tagging is a way to automatically identify the word class of a particular lexical item in a string of text. A decent part of speech tagger can help to facilitate other downstream tasks such as machine translation. In the individual project you focused on the process of developing materials for POS tagging for a low-resource language. In this project you will work with other students to train a POS-tagger using a combination of the data you annotated and additional data that you will develop or source.
Typically, identifying or classifying word types in sentences is done on the basis of FDA (Function, Distribution, Associated grammatical categories), as per your Morphology and Syntax course. There are two basic approaches to automated POS tagging:
- rule-based
- statistical inference.
The first approach can be useful for complex cases, particularly when the language has been analyzed, while the second often involves machine learning and can be helpful when little is known about the language. Some taggers benefit from a hybrid approach, and this can be an iterative process.
Project 2a: Part of Speech tagging for low-resource languages
For Project 2a you will work with the data that you or a group member annotated previously. If you planned ahead, you would have all worked on languages within the same family - this familiarity should allow you to be successful in the group task.
Choose the language
- In your group, choose one of the low-resource languages that a member of your group worked on for Project 1. Using the dataset, develop a POS-tagger for the chosen language. Then use the trained tagger to re-tag the complete corpus in the taggedPBC.
Project 2b: Part of Speech tagging using a UDT dataset
For Project 2b you will work with data from the Universal Dependencies Treebanks project for a language that is also found in the taggedPBC. This will allow you to use hand-tagged data to train a tagger for a relatively low-resource language.
Choose the language
- A list of languages available is given here with links to the respective treebanks. Using this dataset, develop a POS-tagger. Then use this trained tagger to re-tag the respective taggedPBC corpus.
Guide for both project options
For both projects, the goal is to develop a POS tagger and use it to tag the remaining data in the respective taggedPBC corpus for the language you chose. You will evaluate the result on several grounds. The following guide will walk you through the process.
Determine your approach
Decide how you want to approach the task of developing an automated POS tagger: will you train a tagger from data, or write rules, or try some combination of the two?
You also need to consider evaluation/validation of your tagger, which will be based either on the hand-tagged sentences/verses that were annotated in Project 1 or on a portion of the hand-tagged sentences of the UDT corpus for the language that you chose. The set of annotated verses will be data that your tool has not trained on, and will serve as the “gold standard” evaluation set. In order to train/develop your tagger, you may need to find or develop additional tagged data, either through other sources or via hand-annotation. For best results, your training/development data should be many more sentences (at least twice as many) as your evaluation set.
Write code to train and evaluate your tagger
Together with your group, develop code that parses the training/development data and re-tags either the annotated dataset from Project 1 or a set of 20 sentences from the UDT for POS, evaluating its accuracy by comparing the newly generated tags with the “gold standard” tags.
Starter code in your repository shows how you can use tools in the NLTK library for training POS-taggers. You can also use other tools/libraries to train a tagging model, and/or write parsing rules based on your understanding of the language.
The starter code also illustrates how you can evaluate on the “gold standard” data.
BONUS: in addition to POS tagging, a central concern when parsing language is understanding dependencies between word classes. Develop a parser for the language that identifies dependencies between words, i.e. phrasal heads and their subordinate elements (for example, nouns are heads of noun phrases, but dependents of verb phrases, how would you link a particular noun with its head, and in what role relation [subject, object, etc]?)
Predict tags for the taggedPBC corpus
Using the POS tagger that you trained, predict tags for the verses in the taggedPBC for the language that you chose. Code is included to illustrate the basics of this process.
Compare the result with the initial state of the corpus, by examining the number/type of each POS tag in your final corpus. Was your POS-tagger able to increase the number of identified items in a given word class? Was it able to identify a larger number of classes?
Output
Your final repository should include:
code that trains a tagger on existing data, then parses the (unseen) “gold standard” data for the language and outputs tagged data, evaluating the automatic tags against the known tags of the “gold” data. In addition, your code should re-tag the initial corpus and collect statistics on the POS tags in both the initial and the resulting corpora. Code should be self-contained, i.e. I should be able to run it in my terminal and get output
a subfolder containing all the data used for training/testing and evaluation, with clear names for each file
a writeup that lays out the task, process, and results, as well as discussion of the challenges, concerns, and potential improvements