HG2051 Project 2: Group assignment

Introduction

This project constitutes 30% of your final grade for HG2051. Please work on the final program in groups of 2-3 and report together.

There are two possible options for completing Project 2. Both involve developing a POS tagger. The first (Project 2a) uses annotated data from Project 1 to develop a POS tagger for a low-resource language. The second (Project 2b) uses annotated data from the Universal Dependencies Treebanks project to train a POS tagger. If you were successful in annotating one of the languages your team members chose for Project 1, this is the preferred choice. If not, you can focus on using the hand-annotated data from a UDT corpus for training/testing. Make sure to read through both projects below to understand the scope of the tasks.

Preliminaries

As noted in Project 1, part of speech (POS) tagging is a way to automatically identify the word class of a particular lexical item in a string of text. A decent part of speech tagger can help to facilitate other downstream tasks such as machine translation. In the individual project you focused on the process of developing materials for POS tagging for a low-resource language. In this project you will work with other students to train a POS-tagger using a combination of the data you annotated and additional data that you will develop or source.

Typically, identifying or classifying word types in sentences is done on the basis of FDA (Function, Distribution, Associated grammatical categories), as per your Morphology and Syntax course. There are two basic approaches to automated POS tagging:

  1. rule-based
  2. statistical inference.

The first approach can be useful for complex cases, particularly when the language has been analyzed, while the second often involves machine learning and can be helpful when little is known about the language. Some taggers benefit from a hybrid approach, and this can be an iterative process.

Project 2a: Part of Speech tagging for low-resource languages

For Project 2a you will work with the data that you or a group member annotated previously. If you planned ahead, you would have all worked on languages within the same family - this familiarity should allow you to be successful in the group task.

Choose the language

Project 2b: Part of Speech tagging using a UDT dataset

For Project 2b you will work with data from the Universal Dependencies Treebanks project for a language that is also found in the taggedPBC. This will allow you to use hand-tagged data to train a tagger for a relatively low-resource language.

Choose the language

Guide for both project options

For both projects, the goal is to develop a POS tagger and use it to tag the remaining data in the respective taggedPBC corpus for the language you chose. You will evaluate the result on several grounds. The following guide will walk you through the process.

Determine your approach

Write code to train and evaluate your tagger

BONUS: in addition to POS tagging, a central concern when parsing language is understanding dependencies between word classes. Develop a parser for the language that identifies dependencies between words, i.e. phrasal heads and their subordinate elements (for example, nouns are heads of noun phrases, but dependents of verb phrases, how would you link a particular noun with its head, and in what role relation [subject, object, etc]?)

Predict tags for the taggedPBC corpus

Output

Your final repository should include: