EARL Efficient Annotation of Resources by Learning

Project completed!

The grant's project period has now ended. See links below for project members, papers, software, data and resources related to the project. Please contact Jason Baldridge (jbaldrid@mail.utexas.edu) if you have any questions. We are grateful to the NSF for its support of this project.

About

The goal of this project is to reduce the annotation effort in documenting languages with insufficient resources through machine learning and active learning. The output of the research will allow not only field linguists but language scholars in general to concentrate on the more pressing issues of data collection and linguistic analysis while minimizing the time investment required for the highly labor intensive task of linguistic annotation. Harnessing recent developments in natural language processing and machine learning, especially that of active learning and machine translation, our research focuses on maximizing performance in terms of both coverage and precision while using as little human annotated material as possible. To test our approach in a real-world situation, we will be directly involved in the annotation of an actual underdocumented language, Uspanteko, working closely with Mayan language experts and thus maximizing the robustness and applicability of our results.

People

Principal investigators: Jason Baldridge, Katrin Erk,

Research assistants: Alexis Palmer, Taesun Moon

Annotators: Eric Campbell, Telma Can Pixabaj

Publications

Software

  • TexNLP: Java code for HMM and MEMM part-of-speech tagging [LGPL]
  • IGT Editor: an interlinearized text editor and machine labeler.
  • EARL Morph: C code for morphology acquisition [LGPL]

Data

  • See AILLA for the original OKMA Uspanteko data.
  • Our updated and cleaned XML version of the Uspanteko data will be available soon from AILLA, and we hope to post it to the EARL website as well. Until then, please contact Jason Baldridge (jbaldrid@mail.utexas.edu) if you are interested in the dataset.

Resources