EARL Efficient Annotation of Resources by Learning
Project completed!
The grant's project period has now ended. See links below for project members, papers, software, data and resources related to the project. Please contact Jason Baldridge (jbaldrid@mail.utexas.edu) if you have any questions. We are grateful to the NSF for its support of this project.
About
The goal of this project is to reduce the annotation effort in documenting languages with insufficient resources through machine learning and active learning. The output of the research will allow not only field linguists but language scholars in general to concentrate on the more pressing issues of data collection and linguistic analysis while minimizing the time investment required for the highly labor intensive task of linguistic annotation. Harnessing recent developments in natural language processing and machine learning, especially that of active learning and machine translation, our research focuses on maximizing performance in terms of both coverage and precision while using as little human annotated material as possible. To test our approach in a real-world situation, we will be directly involved in the annotation of an actual underdocumented language, Uspanteko, working closely with Mayan language experts and thus maximizing the robustness and applicability of our results.
People
Principal investigators: Jason Baldridge, Katrin Erk,
Research assistants: Alexis Palmer, Taesun Moon
Annotators: Eric Campbell, Telma Can Pixabaj
Publications
- Alexis Palmer, Taesun Moon, Jason Baldridge, Katrin Erk, Eric Campbell, and Telma Can. Computational strategies for reducing annotation effort in language documentation: A case study in creating interlinear texts for Uspanteko. To appear in Linguistic Issues in Language Technology.
- Taesun Moon, Katrin Erk, and Jason Baldridge. Unsupervised morphological segmentation and clustering with document boundaries. In Proceedings of EMNLP-2009. Singapore. 2009.
- Jason Baldridge and Alexis Palmer. How well does active learning actually work? Time-based evaluation of effectiveness for language documentation. In Proceedings of EMNLP-2009. Singapore. 2009.
- Alexis Palmer, Taesun Moon and Jason Baldridge. Evaluating automation strategies in language documentation . In NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing. Boulder, CO. 2009.
- Taesun Moon and Katrin Erk. Minimally supervised lemmatization scheme induction through bilingual parallel corpora. In Proceedings of International Conference on Global Interoperability for Language Resources. Hong Kong. 2008.
- Alexis Palmer and Katrin Erk. IGT-XML: An XML format for interlinearized glossed text. In ACL 2007 Linguistic Annotation Workshop. Prague. 2007.
- Taesun Moon and Jason Baldridge. Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts. In Proceedings of EMNLP/CONLL-2007. Prague. 2007.
Software
- TexNLP: Java code for HMM and MEMM part-of-speech tagging [LGPL]
- IGT Editor: an interlinearized text editor and machine labeler.
- EARL Morph: C code for morphology acquisition [LGPL]
Data
-
See AILLA for the original OKMA Uspanteko data.
- Our updated and cleaned XML version of the Uspanteko data will be available soon from AILLA, and we hope to post it to the EARL website as well. Until then, please contact Jason Baldridge (jbaldrid@mail.utexas.edu) if you are interested in the dataset.
Resources
-
IGT-XML is an XML format for representation of
interlinearized glossed text (IGT).
-
Preliminary schema for IGT-XML
- Q'anjob'al text fragment in IGT-XML
- Uspanteko text fragment in IGT-XML
Sponsor
EARL is supported by the Documenting Endangered Languages program of the National Science Foundation
