EARL Efficient Annotation of Resources by Learning
About
The goal of this project is to reduce the annotation effort in documenting languages with insufficient resources through machine learning and active learning. The output of the research will allow not only field linguists but language scholars in general to concentrate on the more pressing issues of data collection and linguistic analysis while minimizing the time investment required for the highly labor intensive task of linguistic annotation. Harnessing recent developments in natural language processing and machine learning, especially that of active learning and machine translation, our research focuses on maximizing performance in terms of both coverage and precision while using as little human annotated material as possible. To test our approach in a real-world situation, we will be directly involved in the annotation of an actual underdocumented language, Q'anjob'al, working closely with native speakers and thus maximizing the robustness and applicability of our results.
People
Principal investigators: Jason Baldridge, Katrin Erk,
Research assistants: Alexis Palmer, Taesun Moon
Publications
- Taesun Moon and Katrin Erk. Minimally supervised lemmatization scheme induction through bilingual parallel corpora. In Proceedings of International Conference on Global Interoperability for Language Resources. Hong Kong. 2008.
- Alexis Palmer and Katrin Erk. IGT-XML: An XML format for interlinearized glossed text. In ACL 2007 Linguistic Annotation Workshop. Prague. 2007.
- Taesun Moon and Jason Baldridge. Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts. In Proceedings of EMNLP/CONLL-2007. Prague. 2007.
Resources
-
IGT-XML is an XML format for representation of
interlinearized glossed text (IGT).
-
Preliminary schema for IGT-XML
- Q'anjob'al text fragment in IGT-XML
- Uspanteko text fragment in IGT-XML
Sponsor
EARL is supported by the Documenting Endangered Languages program of the National Science Foundation
Project Page
The project wiki page is accessible here.
