====== Morphological Analysis Project ====== [[morph/earl_al:AL exps]] \\ [[morph/earlite:EARL-ITE Annotation Tool]] ===== Reading list ===== ==== Highly relevant ==== [[http://www.stanford.edu/~sgwater/papers/thesis_1spc.pdf|Sharon Goldwater's thesis]] ==== Toolbox, NLTK, and Python ==== [[http://nflrc.hawaii.edu/ldc/June2007/robinson/robinson.pdf|Managing Fieldwork Data with Toolbox and the Natural Language Toolkit]] ==== May be something to check out ==== [[http://linguistlist.org/issues/18/18-1455.html|Machine Recognition and Morphological Analysis of Subanta-Padas]] ==== Papers on morphology and POS tagging ==== * Tseng, Jurafsky and Manning's 2005 [[http://acl.ldc.upenn.edu/I/I05/I05-3005.pdf|paper]] on how morphology helps POS tagging. It seems to be one of few papers that deal with this subject. ==== Papers on unsupervised morphology induction ==== === Must read === * Schone and Jurafsky's 2000 [[http://acl.ldc.upenn.edu/W/w00/w00-0712.pdf|paper]] on using word distribution (latent semantic analysis) to refine morphology induction. === Relevant === * Schone and Jurafsky's 2001 [[http://portal.acm.org/citation.cfm?id=1073360&dl=GUIDE,|paper]] that expands on their 2000 study. (recommended) * Baroni et al's 2002 [[http://portal.acm.org/citation.cfm?id=1118653&dl=|paper]] that uses mutual information and minimal edit distance to induce morphology clusters. * Dayne Freitag's 2005 [[http://acl.ldc.upenn.edu/W/W05/W05-0617.pdf|paper]] that uses information theoretic co-clustering to refine morphology induction. (recommended) ==== Papers on alignment and transfer methods ==== === Must reads === * Yarowsky and Wicentowsky's 2000 [[http://portal.acm.org/citation.cfm?id=1075245|paper]] on lemmatization and morphological analysis through supervised methods. Provides background on Yarowsky et al's 2001 paper below. * Yarowsky and Ngai's 2001 [[http://portal.acm.org/citation.cfm?id=1073336.1073362|paper]] on inducing multilingual POS taggers and NP bracketers via projection across aligned corpora. * Yarowsky et al's 2001 [[http://portal.acm.org/citation.cfm?id=1072187|paper]] on inducing more NLP tools using the methods outlined above. === Relevant === * Mona Diab's 2001 [[http://portal.acm.org/citation.cfm?id=1073126&dl=GUIDE|paper]] on word sense tagging using parallel corpora. * Drábek and Yarowsky's 2005 [[http://acl.ldc.upenn.edu/W/W05/W05-0807.pdf|paper]] on inducing fine-grained POS taggers through alignment. ===== Software ===== * [[http://comp.ling.utexas.edu/~tsmoon/giza.tgz|GIZA++]]: This is a slightly revised version of the one that can be found at Franz Och's [[http://www.fjoch.com/GIZA++.html|website]]. The source has been modified somewhat so that it will compile on gcc versions 4.0 or later. The relevant paper can be found [[http://portal.acm.org/citation.cfm?id=778822.778824|here]]. * [[http://michel.jacobson.free.fr/ITE/index_en.html|Interlinear Text Editor]]: ITE is software designed particularly for interlinear glossing. The software was developed by [[http://michel.jacobson.free.fr/|Michel Jacobson]]. Good, thorough documentation, including users' manual and link to source code. * [[http://nltk.sourceforge.net|NLTK]]: NLTK is a suite of open source tools for natural language processing, written in Python. NLTK includes a suite of tools for processing data from Toolbox: [[http://nltk.org/doc/en/data.html|Managing Linguistic Data with NLTK]]. These tools in turn make use of Python's [[http://effbot.org/zone/element-index.htm|ElementTree module]]. ===== ITE Resources ===== * LACITO archive [[http://lacito.vjf.cnrs.fr/archivage/dtd/|DTD documentation]] describes the default XML format for ITE * Some [[http://michel.jacobson.free.fr/ITE/stylesheets_en.htm|example XSLTs ]] used with ITE ===== Data ===== [[http://www.hlt.utdallas.edu/~sajib/dataset.html|Bengali datasets from Sajib Dasgupta]] =====Tools and resources for Portuguese ===== NOTE: on the lab machines, Floresta (the Bosque portion) is available in /groups/corpora, and the smaller datasets from the [[http://nextens.uvt.nl/~conll/free_data.html|CONLL 2006 shared task]] are available in /groups/projects/earl/data. We also have a number of in-house tools (created by Ben Wing) for working with Floresta. [[http://www.linguateca.pt/|Linguateca]]: sort of a clearing house for Porutguese NLP resources, includes links to a number of resources * [[http://acdc.linguateca.pt/cetempublico/whatisCETEMP.html|CETEMPúblico]]: 180 million words of European Portuguese * [[http://acdc.linguateca.pt/cetenfolha/|CETENFolha]]: 24 million words of Brazilian Portuguese * [[http://www.linguateca.pt/COMPARA/Welcome.html|COMPARA]]: parallel Portuguese-English corpus: roughly 150K words in each language; this page links to a web-based search interface. Availability of entire corpus not spelled out on website. * [[http://acdc.linguateca.pt/treebank/info_floresta_English.html|Floresta Sintá(c)tica]]: Portuguese treebank consisting of texts from both CETEMPúblico and CETENFolha, parsed with the [[http://visl.sdu.dk|Palavras parser]] --> Floresta is broken into two subsets: Bosque and Floresta Virgem, both available from this same page * [[http://acdc.linguateca.pt/treebank/info_floresta_English.html|Bosque]]: manually-corrected portion of Floresta, 9431 trees, roughly 180K words * [[http://acdc.linguateca.pt/treebank/info_floresta_English.html|Floresta Virgem]]: uncorrected, automatically-parsed treebank; 41K trees, over 1 million words [[http://www.nilc.icmc.usp.br/lacioweb/english/index.htm|Lácio-Web]] * [[http://www.nilc.icmc.usp.br/lacioweb/english/plancamento.htm|Lácio-Ref corpus]]: 8.2 million words, text w/ metadata (no tagging) * [[http://www.nilc.icmc.usp.br/lacioweb/english/plancamento.htm|MAC-MORPHO corpus]]: 1.2 million words, POS-tagged with [[http://visl.sdu.dk/|Palavras]] * [[http://www.nilc.icmc.usp.br/lacioweb/english/ferramentas.htm|POS taggers]]: several taggers trained on the MAC-MORPHO corpus ===== Tools and resources for Mayan languages ===== ==== OKMA texts ==== Texts produced by [[http://www.okma.org/|La Asociación Oxlajuuj Keej Maya' Ajtz'iib']]. IMPORTANT: [[:morph#Citation_information|Citation information]] below. for now, here are the (uncleaned) numbers Awakateko: 81564 words glossed, 67637 additional unglossed Sakapalteko: 59890 words glossed, 172709 additional unglossed Tektiteko: 101185 words glossed, 97071 additional unglossed Uspanteko: 73231 words glossed, 218617 additional unglossed === Examples === Texts are glossed in and translated into Spanish. The format shown below is the data as exported from Shoebox. * ''\ref'' annotator's reference number for the clause * ''\t'' raw text (with the exception of the clause boundary marker '+') * ''\m'' morphological segmentation * ''\g'' gloss line -- combination of lemmas and morphological tags * ''\c'' seems to be primarily POS information, more or less * ''\l'' Spanish translation == Awakateko == \ref trtex001awa-parte1 012 \t pero kyi na eel qatxuum tetz,+ \m poro kye' na eel qa- txuum t- eetz \g pero neg. INC salir A1p- pensar E3s- de \c conj. adv. procl. v.i. pref.- v.t. pref.- s. rel. \l pero no lo entendemos == Sakapulteko == \t K'o jun chek rii'... kasi qast mas etz'eneem tziij,+ \m k'o jun chek rii' kasi qas - taj mas etz' en eem tziij \g EXS ART PART este casi muy - IRR mas jugar AP SS palabra \c exist. art. part. dem. adv. part. - part. conj. no. cl. suf. suf. s. \l Hay otro casi no es tan chistoso == Tektiteko == \ref trtex01.1tek03 009 \t Y tzan qaq'unan+ \m y tzan q- aq'una -n \g y PREP E1p- trabajar -AP \c conj. prep. pers.- v.t. -suf. \l Y para que trabajemos == Uspanteko == \t juntir chi'ntayik,+ \m juntiir ch-in-tay-ik \g todo PRE-E1s-escuchar-SV \c adv. prep.-pers.-v.t.-suf. \l Todos los que me están escuchando. === Citation Information === This data should be cited as follows: Text Collections in Four Mayan Languages, 2003-2007 OKMA (Oxlajuuj Keej Maya' Ajtz'iib') * **Awakateko** Supported by the Norwegian Ministry of External Relations B'alam Mateo Toledo (coordinator) Edna Patricia Delgado Rojas (coordinator) Johanna Liseth Mendoza Solís María Virginia Rodríguez Rodríguez * **Sakapulteko** Supported by Endangered Languages Documentation Programme (SOAS, University of London) Romelia Mó Isém (coordinator) Juan Carlos Vásquez Aceituno Ana Luciana Arcón Puzul Juan Adolfo Solís Baltazar * **Tektiteko** Supported by the Norwegian Ministry of External Relations José Reginaldo Pérez Vaíl (coordinator) Erico Simón Morales Ernesto Baltazar Gutiérrez * **Uspanteko** Supported by Endangered Languages Documentation Programme (SOAS, University of London) Telma Can Pixabaj (coordinator) Miguel Angel Vicente Méndez María Vicente Méndez Oswaldo Ajcot Damián ==== Q'anjob'al ==== Q'anjob'al data is located in ''/groups/projects/earl/data/qanjobal'' We have access to data from a number of sources, in various formats and with varied levels of annotation. **NOTE** the texts vary in orthography and/or analysis. I've tried to group them below according to provenance. === Bible (not actually Q'anjob'al) === ''KSM-complete-NT.line-format'' is an Q'anjob'al **Akateko** translation of the New Testament, one verse per line. Each verse is numbered, with chapter numbers appearing only before the first verse of each chapter, as follows: 4:1 Catuý jix iýletoj Jesús yu Yespíritu Dios bey jun cusiltaj txýotxý, yutol oj akýle yijbale naj yin spenail yu naj diablo. 2 Cýam tzet jix sloý yin cawinaj cýual, cawinaj akýbal. Catuý jix tit swail Jesús. This one file contains all 27 books of the New Testament with no explicit separation of the books. Each book of course begins with a line numbered ''1:1''. Here are some notes from B'alam re: this text, orthography, and comparison of Akateko and Q'anjob'al (some consider Akateko a distant dialect of Q'anjob'al, others consider it a separate language). I checked the extract of the bible you sent me. Here are some comments. [1] It is written in Akateko. Linguistically speaking this is a far distant dialect of Q'anjob'al. Some people (like the Academy of Mayan languages, the ministry of education, etc.) consider it a different Language. There are several difference between what you would call Q'anjob'al and Akateko. My undergraduate thesis is a description of these differences. [2] The orthography used in the bible is the non-official one (which is basically the one that the SIL people used, and they still use it). [3] the orthography is partially different from the one I use (or writing Q'anjob'al in general) for dialectal variations and because it is an old alphabet. [3a] I have found the following correspondences between the official alphabet for Q'anjob'al and the one in the bible. There could be other differences but this are the ones that I can recognize right away. I haven't found a word with /xh/ but this would be another one to look for. Q'ajob'al-bible b'-b k-c k'-c' h- (nothing) q'-k', q' q-j j- (nothing) [3b] things might look confusing with respect to /q, q', k',j/. There is a whole sound movement going on with these sounds in Akateko. -In Akateko, /q', k'/ are fusing into one single /k'/ sound. Those words with /q'/ in the bible might be the remaining words with /q'/. -The /j/ in the *old q'anjob'al* was lost in Akateko. In some contexts its lose was compensated with vowel lengthening (that is why there are some long vowels in there). -The /q/ in the *OLD Q'ANJOB'AL* became a /j/ in Akateko (the old /j/ was lost and the old /q/ took its place and became a /j/ in the actual language). All these changes results into: a. There is long vowel in some contexts. b. /q, q'/ are almost lost c. The /q/ in q'anjob'al corresponds to /j/ in Akateko. [3c] Then, the *real orthographic differences/mistakes* in the bible and Q'anjob'al would be: b'-b k-c, k'-c' h- (nothing) === Earlier texts === This is a set of texts used for work reported in Kuhn and Mateo-Toledo 2004 (Applying Computational Linguistic Techniques in a Documentary Project for Q'anjob'al (Mayan, Guatemala), LREC 2004). * ''xhapmat'' is divided into 7 separate files (raw text, needs preprocessing) * the directory ''tagged'' contains a number of texts as tagged by the POS tagger developed by Kuhn. These are **not** gold-standard tags. === Current texts === These are the texts from ongoing Q'anjob'al documentation projects. Numbers may be a bit high, as the texts have not yet been cleaned for our purposes. Q'anjob'al: 26317 glossed, 56333 additional unglossed