NOTE: on the lab machines, Floresta (the Bosque portion) is available in /groups/corpora, and the smaller datasets from the CONLL 2006 shared task are available in /groups/projects/earl/data. We also have a number of in-house tools (created by Ben Wing) for working with Floresta.
Linguateca: sort of a clearing house for Porutguese NLP resources, includes links to a number of resources
Texts produced by La Asociación Oxlajuuj Keej Maya' Ajtz'iib'. IMPORTANT: Citation information below.
for now, here are the (uncleaned) numbers
Awakateko: 81564 words glossed, 67637 additional unglossed Sakapalteko: 59890 words glossed, 172709 additional unglossed Tektiteko: 101185 words glossed, 97071 additional unglossed Uspanteko: 73231 words glossed, 218617 additional unglossed
Texts are glossed in and translated into Spanish. The format shown below is the data as exported from Shoebox.
\ref annotator's reference number for the clause\t raw text (with the exception of the clause boundary marker '+')\m morphological segmentation\g gloss line – combination of lemmas and morphological tags\c seems to be primarily POS information, more or less\l Spanish translation\ref trtex001awa-parte1 012 \t pero kyi na eel qatxuum tetz,+ \m poro kye' na eel qa- txuum t- eetz \g pero neg. INC salir A1p- pensar E3s- de \c conj. adv. procl. v.i. pref.- v.t. pref.- s. rel. \l pero no lo entendemos
\t K'o jun chek rii'... kasi qast mas etz'eneem tziij,+ \m k'o jun chek rii' kasi qas - taj mas etz' en eem tziij \g EXS ART PART este casi muy - IRR mas jugar AP SS palabra \c exist. art. part. dem. adv. part. - part. conj. no. cl. suf. suf. s. \l Hay otro casi no es tan chistoso
\ref trtex01.1tek03 009 \t Y tzan qaq'unan+ \m y tzan q- aq'una -n \g y PREP E1p- trabajar -AP \c conj. prep. pers.- v.t. -suf. \l Y para que trabajemos
\t juntir chi'ntayik,+ \m juntiir ch-in-tay-ik \g todo PRE-E1s-escuchar-SV \c adv. prep.-pers.-v.t.-suf. \l Todos los que me están escuchando.
This data should be cited as follows:
Text Collections in Four Mayan Languages, 2003-2007 OKMA (Oxlajuuj Keej Maya' Ajtz'iib')
Supported by the Norwegian Ministry of External Relations B'alam Mateo Toledo (coordinator) Edna Patricia Delgado Rojas (coordinator) Johanna Liseth Mendoza Solís María Virginia Rodríguez Rodríguez
Supported by Endangered Languages Documentation Programme (SOAS, University of London) Romelia Mó Isém (coordinator) Juan Carlos Vásquez Aceituno Ana Luciana Arcón Puzul Juan Adolfo Solís Baltazar
Supported by the Norwegian Ministry of External Relations José Reginaldo Pérez Vaíl (coordinator) Erico Simón Morales Ernesto Baltazar Gutiérrez
Supported by Endangered Languages Documentation Programme (SOAS, University of London) Telma Can Pixabaj (coordinator) Miguel Angel Vicente Méndez María Vicente Méndez Oswaldo Ajcot Damián
Q'anjob'al data is located in /groups/projects/earl/data/qanjobal
We have access to data from a number of sources, in various formats and with varied levels of annotation. NOTE the texts vary in orthography and/or analysis. I've tried to group them below according to provenance.
KSM-complete-NT.line-format is an Q'anjob'al Akateko translation of the New Testament, one verse per line. Each verse is numbered, with chapter numbers appearing only before the first verse of each chapter, as follows:
4:1 Catuý jix iýletoj Jesús yu Yespíritu Dios bey jun cusiltaj txýotxý, yutol oj akýle yijbale naj yin spenail yu naj diablo. 2 Cýam tzet jix sloý yin cawinaj cýual, cawinaj akýbal. Catuý jix tit swail Jesús.
This one file contains all 27 books of the New Testament with no explicit separation of the books. Each book of course begins with a line numbered 1:1.
Here are some notes from B'alam re: this text, orthography, and comparison of Akateko and Q'anjob'al (some consider Akateko a distant dialect of Q'anjob'al, others consider it a separate language).
I checked the extract of the bible you sent me. Here are some comments. [1] It is written in Akateko. Linguistically speaking this is a far distant dialect of Q'anjob'al. Some people (like the Academy of Mayan languages, the ministry of education, etc.) consider it a different Language. There are several difference between what you would call Q'anjob'al and Akateko. My undergraduate thesis is a description of these differences. [2] The orthography used in the bible is the non-official one (which is basically the one that the SIL people used, and they still use it). [3] the orthography is partially different from the one I use (or writing Q'anjob'al in general) for dialectal variations and because it is an old alphabet. [3a] I have found the following correspondences between the official alphabet for Q'anjob'al and the one in the bible. There could be other differences but this are the ones that I can recognize right away. I haven't found a word with /xh/ but this would be another one to look for. Q'ajob'al-bible b'-b k-c k'-c' h- (nothing) q'-k', q' q-j j- (nothing) [3b] things might look confusing with respect to /q, q', k',j/. There is a whole sound movement going on with these sounds in Akateko. -In Akateko, /q', k'/ are fusing into one single /k'/ sound. Those words with /q'/ in the bible might be the remaining words with /q'/. -The /j/ in the *old q'anjob'al* was lost in Akateko. In some contexts its lose was compensated with vowel lengthening (that is why there are some long vowels in there). -The /q/ in the *OLD Q'ANJOB'AL* became a /j/ in Akateko (the old /j/ was lost and the old /q/ took its place and became a /j/ in the actual language). All these changes results into: a. There is long vowel in some contexts. b. /q, q'/ are almost lost c. The /q/ in q'anjob'al corresponds to /j/ in Akateko. [3c] Then, the *real orthographic differences/mistakes* in the bible and Q'anjob'al would be: b'-b k-c, k'-c' h- (nothing)
This is a set of texts used for work reported in Kuhn and Mateo-Toledo 2004 (Applying Computational Linguistic Techniques in a Documentary Project for Q'anjob'al (Mayan, Guatemala), LREC 2004).
xhapmat is divided into 7 separate files (raw text, needs preprocessing)tagged contains a number of texts as tagged by the POS tagger developed by Kuhn. These are not gold-standard tags.These are the texts from ongoing Q'anjob'al documentation projects. Numbers may be a bit high, as the texts have not yet been cleaned for our purposes.
Q'anjob'al: 26317 glossed, 56333 additional unglossed