Morphological Analysis Project

Reading list

Highly relevant

Toolbox, NLTK, and Python

May be something to check out

Papers on morphology and POS tagging

  • Tseng, Jurafsky and Manning's 2005 paper on how morphology helps POS tagging. It seems to be one of few papers that deal with this subject.

Papers on unsupervised morphology induction

Must read

  • Schone and Jurafsky's 2000 paper on using word distribution (latent semantic analysis) to refine morphology induction.

Relevant

  • Schone and Jurafsky's 2001 paper that expands on their 2000 study. (recommended)
  • Baroni et al's 2002 paper that uses mutual information and minimal edit distance to induce morphology clusters.
  • Dayne Freitag's 2005 paper that uses information theoretic co-clustering to refine morphology induction. (recommended)

Papers on alignment and transfer methods

Must reads

  • Yarowsky and Wicentowsky's 2000 paper on lemmatization and morphological analysis through supervised methods. Provides background on Yarowsky et al's 2001 paper below.
  • Yarowsky and Ngai's 2001 paper on inducing multilingual POS taggers and NP bracketers via projection across aligned corpora.
  • Yarowsky et al's 2001 paper on inducing more NLP tools using the methods outlined above.

Relevant

  • Mona Diab's 2001 paper on word sense tagging using parallel corpora.
  • Drábek and Yarowsky's 2005 paper on inducing fine-grained POS taggers through alignment.

Software

  • GIZA++: This is a slightly revised version of the one that can be found at Franz Och's website. The source has been modified somewhat so that it will compile on gcc versions 4.0 or later. The relevant paper can be found here.
  • Interlinear Text Editor: ITE is software designed particularly for interlinear glossing. The software was developed by Michel Jacobson. Good, thorough documentation, including users' manual and link to source code.

ITE Resources

Data

Tools and resources for Portuguese

NOTE: on the lab machines, Floresta (the Bosque portion) is available in /groups/corpora, and the smaller datasets from the CONLL 2006 shared task are available in /groups/projects/earl/data. We also have a number of in-house tools (created by Ben Wing) for working with Floresta.

Linguateca: sort of a clearing house for Porutguese NLP resources, includes links to a number of resources

  • CETEMPúblico: 180 million words of European Portuguese
  • CETENFolha: 24 million words of Brazilian Portuguese
  • COMPARA: parallel Portuguese-English corpus: roughly 150K words in each language; this page links to a web-based search interface. Availability of entire corpus not spelled out on website.
  • Floresta Sintá(c)tica: Portuguese treebank consisting of texts from both CETEMPúblico and CETENFolha, parsed with the Palavras parser –> Floresta is broken into two subsets: Bosque and Floresta Virgem, both available from this same page
  • Bosque: manually-corrected portion of Floresta, 9431 trees, roughly 180K words
  • Floresta Virgem: uncorrected, automatically-parsed treebank; 41K trees, over 1 million words

Lácio-Web

Tools and resources for Mayan languages

OKMA texts

Texts produced by La Asociación Oxlajuuj Keej Maya' Ajtz'iib'. IMPORTANT: Citation information below.

for now, here are the (uncleaned) numbers

Awakateko: 81564 words glossed, 67637 additional unglossed
Sakapalteko: 59890 words glossed, 172709 additional unglossed
Tektiteko: 101185 words glossed, 97071 additional unglossed
Uspanteko: 73231 words glossed, 218617 additional unglossed

Examples

Texts are glossed in and translated into Spanish. The format shown below is the data as exported from Shoebox.

  • \ref annotator's reference number for the clause
  • \t raw text (with the exception of the clause boundary marker '+')
  • \m morphological segmentation
  • \g gloss line – combination of lemmas and morphological tags
  • \c seems to be primarily POS information, more or less
  • \l Spanish translation
Awakateko
\ref trtex001awa-parte1 012
\t pero  kyi  na     eel   qatxuum       tetz,+
\m poro  kye' na     eel   qa-    txuum  t-     eetz
\g pero  neg. INC    salir A1p-   pensar E3s-   de
\c conj. adv. procl. v.i.  pref.- v.t.   pref.- s. rel.
\l pero no lo entendemos
Sakapulteko
\t K'o    jun  chek  rii'... kasi qast          mas   etz'eneem         tziij,+
\m k'o    jun  chek  rii'    kasi qas   - taj   mas   etz'    en   eem  tziij
\g EXS    ART  PART  este    casi muy   - IRR   mas   jugar   AP   SS   palabra
\c exist. art. part. dem.    adv. part. - part. conj. no. cl. suf. suf. s.
\l Hay otro casi no es tan chistoso
Tektiteko
\ref trtex01.1tek03 009
\t Y     tzan  qaq'unan+
\m y     tzan  q-     aq'una   -n
\g y     PREP  E1p-   trabajar -AP
\c conj. prep. pers.- v.t.     -suf.
\l Y para que trabajemos
Uspanteko
\t juntir  chi'ntayik,+
\m juntiir ch-in-tay-ik
\g todo    PRE-E1s-escuchar-SV
\c adv.    prep.-pers.-v.t.-suf.
\l Todos los que me están escuchando.

Citation Information

This data should be cited as follows:

Text Collections in Four Mayan Languages, 2003-2007
OKMA (Oxlajuuj Keej Maya' Ajtz'iib')
  • Awakateko
Supported by the Norwegian Ministry of External Relations

B'alam Mateo Toledo (coordinator)
Edna Patricia Delgado Rojas (coordinator)
Johanna Liseth Mendoza Solís
María Virginia Rodríguez Rodríguez
  • Sakapulteko
Supported by Endangered Languages Documentation Programme
(SOAS, University of London)

Romelia Mó Isém (coordinator)
Juan Carlos Vásquez Aceituno
Ana Luciana Arcón Puzul
Juan Adolfo Solís Baltazar
  • Tektiteko
Supported by the Norwegian Ministry of External Relations

José Reginaldo Pérez Vaíl (coordinator)
Erico Simón Morales
Ernesto Baltazar Gutiérrez
  • Uspanteko
Supported by Endangered Languages Documentation Programme
(SOAS, University of London)

Telma Can Pixabaj (coordinator)
Miguel Angel Vicente Méndez
María Vicente Méndez
Oswaldo Ajcot Damián 

Q'anjob'al

Q'anjob'al data is located in /groups/projects/earl/data/qanjobal

We have access to data from a number of sources, in various formats and with varied levels of annotation. NOTE the texts vary in orthography and/or analysis. I've tried to group them below according to provenance.

Bible (not actually Q'anjob'al)

KSM-complete-NT.line-format is an Q'anjob'al Akateko translation of the New Testament, one verse per line. Each verse is numbered, with chapter numbers appearing only before the first verse of each chapter, as follows:

4:1 Catuý jix iýletoj Jesús yu Yespíritu Dios bey jun 
cusiltaj txýotxý, yutol oj akýle yijbale naj yin spenail 
yu naj diablo.
2 Cýam tzet jix sloý yin cawinaj cýual, cawinaj akýbal. 
Catuý jix tit swail Jesús.

This one file contains all 27 books of the New Testament with no explicit separation of the books. Each book of course begins with a line numbered 1:1.

Here are some notes from B'alam re: this text, orthography, and comparison of Akateko and Q'anjob'al (some consider Akateko a distant dialect of Q'anjob'al, others consider it a separate language).

I checked the extract of the bible you sent me.  Here are some comments.

[1] It is written in Akateko.  Linguistically speaking this is a far distant dialect 
of Q'anjob'al. Some people (like the Academy of Mayan languages, the ministry of education, 
etc.) consider it a different Language.

There are several difference between what you would call Q'anjob'al and Akateko.  My 
undergraduate thesis is a description of these differences.

[2] The orthography used in the bible is the non-official one (which is basically 
the one that the SIL people used, and they still use it).

[3] the orthography is partially different from the one I use (or writing Q'anjob'al 
in general) for dialectal variations and because it is an old alphabet.

[3a] I have found the following correspondences between the official alphabet for Q'anjob'al 
and the one in the bible.  There could be other differences but this are the ones that I can 
recognize right away.  I haven't found a word with /xh/ but this would be another one to look for.

Q'ajob'al-bible
b'-b
k-c
k'-c'
h- (nothing)
q'-k', q'
q-j
j- (nothing)

[3b] things might look confusing with respect to /q, q', k',j/.  There is a whole sound 
movement going on with these sounds in Akateko.
-In Akateko, /q', k'/ are fusing into one single /k'/ sound. Those words with /q'/ in the 
bible might be the remaining words with /q'/.
-The /j/ in the *old q'anjob'al* was lost in Akateko.  In some contexts its lose was 
compensated with vowel lengthening (that is why there are some long vowels in there).
-The /q/ in the *OLD Q'ANJOB'AL* became a /j/ in Akateko (the old /j/ was lost and the 
old /q/ took its place and became a /j/ in the actual language).

All these changes results into:
a. There is long vowel in some contexts.
b. /q, q'/ are almost lost
c. The /q/ in q'anjob'al corresponds to /j/ in Akateko.

[3c] Then, the *real orthographic differences/mistakes* in the bible and Q'anjob'al would be:
b'-b
k-c,
k'-c'
h- (nothing)

Earlier texts

This is a set of texts used for work reported in Kuhn and Mateo-Toledo 2004 (Applying Computational Linguistic Techniques in a Documentary Project for Q'anjob'al (Mayan, Guatemala), LREC 2004).

  • xhapmat is divided into 7 separate files (raw text, needs preprocessing)
  • the directory tagged contains a number of texts as tagged by the POS tagger developed by Kuhn. These are not gold-standard tags.

Current texts

These are the texts from ongoing Q'anjob'al documentation projects. Numbers may be a bit high, as the texts have not yet been cleaned for our purposes.

Q'anjob'al: 26317 glossed, 56333 additional unglossed
 
morph.txt · Last modified: 2008/05/12 17:03 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki