Task list: annotation tool

Task list: data

  • split and organize uspanteko data
  • clean up uspanteko data: multi-line clauses, morph-gloss alignments, untagged words

potential issues with data

  1. some loan (and other) words get no representation in the morph, gloss, and pos lines (\m, \g, \c)
  2. some glossed texts include occasional unglossed words or even clauses
  3. some clauses (i.e. single refID) carry over multiple sets of lines

data clean-up and preprocessing

This section is intended to be a log of the changes we make to the OKMA data in order to work with it in this project. These changes will include both general cleanup, which may be folded into the eventual distribution of these data, and EARL-specific preprocessing.

May 17 [AP] -- changed \nom to \_nom in glossed files. 
reason: NLTK toolbox reader takes first tag which does **not** begin with '_' 
to be the tag indicating the start of a new record

the data

Data located in …/earlite/data/uspanteko
Fer each data set (train/dev/test) there are two directories: tb (shoebox/toolbox format) and igtxml (IGT-XML)

cat #words #clauses avg. clause texts
TRAIN 38802 8099 4.79 wds 030,035,036,037,049,050,052,053,054,055
056,057,059,063,066,067,068,071,072,076,077
DEV 16792 3847 4.36 wds 020,022,023,025,029
TEST 18704 3785 4.94 wds 001,002,004,008a,008b,014,016
TRANS 7361 005,033
RAW 210157 003,006,007,009,010,011,012,013,017,018
019,021,024,026,027,031,032,034,041,047
048,060,061,062,064,069,070,073,074,075
080,081,110

Information re: individual texts

text# #words #clauses cat genre status
001 1921 343 TEST oral history
002 1976 403 TEST story
003 8740 RAW
004 4619 1207 TEST personal experience
005 3438 TRANS translation only
006 1914 RAW
007 1937 RAW
008a+b 6673 1137 TEST personal experience glossing incomplete, split (by OKMA) into two files
009 3135 RAW
010 4328 RAW
011 5680 RAW
012 2456 RAW
013 1414 RAW
014 1740 381 TEST story
016 1775 314 TEST story
017 7463 RAW
018 7465 RAW
019 6600 RAW
020 3153 669 DEV advice glossing incomplete
021 14654 RAW
022 3782 858 DEV oral history conversion error
023 3576 872 DEV personal experience conversion error
024 8773 RAW
025 4022 1018 DEV story
026 2237 RAW
027 5957 RAW
029 2259 430 DEV personal experience glossing incomplete
030 1615 321 TRAIN story
031 5160 RAW
032 7506 RAW
033 3923 TRANS translation only
034 8258 RAW
035 2525 587 TRAIN story
036 807 147 TRAIN story
037 2063 460 TRAIN story
041 15063 RAW
047 14340 RAW
048 4712 RAW
049 1071 254 TRAIN story
050 531 96 TRAIN oral history conversion error
052 798 177 TRAIN story
053 1240 311 TRAIN story
054 1560 331 TRAIN story
055 893 178 TRAIN story
056 1101 245 TRAIN story
057 849 208 TRAIN story
059 2142 468 TRAIN oral history
060 8227 RAW
061 5806 RAW
062 8351 RAW
063 2694 662 TRAIN story
064 6801 RAW
066 2439 434 TRAIN recipe
067 3821 505 TRAIN oral history
068 3196 571 TRAIN personal experience
069 481 RAW
070 6614 RAW
071 2668 570 TRAIN story added line breaks btw. clauses
072 2603 589 TRAIN story added line breaks btw. clauses
073 4743 RAW
074 4331 RAW
075 6489 RAW
076 2012 486 TRAIN story
077 2174 499 TRAIN story conversion error
080 7661 RAW
081 8175 RAW
110 4686 RAW

Genre categories supplied by Telma Caan:

  • story : kids' stories, folk stories, etc.
  • oral history : usually have to do with history of the village and the community
  • personal experience : stories from people's lives
  • recipe/advice : recipe is self-explanatory, the story labeled 'advice' talks about how things should be done in order to take care of nature and the environment

New clause info

text# div cl.tot cl.g cl.b genre
001 test oral history
002 test story
004 test personal experience
008a test personal experience
014 test story
016 test story
020 dev advice
022 dev oral history
023 dev personal experience
025 dev story
029 dev personal experience
030 train 322 318 4 story
035 train story
036 train story
037 train story
049 train story
050 train oral history
052 train story
053 train story
054 train story
055 train story
056 train story
057 train story
059 train oral history
063 train story
066 train recipe
067 train oral history
068 train personal experience
071 train story
072 train story
076 train story
077 train story
 
morph/earlite/earl-ite_annotation_tool.txt · Last modified: 2008/11/04 01:26 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki