This section is intended to be a log of the changes we make to the OKMA data in order to work with it in this project. These changes will include both general cleanup, which may be folded into the eventual distribution of these data, and EARL-specific preprocessing.
May 17 [AP] -- changed \nom to \_nom in glossed files. reason: NLTK toolbox reader takes first tag which does **not** begin with '_' to be the tag indicating the start of a new record
Data located in …/earlite/data/uspanteko
Fer each data set (train/dev/test) there are two directories: tb (shoebox/toolbox format) and igtxml (IGT-XML)
| cat | #words | #clauses | avg. clause | texts |
|---|---|---|---|---|
| TRAIN | 38802 | 8099 | 4.79 wds | 030,035,036,037,049,050,052,053,054,055 |
| 056,057,059,063,066,067,068,071,072,076,077 | ||||
| DEV | 16792 | 3847 | 4.36 wds | 020,022,023,025,029 |
| TEST | 18704 | 3785 | 4.94 wds | 001,002,004,008a,008b,014,016 |
| TRANS | 7361 | 005,033 | ||
| RAW | 210157 | 003,006,007,009,010,011,012,013,017,018 | ||
| 019,021,024,026,027,031,032,034,041,047 | ||||
| 048,060,061,062,064,069,070,073,074,075 | ||||
| 080,081,110 |
Information re: individual texts
| text# | #words | #clauses | cat | genre | status |
|---|---|---|---|---|---|
| 001 | 1921 | 343 | TEST | oral history | |
| 002 | 1976 | 403 | TEST | story | |
| 003 | 8740 | RAW | |||
| 004 | 4619 | 1207 | TEST | personal experience | |
| 005 | 3438 | TRANS | translation only | ||
| 006 | 1914 | RAW | |||
| 007 | 1937 | RAW | |||
| 008a+b | 6673 | 1137 | TEST | personal experience | glossing incomplete, split (by OKMA) into two files |
| 009 | 3135 | RAW | |||
| 010 | 4328 | RAW | |||
| 011 | 5680 | RAW | |||
| 012 | 2456 | RAW | |||
| 013 | 1414 | RAW | |||
| 014 | 1740 | 381 | TEST | story | |
| 016 | 1775 | 314 | TEST | story | |
| 017 | 7463 | RAW | |||
| 018 | 7465 | RAW | |||
| 019 | 6600 | RAW | |||
| 020 | 3153 | 669 | DEV | advice | glossing incomplete |
| 021 | 14654 | RAW | |||
| 022 | 3782 | 858 | DEV | oral history | conversion error |
| 023 | 3576 | 872 | DEV | personal experience | conversion error |
| 024 | 8773 | RAW | |||
| 025 | 4022 | 1018 | DEV | story | |
| 026 | 2237 | RAW | |||
| 027 | 5957 | RAW | |||
| 029 | 2259 | 430 | DEV | personal experience | glossing incomplete |
| 030 | 1615 | 321 | TRAIN | story | |
| 031 | 5160 | RAW | |||
| 032 | 7506 | RAW | |||
| 033 | 3923 | TRANS | translation only | ||
| 034 | 8258 | RAW | |||
| 035 | 2525 | 587 | TRAIN | story | |
| 036 | 807 | 147 | TRAIN | story | |
| 037 | 2063 | 460 | TRAIN | story | |
| 041 | 15063 | RAW | |||
| 047 | 14340 | RAW | |||
| 048 | 4712 | RAW | |||
| 049 | 1071 | 254 | TRAIN | story | |
| 050 | 531 | 96 | TRAIN | oral history | conversion error |
| 052 | 798 | 177 | TRAIN | story | |
| 053 | 1240 | 311 | TRAIN | story | |
| 054 | 1560 | 331 | TRAIN | story | |
| 055 | 893 | 178 | TRAIN | story | |
| 056 | 1101 | 245 | TRAIN | story | |
| 057 | 849 | 208 | TRAIN | story | |
| 059 | 2142 | 468 | TRAIN | oral history | |
| 060 | 8227 | RAW | |||
| 061 | 5806 | RAW | |||
| 062 | 8351 | RAW | |||
| 063 | 2694 | 662 | TRAIN | story | |
| 064 | 6801 | RAW | |||
| 066 | 2439 | 434 | TRAIN | recipe | |
| 067 | 3821 | 505 | TRAIN | oral history | |
| 068 | 3196 | 571 | TRAIN | personal experience | |
| 069 | 481 | RAW | |||
| 070 | 6614 | RAW | |||
| 071 | 2668 | 570 | TRAIN | story | added line breaks btw. clauses |
| 072 | 2603 | 589 | TRAIN | story | added line breaks btw. clauses |
| 073 | 4743 | RAW | |||
| 074 | 4331 | RAW | |||
| 075 | 6489 | RAW | |||
| 076 | 2012 | 486 | TRAIN | story | |
| 077 | 2174 | 499 | TRAIN | story | conversion error |
| 080 | 7661 | RAW | |||
| 081 | 8175 | RAW | |||
| 110 | 4686 | RAW |
Genre categories supplied by Telma Caan:
New clause info
| text# | div | cl.tot | cl.g | cl.b | genre |
|---|---|---|---|---|---|
| 001 | test | oral history | |||
| 002 | test | story | |||
| 004 | test | personal experience | |||
| 008a | test | personal experience | |||
| 014 | test | story | |||
| 016 | test | story | |||
| 020 | dev | advice | |||
| 022 | dev | oral history | |||
| 023 | dev | personal experience | |||
| 025 | dev | story | |||
| 029 | dev | personal experience | |||
| 030 | train | 322 | 318 | 4 | story |
| 035 | train | story | |||
| 036 | train | story | |||
| 037 | train | story | |||
| 049 | train | story | |||
| 050 | train | oral history | |||
| 052 | train | story | |||
| 053 | train | story | |||
| 054 | train | story | |||
| 055 | train | story | |||
| 056 | train | story | |||
| 057 | train | story | |||
| 059 | train | oral history | |||
| 063 | train | story | |||
| 066 | train | recipe | |||
| 067 | train | oral history | |||
| 068 | train | personal experience | |||
| 071 | train | story | |||
| 072 | train | story | |||
| 076 | train | story | |||
| 077 | train | story |