Use this page to indicate and/or vote for corpora to get from the LDC for years 2006, 2007, and 2008, which we are now getting memberships for. We get 16 corpora per year based on becoming a member for that year, and since 2006 and 2007 are done, we should go ahead and select all the ones we want from those now.

For 2008, only put in corpora you definitely need, not just stuff you are interested in having around. The reason is that more corpora will be released still this year, so we want to reserve some of the remaining 16 for those as they become available.

You may want to look at what we already have available from previous years.

Go to the LDC Catalog to see what is available in each year. Note that some corpora (like TimeBank) are actually free, so we don't need to list them here.

Make sure to put your name down if you want something – even if someone else has already put it on the list. This will help in case we find there are more than 16 corpora selected.

2009

Selected

LDC ID Corpus Name People/groups interested (indicate level of interest)

Wishlist

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2009T13 English Gigaword V 4 Jason (medium), Katrin (medium), Joe (high), Matt (medium)
LDC2009T04 BioProp Elias (somewhat)
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 Taesun (interested)
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 Taesun (interested)
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 Taesun (interested)
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 Taesun (interested)
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 Taesun (interested)
LDC2009T08 Japanese Web N-gram Version 1 Joe (low)
LDC2009T27 Chinese Gigaword Fourth Edition Joe (med)
LDC2009T25 Web 1T 5-gram, 10 European Languages Katrin (medium)
LDC2009T24 OntoNotes Release 3.0 Katrin (high)
LDC2009T23 FactBank 1.0 (free for non-members) Joey (medium)
LDC2009T10 Language Understanding Annotation Corpus Joey (high)
LDC2009T11 REFLEX Entity Translation Training/DevTest Joey (high)

Restricted License Corpora

LDC ID Corpus Name People/groups interested (indicate level of interest)

2008

Selected

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2008T05 Penn Discourse Treebank Version 2 Jason (High priority), Steve (interested), Joey (interested), Lars (high priority)
LDC2008T04 OntoNotes Release 2.0 Jason, Katrin (High priority), Dan (interested)
LDC2008T01 Hungarian-English Parallel Text, Version 1.0 Taesun (somewhat interested)

Wishlist

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2008T23 NomBank v 1.0 Katrin (medium)
LDC2008T13 BLLIP North American News Text, Complete Elias (medium)
LDC2008T21 PennBioIE Oncology 1.0 Elias (medium)
LDC2008T20 PennBioIE CYP 1.0 Elias (medium)
LDC2008T03 English SpatialML Jason (priority), Matt (medium)
LDC2008T19 NYT Annotated Corpus Jason (priority), Joey (medium), Matt (high)
LDC2008T25 AQUAINT-2 IR Text Research Collection Jason (medium), Elias (medium), Matt (medium+)

Restricted License Corpora

LDC ID Corpus Name People/groups interested (indicate level of interest)

Previous Years -- Here for historical purposes only

2007

Selected

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2007T02 English Chinese Translation Treebank v 1.0 Jason (high priority), Taesun (somewhat interested)
LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 James (high interest), Fred (medium interest), Taesun (somewhat interested)
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 Jason (medium interest), Taesun (somewhat interested)
LDC2007T07 English Gigaword Third Edition Jason, Katrin, Steve (medium priority), Lars (high priority)
LDC2007S01 Levantine Arabic Conversational Telephone Speech Farzan (high priority), Fred (medium priority)
LDC2007T01 Levantine Arabic Conversational Telephone Speech, Transcripts Farzan (high priority), Fred(medium priority)
LDC2007T40 Arabic Gigaword Third Edition Farzan (high priority), Fred (medium priority)
LDC2007T08 ISI Arabic-English Automatically Extracted Parallel Text James (very interested), Elias (somewhat interested), Taesun (somewhat interested)
LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text Jason, Katrin (very interested), Elias (somewhat interested), Taesun (somewhat interested)
LDC2007T38 Chinese Gigaword Third Edition Jason, Katrin (very interested), I-Wen (somewhat interested)

Wishlist

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2007T36 Chinese Treebank 6.0 (CTB6.0) Jason (somewhat interested)
LDC2007V01 TRECVID 2005 Keyframes & Transcripts I-Wen, Sudipta (interested)
LDC2007V02 TRECVID 2003 Keyframes & Transcripts Sudipta (interested)
LDC2007T03 Tagged Chinese Gigaword Elias (interested)

Restricted License Corpora

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2007T22 2001 Topic Annotated Enron Email Data Set Jason, Elias (somewhat interested), Steve (interested), Taesun (interested), Lars (high priority), Matt (somewhat interested)
LDC2007S08 CSLU: Foreign Accented English Release 1.2 Jason (priority), Taesun (somewhat interested)
LDC2007S18 CSLU: Kids` Speech Version 1.1 Jason (priority), I-Wen (interested)
LDC2007S13 CSLU: Apple Words and Phrases Jason (medium priority)
LDC2007S09 Mandarin Affective Speech Jason (medium priority)
LDC2007S15 Nationwide Speech Project I-Wen (interested)
LDC2007T19 MITRE 1997 Mandarin Broadcast News Speech I-Wen (somewhat interested)

2006

Selected

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2006T17 French Gigaword First Edition David, Emmy, Knud and Luis (high priority)
LDC2006S31 NIST 2003 Language Recognition Evaluation David, Emmy, Knud and Luis (high priority)
LDC2006T12 Spanish Gigaword First Edition Fred, David, Emmy, Knud and Luis (high priority), Jason, Fred, I-Wen (somewhat interested)
LDC2006S37 West Point Heroico Spanish Speech David, Emmy, Knud and Luis (high priority)
LDC2006S34 Russian through Switched Telephone Network Rajka (interested)
LDC2006S36 West Point Korean Speech Rajka (somewhat interested)
LDC2006T06 ACE 2005 Multilingual Training Corpus Jason, Katrin (high interest), Steve (somewhat interested)
LDC2006T18 TDT5 Multilingual Text Steve (somewhat interested), I-Wen (interested)
LDC2006T19 TDT5 Topics and Annotations Elias, Steve (somewhat interested), Matt (somewhat interested)
LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech Fred (priority)
LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts Fred (priority)

Wishlist

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2006S37 West Point Heroico Spanish Speech I-Wen (somewhat interested)

Restricted License Corpora

LDC ID Corpus Name People/groups interested (indicate level of interest)
LDC2006T01 Prague Dependency Treebank 2.0 Jason, Katrin, Elias (medium interest), Joey (interested)
LDC2006S35 CSLU: Multilanguage Telephone Speech Version 1.2 David, Emmy, Knud and Luis (high priority), I-Wen (interested)
LDC2006T13 Web 1T 5-gram Version 1 Sudipta, Steve (interested), Dan (somewhat interested)
LDC2006S14 CSLU: Stories v 1.2 Jason, I-Wen (interested)
LDC2006S13 N4 NATO Native and Non-Native Speech I-Wen (interested)

Previous years

Note: we are not considering these for the current order, but they will be significantly cheaper to purchase once we have the 2008 membership.

Year LDC ID Corpus Name People/groups interested (indicate level of interest)
1996 LDC96L14 CELEX2 Scott (very interested)
 
ldc_order.txt · Last modified: 2009/11/02 15:13 by matt
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki