Use this page to indicate and/or vote for corpora to get from the LDC for years 2006, 2007, and 2008, which we are now getting memberships for. We get 16 corpora per year based on becoming a member for that year, and since 2006 and 2007 are done, we should go ahead and select all the ones we want from those now.
For 2008, only put in corpora you definitely need, not just stuff you are interested in having around. The reason is that more corpora will be released still this year, so we want to reserve some of the remaining 16 for those as they become available.
You may want to look at what we already have available from previous years.
Go to the LDC Catalog to see what is available in each year. Note that some corpora (like TimeBank) are actually free, so we don't need to list them here.
Make sure to put your name down if you want something – even if someone else has already put it on the list. This will help in case we find there are more than 16 corpora selected.
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2009T13 | English Gigaword V 4 | Jason (medium), Katrin (medium), Joe (high), Matt (medium) |
| LDC2009T04 | BioProp | Elias (somewhat) |
| LDC2009T03 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 | Taesun (interested) |
| LDC2009T09 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 | Taesun (interested) |
| LDC2009T02 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 | Taesun (interested) |
| LDC2009T06 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 | Taesun (interested) |
| LDC2009T15 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 | Taesun (interested) |
| LDC2009T08 | Japanese Web N-gram Version 1 | Joe (low) |
| LDC2009T27 | Chinese Gigaword Fourth Edition | Joe (med) |
| LDC2009T25 | Web 1T 5-gram, 10 European Languages | Katrin (medium) |
| LDC2009T24 | OntoNotes Release 3.0 | Katrin (high) |
| LDC2009T23 | FactBank 1.0 (free for non-members) | Joey (medium) |
| LDC2009T10 | Language Understanding Annotation Corpus | Joey (high) |
| LDC2009T11 | REFLEX Entity Translation Training/DevTest | Joey (high) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2008T05 | Penn Discourse Treebank Version 2 | Jason (High priority), Steve (interested), Joey (interested), Lars (high priority) |
| LDC2008T04 | OntoNotes Release 2.0 | Jason, Katrin (High priority), Dan (interested) |
| LDC2008T01 | Hungarian-English Parallel Text, Version 1.0 | Taesun (somewhat interested) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2008T23 | NomBank v 1.0 | Katrin (medium) |
| LDC2008T13 | BLLIP North American News Text, Complete | Elias (medium) |
| LDC2008T21 | PennBioIE Oncology 1.0 | Elias (medium) |
| LDC2008T20 | PennBioIE CYP 1.0 | Elias (medium) |
| LDC2008T03 | English SpatialML | Jason (priority), Matt (medium) |
| LDC2008T19 | NYT Annotated Corpus | Jason (priority), Joey (medium), Matt (high) |
| LDC2008T25 | AQUAINT-2 IR Text Research Collection | Jason (medium), Elias (medium), Matt (medium+) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2007T02 | English Chinese Translation Treebank v 1.0 | Jason (high priority), Taesun (somewhat interested) |
| LDC2007T24 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 | James (high interest), Fred (medium interest), Taesun (somewhat interested) |
| LDC2007T23 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 | Jason (medium interest), Taesun (somewhat interested) |
| LDC2007T07 | English Gigaword Third Edition | Jason, Katrin, Steve (medium priority), Lars (high priority) |
| LDC2007S01 | Levantine Arabic Conversational Telephone Speech | Farzan (high priority), Fred (medium priority) |
| LDC2007T01 | Levantine Arabic Conversational Telephone Speech, Transcripts | Farzan (high priority), Fred(medium priority) |
| LDC2007T40 | Arabic Gigaword Third Edition | Farzan (high priority), Fred (medium priority) |
| LDC2007T08 | ISI Arabic-English Automatically Extracted Parallel Text | James (very interested), Elias (somewhat interested), Taesun (somewhat interested) |
| LDC2007T09 | ISI Chinese-English Automatically Extracted Parallel Text | Jason, Katrin (very interested), Elias (somewhat interested), Taesun (somewhat interested) |
| LDC2007T38 | Chinese Gigaword Third Edition | Jason, Katrin (very interested), I-Wen (somewhat interested) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2007T36 | Chinese Treebank 6.0 (CTB6.0) | Jason (somewhat interested) |
| LDC2007V01 | TRECVID 2005 Keyframes & Transcripts | I-Wen, Sudipta (interested) |
| LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | Sudipta (interested) |
| LDC2007T03 | Tagged Chinese Gigaword | Elias (interested) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2007T22 | 2001 Topic Annotated Enron Email Data Set | Jason, Elias (somewhat interested), Steve (interested), Taesun (interested), Lars (high priority), Matt (somewhat interested) |
| LDC2007S08 | CSLU: Foreign Accented English Release 1.2 | Jason (priority), Taesun (somewhat interested) |
| LDC2007S18 | CSLU: Kids` Speech Version 1.1 | Jason (priority), I-Wen (interested) |
| LDC2007S13 | CSLU: Apple Words and Phrases | Jason (medium priority) |
| LDC2007S09 | Mandarin Affective Speech | Jason (medium priority) |
| LDC2007S15 | Nationwide Speech Project | I-Wen (interested) |
| LDC2007T19 | MITRE 1997 Mandarin Broadcast News Speech | I-Wen (somewhat interested) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2006T17 | French Gigaword First Edition | David, Emmy, Knud and Luis (high priority) |
| LDC2006S31 | NIST 2003 Language Recognition Evaluation | David, Emmy, Knud and Luis (high priority) |
| LDC2006T12 | Spanish Gigaword First Edition | Fred, David, Emmy, Knud and Luis (high priority), Jason, Fred, I-Wen (somewhat interested) |
| LDC2006S37 | West Point Heroico Spanish Speech | David, Emmy, Knud and Luis (high priority) |
| LDC2006S34 | Russian through Switched Telephone Network | Rajka (interested) |
| LDC2006S36 | West Point Korean Speech | Rajka (somewhat interested) |
| LDC2006T06 | ACE 2005 Multilingual Training Corpus | Jason, Katrin (high interest), Steve (somewhat interested) |
| LDC2006T18 | TDT5 Multilingual Text | Steve (somewhat interested), I-Wen (interested) |
| LDC2006T19 | TDT5 Topics and Annotations | Elias, Steve (somewhat interested), Matt (somewhat interested) |
| LDC2006S29 | Levantine Arabic QT Training Data Set 5, Speech | Fred (priority) |
| LDC2006T07 | Levantine Arabic QT Training Data Set 5, Transcripts | Fred (priority) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2006S37 | West Point Heroico Spanish Speech | I-Wen (somewhat interested) |
| LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|
| LDC2006T01 | Prague Dependency Treebank 2.0 | Jason, Katrin, Elias (medium interest), Joey (interested) |
| LDC2006S35 | CSLU: Multilanguage Telephone Speech Version 1.2 | David, Emmy, Knud and Luis (high priority), I-Wen (interested) |
| LDC2006T13 | Web 1T 5-gram Version 1 | Sudipta, Steve (interested), Dan (somewhat interested) |
| LDC2006S14 | CSLU: Stories v 1.2 | Jason, I-Wen (interested) |
| LDC2006S13 | N4 NATO Native and Non-Native Speech | I-Wen (interested) |
Note: we are not considering these for the current order, but they will be significantly cheaper to purchase once we have the 2008 membership.
| Year | LDC ID | Corpus Name | People/groups interested (indicate level of interest) |
|---|---|---|---|
| 1996 | LDC96L14 | CELEX2 | Scott (very interested) |