Resources for Corpus Linguistics

Peyton Todd posted the following information about references and resources for corpus linguistics, with thanks to Roger Levy, Maria Giagkou, Balint Tanos, Aida Zitouni, Holly Jacobson, Cedric Krummes, Karen Englander, Gill Philip, Martin Volk, N. Wiedenmann, and Josh Viau.

Feel free to add more info to this wiki.

BOOKS AND ARTICLES:

  1. Baker, Paul (2006). Using Corpora in Discourse Analysis. London: Continuum, 0-8264-7725-9
  2. Biber, Douglas. Dimensions of Register Variation using Multifeature/multidimensional analysis.
  3. Hunston, S. & G.Francis, Pattern Grammar (J. Benjamins)
  4. Meyer, Charles F. (2002). English Corpus Linguistics: An Introduction. Cambridge University Press. (ISBN: 052100490X)
  5. Roland, Douglas, Frederic Dick, and Jeffrey L. Elman (2007). Frequency of basic English grammatical structures: A corpus analysis. Journal of Memory and Language 57(3):348-379.
  6. Sinclair, John. Reading Concordances.
  7. Sinclair, John. Trust the Text.
  8. Also, ‘the works of Joan Bybee’, listed at http://www.unm.edu/~jbybee/
  9. List of several articles, books, thesis and dissertations in applied Corpus Linguistics using Wordsmith Tools software around the world: http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
  10. Comprehensive and well-organized list of all sorts of publications about Corpus-based Linguistics, by David Lee: http://devoted.to/corpora/ (link: Reference, Papers, Journals)

HANDS-ON SEARCHES

  1. Collins Wordbanks Online English Corpus - composed of 56 million words of contemporary written and spoken text, POS-tagged; online sample offers 40 randomly selected lines of concordance per search: http://www.collins.co.uk/Corpus/CorpusSearch.aspx
  2. British National Corpus: http://www.natcorp.ox.ac.uk/
  3. BNC-based online search tool, by Mark Davies: http://corpus.byu.edu/bnc/
  4. Linguistic Data Consortium (LDC) at the University of Pennsylvania.WebSearch
  5. Phrases in English: http://pie.usna.edu/, which uses the BNC
  6. CHILDES - child language component of the TalkBank system, which is a system for sharing and studying conversational interactions: http://childes.psy.cmu.edu/
  7. TIGER-Search (freely available from the University of Stuttgart): http://www.ims.uni-stuttgart.de/projekte/TIGER/
  8. The Penn Treebank (for English)
  9. CELT - Corpus of Electronic Texts is a open-ended searchable online corpus of multilingual texts of Irish literature and history with aprox. 10 million words, SGML/TEI tagged: http://www.ucc.ie/celt/search.html
  10. CorTec - comparable corpora of original texts in English and Portuguese (5 technical areas at the moment: Ecotourism; Information Technology; Cardiology; Contract Law; Cooking), each one with aprox. 200,000 words in each languague, searchable through a concordancer, a frequency counter and an n-gram extractor (the server is not very stable...): http://www.fflch.usp.br/dlm/comet/consulta_cortec.html
  11. COMPARA - bi-directional parallel corpus based on an open-ended collection of original texts in several English and Portuguese variants, mainly published fiction, aligned with their corresponding translations (some with more than one translation), fully searchable online: http://www.linguateca.pt/COMPARA/Welcome.html
  12. O Corpus do Português - historical annotated corpus with 45 million words of Portuguese texts from the 1300s to the 1900s; searchable online with the same tool developed for the BNC by Mark Davies: http://www.corpusdoportugues.org/
  13. Lácio-Web - collection of corpora containing texts of written contemporary Brazilian Portuguese, some of them annotated, together with a set of computational tools; you need to subscribe (for free) to search / download the copora: http://www.nilc.icmc.usp.br/lacioweb/english/index.htm
  14. Tycho Brahe - Parsed Corpus of Historical Portuguese texts written by authors born between 1435 and 1835, it has aprox. 2.000.000 words, some texts have been annotated; searchable online and available for downloading: http://www.ime.usp.br/~tycho/corpus/files/index.html
  15. Google Trends - displays graphs of frequencies of terms in Google searches and in newpaper texts.

SOFTWARE

  1. The WordSmith Tools (free demo, and complete version is inexpensive): http://www.lexically.net/wordsmith/
  2. AntConc: downloadable for free at: http://www.antlab.sci.waseda.ac.jp/software.html
  3. ConcApp: available from www.edict.com.hk/PUB/concapp/
  4. Dexter: free suite of software tools that facilitate the annotation of language data; it’s written in Java, and works equally well on Windows, Macintosh, Unix and Linux platforms - http://www.dextercoder.org/

COURSES

Other

  1. Corpora list: Corporauib.no
  2. Prof. Dr. Dietmar Zaefferer, Ludwig-Maximilians-University at Munich, Germany (who is very friendly) who has data on all languages of the world (Computational Linguistics)
  3. David Lee´s Bookmarks for Corpus-based Studies The most complete reference to Corpus Linguistics on the web, with many links to: English and non-English corpora, courses, E-lists, FAQs, tutorials, online tools, software, journals, articles, people, conferences, teaching material, etc.

Summary of entry points into corpus linguistics

The following comes from a summary email that went out to corpora-list – feel free to clean up, merge with the above info, etc. -Jason

I had asked for recommendations for entry points into corpus linguistics. I have compiled below the responses I received--I hope this is of use to some of you, otherwise my apologies.

Thank you, Karon, Geoffrey, Linda, Eva, Stefan, and Alex.
Shekhar Pradhan

From Linda Bawcom

David Lee’s web site http://devoted.to/corpora

which is a vast summary, (certainly enough to get you started) of just about everything regarding corpus based research; from all the different corpora available, (both commercial and on-line, to on-line tutorials, books, references, software and so on.The only thing I am unsure of is when it was last up-dated. Dr. Graeme Hirst’s (computational linguistics) home page http://www.cs.toronto.edu/~gh/ Click on ‘my publications’. You will see a (very long) list of names.

Phil Edmonds URL http://www.cs.toronto.edu/~pedmonds/papers.html

McEnery, A. M. and Xiao, R. Z. and Tono, Y. (2005) Corpus-based Language Studies: An advanced resource book. (Also recommended by Karon Harden.)

http://bowland-files.lancs.ac.uk/corplang/cbls/


From Eva Kerbler

? Adolphs, S. (2006) Introducing Electronic Text Analysis. Abingdon and New York: Routledge.

? Teubert, W. & Cermáková, A. (2007): Corpus Linguistics, Continuum.

? Baker, P., Hardie, A. & McEnery, A.(2006): A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press.

? International Journal of Corpus Linguistics http://www.benjamins.com/cgi-bin/t_seriesview.cgi?series=Ijcl

? Scott, Mike (2005): Textual patterns: key words and corpus analysis in language education.

? Tognini-Bonelli, Elena (2001): Corpus linguistics at work.

? a collection of key texts in the field: Wolfgang Teubert & Ramesh Krishnamurthy, ed. (2007): Corpus Linguistics. http://www.routledgelanguages.com/books/Corpus-Linguistics-isbn9780415338950

? and, of course, John Sinclair?s work!!!

http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm

http://www.uteroemer.com

http://www.corpus-linguistics.de/

For courses I can recommend the Tuscan Word Centre.


Geoffrey Williams

New Trends seminar would be a nice starting point: http://www.ugr.es/local/newtrends/callpapers.php.

For reading lists, you are sure to get a large number of recommended books as there are a number of different approaches about. However, my favourite introduction is through John Sinclair’s 1991 book.

Sinclair J. 1991. Corpus, Concordance, Collocation. Oxford : Oxford University Press. No one has more clearly described the problem of looking at language in an inductive manner through corpora. I’d then recommend ‘Trust the Text’, a collection of papers by John Sinclair (Routledge 2004).

Other stimulating reads include:

Tognini Bonelli E. 2001. Corpus Linguistics at Work. Benjamins. Kennedy G. 1998. An introduction to corpus linguistics. Longman Hunston S. 2002. Corpora in Applied Linguistics. CUP. Sampson G. & McCarthy D. eds. 2004. Corpus linguistics: readings in a widening discipline. Continuum.

That ought to get you off to a good start. You could then go to the Corpus Linguistics conference in Birmingham to meet people.

the International Journal of Corpus Linguistics, that is a constant source of inspiration


From Stefan Th. Gries

It may be early to talk about it yet, but let us be optimistic: Dagmar Divjak <http://perswww.kuleuven.be/~u0015217/> and I are currently trying to pull together a intensive boot camp for corpus linguistics (six full days for approx. 20 people) at the beginning of August 2009 at UCSB. Once we know more, we’ll post it here (and elsewhere ;-)).

 
corpus_linguistics.txt · Last modified: 2008/05/05 09:47 by jason
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki