Introduction to working with corpora and programming in Python

Instructor: Katrin Erk
Office: Calhoun, Room 512.
Office hours: Tuesday 2-3:30 pm, and Wednesday 9:30-11 am.
Phone: 471-9020
Email: katrin dot erk at gmail dot com

Course description

This course is a combined introduction into working with text corpora and into the basics of programming in Python. It is aimed at graduate students in linguistics who would like to use text corpora for their investigations; previous programming experience is not required.

We will study the design, annotation formats, and analysis of text corpora. Topics to be discussed include: what types of corpora there are, and what kinds of research questions can be answered using a corpus; corpus annotation: principles and standards, formats, examples, and tests for annotation guidelines; tools and methods for searching and extracting information in corpora; and the basics of statistical modeling of corpus phenomena, including the selection and evaluation of models.

The introduction to programming in Python will start with a general introduction to key concepts of the language. Later, merging the two topics of the course, we will use Python to access and analyze corpus data.

Relevant links

Syllabus
Schedule
Course documents

Students enrolled in the course can access its Blackboard page.

Martin Wynne (ed): Developing Linguistic Corpora: a Guide to Good Practice. This is a nice collection of hands-on advice for corpus collection and annotation.

The Python tutorial, also available as a PDF document from this page. This tutorial contains rather condensed information, which may be more helpful for lookup than for learning Python in the first place. For a more extensive list of Python documentation sources, take a look at the Python Documentation Index.

Allen B. Downey, Jeffrey Elkner and Chris Meyers: How to Think Like a Computer Scientist: Learning with Python. This Python tutorial is more of a tutorial. Its only drawback for our purpuses is that it is not specifically geared at working with text. Especially the first chapter, which discusses what a computer program does and what it is made of, is quite unique.