CCGBank Info

CCGBank is a collection of CCG analyses for Wall Street Journal texts, converted from the Penn Treebank by Julia Hockenmaier.

CCGbank is installed on the lab computers in the directory /groups/corpora/ccgbank-LDC2005T13. (The Penn Treebank is installed in /groups/corpora/penn-treebank-rel3/)

CCG Parsers trained on CCGBank

  • Julia Hockenmaier's StatCCG Parser
  • James Curran and Stephen Clark's C&C Parser, which includes a number of other tools, such as POS tagging, supertagging, and named entity recognition.

tgrep and CCGBank

The script ccggrep calls the tgrep2 utility with the location of CCGBank already specified. It allows you to search through the CCGBank for particular words, categories, or combinations thereof.

For people outside of UT Austin who want to set up their own ccggrep script, you can model it after the following:

#!/bin/sh
tgrep2 -c /groups/corpora/ccgbank-LDC2005T13/data/TGREP/ccgbank.00-24.t2c $@

This of course assumes that tgrep2 is installed.

Here are some examples (from the CCGBank documentation):

Find all occurrences of words starting with 'buy' or 'Buy' and their lexical categories.

% ccggrep "/.*/</^[Bb]uy/"

Find all occurrences of the special subject-extracting category ((S[dcl]\NP)/N P)/(S[dcl]\NP), and print out their file name and sentence number.

% ccggrep -C  "((S[dcl]\NP)/NP)/(S[dcl]\NP)"

Find all occurrences of transitive verbs (with any morphosyntactic feature) that have the prefix “tak”.

% tgrep2 -C  "/^\(S\[.*\]\\NP\)\/NP/</^tak/"

In regular expressions over category labels, all slashes, parentheses and brackets have to be escaped (eg. '\[', '\\').

 
ccg/ccgbank.txt · Last modified: 2007/04/19 15:58 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki