At present this page is simply a collection of things which may be useful when writing grammars in OpenCCG. Soon these scattered bits of information will be organized into something more coherent.
The logical forms produced by the OpenCCG parser (recall that the semantic derivation occurs in parallel with the syntactic derivation) use predicates supplied by word declarations. The predicate is specified as an attribute of the word. If you’re interested in the details, see the section below on word declarations.
Here are some examples from an itty-bitty Spanish grammar:
word tortuga:N (animal): sg fem; word tortuga:N (pred=turtle class=animal): sg fem; word pajaro:N (pred=bird class=animal): sg masc;
Note the two different declarations for tortuga – in the first, no predicate is supplied, so this word will be represented in the logical form with the predicate tortuga. If semantic class is the only attribute specified, there’s no need to include the name of the attribute. (This is the only attribute for which this is true.) To use only the predicate attribute and *not* the semantic class attribute, you would use this sort of declaration:
word tortuga:N (pred=turtle): sg fem;
Recall from the determiner section of the basic tutorial that feature structures can be inherited in the process of the derivation. This inheritance is indicated in the category definition by giving the two elements the same feature structure number:
family Det(indexRel-det) {
entry: np<2> /^ n<2>[X]:
X:sem-obj(<det>*);
}
Sometimes you only want some of the features to be shared rather than inheriting the whole feature structure. In that case, use the inheritsFrom construct, indicated with ~. The determiner category shown below inherits the PERS feature from the n it combines with, but any CASE feature associated with the n is overridden, and the result np will have accusative case.
family Det(indexRel=det) {
np<~2>[CASE=acc] /^ n<2>[X PERS]:
X:sem-obj(<det>*);
}
Ben Wing’s notes from the tiny.ccg grammar
Each statement specifies a single rule; it is also possible for statements to cancel some or all rules.
Note that some rules are enabled by default; this includes application, composition and crossed composition (forward and backward in each case), as well as forward type-raising from np to s/(s\np) and backward type-raising from np to s$1\(s$1/np).
rule {
# turn off forward cross-composition
no xcomp +;
# this is how we could turn off all type-raising rules.
# no typeraise;
# Declare a backward type-raising rule from pp to s$1\(s$1/pp).
# The $ causes a dollar-sign raise category to be created, as shown;
# without it, we'd just get s\(s/pp).
typeraise - $: pp => s;
# Declare a type-changing rule to enable pro-drop (not useful in English!)
# typechange: s[finite]\np[nom]$1 => s[finite]$1 ;
}
This shows how you can turn off all defaults and specify your own properties from scratch, if you want.
rule {
no; # remove all defaults
app +-;
comp +-; # +- means both forward and backward
xcomp -;
sub +-;
xsub +-;
# Defaults for typeraising are np => s, if omitted.
typeraise +;
typeraise - $;
}
Here’s a more complex version of expansions for English nouns. Note that expansions are processed recursively: if the text of an expansion contains calls to other expansions, they will also be processed. This makes ‘inheritance’ very easy to implement.
Inside of an expansion, the operator . can be used to concatenate two words together into a single word. For example, look at this expansion called normal-noun:
Remember that arguments functioning as variables within the expansion must be upper-case.
def normal-noun(Stem, Class) {
word Stem:N(Class) {
*: sg sg-X;
Stem . s: pl pl-X;
}
}
We can declare regular nouns in English as simply as this:
normal-noun(book, thing) normal-noun(car, thing) normal-noun(bee, animal)
Or we could do this with two nested expansions, basic-noun and normal-noun:
def basic-noun(Sing, Plur, Class) {
word Sing:N(Class) {
*: sg sg-X;
Plur: pl pl-X;
}
}
def normal-noun(Stem, Class) {
basic-noun(Stem, Stem . s, Class)
}
And again, the same normal-noun declarations work:
normal-noun(book, thing) normal-noun(car, thing) normal-noun(bee, animal)
We can do something even more clever to handle pluralization. This section discusses three built-in expansion functions. All three do some sort of text replacement. All three follow normal Python conventions for regular expressions
regsub
This is a conditional replacement function. It takes three arguments: a regular expression (PATTERN), a text (TEXT) to be compared to the regular expression, and a replacement text (REPLACEMENT). Any instances of PATTERN found in TEXT are replaced with REPLACEMENT.
This is the syntax of the function:
regsub(PATTERN, REPLACEMENT, TEXT)
A simple example – the regsub function shown below will replace any occurrence of a, b, or c with d.
regsub('[abc]','d',TEXT)
If we apply this to the text bad, we get the result ddd. (why one would ever want such a function is a different question...) A more realistic use of regsub is illustrated below in the expansion pluralize.
ifmatch
This function differs from regsub in two important ways. First, regsub does a localized replacement on any occurrence of PATTERN that it finds within TEXT. If PATTERN does not occur in TEXT, no replacements are made. ifmatch instead does a global replacement of TEXT which is triggered only when the regular expression PATTERN is found at the beginning of TEXT.
The second major difference is that ifmatch works like an if-else statement. It requires specification of one replacement text (IF-TEXT) to be used when there is a match between PATTERN and the beginning of TEXT and a second replacement text (ELSE-TEXT) to be used when there is not such a match.
This is the syntax of the function:
ifmatch(PATTERN, TEXT, IF-TEXT, ELSE-TEXT)
If the regular expression PATTERN is found at the beginning of TEXT, the function will replace TEXT with IF-TEXT. If PATTERN is not found at the beginning of TEXT, the function will replace TEXT with ELSE-TEXT.
And here’s another weird example – imagine a group of publishers has decided that the world of linguistics is suffering from too much negativity and decides to remove any instances of words that start with un from their publications. Words such as unhappy will be replaced with the text CENSORED, and any other words will be left unchanged.
ifmatch('un',TEXT,'CENSORED',TEXT)
Again, a more relevant example of ifmatch appears below in pluralize.
ifmatch-nocase
This third built-in function works just like ifmatch but with case-insensitive pattern matching. So a case-insensitive version of the un-censorship function would censor unhappy, Unhappy, UNHAPPY, etc.
ifmatch-nocase('un',TEXT,'CENSORED',TEXT)
Bringing these all together
Now we’ll show how these built-in expansion function can be used to write a very powerful expansion to handle plural morphology in English. The expansion pluralize shows a complicated expression using the built-ins ifmatch and regsub. Here are the parts of the expression, in the order they appear:
o or y, the plural is formed by adding s o or y, or if it ends in s, sh, ch, or x, the plural is formed by adding es (and in the case of words ending in y, we first change the y to an isExamples of each of these cases:
buy –> buys, boy –> boys, goo –> goosgo –> goes, try –> tries, lady –> ladiescat –> cats, etc.
Of course there are some exceptions which would need to be handled manually, such as the usual irregular plurals (children, deer, etc.) and other forms which don’t follow the rules described above (volcano –> volcanoes).
def pluralize(Word) {
ifmatch('^.*[aeiou][oy]$', Word, Word . s,
ifmatch('^.*([sxoy]|sh|ch)$', Word, regsub('^(.*)y$', '\1i', Word) . es,
Word . s))
}
This expansion uses nested ifmatch statements, as the ELSE-TEXT argument of the first instance of ifmatch is itself an ifmatch statement. The IF-TEXT argument of the second ifmatch statement is a use of regsub. If the regular expressions aren’t making sense, take a look at this tutorial on regular expressions in Python. If they still don’t make sense, ask someone for assistance.
Now we can replace the normal-noun expansion we wrote above with this noun expansion which, together with the pluralize expansion above and the basic-noun expansion discussed earlier, allows for very concise words declarations for nouns.
def noun(Sing, Class) {
basic-noun (Sing, pluralize(Sing), Class)
}
noun(book, thing)
noun(DVD, thing)
noun(glass, thing)
noun(church, thing)
noun(flower, thing)
noun(bath, thing)
noun(teacher, person)
noun(lady, person) # Pluralized (correctly) to 'ladies'
noun(boy, person) # Pluralized (correctly) to 'boys'
Nouns with irregular plurals are declared with the basic-noun expansion, bypassing the pluralize expansion, which of course doesn’t handle the irregular cases.
basic-noun(policeman, policemen, person) basic-noun(volcano, volcanoes, thing) basic-noun(deer, deer, thing)
The English pluralization example above shows in great detail how to use expansions and the built-in expansion functions to perform complex morphological analysis with OpenCCG.
Now here are some examples from a truly complex morphological system – Arabic nominal morphology.
UNDER CONSTRUCTION
The testbed function of OpenCCG provides a nice way for testing the effects of changes in analysis throughout the grammar. A well-designed testbed contains a set of sentences (both grammatical and ungrammatical sentences) which cover the range of phenomena you want your grammar to cover, making sure the grammar gets all of the examples you want it to get but doesn’t overgenerate.
To run the testbed, run the following command from the command line:
$ ccg-test -norealization tinytiny-grammar.xml
This text is from Ben Wing’s comments in tiny.ccg.
The format of word declarations is
word STEM:FAMILY ...(ATTRS): FEATURES;
or
word STEM:FAMILY ...(ATTRS) { INFLECTED-FORM: FEATURES; ...}
where STEM is the word’s stem, FAMILY is a list of the families that a word is part of, and ATTRS specifies any other attributes associated with the word.
FEATURES gives the word’s features; these come from the feature{} declarations above. (NOTE: Only feature values whose features specify a “macro-tie” value – something in <> following the feature’s name – can be used. See above.)
ATTRS is a list; each attribute is either a specification ATTRIBUTE=VALUE or a single VALUE (equivalent to class=VALUE). The useful attributes are
class Semantic class of a word.pred Semantic predicate of a word, used in the logical form; if omitted, defaults to the word’s stem.excluded List of excluded lexical categories.coart Boolean indicating that this entry is a coarticulation, eg a pitch accent, gesture, or other word-associated element.
Any of FAMILY, ATTRS and/or FEATURES can be omitted.
The second form above, with braces, is used for words with different inflections. Instead of specifying the features directly after the word, you list the features for each inflection separately. Note that * is shorthand for the stem itself.
Note that there can be more than one word{} declaration for a single stem.
The families in FAMILY can be either a family name, from a family{} block, or a part of speech. (ccg2xml will derive the appropriate parts of speech from any families given when creating the XML file.) Note that the words associated with a particular family can be specified either by tagging each word with its family, by listing a family’s words explicitly using the member declaration inside of a family{} block, or by a combination of the two.