The C&C tools are installed on the compling lab computers. To get started using them, go to the official C&C wiki. Add your tips and tricks here!
On the lab computers, save the following as in a file candc_demo.sh:
export CANDC=/usr/local/candc export PATH=$PATH:$CANDC/bin echo "Pierre thinks that Mary persuaded Bill to eat apples" | pos --model /usr/local/candc/models/pos/ | parser --parser /usr/local/candc/models/parser/ --super /usr/local/candc/models/super
Then on the command line, run:
> sh candc_demo.sh
The C&C tools come with extremely limited documentation, so it is not clear at first inspection exactly what tools are available and how to use them.
One of the more hidden ones is the chunker. This is trained and used just like the tagger:
$ source candc.env $ train_chunk --model models/my_chunk_model --comment "Chunks are fun" --ifmt '%w/%p/%c \n' --input my_chunk.train.txt $ chunk --model models/my_chunk_model --ifmt '%w/%p \n' --ofmt '%w/%p/%c \n' --input my_tags.dev.txt --output my_chunks.dev.out
If you have a custom tag set (i.e., a set of tags which differs from the standard Penn set), you get error messages when you try to run the tagger using pos:
pos:tag 'urg' is not a member of klasses:unknowns
The problem is that your tag for the unknowns is not a member of the tagset used in the training data. So if (for example) 'urg' is your unknowns tag, you have to have used 'urg' in the training data.
The problem is (in principle) easy to fix. Just edit the /models/your_model/unknowns and /models/your_model/number_unknowns files so that they include any and only the tags you want to assign to unknown words.
However, it's imporant to consider that the number of types included in the unknowns files can affect the accuracy the model. A better way to do it is to extract a list of tag types from the tagdict file which resides in the same directory as the unknowns file.
Here's a little bash script for doing this:
#!/bin/sh
mv unknowns BACKUP_unknowns
mv number_unknowns BACKUP_number_unknowns
head -n 3 tagdict > header.tmp
tail +4 tagdict > tags.tmp
cat tags.tmp | awk '{print $2}' | sort | uniq > list.tmp
cat header.tmp list.tmp > unknowns
echo NUM > numlist.tmp
cat header.tmp numlist.tmp > number_unknowns
rm *tmp
The C-and-C tools are very particular about white spaces. Be sure to remove white space from the beginnings and ends of lines, and remove multiple white spaces. This is easy to do with perl one-liners:
cat my_data | perl -pe 's/^\s//g' | perl -pe 's/ $//g' | perl -pe 's/(\s)+/\1/g'
Should you find yourself using a character set which is a superset of the ASCII alpha-numeric characters, avoid using the pipe slash “|”. The C-and-C tagger uses this as it's default tag separator, and even if you specify another separator, it will still interpreted “|” as a separator and give you an error message.
In the unlikely event that you find yourself working with the Penn Arabic Treebank, note that the transliteration scheme it uses includes the pipe slash, so you will want to replace it with another character.
If you are using the CandC tools with data that includes really long sentences with complex, character-rich tags, you may get an error like the following:
train_pos:unexpected stream error (probably the sentence is too long):my_tags:548
There seems to be a limit on the number of characters that can occur in a string (the limit is around 5000 chars). The thing to do is to try to simplify your tag labels so that they more compact. If that doesn't work, you may need to break up the sentence.