This page is for adding various tips and tricks for programming. Please feel free to add anything you think would be helpful for others!
The Software Carpentry site has a nice collection of tutorials and tips regarding software development.
Here are some brief notes on some of the languages you might consider using for work in NLP. This is in no way complete, just a starter to get the ball rolling. If your favorite language has been slighted or under-represented, speak up in its defense!
Advantages:
Disadvantages:
To get started:
Main link: java.sun.com
Advantages:
Disadvantages:
TBA
Advantages:
Disadvantages:
We have a separate page for perl tips:
Advantages:
Disadvantages:
Main link: www.python.org/
Our lab has some tips for Python programming that might be useful for you.
Advantages:
Disadvantages:
Ruby is a lot like Python, so it shares many of its advantages and disadvantages. Choosing between Ruby and Python is more a matter of taste than of functionality, really.
Advantages:
Disadvantages:
Downloading files remotely
Sometimes you are working remotely on the lab machines, and you want to download a file from somewhere that you found on a web page. Rather than downloading it to your home machine and then uploading it to the lab, use “wget”:
jbaldrid@quiche:~/tmp$ wget http://nltk.googlecode.com/files/nltk-2.0b7.zip
It can also be handy to use the lynx textual web browser in some contexts, which can also be used for such downloads. It is also often more useful when you are trying to access a journal publication that can be downloaded from UT machines but not when you are off campus. Just log on to the lab machines, run lynx in your terminal, go to the page and tab to the paper you want to download.
Reinvoking a previous command
If you have entered a bunch of commands and want to recall one that had a particular prefix, you can invoke it again using ”!”, as done to recall the first “cat …” command with ”!ca” in the following:
/groups/corpora/nltk-data/gutenberg$ cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
2400 her
1368 he
443 He
90 Her
/groups/corpora/nltk-data/gutenberg$ ls
austen-emma.txt blake-songs.txt README
austen-persuasion.txt chesterton-ball.txt shakespeare-caesar.txt
austen-sense.txt chesterton-brown.txt shakespeare-hamlet.txt
bible-kjv.txt chesterton-thursday.txt shakespeare-macbeth.txt
blake-poems.txt milton-paradise.txt whitman-leaves.txt
/groups/corpora/nltk-data/gutenberg$ wc austen-emma.txt
17078 159826 914529 austen-emma.txt
/groups/corpora/nltk-data/gutenberg$ !ca
cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
2400 her
1368 he
443 He
90 Her
Command Line NLP Tools
Chris Brew (OSU) and Marc Moens wrote a draft of a book on NLP that has 40 or so pages that describe many UNIX shell commands which are useful for building n-grams, etc. Fred Hoyt extracted these pages to a single PDF.
Splitting a file into a series of files by taking every nth line
Say you have a file with a bunch of lines, and you want to put all even numbered lines into one file and odd numbered ones into another. Here's an easy way to do it:
$ cat <file> | awk '{print > "file"2-NR%2}'
For three files with every third line, do:
$ cat <file> | awk '{print > "file"3-NR%3}'
And so on.
Removing Lines from a File
If you want to remove all lines from a file which contain some character, the sed ('stream editor') utility is useful:
$ cat <file> | sed '/<pattern>/d' > outfile
For example, if you have a LaTeX file and you want to remove all the lines containing comments:
$ cat my_tex.tex | sed '/%/d' > my_commentless_tex.tex
Extracting Columns from a File
The awk utility is useful for extracting columns from a file that contains columnar data.
For example, the following pulls out the first column:
$ awk '{print $1}' my_file > column_one
To pull out the 2nd and 3rd columns separated by a comma:
$ awk '{print $2, "," $3}'
Performing Calculations on a Column in a File
You can also use awk to perform arithmetic operations on a column containing numbers.
For example, say you are extracting unigram counts for words in a text. You use the usual command-line tools to calculate word counts and print them to a file:
$ cat I.like.fish.heads.txt | tr '\s' '\123' | sort | uniq -c | sort -rn > fish.count.txt
(if you don't know what these commands mean, see the PDF file on command line NLP tools just above).
This produces output like the following (dislaying the results rather than printing to a file):
$ cat I.like.fish.heads.txt | tr '\s' '\123' | sort | uniq -c | sort -rn | less 100 fish 100 heads 81 yum 23 chomp
If you want to get the sum of all these counts, do the following:
$ cat fish.count.txt | awk '{total = total + $1} END {print total}'
304
The key is that both bracketed expressions in the awk command are within the quote marks. The first expression defines the variable total as the sum of its previous value (which starts at 0) with the value of the first column in the input (expressed by the $1 variable). The second expression tells it to print.
Leaving a process running while you log out
Say you have an experiment to run or code to compile and it's going to take a long time. You need to leave the lab, you don't want to tie up one of the machines by leaving it screen locked, and you want to be able to check the process from another machine.
A nice way to do this is by using the screen utility. It works like this:
(1) Log or ssh into the machine you are going to use.
(2) type:
''$ screen''
(3) You will see an intro window. Press any key and you will then be presented with a prompt. Start your process:
''$ my-process''
(4) Now you have to leave. Press ctrl-a followed by “d”. This “detaches” you from the process. Now you can log out, lock up the lab, and go about your business.
(5) To re-attach to the process later, log or ssh back into the computer on which the process is running, and start screen again.
(6) Type:
''$ screen -ls''
This will show you a list of screen processes running:
''$ screen -ls
There are screens on:
28354.pts-0.odyssey (Detached)
28302.pts-0.odyssey (Detached)
2 Sockets in /var/run/screen/S-bubba.''
Say you want to re-attach to 28302. Type:
''$ screen -r 28302.pts-0.odyssey''
The process will re-appear in the terminal and you're ready to go.
For more details on how to use screen, see http://www.rackaid.com/resources/tips/linux-screen.cfm
How to leave a process running without needing to see it again
If you want to leave a process running until it finishes, but you only need to see the output, you can use the nohup utility (“nohup” stands for “no hangup”). Say you want to run a program called fishheads.py, do the following:
''$ nohup fishheads.py > dog_treats.txt &''
Now you can log out and go about your business, and the process will keep running until it's through. Be sure to include the ampersand.
A Quick and Easy Command for Checking Disk Space
du is a useful command-line utility for taking inventory of disk space usage. Say you're in your home directory, the following command returns a list of directory contents followed with their sizes in terms of bytes:
''$ du -h --max-depth=1''
If you use du without any options, it will operate recursively, giving a potentially long list. The -h argument returns the results in eye-friendly format, and the –max-depth=1 blocks recursion.
If you just want a sum of disk space usage for the contents of the directory, do the following:
''$ du -hs''
If you aren't using a revisioning system like CVS or Subversion, you should start doing so now. And, Subversion is basically the better one to get started on if you haven't used either. If you are already using CVS and want to switch over to Subversion, check out these CVS-to-Subversion switching notes.
For the big picture and full documentation, check out the Subversion manual or get the quick start with Subversion cheat sheet. For help specific to using Subversion on the lab computers, read the following.
Also this is a nice general tutorial.
First things first, you may want to set up a password-free login with SSH public-key authentication. Then, create a repository in your home directory, like this:
efp@quiche:~$ mkdir newsvn efp@quiche:~$ svnadmin create --fs-type fsfs /home/efp/newsvn
On your own machine (can be at home or or laptop) or in your own filespace on the lab computers, create a new project:
dib:~ efp$ mkdir newproject dib:~ efp$ cd newproject/ dib:~/newproject efp$ touch foo.c bar.c baz.c dib:~/newproject efp$ ls bar.c baz.c foo.c
Add the project like this:
dib:~/newproject efp$ svn import . svn+ssh://quiche.ling.utexas.edu/home/efp/newsvn/newproject -m"Test import" Adding foo.c Adding bar.c Adding baz.c Committed revision 1.
Note that you should put your own directory in place of /home/efp/newsvn. Also note that you don't have to call it “newproject” just because you created the initial repository in “newproject”. So, you could have put /home/efp/newsvn/cool_parser or whatever your project should be called when it gets checked out.
Now, this may seem a bit weird (and it's the same with CVS) – move the original project out of the way to start working with the current working copy:
dib:~ efp$ mv newproject/ tmp/ dib:~ efp$ svn co svn+ssh://quiche.ling.utexas.edu/home/efp/newsvn/newproject A newproject/foo.c A newproject/bar.c A newproject/baz.c Checked out revision 1.
You need to do that because once you have added those files to the repository, the original directory containing them is in no way connected to the subversion repository itself. (It was just a data source.)
With your new checkout, you have a fulling functioning view on the repository. So, now you can edit and all:
dib:~ efp$ cd newproject/ dib:~/newproject efp$ echo "#include <stdio.h>" >> foo.c dib:~/newproject efp$ svn status M foo.c dib:~/newproject efp$ svn ci -m"Change foo" Sending foo.c Transmitting file data . Committed revision 2.
Now, you can do some kind fun svn stuff:
dib:~/newproject efp$ svn mkdir newdir A newdir dib:~/newproject efp$ svn mv foo.c newdir A newdir/foo.c D foo.c dib:~/newproject efp$ svn ci -m"Added newdir" Deleting foo.c Adding newdir Adding newdir/foo.c Committed revision 3. dib:~/newproject efp$ svn mkdir newdir2 A newdir2 dib:~/newproject efp$ svn mv newdir/foo.c newdir2/ A newdir2/foo.c D newdir/foo.c dib:~/newproject efp$ svn rm newdir D newdir/foo.c D newdir dib:~/newproject efp$ svn ci -m"Created newdir2, dropped newdir" Deleting newdir Adding newdir2 Adding newdir2/foo.c
On the lab machines, you can check the project out like this:
efp@quiche:~$ svn co file:///home/efp/newsvn/newproject A newproject/bar.c A newproject/newdir2 A newproject/newdir2/foo.c A newproject/baz.c Checked out revision 4.
You can always retrieve any prior version of a file from svn, like this:
svn cat -r134 expand.tex
This command prints out version 134, which is the version prior to the change that I made. You could save it to a temporary file like this:
svn cat -r134 expand.tex > expand.tex.134
To figure out which version to retrieve, use this:
svn log expand.tex
This prints out all the available versions, who committed each version and when, and what the comment was.
A lot of work is still done with CVS, so even though SVN may be easier to use in some ways, it is still useful to know how to work with CVS.
If you need to access a CVS repository from a non-local site, you use a command of the following form (thanks to Jason for this):
cvs -d:pserver:id@url:/cvsroot/project login cvs -z3 -d:pserver:id@url:/cvsroot/project co -P project
For example, if you want to download the latest version of OpenCCG from the Sourceforge repository:
cvs -d:pserver:anonymous@openccg.cvs.sourceforge.net:/cvsroot/openccg login cvs -z3 -d:pserver:anonymous@openccg.cvs.sourceforge.net:/cvsroot/openccg co -P openccg
Note that you may get error messages even when you successfully checkout a project. Error messages should not be taken at face value. First check to see if the checkout actually worked, and if not, then start Google-ing the error message(s) that you got.
If you are trying to access a repository in the UT CompLing network from off-site, then you will need to tell CVS to ssh into the lab network:
export CVS_RSH=ssh
Then the checkout command:
cvs -d:ext:url/path/ checkout project
If you are accessing the lab, the url will be something like quiche.ling.utexas.edu, and the path will be the (absolute) path to the directory containing the repository. If you get failure messages involving permissions (and you sure that you do have the correct permissions) then there is probably an error in the path as you gave it.
If you are developing a system that involves lots of code, possibly spread out over many files, using a “build” tool of some sorts is highly advisable. A standard approach is to use the original make setup, using GNU Make for example. It provides a simple and very standardized way to declare what are the “targets” for compilation, and has a great deal of configurability and power. To help out with configuration settings so that code can be compiled easily on lots of different systems, check out Autoconf.
If you are working with Java, the Apache Ant system is quite nice (it can be used for other languages too). Check out the source code of OpenCCG, specifically it's build XML file that Ant uses to perform its build actions.
If you like generate your LaTeX output on the command line, here's some quick, simple scripts for streamlining the process.
Say you have a latex document with a bibliography. The following script generates the document once, generates the bibliography, then re-runs the latex twice so that the references all appear in the text, and then prints it first to .ps and then to .pdf. I call the script `fulltex.sh':
#!/bin/sh A = $1 latex $A.tex bibtex $A latex $A.tex latex $A.tex dvips -t letter $A.dvi -o ps2pdf $A.ps
For a simpler version, just eliminate the bibtex command and one or two of the latex commands.
Now, say you have a directory full of .tex documents, and you want to generate all of them. You can put a loop in the shell script:
for i in `ls *.tex`
do
latex i
dvips -t letter i -o
done
Let's say you have a bunch of files that have complex names that you'd like to simplify, like changing “1997-Document1.txt” to “Document1.txt”. For example, you have the following files in a directory:
$ ls 1997A-Document1.txt 1997A-Document2.txt 1997A-Document3.txt
You could do a bunch of mv commands that would handle each one individually. Or, you can do the following (using bash):
$ for i in $(ls); do if [[ $i =~ '.*-(\w+\.txt)' ]]; then mv $i ${BASH_REMATCH[1]} ; fi; done
Now the names of the files have been changed:
$ ls Document1.txt Document2.txt Document3.txt
You can enter that bash program on a single line, but if you'd like to make an actual script, you can spread things out a bit:
for i in $(ls); do
if [[ $i =~ '.*-(\w+\.txt)' ]];
then
mv $i ${BASH_REMATCH[1]} ;
fi;
done
This example should be a good pointer for how to do lots of other similar things in bash.
In certain cases, the Unix command rename is a simpler way to accomplish such tasks. For example, to rename all files ending in ”.txt” to ”.foo”, do the following:
$ rename 's/\.txt$/.foo/' *.txt
I like using emacs with a black background. This can be invoked on startup with the following command:
emacs -bg black -fg white
If you don't want to have to type all that each time you start emacs you can stick it in a script (which I call 'blackmacs.sh'):
#!/bin/sh emacs -bg black -fg white $1
There are a number of ways to produce graphs. Probably the best thing to do if you are learning for the first time is to use the R language.
In the meantime, there are some old ways of doing it too: xgraph and gnuplot.
xgraph provides a simple way to create graphs. Here's how you can create a graph, quickly. Say you have an xgraph specification file like this:
TitleText: Sample Data "Plot one" 1 2 2 3 3 4 4 5 5 6 "Plot two" 1 1 2 4 3 9 4 16 5 25 "Plot three" 1 10 2 8 3 6 4 4 5 2
This should be pretty self-explanatory: there are three different relationships being plotted, and we can name them by putting a string in quotes along with the block giving the data. The first column gives x values, the second gives y values.
To see the graph, save this to a file like foo.txt and do this:
$ xgraph foo.txt
Unfortunately, you have to either be logged onto the machine or have X-forward working in order for this to work. It seems that it is a known issue that xgraph segfaults when you try to output directly to a file, which you are supposed to be able to do this way:
$ xgraph -device ps -o foo.ps graph.txt Segmentation fault
Another alternative is gnuplot. This will work if you are working remotely. First, let's set it up to work with on a machine you are in front of, or if you have X-forwarding working.
Let's start by creating two data files 'numbers1.dat' and 'numbers2.dat'.
1 2 2 3 3 4 4 5 5 6
1 1 2 4 3 9 4 16 5 25
Now, create a file called myplot.gp with the following contents (and in the same directory):
set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
'numbers2.dat' title 'squared' with l
To see the output of visualizing the data, do this:
$ gnuplot -persist myplot.gp
If you instead want to save the graph to a Postscript file (which you'll need to do if doing this remotely), define your gnuplot specifications in myplot.gp as follows:
set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
'numbers2.dat' title 'squared' with l
set out 'myplot.ps'
set terminal postscript landscape enhanced mono dashed lw 1 'Helvetica' 14
replot
This saves the graph to the file myplot.ps. You can then use scp to retrieve the file from the remote machine and look at it on yours. If you are on a *nix machine, you can do it as follows. Say your login name is johndoe, and you have myplot.ps in your home directory /home/johndoe:
$ scp johndoe@ssh.ling.utexas.edu:/home/johndoe/myplot.ps .
That will securely copy myplot.ps to your machine.
If your home machine is a Windows box, use PSCP from Putty to copy the file to your home machine.