Unix/Programming Tips and Tricks

This page is for adding various tips and tricks for programming. Please feel free to add anything you think would be helpful for others!

The Software Carpentry site has a nice collection of tutorials and tips regarding software development.

Notes on Programming Languages

Here are some brief notes on some of the languages you might consider using for work in NLP. This is in no way complete, just a starter to get the ball rolling. If your favorite language has been slighted or under-represented, speak up in its defense!

C/C++

Advantages:

  • speed, speed, speed
  • fine-tuned control
  • cross-platform
  • standardization
  • speed
  • great on your resume
  • speed
  • efficient data types that can be used to reduce memory consumption, and more
  • did we mention speed?

Disadvantages:

  • not as high-level as other languages – can require oodles of code to do things that Python does in a breath (string operations are particularly painstaking)
  • segmentation faults galore
  • installation of external libraries can sometime lead to tricky dependencies and the need for root access on a machine (unlike Java)
  • pointers can be tricky (though are part of C's greatness)

To get started:

  • For a list of book recommendations on C++ programming: here (and Accelerated C++ is rumored to be a pretty good introductory text)
  • You do NOT have to know C to learn C++! (it might even help not to know C going into C++)
  • For solutions to some infuriating template programming issues, see cplusplus

Java

Main link: java.sun.com

Advantages:

  • cross-platform (more so than most languages)
  • object-orientation
  • lots of external packages which can be easily imported and used in your system
  • high-level yet has full arrange of data types, like boolean/byte/short/int distinctions, for more efficient memory usage (though Java is *much* more of a memory hog than C, for example)
  • great on your resume
  • excellent for web-based applications

Disadvantages:

  • uses a lot more memory than C/C++ to handle the same stuff
  • *obsessive* object-orientation leads to verbose code
  • setting classpaths (for using other modules) correctly can be very confusing for novices

Lisp

TBA

Perl

Advantages:

  • completely native use of regular expressions
  • complex text processing tasks can be written with very little code
  • fast at text processing
  • great for small scripts
  • the same bit of code can be expressed in many ways (like in real language)

Disadvantages:

  • Perl syntax is ugly, ugly, ugly, and often very hard to read
  • not ideal for large system development
  • object-orientation available, but it is tacked on
  • over-permissive syntax that leads to most errors surfacing at run time (though compilation flags can help illuminate code problems)
  • the same bit of code can be expressed in many ways (can be hard to decode the original programmer's intent – even when you *were* the original programmer)

We have a separate page for perl tips:

http://comp.ling.utexas.edu/wiki/doku.php/perl_tips

Prolog

Advantages:

  • programs are declared in terms of their goals rather than imperatively specified – you say *what* should be computed, not *how* it should be computed
  • the same function can be used to return results for different arguments depending on how it is called
  • certain algorithms can be straightforwardly coded, like chart parsers
  • a lot of important NLP tools have been built in Prolog

Disadvantages:

  • the hype about not working with imperative procedures doesn't quite hold in practice when building real systems
  • generally difficult to build large systems
  • painful, painful, painful for text processing, system control

Python

Main link: www.python.org/

Our lab has some tips for Python programming that might be useful for you.

Advantages:

  • a lot of work can get done with very little code, but the code remains quite readable
  • works as a great front-end or glue language to code written in other languages
  • great for text-processing applications
  • very high-level and easy to learn as a first language (hence its choice for NLTK)
  • can call C functions directly

Disadvantages:

  • slow
  • high-memory usage
  • not really suitable for intensive computing applications
  • tabs used for indicating block structure (rather than parens or brackets like {}) can lead to errors when reconfiguring such blocks (and this simply doesn't happen in most languages)
  • variables live on outside of the block they are introduced (we hates it!)

Ruby

Ruby is a lot like Python, so it shares many of its advantages and disadvantages. Choosing between Ruby and Python is more a matter of taste than of functionality, really.

Advantages:

  • Ruby code is very readable, and often you can get a lot done with a short program
  • great for text processing
  • very high level and easy to learn
  • proper brackets rather than indentation for indicating block structure (yay!)
  • Great online reference available: the Ruby book. If you have ever searched around the Python library documentation and hated it, you will appreciate this.
  • very, very object-oriented.
  • Everything is a first-class object. A class is in fact an object of class Class, so you can manipulate it accordingly. This gives you the power to express some high-level concepts directly in the language (for example some Design Patterns, see the Ruby book), and the power to change a lot of things about the language…
  • proper separation between class variables and object variables (hear this, Python?)

Disadvantages:

  • slow
  • high memory usage
  • not really suitable for intensive computing applications
  • very, very object-oriented. I mean, even a loop over the numbers from 1 to 10 is a method of the number 1: “1.upto(10) { … }”
  • sometimes too many ways to express something. Do I really have to have a choice between using {…} and begin…end?

Handy Unix Commands

Downloading files remotely

Sometimes you are working remotely on the lab machines, and you want to download a file from somewhere that you found on a web page. Rather than downloading it to your home machine and then uploading it to the lab, use “wget”:

jbaldrid@quiche:~/tmp$ wget http://nltk.googlecode.com/files/nltk-2.0b7.zip

It can also be handy to use the lynx textual web browser in some contexts, which can also be used for such downloads. It is also often more useful when you are trying to access a journal publication that can be downloaded from UT machines but not when you are off campus. Just log on to the lab machines, run lynx in your terminal, go to the page and tab to the paper you want to download.

Reinvoking a previous command

If you have entered a bunch of commands and want to recall one that had a particular prefix, you can invoke it again using ”!”, as done to recall the first “cat …” command with ”!ca” in the following:

/groups/corpora/nltk-data/gutenberg$ cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
   2400 her
   1368 he
    443 He
     90 Her
/groups/corpora/nltk-data/gutenberg$ ls
austen-emma.txt        blake-songs.txt          README
austen-persuasion.txt  chesterton-ball.txt      shakespeare-caesar.txt
austen-sense.txt       chesterton-brown.txt     shakespeare-hamlet.txt
bible-kjv.txt          chesterton-thursday.txt  shakespeare-macbeth.txt
blake-poems.txt        milton-paradise.txt      whitman-leaves.txt
/groups/corpora/nltk-data/gutenberg$ wc austen-emma.txt
 17078 159826 914529 austen-emma.txt
/groups/corpora/nltk-data/gutenberg$ !ca
cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
   2400 her
   1368 he
    443 He
     90 Her

Command Line NLP Tools

Chris Brew (OSU) and Marc Moens wrote a draft of a book on NLP that has 40 or so pages that describe many UNIX shell commands which are useful for building n-grams, etc. Fred Hoyt extracted these pages to a single PDF.


Splitting a file into a series of files by taking every nth line

Say you have a file with a bunch of lines, and you want to put all even numbered lines into one file and odd numbered ones into another. Here's an easy way to do it:

$ cat <file> | awk '{print > "file"2-NR%2}'

For three files with every third line, do:

$ cat <file> | awk '{print > "file"3-NR%3}'

And so on.


Removing Lines from a File

If you want to remove all lines from a file which contain some character, the sed ('stream editor') utility is useful:

$ cat <file> | sed '/<pattern>/d' > outfile

For example, if you have a LaTeX file and you want to remove all the lines containing comments:

$ cat my_tex.tex | sed '/%/d' > my_commentless_tex.tex


Extracting Columns from a File

The awk utility is useful for extracting columns from a file that contains columnar data.

For example, the following pulls out the first column:

$ awk '{print $1}' my_file > column_one

To pull out the 2nd and 3rd columns separated by a comma:

$ awk '{print $2, "," $3}' 


Performing Calculations on a Column in a File

You can also use awk to perform arithmetic operations on a column containing numbers.

For example, say you are extracting unigram counts for words in a text. You use the usual command-line tools to calculate word counts and print them to a file:

$ cat I.like.fish.heads.txt | tr '\s' '\123' | sort | uniq -c | sort -rn > fish.count.txt

(if you don't know what these commands mean, see the PDF file on command line NLP tools just above).

This produces output like the following (dislaying the results rather than printing to a file):

$ cat I.like.fish.heads.txt | tr '\s' '\123' | sort | uniq -c | sort -rn | less 

100 fish
100 heads
81  yum
23  chomp

If you want to get the sum of all these counts, do the following:

$ cat fish.count.txt | awk '{total = total + $1} END {print total}'
304

The key is that both bracketed expressions in the awk command are within the quote marks. The first expression defines the variable total as the sum of its previous value (which starts at 0) with the value of the first column in the input (expressed by the $1 variable). The second expression tells it to print.


Leaving a process running while you log out

Say you have an experiment to run or code to compile and it's going to take a long time. You need to leave the lab, you don't want to tie up one of the machines by leaving it screen locked, and you want to be able to check the process from another machine.

A nice way to do this is by using the screen utility. It works like this:

(1) Log or ssh into the machine you are going to use.

(2) type:

  ''$ screen'' 

(3) You will see an intro window. Press any key and you will then be presented with a prompt. Start your process:

  ''$ my-process'' 

(4) Now you have to leave. Press ctrl-a followed by “d”. This “detaches” you from the process. Now you can log out, lock up the lab, and go about your business.

(5) To re-attach to the process later, log or ssh back into the computer on which the process is running, and start screen again.

(6) Type:

  ''$ screen -ls''

This will show you a list of screen processes running:

  ''$ screen -ls
    There are screens on:
              28354.pts-0.odyssey     (Detached)
              28302.pts-0.odyssey     (Detached)
    2 Sockets in /var/run/screen/S-bubba.''

Say you want to re-attach to 28302. Type:

  ''$ screen -r 28302.pts-0.odyssey''

The process will re-appear in the terminal and you're ready to go.

For more details on how to use screen, see http://www.rackaid.com/resources/tips/linux-screen.cfm


How to leave a process running without needing to see it again

If you want to leave a process running until it finishes, but you only need to see the output, you can use the nohup utility (“nohup” stands for “no hangup”). Say you want to run a program called fishheads.py, do the following:

   ''$ nohup fishheads.py > dog_treats.txt &''

Now you can log out and go about your business, and the process will keep running until it's through. Be sure to include the ampersand.


A Quick and Easy Command for Checking Disk Space

du is a useful command-line utility for taking inventory of disk space usage. Say you're in your home directory, the following command returns a list of directory contents followed with their sizes in terms of bytes:

   ''$ du -h --max-depth=1''

If you use du without any options, it will operate recursively, giving a potentially long list. The -h argument returns the results in eye-friendly format, and the –max-depth=1 blocks recursion.

If you just want a sum of disk space usage for the contents of the directory, do the following:

   ''$ du -hs''

Revisioning systems

If you aren't using a revisioning system like CVS or Subversion, you should start doing so now. And, Subversion is basically the better one to get started on if you haven't used either. If you are already using CVS and want to switch over to Subversion, check out these CVS-to-Subversion switching notes.

Subversion

For the big picture and full documentation, check out the Subversion manual or get the quick start with Subversion cheat sheet. For help specific to using Subversion on the lab computers, read the following.

Also this is a nice general tutorial.

First things first, you may want to set up a password-free login with SSH public-key authentication. Then, create a repository in your home directory, like this:

efp@quiche:~$ mkdir newsvn
efp@quiche:~$ svnadmin create --fs-type fsfs /home/efp/newsvn

On your own machine (can be at home or or laptop) or in your own filespace on the lab computers, create a new project:

dib:~ efp$ mkdir newproject
dib:~ efp$ cd newproject/
dib:~/newproject efp$ touch foo.c bar.c baz.c
dib:~/newproject efp$ ls
bar.c   baz.c   foo.c

Add the project like this:

dib:~/newproject efp$ svn import . svn+ssh://quiche.ling.utexas.edu/home/efp/newsvn/newproject -m"Test import"
Adding         foo.c
Adding         bar.c
Adding         baz.c

Committed revision 1.

Note that you should put your own directory in place of /home/efp/newsvn. Also note that you don't have to call it “newproject” just because you created the initial repository in “newproject”. So, you could have put /home/efp/newsvn/cool_parser or whatever your project should be called when it gets checked out.

Now, this may seem a bit weird (and it's the same with CVS) – move the original project out of the way to start working with the current working copy:

dib:~ efp$ mv newproject/ tmp/
dib:~ efp$ svn co svn+ssh://quiche.ling.utexas.edu/home/efp/newsvn/newproject
A    newproject/foo.c
A    newproject/bar.c
A    newproject/baz.c
Checked out revision 1.

You need to do that because once you have added those files to the repository, the original directory containing them is in no way connected to the subversion repository itself. (It was just a data source.)

With your new checkout, you have a fulling functioning view on the repository. So, now you can edit and all:

dib:~ efp$ cd newproject/
dib:~/newproject efp$ echo "#include <stdio.h>" >> foo.c
dib:~/newproject efp$ svn status
M      foo.c
dib:~/newproject efp$ svn ci -m"Change foo"
Sending        foo.c
Transmitting file data .
Committed revision 2.

Now, you can do some kind fun svn stuff:

dib:~/newproject efp$ svn mkdir newdir
A         newdir
dib:~/newproject efp$ svn mv foo.c newdir
A         newdir/foo.c
D         foo.c
dib:~/newproject efp$ svn ci -m"Added newdir"
Deleting       foo.c
Adding         newdir
Adding         newdir/foo.c

Committed revision 3.
dib:~/newproject efp$ svn mkdir newdir2
A         newdir2
dib:~/newproject efp$ svn mv newdir/foo.c newdir2/
A         newdir2/foo.c
D         newdir/foo.c
dib:~/newproject efp$ svn rm newdir
D         newdir/foo.c
D         newdir
dib:~/newproject efp$ svn ci -m"Created newdir2, dropped newdir"
Deleting       newdir
Adding         newdir2
Adding         newdir2/foo.c

On the lab machines, you can check the project out like this:

efp@quiche:~$ svn co file:///home/efp/newsvn/newproject
A  newproject/bar.c
A  newproject/newdir2
A  newproject/newdir2/foo.c
A  newproject/baz.c
Checked out revision 4.

Getting old versions

You can always retrieve any prior version of a file from svn, like this:

svn cat -r134 expand.tex

This command prints out version 134, which is the version prior to the change that I made. You could save it to a temporary file like this:

svn cat -r134 expand.tex > expand.tex.134

To figure out which version to retrieve, use this:

svn log expand.tex

This prints out all the available versions, who committed each version and when, and what the comment was.

CVS

A lot of work is still done with CVS, so even though SVN may be easier to use in some ways, it is still useful to know how to work with CVS.

If you need to access a CVS repository from a non-local site, you use a command of the following form (thanks to Jason for this):

cvs -d:pserver:id@url:/cvsroot/project login
cvs -z3 -d:pserver:id@url:/cvsroot/project co -P project

For example, if you want to download the latest version of OpenCCG from the Sourceforge repository:

cvs -d:pserver:anonymous@openccg.cvs.sourceforge.net:/cvsroot/openccg login
cvs -z3 -d:pserver:anonymous@openccg.cvs.sourceforge.net:/cvsroot/openccg co -P openccg

Note that you may get error messages even when you successfully checkout a project. Error messages should not be taken at face value. First check to see if the checkout actually worked, and if not, then start Google-ing the error message(s) that you got.

If you are trying to access a repository in the UT CompLing network from off-site, then you will need to tell CVS to ssh into the lab network:

export CVS_RSH=ssh 

Then the checkout command:

cvs -d:ext:url/path/ checkout project

If you are accessing the lab, the url will be something like quiche.ling.utexas.edu, and the path will be the (absolute) path to the directory containing the repository. If you get failure messages involving permissions (and you sure that you do have the correct permissions) then there is probably an error in the path as you gave it.

Build tools

If you are developing a system that involves lots of code, possibly spread out over many files, using a “build” tool of some sorts is highly advisable. A standard approach is to use the original make setup, using GNU Make for example. It provides a simple and very standardized way to declare what are the “targets” for compilation, and has a great deal of configurability and power. To help out with configuration settings so that code can be compiled easily on lots of different systems, check out Autoconf.

If you are working with Java, the Apache Ant system is quite nice (it can be used for other languages too). Check out the source code of OpenCCG, specifically it's build XML file that Ant uses to perform its build actions.

Shell Scripts for Fun and Profit

Automated LaTeX-ing

If you like generate your LaTeX output on the command line, here's some quick, simple scripts for streamlining the process.

Say you have a latex document with a bibliography. The following script generates the document once, generates the bibliography, then re-runs the latex twice so that the references all appear in the text, and then prints it first to .ps and then to .pdf. I call the script `fulltex.sh':

#!/bin/sh

A = $1
latex $A.tex
bibtex $A
latex $A.tex
latex $A.tex
dvips -t letter $A.dvi -o
ps2pdf $A.ps

For a simpler version, just eliminate the bibtex command and one or two of the latex commands.

Now, say you have a directory full of .tex documents, and you want to generate all of them. You can put a loop in the shell script:

for i in `ls *.tex`
do
    latex i
    dvips -t letter i -o
done

Bashing your files to rename them

Let's say you have a bunch of files that have complex names that you'd like to simplify, like changing “1997-Document1.txt” to “Document1.txt”. For example, you have the following files in a directory:

$ ls
1997A-Document1.txt  1997A-Document2.txt  1997A-Document3.txt

You could do a bunch of mv commands that would handle each one individually. Or, you can do the following (using bash):

$ for i in $(ls); do  if [[ $i =~ '.*-(\w+\.txt)' ]]; then mv $i ${BASH_REMATCH[1]} ; fi; done

Now the names of the files have been changed:

$ ls
Document1.txt  Document2.txt  Document3.txt

You can enter that bash program on a single line, but if you'd like to make an actual script, you can spread things out a bit:

for i in $(ls); do
if [[ $i =~ '.*-(\w+\.txt)' ]]; 
then 
  mv $i ${BASH_REMATCH[1]} ; 
fi; 
done

This example should be a good pointer for how to do lots of other similar things in bash.

In certain cases, the Unix command rename is a simpler way to accomplish such tasks. For example, to rename all files ending in ”.txt” to ”.foo”, do the following:

$ rename 's/\.txt$/.foo/' *.txt

Making Emacs Look the Way You Want It

I like using emacs with a black background. This can be invoked on startup with the following command:

emacs -bg black -fg white

If you don't want to have to type all that each time you start emacs you can stick it in a script (which I call 'blackmacs.sh'):

#!/bin/sh
emacs -bg black -fg white $1

Graphing

There are a number of ways to produce graphs. Probably the best thing to do if you are learning for the first time is to use the R language.

In the meantime, there are some old ways of doing it too: xgraph and gnuplot.

xgraph

xgraph provides a simple way to create graphs. Here's how you can create a graph, quickly. Say you have an xgraph specification file like this:

TitleText: Sample Data

"Plot one"
1 2
2 3
3 4
4 5
5 6

"Plot two"
1 1
2 4
3 9
4 16
5 25

"Plot three"
1 10
2 8
3 6
4 4
5 2

This should be pretty self-explanatory: there are three different relationships being plotted, and we can name them by putting a string in quotes along with the block giving the data. The first column gives x values, the second gives y values.

To see the graph, save this to a file like foo.txt and do this:

$ xgraph foo.txt

Unfortunately, you have to either be logged onto the machine or have X-forward working in order for this to work. It seems that it is a known issue that xgraph segfaults when you try to output directly to a file, which you are supposed to be able to do this way:

$ xgraph -device ps -o foo.ps graph.txt
Segmentation fault

gnuplot

Another alternative is gnuplot. This will work if you are working remotely. First, let's set it up to work with on a machine you are in front of, or if you have X-forwarding working.

Let's start by creating two data files 'numbers1.dat' and 'numbers2.dat'.

  • numbers1.dat
1 2
2 3
3 4
4 5
5 6
  • numbers2.dat
1 1
2 4
3 9
4 16
5 25

Now, create a file called myplot.gp with the following contents (and in the same directory):

set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
     'numbers2.dat' title 'squared' with l

To see the output of visualizing the data, do this:

$ gnuplot -persist myplot.gp

If you instead want to save the graph to a Postscript file (which you'll need to do if doing this remotely), define your gnuplot specifications in myplot.gp as follows:

set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
     'numbers2.dat' title 'squared' with l
set out 'myplot.ps'
set terminal postscript landscape enhanced mono dashed lw 1 'Helvetica' 14
replot

This saves the graph to the file myplot.ps. You can then use scp to retrieve the file from the remote machine and look at it on yours. If you are on a *nix machine, you can do it as follows. Say your login name is johndoe, and you have myplot.ps in your home directory /home/johndoe:

$ scp johndoe@ssh.ling.utexas.edu:/home/johndoe/myplot.ps .

That will securely copy myplot.ps to your machine.

If your home machine is a Windows box, use PSCP from Putty to copy the file to your home machine.

 
tips_and_tricks.txt · Last modified: 2009/11/10 22:46 by jason
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki