The page is both for tips on using Python more effectively and for including explications of basic Python concepts.
Throw your tips in here if they help you get jobs done more quickly or more elegantly, or even they are just cool because you can do them at all.
When you divide a number by another, you get a float with lots of decimal places, e.g.:
>>> print 4/19.0 0.210526315789
You can round to a certain precision with the following kind of print statement:
>>> print "%.3f" % (4/19.0) 0.211 >>> print "%.3f %.4f" % (4/19.0, 46/21.0) 0.211 2.1905
If you want nice percentages like “21.05%”, do this:
>>> print "%.2f%%" % (4/19.0 * 100) 21.05%
(Because the % is a special character when using this kind of print statement, it is also what is used to get an actual % in the output).
You can also add in strings with %s:
>>> print "A: %.3f B: %.4f C: %s" % (4/19.0, 46/21.0, "hello") A: 0.211 B: 2.1905 C: hello
Where this is truly handy is when you have variables you are trying to print:
>>> x = 4/19.0 >>> y = 46/21.0 >>> z = "hello" >>> print "A: %.3f B: %.4f C: %s" % (x, y, z) A: 0.211 B: 2.1905 C: hello
Say you have a list containing the numbers 0 through 9 (which you can easily obtaining using range(10). Now say you want to create a new list that has the square of 0 through 9 in it. You could do it this way:
>>> squared = [] >>> for x in range(10): ... squared.append(x**2) ... >>> squared [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Here's a more concise way using list comprehensions:
>>> squared = [x**2 for x in range(10)] >>> squared [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
You can also filter some values out using conditionals:
>>> squared = [x**2 for x in range(10) if x > 3] >>> squared [16, 25, 36, 49, 64, 81] >>> squared = [x**2 for x in range(10) if x % 2 == 0] >>> squared [0, 4, 16, 36, 64]
Another alternative is to use the map() function with a lambda expression:
>>> squared = map(lambda x:x**2, range(10)) >>> squared [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Sometimes, you need to store data structures to a file, and then later read them back in. For example, you might have a bunch of values that correspond to the parameters of a model, or a dictionary representing the number of occurences of various words in a text. You can convert them to strings using repr() and then back to the original data structures using expr(). Here's an example with a list:
>>> list1 = [0, [3,4]] >>> list1_string = repr(list1) >>> list1_string '[0, [3, 4]]' >>> eval(list1_string) [0, [3, 4]] >>> eval(list1_string)[1][1] 4
And here's one with a dictionary:
>>> dict1 = {"a":0,"b":1}
>>> dict1
{'a': 0, 'b': 1}
>>> repr(dict1)
"{'a': 0, 'b': 1}"
>>> dict1_string = repr(dict1)
>>> eval(dict1_string)
{'a': 0, 'b': 1}
>>> eval(dict1_string)["a"]
0
>>> eval(dict1_string)["b"]
1
Python also has a module for “pickling” data structures to files:
>>> import pickle
>>> dict = {"a":0, "b":1}
>>> f = open("mypickle", "w")
>>> pickle.dump(dict, f)
Here's how you read the pickles back in:
>>> import pickle
>>> f = open("mypickle")
>>> dict = pickle.load(f)
This is highly recommended if you have code that runs for days or even weeks: pickle intermediate results, then occasional crashes of the machine you're working on will not be a problem.
Here's a tip I found online.
I understand that if-loops tend to be expensive speed-wise. To get around this, in some situations you can replace an if-loop with a try-except pair, which runs significantly faster (at least on older versions of Python - see link below).
Say you are building a dictionary/array of words and their associated frequencies. The loop requires checking each token to see if it's already in the dictionary.
wdict = {}
for word in words:
if word not in wdict:
wdict[word] = 0
else:
wdict[word] += 1
As the dictionary grows, the if-test will fail more and more frequently and the words in the key of the dictionary will have their values augmented more and more. This means that later in the process the if-statement is wasting time because the else-statement is doing more of the work.
So, you can replace the if-loop with the following:
wdict = {}
for word in words:
try:
wdict[word] += 1
except KeyError:
wdict[word] = 1
This inverts the procedure by trying to augment the value of word in the dictionary first. Only if that fails does it add the word-value pair to the dictionary.
Note that the error type in the except-statement must be appropriate to the kind of procedure and/or data being treated by try-statement. In other words, in this example, the error needs to be keyError since the try-statement is trying to identify a key in a dictionary.
For more on this tip and others:
Many programming languages have a construct similar to the following:
print x ? y : z;
This means: if x is true, print y; otherwise print z. In Python, you can use the and/or trick
to achieve similar concise expressions. Technically, x and y returns y if x is true, or False
otherwise; x or y returns x if x is true (or a non-zero-ish value), or y otherwise. So,
print x and y or z
is basically equivalent to the expression above.
You can do the equivalent of that with:
for i in range(n,0,-1): ...
Careful! The above loop will start with i = n and ends with i = 1.
This one is a super pain to debug, and can lead to some very confusing errors. Basically, the no-no is, never have a constructor that takes and optional argument, the default being a mutable object, as in
class Spam:
def __init__(self, x=[]):
self.x = x
What happens is that the mutable object, in this case the empty list passed as a default parameter x, is created the first time the default is invoked, the all subsequent invocations of that default set the parameter to the original object, which may have changed in the meantime. Consider:
>>> class Spam:
... def __init__(self, x=[]):
... self.x = x
...
>>> a = Spam()
>>> a.x.append('a')
>>> b = Spam()
>>> b.x
['a']
The better way to handle this situation is using a non-mutable value as the default, such as a simple boolean:
>>> class Spam:
... def __init__(self, x=False):
... if x:
... self.x = x
... else:
... self.x = []
...
>>> a = Spam()
>>> a.x.append('a')
>>> a.x
['a']
>>> b = Spam()
>>> b.x
[]
If you use a double assignment in the style of C to create and initialize two different variables to two different values, then they become aliases of one another instead of being allocated two different memory locations. Take a look at the following example:
>>> list_a = list_b = ['d','c'] >>> list_a ['d', 'c'] >>> list_b ['d', 'c'] >>> list_b.sort() # Note here that we are sorting list b and not touching list a anyhow >>> list_a ['c', 'd']
These are explanations and/or examples of basic concepts that are well covered in introductions to Python, but which come up often in teaching and merit an entry.
Here is what learners of Python quickly get accustomed to doing:
>>> my_list = [1,4,5,2,8,4] >>> for number in my_list: ... print number ... 1 4 5 2 8 4
Here is how you do the same thing with indexes:
>>> for i in range(0, len(my_list)): ... print my_list[i] ... 1 4 5 2 8 4
The range function just gives you a list of numbers, e.g.:
>>> print range(1,10) [1, 2, 3, 4, 5, 6, 7, 8, 9]
It gave back the numbers starting from the first argument (1) and going up to, but not including, the last one (10). You can also make it increase in increments greater than one by providing a third argument:
>>> print range(1,10,2) [1, 3, 5, 7, 9]
So, now you can do stuff like this:
>>> my_list = [1,4,5,2,8,4] >>> for i in range(0, len(my_list), 2): ... print i, my_list[i], my_list[i+1] ... 0 1 4 2 5 2 4 8 4
You can get “slices” of lists if you only want part of one. For example, the following is an example showing accessing slices of a list:
>>> foo = [1,6,3,3,5,92,5] >>> foo[2:5] [3, 3, 5] >>> foo[1:] [6, 3, 3, 5, 92, 5]
Using “foo[2:5]” returns a list containing the elements from index 2 up to, but not including, index 5. Providing nothing after the colon gives you the rest of the list from the index you have provided, as in “foo[1:]”. You can probably guess what “foo[:3]” does – try it out and see.
Lists and dictionaries can store more than basic values. For example, you can store lists of lists of dictionaries of lists of … and so on. Here is a simple example to show some basic manipulations of a list that maps integers to lists of strings:
>>> my_list = ["a", "b", "c"]
>>> my_dictionary = {}
>>> my_dictionary[1] = my_list
>>> print my_dictionary[1][2]
c
>>> my_dictionary[1][2] = "hello"
>>> print my_dictionary
{1: ['a', 'b', 'hello']}
>>> my_dictionary[302] = ["Sam", "Bill", "George"]
>>> print my_dictionary
{1: ['a', 'b', 'hello'], 302: ['Sam', 'Bill', 'George']}
Here is an example that maps strings to lists of integers, and shows how to use the append function to add new members to lists that are stored inside the dictionary, and shows how to append a new value (here, “111') when the keys one wants to use may or may not already be in the dictionary:
>>> foo = {}
>>> foo["a"] = [1]
>>> print foo
{'a': [1]}
>>> foo["a"].append(3)
>>> print foo
{'a': [1, 3]}
>>> foo["b"] = [5,8,0]
>>> print foo
{'a': [1, 3], 'b': [5, 8, 0]}
>>> for x in foo:
... foo[x].append(201)
...
>>> print foo
{'a': [1, 3, 201], 'b': [5, 8, 0, 201]}
>>> possibly_new_keys = ["a", "c", "d"]
>>> for x in possibly_new_keys:
... foo[x] = foo.get(x,[]) + [111]
...
>>> print foo
{'a': [1, 3, 201, 111], 'c': [111], 'b': [5, 8, 0, 201], 'd': [111]}
If a dictionary doesn't have a key and you ask for the value associated with that key, you'll get an error:
>>> foo = {}
>>> foo[1] = "a"
>>> foo[2] = "b"
>>> print foo[3]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 3
One way to avoid this error run is to always check for the presence of the key with the has_key() method. Let's say you want to get the empty string back if the key isn't in the dictionary. Here is how you would do it:
>>> if foo.has_key(3): ... print foo[3] ... else: ... print "" ...
You can instead use the get() method on the dictionary. It takes two arguments, the first of which is the key you are trying to access, and the second of which is the default return value if the key is not in the dictionary. Here's an example that shows the default first being the empty string, and then being the string “Hello hello!”:
>>> print foo.get(3, "") >>> print foo.get(3, "Hello hello!") Hello hello!
This is quite handy for storing counts of things. Let's say you want to count the number of times each letter is seen in a string. Here's how you could do it:
>>> list_of_chars = list("How many times is each letter seen in this string?")
>>> print list_of_chars
['H', 'o', 'w', ' ', 'm', 'a', 'n', 'y', ' ', 't', 'i', 'm', 'e', 's', ' ', 'i', 's', ' ', 'e', 'a', 'c', 'h', ' ', 'l', 'e', 't', 't', 'e', 'r', ' ', 's', 'e', 'e', 'n', ' ', 'i', 'n', ' ', 't', 'h', 'i', 's', ' ', 's', 't', 'r', 'i', 'n', 'g', '?']
>>> counts = {}
>>> for letter in list_of_chars:
... counts[letter] = counts.get(letter, 0) + 1
...
>>> print counts["i"]
5
>>> print counts
{'a': 2, ' ': 9, 'c': 1, 'e': 6, 'g': 1, 'i': 5, 'H': 1, 'm': 2, 'l': 1, 'o': 1, 'n': 4, 's': 5, 'r': 2, 't': 5, 'w': 1, 'h': 2, 'y': 1, '?': 1}
Say your program should take two numbers as arguments on the command line, so you want to be able to do stuff like:
> python myprogram.py 23 51
To have access to those values, you need to import the sys module and then access the values of the list sys.argv:
import sys num1 = int(sys.argv[1]) num2 = int(sys.argv[2])
Note that if you add the statement “print sys.argv” to your program, you'll see the following:
["myprogram.py", "23", "51"]
as part of the output. This is a list with contents like any other, so if you want those values, you need to index into them. That's what “sys.argv[1]” is doing in the above code snippet – accessing the *second* element of that list.
The second thing going on is that the function int() takes a string and turns it into an integer, so int(“5”) gives back 5. (Try it in the python interactive prompt.) However, when you do need the command line arguments as a string, you should *not* convert them to ints – you want them to remain as strings. So, that means you can just do the following:
first_argument = sys.argv[1]
And if you want all the arguments you can do this:
all_command_line_args = sys.argv[1:]
This uses the slice of the list that doesn't include the name of the program (e.g., “myprogram.py”).
If want to convert all the arguments into ints, you can use list comprehensions (see above) as follows:
all_command_line_args_as_ints = [int(x) for x in sys.argv[1:]]
If you're familiar with other programming languages, like Java or C++, Python's loose typing system may seem somewhat unusual to you. If you don't have a programming background, there are a lot of languages which are strongly typed - that is, before you declare a variable, you must declare what type it is (an integer, a character, a string, etc.), and afterwards you can only use that variable to store information of the same type you originally declared it. Python, on the other hand, is weakly typed, meaning that you can use any variable to store any kind of data, without telling Python the variable is a 'list' or a 'string' or an 'integer'. Importantly, this means lists and dictionaries can hold different types of elements. For example, the following code compiles and runs fine:
a = "a string" b = 9 c = (2, 4) d = [102.5, 6, "pancakes"] e = [a, b, c, d] print e
But what if we are putting data into a list, or any other data structure for that matter, and later want to pull it out and perform different functions on it depending on what type of data it contains? For example, we may have a list that we know contains both numbers and lists of numbers. How would we print only the numbers to the screen? A naive approach may be to do a simple for loop:
l = [1, 2, [3, 4, 5], 6, [7]]
for num in l:
print num
But running this code gives the following (incorrect) output:
1 2 [3, 4, 5] 6 [7]
In order to print the output correctly, we have to test each element in the list to see what its type is, and we do this with a Python function isinstance, which takes two arguments: the variable to be tested, followed by the type you are testing for. This can be used for all native Python types (int, list, dict, etc.) and all user-defined types. So to get back to our problem, we modify the code to the following:
l = [1, 2, [3, 4, 5], 6, [7]]
for num in l:
if isinstance(num, list):
for n in num:
print n
else:
print num
This gives the following correct output:
1 2 3 4 5 6 7
So although Python allows you to do many interesting things as a weakly typed language, the isinstance method allows you to check for typing when it is desireable to differentiate between different types of data.
Python is picky about the syntax for back references inside regex substitution operators. The following will not work:
$ python
>>>> derf = "peas porridge hot"
>>>> derf = re.sub('(peas porridge )(hot)','\1cold',derf)
>>>> print derf
^A derf
The problem is that the operator is making the match, but not the back reference.
Back references have to be marked with the \g<> operator:
$ python
>>> derf = "peas porridge hot"
>>> derf = re.sub('(peas porridge )(hot)','\g<1>cold',derf)
>>> print derf
peas porridge cold
This tip is specific for taggers or other classifiers that use tag dictionaries. It is useful if your tag set includes a dummy tag or some kind of tag which appears to often and skews your predictions.
Here's a little script for cleaning up a tag dictionary file. It grinds through the file looking for lines with the form word \t tag \t count. If two lines have the same word, and one of them has the overzealous tag, it reprints one line with the word, the desired tag, and the sum of their counts:
#!/usr/bin/python
import re,sys
line = sys.stdin.readline()
while line:
line = line.strip()
line = line.split()
try:
stem = line[0]
tag = line[1]
count = line[2]
next_line = sys.stdin.readline()
next_line = next_line.strip()
next_line = next_line.split()
next_stem = next_line[0]
next_tag = next_line[1]
next_count = next_line[2]
if stem == next_stem:
if tag == 'none':
print stem,'\t',next_tag,'\t', (count + next_count)
elif next_tag == 'none':
print stem,'\t',tag,'\t', (count + next_count)
else:
print stem,'\t',tag,'\t',count
print next_stem,'\t',next_tag,'\t',next_count
else:
print stem,'\t',tag,'\t',count
print next_stem,'\t',next_tag,'\t',next_count
except IndexError:
pass
line = sys.stdin.readline()
The following script is an example of feeding data from a python script into an external program.
This particular script calls Praat, a widely used freeware acoustic analysis program, and takes as its input data extracted from anotated corpora produced by the Linguistic Data Consortium.
Transcripts for the LDC audio corpora can be searched using grep commands. For example, say you are searching LDC's Levantine QT training data (available on the lab machines in /groups/corpora/) using grep.
$ grep 'wlA HdA' * fsa_25680.txt.out:565.73 566.39 B: wlA HdA fsa_25803.txt.out:462.57 463.71 A: mA fyh wlA HdA
The output is contains lines beginning with the name of the file containing the matches, then the lines containing the hits. This can be output to a file:
$ grep 'wlA HdA' > results.txt
The following script reads files like results.txt as standard input (i.e., via a pipe) and takes as arguments a path and the name of a Praat script, and searches the directory pointed to by that path for the appropriate file and then calls a Praat script which extracts the audio segment in question. Crucially, the script uses the os module which lets you call external programs.
Here's the python script:
#!/usr/bin/python
import re,sys,os,shutil
file_match = re.compile(r'(fsa\_[0-9]{1,6})\.txt\.out\:([0-9]{1,3}\.[0-9]{1,3}) ([0-9]{1,3}\.[0-9]{1,3}) [AB]\: (.*)$')
inpath = str(sys.argv[1])
line = sys.stdin.readline()
praat_script = str(sys.argv[2])
while line:
match = file_match.match(line)
file = match.group(1)
start = match.group(2)
end = match.group(3)
datum_string = match.group(4)
datum_exp = re.compile(r"'%s'" % datum_string)
datum = datum_exp.pattern
print file,start,end
line = line.strip()
filepath = '%s.sph' % file
print filepath
shutil.copy(os.path.join(inpath,filepath),filepath)
command = '/Applications/Spectroph_Software/Praat.app/Contents/MacOS/Praat %s.praat %s %s %s %s' % (praat_script,file,start,end,str(datum))
os.system(command)
line = sys.stdin.readline()
For completeness, here's the Praat script (which is has a Basic-like syntax). What it does is extract relevant time-sequence from the audio file and generate a spectrum for it, which is stored in two files.
form Extract sequence text name real start_point real end_point text data endform Open long sound file... 'name$'.sph Extract part... start_point end_point yes select Sound 'name$' To TextGrid... "Input" select Sound 'name$' To Pitch... 0 75 600 select TextGrid 'name$' plus Sound 'name$' plus Pitch 'name$' Write to text file... 'name$'.Collection Remove Read from file... 'name$'.Collection select TextGrid 'name$' Set interval text... 1 1 'data$' plus Sound 'name$' plus Pitch 'name$' Write to text file... 'name$'.Collection
This link to the Python IAQ contains some handy information.