Structuring, Testing, and Maintaining Python Programs¶
Python is really the first programming language in which I started re-using code significantly. In part, this is because it is rather easy to compartmentalize functions and classes in Python. Something else that Python makes relatively easy is building testing into your program structure. Combined, reusability and testing can have a huge effect on maintenance.
Programming for reusability¶
It’s difficult to come up with any hard and fast rules for programming for reusability, but my main rules of thumb are: don’t plan too much, and don’t hesitate to refactor your code. [1].
In any project, you will write code that you want to re-use in a slightly different context. It will often be easiest to cut and paste this code rather than to copy the module it’s in – but try to resist this temptation a bit, and see if you can make the code work for both uses, and then use it in both places.
[1] | If you haven’t read Martin Fowler’s Refactoring, do so – it describes how to incrementally make your code better. I’ll discuss it some more in the context of testing, below. |
Modules and scripts¶
The organization of your code source files can help or hurt you with code re-use.
Most people start their Python programming out by putting everything in a script:
calc-squares.py:
#! /usr/bin/env python
for i in range(0, 10):
print i**2
This is great for experimenting, but you can’t re-use this code at all!
(UNIX folk: note the use of #! /usr/bin/env python
, which tells UNIX
to execute this script using whatever python
program is first in your
path. This is more portable than putting #! /usr/local/bin/python
or
#! /usr/bin/python
in your code, because not everyone puts python in
the same place.)
Back to reuse. What about this?
calc-squares.py:
#! /usr/bin/env python
def squares(start, stop):
for i in range(0, 10):
print i**2
squares(0, 10)
I think that’s a bit better for re-use – you’ve made squares
flexible and re-usable – but there are two mechanistic problems.
First, it’s named calc-squares.py
, which means it can’t readily be
imported. (Import filenames have to be valid Python names, of
course!) And, second, were it importable, it would execute squares(0, 10)
on import - hardly what you want!
To fix the first, just change the name:
calc_squares.py:
#! /usr/bin/env python
def squares(start, stop):
for i in range(0, 10):
print i**2
squares(0, 10)
Good, but now if you do import calc_squares
, the squares(0, 10)
code
will still get run! There are a couple of ways to deal with this. The first
is to look at the module name: if it’s calc_squares
, then the module is
being imported, while if it’s __main__
, then the module is being run as
a script:
calc_squares.py:
#! /usr/bin/env python
def squares(start, stop):
for i in range(0, 10):
print i**2
if __name__ == '__main__':
squares(0, 10)
Now, if you run calc_squares.py
directly, it will run squares(0, 10)
;
if you import it, it will simply define the squares
function and leave
it at that. This is probably the most standard way of doing it.
I actually prefer a different technique, because of my fondness for testing. (I also think this technique lends itself to reusability, though.) I would actually write two files:
squares.py:
def squares(start, stop):
for i in range(0, 10):
print i**2
if __name__ == `__main__`:
# ...run automated tests...
calc-squares:
#! /usr/bin/env python
import squares
squares.squares(0, 10)
A few notes – first, this is eminently reusable code, because
squares.py
is completely separate from the context-specific call.
Second, you can look at the directory listing in an instant and see
that squares.py
is probably a library, while calc-squares
must
be a script, because the latter cannot be imported. Third, you can add
automated tests to squares.py
(as described below), and run them
simply by running python squares.py
. Fourth, you can add script-specific
code such as command-line argument handling to the script, and keep it
separate from your data handling and algorithm code.
Packages¶
A Python package is a directory full of Python modules containing a
special file, __init__.py
, that tells Python that the directory is
a package. Packages are for collections of library code that are too
big to fit into single files, or that have some logical substructure
(e.g. a central library along with various utility functions that all
interact with the central library).
For an example, look at this directory tree:
package/
__init__.py -- contains functions a(), b()
other.py -- contains function c()
subdir/
__init__.py -- contains function d()
From this directory tree, you would be able to access the functions like so:
import package
package.a()
package.b()
import package.other
package.other.c()
import package.subdir
package.subdir.d()
Note that __init__.py
is just another Python file; there’s nothing
special about it except for the name, which tells Python that the
directory is a package directory. __init__.py
is the only code
executed on import, so if you want names and symbols from other
modules to be accessible at the package top level, you have to import
or create them in __init__.py
.
There are two ways to use packages: you can treat them as a convenient code organization technique, and make most of the functions or classes available at the top level; or you can use them as a library hierarchy. In the first case you would make all of the names above available at the top level:
package/__init__.py:
from other import c
from subdir import d
...
which would let you do this:
import package
package.a()
package.b()
package.c()
package.d()
That is, the names of the functions would all be immediately available at the top level of the package, but the implementations would be spread out among the different files and directories. I personally prefer this because I don’t have to remember as much ;). The down side is that everything gets imported all at once, which (especially for large bodies of code) may be slow and memory intensive if you only need a few of the functions.
Alternatively, if you wanted to keep the library hierarchy, just leave out the top-level imports. The advantage here is that you only import the names you need; however, you need to remember more.
Some people are fond of package trees, but I’ve found that hierarchies of packages more than two deep are annoying to develop on: you spend a lot of your time browsing around between directories, trying to figure out exactly which function you need to use and what it’s named. (Your mileage may vary.) I think this is one of the main reasons why the Python stdlib looks so big, because most of the packages are top-level.
One final note: you can restrict what objects are exported from a module
or package by listing the names in the __all__
variable. So, if
you had a module some_mod.py
that contained this code:
some_mod.py:
__all__ = ['fn1']
def fn1(...):
...
def fn2(...):
...
then only ‘some_mod.fn1()’ would be available on import. This is a good way to cut down on “namespace pollution” – the presence of “private” objects and code in imported modules – which in turn makes introspection useful.
A short digression: naming and formatting¶
You may have noticed that a lot of Python code looks pretty similar – this is because there’s an “official” style guide for Python, called PEP 8. It’s worth a quick skim, and an occasional deeper read for some sections.
Here are a few tips that will make your code look internally consistent, if you don’t already have a coding style of your own:
use four spaces (NOT a tab) for each indentation level;
- use lowercase, _-separated names for module and function names, e.g.
my_module
;use CapsWord style to name classes, e.g.
MySpecialClass
;
- use ‘_’-prefixed names to indicate a “private” variable that should
not be used outside this module, , e.g.
_some_private_variable
;
Another short digression: docstrings¶
Docstrings are strings of text attached to Python objects like modules, classes, and methods/functions. They can be used to provide human-readable help when building a library of code. “Good” docstring coding is used to provide additional information about functionality beyond what can be discovered automatically by introspection; compare
def is_prime(x):
"""
is_prime(x) -> true/false. Determines whether or not x is prime,
and return true or false.
"""
versus
def is_prime(x):
"""
Returns true if x is prime, false otherwise.
is_prime() uses the Bernoulli-Schmidt formalism for figuring out
if x is prime. Because the BS form is stochastic and hysteretic,
multiple calls to this function will be increasingly accurate.
"""
The top example is good (documentation is good!), but the bottom
example is better, for a few reasons. First, it is not redundant:
the arguments to is_prime
are discoverable by introspection and
don’t need to be specified. Second, it’s summarizable: the first line
stands on its own, and people who are interested in more detail can read
on. This enables certain document extraction tools to do a better job.
For more on docstrings, see PEP 257.
Sharing data between code¶
There are three levels at which data can be shared between Python code: module globals, class attributes, and object attributes. You can also sneak data into functions by dynamically defining a function within another scope, and/or binding them to keyword arguments.
Scoping: a digression¶
Just to make sure we’re clear on scoping, here are a few simple examples. In this first example, f() gets x from the module namespace.
>>> x = 1
>>> def f():
... print x
>>> f()
1
In this second example, f() overrides x, but only within the namespace in f().
>>> x = 1
>>> def f():
... x = 2
... print x
>>> f()
2
>>> print x
1
In this third example, g() overrides x, and h() obtains x from within g(), because h() was defined within g():
>>> x = 1
>>> def outer():
... x = 2
...
... def inner():
... print x
...
... return inner
>>> inner = outer()
>>> inner()
2
In all cases, without a global
declaration, assignments will
simply create a new local variable of that name, and not modify the
value in any other scope:
>>> x = 1
>>> def outer():
... x = 2
...
... def inner():
... x = 3
...
... inner()
...
... print x
>>> outer()
2
However, with a global
definition, the outermost scope is used:
>>> x = 1
>>> def outer():
... x = 2
...
... def inner():
... global x
... x = 3
...
... inner()
...
... print x
>>> outer()
2
>>> print x
3
I generally suggest avoiding scope trickery as much as possible, in the interests of readability. There are two common patterns that I use when I have to deal with scope issues.
First, module globals are sometimes necessary. For one such case, imagine that you have a centralized resource that you must initialize precisely once, and you have a number of functions that depend on that resource. Then you can use a module global to keep track of the initialization state. Here’s a (contrived!) example for a random number generator that initializes the random number seed precisely once:
_initialized = False
def init():
global _initialized
if not _initialized:
import time
random.seed(time.time())
_initialized = True
def randint(start, stop):
init()
...
This code ensures that the random number seed is initialized only once by
making use of the _initialized
module global. A few points, however:
- this code is not threadsafe. If it was really important that the resource be initialized precisely once, you’d need to use thread locking. Otherwise two functions could call
randint()
at the same time and both could get past theif
statement.- the module global code is very isolated and its use is very clear. Generally I recommend having only one or two functions that access the module global, so that if I need to change its use I don’t have to understand a lot of code.
The other “scope trickery” that I sometimes engage in is passing data into
dynamically generated functions. Consider a situation where you have to
use a callback API: that is, someone has given you a library function that
will call your own code in certain situations. For our example, let’s look at
the re.sub
function that comes with Python, which takes a callback
function to apply to each match.
Here’s a callback function that uppercases words:
>>> def replace(m):
... match = m.group()
... print 'replace is processing:', match
... return match.upper()
>>> s = "some string"
>>> import re
>>> print re.sub('\\S+', replace, s)
replace is processing: some
replace is processing: string
SOME STRING
What’s happening here is that the replace
function is called each
time the regular expression ‘\S+’ (a set of non-whitespace characters)
is matched. The matching substring is replaced by whatever the
function returns.
Now let’s imagine a situation where we want to pass information into
replace
; for example, we want to process only words that match
in a dictionary. (I told you it was contrived!) We could simply rely
on scoping:
>>> d = { 'some' : True, 'string' : False }
>>> def replace(m):
... match = m.group()
... if match in d and d[match]:
... return match.upper()
... return match
>>> print re.sub('\\S+', replace, s)
SOME string
but I would argue against it on the grounds of readability: passing information implicitly between scopes is bad. (At this point advanced Pythoneers might sneer at me, because scoping is natural to Python, but nuts to them: readability and transparency is also very important.) You could also do it this way:
>>> d = { 'some' : True, 'string' : False }
>>> def replace(m, replace_dict=d): # <-- explicit declaration
... match = m.group()
... if match in replace_dict and replace_dict[match]:
... return match.upper()
... return match
>>> print re.sub('\\S+', replace, s)
SOME string
The idea is to use keyword arguments on the function to pass in required information, thus making the information passing explicit.
Back to sharing data¶
I started discussing scope in the context of sharing data, but we got a bit sidetracked from data sharing. Let’s get back to that now.
The key to thinking about data sharing in the context of code reuse is to think about how that data will be used.
If you use a module global, then any code in that module has access to that global.
If you use a class attribute, then any object of that class type (including inherited classes) shares that data.
And, if you use an object attribute, then every object of that class type will have its own version of that data.
How do you choose which one to use? My ground rule is to minimize the use of more widely shared data. If it’s possible to use an object variable, do so; otherwise, use either a module or class attribute. (In practice I almost never use class attributes, and infrequently use module globals.)
How modules are loaded (and when code is executed)¶
Something that has been implicit in the discussion of scope and data sharing, above, is the order in which module code is executed. There shouldn’t be any surprises here if you’ve been using Python for a while, so I’ll be brief: in general, the code at the top level of a module is executed at first import, and all other code is executed in the order you specify when you start calling functions or methods.
Note that because the top level of a module is executed precisely once, at first import, the following code prints “hello, world” only once:
mod_a.py:
def f():
print 'hello, world'
f()
mod_b.py:
import mod_a
The reload
function will reload the module and force re-execution at
the top level:
reload(sys.modules['mod_a'])
It is also worth noting that the module name is bound to the local
namespace prior to the execution of the code in the module, so not
all symbols in the module are immediately available. This really only
impacts you if you have interdependencies between modules: for
example, this will work if mod_a
is imported before mod_b
:
mod_a.py:
import mod_b
mod_b.py:
import mod_a
while this will not:
mod_a.py:
import mod_b
x = 5
mod_b.py:
import mod_a
y = mod_a.x
To see why, let’s put in some print statements:
mod_a.py:
print 'at top of mod_a'
import mod_b
print 'mod_a: defining x'
x = 5
mod_b.py:
print 'at top of mod_b'
import mod_a
print 'mod_b: defining y'
y = mod_a.x
Now try import mod_a
and import mod_b
, each time in a new
interpreter:
>> import mod_a
at top of mod_a
at top of mod_b
mod_b: defining y
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mod_a.py", line 2, in <module>
import mod_b
File "mod_b.py", line 4, in <module>
y = mod_a.x
AttributeError: 'module' object has no attribute 'x'
>> import mod_b
at top of mod_b
at top of mod_a
mod_a: defining x
mod_b: defining y
PYTHONPATH, and finding packages & modules during development¶
So, you’ve got your re-usable code nicely defined in modules, and now you want to ... use it. How can you import code from multiple locations?
The simplest way is to set the PYTHONPATH environment variable to contain a list of directories from which you want to import code; e.g. in UNIX bash,
% export PYTHONPATH=/path/to/directory/one:/path/to/directory/two
or in csh,
% setenv PYTHONPATH /path/to/directory/one:/path/to/directory/two
Under Windows,
> set PYTHONPATH directory1;directory2
should work.
However, setting the PYTHONPATH explicitly can make your code less movable in practice, because you will forget (and fail to document) the modules and packages that your code depends on. I prefer to modify sys.path directly:
import sys
sys.path.insert(0, '/path/to/directory/one')
sys.path.insert(0, '/path/to/directory/two')
which has the advantage that you are explicitly specifying the location of packages that you depend upon in the dependent code.
Note also that you can put modules and packages in zip files and
Python will be able to import directly from the zip file; just place
the path to the zip file in either sys.path
or your PYTHONPATH.
Now, I tend to organize my projects into several directories, with a
bin/
directory that contains my scripts, and a lib/
directory
that contains modules and packages. If I want to to deploy this code
in multiple locations, I can’t rely on inserting absolute paths into
sys.path; instead, I want to use relative paths. Here’s the trick I use
In my script directory, I write a file _mypath.py
.
_mypath.py:
import os, sys
thisdir = os.path.dirname(__file__)
libdir = os.path.join(thisdir, '../relative/path/to/lib/from/bin')
if libdir not in sys.path:
sys.path.insert(0, libdir)
Now, in each script I put import _mypath
at the top of the script.
When running scripts, Python automatically enters the script’s
directory into sys.path, so the script can import _mypath. Then
_mypath uses the special attribute __file__ to calculate its own
location, from which it can calculate the absolute path to the library
directory and insert the library directory into sys.path
.
setup.py and distutils: the old fashioned way of installing Python packages¶
While developing code, it’s easy to simply work out of the development
directory. However, if you want to pass the code onto others as a
finished module, or provide it to systems admins, you might want to
consider writing a setup.py
file that can be used to install your
code in a more standard way. setup.py lets you use distutils to install the software by
running
python setup.py install
Writing a setup.py is simple, especially if your package is pure Python and doesn’t include any extension files. A setup.py file for a pure Python install looks like this:
from distutils.core import setup
setup(name='your_package_name',
py_modules = ['module1', 'module2']
packages = ['package1', 'package2']
scripts = ['script1', 'script2'])
One this script is written, just drop it into the top-level directory
and type python setup.py build
. This will make sure that distutils
can find all the files.
Once your setup.py works for building, you can package up the entire directory with tar or zip and anyone should be able to install it by unpacking the package and typing
% python setup.py install
This will copy the packages and modules into Python’s
site-packages
directory, and install the scripts into Python’s
script directory.
setup.py, eggs, and easy_install: the new fangled way of installing Python packages¶
A somewhat newer (and better) way of distributing Python software is to use easy_install, a system developed by Phillip Eby as part of the setuptools package. Many of the capabilities of easy_install/setuptools are probably unnecessary for scientific Python developers (although it’s an excellent way to install Python packages from other sources), so I will focus on three capabilities that I think are most useful for “in-house” development: versioning, user installs, and binary eggs.
First, install easy_install/setuptools. You can do this by downloading
http://peak.telecommunity.com/dist/ez_setup.py
and running python ez_setup.py
. (If you can’t do this as the
superuser, see the note below about user installs.) Once you’ve
installed setuptools, you should be able to run the script
easy_install
.
The first thing this lets you do is easily install any software that
is distutils-compatible. You can do this from a number of sources:
from an unpackaged directory (as with python setup.py install
);
from a tar or zip file; from the project’s URL or Web page; from an
egg (see below); or from PyPI, the Python Package Index (see
http://cheeseshop.python.org/pypi/).
Let’s try installing nose
, a unit test discovery package we’ll be
looking at in the testing section (below). Type:
easy_install --install-dir=~/.packages nose
This will go to the Python Package Index, find the URL for nose, download it, and install it in your ~/.packages directory. We’re specifying an install-dir so that you can install it for your use only; if you were the superuser, you could install it for everyone by omitting ‘–install-dir’.
(Note that you need to add ~/.packages to your PATH and your PYTHONPATH, something I’ve already done for you.)
So, now, you can go do ‘import nose’ and it will work. Neat, eh?
Moreover, the nose-related scripts (nosetests
, in this case) have
been installed for your use as well.
You can also install specific versions of software; right now, the
latest version of nose is 0.9.3, but if you wanted 0.9.2, you could
specify easy_install nose==0.9.2
and it would do its best to find
it.
This leads to the next setuptools feature of note,
pkg_resource.require
. pkg_resources.require
lets you specify
that certain packages must be installed. Let’s try it out by
requiring that CherryPy 3.0 or later is installed:
>> import pkg_resources
>> pkg_resources.require('CherryPy >= 3.0')
Traceback (most recent call last):
...
DistributionNotFound: CherryPy >= 3.0
OK, so that failed... but now let’s install CherryPy:
% easy_install --install-dir=~/.packages CherryPy
Now the require will work:
>> pkg_resources.require('CherryPy >= 3.0')
>> import CherryPy
This version requirement capability is quite powerful, because it lets
you specify exactly the versions of the software you need for your own
code to work. And, if you need multiple versions of something
installed, setuptools lets you do that, too – see the
--multi-version
flag for more information. While you still can’t
use different versions of the same package in the same program, at
least you can have multiple versions of the same package installed!
Throughout this, we’ve been using another great feature of setuptools:
user installs. By specifying the --install-dir
, you can install
most Python packages for yourself, which lets you take advantage of
easy_install’s capabilities without being the superuser on your
development machine.
This brings us to the last feature of setuptools that I want to mention: eggs, and in particular binary eggs. We’ll explore binary eggs later; for now let me just say that easy_install makes it possible for you to package up multiple binary versions of your software (with extension modules) so that people don’t have to compile it themselves. This is an invaluable and somewhat underutilized feature of easy_install, but it can make life much easier for your users.