Measuring and Increasing Performance

“Premature optimization is the root of all evil (or at least most of it) in programming.” Donald Knuth.

In other words, know thy code! The only way to find performance bottlenecks is to profile your code. Unfortunately, the situation is a bit more complex in Python than you would like it to be: see http://docs.python.org/lib/profile.html. Briefly, there are three (!?) standard profiling systems that come with Python: profile, cProfile (only since python 2.5!), and hotshot (thought note that profile and cProfile are Python and C implementations of the same API). There is also a separately maintained one called statprof, that I nominally maintain.

The ones included with Python are deterministic profilers, while statprof is a statistical profiler. What’s the difference? To steal from the Python docs:

Deterministic profiling is meant to reflect the fact that all function call, function return, and exception events are monitored, and precise timings are made for the intervals between these events (during which time the user’s code is executing). In contrast, statistical profiling randomly samples the effective instruction pointer, and deduces where time is being spent. The latter technique traditionally involves less overhead (as the code does not need to be instrumented), but provides only relative indications of where time is being spent.

Let’s go to the examples. Suppose we have two functions ‘count1’ and ‘count2’, and we want to run both and see where time is spent.


Here’s some example hotshot code:

import hotshot, hotshot.stats
prof = hotshot.Profile('hotshot.prof')
prof.runcall(count1)
prof.runcall(count2)
prof.close()
stats = hotshot.stats.load('hotshot.prof')
stats.sort_stats('time', 'calls')
stats.print_stats(20)

and the resulting output:

      2 function calls in 5.769 CPU seconds

Ordered by: internal time, call count

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    4.335    4.335    4.335    4.335 count.py:8(count2)
     1    1.434    1.434    1.434    1.434 count.py:1(count1)
     0    0.000             0.000          profile:0(profiler)

Here’s some example cProfile code:

def runboth():
    count1()
    count2()

import cProfile, pstats
cProfile.run('runboth()', 'cprof.out')

p = pstats.Stats('cprof.out')
p.sort_stats('time').print_stats(10)

and the resulting output:

Wed Jun 13 00:11:55 2007    cprof.out

         7 function calls in 5.643 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.817    3.817    4.194    4.194 count.py:8(count2)
        1    1.282    1.282    1.450    1.450 count.py:1(count1)
        2    0.545    0.272    0.545    0.272 {range}
        1    0.000    0.000    5.643    5.643 run-cprofile:8(runboth)
        1    0.000    0.000    5.643    5.643 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

And here’s an example of statprof, the statistical profiler:

import statprof
statprof.start()
count1()
count2()
statprof.stop()
statprof.display()

And the output:

  %   cumulative      self
 time    seconds   seconds  name
 74.66      4.10      4.10  count.py:8:count2
 25.34      1.39      1.39  count.py:1:count1
  0.00      5.49      0.00  run-statprof:2:<module>
---
Sample count: 296
Total time: 5.490000 seconds

Which profiler should you use?

statprof used to report more accurate numbers than hotshot or cProfile, because hotshot and cProfile had to instrument the code (insert tracing statements, basically). However, the numbers shown above are pretty similar to each other and I’m not sure there’s much of a reason to choose between them any more. So, I recommend starting with cProfile, because it’s the officially supported one.

One note – none of these profilers really work all that well with threads, for a variety of reasons. You’re best off doing performance measurements on non-threaded code.

Measuring code snippets with timeit

There’s also a simple timing tool called timeit:

from timeit import Timer
from count import *

t1 = Timer("count1()", "from count import count1")
print 'count1:', t1.timeit(number=1)

t2 = Timer("count2()", "from count import count2")
print 'count2:', t2.timeit(number=1)

Speeding Up Python

There are a couple of options for speeding up Python.

psyco

(Taken almost verbatim from the psyco introduction!)

psyco is a specializing compiler that lets you run your existing Python code much faster, with absolutely no change in your source code. It acts like a just-in-time compiler by rewriting several versions of your code blocks and then optimizing them by specializing the variables they use.

The main benefit is that you get a 2-100x speed-up with an unmodified Python interpreter and unmodified source code. (You just need to import psyco.)

The main drawbacks are that it only runs on i386-compatible processors (so, not PPC Macs) and it’s a bit of a memory hog.

For example, if you use the prime number generator generator code (see Idiomatic Python) to generate all primes under 100000, it takes about 10.4 seconds on my development server. With psyco, it takes about 1.6 seconds (that’s about a 6x speedup). Even when doing less numerical stuff, I see at least a 2x speedup.

Installing psyco

(Note: psyco is an extension module and does not come in pre-compiled form. Therefore, you will need to have a Python-compatible C compiler installed in order to install psyco.)

Grab the latest psyco snapshot from here:

http://psyco.sourceforge.net/psycoguide/sources.html

unpack it, and run ‘python setup.py install’.

Using psyco

Put the following code at the top of your __main__ Python script:

try:
   import psyco
   psyco.full()
except ImportError:
   pass

...and you’re done. (Yes, it’s magic!)

The only place where psyco won’t help you much is when you have already recoded the CPU-intensive component of your code into an extension module.

pyrex

pyrex is a Python-like language used to create C modules for Python. You can use it for two purposes: to increase performance by (re)writing your code in C (but with a friendly extension language), and to make C libraries available to Python.

In the context of speeding things up, here’s an example program:

def primes(int maxprime):
  cdef int n, k, i
  cdef int p[100000]
  result = []
  k = 0
  n = 2
  while n < maxprime:
    i = 0

    # test against previous primes
    while i < k and n % p[i] <> 0:
      i = i + 1

    # prime? if so, save.
    if i == k:
      p[k] = n
      k = k + 1
      result.append(n)
    n = n + 1

  return result

To compile this, you would execute:

pyrexc primes.pyx
gcc -c -fPIC -I /usr/local/include/python2.5 primes.c
gcc -shared primes.o -o primes.so

Or, more nicely, you can write a setup.py using some of the Pyrex helper functions:

from distutils.core import setup
from distutils.extension import Extension
from Pyrex.Distutils import build_ext                # <--

setup(
  name = "primes",
  ext_modules=[
    Extension("primes", ["primes.pyx"], libraries = [])
    ],
  cmdclass = {'build_ext': build_ext}
)

A few notes:

  • ‘cdef’ is a C definition statement
  • this is a “python-alike” language but not Python, per se ;)
  • pyrex does handle a lot of the nasty C extension stuff for you.

There’s an excellent guide to Pyrex available online here: http://ldots.org/pyrex-guide/.

I haven’t used Pyrex much myself, but I have a friend who swears by it. My concerns are that it’s a “C/Python-alike” language but not C or Python, and I have already memorized too many weird rules about too many languages!

We’ll encounter Pyrex a bit further down the road in the context of linking existing C/C++ code into your own code.