I have done some performance benchmarking for Python's ctypes library. I am
planning to use ctypes as an alternative to writing C extension module for
performance enhancement. Therefore my use case is slight different from the
typical use case for accessing existing third party C libraries. In this case I
am both the user and the implementer of the C library.
In order to determine what is the right granularity for context switching between
Python and C, I have done some benchmarking. I mainly want to measure the
function call overhead. So the test functions are trivial function like
returning the first character of a string. I compare a pure Python function
versus C module function versus ctypes function. The tests are ran under
Python 2.6 on Windows XP with Intel 2.33Ghz Core Duo.
First of all I want to compare the function to get the first character of a
string. The most basic case is to reference it as the 0th element of a sequence
without calling any function. The produce the fastest result at 0.0659 usec per
loop.
$ timeit "'abc'[0]"
10000000 loops, best of 3: 0.0659 usec per loop
As soon as I build a function around it, the cost goes up substantially. Both
pure Python and C extension method shows similar performance at around 0.5 usec.
ctypes function takes about 2.5 times as long at 1.37 usec.
$ timeit -s "f=lambda s: s[0]" "f('abc')"
1000000 loops, best of 3: 0.506 usec per loop
$ timeit -s "import mylib" "mylib.py_first('abc')"
1000000 loops, best of 3: 0.545 usec per loop
$ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
"dll.first('abc')"
1000000 loops, best of 3: 1.37 usec per loop
I repeated the test with a long string (1MB). There are not much difference in
performance. So I can be quite confident that the parameter is passed by
reference (of the internal buffer).
$ timeit -s "f=lambda s: s[0]; lstr='abcde'*200000"
"f(lstr)"
1000000 loops, best of 3: 0.465 usec per loop
$ timeit -s "import mylib; lstr='abcde'*200000"
"mylib.py_first(lstr)"
1000000 loops, best of 3: 0.539 usec per loop
$ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
-s "lstr='abcde'*200000"
"dll.first(lstr)"
1000000 loops, best of 3: 1.4 usec per loop
Next I have make some attempts to speed up ctypes performance. A measurable
improvement can be attained by eliminating the attribute look up for the
function. Curiously this shows no improvement in the similar case for C extension.
$ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd');
-s "f=dll.first"
"f('abcde')"
1000000 loops, best of 3: 1.18 usec per loop
Secondary I have tried to specify the ctypes function prototype. This actually
decrease the performance significantly.
$ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
-s "f=dll.first"
-s "f.argtypes=[ctypes.c_char_p]"
-s "f.restype=ctypes.c_int"
"f('abcde')"
1000000 loops, best of 3: 1.57 usec per loop
Finally I have tested passing multiple parameters into the function. One of the
parameter is passed by reference in order to return a value. Performance
decrease as the number of parameter increase.
$ timeit -s "charAt = lambda s, size, pos: s[pos]"
-s "s='this is a test'"
"charAt(s, len(s), 1)"
1000000 loops, best of 3: 0.758 usec per loop
$ timeit -s "import mylib; s='this is a test'"
"mylib.py_charAt(s, len(s), 1)"
1000000 loops, best of 3: 0.929 usec per loop
$ timeit -s "import ctypes"
-s "dll = ctypes.CDLL('mylib.pyd')"
-s "s='this is a test'"
-s "ch = ctypes.c_char()"
"dll.charAt(s, len(s), 1, ctypes.byref(ch))"
100000 loops, best of 3: 2.5 usec per loop
One style of coding that improve the performance somewhat is to build a
C struct to hold all the parameters.
$ timeit -s "from test_mylib import dll, charAt_param"
-s "s='this is a test'"
-s "obj = charAt_param(s=s, size=len(s), pos=3, ch='')"
"dll.charAt_struct(obj)"
1000000 loops, best of 3: 1.71 usec per loop
This may work because most of the fields in the charAt_param struct are
invariant in the loop. Having them in the same struct object save them from
getting rebuilt each time.
My overall observation is that ctypes function has an overhead that is 2 to 3 times to a
similar C extension function. This may become a limiting factor if the function
calls are fine grained. Using ctypes for performance enhancement is a lot more
productive if the interface can be made to medium or coarse grained.
A snapshot of the source code used for
testing is available for download. This is also useful if you want a boiler
plate for building your own ctypes library.
2009.07.16 [python] -
comments