python - 戇人日記

Blog

San Francisco, USA

Python CSV reader is much faster than pickle

If you are considering to serialize a large amount of data to the disk, performance may become a concern to you. Python provides a serialization tool in the pickle module. There is also an optimized version called the cPickle. But how do they perform?

The data of concern to me is tabular data. In order to do a bake off, I have generated 50,000 records of sample data. The CSV representation is shown below:

seq, name, address, city, age, birthday
1000,John M. Doe,2147 Main St.,Middle Town 14,47,1985-05-15
1001,John N. Doe,2148 Main St.,Middle Town 15,48,1985-05-16
1002,John O. Doe,2149 Main St.,Middle Town 16,49,1985-05-17
1003,John P. Doe,2150 Main St.,Middle Town 17,50,1985-05-18
1004,John Q. Doe,2151 Main St.,Middle Town 18,51,1985-05-19
1005,John R. Doe,2152 Main St.,Middle Town 19,52,1985-05-20
1006,John S. Doe,2153 Main St.,Middle Town 20,53,1985-05-21
1007,John T. Doe,211 Main St.,Middle Town 21,1,1985-05-22
...

Naturally, CSV is a contender for storing tabular data. (Indeed the data source I'm working with is in CSV format.) The two pickle modules produce identical data output. In addition, Python 2.6 also provides a JSON module that do the similar task as pickle but outputs a standard text based format. I included it in the comparison below.

First observation, CSV output the most compact data at 3MB. Pickle output is 40% larger at 4.2MB. JSON is somewhere in between. The speed? CSV is the winner among them all.

Method	Load Time (ms)	File size (MB)
CSV	188	3
CSV int	289	3
cPickle	692	4.2
pickle	1,815	4.2
JSON	4,975	3.9

Note that CSV reader create data items as string. In the sample data, two out of the six columns are integer fields. In order to do an apple-to-apple comparison I have another test that do integer conversion after loading such that the data loaded is identical to pickle's. This impacted the performance somewhat. But it is still more than twice as fast as the faster cPickle module. The standard library's JSON's performance trailing far behind, making it unsuitable for anything performance intensive. FYI, unlike the other modules, JSON's output is in unicode.

The test is done by Python 2.6 on Windows XP machine with 2.33GHz Core2 CPU (Download source code).

2010.05.12 [programming, python] - comments

ctype performance benchmark

I have done some performance benchmarking for Python's ctypes library. I am planning to use ctypes as an alternative to writing C extension module for performance enhancement. Therefore my use case is slight different from the typical use case for accessing existing third party C libraries. In this case I am both the user and the implementer of the C library.

In order to determine what is the right granularity for context switching between Python and C, I have done some benchmarking. I mainly want to measure the function call overhead. So the test functions are trivial function like returning the first character of a string. I compare a pure Python function versus C module function versus ctypes function. The tests are ran under Python 2.6 on Windows XP with Intel 2.33Ghz Core Duo.

First of all I want to compare the function to get the first character of a string. The most basic case is to reference it as the 0th element of a sequence without calling any function. The produce the fastest result at 0.0659 usec per loop.

  $ timeit "'abc'[0]"

  10000000 loops, best of 3: 0.0659 usec per loop

As soon as I build a function around it, the cost goes up substantially. Both pure Python and C extension method shows similar performance at around 0.5 usec. ctypes function takes about 2.5 times as long at 1.37 usec.

  $ timeit -s "f=lambda s: s[0]"  "f('abc')"

  1000000 loops, best of 3: 0.506 usec per loop

  $ timeit -s "import mylib" "mylib.py_first('abc')"

  1000000 loops, best of 3: 0.545 usec per loop

  $ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
              "dll.first('abc')"

  1000000 loops, best of 3: 1.37 usec per loop

I repeated the test with a long string (1MB). There are not much difference in performance. So I can be quite confident that the parameter is passed by reference (of the internal buffer).

  $ timeit -s "f=lambda s: s[0]; lstr='abcde'*200000"
              "f(lstr)"

  1000000 loops, best of 3: 0.465 usec per loop

  $ timeit -s "import mylib; lstr='abcde'*200000"
              "mylib.py_first(lstr)"

  1000000 loops, best of 3: 0.539 usec per loop

  $ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
           -s "lstr='abcde'*200000"
              "dll.first(lstr)"

  1000000 loops, best of 3: 1.4 usec per loop

Next I have make some attempts to speed up ctypes performance. A measurable improvement can be attained by eliminating the attribute look up for the function. Curiously this shows no improvement in the similar case for C extension.

  $ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd');
           -s "f=dll.first"
              "f('abcde')"

  1000000 loops, best of 3: 1.18 usec per loop

Secondary I have tried to specify the ctypes function prototype. This actually decrease the performance significantly.

  $ timeit -s "import ctypes; dll = ctypes.CDLL('mylib.pyd')"
           -s "f=dll.first"
           -s "f.argtypes=[ctypes.c_char_p]"
           -s "f.restype=ctypes.c_int"
              "f('abcde')"

  1000000 loops, best of 3: 1.57 usec per loop

Finally I have tested passing multiple parameters into the function. One of the parameter is passed by reference in order to return a value. Performance decrease as the number of parameter increase.

  $ timeit -s "charAt = lambda s, size, pos: s[pos]"
           -s "s='this is a test'"
              "charAt(s, len(s), 1)"

  1000000 loops, best of 3: 0.758 usec per loop

  $ timeit -s "import mylib; s='this is a test'"
              "mylib.py_charAt(s, len(s), 1)"

  1000000 loops, best of 3: 0.929 usec per loop

  $ timeit -s "import ctypes"
           -s "dll = ctypes.CDLL('mylib.pyd')"
           -s "s='this is a test'"
           -s "ch = ctypes.c_char()"
              "dll.charAt(s, len(s), 1, ctypes.byref(ch))"

  100000 loops, best of 3: 2.5 usec per loop

One style of coding that improve the performance somewhat is to build a C struct to hold all the parameters.

  $ timeit -s "from test_mylib import dll, charAt_param"
           -s "s='this is a test'"
           -s "obj = charAt_param(s=s, size=len(s), pos=3, ch='')"
              "dll.charAt_struct(obj)"

  1000000 loops, best of 3: 1.71 usec per loop

This may work because most of the fields in the charAt_param struct are invariant in the loop. Having them in the same struct object save them from getting rebuilt each time.

My overall observation is that ctypes function has an overhead that is 2 to 3 times to a similar C extension function. This may become a limiting factor if the function calls are fine grained. Using ctypes for performance enhancement is a lot more productive if the interface can be made to medium or coarse grained.

A snapshot of the source code used for testing is available for download. This is also useful if you want a boiler plate for building your own ctypes library.

2009.07.16 [python] - comments

Exploring parallel programming

When I first started blogging, I have a lot of geeky posting about software development. The idea of blogging is to write on short subjects frequently. It is different from writing a long article, which is so burdensome many people end up not doing it at all. But in practice, I have not lived up to expectation of a prolific blogger. Sometimes the gap between postings can be a few months.

Anyway this is the quick posting on geeky subject that I think I should do more. Recently I am working on high performance computing and number crunching task. I have dipped into writing C extension for Python for performance. I got spectacular result from the exercise. It is the best of both world. Programming in Python supplement by C is still very agile and iterative. On the other hand C is giving the performance unachievable with pure Python.

I am started to look into the multiprocessing module introduced in Python 2.6. It is a library to run multiple python processes. With an interface similar to the threading module, it is positioning itself as an alternative approach to threading to circumvent the limitation of Python's GIL. I haven't give much thought to it initially. But looking closer, I find it is really more than just an alternative to threading. It is really a parallel programming framework. For example, the method Pool.map() is really a prototype of a mapreduce framework. I am quite excited to explore this module in the coming days.

Another module I am looking at is the ctypes module. It allows Python code to call C functions directly. Again I haven't thought much about it when it was introduced in Python 2.5. That's because I haven't have a use case of it. Now that I am writing C module, I begin to realize how revolutionary it is. C module delivery great performance. But writing one is a pain. Just the need to keep track of reference count may double the number of lines of your code. For me I am not building a reusable module for other people. It is one off function for me to do number crunching. Using ctypes may allow me easier access to C without the burden of actually writing a Python module.

2009.07.12 [python, tech] - comments

The World's Shortest Short URL Generator

(2009-07-07 Update: I have suspended this demo program because I find spammers has started to exploit it. What a lesson for me to learn.)

There was a long winding discussion about a short URL generator bit.ly. Many people are familiar with the pioneer in this field tinyurl.com, which for many people, is the synonym of this service. Nevertheless, with fast raising popularity, many people see a lot of promises in bit.ly as a business. On the hand, just as many people brush off it as a trivial service easy to replicate. Someone dismiss it as an application that can be done in 10 lines of code. This prompted the posting and set off a discussion on Fred Wilson's blog.

I should say I agree with most of the comments from both sides :) But as a geek, the idea of building a short URL generator in 10 lines has latched on my head. So I set out to work. An hour later out came http://tungwaiyip.info/shorturl. In just 10 lines of code, it is the world's shortest short URL generator, or so I thought.

This is a real application. It has UI, URL shortening algorithm and database persistence, all in 10 lines of code. Feel free to check out the source code. I have a lot of fun doing this. There is a fatal flaw though. It does not handle concurrent update very well. To squeeze the code even more to handle concurrency does not seem fun. So I just stopped there.

2009.05.12 [python, programming] - comments

Python Stock Quote

There are many free tools available to help people manage their stock portfolio. For example, you can use Yahoo! Finance to track your current holdings. For each stock it list many detail financial data as well as stock chart for various periods. Still, when you need to do more sophisticated analysis, there is nothing that beats the power of spreadsheet.

Fortunately it is easy to pull the data from the web tool to a spreadsheet. You can simple copy and paste the data into a spreadsheet to get a properly formatted table. For there you can add your custom formula or charts. The problem is unlike the web view, data downloaded on a spreadsheet is not updated with current market information. On the other hand if you download the again you will lose all custom formula you have added.

Download sample spreadsheet

To get the best of both world, I have created a script to so that you can update your spreadsheet with the latest stock quote from the web. This is a simple Python script. It first looks for the stock symbols in your portfolio in your spreadsheet. It then fetches the (15 minute delayed) stock quote from a Yahoo web services. Finally it updates the cells in the spreadsheet with new data from Yahoo.

This tool requires Python 2.5, Excel and pywin32 extension. Click here to download the stock quote update script.

2008.06.22 [python] - comments

Python's half open index notation

Beginner programmers often wonder about Python's sequence indexing and slicing notation. Array index starts from 0. Slicing uses half open notations, where L[a:b] is a subsequence with index x where a <= x < b.

Why is the endpoint excluded? Isn't it more intuitive if array index starts from 1 and the endpoint is included, so that a 3 elements array is referenced as L[1:3] with items L[1], L[2], L[3]?

It turns out this notation is an elegant and deliberate design and it has some excellent properties.

We write programs to operate on arrays, to find their length, traverse the subsequences, split them or join them. The half open notation always show a simple pattern. But the inclusive notation often requires adding 1 or substracting 1 to the indexes in many operations. Thus it is more vulnerable to off-by-one-error. This article One True Way of array indexing discuss this at length. I have reproduce its example (with corrections) below:

Operation	Half open	Inclusive
length of a slice L[a:b]	b-a	(b-a+1)
first n characters of L	L[:n]	L[1:n]
last n characters of L	L[-n:]	L[len(L)-n+1:]
The identity slice	L == L[0:len(L)] == L[:]	L[1:len(L)]
The empty slice	L[a:a] is empty for any a.	perhaps L[a:a-1]?
A slice of length n, from point a	L[a:a+n]	L[a:a+n-1]
Split L[a:b] at index c	L[a:b] == L[a:c]+L[c:b]	L[a:c-1]+L[c:b]

Another important property is an empty sequence can be expressed by L[a:a], while there is no natural way to express an empty sequence with the inclusive notation. But do we really need to care about a special case? Absolutely! In fact failure to account for empty input is one of the most common error. Just like zero is a fundamental concept in mathematics, always think how you program can handle null input. An inferior approach is to represent empty sequence by None or null pointer. This creates a special case so that a variable need to be tested before dereferencing. Failure to do so contributes to unexpected exceptions. It is an elegant design that L[a:b] can also represent sequences with 0 length.

C++'s STL also choose this notation to represent a range. According to the literature this is crucial because "algorithms that operate on n things frequently require n+1 positions. Linear search, for example (find) must be able to return some value to indicate that the search was unsuccessful." I have seen so many people flunked link list or data structure exercises because they have trouble dealing with the end of a list. Often a good solution is shift the focus beyond the n concrete objects to the n+1 positions around them. I hope this help to make sense of the half open notation.

2005.06.16 [software, python] - comments

PyCon2005 day 3

The third day's keynote is delivered by Greg Stein from Google. He gave some insight about evangelizing Python in his last few companies. Small companies are more readily to adopt Python and consider it a competitive advantage. Whereas large company would hold on until the support environment is present. Nevertheless he believes the growth of Python has passed the tipping point and it was never a problem to train any new programmer Python.
He went on to describe the use of Python in Google and emphasized SWIG as a great glue for integrating code build using various languages.
Andi Vajda, whom's search engine PyLucene is what powers my MindRetrieve project, is giving a talk in PyCon. He outlined the challenges to compile a Java application into C executable and making it into Python extension library using GCJ and SWIG. The issues including different memory management, different thread model and cross language error reporting. The success of PyLucene draw a lot of interests in compiling other Java projects into executable and provide more language binding.
I enjoyed yesterday's lightning talks so much that I have stepped up to demonstrate my own MindRetrieve project today. Again the room was packed. I'm glad that I went thought the 5 minutes presentation reasonably well and as at least a few people seems to appreciate my idea.
Geek biker Peter Kropf has made a cross country motorcycle trip. With the bike was a custom built hardware censors and cameras recording everything. He made his videos available on his website .
Chris Tismer shown a web demo using stackless Python to maintain server state. Stateless Python sounds like a mystery. But his few lines of code is a great introduction.

I thoroughly enjoyed this three days of PyCon, met lots of great people and learned a whole lot. I cherish this supportive open source community and look forward to more exciting development in the coming year.

Read more about day 1, day 2 and day 3 of PyCon.

2005.03.25 [software, python] - comments

PyCon2005 day 2

Guido van Rossum delivered the State of Python keynote on the second day. First he mentioned a security issue in the Python standard library was reported recently. While the scope of this issue is limited, this has prompted the development team to setup a structure to response to future security problems. He then described some incremental improvement proposed. This is followed by some contentious "optional static type checking proposals". We can expect Python would continue its slow growth policy with few major change in coming releases.
I am missing more formal sessions because of the continuous discussion of web development in python. Shannon Behrens is giving a improvised tutorial of his Aquarium web framework to a user. Using this fairly straightforward framework he has covered the essence of web development within an hour. The Aquarium framework is comprised of mere a few thousands lines of code. This gave another perspective to the framework proliferation problem. Python is so productive that it is well within a single talented developer's capability to build a complete framework.
The open source movement give great opportunities to geeks to produce and contribute independently. But that could also leads to divergence is most apparent in Python's web development environment. A truly successful project will need not only technical excellency but also the ability to find consensus and to build coalitions.
Richard Jones has shown us the Roundup issue tracker. It seems to be well build and has rich functionalities. If you are starting a new project it is definitively an alternative to Bugzilla. Another similar project mentioned is trac with also has subversion integration.
PyCon has two sessions of Lightning talks make of of a series of informal 5 minutes presentations. This provides a low pressure environment and encourages people to show case smaller projects or ideas that might not warrant a full session. Given its unofficial nature I'm surprised to find the lightning talks is actually very well attended.
Armin Rigo has demonstrated a neat collect class that build a sequence from iterator on demand.
The Holger Krekel and Armin Rigo team has even more neat tools to show. The rlcomplete2 seems to be a must have command line completion tool. shpy enable people on two different computers to share screen and edit the same file simultaneous. That's what you call pair programming!
Wayne Yamamoto from Rustic Canyon Partners come to solicit talents to build startups base specifically on Python technologies. So far the Python community seems to be remarkably uncommercial. Many being merely closet Pythonistas. I think we really need to do more to let the larger world know how incredibly productive these Python technologies really are.
A few more sessions worth mentioning. Christopher Gillett from Compete Inc described the use of Python for large scale data mining. Michael Salib try to save all of us from the software patent machine. He has built a US Patent Database using Xapian as the search engine. Anna Ravenscroft shown us some important libraries dealing with date and time including Dateutil and pytz.

Read more about day 1, day 2 and day 3 of PyCon.

2005.03.24 [software, python] - comments

PyCon2005 day 1

I am really excited to go to PyCon for the first time. This is some notes about what happened in this 3 day conference in Washington D.C.

PyCon2005 starts with a keynote from Jim Hugunin from Microsoft, who started the IronPython project that ports Python on Microsoft's .NET platform.
Coming from Microsoft automatically put one into defensive when confronted with the non-Microsoft community. Jim certainly knows when to crack Microsoft jokes and what to say when a demo crash. Putting this aside he did delivered some great demos and made a strong case about the value of python on .NET platform. On the other hand I can't help thinking about how much ill will and negative publicity Microsoft has created.
The next interesting session is Holger Krekel talking about a novel testing tool py.test. He find the JUnit inspired unittest.py clumsy to use. With his test tool, user create test cases just using the assert statement, instead of the function call based unittest module, which he find quite clumsy. He then done some clever analysis when there is exception and generate an informative report.
He then went on to show another tool that bring a twist to RPC. Instead of the usual approach of transferring the objects to the remote host, he simply creates a two way channel and let the local and remote code communicates in their own way. Smart tool! Unfortunately the website http://codespeak.net/py seems to be down throughout the conference.
Next session Grig Gheorghiu cover a lot of ground about agile testing. He touches on various tools and the XP principles. Finally he demonstrated using wiki to let customers design test cases and provide instant testing and feedback. Don't you think all software should some have something like that? Check out FitNesse and Selenium.
I really love the PyWebOff presentation Michelle Levesque gave in the afternoon. She hit the nail on the head that having far too many web application frameworks in Python causes great confusion to the users. It was a fabulous and very entertaining presentation. The message is clear, users need a clear guidance on what framework to use given certain requirements.
Ian Bicking's talk about WSGI is exactly an effort to bring order to chaos about the proliferation the frameworks. While it is good to define a standard interface between certain layers, it is less clear to me if this effort would weed out the number of frameworks, at least not in the short run.

I think Python is missing the opportunity to establish itself as a premier web development platform due to these issues. Otherwise it could easily double or triple its user base. Instead it is losing market to some less capable tools like PHP. I was so passionate about this problem that I have spent most of the afternoon discussing this in open sessions rather than attending talks.

Read more about day 1, day 2 and day 3 of PyCon.

2005.03.23 [software, python] - comments

David Ascher's paper on Dynamic Languages

For the past year I have been engaged with the Python language and have very much impressed by it. So it is excited to find a recent white paper from David Ascher to speak for dynamic languages, a term he coined for the class of languages such as Perl, Python, PHP, etc, which are often referred to as scripting languages. He observed that these languages are widely used beyond the scripting area and their dynamic nature is really what set them apart from the system language such as C++ and Java.

The most interesting part in his paper is he look more than the technical competence but also the social aspect as a defining characteristics. These languages all have primary implementation in open source model and have active grassroots participation. Being open source also make them fertile ground for experimentation in academic language research. Despite having nearly no formal budget, they are able to evolve and succeed against other corporate made development tools.

I believe dynamic language is going to play an important role in the future of computing. And I see this paper to serve as the "The Cathedral and the Bazaar" in the programming language domain.

2004.09.09 [software, python] - comments

Python screen scrapper

At last I have built the application I always wanted, a weather and news email alert for my mobile phone. While my T-Mobile phone is capable of internet browsing, the connection is so poor that it make web surfing an unenjoyable experience. The email alert deliver the information and does not incur the painful delay in interactive browsing.

This is mostly a HTML screen scrapper in only 200 lines of Python code. Within a few hours I have success in the implementation. That is while I am still learning the language! If it is not so easy because of Python this would probably remain on the drawing board for a long time.

2003.04.21 [python, web, software] - comments

past articles »

BBC News

•	Prague gunman killed himself on roof as police approached (22 Dec 2023)
•	Bodycam footage shows police hunting Prague gunman (22 Dec 2023)
•	Alex Batty: Police launch abduction investigation into disappearance of British teen (22 Dec 2023)
•	Banksy stop sign drones art removed in London (22 Dec 2023)
•	Martin Kemp refunds disabled ticket after fans' difficulty with seller (22 Dec 2023)
•	Queues at Dover as Christmas getaway begins for millions (22 Dec 2023)
•	New Â£38,700 visa rule will be introduced in early 2025, says Rishi Sunak (22 Dec 2023)
•	UK at risk of recession after economy shrinks (22 Dec 2023)
•	Mohamed Al Bared: Student jailed for life for building IS drone (22 Dec 2023)
•	Andrew Tate denied request to visit ill mother in UK (22 Dec 2023)

SF Gate