tungwaiyip.info

home

about me

links

Blog

< January 2014 >
SuMoTuWeThFrSa
    1 2 3 4
5 6 7 8 91011
12131415161718
19202122232425
262728293031 

past articles »

Click for San Francisco, California Forecast

San Francisco, USA

 

Python 3 is not better, I am moving back to Python 2

After Python 3.3 was released, I begin to embrace Python 3 in a big way last year. I use it in my personal projects whenever possible, when I do not have to worry about compatibility with existing code base. I use it as the default REPL everyday.

While I am not an early adopter, I am still ahead of many other Python programmers to actually make production use of it. Unfortunately, my experience so far is not favorable (note I have not used unicode string much). I ran into miscellaneous problems. At first I thought it is some learning curve I have to overcome. Slowly I come to realize some design decision are really problemetic.

Recently there are more discussions around the problems related to porting to Python 3, notably from Armin Ronacher of Flask. I want to add my experience to the discussion. Many have been written about the core issue of unicode string handling or the death of binary string. My grievance is from a different area, the systematic replacement of list result by iterators.

Iterator Blues

Iterator and generator are some of the best feature of Python. In many cases, a list is interchangeable with its corresponding iterator. When the result is large, iterator is certainly more memory efficient. If you want to loop 100 millions times, you probably don't want to build a list of 100 millions items using range, but rather use the iterator version of xrange.

Since iterators are so great, they argue, from now on Python 3 should only provide iterator result.

This turn out to be a huge annoyance for me.

For the start, when I run a function in the interactive console, I don't see the result anymore. I get an iterator object. No big deal, the Python 3 people say, you can render the result by wrapping it in the list function. Fine, it is an inconvenience, a minor inconvenience you may argue. But you just cannot spin it as a positive change. When I first starting using Python when coming from a Java background, I was delighted to find how easy thing is in Python. Why bother to build a chain of stream handler to read a file in Java? The Pythonic way is to build the entire input in memory. 95% of time the data is so small it does not matter. This was the Python magic and this is beginning to lose.

Using the list will be a tolerable workaround if it is exceptional, if it is only needed in a small number of cases. But I have bumped into the wall so many times I begin to think the reverse is more true.

Other than want to see the result in REPL, there are actually a long list of use cases of a list that cannot be conveniently expressed with iterator, like

  • building nested data structure
  • processes that need the length of the collection
  • take just one element, say the first, from the collection
  • many libraries anticipate a list rather than iterator (like numpy)

For example, I want to parse a CSV like input into a nested data structure.

In [22]: INPUT = """\
   ....: 1,2
   ....: 3,4
   ....: """.splitlines()

In [24]: [map(int, line.split(',')) for line in INPUT]
Out[24]: [<builtins.map at 0x321f270>, <builtins.map at 0x321f090>]

Ouch, I got a list of iterators instead. It used to be easy in Python 2 like this.

In [2]: [map(int, line.split(',')) for line in INPUT]
Out[2]: [[1, 2], [3, 4]]

The problem is you almost never want a nested data structure with iterators inside. When I accidentally did that, it usually causes a bug a few lines down. I have to dig hard into the data structure to find out what has done wrong.

Trying to pull a value from a dictionary gives me further insult. Sometimes I want inspect a value in a dictionary. Which one does not matter, I just need one. With Python 2, it is d.items()[0]. It will be dumb to write a for loop in Python 3 to do this. As an experienced programmer, I know I can use next(). But this gives me an exception?! How about d.items().next()? Fail. How about d.items().__next__(). It fails too. I spent hours before I found out in Python 3, d.keys() correspond not to iterkeys() but an unfamiliar viewkeys() of Python 2. To get any values, I have to first turn it into an iterator, only then can I apply next. When you apply an extra function like list once, it is an inconvenience. When you have to do it twice or more, it becomes a big clutter and big annoyance.

Python 3 renders the map function nearly useless because of the extra list needed. In Python 2, we often have two alternative to express a similar construct, with the map function or list comprehension. Usually I choose map when there is a function readily available, like int in my example above. But because of the extra clutter of list needed with map, the balance has tipped toward list comprehension decisively in Python 3. I should be thankful because they could have remove the list comprehension too and force me to use generator expressions and list.

The bottom line is this change is strictly feature removal. With Unicode, it is a necessary pain to go through and we gain a predictable unicode handling as a bargain. With iterator, there is no new feature to be gain. Existing code are broken for nothing. All the Python 3 people tell you is just to wrap you function with a list, no big deal.

Enough to say I am not convinced. To me this is torture.

Feature Removal Pain

Just a few days ago I was bitten by another feature removal issue. The sort method used to have a cmp feature that's removed in Python. Oh, it is dumb to use cmp anyway because the implementation using key is faster. Except in my case, I was working on a bioinformatics problem that required sorting all suffix substring of a long string. With a string that's millions of characters long, generating millions of substrings quickly exhaust all memory. This trick is to use cmp to generate and compare the substrings on demand. This may be slower, but it works. Removing cmp not only cause inconvenience, the algorithm breaks with no easy workaround.

I solved the problem by going back to Python 2.

Python 3 is the dead end

The official story line is Python 2 is a dead end. Python 3 is the future. I begin to see it differently. Python 2 is actually alive and well. The development of the language and the interpreter has stalled. But innovation continues in third party library and tools. For example, Pandas is a big progress for Python in the data analysis space.

It is Python 3 we should worry about. I fear it would become a facto dead end because of lack of adoption. Outside of my personal use, there are absolutely no proposal from my workplace about moving to Python 3. Two companies and hundred of programmers I have worked with recently are cranking out Python 2 code everyday, not Python 3. At various time I was considering to championing Python 3 at work. I am not considering this anymore in the near future.

Sorry for the critical opinion. I just wish to open up some honest discussion about the merit of Python 3.

2014.01.23 [] - comments

 

Cython for the win, 177x speed increase!

Putting Cython into use, I have great success in speeding up a computation algorithm. Previously I have great success of getting 10x speed increase just by swapping in pypy. This time, with a little bit of work, I get 177x speed improvement using Cython. This vastly exceed my expectation.

The code is to solve a string alignment problem in bioinformatics. With a string s and t, the algorithm has complexity of O(|s||t|). The first step of writing Cython is to identify the performance critical region of code. In this case it is quite obvious the bottleneck is in the inner loop with complexity of O(n^2). No profiling is needed. The origin Python code inner loop is like below. Note that M and B are numpy arrays.

def overlap_alignment_inner(s, t, sigma, M, B):
  for i in range(1,M.shape[0]):
    for j in range(1,M.shape[1]):
      match_score = M[i-1, j-1] + (1 if (s[i-1] == t[j-1]) else -2)
      M[i,j], B[i,j] = max([
                             (M[i-1, j] - sigma, (i-1, j  )),
                             (M[i, j-1] - sigma, (i  , j-1)),
                             (match_score,       (i-1, j-1)),
                           ])

After a few iterations, I have arrived with the optimized code below. It looks somewhat different from the first glance. But I would walk through the chances I have made. I have taken a lot of clues from the tutorial Working with NumPy.

import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t

def overlap_alignment_inner(s, t, int sigma, np.ndarray[DTYPE_t, ndim=2] M, np.ndarray[DTYPE_t, ndim=3] B):
    cdef int i, j
    cdef int s10, s01, s11
    for i in range(1,M.shape[0]):
        for j in range(1,M.shape[1]):
            s01 = M[i, j-1] - sigma
            s10 = M[i-1, j] - sigma
            s11 = M[i-1, j-1] + (1 if (s[i-1] == t[j-1]) else -2)
            if s11 >= s01 and s11 >= s10:
                M[i,j] = s11
                B[i,j,0] = i-1
                B[i,j,1] = j-1
            elif s01 >= s10 and s01 >= s11:
                M[i,j] = s01
                B[i,j,0] = i
                B[i,j,1] = j-1
            else:
                M[i,j] = s10
                B[i,j,0] = i-1
                B[i,j,1] = j

The first thing to do is to add type to certain variable for early binding. The simple int declaration below increase speed by about 10%.

cdef int i, j

The next thing to do is to type numpy ndarray objects like below. This allow faster indexing compares to normal Python operations. This speed things up a few times.

... np.ndarray[DTYPE_t, ndim=2] M, np.ndarray[DTYPE_t, ndim=3] B ...

The most significant problem turn out to be the use of the max function. In the original code it is a concise and stylish way to pick the best score and simultaneously assign two values to M and B. But this keep the statement as a costly Python function call. By unwinding the function into if statements, it allow the code to be fully optimized and attained the 177x speed improvement! This brings the performance to the league of C.

The unwind code, while longer, is actually quite straight forward. So I backported it to the pure Python code. This also result in a 3x improvement! Turns out use of max here is rather costly.

My first use of Cython is very successful. By identifying and optimizing only a few lines of critical code, it dramatically speed up the performance to the level of C. The rest of the code are not performance critical. They remain easy to write and debug using Python.

2013.12.19 [] - comments

 

Getting Cython to work with Python 3.3 on Windows

I have spent a fair bit of time to get Cython to work with Python 3.3 on Windows. So I just add a note here in case anyone want a similar setup. These are the components I use:

  • Windows 7
  • Python 3.3 (Anaconda distribution)
  • Visual C++ 2010 Express

To install Python 3.3 version of Anaconda there is a twist. The Anaconda Windows installer (as of 2013-12) is of Python 2.7. You could add the Python 3.3 environment after Anaconda is installed. But then you will be adding the 3.3 packages on top of the 2GB base 2.7 installation. If you don't need 2.7, there is a way to save the disk space. Download and install the Miniconda3 installer. It should be a 32M base installer only. Note that I have only successfully used the x86 32 bit version. I would explain the problem of the 64 bit version later. Next run this to install the Python 3.3 Anaconda packages.

conda update conda
conda create -n py33 python=3.3 anaconda
activate py33

Previously I have tried to download miniconda and install individual packages as needed. But then I find that I did not save much disk space compare to the 2GB full Anaconda packages. So I rather save myself hassle and install everything.

Next you need a C compilers. The idea is your C compiler should be of the same version that compiles the Python interpreter. How do you find out what version of compiler is used? Notice when you start the Python interpreter, it output one line that says the system version. From this we can find out the what it is compiled from. The Anaconda Python version string looks like:

Python 3.3.3 |Anaconda 1.8.0 (32-bit)| (default, Dec 3 2013, 11:58:39) [MSC v.1600 32 bit (Intel)]

1600 is the MSVC version string. Divide it by 100 and then minus 6 gets you 10. This corresponds to Visual Studio 2010 Express. If you have Python 2.7, the version string reads 1500 and it corresponds to Visual Studio 2008 express. I found out this only after tracing the source code in msvc9compiler.py in the distutils packages.

Originally I have installed the 64 bit version of miniconda. In this case the version is like

Python 3.3.3 |Continuum Analytics, Inc.| (default, Dec 3 2013, 11:56:40) [MSC v.1600 64 bit (AMD64)] on win32

I don't know what is the significance of the label (AMD64) in there (my PC is Intel i7). But this prevent the msvc9compiler from associating with the correct compiler. It gives me a cryptic error when I tried compiling:

ValueError: ['path']

I have succeed only after I reinstall with the 32 bit miniconda. With this I am able to compile Cython's hello world example. I hope this helps.

2013.12.18 [] - comments

 

Technical notes: numpy and anaconda

I'm make a big move into Python data processing. Here are some technical notes to myself.

Notes of Parallel Programming with numpy and scipy
http://wiki.scipy.org/ParallelProgramming

My question of Numpy matrix multiplication with custom dot product got answered quickly
http://stackoverflow.com/questions/19278313/numpy-matrix-multiplication-with-custom-dot-product/19278448

How to hash numpy array
Answer: use hash(a.tostring())
http://stackoverflow.com/questions/16589791/most-efficient-property-to-hash-for-numpy-array

I am typing out the Anaconda scipy stack distribution. The download is gigantic 300M file. Then I create a Python 3, environment, which download another few hundred meg of Python 3 packages. I should follow the note on Using Python 3 as my default Python instead.

2013.10.09 [] - comments

 

Ace the Rosalind Bioinformatics Challenge

Here is something to brag about. It has taken me 6 weeks, and at last I have cracked all 123 problems in the Rosalind Bioinformatics Challenge website. I become the 11th person in the world to have solved all the problems posted.

Rosalind Top players

Top Players

I have always been curious about science of molecular biology and the related field of bioinformatics. But as a layman in biology, I find it difficult to grasp the concept in any depth. Since someone has introduced me to the Rosalind project, a website that uses hand on problem solving to teach bioinformatics, I was quickly hooked to work on the problems. These range from basic programming and algorithm problem at the beginning, which I have quickly knocked down, to the rather advanced problems at the end. For a few hard problems I have to reach for research papers to find the approach. I feel quite an accomplishment to have done it.

2013.06.13 [] - comments

 

Collaborative Filtering in 9 Lines of Code

I have purchased Toby Segaran's book Programming Collective Intelligence when it first come out in 2007. Yet I still have not been able to finished it after all these years. Now I am giving myself another push. While I was revisiting the book I tried to implement its technique using Pandas. I am quite happy to be able to do it in just 9 lines of code. So I have put together a little article to show collaborative filtering using Pandas. This is also a great exercise for me using to make a very nice document using iPython notebook for the first time.

The main reason for not having completed the book is not because it is not interesting or not useful. It is very much an excellent book to introduce readers to the topic of collective intelligence. The problem is I strive to work through all the coding examples and not just to read the text. Many times I have feel the examples given, designed for average programmer without additional experience, often has low abstraction level. I tried to improve them, including building some tools on myself. This unfortunately doomed me because I never have the time to build the tools or even just to work through the exercises in sufficient detail. This time around I feel that I have gained a lot of experience and have access to many tools like Pandas. I hope this will be an enjoyable exercise.

2012.12.17 [] - comments

 

My view on HTML5 Apps v.s. Native Mobile Apps

In May, we participated in our business partner Chartboost's Mobile Hackathon. We wasn't there to win. We was there partly to be cheerleader, and partly to use it as an opportunity to learn building mobile app. Most of us don't have much experience at all. Unfortunately we flopped. After two days we have failed to build a usable application.

Two days wasn't long enough for me to become proficient in the Android platform SDK. But it is long enough for me to form a negative opinion. Jeez, it is like flashback to the old days when we were building Windows apps. It is the same style of GUI toolkit like Swing or Microsoft frameworks, the style we have largely left behind in favorite of web apps. Hundred of thousand of developers toiling to develop on these platforms. How long is this rush of native mobile app development going to last?

Web development has come a long way in the past decade. The power of HTML, CSS and JavaScript unleash great deal of creativity and design that constantly awe people. It wasn't just because of flashiness or novelty. The fact is the web layout engine has so much power and fine grain control that leave the old school GUI toolkit far behind. While I'm scratching on how to layout Android widgets, I realize I am thinking in terms of CSS box model and really miss it on the native platform. Web development is far more productive. It language is declarative. The development environment interactive. To go back to compile execute development cycle just to change one attribute is so frustrating. The other thing I realize is we have build up a vast community based knowledge network on web development through Q&A website and the diligent work of many individuals. HTML and JavaScript is no longer the unpredictable and compatibility headache as in the age when it just come of the browser war. Today we have a rich set of libraries and the expertise to make use of web technologies to its fullest.

When Steve Jobs first introduced iPhone, he said the web is the mobile SDK. Steve was right all along.

2012.07.05 [, ] - comments

 

Python Whitespace Misconception

Whitespace is significant in Python. This uncommon syntax rule is one of the refrain of many non-Python programmers about Python. Given most computer languages since Fortran has adopted free layout syntax, this feel like an objectionable design.

Now that I think of it, I find this mostly a problem of wrong framing. The term "whitespace" has already embedded the meaning of unimportant filler. In this sense "significant whitespace" sounds like a problem. Actually the right way to describe it is that Python uses indentation to determine the block structure. Rather than feeling wrong, this is lot more likely to evoke an "aha" reaction from non-Python programmers. Of course it makes a lot sense for the indentation to follow the block structure. I can hardly find any legitimate case for those two to differ.

Austere programmers may notice that there are two way to indicate block structure in main stream languages, by using language construct like "begin" and "end" and by laying out the code using indentation. Whenever there are two ways to do express one thing there will be problem of inconsistency. This extra degree of freedom can get you into trouble sometimes. One of the most nasty class of programming bug is dangling else. It is the case in free layout language when the apparent block structure is different from the actual block structure. Dangling else is not possible in Python. It is because the block structure is determined by the indentation.

2012.06.08 [] - comments

 

&D\anger'"+<b>@?!mb Against Code Injection

I have to build my web app against code injection. I find that the problem requires us to see input string used in several different context. [more...]

2012.05.01 [] - comments

 

A Program That Output Itself

Long time ago my schoolmate pose a challenge, write a program those output is itself. I take on the challenge think I'm fairly good in programming and mathematics. But I can never get it right and this problem haunted me for many years. [more...]

2011.11.17 [] - comments

 

Facebook OAuth Authentication Flow

I was trying to follow the Facebook OAuth documentation. I finally have it figured out. There are three parties and multiple steps involved. I have created a diagram to show the flow (the server-side flow). [more...]

2011.02.19 [] - comments

 

Traffic Data Analysis

I was doing a traffic data analysis base on the vehicle location data pull from the SF Muni website. This is an extremely interesting project. I pick up a whole lot of new skills while doing this, most notably data analysis and computational geometry. The project also turns out to be a challenging one. A few months (of part time work) has gone by I still haven't nearly achieved my original vision. [more...]

2010.09.02 [, ] - comments

 

Find Sophie Solution (Facebook Programming Puzzle)

This is the fourth and probably my last post about solving Facebook programming puzzle. The Find Sophie problem is a buffet (hardest) level puzzle. It is basically a variation of the traveling salesman problem. Traveling salesman is of course, a NP-complete problem. But before you throw up your hand, consider the puzzle refers to locations in your home where you may find a cat. Just how many locations are there. Let's say living room, dining room, bedroom, on the bed, under the desk, behind the curtain, ... etc. If Facebook is not very wicked, this puzzle should be bounded to a low hundreds of locations. This mean it is still possible to solve using some relatively simple search algorithm.

Here is the outline of my solution:[more...]

2010.07.22 [] - comments

 

Python CSV reader is much faster than pickle

If you are considering to serialize a large amount of data to the disk, performance may become a concern to you. Python provides a serialization tool in the pickle module. There is also an optimized version called the cPickle. But how do they perform? [more...]

2010.05.12 [, ] - comments

 

Data Compression Comparison

This is a follow up on my last post about data compression. After encoded my numerical data in a compact CSV format, I apply data compression before storing it in the disk. I have done a quick study on the two algorithm available in standard Python library, gzip and bzip2. The result is shown below. The original message's size is 537,776 bytes.

Gzip compression Result[more...]

2010.04.21 [] - comments

 

The Power of Gzip

I know Information Theory says that good data compression will shrink a message down to its entropy. So for application developers, it is not productive to design our own spacing saving encoding scheme if we plan to apply data compression at the end anyway. Because the original message and the encoded message contain the same amount of information, the compressed data will end up with approximately the same size. [more...]

2010.04.19 [] - comments

 

Peak Traffic Solution (Facebook Programming Puzzle)

The Peak Traffic puzzle is easy to interpret. It is a graph problem to find all maximal cliques. Implementating it, like many graph problems, is quite difficult though. The possible combination is huge. And there is no simple ordering graph subset. It end up costed me several days to do it. Here is the outline of my solution:

  1. Parse the input. Generate an undirected graph.
  2. Generate a list of cliques using recursive algorithm.
  3. Eliminate subsets from the result to find all maximal cliques.

After working on it for several days, I'm quite done. I haven't spent any more effort to optimize it and expain it. So here is my un-tidied code. [more...]

2010.03.03 [] - comments

 

Visualization Using Variable Width Bar Chart

I was plotting a chart to visualize various development project in my area. The primary concern is in development density. But the project footprint itself is also a factor. We want to focus on big project that matters. We also want to identify outliers that has very high or very low density, but otherwise has small overall size and less relevance. [more...]

2010.02.26 [, ] - comments

 

Refrigerator Madness solution (Facebook programming puzzle)

Refrigerator Madness is the second Facebook programming puzzle I have tried. The problem description is complicated. I won't try to repeat it here but to suggest you to refer to the original facebook page. [more...]

2010.02.22 [] - comments

 

smallworld solution (Facebook programming puzzle)

Facebook has published a number of programming puzzles to challenge software developers. Their problems are not trivial at all. This is my attempt on "Snack" level problem It's A Small World. Even this one take me sometimes to solve. [more...]

2010.02.21 [] - comments

 

past articles »

 

BBC News

 

Prague gunman killed himself on roof as police approached (22 Dec 2023)

 

Bodycam footage shows police hunting Prague gunman (22 Dec 2023)

 

Alex Batty: Police launch abduction investigation into disappearance of British teen (22 Dec 2023)

 

Banksy stop sign drones art removed in London (22 Dec 2023)

 

Martin Kemp refunds disabled ticket after fans' difficulty with seller (22 Dec 2023)

 

Queues at Dover as Christmas getaway begins for millions (22 Dec 2023)

 

New £38,700 visa rule will be introduced in early 2025, says Rishi Sunak (22 Dec 2023)

 

UK at risk of recession after economy shrinks (22 Dec 2023)

 

Mohamed Al Bared: Student jailed for life for building IS drone (22 Dec 2023)

 

Andrew Tate denied request to visit ill mother in UK (22 Dec 2023)

more »

 

SF Gate

more »


Site feed Updated: 2023-Dec-22 09:00