tungwaiyip.info

home

about me

links

my software

Media

Yucatán Photos

St Lucia Photos

Photo Album

Videos

Blog

< November 2011 >
SuMoTuWeThFrSa
   1 2 3 4 5
6 7 8 9101112
13141516171819
20212223242526
27282930   

past articles »

Click for San Francisco, California Forecast

San Francisco, USA

 

A Program That Output Itself

Long time ago my schoolmate pose a challenge, write a program those output is itself. I take on the challenge think I'm fairly good in programming and mathematics. But I can never get it right and this problem haunted me for many years.

def rep(template):
    print template + 'rep(' + '"'*3 + template + '"'*3 + ")"

rep("""def rep(template):
    print template + 'rep(' + '"'*3 + template + '"'*3 + ")"

""")

Now that I'm reading Gödel, Escher, Bach I finally beginning to learn the technique. I am really happy to finally successfully written the program that output itself in Python above!

2011.11.17 [] - comments

 

Facebook OAuth Authentication Flow

I was trying to follow the Facebook OAuth documentation. I finally have it figured out. There are three parties and multiple steps involved. I have created a diagram to show the flow (the server-side flow).

Facebook OAuth Authentication Flow

This is a condensed version of Facebook's documentation of the steps required.

  1. Redirect the user to Facebook's OAuth Dialog /dialog/oauth?app_id.
    1. User authentication - If the user is not logged in, they are prompted to enter their credentials.
    2. App authorization - After the user is successfully authenticated, the OAuth Dialog will prompt the user to authorize the app.
  2. After the app is authorized, the OAuth Dialog will redirect (via HTTP 302) the user's browser to the URL you passed in the redirect_uri parameter with an authorization code.
  3. App authentication - In order to authenticate your app, you must pass the authorization code and your app secret to the Graph API token endpoint /oauth/access_token?app_id&secret&code. If your app is successfully authenticated and the authorization code from the user is valid, the authorization server will return the access token.

If you got a "Error validating verification code" in step 3, note that the redirect_uri should be the same as in step 1. See the issue on StackOverflow.

2011.02.19 [] - comments

 

Traffic Data Analysis

I was doing a traffic data analysis base on the vehicle location data pull from the SF Muni website. This is an extremely interesting project. I pick up a whole lot of new skills while doing this, most notably data analysis and computational geometry. The project also turns out to be a challenging one. A few months (of part time work) has gone by I still haven't nearly achieved my original vision.

Still I think I am getting better at this work. So I'm sharing some interesting data I'm working with. On one bus route I've collected 20,000 location readings on a single day. The goal is to organize them into individual vehicle trip. After a first pass, the data are reduced to 336 trips. But how good is the quality of data? And how good is my trip segregation algorithm? So I plotted the trips' distance against their duration on a chart. The graph quickly helped me to evaluate the quality of the result.

Bus #38 trips characteristic

The first impression is the result looks fairly good. A dense band of points shows that most trip length is about 10 km, and they takes about 40-75 minutes. This seems to match real world experience. At the lower left hand corner is a number of trips that last very short time and cover very little distance. I'll probably treat them as noise and discard them. A curiosity is some trips seem to cover a long distance from 15 to 20 km, much long than the official route. Where did they went?

Closer examination on their track reveal the problem. These are all legitimate eastbound trips. However the bus starts from a depot in the middle of the city. They went all the way to the terminal in the west end. Only from there the real eastbound trip begins. The challenge for me is I'm only interested in the eastbound portion. The trip from the depot to the west end is actually data pollution. I need to find a way to handle these extraneous data.

This still leaves a scatter of points of around 5 km long and last 15 to 35 minutes to inspect. Meanwhile I have difficulty to examine the track because Google Earth is crashing on me all the time. I guess this is a good time to turn away from technical work and to write a blog instead.

2010.09.02 [, ] - comments

 

Find Sophie Solution (Facebook Programming Puzzle)

This is the fourth and probably my last post about solving Facebook programming puzzle. The Find Sophie problem is a buffet (hardest) level puzzle. It is basically a variation of the traveling salesman problem. Traveling salesman is of course, a NP-complete problem. But before you throw up your hand, consider the puzzle refers to locations in your home where you may find a cat. Just how many locations are there. Let's say living room, dining room, bedroom, on the bed, under the desk, behind the curtain, ... etc. If Facebook is not very wicked, this puzzle should be bounded to a low hundreds of locations. This mean it is still possible to solve using some relatively simple search algorithm.

Here is the outline of my solution:

  1. Prepare to enumerate all paths.
  2. Prefer the local optimal during the search. Remember the best solution encountered so far.
  3. Prune the search tree. For example if the current best solution has a score of 50. When you are considering another path, as soon as its score exceed 50, you should abandon it.
  4. Improve the pruning by taking the additional score to complete the path into account. For example, if the current score of a partial path is S, the minimum score to complete the remaining walk is Rmin. Then if S+Rmin > best score, prune it.

Actually I have submitted the solution a few months ago. Only today do I find out I have not posted my solution in my blog. So I'm really speaking from my memory. I hope my plan is still accurate. Again my un-tidied code is available.

The other 3 facebook puzzles I have solved was posted below:

2010.07.22 [] - comments

 

Python CSV reader is much faster than pickle

If you are considering to serialize a large amount of data to the disk, performance may become a concern to you. Python provides a serialization tool in the pickle module. There is also an optimized version called the cPickle. But how do they perform?

The data of concern to me is tabular data. In order to do a bake off, I have generated 50,000 records of sample data. The CSV representation is shown below:

seq, name, address, city, age, birthday
1000,John M. Doe,2147 Main St.,Middle Town 14,47,1985-05-15
1001,John N. Doe,2148 Main St.,Middle Town 15,48,1985-05-16
1002,John O. Doe,2149 Main St.,Middle Town 16,49,1985-05-17
1003,John P. Doe,2150 Main St.,Middle Town 17,50,1985-05-18
1004,John Q. Doe,2151 Main St.,Middle Town 18,51,1985-05-19
1005,John R. Doe,2152 Main St.,Middle Town 19,52,1985-05-20
1006,John S. Doe,2153 Main St.,Middle Town 20,53,1985-05-21
1007,John T. Doe,211 Main St.,Middle Town 21,1,1985-05-22
...

Naturally, CSV is a contender for storing tabular data. (Indeed the data source I'm working with is in CSV format.) The two pickle modules produce identical data output. In addition, Python 2.6 also provides a JSON module that do the similar task as pickle but outputs a standard text based format. I included it in the comparison below.

First observation, CSV output the most compact data at 3MB. Pickle output is 40% larger at 4.2MB. JSON is somewhere in between. The speed? CSV is the winner among them all.


Method Load Time (ms) File size (MB)
CSV 188 3
CSV int 289 3
cPickle 692 4.2
pickle 1,815 4.2
JSON 4,975 3.9

Note that CSV reader create data items as string. In the sample data, two out of the six columns are integer fields. In order to do an apple-to-apple comparison I have another test that do integer conversion after loading such that the data loaded is identical to pickle's. This impacted the performance somewhat. But it is still more than twice as fast as the faster cPickle module. The standard library's JSON's performance trailing far behind, making it unsuitable for anything performance intensive. FYI, unlike the other modules, JSON's output is in unicode.

The test is done by Python 2.6 on Windows XP machine with 2.33GHz Core2 CPU (Download source code).


2010.05.12 [, ] - comments

 

Data Compression Comparison

This is a follow up on my last post about data compression. After encoded my numerical data in a compact CSV format, I apply data compression before storing it in the disk. I have done a quick study on the two algorithm available in standard Python library, gzip and bzip2. The result is shown below. The original message's size is 537,776 bytes.

Gzip compression Result

Compression Level Compressed Size Compress time Decompress time
9 183,019 179 ms 5.51 ms
6 184,532 125 ms 5.48 ms
3 203,105 38.2 ms 5.54 ms

Bzip2 compression Result

Compression Level Compressed Size Compress time Decompress time
9 152,283 84.3 ms 29 ms
6 152,283 84.9 ms 29 ms
3 157,065 80.6 ms 26.9 ms
1 166,949 79.8 ms 26.7 ms

Surprisingly, bzip compress faster than gzip at level 9. Unfortunately compression performance is the least important for me. Compression ratio and decompression performance is far more important. Compression is only done one time. But fetching and decompressing the data is going to be done many times. It is hard for me to choose between the better compression ratio of bzip or the faster decompression time of gzip. For now I think I will stick with gzip.

2010.04.21 [] - comments

 

The Power of Gzip

I know Information Theory says that good data compression will shrink a message down to its entropy. So for application developers, it is not productive to design our own spacing saving encoding scheme if we plan to apply data compression at the end anyway. Because the original message and the encoded message contain the same amount of information, the compressed data will end up with approximately the same size.

I don't realize how true it is until I have actually tried it. I am working with a CSV file with mostly integer data. I am very keen on reducing its size to save storage and network bandwidth. So I tried several schemes. They all failed to make significant saving once gzipped.

The first attempt is on the minus sign. I notice there are a lot of negative numbers. The '-' sign occupies one bytes, but it only carries one bit of information. What if I apply a simple encoding, e.g. using 'A' to stand for '-1', and 'B' stand for '-2' and so on? Trimming the negative sign with this encoding cut down the storage by 6%.

  e.g.
    "108,-2,-10"  ->  "108,B,A0"

What about the result after gzipping? Gzip shrinks the original data down to 34%. For the encoded message, it is 36%. The difference between the two? A negligible 0.1%.

Next attempt, it seems wasteful to store an integer as string using only 10 decimal digits per character. What if we use the hexadecimal representation? The conversion is trivial and it should cut down the string length a bit. If this is fruitful we may even try to use a higher base. Using the hexadecimal scheme, we reduce the storage by 7%. But once gzipped, the saving again evaporates.

A far more lucrative approach is to abandon text format altogether and use binary encoding for the numbers. Since the order of the number differ a lot, I use a kind of variable length integer encoding to make it economical for both small and large numbers. The binary encoding deliver the most significant saving by cutting down the storage by 44%. The text data and the binary encoded data seem very different initially, not to mention its size is nearly half of the original. But once gzipped, the binary data is only 4% smaller. Despite the big difference in representation, the compressed data is still proportional to the entropy. The 4% gain is hardly enough to justify using binary format over text.

The lesson learned? Don't be too concern about the efficiency of storing number in text format like CSV. Data compression will take out the inefficiency in one easy step.

Finally I like to mention some encoding that works for me. The data is initially available in XML format. Dropping the XML baggage and store it in CSV format saves a lot. Secondly, storing only the delta of the numbers works very well in my application. Furthermore, slightly reducing the precision of the numbers, a sort of lossy compression, also deliver a meaningful saving. More importantly, the saving still present after compression.

2010.04.19 [] - comments

 

Peak Traffic Solution (Facebook Programming Puzzle)

The Peak Traffic puzzle is easy to interpret. It is a graph problem to find all maximal cliques. Implementating it, like many graph problems, is quite difficult though. The possible combination is huge. And there is no simple ordering graph subset. It end up costed me several days to do it. Here is the outline of my solution:

  1. Parse the input. Generate an undirected graph.
  2. Generate a list of cliques using recursive algorithm.
  3. Eliminate subsets from the result to find all maximal cliques.

After working on it for several days, I'm quite done. I haven't spent any more effort to optimize it and expain it. So here is my un-tidied code.

2010.03.03 [] - comments

 

Visualization Using Variable Width Bar Chart

I was plotting a chart to visualize various development project in my area. The primary concern is in development density. But the project footprint itself is also a factor. We want to focus on big project that matters. We also want to identify outliers that has very high or very low density, but otherwise has small overall size and less relevance. [more...]

2010.02.26 [, ] - comments

 

Refrigerator Madness solution (Facebook programming puzzle)

Refrigerator Madness is the second Facebook programming puzzle I have tried. The problem description is complicated. I won't try to repeat it here but to suggest you to refer to the original facebook page. [more...]

2010.02.22 [] - comments

 

smallworld solution (Facebook programming puzzle)

Facebook has published a number of programming puzzles to challenge software developers. Their problems are not trivial at all. This is my attempt on "Snack" level problem It's A Small World. Even this one take me sometimes to solve. [more...]

2010.02.21 [] - comments

 

The World's Shortest Short URL Generator

(2009-07-07 Update: I have suspended this demo program because I find spammers has started to exploit it. What a lesson for me to learn.)[more...]

2009.05.12 [, ] - comments

 

past articles »

 

Kontagent

Kontagent is hiring software engineers

BBC News

 

Gulf states fuel Syria isolation (07 Feb 2012)

 

New Maldives leader pledges order (07 Feb 2012)

 

Iranian president summoned by MPs (07 Feb 2012)

 

LA abuse school removes all staff (07 Feb 2012)

 

Uganda MP revives anti-gay bill (07 Feb 2012)

 

California court backs gay unions (07 Feb 2012)

 

Monaco royals in privacy defeat (07 Feb 2012)

 

Royals celebrate Dickens' legacy (07 Feb 2012)

 

Euro 'could survive Greece exit' (07 Feb 2012)

 

Obama to give back campaign funds (07 Feb 2012)

more »

 

Slashdot News for nerds, stuff that matters

 

Fracture Putty Can Heal a Broken Bone In Days (2012-02-07T21:32:00+00:00)

 

Apple Could Lose .6 Billion In iPad Lawsuit (2012-02-07T20:51:00+00:00)

 

Google Releases Chrome For Android Beta (2012-02-07T20:09:00+00:00)

 

The 20th IOCCC Winners Announced (2012-02-07T19:46:00+00:00)

 

Programming Error Doomed Russian Mars Probe (2012-02-07T19:25:00+00:00)

 

Honeywell Vs Nest: When the Establishment Sues Silicon Valley (2012-02-07T18:42:00+00:00)

 

DARPA Investing In Electric Brain Stimulation To Train Snipers Quickly (2012-02-07T18:19:00+00:00)

 

Delayed Outrage Over A Censored Site; What's a Better Way To Spread News? (2012-02-07T18:00:00+00:00)

more »

 

TechPsychic Tech Rumors and Invented News

 

TechPsychic: AT&T: more money, says it's disruptive in funding from. (08 May 2010)

 

TechPsychic: I know that Apple is close to Apple Dominates, Hires ex-Googler - Yes, Android phones. (08 May 2010)

 

TechPsychic: AT&T says: Facebook Connect. (08 May 2010)

 

TechPsychic: Google's Nexus One of Google Chrome Release Adds Support subscriptions accounted for Amazon: Apple. (08 May 2010)

 

TechPsychic: Another stat: Twitter's Design of this is giving rise of BlackBerry Foursquare Map App store end. (07 May 2010)

 

TechPsychic: Like educational sales Up around Apple iPad makes money Plan costs half an Apple. (07 May 2010)

 

TechPsychic: Instead added extensions, social Networks than double, everyone jumps in Silicon Valley? (07 May 2010)

 

TechPsychic: So why iTunes App lets Social Networks Verizon Wireless Internet. (07 May 2010)

more »

 

SF Gate

 

APNewsBreak: Komen exec quits after funding flap (2012-02-07T19:44:17PST)

 

Why S.F. still counts on street fire alarm boxes (2012-02-07T19:44:17PST)

 

A U.S. appeals court rules Prop. 8 unconstitutional (2012-02-07T19:44:17PST)

 

Gingrich hits Romney, Obama, on Catholic rights (2012-02-07T19:44:17PST)

 

Job openings jump to near a 3-year high (2012-02-07T19:44:17PST)

 

Apple TV product may be imminent, analyst says (2012-02-07T19:44:17PST)

 

Russia Seeks to Prod Syria's Assad After Blocking UN Vote (2012-02-07T19:44:17PST)

 

Man stabbed to death in S.F. homeless shelter (2012-02-07T19:44:17PST)

 

Job openings jump to near a 3-year high (2012-02-07T19:42:25PST)

 

Stocks turn higher on strong job openings (2012-02-07T19:42:25PST)

 

Oracle rejects SAP award, demands new trial (2012-02-07T19:42:25PST)

 

Catholic hospitals split on giving contraceptives (2012-02-07T19:42:25PST)

 

Time for taxes means time for questions (2012-02-07T19:42:25PST)

 

BofA Investor Suit Over Merrill Deal Granted Class Status (2012-02-07T19:42:25PST)

more »

 

Asia Times Online

 

Pakistan snubs US over Osama informer (7 Feb 2012)

 

Obama switches play on war with Iran (7 Feb 2012)

 

Gulf crisis ripples across the globe (7 Feb 2012)

 

US weighs options as Syrian violence rises (7 Feb 2012)

 

Kidnaps highlight urgent task for China (7 Feb 2012)

 

Beijing finds vulnerable ally in Berlin (7 Feb 2012)

 

Kicking down the world's door (7 Feb 2012)

 

Chinese give boost to illegal ivory trade (7 Feb 2012)

 

Confidence in Nabucco fades (7 Feb 2012)

 

Obama may win on pessimism (7 Feb 2012)

 

THE BEAR'S LAIR : The government bubble (7 Feb 2012)

more »

 


Site feed Updated: 2012-Feb-07 14:00