tungwaiyip.info

home

about me

links

my software

Media

Yucatán Photos

St Lucia Photos

Photo Album

Videos

Blog

< April 2010 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 

past articles »

Click for San Francisco, California Forecast

San Francisco, USA

 

Data Compression Comparison

This is a follow up on my last post about data compression. After encoded my numerical data in a compact CSV format, I apply data compression before storing it in the disk. I have done a quick study on the two algorithm available in standard Python library, gzip and bzip2. The result is shown below. The original message's size is 537,776 bytes.

Gzip compression Result

Compression Level Compressed Size Compress time Decompress time
9 183,019 179 ms 5.51 ms
6 184,532 125 ms 5.48 ms
3 203,105 38.2 ms 5.54 ms

Bzip2 compression Result

Compression Level Compressed Size Compress time Decompress time
9 152,283 84.3 ms 29 ms
6 152,283 84.9 ms 29 ms
3 157,065 80.6 ms 26.9 ms
1 166,949 79.8 ms 26.7 ms

Surprisingly, bzip compress faster than gzip at level 9. Unfortunately compression performance is the least important for me. Compression ratio and decompression performance is far more important. Compression is only done one time. But fetching and decompressing the data is going to be done many times. It is hard for me to choose between the better compression ratio of bzip or the faster decompression time of gzip. For now I think I will stick with gzip.

2010.04.21 [] - comments

 

The Power of Gzip

I know Information Theory says that good data compression will shrink a message down to its entropy. So for application developers, it is not productive to design our own spacing saving encoding scheme if we plan to apply data compression at the end anyway. Because the original message and the encoded message contain the same amount of information, the compressed data will end up with approximately the same size.

I don't realize how true it is until I have actually tried it. I am working with a CSV file with mostly integer data. I am very keen on reducing its size to save storage and network bandwidth. So I tried several schemes. They all failed to make significant saving once gzipped.

The first attempt is on the minus sign. I notice there are a lot of negative numbers. The '-' sign occupies one bytes, but it only carries one bit of information. What if I apply a simple encoding, e.g. using 'A' to stand for '-1', and 'B' stand for '-2' and so on? Trimming the negative sign with this encoding cut down the storage by 6%.

  e.g.
    "108,-2,-10"  ->  "108,B,A0"

What about the result after gzipping? Gzip shrinks the original data down to 34%. For the encoded message, it is 36%. The difference between the two? A negligible 0.1%.

Next attempt, it seems wasteful to store an integer as string using only 10 decimal digits per character. What if we use the hexadecimal representation? The conversion is trivial and it should cut down the string length a bit. If this is fruitful we may even try to use a higher base. Using the hexadecimal scheme, we reduce the storage by 7%. But once gzipped, the saving again evaporates.

A far more lucrative approach is to abandon text format altogether and use binary encoding for the numbers. Since the order of the number differ a lot, I use a kind of variable length integer encoding to make it economical for both small and large numbers. The binary encoding deliver the most significant saving by cutting down the storage by 44%. The text data and the binary encoded data seem very different initially, not to mention its size is nearly half of the original. But once gzipped, the binary data is only 4% smaller. Despite the big difference in representation, the compressed data is still proportional to the entropy. The 4% gain is hardly enough to justify using binary format over text.

The lesson learned? Don't be too concern about the efficiency of storing number in text format like CSV. Data compression will take out the inefficiency in one easy step.

Finally I like to mention some encoding that works for me. The data is initially available in XML format. Dropping the XML baggage and store it in CSV format saves a lot. Secondly, storing only the delta of the numbers works very well in my application. Furthermore, slightly reducing the precision of the numbers, a sort of lossy compression, also deliver a meaningful saving. More importantly, the saving still present after compression.

2010.04.19 [] - comments

 

past articles »

 

Kontagent

Kontagent is hiring software engineers

BBC News

 

Suicide attack targets Yemen army (21 May 2012)

 

Protesters attack Mali president (21 May 2012)

 

Nato endorses Afghan timetable (21 May 2012)

 

Jail for US webcam spy student (21 May 2012)

 

Facebook closes below float price (21 May 2012)

 

'Lake red' after Breivik killings (21 May 2012)

 

Three climbers die on Mt Everest (21 May 2012)

 

Dutch pre-school abuser is jailed (21 May 2012)

 

Stars pay tribute to Robin Gibb (21 May 2012)

 

Lockerbie bomber's funeral held (21 May 2012)

more »

 

Slashdot News for nerds, stuff that matters

 

Rutger's Student Dharun Ravi Sentenced To 30-Day Jail Time (2012-05-21T21:15:00+00:00)

 

Perl 5.16.0 Released (2012-05-21T20:47:00+00:00)

 

Mega-Uploads: The Cloud's Unspoken Hurdle (2012-05-21T20:26:00+00:00)

 

SCOTUS Refuses To Hear Tenenbaum Appeal (2012-05-21T19:38:00+00:00)

 

White House Petition For Open Access To Research (2012-05-21T19:13:00+00:00)

 

Maryland Teen Wins World's Largest Science Fair (2012-05-21T18:50:00+00:00)

 

Allowing the Mind To Wander Aids Creative Problem Solving (2012-05-21T18:09:00+00:00)

 

Google Chrome Becomes World's No. 1 Browser (2012-05-21T17:23:00+00:00)

more »

 

TechPsychic Tech Rumors and Invented News

 

TechPsychic: AT&T: more money, says it's disruptive in funding from. (08 May 2010)

 

TechPsychic: I know that Apple is close to Apple Dominates, Hires ex-Googler - Yes, Android phones. (08 May 2010)

 

TechPsychic: AT&T says: Facebook Connect. (08 May 2010)

 

TechPsychic: Google's Nexus One of Google Chrome Release Adds Support subscriptions accounted for Amazon: Apple. (08 May 2010)

 

TechPsychic: Another stat: Twitter's Design of this is giving rise of BlackBerry Foursquare Map App store end. (07 May 2010)

 

TechPsychic: Like educational sales Up around Apple iPad makes money Plan costs half an Apple. (07 May 2010)

 

TechPsychic: Instead added extensions, social Networks than double, everyone jumps in Silicon Valley? (07 May 2010)

 

TechPsychic: So why iTunes App lets Social Networks Verizon Wireless Internet. (07 May 2010)

more »

 

SF Gate

 

Woman killed in Berkeley crash was new Cal grad (2012-05-21T21:43:45PDT)

 

Eclipse crosses Asia, US: Millions look skyward (2012-05-21T21:43:45PDT)

 

Bay to Breakers - sun, fun and a few fast runners (2012-05-21T21:43:45PDT)

 

Chicago braces for last day of large NATO protests (2012-05-21T21:43:45PDT)

 

96 Yemeni soldiers killed in suicide bombing (2012-05-21T21:43:45PDT)

 

Muni uses feds' funds for cameras it doesn't use (2012-05-21T21:43:45PDT)

 

Support for pot in 2nd District House race (2012-05-21T21:43:45PDT)

 

Presented By: (21 May 2012)

 

Price-comparison services help cut medical costs (2012-05-21T21:31:23PDT)

 

Presented By: (21 May 2012)

 

Penthouse in N.Y. sets record price: million (2012-05-21T21:31:23PDT)

 

Chinese company to buy US movie theater chain AMC (2012-05-21T21:31:23PDT)

 

Wheat rises again, other grains settle mixed (2012-05-21T21:31:23PDT)

 

Head of US nuclear safety agency to step down (2012-05-21T21:31:23PDT)

more »

 

Asia Times Online

 

Riddle of the Scarborough Shoals (18 May 2012)

 

THE ROVING EYE : NATO occupies sweet home Chicago (18 May 2012)

 

US Iran hawks in some disarray (18 May 2012)

 

Tehran: To talk or not to talk (18 May 2012)

 

The 'illogic' of China's North Korea policy (18 May 2012)

 

BOOK REVIEW : Cherry-picking from China's success (18 May 2012)

 

SPEAKING FREELY : Nepal's constitution: Respect the dissenters (18 May 2012)

 

China's start-ups hold global potential (18 May 2012)

 

US gives green light to investment in Myanmar (18 May 2012)

 

IT WORLD : Facebook floats (18 May 2012)

more »

 


Site feed Updated: 2012-May-21 15:00