tungwaiyip.info

home

about me

links

Media

Yucatán Photos

St Lucia Photos

Photo Album

Videos

Blog

< April 2010 >
SuMoTuWeThFrSa
     1 2 3
4 5 6 7 8 910
11121314151617
18192021222324
252627282930 

past articles »

Click for San Francisco, California Forecast

San Francisco, USA

 

The Power of Gzip

I know Information Theory says that good data compression will shrink a message down to its entropy. So for application developers, it is not productive to design our own spacing saving encoding scheme if we plan to apply data compression at the end anyway. Because the original message and the encoded message contain the same amount of information, the compressed data will end up with approximately the same size.

I don't realize how true it is until I have actually tried it. I am working with a CSV file with mostly integer data. I am very keen on reducing its size to save storage and network bandwidth. So I tried several schemes. They all failed to make significant saving once gzipped.

The first attempt is on the minus sign. I notice there are a lot of negative numbers. The '-' sign occupies one bytes, but it only carries one bit of information. What if I apply a simple encoding, e.g. using 'A' to stand for '-1', and 'B' stand for '-2' and so on? Trimming the negative sign with this encoding cut down the storage by 6%.

  e.g.
    "108,-2,-10"  ->  "108,B,A0"

What about the result after gzipping? Gzip shrinks the original data down to 34%. For the encoded message, it is 36%. The difference between the two? A negligible 0.1%.

Next attempt, it seems wasteful to store an integer as string using only 10 decimal digits per character. What if we use the hexadecimal representation? The conversion is trivial and it should cut down the string length a bit. If this is fruitful we may even try to use a higher base. Using the hexadecimal scheme, we reduce the storage by 7%. But once gzipped, the saving again evaporates.

A far more lucrative approach is to abandon text format altogether and use binary encoding for the numbers. Since the order of the number differ a lot, I use a kind of variable length integer encoding to make it economical for both small and large numbers. The binary encoding deliver the most significant saving by cutting down the storage by 44%. The text data and the binary encoded data seem very different initially, not to mention its size is nearly half of the original. But once gzipped, the binary data is only 4% smaller. Despite the big difference in representation, the compressed data is still proportional to the entropy. The 4% gain is hardly enough to justify using binary format over text.

The lesson learned? Don't be too concern about the efficiency of storing number in text format like CSV. Data compression will take out the inefficiency in one easy step.

Finally I like to mention some encoding that works for me. The data is initially available in XML format. Dropping the XML baggage and store it in CSV format saves a lot. Secondly, storing only the delta of the numbers works very well in my application. Furthermore, slightly reducing the precision of the numbers, a sort of lossy compression, also deliver a meaningful saving. More importantly, the saving still present after compression.

2010.04.19 [] - comments

 

 

blog comments powered by Disqus

past articles »

 

BBC News

 

Russia protests leader Alexei Navalny jailed for 15 days (27 Mar 2017)

 

Syria fighters take control of IS-held airbase near Raqqa (27 Mar 2017)

 

'Two dead' in Mosul market attack in Iraq (27 Mar 2017)

 

Eight school children feared dead in Japanese avalanche (27 Mar 2017)

 

Wiz Khalifa in drug lord row (27 Mar 2017)

 

United Airlines caught up in leggings row (27 Mar 2017)

 

Trump son-in-law Jared Kushner to lead US federal overhaul (27 Mar 2017)

 

Cyclone Debbie: Thousands evacuate in Queensland, Australia (27 Mar 2017)

 

Singapore couple jailed for starving Philippine maid (27 Mar 2017)

 

Hong Kong protests: Nine activists 'to be charged' (27 Mar 2017)

more »

 

SF Gate

 

Bay Area News (7 Jan 2012)

 

City Insider (11 Feb 2012)

 

Crime Scene (13 Feb 2012)

 

C.W Newius Column (10 Jan 2012)

 

C.W. Nevius Blog (11 Feb 2012)

 

Education News (10 Jan 2012)

 

KALW (11 Feb 2012)

 

Matier and Ross Blog (11 Feb 2012)

 

Alexa, Siri battle it out in Marriott hotel rooms (26 Mar 2017)

 

Perks, pitfalls of removing yourself from social media (26 Mar 2017)

 

Best sound bars under (25 Mar 2017)

 

Faraday Future pulls plug on Vallejo electric car factory (24 Mar 2017)

 

ICYMI: US gummy bears, high-tech toilet paper, Rodham Rye (24 Mar 2017)

 

Hotel Nikko reopens after million renovation (24 Mar 2017)

more »

 


Site feed Updated: 2017-Mar-27 07:00