Per Erik Strandberg /cv /kurser /blog

The other day I was making a large log file that would rarely be investigated. So I decided to compress it in order to make it occupy less space. Since we are on a GNU/Linux system gzip seemed a wise choice - and it was almost as easy as making a regular file.

Python and gzip compression

The trick it to use import gzip and when creating a file. I wanted to experiment with the efficiency of the compression so I made four different files.

import gzip
ghandle0 ='values-0000.txt.gz', 'w')
ghandle1 ='values-1234.txt.gz', 'w')
ghandle2 ='values-semi-rnd.txt.gz', 'w')
ghandle3 ='values-rnd.txt.gz', 'w')

I want to compare this to a regular file where I'll write the same values:

fhandle = open('values-uncompressed.txt', 'w')

One file will just get zeros, one will write increasing numbers, one randomness and increasing numbers, and the last one only randomness.

from random import randint
fmt = "%06d %06d\n"

for i in xrange(99999):
    m = randint(0, 999999)
    n = randint(0, 999999)
    msg = fmt % (m, n)
    ghandle2.write(fmt % (i, m))
    ghandle1.write(fmt % (i, i))
    ghandle0.write(fmt % (0, 0))

As expected the files contain number (in text format!) of varying randomness:

$ zcat values-semi-rnd.txt.gz | head -n 4
000000 033446
000001 863643
000002 146831
000003 572757

$ zcat values-rnd.txt.gz | head -n 4
033446 325772
863643 493616
146831 223209
572757 428555

$ head -n 4 values-uncompressed.txt
033446 325772
863643 493616
146831 223209
572757 428555

Now the file sizes are smaller if compressed, but of varying efficiency. The very high efficiency is related to properties of ascii text - but the trend would be the same with most data.

$ ls -1s *txt*
1376 values-uncompressed.txt  # 100.0% size
 684 values-rnd.txt.gz        #  49.7% size
 588 values-semi-rnd.txt.gz   #  42.7% size
 432 values-1234.txt.gz       #  31.4% size
  12 values-0000.txt.gz       #   0.9% size

Read more in Doug Hellman's Python Module of the Week on compression: [1]
This page belongs in Kategori Programmering