Python Compressed Files
The other day I was making a large log file that would rarely be investigated. So I decided to compress it in order to make it occupy less space. Since we are on a GNU/Linux system gzip seemed a wise choice - and it was almost as easy as making a regular file.
Python and gzip compression
The trick it to use import gzip and gzip.open when creating a file. I wanted to experiment with the efficiency of the compression so I made four different files.
import gzip ghandle0 = gzip.open('values-0000.txt.gz', 'w') ghandle1 = gzip.open('values-1234.txt.gz', 'w') ghandle2 = gzip.open('values-semi-rnd.txt.gz', 'w') ghandle3 = gzip.open('values-rnd.txt.gz', 'w')
I want to compare this to a regular file where I'll write the same values:
fhandle = open('values-uncompressed.txt', 'w')
One file will just get zeros, one will write increasing numbers, one randomness and increasing numbers, and the last one only randomness.
from random import randint fmt = "%06d %06d\n" for i in xrange(99999): m = randint(0, 999999) n = randint(0, 999999) msg = fmt % (m, n) fhandle.write(msg) ghandle3.write(msg) ghandle2.write(fmt % (i, m)) ghandle1.write(fmt % (i, i)) ghandle0.write(fmt % (0, 0))
As expected the files contain number (in text format!) of varying randomness:
$ zcat values-semi-rnd.txt.gz | head -n 4 000000 033446 000001 863643 000002 146831 000003 572757 $ zcat values-rnd.txt.gz | head -n 4 033446 325772 863643 493616 146831 223209 572757 428555 $ head -n 4 values-uncompressed.txt 033446 325772 863643 493616 146831 223209 572757 428555
Now the file sizes are smaller if compressed, but of varying efficiency. The very high efficiency is related to properties of ascii text - but the trend would be the same with most data.
$ ls -1s *txt* 1376 values-uncompressed.txt # 100.0% size 684 values-rnd.txt.gz # 49.7% size 588 values-semi-rnd.txt.gz # 42.7% size 432 values-1234.txt.gz # 31.4% size 12 values-0000.txt.gz # 0.9% size
Read more in Doug Hellman's Python Module of the Week on compression: [1]
This page belongs in Kategori Programmering