bzip, bunzip - a block-sorting file compressor, v0.21
bzip [ -cdfkvVL123456789 ] [ filenames ... ]
bunzip [ -kvVL ] [ filenames ... ]
Bzip compresses files using the Burrows-Wheeler-Fenwick block-sorting
text compression algorithm. Compression is generally considerably bet‐
ter than that achieved by more conventional LZ77/LZ78-based compres‐
sors, and competitive with all but the best of the PPM family of sta‐
The command-line options are deliberately very similar to those of GNU
Gzip, but they are not identical.
Bzip expects a list of file names to follow the command-line flags.
Each file is replaced by a compressed version of itself, with the name
"original_name.bz". Each compressed file has the same modification
date and permissions as the corresponding original, so that these prop‐
erties can be correctly restored at decompression time. File name han‐
dling is naive in the sense that there is no mechanism for preserving
original file names, permissions and dates in filesystems which lack
these concepts, or have serious file name length restrictions, such as
Bzip and bunzip will not overwrite existing files; if you want this to
happen, you should delete them first.
If no file names are specified, bzip compresses from standard input to
standard output. In this case, bzip will decline to write compressed
output to a terminal, as this would be entirely incomprehensible and
Bunzip (or bzip -d ) decompresses and restores all specified files
whose names end in ".bz". Files without this suffix are ignored.
Again, supplying no filenames causes decompression from standard input
to standard output.
You can also compress or decompress exactly one named file to the stan‐
dard output by giving the -c flag.
Compression is always performed, even if the compressed file is
slightly larger than the original. The worst case expansion is for
files of zero length, which expand to seventeen bytes. Random data
(including the output of most file compressors) is coded at about 8.1
bits per byte, giving an expansion of around 1%.
As a self-check for your protection, bzip uses 32-bit CRCs to make sure
that the decompressed version of a file is identical to the original.
This guards against corruption of the compressed data, and against
undetected bugs in bzip (hopefully very unlikely). The chances of data
corruption going undetected is microscopic, about one chance in four
billion for each file processed. Be aware, though, that the check
occurs upon decompression, so it can only tell you that that something
is wrong. It can't help you recover the original uncompressed data.
Return values: 1 for an abnormal exit, otherwise 0.
Bzip compresses large files in blocks. The block size affects both the
compression ratio achieved, and the amount of memory needed both for
compression and decompression. The flags -1 through -9 specify the
block size to be 100,000 bytes through 900,000 bytes (the default)
respectively. At decompression-time, the block size used for compres‐
sion is read from the header of the compressed file, and bunzip then
allocates itself just enough memory to decompress the file. Since
block sizes are stored in compressed files, it follows that the flags
-1 to -9 are irrelevant to and so ignored during decompression. Com‐
pression and decompression requirements, in bytes, can be estimated as:
Compression: 300k + ( 8 x block size )
Decompression: 6 x block size
The 300k constant is for a frequency-count table, used in the sorting
phase of compression.
Larger block sizes give rapidly diminishing marginal returns; most of
the compression comes from the first two or three hundred k of block
size, a fact worth bearing in mind when using bzip on small machines.
It is also important to appreciate that the decompression memory
requirement is set at compression-time by the choice of block size.
So, for example, if you are compressing files which you think might
possibly be decompressed on a 4-megabyte machine, you might want to
select a block size of 200k or 300k, so the decompressor will draw 1200
kbytes or 1800 kbytes respectively, which is probably the limit of
what's comfortable on a 4-meg machine. In general, though, you should
try and use the largest block size memory constraints allow. Compres‐
sion and decompression speed is virtually unaffected by block size.
Another significant point applies to files which fit in a single block
-- that means most files you'd encounter using a large block size. The
amount of real memory touched is proportional to the size of the file,
since the file is smaller than a block. For example, compressing a
file 20,000 bytes long with the flag -9 will cause the compressor to
allocate [by the formula, in practice a little more] 7500k of memory,
but only touch 300k + 20000 * 8 = 460 kbytes of it. Similarly, the
decompressor will allocate 5400k but only touch 20000 * 6 = 120 kbytes.
Here is a table which summarises the maximum memory usage for different
block sizes. Also recorded is the total compressed size for 14 files
of the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
column gives some feel for how compression varies with block size.
These figures tend to understate the advantage of larger block sizes
for larger files, since the Corpus is dominated by smaller files.
Compress Decompress Corpus
Flag usage usage Size
-1 1100k 500k 905958
-2 1900k 1000k 870646
-3 2700k 1500k 853650
-4 3500k 2000k 840140
-5 4300k 2500k 838355
-6 5100k 3000k 831695
-7 5900k 3500k 827104
-8 6700k 4000k 821652
-9 7500k 4500k 821652
OPTIONS-c Compress or decompress to standard output. -c requires you to
supply exactly one file name, and this file is compressed or
decompressed to standard out.
-d Force decompression. Bzip and bunzip are really the same pro‐
gram, and the decision about whether to compress or decompress
is done on the basis of which name is used. This flag overrides
that mechanism, and forces bzip to decompress.
-f The complement to -d: forces compression, regardless of the
-k Keep (don't delete) input files during compression or decompres‐
-v Verbose mode -- show the compression ratio for each file pro‐
-V Be very verbose. This spews out lots of information during com‐
pression which is primarily of interest for debugging purposes.
-L Display the software license terms and conditions.
-1 to -9
Set the block size to 100 k, 200 k .. 900 k when compressing.
Has no effect when decompressing. See MEMORY MANAGEMENT above.
The sorting phase of compression gathers together similar strings in
the file. Because of this, files containing very long runs of repeated
symbols, like "aabaabaabaab ..." (repeated several hundred times) may
compress extraordinarily slowly. You can use the -V option to monitor
progress in great detail, if you want. Decompression speed is unaf‐
fected. Such pathological cases seem rare in practice.
Incompressible or virtually-incompressible data may decompress rather
more slowly than one would hope. This is due to naive implementation
of the move-to-front coder, and of the frequency tables for the arith‐
Decompression on Sun Sparc 1's (and other low-range Sparcs) can be
slow, because of the lack of hardware implementations of integer multi‐
ply and divide in the SPARC v7 instruction set. The situation is much
exacerbated if bzip is compiled for a full SPARC v8 instruction set,
since this causes the machine to trap on each multiply and divide
instruction. These traps take control to the relevant software emula‐
tion of the offending instruction, but it is much quicker for the com‐
piler simply to plant a call to the emulation routine. Moral: be care‐
ful how you compile bzip for a Sparc. If you use GNU C, investigate
the effects of the -msupersparc and -mcypress flags.
Wildcard expansion for Windows 95 and NT loses leading directory infor‐
mation. For example, the pathspec "sources\*.c" is searched correctly
for matching files, but the "sources\" bit is ignored when the files
come to be processed, which means bzip won't be able to find any of
them. This is easy to fix; perhaps some enterprising soul will send me
I/O error messages are not as helpful as they could be. Bzip tries
hard to detect I/O errors and exit cleanly, but the details of what the
problem is sometimes seem rather misleading.
There is no -t option to test the integrity of a compressed file. How‐
ever, Unix folks can do the following:
bzip -dcV file.bz > /dev/null
which causes bzip to do a trial decompression of file.bz, throwing away
the result. You'll be shown the computed and stored CRCs. If these
are identical, the file is almost certainly OK -- see the discussion
above on CRCs for a definition of "almost certainly". If they're not,
bzip will complain loudly. Note that file.bz is left unchanged regard‐
less of the outcome. Win95/NT folks can do the same, but /dev/null
will have to be replaced with something suitable, perhaps NUL.
This manual page pertains to version 0.21 of bzip. It may well happen
that some future version will use a different compressed file format.
If you try to decompress, using 0.21, a .bz file created with some
future version which uses a different compressed file format, 0.21 will
complain that your file "is not a BZIP file". If that happens, you
should obtain a more recent version of bzip and use that to decompress
Julian Seward, firstname.lastname@example.org.
The ideas embodied in bzip are due to (at least) the following people:
Michael Burrows and David Wheeler (for the block sorting transforma‐
tion), Peter Fenwick (for the structured coding model, and many refine‐
ments), and Alistair Moffat, Radford Neal and Ian Witten (for the
arithmetic coder). I am much indebted for their help, support and
advice. See the file ALGORITHMS in the source distribution for point‐
ers to sources of documentation. Christian von Roques encouraged me to
look for faster sorting algorithms, so as to speed up compression.
Many people sent patches, helped with portability problems, lent
machines, gave advice and were generally helpful.