dirfile-encoding(5) DATA FORMATS dirfile-encoding(5)NAMEdirfile-encoding — dirfile database encoding schemes
DESCRIPTION
The Dirfile Standards indicate that RAW fields defined in the database
are accompanied by binary files containing the field data in the speci‐
fied simple data type. In certain situations, it may be advantageous
to convert the binary files in the database into a more convenient
form. This is accomplished by encoding the binary file into the alter‐
nate form. A common use-case for encoding a binary file is to compress
it to save disk space. Only data is modified by an encoding scheme.
Database metadata is never encoded.
Support for encoding schemes is optional. An implementation need not
support any particular encoding scheme, or may only support certain
operations with it, but should expect to encounter unknown encoding
schemes and fail gracefully in such situations.
Additionally, how a particular encoding is implemented is not specified
by the Dirfile Standards, but, for purposes of interoperability, all
dirfile implementations are encouraged to support the encoding imple‐
mentation used by the GetData dirfile reference implementation, elabo‐
rated below.
An encoding scheme is local to the particular format specification
fragment in which it is indicated. This allows a single dirfile to
have binary files which are stored using multiple encodings, by having
them defined in multiple fragments.
The rest of this manual page discusses specifics of the encoding frame‐
work implemented in the GetData library, and does not constitute part
of the Dirfile Standards.
THE GETDATA ENCODING FRAMEWORK
The GetData library provides an encoding framework which abstracts
binary file I/O, allowing for generic support for a wide variety of
encoding schemes. Functions which may make use of the encoding frame‐
work are:
gd_add(3), gd_add_raw(3), gd_add_spec(3), gd_alter_encoding(3),
gd_alter_endianness(3), gd_alter_frameoffset(3),
gd_alter_entry(3), gd_alter_raw(3), gd_alter_spec(3), gd_get‐
data(3), gd_move(3), gd_nframes(3), gd_putdata(3), and
gd_rename(3).
Most of the encodings supported by GetData are implemented through
external libraries which handle the actual file I/O and data transla‐
tion. All such libraries are optional; a build of the library which
omits an external library will lack support for the associated encoding
scheme. In this case, GetData will still properly identify the encod‐
ing scheme, but attempts to use GetData for file I/O via the encoding
will fail with the GD_E_UNSUPPORTED error code.
GetData discovers the encoding scheme of a particular RAW field by not‐
ing the filename extension of files associated with the field. Binary
files which form an unencoded dirfile have no file extension. The file
extension used by the other encodings are noted below. Encoding dis‐
covery proceeds by searching for files with the known list of file
extensions (in an unspecified order) and stopping when the first suc‐
cessful match is made. Because of this, when the a field has multiple
data files with different, supported file extensions which could legit‐
imately be associated with it, the encoding scheme discovered by Get‐
Data is not well defined.
In addition to raw (unencoded) data, GetData supports eight other
encoding schemes: text encoding, bzip2 encoding, gzip encoding, lzma
encoding, sie (sample-index encoding), slim encoding, zzip encoding,
and zzslim encoding, all discussed below.
The text encoding and the sample-index encoding are implemented by Get‐
Data natively and need no external library. As a result, they are
always present in the library.
BZip2 Encoding
The BZip2 Encoding reads compressed raw binary files using the Burrows-
Wheeler block sorting text compression algorithm and Huffman coding, as
implemented in the bzip2 format. GetData's BZip2 Encoding scheme is
implemented through the bzip2 compression library written by Julian
Seward. GetData's BZip2 Encoding framework currently lacks write capa‐
bilities; as a result the BZip2 Encoding does not support functions
which modify binary data.
GetData caches an uncompressed megabyte of data at a time to speed
access times. A call to gd_nframes(3) requires decompression of the
entire binary file to determine its uncompressed size, and may take
some time to complete. The file extension of the BZip2 Encoding is
.bz2.
GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding
(LZ77) as implemented in the gzip format. GetData's GZip Encoding
scheme is implemented through the the zlib compression library written
by Jean-loup Gailly and Mark Adler. All operations are supported by the
GZip Encoding. Writes to GZip encoded data occur out-of-place; that
is: writing GZip Encoded data requires making a copy of the whole
binary data file. A side effect of this is that concurrently reading a
GZip Encoded Dirfile which is being written to usually doesn't work.
To speed the operation of gd_nframes(3), the GZip Encoding takes the
uncompressed size of the file the gzip footer, which contains the
file's uncompressed size in bytes, modulo 2**32. As a result, using a
field with an (uncompressed) binary file size larger than 4 GiB as the
reference field will result in the wrong number of frames being
reported. The file extension of the GZip Encoding is .gz.
LZMA Encoding
The LZMA Encoding reads compressed raw binary files using the Lempel-
Ziv Markov Chain Algorithm (LZMA) as implemented in the xz container
format. GetData's LZMA Encoding scheme is implemented through the lzma
library, part of the XZ Utils suite written by Lasse Collin, Ville
Koskinen, and Igor Pavlov. GetData's LZMA Encoding framework currently
lacks write capabilities; as a result the LZMA Encoding does not sup‐
port functions which modify binary data.
As with the BZip2 Encoding, GetData caches an uncompressed megabyte of
data at a time to speed access times. A call to gd_nframes(3) requires
decompression of the entire binary file to determine its uncompressed
size, and may take some time to complete. The file extension of the
LZMA Encoding is .xz, or .lzma.
Sample-Index Encoding
The Sample-Index Encoding (SIE) compresses raw binary data by replacing
runs of repeated data, similar to run-length encoding. SIE files con‐
tain binary records consisting of a 64-bit sample number followed by a
datum (the size and format of which is determined by the RAW field's
data type in the format metadata). The sample number indicates the
last sample of the field which has the specified value. The first sam‐
ple with the value is the sample immediately following the data in the
previous record, or sample number zero, for the first record. Sample
numbers are relative to any /FRAMEOFFSET specified in the Dirfile meta‐
data. All operations are supported by the Sample-Index Encoding. The
file extension of the Sample-Index Encoding is .sie.
Slim Encoding
The Slim Encoding reads compressed raw binary files using the slimlib
compression library written by Joseph Fowler. The slimlib library was
developed at Princeton University to compress dirfile-like data. Get‐
Data's Slim Encoding framework currently lacks write capabilities; as a
result, the Slim Encoding does not support function which modify binary
files. The file extension of the Slim Encoding is .slm.
Using the Slim Encoding with GetData may result in unexpected, but man‐
ageable, memory usage. See the gd_getdata(3) manual page for details.
Text Encoding
The Text Encoding replaces the binary data files with 7-bit ASCII files
containing a decimal text encoding of the data, one sample per line.
All operations are supported by the Text Encoding. The file extension
of the Text Encoding is .txt.
ZZip Encoding
The ZZip Encoding reads compressed raw binary files using the DEFLATE
algorithm as implemented in the PKWARE ZIP archive container format.
GetData's ZZip Encoding scheme is implemented through the zzip library
written by Tomi Ollila and Guido Draheim. The ZZip Encoding framework
currently lacks write capabilities; as a result the ZZip Encoding does
not support functions which modify binary data.
Unlike most encoding schemes, the ZZip encoding merges all binary data
files defined in a given fragment into a single ZIP archive. The name
of this archive is raw.zip by default, but a different name may be
specified using the second parameter to the /ENCODING directive. For
example,
/ENCODING zzip archive
indicates that the ZIP archive is called archive.zip. The file exten‐
sion of the ZZip Encoding is .zip.
ZZSlim Encoding
The ZZSlim Encoding is a convolution of the Slim Encoding and the ZZip
Encoding. To create ZZSlim Encoded files, first the raw data are com‐
pressed using the slim library, and then these slim-compressed files
are archived (and compressed again) into a ZIP archive. As with the
ZZip Encoding, the ZIP archive is raw.zip by default, but a different
name may be specified with the /ENCODING directive.
Notably, since the archives have the same name as ZZip Encoded data,
automatic encoding detection on ZZSlim Encoded data always fails: they
are incorrectly identified as simply ZZip Encoded. As a result, an
/ENCODING directive in the format file or else a GD_ZZSLIM_ENCODED flag
passed to gd_open(3) is required to read ZZSlim encoded data. The file
extension of the ZZSlim Encoding is .zip.
Using the ZZSlim Encoding with GetData may result in unexpected, but
manageable, memory usage. See the gd_getdata(3) manual page for
details.
AUTHOR
This manual page was written by D. V. Wiebe <dvw@ketiltrout.net>.
SEE ALSOdirfile(5), dirfile-format(5), bzip2(1), gzip(1), xz(1), zlib(3).
Standards Version 9 26 January 2013 dirfile-encoding(5)