home
intention
The intended purpose of RaXeJac
A. Project Summary
Project name: RaXeJac - Random accessed XML extreme-fast Java compressor
Languague: Java (1.4)
OS: Any OS supporting Java 1.4
Project size: Small, about 20 classes
Usage: As a library for using in other Java-written applications. Not as a stand-alone application.
B. Aims
- 1. Compression speed: Compressing >5 MB/s (on a 2 GHz Pentium)
- 2. Decompression speed: Decompression >80 MB/s (on a 2 GHz Pentium)
- 3. Compression rate: ~40% of original size on a normal looking XML-document
C. Problem statement
XML are used in a variety of programs. Both as a data-container and means for communication. RaXeJax aims to solve 3 problems:
- 1. Faster XML-based communication:
Much data communication today is done in XML (soap etc). However communication is fast, and no
compression algorithms can match >1 MB/s, so that compression can provide any increase in communication
speed. RaXeJac is intended to be a library that manages to compress more than 5 MB/s.
- 2. Compression and decompression to match disc-access speed:
XML data are often in many GB. When XML needs to be alternated, data is fetched from disc, altered,
and stored again. Disc access is fast, 10 MB/s, which beats most decompression algorithms.
LZ77 and similar context dependent compression algorithms reach that decompression speed, but ANY
alteration in data, forces a total re-compression of data, since compressed data is totally dependent
of its context, and no "local" change in the data stream can be made. Those algorithms are far below
10 MB/s in compression speed. RaXeJac can decompress in 80 MB/s, and make local changes in the
data-stream, so that only the local changes need to be re-compressed.
- 3. Match speed of in-memory access of XML-data:
Since data can be decompressed in 80 MB/s, it can match the speed of memory access. Hence, data could
as well be kept in compressed format as in plain format. And since data can be localy changed, data
manipulation in memory will almost be as fast as manipulation in plain format. Further more,
when searching for data, the search condition can be expressed in the compressed format, and quick
search algorithms such as BMH (Boyer Moore Horspool) can be applied directly to compressed data, and
in-memory search for data will in some cases be even faster than searches in plain format.
Compression rate is low, only compresses to 40% of original size. But the library will meet the special
need for extremely fast compression and decompression.
D. Intended usage
- 1. Faster access of XML documents.
- 2. Data communication.
For compression in data communication, e.g. soap, Web Services etc.
- 3. Precompression to slower compression algorithms.
Since other compression algorithms are slow, a fast precompression to 40% of original size,
might give total of doubled compression speed.
- 4. In-memory compression of XML data.
Without significant loss of access or modification speed.
E. Technical guidelines
- 1. Java
I intend to develop the project in Java, to enable it for use in other Java projects.
C would probably produce faster code, but the project can be translated to C later.
- 2. Speed
Speed is essential for the project. Therefor RaXeJax will follow a few guidelines:
- i) Compression and decompression results will be stored in buffers that may
be accessed. This to avoid coying into other buffers.
- ii) The normal way to access data is by streams. This will be possible,
but also the underlying data will be accessable
- 3. Both stream and random access
The compression will hava a static code-table. This table can be inserted into the stream,
in case of stream compression, or be kept separate, for enabling random access of data
- 4. General data compression design
The compression will be done by making assumption on the structure of xml.
This will not be coded into the program, but be kept as a definition.
Hence, RaXeJac might be suitable for fast compression of other repetive data where such assumption can be made.
F. References
home
intention