Assume I have a regular fixed width file that is sorted on one of the fields. Given that I know the length of the records, I can use lseek to implement a binary search to find records with fields that match a given value without having to read the entire file.
Now the difficulty is that the file is gzipped. Is it possible to do this without completely inflating the file? If not with gzip. is there any compression that supports this kind of behavior?
This is totally impossible with a file compressed with zip and derivatives. Those are based on a rolling dictionary window, typically with some sort of buffer-based compression of the most significant bits of the output codes on top of that. Bottom line is that a particular sequence of bytes in a zip file is meaningless without context.
If you want to be able to randomly read a particular record out of a compressed file, you have to compress each record independently and then have an index into the file. Depending on your data, this would probably render the compression step worthless.
The bzip2 file format consists of multiple independently-compressed blocks. If you're willing to maintain an index alongside of your bzip2 file, you could know where to lseek to.
Note: This is a duplicate of the questions:
- Compression formats with good support for random access within archives?
- Random access gzip stream
- Multi-part gzip file random access (in Java)
These answer the same question, but also identity BGZF as a gzip-compatible output format with sync points inserted to reset the compression state.
Pretty much all compression algo I know work in block mode, meaning a random seek isn't possible. Even LZMA which doesn't use an initial dictionary requires sequential decompression.
Stream compression means usually an adaptive lossy compression with some key that reset state (or actually cut into blocks). Details are more complex.
Now here are a couple of ideas to solve this:
- Create an index: Like when you open a ZIP you can see all files in it
- Cut your compressed file into blocks and then use a binary search within each block (similar actually to the first one)
- Decompress in memory but actually discard any data until you found the beginning of the data you're looking for.
The last way is good for small compressed files, and the block method is good for larger compressed files. You can mix the two.
PS: Fixed with in the input, doesn't mean the compressed file will be fixed with. So it's a pretty useless info.
Building on what Wernight said, you can split your file into many fixed size subfiles before you gzip it. Your binary search can start by searching for the subfile that contains the range, then it will only need to decompress the small subfile rather than the whole thing. You can optimize by creating an upper-level file in the archive that contains the first row of each subfile.
Continuing on what Liudvikas Bukys says: If your compressed blocks have an unique header, you don't need the index. That's similar to how seeking in some compressed video formats is done. You seek to a point and look for the next header. This does need robust validation (using a checksum) though, as mis-identification is possible.
what you want is seekable compression; the dict server has dictzip which is format compatible with gzip as it stores it's seektable in a gzip extension in the header and sleuth kit has sgzip which isn't as it stores block lengths at the beginning of each of the blocks