searching within a compressed sorted fixed width file

Assume I have a regular fixed width file that is sorted on one of the fields. Given that I know the length of the records, I can use lseek to implement a binary search to find records with fields that match a given value without having to read the entire file.

Now the difficulty is that the file is gzipped. Is it possible to do this without completely inflating the file? If not with gzip. is there any compression that supports this kind of behavior?

-------------Problems Reply------------

This is totally impossible with a file compressed with zip and derivatives. Those are based on a rolling dictionary window, typically with some sort of buffer-based compression of the most significant bits of the output codes on top of that. Bottom line is that a particular sequence of bytes in a zip file is meaningless without context.

If you want to be able to randomly read a particular record out of a compressed file, you have to compress each record independently and then have an index into the file. Depending on your data, this would probably render the compression step worthless.

The bzip2 file format consists of multiple independently-compressed blocks. If you're willing to maintain an index alongside of your bzip2 file, you could know where to lseek to.

Note: This is a duplicate of the questions:

  • Compression formats with good support for random access within archives?
  • Random access gzip stream
  • Multi-part gzip file random access (in Java)

These answer the same question, but also identity BGZF as a gzip-compatible output format with sync points inserted to reset the compression state.

Pretty much all compression algo I know work in block mode, meaning a random seek isn't possible. Even LZMA which doesn't use an initial dictionary requires sequential decompression.

Stream compression means usually an adaptive lossy compression with some key that reset state (or actually cut into blocks). Details are more complex.

Now here are a couple of ideas to solve this:

  • Create an index: Like when you open a ZIP you can see all files in it
  • Cut your compressed file into blocks and then use a binary search within each block (similar actually to the first one)
  • Decompress in memory but actually discard any data until you found the beginning of the data you're looking for.

The last way is good for small compressed files, and the block method is good for larger compressed files. You can mix the two.

PS: Fixed with in the input, doesn't mean the compressed file will be fixed with. So it's a pretty useless info.

Building on what Wernight said, you can split your file into many fixed size subfiles before you gzip it. Your binary search can start by searching for the subfile that contains the range, then it will only need to decompress the small subfile rather than the whole thing. You can optimize by creating an upper-level file in the archive that contains the first row of each subfile.

Continuing on what Liudvikas Bukys says: If your compressed blocks have an unique header, you don't need the index. That's similar to how seeking in some compressed video formats is done. You seek to a point and look for the next header. This does need robust validation (using a checksum) though, as mis-identification is possible.

what you want is seekable compression; the dict server has dictzip which is format compatible with gzip as it stores it's seektable in a gzip extension in the header and sleuth kit has sgzip which isn't as it stores block lengths at the beginning of each of the blocks

Category:file Views:0 Time:2010-04-23

Related post

  • Why are fixed-width file formats still in use? 2011-10-05

    Are there any advantages to a fixed-width file format over something like XML? I realize XML would likely take up more disk space to store the same amount of data but the file could also be compressed. I guess you could also, in theory, read a specif

  • Create fixed width file from access 2013-10-23

    I need pointed in the correct direction. I need to create a fixed width file from Access. The data will be coming from 2 tables (employer and employee). The text file must display the employer record, then the next records will contain employee data.

  • Macro to parse fixed width files - allow user to select file and properly parse the file it opens 2014-01-30

    I am using Excel to open a fixed width text file and parse it into multiple columns using the following code. Workbooks.OpenText FileName:= _ "C:\EMIS\*_CN_*.txt", Origin:=437, _ StartRow:=1, DataType:=xlFixedWidth, FieldInfo:=Array(Array(0, 2), Arra

  • Creating a fixed width file in C# 2008-09-17

    What is the best way to create a fixed width file in C#. I have a bunch of fields with lengths to write out. Say 20,80.10,2 etc all left aligned. Is there an easy way to do this? --------------Solutions------------- You can use string.Format to easil

  • Most Efficient Way to Write to Fixed Width File (Ruby) 2010-05-12

    I'm currently working with extremely large fixed width files, sometimes well over a million lines. I have written a method that can write over the files based on a set of parameters, but I think there has to be a more efficient way to accomplish this

  • Import fixed width file to oracle 2010-06-10

    Is there an ability to import fixed width file to oracle? Preferably through .net(c#) for catching errors during import and showing them to user. P.S. File has 5 types of rows. For example 1 row has 5 columns, 2-nd has 50 columns. --------------Solut

  • Parse fixed-width files 2011-02-06

    I have a lot of text files with fixed-width fields: <c> <c> <c> Dave Thomas 123 Main Dan Anderson 456 Center Wilma Rainbow 789 Street The rest of the files are in a similar format, where the <c> will mark the beginning of a co

  • JSON to fixed width file 2011-02-28

    I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting. Since the data is ju

  • VBScript (QTP) - fixed width file - check contents 2011-09-14

    If I had a fixed width file (.txt) with specs (which characters form which field) such as: 1-10 id_no 11-25 seq 26-30 cur_code 31-40 first 41-90 cur_desc 91-120 misa Example 3 lines in the file: 7284585 98354u38654 347 USD jfsnkjndf;kjsdgn;jndfsjngjd

  • How to parse multiple line, fixed-width file in perl? 2011-12-14

    I have a file that I need to parse in the following format. (All delimiters are spaces): field name 1: Multiple word value. field name 2: Multiple word value along with multiple lines. field name 3: Another multiple word and multiple line value. I am

  • Java fixed-width file format read/write library 2011-12-29

    I'm looking for a good Java library that easily allows the read/write of fixed-width files. Needed for maintaining legacy systems, i.e. files are needed to work with COBOL. Any suggestions are greatly appreciated! Thanks. --------------Solutions-----

  • Tkinter GUI to Convert Fixed Width File to Delimited File 2012-01-12

    I am writing a converter code for our Data Department to convert fixed width files into delmited files. Normally we use import the file into Excel, use the text import wizard to set the field lengths, and then just save as a csv. However we have run

  • Unix/Linux: Convert fixed-width file to csv and subset on two columns 2012-04-11

    I have this fixed-width file with the widths being 34, 2, 3, 2, 1, 1, 3, 1, 2, 1, 2, 2 and 75 which I want to (a) convert to delimited (csv) format and then (b) subset according to V2="03" and V5="1". I have figured out the first step: awk -v FIELDWI

  • Create fixed width file from 2 worksheets 2013-11-24

    I have created code to create a fixed width file from a worksheet. But the file layout required will need to compile data from two separate worksheets - Employer and Employee. Employer code will run first to create the first line of text in the file.

  • Efficient way of parsing fixed width files in Python 2011-02-06

    I am trying to find an efficient way of parsing files that holds fixed width lines. Example: first 20 chars represent a column, from 21:30 another one and so on. Let's assume that the line holds 100 chars. What would be an efficient way to parse a li

  • SQL Server Bulk Insert from Fixed Width File- How do I get NULLs for strings? 2010-08-11

    I'm doing a bulk insert from a fixed width text file using INSERT INTO blah SELECT blah1, blah2 FROM OPENROWSET(BULK 'filename.txt', FORMATFILE='format.xml'); It's working fine, except that I want NULLs for the empty (all spaces) fields in the file.

  • fixed width file mapped to a custom object 2011-03-24

    Problem: Taking mainframe fixedwidth files and turning them into custom objects I am doing this by creating constructors in the custom objects and using substring to pull out various columns What I am looking for: A way to just map the columns to an

  • Parsing a series of fixed-width files 2015-01-26

    I have a series (~30) of files that are made up of rows like: xxxxnnxxxxxxxnnnnnnnxxxnn Where x is a char and n is a number, and each group is a different field. This is fixed for each file so would be pretty easy to split and read with a struct or s

  • Efficient Pattern for processing fixed width files 2011-06-24

    I have a case where in I need to read a flat file with close to 100000 logical records. Each logical record is comprised of nx128 character parts. ie, Type A: 3x128, Type B : 4-5 X 128 etc where maximum possible n is 6. Application has to read the fi

Copyright (C), All Rights Reserved.

processed in 0.148 (s). 11 q(s)