Changed stream format to make compatible with MiGz by cielavenir · Pull Request #2 · vinlyx/mgzip

cielavenir · 2019-09-23T08:09:16Z

Firstly I should mention that https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L492 's -8 corresponds to https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L316 's +8, which is CRC32 + block size. Actually you have already figured out the -8.
Also as indexed gzip, you should not need QWORD as member size or block size. Then you can get the latter from gzip footer.

Now I start the main topic - although your concept is great, your format is very rare; no other tools can open it as indexed gzip.
Recently linkedin invented a format named MiGz: https://github.com/linkedin/migz
This implements the above mentions as well as recording only compressed_size to the extra header.
So I tried to change your code a little bit for the interoperability with MiGz.

By the way note that get_index() needs to be rewritten. (I could work on it after I hear from you.)

vinlyx · 2019-09-24T18:59:32Z

Hi cielavenir, thank you very much for your explaination and suggestion. I have already merged your first PR #1 .

Thanks for your information, I didn't notice linkedin's repository of MiGz, but it seems the idea is almost the same.

I would like to explain the original reason to put a QWORD into extra flag to record size of raw data:
As you may know there is a inherited issued of gzip format which is impossible to get the exactly uncompressed size without decompressing the file when the raw file is >4GB (https://bugzilla.redhat.com/show_bug.cgi?id=752040). That was caused by the originally design of gzip formt using 32bit ISIZE to record the raw file size in 1952.

The parimary purpose of mgzip is inventing a faster way to process a large file, specifically files may larger than 100GB. Using original 32bit ISIZE to repersent raw size will protentially limiting the member size to 4GB, but I want to keep the potential possiblity to use member size >4GB.

But it is opened to discuess and I am also looking the document of BGZF and RAZF to see whether there is a better solution.

Any comment and suggestion is welcome. Thanks!

cielavenir · 2019-09-25T00:43:28Z

Yes, but that applies if the block size is more than 4GB. mgzip(IndexedGzip) or MiGz should not be designed that way (to be clear, block size is what you mention as 200MB in readme.md).

By the way, I have a suite to handle such files: https://github.com/cielavenir/7bgzf/tree/dev
(And actually I was looking for other formats and happened to find this module)

cielavenir · 2020-06-06T17:13:45Z

moved to #6

Close at end of read

Changed stream format to make compatible with MiGz

90d7380

vinlyx force-pushed the master branch from 5066abe to 20c3328 Compare November 26, 2019 14:39

cielavenir closed this Jun 6, 2020

timhughes referenced this pull request in pgzip/pgzip Sep 11, 2021

Merge pull request #2 from timhughes/tests

c1b517a

Close at end of read

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed stream format to make compatible with MiGz#2

Changed stream format to make compatible with MiGz#2
cielavenir wants to merge 1 commit intovinlyx:masterfrom
cielavenir:migz_compatibility

cielavenir commented Sep 23, 2019

Uh oh!

vinlyx commented Sep 24, 2019 •

edited

Loading

Uh oh!

cielavenir commented Sep 25, 2019

Uh oh!

cielavenir commented Jun 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cielavenir commented Sep 23, 2019

Uh oh!

vinlyx commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cielavenir commented Sep 25, 2019

Uh oh!

cielavenir commented Jun 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinlyx commented Sep 24, 2019 •

edited

Loading