Changed stream format to make compatible with MiGz#2
Changed stream format to make compatible with MiGz#2cielavenir wants to merge 1 commit intovinlyx:masterfrom
Conversation
|
Hi cielavenir, thank you very much for your explaination and suggestion. I have already merged your first PR #1 . Thanks for your information, I didn't notice linkedin's repository of MiGz, but it seems the idea is almost the same. I would like to explain the original reason to put a QWORD into extra flag to record size of raw data: The parimary purpose of mgzip is inventing a faster way to process a large file, specifically files may larger than 100GB. Using original 32bit ISIZE to repersent raw size will protentially limiting the member size to 4GB, but I want to keep the potential possiblity to use member size >4GB. But it is opened to discuess and I am also looking the document of BGZF and RAZF to see whether there is a better solution. Any comment and suggestion is welcome. Thanks! |
|
Yes, but that applies if the block size is more than 4GB. mgzip(IndexedGzip) or MiGz should not be designed that way (to be clear, block size is what you mention as By the way, I have a suite to handle such files: https://github.com/cielavenir/7bgzf/tree/dev |
|
moved to #6 |
Firstly I should mention that https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L492 's
-8corresponds to https://github.com/vinlyx/mgzip/blob/0.1.0/mgzip/multiProcGzip.py#L316 's+8, which is CRC32 + block size. Actually you have already figured out the -8.Also as indexed gzip, you should not need QWORD as member size or block size. Then you can get the latter from gzip footer.
Now I start the main topic - although your concept is great, your format is very rare; no other tools can open it as indexed gzip.
Recently linkedin invented a format named MiGz: https://github.com/linkedin/migz
This implements the above mentions as well as recording only compressed_size to the extra header.
So I tried to change your code a little bit for the interoperability with MiGz.
By the way note that get_index() needs to be rewritten. (I could work on it after I hear from you.)