Skip to content

Conversation

@AMZN-hgoffin
Copy link

@AMZN-hgoffin AMZN-hgoffin commented Jan 5, 2026

Add a new internal ZipIndex data structure with an API similar to IndexMap, but with memory reduction. My local test that repeatedly opens ZIP files containing hundreds of thousands of files indicate that total process heap size is reduced by over 75%. (Edit: this was an artifact of me eyeballing Activity Monitor imprecisely to attempt memory measurement, further instrumentation shows that only the index structure is reduced by 75% but the ZipFileData entries and associated data still make up the bulk of process memory, so the total reduction is only 15%, making this not worth the effort.) Performance of lookup-by-name is theoretically reduced to O(log n), but CPU cache effects and reduced memory usage mean that lookup performance seems unchanged in practice. The move from IndexMap to ZipIndex is only implemented for the reader code, but a future change could extend this data structure for ZipWriter use, too. This change also lays groundwork for loading ZIP entries with duplicate names, which have been historically stripped out by IndexMap in a non-order-preserving way. The existing non-order-preserving behavior is maintained for now.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AMZN-hgoffin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new ZipIndex data structure to significantly optimize memory consumption when processing ZIP files, particularly those with a large number of entries. By replacing the less memory-efficient IndexMap in the ZipArchive's core components, this change aims to enhance the overall efficiency of ZIP file reading without compromising lookup performance. The implementation also prepares the codebase for more sophisticated handling of duplicate filenames in the future.

Highlights

  • New Data Structure: Introduced ZipIndex, a new internal data structure designed for optimized memory usage in ZIP file reading operations.
  • Memory Optimization: Replaced IndexMap with ZipIndex within the ZipArchive's internal Shared structure, resulting in a significant memory reduction (over 75% in local tests for large ZIP files).
  • Performance Characteristics: Lookup-by-name performance, while theoretically O(log n), remains practically unchanged due to improved CPU cache efficiency and reduced memory footprint.
  • Reader-Side Implementation: The ZipIndex is currently implemented for the reader code, with future potential for extension to ZipWriter.
  • Duplicate Name Handling: The new structure lays groundwork for handling ZIP entries with duplicate names, while maintaining the existing non-order-preserving behavior for now.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ZipIndex data structure to optimize memory usage, which is a significant improvement for handling ZIP files with a large number of entries. The implementation is well-done, and the API changes to accommodate the new structure are clean. However, I've found a critical performance issue in the construction of the ZipIndex that could lead to very slow load times for large archives, negating some of the performance benefits. My review includes a detailed comment with a suggested fix for this issue.

src/read.rs Outdated
Comment on lines 74 to 77
let mut files = ZipIndex::with_capacity(self.files.len());
for file in self.files {
files.push_or_replace_existing(file);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for building the ZipIndex has a time complexity of O(N^2), where N is the number of files. This is because ZipIndex::push is an O(N) operation, and it's called in a loop. For archives with a large number of files (e.g., hundreds of thousands), this will be very slow and could negate the performance benefits of this PR.

A more performant approach is to implement FromIterator<ZipFileData> for ZipIndex and use collect(). This will reduce the complexity to a much more acceptable O(N log N).

First, add the following implementation to src/zipindex.rs:

use std::iter::FromIterator;

impl<S: BuildHasher + Default> FromIterator<ZipFileData> for ZipIndex<S> {
    fn from_iter<T: IntoIterator<Item = ZipFileData>>(iter: T) -> Self {
        let iter = iter.into_iter();
        let mut unique_files = Vec::with_capacity(iter.size_hint().0);
        let mut seen = IndexMap::with_capacity(iter.size_hint().0);
        for file in iter {
            if let Some(index) = seen.get(file.file_name.as_ref()) {
                unique_files[*index] = file;
            } else {
                seen.insert(file.file_name.clone(), unique_files.len());
                unique_files.push(file);
            }
        }
        unique_files.into()
    }
}

Then, you can change this block to be much more efficient and concise.

Suggested change
let mut files = ZipIndex::with_capacity(self.files.len());
for file in self.files {
files.push_or_replace_existing(file);
}
let files: ZipIndex = self.files.into_iter().collect();

Add a new internal ZipIndex data structure with an API similar to IndexMap, but
with massive memory reduction. My local test that repeatedly opens ZIP files
containing hundreds of thousands of files indicate that total process heap size
is reduced by over 75%. Performance of lookup-by-name is theoretically reduced
but seems unchanged in practice. This is only implemented for the reader code,
but a future change couuld extend this data structure for ZipWriter use, too.
This change also lays groundwork for loading ZIP entries with duplicate names,
which have been historically stripped out by IndexMap in a non-order-preserving
way. The existing non-order-preserving behavior is maintained for now.
@AMZN-hgoffin
Copy link
Author

AMZN-hgoffin commented Jan 5, 2026

The memory savings on the IndexMap is real, but the process I was using to measure overall ZipReader memory usage turns out to be inaccurate. It does reduce the IndexMap overhead for a 5M-entry zip file from 275 MB to 50 MB, which is quite substantial, but it turns out that there is also over 1.1 gigabytes of data owned by ZipFileData directly, so it's no longer clear that saving 225 MB out of 1500 MB is worth the performance tradeoff. (And yes, Gemini is correct that I wrote a silly N^2 loop via the use of push, so it shouldn't be committed anyway unless it is reworked to use Into with a deduplication pass afterwards and the push functions should go away as they were meant for testing only.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant