feat: optimize memory usage with new ZipIndex lookup structure #490

AMZN-hgoffin · 2026-01-05T09:44:19Z

Add a new internal ZipIndex data structure with an API similar to IndexMap, but with memory reduction. My local test that repeatedly opens ZIP files containing hundreds of thousands of files indicate that total process heap size is reduced by over 75%. (Edit: this was an artifact of me eyeballing Activity Monitor imprecisely to attempt memory measurement, further instrumentation shows that only the index structure is reduced by 75% but the ZipFileData entries and associated data still make up the bulk of process memory, so the total reduction is only 15%, making this not worth the effort.) Performance of lookup-by-name is theoretically reduced to O(log n), but CPU cache effects and reduced memory usage mean that lookup performance seems unchanged in practice. The move from IndexMap to ZipIndex is only implemented for the reader code, but a future change could extend this data structure for ZipWriter use, too. This change also lays groundwork for loading ZIP entries with duplicate names, which have been historically stripped out by IndexMap in a non-order-preserving way. The existing non-order-preserving behavior is maintained for now.

gemini-code-assist · 2026-01-05T09:44:45Z

Summary of Changes

Hello @AMZN-hgoffin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new ZipIndex data structure to significantly optimize memory consumption when processing ZIP files, particularly those with a large number of entries. By replacing the less memory-efficient IndexMap in the ZipArchive's core components, this change aims to enhance the overall efficiency of ZIP file reading without compromising lookup performance. The implementation also prepares the codebase for more sophisticated handling of duplicate filenames in the future.

Highlights

New Data Structure: Introduced ZipIndex, a new internal data structure designed for optimized memory usage in ZIP file reading operations.
Memory Optimization: Replaced IndexMap with ZipIndex within the ZipArchive's internal Shared structure, resulting in a significant memory reduction (over 75% in local tests for large ZIP files).
Performance Characteristics: Lookup-by-name performance, while theoretically O(log n), remains practically unchanged due to improved CPU cache efficiency and reduced memory footprint.
Reader-Side Implementation: The ZipIndex is currently implemented for the reader code, with future potential for extension to ZipWriter.
Duplicate Name Handling: The new structure lays groundwork for handling ZIP entries with duplicate names, while maintaining the existing non-order-preserving behavior for now.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new ZipIndex data structure to optimize memory usage, which is a significant improvement for handling ZIP files with a large number of entries. The implementation is well-done, and the API changes to accommodate the new structure are clean. However, I've found a critical performance issue in the construction of the ZipIndex that could lead to very slow load times for large archives, negating some of the performance benefits. My review includes a detailed comment with a suggested fix for this issue.

gemini-code-assist · 2026-01-05T09:47:57Z

src/read.rs

+            let mut files = ZipIndex::with_capacity(self.files.len());
+            for file in self.files {
+                files.push_or_replace_existing(file);
+            }


The current implementation for building the ZipIndex has a time complexity of O(N^2), where N is the number of files. This is because ZipIndex::push is an O(N) operation, and it's called in a loop. For archives with a large number of files (e.g., hundreds of thousands), this will be very slow and could negate the performance benefits of this PR.

A more performant approach is to implement FromIterator<ZipFileData> for ZipIndex and use collect(). This will reduce the complexity to a much more acceptable O(N log N).

First, add the following implementation to src/zipindex.rs:

use std::iter::FromIterator; impl<S: BuildHasher + Default> FromIterator<ZipFileData> for ZipIndex<S> { fn from_iter<T: IntoIterator<Item = ZipFileData>>(iter: T) -> Self { let iter = iter.into_iter(); let mut unique_files = Vec::with_capacity(iter.size_hint().0); let mut seen = IndexMap::with_capacity(iter.size_hint().0); for file in iter { if let Some(index) = seen.get(file.file_name.as_ref()) { unique_files[*index] = file; } else { seen.insert(file.file_name.clone(), unique_files.len()); unique_files.push(file); } } unique_files.into() } }

Then, you can change this block to be much more efficient and concise.

Suggested change

let mut files = ZipIndex::with_capacity(self.files.len());

for file in self.files {

files.push_or_replace_existing(file);

}

let files: ZipIndex = self.files.into_iter().collect();

Add a new internal ZipIndex data structure with an API similar to IndexMap, but with massive memory reduction. My local test that repeatedly opens ZIP files containing hundreds of thousands of files indicate that total process heap size is reduced by over 75%. Performance of lookup-by-name is theoretically reduced but seems unchanged in practice. This is only implemented for the reader code, but a future change couuld extend this data structure for ZipWriter use, too. This change also lays groundwork for loading ZIP entries with duplicate names, which have been historically stripped out by IndexMap in a non-order-preserving way. The existing non-order-preserving behavior is maintained for now.

AMZN-hgoffin · 2026-01-05T13:50:56Z

The memory savings on the IndexMap is real, but the process I was using to measure overall ZipReader memory usage turns out to be inaccurate. It does reduce the IndexMap overhead for a 5M-entry zip file from 275 MB to 50 MB, which is quite substantial, but it turns out that there is also over 1.1 gigabytes of data owned by ZipFileData directly, so it's no longer clear that saving 225 MB out of 1500 MB is worth the performance tradeoff. (And yes, Gemini is correct that I wrote a silly N^2 loop via the use of push, so it shouldn't be committed anyway unless it is reworked to use Into with a deduplication pass afterwards and the push functions should go away as they were meant for testing only.)

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

AMZN-hgoffin force-pushed the master branch from 22e5070 to 74469d9 Compare January 5, 2026 12:45

AMZN-hgoffin closed this Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize memory usage with new ZipIndex lookup structure #490

feat: optimize memory usage with new ZipIndex lookup structure #490

Uh oh!

AMZN-hgoffin commented Jan 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 5, 2026

Uh oh!

AMZN-hgoffin commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: optimize memory usage with new ZipIndex lookup structure #490

feat: optimize memory usage with new ZipIndex lookup structure #490

Uh oh!

Conversation

AMZN-hgoffin commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

AMZN-hgoffin commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AMZN-hgoffin commented Jan 5, 2026 •

edited

Loading

AMZN-hgoffin commented Jan 5, 2026 •

edited

Loading