-
Notifications
You must be signed in to change notification settings - Fork 102
feat: optimize memory usage with new ZipIndex lookup structure #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @AMZN-hgoffin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ZipIndex data structure to optimize memory usage, which is a significant improvement for handling ZIP files with a large number of entries. The implementation is well-done, and the API changes to accommodate the new structure are clean. However, I've found a critical performance issue in the construction of the ZipIndex that could lead to very slow load times for large archives, negating some of the performance benefits. My review includes a detailed comment with a suggested fix for this issue.
src/read.rs
Outdated
| let mut files = ZipIndex::with_capacity(self.files.len()); | ||
| for file in self.files { | ||
| files.push_or_replace_existing(file); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for building the ZipIndex has a time complexity of O(N^2), where N is the number of files. This is because ZipIndex::push is an O(N) operation, and it's called in a loop. For archives with a large number of files (e.g., hundreds of thousands), this will be very slow and could negate the performance benefits of this PR.
A more performant approach is to implement FromIterator<ZipFileData> for ZipIndex and use collect(). This will reduce the complexity to a much more acceptable O(N log N).
First, add the following implementation to src/zipindex.rs:
use std::iter::FromIterator;
impl<S: BuildHasher + Default> FromIterator<ZipFileData> for ZipIndex<S> {
fn from_iter<T: IntoIterator<Item = ZipFileData>>(iter: T) -> Self {
let iter = iter.into_iter();
let mut unique_files = Vec::with_capacity(iter.size_hint().0);
let mut seen = IndexMap::with_capacity(iter.size_hint().0);
for file in iter {
if let Some(index) = seen.get(file.file_name.as_ref()) {
unique_files[*index] = file;
} else {
seen.insert(file.file_name.clone(), unique_files.len());
unique_files.push(file);
}
}
unique_files.into()
}
}Then, you can change this block to be much more efficient and concise.
| let mut files = ZipIndex::with_capacity(self.files.len()); | |
| for file in self.files { | |
| files.push_or_replace_existing(file); | |
| } | |
| let files: ZipIndex = self.files.into_iter().collect(); |
Add a new internal ZipIndex data structure with an API similar to IndexMap, but with massive memory reduction. My local test that repeatedly opens ZIP files containing hundreds of thousands of files indicate that total process heap size is reduced by over 75%. Performance of lookup-by-name is theoretically reduced but seems unchanged in practice. This is only implemented for the reader code, but a future change couuld extend this data structure for ZipWriter use, too. This change also lays groundwork for loading ZIP entries with duplicate names, which have been historically stripped out by IndexMap in a non-order-preserving way. The existing non-order-preserving behavior is maintained for now.
|
The memory savings on the IndexMap is real, but the process I was using to measure overall ZipReader memory usage turns out to be inaccurate. It does reduce the IndexMap overhead for a 5M-entry zip file from 275 MB to 50 MB, which is quite substantial, but it turns out that there is also over 1.1 gigabytes of data owned by ZipFileData directly, so it's no longer clear that saving 225 MB out of 1500 MB is worth the performance tradeoff. (And yes, Gemini is correct that I wrote a silly N^2 loop via the use of push, so it shouldn't be committed anyway unless it is reworked to use Into with a deduplication pass afterwards and the push functions should go away as they were meant for testing only.) |
Add a new internal ZipIndex data structure with an API similar to IndexMap, but with memory reduction. My local test that repeatedly opens ZIP files containing hundreds of thousands of files indicate that total process heap size is reduced by over 75%. (Edit: this was an artifact of me eyeballing Activity Monitor imprecisely to attempt memory measurement, further instrumentation shows that only the index structure is reduced by 75% but the ZipFileData entries and associated data still make up the bulk of process memory, so the total reduction is only 15%, making this not worth the effort.) Performance of lookup-by-name is theoretically reduced to O(log n), but CPU cache effects and reduced memory usage mean that lookup performance seems unchanged in practice. The move from IndexMap to ZipIndex is only implemented for the reader code, but a future change could extend this data structure for ZipWriter use, too. This change also lays groundwork for loading ZIP entries with duplicate names, which have been historically stripped out by IndexMap in a non-order-preserving way. The existing non-order-preserving behavior is maintained for now.