-
Notifications
You must be signed in to change notification settings - Fork 211
feat: reading PivotTable (PivotCache) #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Overall it looks okay. However there are a number of issues to fix before review:
|
|
@jmcnamara I may have accidentally resubmitted the PR. I took your advice on moving to a branch on my own repo but didn't realize it would automatically resubmit once I rebased it to my master. If that was the case I still have a little cleanup left removing the unnecessary comments. Also, do you plan to have this released for just .xlsx workbooks then have the remaining prd seperately? |
That is fine, and normal. People often submit the PR as "draft" (it is an option in the initial GitHub dialog or you can do it in git) while they are iterating and then move it to full once it is ready to merge.
I would think it would be too hard to do this for xls. It may be possible to do it for xlsb and I don't know about ods. So I think it is probably ok to work it out for just xlsx in this PR. Also, I will run the AI code review on this later. Use the usual amount of judgement in relation to the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds pivot table data reading functionality to the Calamine library, specifically for XLSX files. The feature allows users to extract the underlying data (pivot cache) that supports pivot tables, which can be useful for auditing filtered content in externally sourced pivot tables.
Key changes:
- Adds a new
pivot-cachefeature flag to enable pivot table functionality - Implements pivot table metadata parsing and data extraction from XLSX files
- Provides public API methods for accessing pivot table names and data
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
Cargo.toml |
Adds the new pivot-cache feature flag |
src/lib.rs |
Includes the new pivot module when the feature is enabled |
src/pivot.rs |
Core pivot table data structures and parsing utilities |
src/xlsx/mod.rs |
Main implementation of pivot table reading functionality for XLSX files |
src/auto.rs |
Commented placeholder for future pivot table support in auto-detection |
src/ods.rs |
Commented placeholder for future ODS pivot table support |
src/xls.rs |
Commented placeholder for future XLS pivot table support |
src/xlsb.rs |
Commented placeholder for future XLSB pivot table support |
tests/test.rs |
Comprehensive test cases for the new pivot table functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@jmcnamara thank you for the quick feedback and shared git knowledge. Since this branch is now a draft I'll keep the same approach. Also, the copilot suggestions were actually decent for identifying places I intended to come back to (unwraps and such). @sftse thanks as well for feedback. I took your approach in the many of the suggestions (those with the thumbs up) and addressed comments / questions for the others. |
sftse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think most uses of ref here can be removed.
|
@jmcnamara I believe the failing check for stable appears to be due to the addition of clippy in 458b8ca. The image installs the toolchain with
|
|
@siqpush Thanks. I'll fix that. I wonder why it was working previously. |
|
@siqpush Thanks it is fixed on master. You will need to do a fetch and rebase to pick up the changes. |
Using github "sync" your code was merged not rebased onto my branch. I saw your note too late - my local branch was merge squashed by then. Tried the rebase follow up but it was useless. @jmcnamara |
Don't worry about it we can sort it out at merge. |
|
@siqpush Could you "Resolve" the comments that have been already addressed to make the review cleaner. Thanks. |
|
I fixed the failing typo check on master. If you rebase to that it will fix the failing CI check. |
|
@siqpush I would like to merge this for the next release but there is a conflict due to recent changes. Could you fix this? |
|
@jmcnamara I don't think we should merge this in its current state. The commit history is very confusing (empty rebase commits, merge commits inside this PR) and I think the code could still use some improvements. |
@sftse I could probably deal with that but if it still needs reviewing I will pause the merge. @siqpush Could you try rebase this down into a single commit that is up to date with main. Or start a clean branch and a new PR. |
|
@sftse @jmcnamara does commit 0b7a884 better present the fix for you? On my side it shows i want to merge both my commits and others into master (very confusing unless i drill into the commit link). Feels like a new PR may be the way... |
Yes. The looks clean, from the point of view of a review but it seems to be detached from a branch. |
I think we should only use feature flags for features that require an additional dependency like "chrono". I think all other features like "pictures" should be initialised when the user calls them. (This isn't 100% simple in some cases since it could require a second parsing of parts of a file but in general it should be achievable). |
Co-authored-by: sftse <c@farsight.net>
Co-authored-by: sftse <c@farsight.net>
* adding pivotref vec wrapper for public facing * removing pivot cache mod * tag enumeration * misc design changes --------- Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: GitHub Actions <actions@github.com>
sftse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look proactively for similar code patterns to the ones we highlight and consider how additional commits can help reviewers understand the changes.
Git diff relies on heuristics that are easily trashed by big diffs, so I'd overindex on small diffs and rather too many commits than too few.
| Ok(pivots_on_sheet) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my pattern-matching kicking in, is it correct that multiple pivots per sheet are permitted? Other functions error when more than one candidate item is found in the zip.
If you recheck, can you add a reference to the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link / screenshot to the reference is below. Originally this was just something I noticed during a few random tests. The test referenced in this comment #559 (comment) was to address.
https://web.mit.edu/~stevenj/www/ECMA-376-new-merged.pdf
See section 12.3.11 (page 78 in the spec, or 90 in pdf pages) otherwise screenshot below:
sftse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this PR across the finish line, nearly there!
src/xlsx/mod.rs
Outdated
| pub fn get_pivot_tables_by_name_and_sheet(&self) -> Vec<(String, String)> { | ||
| self.0 | ||
| .iter() | ||
| .map(|pt| (pt.sheet().to_string(), pt.name().to_string())) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the proposed API is hard to judge as-is, can we have a complete example how to use this in practice?
As an alternative, would it make sense to expose fn iter(&self) -> impl Iterator<Item = &'_ PivotTableRef> and let the caller use the API on PivotTableRef to call pt.name() and pt.sheet() as needed? This function just changes the representation of the data, which is not in itself a reason to exclude it, but would need more justification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it felt strange because I exposed functionality to PivotTables and even PivotTableRef that felt like they should be able to get data on their own. In turn this made functions like the above feel extra.
As for the alternative, my gut thinking is that because we also have get_pivot_tables_by_sheet, which would then also need be removed. But if we give the filter map to the user, then at best case they are left having to do some extra doc reading on some of the uniqueness subtleties.
As for a concrete example, I imagine for auditing data in workbook that might be sensitive to expose to the wrong user / client / ..etc. Using the relevant workbook from tests, this is how I would avoid leaking the top secret details of Category Blue:
let mut workbook: Xlsx<_> = open_workbook(path)?;
let pivot_tables = workbook.read_pivot_table_metadata()?;
for (sheet, pt) in pivot_tables.get_pivot_tables_by_name_and_sheet() {
let mut check_col = 0;
for (row_number, row) in workbook.pivot_table_data(&pivot_tables, sheet, pt)?.enumerate() {
// header is the first row
if row_number == 0 {
for (col_number, field) in row.iter().enumerate() {
if field.eq(&crate::calamine::Data::String("Category".to_string())) {
check_col = col_number
}
}
} else if row[check_col].eq(&crate::calamine::Data::String("blue".to_string())) {
panic!("Blue should not be included in this report.")
}
}
}```
Review Cleanup
Aligning worksheet - pivot table hierarchy
sftse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Havent forgotten about this PR, sorry about the delays.
I'd have to find time to use the current API to endorse it, but the internal details are getting close to being right.
Dont be. I was happy you got a healthy break from reviewing my mess. Let me know what you think once you get a moment! |
import not needed


Feature to read all data available to a pivot table.
The data supporting a pivot table is referred to as the pivotCache. And I like to consider this feature "Calamine for your Cache".
An example use case would be auditing filtered content in an externally sourced pivot table.
Pivot Table's require both xl/pivotCache/PivotCacheDefinitions and xl/pivotCache/PivotCacheRecords files. The definitions file has relevant metadata as well as shared items. While the records file has values - rows are delimited with the <r> tag. <x> indicates only their position in the Definitions file is given (sample below from tests/pivots.xlsx)
xl/pivotCache/PivotCacheRecords1.xml
xl/pivotCache/PivotCacheDefinitions1.xml
Example
This may be determined to go outside the scope Calamine - but if it fits then it will need to be applied to other workbook formats (only .xlsx currently) and worked on in a few places like error handling.