Skip to content

Conversation

@siqpush
Copy link

@siqpush siqpush commented Sep 18, 2025

Feature to read all data available to a pivot table.

The data supporting a pivot table is referred to as the pivotCache. And I like to consider this feature "Calamine for your Cache".

An example use case would be auditing filtered content in an externally sourced pivot table.

Pivot Table's require both xl/pivotCache/PivotCacheDefinitions and xl/pivotCache/PivotCacheRecords files. The definitions file has relevant metadata as well as shared items. While the records file has values - rows are delimited with the <r> tag. <x> indicates only their position in the Definitions file is given (sample below from tests/pivots.xlsx)

image

xl/pivotCache/PivotCacheRecords1.xml

<r>
    <n v="10"/>
    <s v="j"/>

    <!-- use data from PivotCacheDefinitions1.xml's 3rd 'CacheField Tag'  -->
    <x v="1"/>

    <x v="3"/>
    <n v="1.20452"/>
    <x v="5"/>
    <n v="4.1510311161292464"/>
    <b v="0"/>
    <m/>
    <s v="blue"/>
  </r>

xl/pivotCache/PivotCacheDefinitions1.xml

  <cacheFields count="10">
    <cacheField name="Id" numFmtId="0">
      <sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="10"/>
    </cacheField>
    <cacheField name="Name" numFmtId="0">
      <sharedItems/>
    </cacheField>

    <!-- Corresponding lookup value <x v="1"/> above refers to <s v="blue"/> -->
    <cacheField name="Category" numFmtId="0">
      <sharedItems count="2">
        <s v="blue"/>
        <s v="yellow"/>
      </sharedItems>
    </cacheField>

    <cacheField name="Value" numFmtId="0">
      <sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="5" maxValue="20" count="4">
        <n v="10"/>
        <n v="20"/>
        <n v="15"/>
        <n v="5"/>
      </sharedItems>

Example



    let mut wb: Xlsx<_> = wb("pivots.xlsx");
    for result in wb.get_pivot_data_by_name_ref("PivotTable1").unwrap() {
        println!("{:?}", result);
    }

    /*
     prints the following:
     
    [String("Id"), String("Name"), String("Category"), String("Value"), String("Size"), String("Date"), String("Value / Size"), String("IsBlue"), String("Null"), String("Misc")]
    [Int(1), String("a"), String("blue"), Int(10), Float(1.78), DateTimeIso("2024-11-01T00:00:00"), Float(5.617977528089887), Bool(true), Empty, Empty]
    [Int(2), String("b"), String("blue"), Int(20), Float(2.012), DateTimeIso("2024-01-04T00:00:00"), Float(9.940357852882704), Bool(true), Empty, Float(2.012)]
...
*/

This may be determined to go outside the scope Calamine - but if it fits then it will need to be applied to other workbook formats (only .xlsx currently) and worked on in a few places like error handling.

@jmcnamara
Copy link
Collaborator

jmcnamara commented Sep 18, 2025

Overall it looks okay. However there are a number of issues to fix before review:

  • Rebase to master on tafia/calamine. There are some fixes on master relative to your clone.
  • Probably best to move the changes to a branch on your repo for easier PRs.
  • Turn on the CI on your branch and fix any issues before resubmitting the PR. This PR fails in the calamine CI.
  • Run cargo fmt on the code.
  • Fix any cargo clippy issues.
  • Fix the warnings from cargo build -F pivot-cache.
  • Comment style should be proper sentence case with period at the end.
  • Don't use /// comments for non public comments. Use //.
  • Explain why fn xml_reader() is being made public. Use pub(crate) if necessary or don't make it public if it isn't necessary.
  • Probably best to upgrade to Rust v1.89.0 or 1.90.0, if not already using them. This will give you the latest clippy at least.
  • Avoid making whitespace changes like removing blank lines unless there is a valid reason.
  • Rebase your local changes, to fix the above issues, into a single commit.

@jmcnamara jmcnamara self-assigned this Sep 18, 2025
@jmcnamara jmcnamara added enhancement awaiting user changes Awaiting changes to a PR to fix requested changes or CI issues. xlsx labels Sep 18, 2025
@siqpush
Copy link
Author

siqpush commented Sep 20, 2025

@jmcnamara I may have accidentally resubmitted the PR. I took your advice on moving to a branch on my own repo but didn't realize it would automatically resubmit once I rebased it to my master. If that was the case I still have a little cleanup left removing the unnecessary comments.

Also, do you plan to have this released for just .xlsx workbooks then have the remaining prd seperately?

@jmcnamara
Copy link
Collaborator

but didn't realize it would automatically resubmit once I rebased it to my master. If that was the case I still have a little cleanup left removing the unnecessary comments.

That is fine, and normal. People often submit the PR as "draft" (it is an option in the initial GitHub dialog or you can do it in git) while they are iterating and then move it to full once it is ready to merge.

Also, do you plan to have this released for just .xlsx workbooks then have the remaining prd seperately?

I would think it would be too hard to do this for xls. It may be possible to do it for xlsb and I don't know about ods. So I think it is probably ok to work it out for just xlsx in this PR.

Also, I will run the AI code review on this later. Use the usual amount of judgement in relation to the suggestions.

@jmcnamara jmcnamara requested a review from Copilot September 20, 2025 23:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds pivot table data reading functionality to the Calamine library, specifically for XLSX files. The feature allows users to extract the underlying data (pivot cache) that supports pivot tables, which can be useful for auditing filtered content in externally sourced pivot tables.

Key changes:

  • Adds a new pivot-cache feature flag to enable pivot table functionality
  • Implements pivot table metadata parsing and data extraction from XLSX files
  • Provides public API methods for accessing pivot table names and data

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
Cargo.toml Adds the new pivot-cache feature flag
src/lib.rs Includes the new pivot module when the feature is enabled
src/pivot.rs Core pivot table data structures and parsing utilities
src/xlsx/mod.rs Main implementation of pivot table reading functionality for XLSX files
src/auto.rs Commented placeholder for future pivot table support in auto-detection
src/ods.rs Commented placeholder for future ODS pivot table support
src/xls.rs Commented placeholder for future XLS pivot table support
src/xlsb.rs Commented placeholder for future XLSB pivot table support
tests/test.rs Comprehensive test cases for the new pivot table functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jmcnamara jmcnamara marked this pull request as draft September 23, 2025 10:51
@siqpush
Copy link
Author

siqpush commented Sep 23, 2025

@jmcnamara thank you for the quick feedback and shared git knowledge. Since this branch is now a draft I'll keep the same approach. Also, the copilot suggestions were actually decent for identifying places I intended to come back to (unwraps and such).

@sftse thanks as well for feedback. I took your approach in the many of the suggestions (those with the thumbs up) and addressed comments / questions for the others.

Copy link
Contributor

@sftse sftse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think most uses of ref here can be removed.

@siqpush
Copy link
Author

siqpush commented Sep 25, 2025

@jmcnamara I believe the failing check for stable appears to be due to the addition of clippy in 458b8ca. The image installs the toolchain with --profile minimal. This setting does not include clippy by default. See line 48 in screen shot which would conflict.

image

@jmcnamara
Copy link
Collaborator

@siqpush Thanks. I'll fix that.

I wonder why it was working previously.

@jmcnamara
Copy link
Collaborator

@siqpush Thanks it is fixed on master. You will need to do a fetch and rebase to pick up the changes.

@siqpush
Copy link
Author

siqpush commented Sep 25, 2025

@siqpush Thanks it is fixed on master. You will need to do a fetch and rebase to pick up the changes.

Using github "sync" your code was merged not rebased onto my branch. I saw your note too late - my local branch was merge squashed by then. Tried the rebase follow up but it was useless. @jmcnamara

@jmcnamara
Copy link
Collaborator

Tried the rebase follow up but it was useless.

Don't worry about it we can sort it out at merge.

@siqpush siqpush marked this pull request as ready for review September 27, 2025 20:26
@jmcnamara
Copy link
Collaborator

@siqpush Could you "Resolve" the comments that have been already addressed to make the review cleaner. Thanks.

@jmcnamara
Copy link
Collaborator

I fixed the failing typo check on master. If you rebase to that it will fix the failing CI check.

@jmcnamara
Copy link
Collaborator

@siqpush I would like to merge this for the next release but there is a conflict due to recent changes. Could you fix this?

@sftse
Copy link
Contributor

sftse commented Nov 18, 2025

@jmcnamara I don't think we should merge this in its current state. The commit history is very confusing (empty rebase commits, merge commits inside this PR) and I think the code could still use some improvements.
I can make another review.

@jmcnamara
Copy link
Collaborator

The commit history is very confusing (empty rebase commits, merge commits inside this PR

@sftse I could probably deal with that but if it still needs reviewing I will pause the merge.

@siqpush Could you try rebase this down into a single commit that is up to date with main. Or start a clean branch and a new PR.

@siqpush
Copy link
Author

siqpush commented Nov 18, 2025

@sftse @jmcnamara does commit 0b7a884 better present the fix for you? On my side it shows i want to merge both my commits and others into master (very confusing unless i drill into the commit link).

Feels like a new PR may be the way...

@jmcnamara
Copy link
Collaborator

does commit 0b7a884 better present the fix for you

Yes. The looks clean, from the point of view of a review but it seems to be detached from a branch.

@sftse
Copy link
Contributor

sftse commented Nov 19, 2025

It may be easier to visualize what is going on with this PR. git log --all --graph may help with orientation for what to merge into what and what a clear history should look like.

Here a more compact rendering.

2025-11-19-103449_1031x989_scrot

As a rule of thumb, do not merge master into other branches, as this makes for a confusing history. Github is confused by it as well, as you can see by how it chooses to render the commits belonging to this PR, it treats the commits that were merged from master into your branch (labeled pr-559 in the rendered graph) as belonging to this PR, even though they do not contribute to the final diff or understanding of what has changed. At that point, the usefulness of the individual commits is lower than could have been and it may be clearer to just squash all of them into a single one.

Of note as well, it seems you rebased your commits beginning at 175f and then decided to merge the rebased branch into the old one, this duplicates a large amount of commits, see 175f and aff6.

@jmcnamara
Copy link
Collaborator

This eliminates the problem of having to add a feature flag to initialize the pivot tables deep within Xlsx::new and the additional costs for every caller, even the ones that don't need the feature.

I think we should only use feature flags for features that require an additional dependency like "chrono". I think all other features like "pictures" should be initialised when the user calls them. (This isn't 100% simple in some cases since it could require a second parsing of parts of a file but in general it should be achievable).

siqpush and others added 5 commits November 25, 2025 21:10
Co-authored-by: sftse <c@farsight.net>
Co-authored-by: sftse <c@farsight.net>
* adding pivotref vec wrapper for public facing

* removing pivot cache mod

* tag enumeration

* misc design changes

---------

Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: GitHub Actions <actions@github.com>
Co-authored-by: GitHub Actions <actions@github.com>
Copy link
Contributor

@sftse sftse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look proactively for similar code patterns to the ones we highlight and consider how additional commits can help reviewers understand the changes.

Git diff relies on heuristics that are easily trashed by big diffs, so I'd overindex on small diffs and rather too many commits than too few.

Comment on lines +2506 to +2507
Ok(pivots_on_sheet)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my pattern-matching kicking in, is it correct that multiple pivots per sheet are permitted? Other functions error when more than one candidate item is found in the zip.

If you recheck, can you add a reference to the spec?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link / screenshot to the reference is below. Originally this was just something I noticed during a few random tests. The test referenced in this comment #559 (comment) was to address.

https://web.mit.edu/~stevenj/www/ECMA-376-new-merged.pdf

See section 12.3.11 (page 78 in the spec, or 90 in pdf pages) otherwise screenshot below:

image

Copy link
Contributor

@sftse sftse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this PR across the finish line, nearly there!

src/xlsx/mod.rs Outdated
Comment on lines 2694 to 2699
pub fn get_pivot_tables_by_name_and_sheet(&self) -> Vec<(String, String)> {
self.0
.iter()
.map(|pt| (pt.sheet().to_string(), pt.name().to_string()))
.collect()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the proposed API is hard to judge as-is, can we have a complete example how to use this in practice?

As an alternative, would it make sense to expose fn iter(&self) -> impl Iterator<Item = &'_ PivotTableRef> and let the caller use the API on PivotTableRef to call pt.name() and pt.sheet() as needed? This function just changes the representation of the data, which is not in itself a reason to exclude it, but would need more justification.

Copy link
Author

@siqpush siqpush Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it felt strange because I exposed functionality to PivotTables and even PivotTableRef that felt like they should be able to get data on their own. In turn this made functions like the above feel extra.

As for the alternative, my gut thinking is that because we also have get_pivot_tables_by_sheet, which would then also need be removed. But if we give the filter map to the user, then at best case they are left having to do some extra doc reading on some of the uniqueness subtleties.

As for a concrete example, I imagine for auditing data in workbook that might be sensitive to expose to the wrong user / client / ..etc. Using the relevant workbook from tests, this is how I would avoid leaking the top secret details of Category Blue:

let mut workbook: Xlsx<_> = open_workbook(path)?;
let pivot_tables = workbook.read_pivot_table_metadata()?;
for (sheet, pt) in pivot_tables.get_pivot_tables_by_name_and_sheet() {
    let mut check_col = 0;
    for (row_number, row) in workbook.pivot_table_data(&pivot_tables, sheet, pt)?.enumerate() {
        // header is the first row
        if row_number == 0 {
            for (col_number, field) in row.iter().enumerate() {
                if field.eq(&crate::calamine::Data::String("Category".to_string())) {
                    check_col = col_number
                }
            }
        } else if row[check_col].eq(&crate::calamine::Data::String("blue".to_string())) {
                panic!("Blue should not be included in this report.")
        }
    }
}```

Copy link
Contributor

@sftse sftse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Havent forgotten about this PR, sorry about the delays.
I'd have to find time to use the current API to endorse it, but the internal details are getting close to being right.

@siqpush
Copy link
Author

siqpush commented Dec 10, 2025

Havent forgotten about this PR, sorry about the delays. I'd have to find time to use the current API to endorse it, but the internal details are getting close to being right.

Dont be. I was happy you got a healthy break from reviewing my mess.

Let me know what you think once you get a moment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting user changes Awaiting changes to a PR to fix requested changes or CI issues. enhancement xlsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants