Function To Cast InferenceData Into `tidy_draws` Format #36

AFg6K7h4fhy2 · 2024-10-28T14:20:40Z

For the scope of this PR, please refer to issue #18 .

…medium

…dy_draws

…nferencedata-into-tidy_draws-format

.pre-commit-config.yaml

…nferencedata-into-tidy_draws-format

pyproject.toml

…nferencedata-into-tidy_draws-format

AFg6K7h4fhy2 · 2025-03-12T14:58:05Z

A small note that the following historically worked but assumes all chains have the same number of iterations (in tidy-lingo):

tidy_dfs = {
        group: (
            idata_df.select("chain", "draw", cs.starts_with(f"('{group}',"))
            .rename(
                {
                    col: col.split(", ")[1].strip("')")
                    for col in idata_df.columns
                    if col.startswith(f"('{group}',")
                }
            )
            # draw in arviz is iteration in tidybayes
            .rename({"draw": ".iteration", "chain": ".chain"})
            .unpivot(
                index=[".chain", ".iteration"],
                variable_name="variable",
                value_name="value",
            )
            .with_columns(
                pl.col("variable").str.replace(r"\[.*\]", "").alias("variable")
            )
            .with_columns(pl.col(".iteration") + 1, pl.col(".chain") + 1)
            .with_columns(
                (pl.col(".iteration").n_unique()).alias("draws_per_chain"),
            )
            .with_columns(
                (
                    ((pl.col(".chain") - 1) * pl.col("draws_per_chain"))
                    + pl.col(".iteration")
                ).alias(".draw")
            )
            .pivot(
                values="value",
                index=[".chain", ".iteration", ".draw"],
                columns="variable",
                aggregate_function="first",
            )
            .sort([".chain", ".iteration", ".draw"])
        )
        for group in groups
    }

The method which does take into account the number of iterations per chain:

tidy_dfs = {
        group: (
            idata_df.select("chain", "draw", cs.starts_with(f"('{group}',"))
            .rename(
                {
                    col: col.split(", ")[1].strip("')")
                    for col in idata_df.columns
                    if col.startswith(f"('{group}',")
                }
            )
            # draw in arviz is iteration in tidybayes
            .rename({"draw": ".iteration", "chain": ".chain"})
            .unpivot(
                index=[".chain", ".iteration"],
                variable_name="variable",
                value_name="value",
            )
            .with_columns(
                pl.col("variable").str.replace(r"\[.*\]", "").alias("variable")
            )
            .with_columns(
                pl.col(".iteration") + 1, 
                pl.col(".chain") + 1)
            .pivot(
                values="value",
                index=[".chain", ".iteration"],
                columns="variable",
                aggregate_function="first",
            )
            .sort([".chain", ".iteration", ".draw"])
            .with_row_count(name=".draw", offset=1)
        )
        for group in groups
    }

dylanhmorris

A few small but important things. Thanks, @AFg6K7h4fhy2!

forecasttools/idata_to_tidy.py

dylanhmorris · 2025-03-13T22:56:02Z

forecasttools/idata_to_tidy.py

+                values="value",
+                index=[".chain", ".iteration"],
+                columns="variable",
+                aggregate_function="first",


should be None per the docs, no? (unless I'm misunderstanding what your goal is with this operation). You want one column for each unique value of "variable" for a given ".chain" and ".iteration".

Suggested change

aggregate_function="first",

aggregate_function=None,

https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.pivot.html

For the actual pyrenew inference data object, but not for the other examples I have made, using None and not first produces this error (I originally used None before making a notebook & tests with the pyrenew-hew InferenceData object given to me):

FAILED tests/test_idata_to_tidy.py::test_posterior_predictive_group - polars.exceptions.ComputeError: found multiple elements in the same group, please specify an aggregation function

That suggests something is not correct upstream. Perhaps the variable name regex?

I believe the source of this issue is the line in col: col.split(", ")[1].strip("')")

.rename( { col: col.split(", ")[1].strip("')") for col in idata_df.columns if col.startswith(f"('{group}',") } )

I changed

col: col.split(", ")[1].strip("')")

to

col: re.search(r",\s*'?(.+?)'?\)", col).group(1)

which I got incorrect a handful of times but which I believe works now.

The target of this regex looks like, e.g., "('posterior', 'alpha')".

In the expression, the , matches comma; in \s*, the \s is spaces and * for zero or more spaces; in '?, the ' means single quote and ? means optional (0 or 1 times); in (.+?), the ( and ) capture whatever is in the parenthesis with . means non-newline character, + means one or more characters, and ? (which I am not sure is necessary to include here) tries to get this capture in as small a job as possible (this capture will get alpha in the example target); the '? is the same as before; the \) is the single closing parenthesis after the group (the \ is needed for escaping parenthesis; and .group(1) in re gets the first item captured by (.+?).

If I am missing anything or wrote something inaccurately, please, reader, let me know.

The aggregate function is not successfully set to None and not first.

Can you write a test to demonstrate the case where the previous version was failing and the new version succeeds? They both work correctly on the example provided in your comment: "('posterior', 'alpha')".

I do understand the subtle difference between the two approaches, but I do not understand why the split approach did not work in practice.

The split approach did not work on the InferenceData object from pyrenew-hew in tests when the aggregate function was set to None but did work when the aggregate function was set to first. Yes, both the split and re approaches work on "('posterior', 'alpha')".

Thank you for comment + I will write test.

The split approach did not work on the InferenceData object from pyrenew-hew in tests when the aggregate function was set to None but did work when the aggregate function was set to first. Yes, both the split and re approaches work on "('posterior', 'alpha')".

I should probably investigate why, exactly, aggregate=None w/ split doesn't work for the Pyrenew inference data.

dylanhmorris · 2025-03-13T22:59:22Z

tests/test_idata_to_tidy.py

+    assert set(result.keys()) == set(simple_inference_data.groups())
+
+
+def test_tidydraws_format(simple_inference_data):


This is nice, but I think it's important also to check the correct tidying of array-valued parameters. The example idata and associated test from forecasttools-R are nice and should be easily portable to Python

https://github.com/CDCgov/forecasttools/blob/main/data-raw/ex_inferencedata_dataframe.R
https://github.com/CDCgov/forecasttools/blob/main/tests/testthat/test_inferencedata_dataframe_to_tidydraws.R

I will save this task for 2025-03-14.

Co-authored-by: Dylan H. Morris <dylanhmorris@users.noreply.github.com>

dylanhmorris · 2025-03-18T16:43:34Z

forecasttools/idata_to_tidy.py

@@ -54,7 +56,7 @@ def convert_inference_data_to_tidydraws(
            idata_df.select("chain", "draw", cs.starts_with(f"('{group}',"))
            .rename(
                {
-                    col: col.split(", ")[1].strip("')")
+                    col: re.search(r",\s*'?(.+?)'?\)", col).group(1)


Feels like this would be better handled by providing a lambda rename mapping to .rename() rather than via dictionary comprehension. See https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rename.html

Hadn't thought of this. In the back of mind there are reservations I've seen from others that I haven't verified regarding about lambda mappings but they're not sufficient to not act here. Polars rename seems apt. Thank you for the quick comment.

I changed the code to:

.rename( lambda col: re.search(r",\s*'?(.+?)'?\)", col).group(1) if col.startswith(f"('{group}',") else col )

got

forecasttools/idata_to_tidy.py:66:40: B023 Function definition does not bind loop variable `group`

then changed to:

.rename( lambda col, group=group: re.search( r",\s*'?(.+?)'?\)", col ).group(1) if col.startswith(f"('{group}',") else col )

which works w/ tests & linting.

Commenting above for my future self.

Readability concern: at least two different senses of "group" present here.

I'd make it a one argument lambda and exclude it from linting. Making "group" an argument that defaults to the value of group satisfies the linter but hurts readability imo

forecasttools/idata_to_tidy.py:66:40: B023 Function definition does not bind loop variable `group`

Re: #36 (comment)

Will do. I agree, group=group seems extraneous but my usual response is to defer to the linter.

…st of dhm

initial commit for this PR; begin skeleton experimentation file

3865001

AFg6K7h4fhy2 self-assigned this Oct 28, 2024

AFg6K7h4fhy2 linked an issue Oct 28, 2024 that may be closed by this pull request

Function to cast InferenceData into tidy_draws format #18

Open

AFg6K7h4fhy2 added feature A new tool or utility being added. High Priority A task that is of higher relative priority. labels Oct 28, 2024

AFg6K7h4fhy2 added this to the [October 28, November 8] milestone Oct 28, 2024

AFg6K7h4fhy2 added Medium Priority A task that is of medium relative priority. and removed High Priority A task that is of higher relative priority. labels Oct 28, 2024

AFg6K7h4fhy2 added 5 commits October 28, 2024 15:27

some unfinished experimentation code; priority status change high to …

763355e

…medium

add first semi-failed attempt at converting entire idata object to ti…

44e7fe2

…dy_draws

add attempt at option 2

31c7b72

slightly modify spread draws example

9a87902

more minor changes to tidy draws notebook

c632ae8

AFg6K7h4fhy2 mentioned this pull request Nov 6, 2024

Utilities Pipeline #16

Open

light edits during DHM convo

123ad51

AFg6K7h4fhy2 modified the milestones: [October 28, November 8], [November 11, November 22] Nov 8, 2024

AFg6K7h4fhy2 modified the milestones: [November 11, November 22], [November 25, December 6] Nov 22, 2024

AFg6K7h4fhy2 added 5 commits November 25, 2024 10:04

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

a3c2d17

…nferencedata-into-tidy_draws-format

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

f44a6ee

…nferencedata-into-tidy_draws-format

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

cb883e3

…nferencedata-into-tidy_draws-format

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

df922d4

…nferencedata-into-tidy_draws-format

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

21968be

…nferencedata-into-tidy_draws-format

AFg6K7h4fhy2 modified the milestones: [November 25, December 6], [December 9, December 20] Dec 9, 2024

AFg6K7h4fhy2 added 4 commits December 9, 2024 11:22

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

7dcd7d3

…nferencedata-into-tidy_draws-format

a DB conversion attempt

718ba85

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

7394d4d

…nferencedata-into-tidy_draws-format

begin references file; create external program folder

4a77d50

AFg6K7h4fhy2 mentioned this pull request Feb 6, 2025

Namespace Resolution #62

Open

AFg6K7h4fhy2 added 2 commits February 6, 2025 09:37

remove tab ignoral

b3f6367

switch from melt to pivot

84bb99e

damonbayer reviewed Feb 7, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

AFg6K7h4fhy2 added 2 commits February 10, 2025 11:18

lightweight change to dev deps

1a94da4

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

7a737f6

…nferencedata-into-tidy_draws-format

dylanhmorris reviewed Feb 11, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

AFg6K7h4fhy2 added 3 commits February 11, 2025 13:32

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

740f9c8

…nferencedata-into-tidy_draws-format

commented change; was examining results

b3544e5

Merge remote-tracking branch 'origin/main' into 18-function-to-cast-i…

3c9609c

…nferencedata-into-tidy_draws-format

AFg6K7h4fhy2 modified the milestones: [December 9, December 20], [February 24, March 7] Mar 3, 2025

AFg6K7h4fhy2 added 3 commits March 12, 2025 11:04

have draw calculation take into account row count

ca7c340

remove extraneous metaflow

9b2cdc9

some of the tests using simple idata class

84ec6e1

AFg6K7h4fhy2 requested a review from dylanhmorris March 13, 2025 14:03

AFg6K7h4fhy2 mentioned this pull request Mar 13, 2025

Arviz To Scoring Vignette #73

Open

additional test

07e83a9

dylanhmorris requested changes Mar 13, 2025

View reviewed changes

AFg6K7h4fhy2 and others added 5 commits March 13, 2025 19:52

Update forecasttools/idata_to_tidy.py

238c80c

Co-authored-by: Dylan H. Morris <dylanhmorris@users.noreply.github.com>

revert to original aggregate function argument value

27436cd

add base posterior predictive test for pyrenew idata

70219ad

change test path; capture aggregate function error

60e0b5a

change col search method from strip to re

61849e0

dylanhmorris reviewed Mar 18, 2025

View reviewed changes

AFg6K7h4fhy2 added 3 commits March 18, 2025 13:02

add lambda rename rather than dictionary comprehension

ba2847e

remove comment

15d920b

update pre-commit config file to remove loop binding linting at reque…

0604de1

…st of dhm

damonbayer mentioned this pull request Apr 22, 2025

Save intermediate parquet files instead of csvs CDCgov/pyrenew-hew#436

Closed

		assert set(result.keys()) == set(simple_inference_data.groups())


		def test_tidydraws_format(simple_inference_data):

Function To Cast InferenceData Into tidy_draws Format #36

Are you sure you want to change the base?

Function To Cast InferenceData Into tidy_draws Format #36

Uh oh!

Conversation

AFg6K7h4fhy2 commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AFg6K7h4fhy2 commented Mar 12, 2025

Uh oh!

dylanhmorris left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AFg6K7h4fhy2 Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AFg6K7h4fhy2 Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damonbayer Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AFg6K7h4fhy2 Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AFg6K7h4fhy2 Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Function To Cast InferenceData Into `tidy_draws` Format #36

Function To Cast InferenceData Into `tidy_draws` Format #36

AFg6K7h4fhy2 commented Oct 28, 2024 •

edited

Loading

AFg6K7h4fhy2 Mar 14, 2025 •

edited

Loading

AFg6K7h4fhy2 Mar 18, 2025 •

edited

Loading

damonbayer Mar 18, 2025 •

edited

Loading

AFg6K7h4fhy2 Mar 19, 2025 •

edited

Loading

AFg6K7h4fhy2 Mar 18, 2025 •

edited

Loading