Improve sample output type page by lshandross · Pull Request #439 · hubverse-org/hubDocs

lshandross · 2026-01-28T17:40:32Z

Major overhaul of the sample output type docs page.

This doesn't require any code review, but I'd like a fairly detailed review to ensure the changes I've made are 1) accurate, 2) address the questions and comments in #169, and 3) understandable to the wider hubverse community (not just us developers). Likely it will be useful to have multiple people provide their reviews. Not time sensitive, but would be good to get merged in sooner rather than later

…sks" sections

micokoch

Overall this is a good improvement, and you address all the comments Zhian had made. It's still a bit tricky to follow, but that may be the nature of the beast. I made various comments and suggestions, and I hope they make sense. My overall suggestions are:

Ensure consistency with terms. Task ID is my main issue, as it is written differently in different parts, and I wish we were consistent across hubDocs.
I think you should have a mini-glossary or a summary at the very beginning of key terms like compound_idx, task ID, etc. I know this would be duplicative, but as I was reading, I thought it would be helpful to go back to the beginning and see what those concepts mean.
Finally, I really think you should include values in the value column. It would help illustrate.
Great job, and I hope others comment, as it's important to get many eyes to ensure these explanations are clear.

docs/source/user-guide/sample-output-type.md

- Introduce "modeling task" as a slice of task ID space - Add "Sampling modeling tasks" section bridging to marginal/joint concepts - Use "response dependence" terminology consistently throughout - Reorder examples (A→B→C→D) to build complexity progressively: A: No dependence, B: Variants, C: Horizons (trajectories), D: Both - Restructure examples: show data tables first, then configuration - Fix typos in compound_taskid_set field name - Update validation table and references to match new example ordering - Add .claude/ to gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…rizons - Accept reviewer suggestion to say "single prediction" instead of "single predicted value" - Add "that we are interested in predicting" qualifier per reviewer suggestion - Change all horizon values from days to weeks throughout the page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Li Shandross <57642277+lshandross@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…stions Suggestions for sample output type documentation clarity

lshandross · 2026-03-18T20:43:09Z

@micokoch I'm still working on adding values for the value column, but I want to get your opinion on the section on key terms with definitions. Are there any terms that I should add? And what do you think of the definitions I provided? (I tweaked the definitions for the terms already given elsewhere in the docs to fit the information on this page)

micokoch · 2026-03-19T16:09:56Z

@lshandross - thanks for the changes, Li! I'm way past my 5 weekly hours for this week, so I'll take a closer look at this next week. Hope that's okay, but let me know if you need it sooner.

lshandross · 2026-03-23T15:39:28Z

@micokoch Some other questions about the values for the example model outputs: For the section with the four examples, should the sample prediction values be the same in all cases (i.e., the value column is identical across examples A-D)? Or would it be more useful for them to be different?

nickreich

I have a handful of mostly minor phrasing comments.

nickreich · 2026-03-23T20:21:47Z

docs/source/user-guide/sample-output-type.md

+: A definition of the goals of a modeling effort (i.e., what it is hoping to predict), possibly including conditions, assumptions, and task ID variables. Generally, they are defined by unique combinations of the task IDs.
+
+[task IDs (AKA task ID variables, task ID columns)]{#key-term-task-ids}
+: Column(s) in model output that provide details about what is being predicted. These "task ID" columns may also include additional information, such as any conditions or assumptions used to generate the predictions. Common examples include `target`, `location`, `reference_date`, and `horizon`.


I'd suggest moving the middle sentence about additional information to the end of this paragraph and maybe adding an additional example here, as I think this is taken from a scenario hub?

nickreich · 2026-03-23T20:22:51Z

docs/source/user-guide/sample-output-type.md

+: Column(s) in model output that provide details about what is being predicted. These "task ID" columns may also include additional information, such as any conditions or assumptions used to generate the predictions. Common examples include `target`, `location`, `reference_date`, and `horizon`.
+
+[output type ID]{#key-term-output-type-id}
+: A column in model output that specifies more identifying information specific to the output type. For samples, an integer or string providing a unique identifier for the sample. If one or more task IDs display of response dependence, rows of sample predictions from a particular model that share an output type ID may be assumed to represent a single sample from a joint distribution across multiple levels of the task ID variables.


I don't understand the sentence that starts "If one or more ..."

nickreich · 2026-03-23T20:24:08Z

docs/source/user-guide/sample-output-type.md

+: A column in model output that specifies more identifying information specific to the output type. For samples, an integer or string providing a unique identifier for the sample. If one or more task IDs display of response dependence, rows of sample predictions from a particular model that share an output type ID may be assumed to represent a single sample from a joint distribution across multiple levels of the task ID variables.
+
+[compound_idx]{#key-term-compound_idx}
+: A column used a visual aid to indicate which rows belong to the same group in the example model output on this page. In the marginal case, each group contains samples for one modeling task. The `compound_idx` column is not a task ID variable and is not typically present in actual model output data.


it's not just a "visual aid" right, it's actually like a specific identifer in this example? This is not clear from this point in the description.

it sort of feels like these definitions are coming sort of early in the description?

nickreich · 2026-03-23T20:25:28Z

docs/source/user-guide/sample-output-type.md


-## Individual modeling tasks
-In many settings, forecasts will be made for individual modeling tasks, with no notion of modeling tasks being related to each other or collected into sets (for more on this, see the [compound modeling tasks section](#compound-modeling-tasks)). In the situations where forecasts are assumed to be made for individual modeling tasks, every modeling task is treated as distinct, as is implied by the `compound_idx` column in the table below (grayed out to indicate that such a column exists implicitly in the dataset and is not typically present in the actual tabular data). In this setting, the `output_type_id` column indexes the samples that exist for each modeling task.
+While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values at each slice of task ID space. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.


I don't love the "slice of task ID space" language. It feels overly jargony.

Suggested change

While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values at each slice of task ID space. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.

While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values for each task. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.

nickreich · 2026-03-23T20:26:32Z

docs/source/user-guide/sample-output-type.md

-In many settings, forecasts will be made for individual modeling tasks, with no notion of modeling tasks being related to each other or collected into sets (for more on this, see the [compound modeling tasks section](#compound-modeling-tasks)). In the situations where forecasts are assumed to be made for individual modeling tasks, every modeling task is treated as distinct, as is implied by the `compound_idx` column in the table below (grayed out to indicate that such a column exists implicitly in the dataset and is not typically present in the actual tabular data). In this setting, the `output_type_id` column indexes the samples that exist for each modeling task.
+While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values at each slice of task ID space. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.
+
+How modeling tasks relate to each other determines the structure of this distribution. When modeling tasks are treated independently, samples at each slice form a univariate (single-variable) distribution. When modeling tasks are grouped together, samples capture a multivariate (joint) distribution across the group.


Suggested change

How modeling tasks relate to each other determines the structure of this distribution. When modeling tasks are treated independently, samples at each slice form a univariate (single-variable) distribution. When modeling tasks are grouped together, samples capture a multivariate (joint) distribution across the group.

How modeling tasks relate to each other determines the structure of this distribution. When modeling tasks are treated independently, samples for each task form a univariate (single-variable) distribution. When modeling tasks are grouped together, samples capture a multivariate (joint) distribution across the group of tasks.

nickreich · 2026-03-23T20:30:38Z

docs/source/user-guide/sample-output-type.md

+| 3 | 2024-03-15 | 3 | MA | sample | 8| - |
+| 3 | 2024-03-15 | 3 | MA | sample | 9| - |
+
+In this setting, a hub will specify a minimum and maximum number of required samples per group in the configuration for the prediction task. In the marginal case, each group corresponds to a single modeling task, but as we will see in the [compound modeling tasks section](#compound-modeling-tasks), a group can span multiple modeling tasks. The associated configuration might look like:


Suggested change

In this setting, a hub will specify a minimum and maximum number of required samples per group in the configuration for the prediction task. In the marginal case, each group corresponds to a single modeling task, but as we will see in the [compound modeling tasks section](#compound-modeling-tasks), a group can span multiple modeling tasks. The associated configuration might look like:

In this setting, a hub will specify a minimum and maximum number of required samples per group in the configuration for the prediction task. In this "marginal" case, each group corresponds to a single modeling task, but as we will see in the [compound modeling tasks section](#compound-modeling-tasks), a group can span multiple modeling tasks. The associated configuration might look like:

nickreich · 2026-03-23T20:33:17Z

docs/source/user-guide/sample-output-type.md

 ## Compound modeling tasks

-In some settings, modeling hubs may wish to identify sets of modeling tasks that the hub will treat as related, for example, when multiple distinct values can be seen as representations of a single multivariate outcome of interest. In these settings, a subset of the task-id columns (a `"compound_taskid_set"`) will be used to identify what values are shared for the modeling tasks related to each other.
+In the previous section, we saw that when sampling from marginal distributions, each sample is drawn from a single modeling task (a single slice of task ID space). In some settings, however, modeling hubs may wish to capture relationships between modeling tasks by sampling from a joint distribution. This means drawing values for multiple modeling tasks at once as a coherent set.


Suggested change

In the previous section, we saw that when sampling from marginal distributions, each sample is drawn from a single modeling task (a single slice of task ID space). In some settings, however, modeling hubs may wish to capture relationships between modeling tasks by sampling from a joint distribution. This means drawing values for multiple modeling tasks at once as a coherent set.

In the previous section, we saw that when sampling from marginal distributions, each sample is drawn from a single modeling task. In some settings, however, modeling hubs may wish to capture relationships between modeling tasks by sampling from a joint distribution. This means drawing values for multiple modeling tasks at once as a coherent set.

If other people find the "slice" language to be helpful then I will defer. I mostly find it distracting when we have this other clear concept of a "modeling task" in place already. I think we should just use the clear term since we have one for it.

nickreich · 2026-03-23T20:34:47Z

docs/source/user-guide/sample-output-type.md

+- **Variant proportions**: A model predicting proportions of multiple disease variants might draw all variant proportions together, a sample from a joint distribution across variants.

-As a running example of how compound modeling tasks could be specified differently, we will look at a hub reporting on variant proportions observed at a given location and time. In the table below, a single modeling task is a unique combination of values from the task-id variables `origin_date`, `horizon`, `variant`, and `location`.  In the table below, one set of four rows with the same values in the `origin_date`, `horizon`, and `location` columns, but different variant values below represent four predicted variant proportions.
+In both cases, joint sampling introduces additional dimensions to the predictive distribution. Instead of a univariate distribution at each modeling task, we have a multivariate distribution spanning multiple modeling tasks. We refer to this as **response dependence** across the task IDs that vary within a group, because the predicted values (responses) for different modeling tasks are statistically dependent.


Suggested change

In both cases, joint sampling introduces additional dimensions to the predictive distribution. Instead of a univariate distribution at each modeling task, we have a multivariate distribution spanning multiple modeling tasks. We refer to this as **response dependence** across the task IDs that vary within a group, because the predicted values (responses) for different modeling tasks are statistically dependent.

In both cases, joint sampling introduces additional dimensions to the predictive distribution. Instead of a univariate distribution at each modeling task, we have a multivariate distribution spanning multiple modeling tasks. We refer to this as **dependence** across the task IDs that vary within a group, because the predicted values for different modeling tasks are statistically dependent.

nickreich · 2026-03-23T20:35:48Z

docs/source/user-guide/sample-output-type.md

+
+Consider two common scenarios:
+- **Trajectories over time**: A model might predict values across multiple time horizons as a coherent path. Rather than drawing each horizon's prediction independently, the model draws an entire trajectory, a sample from a joint distribution over horizons.
+- **Variant proportions**: A model predicting proportions of multiple disease variants might draw all variant proportions together, a sample from a joint distribution across variants.


We could consider adding locations as another (I think more intuitive) dimension in addition to trajectories over time. E.g., does a model simulate trajectories across space and time together? or for each location separately?

nickreich · 2026-03-23T20:42:27Z

docs/source/user-guide/sample-output-type.md

+**Derived task IDs** are a type of task IDs whose values depend wholly on that of other task ID variables. A common example is the `target_end_date` task ID, which tends to be derived from the combination of the `reference_date` (or `origin_date`) and `horizon` task IDs.

-There is a class of task-ids that can cause problems for validation of compound modeling tasks if not properly configured, that of **derived task-ids** i.e. task-ids whose values depend on the values of other task-id variables. An example is the `target_end_date` task-id which is most commonly derived from the combination of the `reference_date` or `origin_date` and `horizon` task-ids.
+These derived task IDs must be properly configured, or they can cause problems when validating compound modeling tasks by throwing erroneous errors. *If **all** the task ID variables a derived task ID is derived from are part of the `compound_taskid_set`, then that derived task ID must also be a part of the `compound_taskid_set`; otherwise, that derived task ID should be excluded.*


Suggested change

These derived task IDs must be properly configured, or they can cause problems when validating compound modeling tasks by throwing erroneous errors. *If **all** the task ID variables a derived task ID is derived from are part of the `compound_taskid_set`, then that derived task ID must also be a part of the `compound_taskid_set`; otherwise, that derived task ID should be excluded.*

These derived task IDs must be properly configured, or they can cause problems when validating compound modeling tasks by throwing errors when they should not. *If **all** the task ID variables a derived task ID is derived from are part of the `compound_taskid_set`, then that derived task ID must also be a part of the `compound_taskid_set`; otherwise, that derived task ID should be excluded.*

lshandross added 2 commits January 26, 2026 17:33

Clarify sample output type "Introduction" and "Individual modeling ta…

ae74101

…sks" sections

Rework sample output type docs page

4cddf1d

lshandross linked an issue Jan 28, 2026 that may be closed by this pull request

sample-output-type.md: Clarifying questions and editing suggestions #169

Open

micokoch approved these changes Feb 6, 2026

View reviewed changes

micokoch added the documentation Improvements or additions to documentation label Feb 6, 2026

micokoch added this to hubverse Development overview Feb 6, 2026

github-project-automation bot moved this to Reviewed/Ready to Merge in hubverse Development overview Feb 6, 2026

micokoch moved this from Reviewed/Ready to Merge to Ready for Review in hubverse Development overview Feb 6, 2026

annakrystalli and others added 13 commits February 11, 2026 12:57

Update docs/source/user-guide/sample-output-type.md

6355469

Co-authored-by: Li Shandross <57642277+lshandross@users.noreply.github.com>

Clarify Example D subtitle to specify horizons and variants

b1e0324

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #442 from hubverse-org/ak/sample-output-type-sugge…

cc36987

…stions Suggestions for sample output type documentation clarity

Fix typos and improve clarity

8ea8e73

Task ID is written consistently throughout

fe7626c

Remove quotes around compound_taskid_set

f347635

Edits for clarity

a2b5d71

Update to use 1 indexing scheme

b1f74c9

Fix table row shading in Sample Output Type page

e5d67a4

More clarity edits

603d4e7

Add Definitions section to Sample Output Type page

42dcd78

lshandross moved this from Ready for Review to Reviewed/Awaiting Changes in hubverse Development overview Mar 20, 2026

nickreich reviewed Mar 23, 2026

View reviewed changes

	While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values at each slice of task ID space. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.
	While the mean output type produces a single predicted value per modeling task, the sample output type captures uncertainty by providing multiple possible values for each task. Each sample represents one possible outcome, and together, the collection of samples describes a distribution of predicted values.

	How modeling tasks relate to each other determines the structure of this distribution. When modeling tasks are treated independently, samples at each slice form a univariate (single-variable) distribution. When modeling tasks are grouped together, samples capture a multivariate (joint) distribution across the group.
	How modeling tasks relate to each other determines the structure of this distribution. When modeling tasks are treated independently, samples for each task form a univariate (single-variable) distribution. When modeling tasks are grouped together, samples capture a multivariate (joint) distribution across the group of tasks.

	In this setting, a hub will specify a minimum and maximum number of required samples per group in the configuration for the prediction task. In the marginal case, each group corresponds to a single modeling task, but as we will see in the [compound modeling tasks section](#compound-modeling-tasks), a group can span multiple modeling tasks. The associated configuration might look like:
	In this setting, a hub will specify a minimum and maximum number of required samples per group in the configuration for the prediction task. In this "marginal" case, each group corresponds to a single modeling task, but as we will see in the [compound modeling tasks section](#compound-modeling-tasks), a group can span multiple modeling tasks. The associated configuration might look like:

	In the previous section, we saw that when sampling from marginal distributions, each sample is drawn from a single modeling task (a single slice of task ID space). In some settings, however, modeling hubs may wish to capture relationships between modeling tasks by sampling from a joint distribution. This means drawing values for multiple modeling tasks at once as a coherent set.
	In the previous section, we saw that when sampling from marginal distributions, each sample is drawn from a single modeling task. In some settings, however, modeling hubs may wish to capture relationships between modeling tasks by sampling from a joint distribution. This means drawing values for multiple modeling tasks at once as a coherent set.

	These derived task IDs must be properly configured, or they can cause problems when validating compound modeling tasks by throwing erroneous errors. If all* the task ID variables a derived task ID is derived from are part of the `compound_taskid_set`, then that derived task ID must also be a part of the `compound_taskid_set`; otherwise, that derived task ID should be excluded.*
	These derived task IDs must be properly configured, or they can cause problems when validating compound modeling tasks by throwing errors when they should not. If all* the task ID variables a derived task ID is derived from are part of the `compound_taskid_set`, then that derived task ID must also be a part of the `compound_taskid_set`; otherwise, that derived task ID should be excluded.*

Conversation

lshandross commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micokoch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lshandross commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

micokoch commented Mar 19, 2026

Uh oh!

lshandross commented Mar 23, 2026

Uh oh!

nickreich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lshandross commented Jan 28, 2026 •

edited

Loading

lshandross commented Mar 18, 2026 •

edited

Loading