From cc70794b202dd04f0fb3ee79869015be6632e2f5 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Fri, 27 Feb 2026 13:52:25 -0500 Subject: [PATCH 1/8] Add eval-metrics-expansion project poster and fix README folder path Adds the project poster for expanding hubverse evaluation metrics and dashboard support (five mini-sprints: UI polish, config-driven enhancements, scale transforms, variogram score, documentation). Also corrects the README poster instructions to use `project-posters/` instead of `posters/`, matching the actual convention in the repo. Co-Authored-By: Claude Sonnet 4.6 --- README.md | 8 +- .../eval-metrics-expansion/AGENTS.md | 86 +++++ .../eval-metrics-expansion.md | 324 ++++++++++++++++++ 3 files changed, 414 insertions(+), 4 deletions(-) create mode 100644 project-posters/eval-metrics-expansion/AGENTS.md create mode 100644 project-posters/eval-metrics-expansion/eval-metrics-expansion.md diff --git a/README.md b/README.md index c4f50fc..ed7df38 100644 --- a/README.md +++ b/README.md @@ -42,11 +42,11 @@ New projects require a project poster. To create a project poster, 1. create a new branch with your initials, the word "poster", and the project - name (e.g. (znk/poster/hub-docker-containers) + name (e.g. `ngr/poster/eval-metrics-expansion`) 2. make a copy of [templates/poster-template.md](templates/poster-template.md) - into the a subfolder `posters/` with the format `/.md` - where `` is the name of the project. You can place any supporting - files inside the `posters//` folder. + into the subfolder `project-posters//` with the filename + `.md`, where `` is the name of the project. You can place + any supporting files inside the `project-posters//` folder. 3. follow the instructions in the template 4. create a pull request with the format `[poster] project title` 5. request a review from your collaborators with a timeline (suggestion of 1 week) diff --git a/project-posters/eval-metrics-expansion/AGENTS.md b/project-posters/eval-metrics-expansion/AGENTS.md new file mode 100644 index 0000000..d829584 --- /dev/null +++ b/project-posters/eval-metrics-expansion/AGENTS.md @@ -0,0 +1,86 @@ +# AGENTS.md — Eval Metrics Expansion Project + +This file provides context for anyone (human or AI agent) arriving at this project poster for the first time. + +## What this project is + +A 1–3 month development sprint to expand the hubverse forecast evaluation ecosystem in two areas: +1. Better end-user and developer experience in the existing dashboard +2. Two new metric capabilities: log-scale scoring and the variogram score + +The full plan is in **`eval-metrics-expansion.md`** in this directory. + +## The hubverse evaluation pipeline (4 repos) + +``` +hubEvals (R pkg) + └── core scoring library; wraps scoringutils; exposes score_model_out() + ↓ +hubPredEvalsData (R pkg) + └── orchestrates scoring for a hub; reads predevals-config.yml; + outputs scores.csv files organized by target/eval_set/disaggregate_by + ↓ +hubPredEvalsData-docker + └── Docker container wrapping hubPredEvalsData::generate_eval_data(); + used in hub CI/CD pipelines + ↓ +predevals (JavaScript) + └── client-side module that reads the scores CSV files and renders + interactive tables, heatmaps, and line plots +``` + +Documentation: +- User guide: https://docs.hubverse.io/en/latest/user-guide/dashboards.html#predevals-evaluation-optional +- Developer guide (infrastructure): https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html +- predevals repo: https://github.com/hubverse-org/predevals +- hubPredEvalsData repo: https://github.com/hubverse-org/hubPredEvalsData +- hubEvals repo: https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals +- hubPredEvalsData-docker repo: https://github.com/hubverse-org/hubPredEvalsData-docker + +## Current state (as of 2026-02-26) + +- **hubEvals v0.1.0** supports: quantile, mean, median, pmf output types; log/sqrt scale transforms already implemented in `score_model_out()` +- **hubEvals PR #103** (branch `ak/sample-scoring/94`) is open and nearly ready to merge — adds `sample` output type with CRPS, bias, DSS, energy score, and dynamic compound metric list (will pick up variogram score automatically once scoringutils#1114 lands) +- **hubPredEvalsData schema v1.0.1** — no support yet for scale transforms or sample output type +- **predevals** — functional but has multiple known UX gaps (see issues) +- **scoringutils** — variogram score being added upstream in issue #1114 + +## Sprint structure + +The sprint is broken into 5 mini-sprints ordered by implementation complexity: + +| Sprint | Scope | Repos touched | Approx. duration | +|--------|-------|--------------|-----------------| +| A | UI-only polish (bugs, ergonomics, metric docs) | predevals only | ~2 weeks | +| B | Config-driven enhancements (decimal places, sort, target names) | hubPredEvalsData schema + predevals | ~3 weeks | +| C | Scale transformation pipeline | hubPredEvalsData schema + predevals | ~4 weeks | +| D | Variogram score / sample scoring | hubEvals (PR #103) + hubPredEvalsData + predevals | ~3–4 weeks | +| E | Developer docs (extend existing guide) | hubDocs | ~2 weeks, can overlap | + +Sprint A should go first; Sprint C depends on Sprint A's table ergonomics fix (#49). Sprints B, C, D share the hubPredEvalsData schema so should release sequentially (B → C → D). Sprint E can begin in parallel with Sprint B but cannot fully close until Sprint D is complete (the worked example depends on Sprints C + D as prior art). + +## Key design decisions made + +- **Scale metrics treated as separate metrics**: When `append=true` in transform config, log-scaled and natural-scale metrics appear as distinct columns/dropdown items (e.g., "wis (natural)" and "wis (log)"), not as a filter/toggle. This requires the frozen-column table fix (Sprint A #49) to land first. +- **Variogram score is compound (across locations)**: The variogram score is computed jointly across locations via `compound_taskid_set`. It is NOT disaggregatable by location. hubPredEvalsData should enforce this with a validation error. +- **Scoringutils metrics are automatically available**: Any metric that scoringutils returns as a default for a given output type flows through automatically via `get_standard_metrics()`. No schema/code changes needed beyond adding the output type — just config. + +## Open questions (as of 2026-02-26) + +- When does scoringutils#1114 (variogram score) land? +- How many hubs currently use sample-format forecasts? +- Should `disaggregate_by: location` + variogram score be a silent skip, warning, or error in hubPredEvalsData? +- Is hubPredEvalsData#21 (target metadata) resolved? Unblocks predevals#44. +- ~~Issue #4 (default sort column): schema change or JS-only?~~ **Resolved**: `predevals-config.yml` schema change, included in Sprint B v1.0.2. + +## Key files to know + +| File | Purpose | +|------|---------| +| `hubEvals/R/score_model_out.R` | Core scoring function; transform + joint_across params live here | +| `hubEvals/R/validate.R` | Output type validation | +| `hubPredEvalsData/R/utils-metrics.R` | `get_standard_metrics()` maps output types → metric names | +| `hubPredEvalsData/R/generate_eval_data.R` | Top-level orchestration; calls hubEvals | +| `hubPredEvalsData/R/config.R` | Config parsing + validation | +| `hubPredEvalsData/inst/schema/` | JSON schema versions for predevals-config.yml | +| `predevals/src/predevals.js` | Main JS source; reads CSV files, renders UI | diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md new file mode 100644 index 0000000..2ec7532 --- /dev/null +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -0,0 +1,324 @@ +# Project Poster: Expanding Hubverse Evaluation Metrics and Dashboard Support + +- Date: 2026-02-26 + +- Owner: Nicholas Reich + +- Status: draft + +## ❓ Problem space + +### What are we doing? + +Extending the hubverse forecast evaluation ecosystem across five goals: + +1. **UI polish for end users**: Fix known usability gaps in the predevals dashboard—metric documentation, score direction indicators, table ergonomics, and several priority bugs—without touching R packages or config schemas. Issues: [predevals#13](https://github.com/hubverse-org/predevals/issues/13), [#31](https://github.com/hubverse-org/predevals/issues/31), [#42](https://github.com/hubverse-org/predevals/issues/42), [#41](https://github.com/hubverse-org/predevals/issues/41), [#49](https://github.com/hubverse-org/predevals/issues/49), [#5](https://github.com/hubverse-org/predevals/issues/5). Explore whether column hiding (so users can make the tabler simpler to focus on just desired metrics) is possible using datatables. + +2. **Config-driven UI enhancements**: Small additions to the hubPredEvalsData schema enabling per-target decimal precision, human-readable target names, and a configurable default sort column for the evaluation table. Issues: [predevals#48](https://github.com/hubverse-org/predevals/issues/48), [#44](https://github.com/hubverse-org/predevals/issues/44), [#4](https://github.com/hubverse-org/predevals/issues/4). + +3. **Scale transformation pipeline**: Wire the already-implemented log/sqrt transform support in `hubEvals::score_model_out()` through the hubPredEvalsData config schema and the predevals UI so hub admins can evaluate forecasts on transformed scales. Issue: [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). + +4. **Variogram score**: Add `sample` output type support to hubEvals and expose `variogram_score_multivariate()` and `variogram_score_multivariate_point()` (recently added to scoringutils) as a metric evaluating ensemble spatial correlation structure across locations. + +5. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. + +### Why are we doing this? + +- Increasing evaluation diversity was the most highly ranked priority in a recent survey of hubverse users. This project both surfaces existing metrics, adds new ones, and makes all evaluations more interpretable. +- Many predevals usability gaps have been open for over a year; the dashboard is hard for non-expert users to interpret (no metric definitions, no "lower is better" cues, table ergonomics issues). +- Infectious disease forecasts predict count data that spans orders of magnitude across locations. Log-scale evaluation provides fairer cross-location comparisons. +- The variogram score is a multivariate proper scoring rule capturing spatial correlation—something WIS cannot measure—and scoringutils now supports it natively. +- The [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) covers infrastructure (Docker, build, testing) but not metric development workflow, making it hard for new contributors to add metrics end-to-end. + +### What are we _not_ trying to do? + +- Not adding new chart types (existing table/heatmap/line plot are sufficient). +- Not changing the existing WIS/AE/interval coverage metrics. +- Not implementing calibration/reliability diagram visualizations in this sprint. +- Not adding the variogram score for quantile-format forecasts (requires ensemble/sample format). + +### How do we judge success? + +- A first-time user reading the dashboard understands what WIS means, which direction is better, and which models are performing best—without leaving the page. +- A hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear as distinct items alongside natural-scale metrics in all dropdowns and table columns. +- A hub submitting sample-format ensemble forecasts can configure and view the variogram score in the dashboard. +- A new developer can follow the hubDocs guide to add a hypothetical metric end-to-end without reading source code across repos. + +### What are possible solutions? + +See the **mini-sprint breakdown** in "Ready to make it" below. Each sprint is self-contained and releasable. + +--- + +## ✅ Validation + +### What do we already know? + +- `hubEvals::score_model_out()` (v0.1.0) already supports `transform`, `transform_append`, and `transform_label`—no hubEvals changes needed for the scale transform pipeline. +- scoringutils dev branch has `as_forecast_multivariate_sample()`, `variogram_score_multivariate()`, and `variogram_score_multivariate_point()`. +- hubPredEvalsData schema versioning is established (v0.1.0 → v1.0.0 → v1.0.1); v1.1.0 is the natural target for transform + variogram + config-driven enhancements. +- The predevals JS dashboard builds via webpack into `dist/predevals.bundle.js`; the existing table/heatmap/line plot handle new metric columns without new chart types. +- predevals issues #31, #5, #42, #41, #49 are pure JS/CSS changes with no schema dependencies. + +### What do we need to answer? + +- **Variogram score in scoringutils**: [scoringutils#1114](https://github.com/epiforecasts/scoringutils/issues/1114) tracks adding `variogram_score_multivariate()` as a default compound metric. Confirm its merge status before closing Sprint D. hubEvals PR #103 is already wired to pick it up dynamically once it lands. +- **How many active hubs submit `sample`-format forecasts?** This determines Sprint D's near-term impact. Although, note that the variogram point score could be used by almost all hubs, with some additional specification about joint across. It is possible that some specification would be needed for any variogram score to specify which dimension to assume the forecasts are joint across. +- **Variogram + disaggregate_by conflict**: If a target has `disaggregate_by: location` and also uses the variogram score (computed *across* locations), should hubPredEvalsData skip disaggregation silently, warn, or error? +- **Issue #44 (human-readable target names)**: Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency. +- **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.0.2 bump. +- **Which scoringutils metrics are already surfaceable?** For quantile forecasts, any metric in `scoringutils::get_metrics(scoringutils::example_quantile)` is already available via the existing `get_standard_metrics()` pathway — no schema or code changes needed, just config. For sample forecasts, once PR #103 merges, CRPS/bias/DSS/energy score become available the same way. New scoringutils defaults propagate automatically. + + +--- + +## 👍 Ready to make it + +### Proposed solution + +Five mini-sprints ordered by implementation complexity: UI-only first, then incremental schema additions, then new metric features. Each sprint is independently releasable. + +The core complexity axis is: + +> **UI-only** (predevals JS/CSS, no schema changes) → **Config-driven** (small schema additions + JS) → **Full pipeline** (R packages + schema + JS across 4 repos) + +--- + +### Development standards + +Every issue follows a **two-phase workflow** before any code is written: + +#### Phase 1 — Issue refinement + +Before an implementer picks up an issue, the issue must be rewritten (if necessary) so that it is: + +- **Specific and testable**: describes a concrete, observable outcome rather than a vague intent. For example, not "fix the target-change bug" but "when a new target is selected, the metric dropdown resets to the first valid metric for that target and the disaggregate_by dropdown resets to 'overall'." +- **Single-concern**: one responsibility per issue; split if needed. +- **Unblocked**: all upstream dependencies are resolved. + +If the issue text isn't clear enough to write a test from, update it before starting implementation. + +#### Phase 2 — Test-Driven Development (TDD) + +1. **Write a failing test** that encodes the specific outcome from the refined issue. +2. **Implement the minimum code** to make the test pass. +3. **Refactor** with the test suite green. + +**Special cases where the TDD sequence is adapted:** + +| Case | Adapted sequence | +|------|-----------------| +| **Refactors** (#27, #28, #30) | Write *characterization tests* against the existing behaviour first → refactor → confirm tests still pass. No observable behaviour should change. | +| **Rename** (#34) | Not TDD. Completion is verified by a grep/search confirming no old string remains. | +| **Documentation** (Sprint E) | Not TDD. Acceptance criterion: a developer unfamiliar with the codebase can follow the guide and add a toy metric in a local dev environment (verified by peer walkthrough). | +| **Docker integration** (docker#6) | Write the integration test against current oracle-fetching behaviour → migrate to hubData → confirm test still passes. | + +**Tooling by repo:** + +| Repo | Test framework | Notes | +|------|---------------|-------| +| hubEvals (R) | `testthat` via `devtools::test()` | Hand-computed expected values where possible (see PR #103 pattern) | +| hubPredEvalsData (R) | `testthat` via `devtools::test()` | Integration tests using example hub data | +| predevals (JS) | To be established in Sprint A [#22](https://github.com/hubverse-org/predevals/issues/22) | Unit test framework (e.g. Jest or Vitest) chosen during Sprint A setup | +| hubPredEvalsData-docker | Integration tests in GitHub Actions | Compare CSV outputs between image versions (existing pattern) | + +**Universal Definition of Done** — every issue is closed only when: + +- [ ] The issue was refined to a specific, testable outcome *before* any code was written +- [ ] A failing test encoding that outcome was written *before* the implementation (or the appropriate adapted sequence above was followed) +- [ ] All existing tests continue to pass (`R CMD check` / CI green) +- [ ] The change is documented (inline comments for non-obvious logic; function-level docs for new public API) +- [ ] A PR is reviewed and approved by at least one other team member +- [ ] The relevant GitHub issue is referenced in the PR and closed on merge + +--- + +### Mini-Sprint A — UI-only polish (~2 weeks) +*Scope: predevals JS/CSS only. No schema changes. No R package changes.* + +| Issue | Change | Complexity | +|-------|--------|-----------| +| [#31](https://github.com/hubverse-org/predevals/issues/31) 🐛 **priority** | Auto-update metric + disaggregate_by selectors when target changes | JS logic fix | +| [#5](https://github.com/hubverse-org/predevals/issues/5) **priority** | Preserve model selection state when switching plots | JS state management | +| [#49](https://github.com/hubverse-org/predevals/issues/49) **priority** | Freeze model name column; horizontal scroll for other columns | CSS/JS table layout | +| [#42](https://github.com/hubverse-org/predevals/issues/42) **priority** | Add "lower is better" / "closer to nominal is better" cues to axes and table headers | JS + display logic | +| [#13](https://github.com/hubverse-org/predevals/issues/13) | Add metric definitions panel with abbreviations and direction indicators, shown by default and togglable | JS; metric glossary baked into predevals.js | +| [#50](https://github.com/hubverse-org/predevals/issues/50) | Enable column hiding via ColumnControl visibility toggle (one-line config change; ColumnControl already loaded) | JS config (trivial) | +| [#30](https://github.com/hubverse-org/predevals/issues/30) | Refactor code for getting full metrics list (prerequisite for Sprint C's per-scale metric treatment) | JS refactor | +| [#34](https://github.com/hubverse-org/predevals/issues/34) | Rename all `predeval` → `predevals` references in source | JS cleanup (trivial) | +| [#22](https://github.com/hubverse-org/predevals/issues/22) | Add unit tests for existing JS functionality; establishes test harness for TDD in subsequent sprints | JS testing infrastructure | + +**Deliverable**: New predevals release. No changes to hubPredEvalsData, hubEvals, or Docker. + +**Sprint A — Definition of Done:** +- [ ] #22: Test harness chosen and one passing test for an existing function exists; all subsequent issues in this sprint add tests to it +- [ ] #31: Issue refined to specify exact reset behaviour → failing test written → implemented. Test: selecting a new target resets metric and disaggregate_by to the first valid value for that target +- [ ] #5: Issue refined to specify which state is preserved and when → failing test written → implemented. Test: model selection is unchanged after switching between table/heatmap/line views +- [ ] #49: Issue refined to name the exact column and scroll behaviour → failing DOM test written → implemented. Test: first column has `position: sticky` and table body scrolls independently +- [ ] #42: Issue refined per metric (which direction, what text) → failing tests written → implemented. Tests: each metric name has a registered direction; that direction renders correctly in headers and axis labels +- [ ] #13: Issue refined with complete metric glossary entries → failing test written → implemented. Test: glossary panel present on load; toggles; every metric name used in the dashboard has a glossary entry +- [ ] #50: Add `'visibility'` to `columnControl` array; test: each metric column header has a working hide/show toggle; `model_id` column is excluded from hiding +- [ ] #30: Characterization tests written against *existing* metrics-list behaviour → refactored → tests still pass. No behaviour change +- [ ] #34: Grep confirms zero occurrences of `predeval` (without trailing `s`) in `src/` +- [ ] CI passes; new predevals release tagged + +--- + +### Mini-Sprint B — Config-driven enhancements (~3 weeks) +*Scope: Small hubPredEvalsData schema additions + corresponding predevals JS. Schema bump to v1.0.2.* + +| Issue | Change | Where | +|-------|--------|-------| +| [predevals#28](https://github.com/hubverse-org/predevals/issues/28) | Refactor numeric rounding into a shared helper (prerequisite for #48) | predevals JS refactor | +| [predevals#48](https://github.com/hubverse-org/predevals/issues/48) **priority** | Add `decimal_places` per-target in config schema; JS reads and applies it for table display | hubPredEvalsData schema + predevals JS | +| [predevals#27](https://github.com/hubverse-org/predevals/issues/27) | Refactor score-sorting into a helper function (prerequisite for #4) | predevals JS refactor | +| [predevals#4](https://github.com/hubverse-org/predevals/issues/4) **priority** | Add `default_sort_metric` to config/options; JS uses it for initial table sort (e.g., sort models by relative WIS ascending on first load rather than alphabetically) | hubPredEvalsData schema (or predevals options object) + predevals JS | +| [predevals#44](https://github.com/hubverse-org/predevals/issues/44) | Display `target_name` from `tasks.json` instead of `target_id` in dropdowns | hubPredEvalsData (depends on [#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)) + predevals JS | + +> ⚠️ Issue #44 depends on upstream hubPredEvalsData work ([#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)). Include only if that work is complete. + +**Deliverable**: hubPredEvalsData v1.0.2 schema, new predevals release, Docker rebuild. + +**Sprint B — Definition of Done:** +- [ ] #28: Characterization tests written against existing rounding behaviour → refactored into shared helper → tests still pass. Must be merged before #48 is started +- [ ] #48: Issue refined to specify which targets need non-default precision and what values are valid → failing tests written (R: schema rejects invalid `decimal_places`; JS: rendered table rounds to configured places) → implemented +- [ ] #27: Characterization tests written against existing sorting behaviour → refactored into helper → tests still pass. Must be merged before #4 is started +- [ ] #4: Issue refined to specify fallback behaviour when metric is absent → failing tests written (R: schema accepts/rejects new field; JS: initial sort matches config, falls back to alphabetical when field absent) → implemented +- [ ] **Before starting #44**: confirm [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) is resolved; skip #44 in this sprint if not +- [ ] #44 (if included): Issue refined to specify behaviour when `target_name` is missing from data → failing test written → implemented. Test uses fixture data with and without `target_name` +- [ ] R `testthat` tests cover new schema properties end-to-end through `generate_eval_data()`, written before config.R is modified +- [ ] Docker image rebuilt and integration test passes + +--- + +### Mini-Sprint C — Scale transformation pipeline (~4 weeks) +*Scope: hubPredEvalsData schema v1.1.0 additions + predevals JS for scale UI. hubEvals unchanged (transforms already implemented).* + +**hubPredEvalsData changes:** +- Add `transform_defaults` (top-level) and per-target `transform` to `inst/schema/v1.1.0/config_schema.json` +- Allowed transform functions: `log_shift`, `sqrt`, `log1p`, `log`, `log10`, `log2` +- `append: true/false` — when true, scores.csv gains a `scale` column (`"natural"` or transform label) +- Add `validate_config_transforms()` in `R/config.R` +- Wire resolved transform config into `get_scores_for_output_type()` → `hubEvals::score_model_out(transform=..., transform_append=..., ...)` + +**predevals JS changes:** +- When the `scale` column is present in scores data, treat each (metric × scale) combination as a distinct metric: e.g., "wis (natural)" and "wis (log)" appear as separate items in dropdowns and as separate columns in tables +- No separate filter/toggle; scales are just more metrics +- Info banner when any transformed metrics are present +- **Note**: Sprint A's table ergonomics work ([#49](https://github.com/hubverse-org/predevals/issues/49) fixed column) should land before or alongside this sprint, since adding scale variants doubles the number of metric columns + +**Deliverable**: hubPredEvalsData v1.1.0 schema, new predevals release, Docker rebuild. Resolves [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). + +**Sprint C — Definition of Done:** +- [ ] Each behaviour below has its failing test written and merged to a test branch *before* the corresponding R or JS implementation is written +- [ ] R: `generate_eval_data()` with `transform_defaults: {function: log_shift, append: true}` produces `scores.csv` with both `scale = "natural"` and `scale = "log_shift"` rows +- [ ] R: config with an invalid transform function name fails `validate_config_transforms()` with a clear error message +- [ ] R: config applying a transform to a `pmf` target fails validation +- [ ] R: per-target `transform: null` correctly overrides `transform_defaults` (hierarchical override) +- [ ] JS: when `scale` column present, metric dropdown contains `"wis (natural)"` and `"wis (log_shift)"` as distinct entries +- [ ] JS: table has one column per (metric × scale) combination +- [ ] JS: info banner visible when transformed metrics are present; absent otherwise +- [ ] Schema v1.1.0 is backward-compatible: all existing example configs validate against it without changes + +--- + +### Mini-Sprint D — Variogram score (~3–4 weeks) +*Scope: Merge near-complete hubEvals PR, then wire sample scoring + variogram through hubPredEvalsData and predevals.* + +**hubEvals changes — mostly done:** +[PR #103](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/pull/103) (branch `ak/sample-scoring/94`) is open and awaiting final review. It adds: +- `transform_sample_model_out()` — converts hubverse sample format to scoringutils-compatible objects +- `"sample"` as a valid output type in `validate_output_type()` +- Marginal scoring (CRPS, bias, DSS) and compound/multivariate scoring (energy score) via `compound_taskid_set` +- Dynamically generates compound metric list from scoringutils — so variogram score will be picked up automatically once [scoringutils#1114](https://github.com/epiforecasts/scoringutils/issues/1114) lands and bumps the default metrics +- Requires scoringutils ≥ 2.1.2.9000 +- Open issues on the branch: [#99](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/issues/99) (NaN/Inf validation), [#100](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/issues/100) (test updates), [#101](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/issues/101) (compound_taskid_set validation), [#102](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/issues/102) (test warnings) + +**Action**: Review and merge PR #103; track scoringutils#1114 for variogram score availability. + +**hubPredEvalsData changes** (extends v1.1.0 schema from Sprint C): +- Add `joint_across` optional property to target config +- Add `"variogram_score"` as a recognized metric name for sample output types +- `R/utils-metrics.R` — add `sample = "variogram_score"` case in `get_standard_metrics()` +- `R/generate_eval_data.R` — extract and propagate `joint_across`; skip location-based disaggregation for sample metrics when `joint_across = "location"` + +**predevals JS changes:** +- No new chart types; variogram score appears as another column in the overall scores table +- Graceful handling of missing metric columns in disaggregated views (may already work) + +> ⚠️ **Scope constraint**: The variogram score is computed jointly across locations and is **not** disaggregable by location. Document this in the config schema and validation error messages. + +**hubPredEvalsData-docker changes:** +- [docker#6](https://github.com/hubverse-org/hubPredEvalsData-docker/issues/6): Replace ad-hoc oracle data fetching with `hubData` tooling, so the container fetches oracle output through the standard hubverse data access layer rather than direct file paths + +**Deliverable**: hubEvals new minor version, hubPredEvalsData v1.1.0 (with Sprint C changes), Docker rebuild (with hubData oracle fetching). + +**Sprint D — Definition of Done:** +- [ ] Issues #99–#102 each refined and resolved via TDD before PR #103 is merged +- [ ] PR #103: each open issue has a failing test written first; hand-computed expected values for CRPS and energy score are confirmed correct before the implementation is accepted +- [ ] hubEvals: failing test for `score_model_out()` with `output_type = "sample"` returning the expected variogram score is written against energy score first (interim), then updated when scoringutils#1114 lands — written before implementation is touched +- [ ] hubPredEvalsData: issue for sample output type support is refined to specify exact config fields and error conditions → failing R tests written → `utils-metrics.R` and `generate_eval_data.R` implemented. Tests: `get_standard_metrics("sample")` returns expected names; `generate_eval_data()` with `joint_across: location` produces correct `scores.csv` +- [ ] hubPredEvalsData: failing validation test for the `joint_across` + `disaggregate_by` conflict written before `config.R` validation is modified +- [ ] docker#6: integration test written against *current* oracle-fetching behaviour → hubData migration implemented → same test still passes +- [ ] JS: issue refined to specify how missing variogram score column is handled in disaggregated views → failing test written → implemented +- [ ] `R CMD check` passes for both hubEvals and hubPredEvalsData + +--- + +### Mini-Sprint E — Documentation (~2 weeks, can overlap with D) +*Scope: Extend the [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) + predevals JS (end-user metric definitions, partly overlapping with Sprint A #13).* + +The existing guide covers infrastructure well (Docker setup, renv, build process, integration testing). It does **not** cover metric development workflow. Sprint E adds that missing layer as a new section or sibling page within the same developer guide. + +**New content to add to hubDocs:** +1. Architecture diagram: hubEvals → hubPredEvalsData → predevals → dashboard (with pointers to existing infrastructure docs) +2. How to add a metric in hubEvals: implement a transform function, update `score_model_out`, update `get_metrics`, write tests +3. How to wire it through hubPredEvalsData: add metric name to schema, `get_standard_metrics`, `get_metric_name_to_output_type`, config validation +4. How predevals reads scores CSVs — when new metrics appear automatically vs. when JS changes are needed +5. Worked example: variogram score end-to-end (referencing Sprints C + D as concrete prior art) + +**End-user metric documentation** (builds on Sprint A #13): +- If not fully addressed in Sprint A, finalize a metric glossary embedded in predevals covering: WIS, ae_point, se_point, interval coverage, variogram score +- Hub-level task variable definitions handled separately (via hub model-output README, not in predevals) + +**Deliverable**: New or extended hubDocs page; predevals minor release if glossary was deferred from Sprint A. + +**Sprint E — Definition of Done:** +- [ ] The guide explicitly describes the issue-refinement + TDD workflow expected for each repo, including the special cases (refactors, renames, Docker) +- [ ] The guide is validated by a walkthrough: a developer unfamiliar with the codebase reads it and successfully adds a toy metric in a local dev environment without asking for help +- [ ] All metric names mentioned in the guide are present in the predevals JS glossary (#13) +- [ ] hubDocs CI (link checks, build) passes + +--- + +### Files to modify across all sprints + +| Repo | Files | Sprint | +|------|-------|--------| +| hubEvals | `R/validate.R`, `R/score_model_out.R`, new `R/transform_sample_model_out.R` | D | +| hubPredEvalsData | `inst/schema/v1.0.2/config_schema.json` (new), `inst/schema/v1.1.0/config_schema.json` (new), `R/config.R`, `R/utils-metrics.R`, `R/generate_eval_data.R` | B, C, D | +| predevals | `src/predevals.js` and related source files | A, B, C, D | +| hubPredEvalsData-docker | Dockerfile / entrypoint | B, C, D | +| hubDocs | New developer + end-user guide page | E | + +### Scale and scope + +- **Duration**: 1–3 months total across all sprints; Sprints A and E can run in parallel with others +- **Sequencing**: + +``` +Sprint A (UI polish, ~2w) ──────────────────────────► release +Sprint B (config enhancements, ~3w) ─────────────────► release +Sprint C (scale transforms, ~4w) ────────────► release +Sprint D (variogram score, ~3–4w) ────────► release +Sprint E (docs, ~2w) ──────────────────────► release +``` + +> **Note**: Sprint E can begin in parallel with Sprint B, but cannot fully close until Sprint D is complete. The worked example (item 5 in the content list) depends on Sprints C and D as concrete prior art. + +### Key risks + +1. **scoringutils#1114 timing**: Variogram score will land in hubEvals automatically once scoringutils#1114 merges, but the timeline is upstream-dependent. The rest of Sprint D (hubPredEvalsData schema, pipeline) can proceed with energy score in the meantime. +2. **hubPredEvalsData#21 dependency**: Sprint B's issue #44 (human-readable target names) cannot land until that upstream issue is resolved. +3. **Schema versioning coordination**: Sprints B, C, and D all modify the hubPredEvalsData schema. Recommend sequential release of B → C → D to avoid conflicting bumps. +4. **Table width**: Treating scaled metrics as separate columns (Sprint C) will widen the evaluation table significantly. Sprint A's frozen-column fix ([#49](https://github.com/hubverse-org/predevals/issues/49)) is a soft prerequisite for Sprint C. From 3e8046d47caf812c46cdd7ff655f03c9cc0221d9 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Tue, 3 Mar 2026 23:14:55 -0500 Subject: [PATCH 2/8] Address PR #34 review feedback: remove AGENTS.md, trim poster - Remove AGENTS.md (duplicated poster content); migrate pipeline diagram and repo links into the poster's What do we already know section - Remove Development Standards section (team methodology, not project scope) - Trim per-sprint DoD checklists to brief acceptance criteria - Slim down Sprint C to reference hubPredEvalsData#34 as the authoritative implementation plan, eliminating duplicated/conflicting detail - Fix transform: null to transform: false (matching hubPredEvalsData#34) - Replace joint_across with compound_taskid_set throughout (matching established hubverse terminology) Co-Authored-By: Claude Opus 4.6 --- .../eval-metrics-expansion/AGENTS.md | 86 --------- .../eval-metrics-expansion.md | 165 ++++++------------ 2 files changed, 56 insertions(+), 195 deletions(-) delete mode 100644 project-posters/eval-metrics-expansion/AGENTS.md diff --git a/project-posters/eval-metrics-expansion/AGENTS.md b/project-posters/eval-metrics-expansion/AGENTS.md deleted file mode 100644 index d829584..0000000 --- a/project-posters/eval-metrics-expansion/AGENTS.md +++ /dev/null @@ -1,86 +0,0 @@ -# AGENTS.md — Eval Metrics Expansion Project - -This file provides context for anyone (human or AI agent) arriving at this project poster for the first time. - -## What this project is - -A 1–3 month development sprint to expand the hubverse forecast evaluation ecosystem in two areas: -1. Better end-user and developer experience in the existing dashboard -2. Two new metric capabilities: log-scale scoring and the variogram score - -The full plan is in **`eval-metrics-expansion.md`** in this directory. - -## The hubverse evaluation pipeline (4 repos) - -``` -hubEvals (R pkg) - └── core scoring library; wraps scoringutils; exposes score_model_out() - ↓ -hubPredEvalsData (R pkg) - └── orchestrates scoring for a hub; reads predevals-config.yml; - outputs scores.csv files organized by target/eval_set/disaggregate_by - ↓ -hubPredEvalsData-docker - └── Docker container wrapping hubPredEvalsData::generate_eval_data(); - used in hub CI/CD pipelines - ↓ -predevals (JavaScript) - └── client-side module that reads the scores CSV files and renders - interactive tables, heatmaps, and line plots -``` - -Documentation: -- User guide: https://docs.hubverse.io/en/latest/user-guide/dashboards.html#predevals-evaluation-optional -- Developer guide (infrastructure): https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html -- predevals repo: https://github.com/hubverse-org/predevals -- hubPredEvalsData repo: https://github.com/hubverse-org/hubPredEvalsData -- hubEvals repo: https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals -- hubPredEvalsData-docker repo: https://github.com/hubverse-org/hubPredEvalsData-docker - -## Current state (as of 2026-02-26) - -- **hubEvals v0.1.0** supports: quantile, mean, median, pmf output types; log/sqrt scale transforms already implemented in `score_model_out()` -- **hubEvals PR #103** (branch `ak/sample-scoring/94`) is open and nearly ready to merge — adds `sample` output type with CRPS, bias, DSS, energy score, and dynamic compound metric list (will pick up variogram score automatically once scoringutils#1114 lands) -- **hubPredEvalsData schema v1.0.1** — no support yet for scale transforms or sample output type -- **predevals** — functional but has multiple known UX gaps (see issues) -- **scoringutils** — variogram score being added upstream in issue #1114 - -## Sprint structure - -The sprint is broken into 5 mini-sprints ordered by implementation complexity: - -| Sprint | Scope | Repos touched | Approx. duration | -|--------|-------|--------------|-----------------| -| A | UI-only polish (bugs, ergonomics, metric docs) | predevals only | ~2 weeks | -| B | Config-driven enhancements (decimal places, sort, target names) | hubPredEvalsData schema + predevals | ~3 weeks | -| C | Scale transformation pipeline | hubPredEvalsData schema + predevals | ~4 weeks | -| D | Variogram score / sample scoring | hubEvals (PR #103) + hubPredEvalsData + predevals | ~3–4 weeks | -| E | Developer docs (extend existing guide) | hubDocs | ~2 weeks, can overlap | - -Sprint A should go first; Sprint C depends on Sprint A's table ergonomics fix (#49). Sprints B, C, D share the hubPredEvalsData schema so should release sequentially (B → C → D). Sprint E can begin in parallel with Sprint B but cannot fully close until Sprint D is complete (the worked example depends on Sprints C + D as prior art). - -## Key design decisions made - -- **Scale metrics treated as separate metrics**: When `append=true` in transform config, log-scaled and natural-scale metrics appear as distinct columns/dropdown items (e.g., "wis (natural)" and "wis (log)"), not as a filter/toggle. This requires the frozen-column table fix (Sprint A #49) to land first. -- **Variogram score is compound (across locations)**: The variogram score is computed jointly across locations via `compound_taskid_set`. It is NOT disaggregatable by location. hubPredEvalsData should enforce this with a validation error. -- **Scoringutils metrics are automatically available**: Any metric that scoringutils returns as a default for a given output type flows through automatically via `get_standard_metrics()`. No schema/code changes needed beyond adding the output type — just config. - -## Open questions (as of 2026-02-26) - -- When does scoringutils#1114 (variogram score) land? -- How many hubs currently use sample-format forecasts? -- Should `disaggregate_by: location` + variogram score be a silent skip, warning, or error in hubPredEvalsData? -- Is hubPredEvalsData#21 (target metadata) resolved? Unblocks predevals#44. -- ~~Issue #4 (default sort column): schema change or JS-only?~~ **Resolved**: `predevals-config.yml` schema change, included in Sprint B v1.0.2. - -## Key files to know - -| File | Purpose | -|------|---------| -| `hubEvals/R/score_model_out.R` | Core scoring function; transform + joint_across params live here | -| `hubEvals/R/validate.R` | Output type validation | -| `hubPredEvalsData/R/utils-metrics.R` | `get_standard_metrics()` maps output types → metric names | -| `hubPredEvalsData/R/generate_eval_data.R` | Top-level orchestration; calls hubEvals | -| `hubPredEvalsData/R/config.R` | Config parsing + validation | -| `hubPredEvalsData/inst/schema/` | JSON schema versions for predevals-config.yml | -| `predevals/src/predevals.js` | Main JS source; reads CSV files, renders UI | diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index 2ec7532..45a1cbe 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -54,6 +54,35 @@ See the **mini-sprint breakdown** in "Ready to make it" below. Each sprint is se ### What do we already know? +**The hubverse evaluation pipeline (4 repos):** + +``` +hubEvals (R pkg) + └── core scoring library; wraps scoringutils; exposes score_model_out() + ↓ +hubPredEvalsData (R pkg) + └── orchestrates scoring for a hub; reads predevals-config.yml; + outputs scores.csv files organized by target/eval_set/disaggregate_by + ↓ +hubPredEvalsData-docker + └── Docker container wrapping hubPredEvalsData::generate_eval_data(); + used in hub CI/CD pipelines + ↓ +predevals (JavaScript) + └── client-side module that reads the scores CSV files and renders + interactive tables, heatmaps, and line plots +``` + +Repos: +- [hubEvals](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals) +- [hubPredEvalsData](https://github.com/hubverse-org/hubPredEvalsData) +- [hubPredEvalsData-docker](https://github.com/hubverse-org/hubPredEvalsData-docker) +- [predevals](https://github.com/hubverse-org/predevals) + +Documentation: +- [User guide](https://docs.hubverse.io/en/latest/user-guide/dashboards.html#predevals-evaluation-optional) +- [Developer guide (infrastructure)](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) + - `hubEvals::score_model_out()` (v0.1.0) already supports `transform`, `transform_append`, and `transform_label`—no hubEvals changes needed for the scale transform pipeline. - scoringutils dev branch has `as_forecast_multivariate_sample()`, `variogram_score_multivariate()`, and `variogram_score_multivariate_point()`. - hubPredEvalsData schema versioning is established (v0.1.0 → v1.0.0 → v1.0.1); v1.1.0 is the natural target for transform + variogram + config-driven enhancements. @@ -84,55 +113,6 @@ The core complexity axis is: --- -### Development standards - -Every issue follows a **two-phase workflow** before any code is written: - -#### Phase 1 — Issue refinement - -Before an implementer picks up an issue, the issue must be rewritten (if necessary) so that it is: - -- **Specific and testable**: describes a concrete, observable outcome rather than a vague intent. For example, not "fix the target-change bug" but "when a new target is selected, the metric dropdown resets to the first valid metric for that target and the disaggregate_by dropdown resets to 'overall'." -- **Single-concern**: one responsibility per issue; split if needed. -- **Unblocked**: all upstream dependencies are resolved. - -If the issue text isn't clear enough to write a test from, update it before starting implementation. - -#### Phase 2 — Test-Driven Development (TDD) - -1. **Write a failing test** that encodes the specific outcome from the refined issue. -2. **Implement the minimum code** to make the test pass. -3. **Refactor** with the test suite green. - -**Special cases where the TDD sequence is adapted:** - -| Case | Adapted sequence | -|------|-----------------| -| **Refactors** (#27, #28, #30) | Write *characterization tests* against the existing behaviour first → refactor → confirm tests still pass. No observable behaviour should change. | -| **Rename** (#34) | Not TDD. Completion is verified by a grep/search confirming no old string remains. | -| **Documentation** (Sprint E) | Not TDD. Acceptance criterion: a developer unfamiliar with the codebase can follow the guide and add a toy metric in a local dev environment (verified by peer walkthrough). | -| **Docker integration** (docker#6) | Write the integration test against current oracle-fetching behaviour → migrate to hubData → confirm test still passes. | - -**Tooling by repo:** - -| Repo | Test framework | Notes | -|------|---------------|-------| -| hubEvals (R) | `testthat` via `devtools::test()` | Hand-computed expected values where possible (see PR #103 pattern) | -| hubPredEvalsData (R) | `testthat` via `devtools::test()` | Integration tests using example hub data | -| predevals (JS) | To be established in Sprint A [#22](https://github.com/hubverse-org/predevals/issues/22) | Unit test framework (e.g. Jest or Vitest) chosen during Sprint A setup | -| hubPredEvalsData-docker | Integration tests in GitHub Actions | Compare CSV outputs between image versions (existing pattern) | - -**Universal Definition of Done** — every issue is closed only when: - -- [ ] The issue was refined to a specific, testable outcome *before* any code was written -- [ ] A failing test encoding that outcome was written *before* the implementation (or the appropriate adapted sequence above was followed) -- [ ] All existing tests continue to pass (`R CMD check` / CI green) -- [ ] The change is documented (inline comments for non-obvious logic; function-level docs for new public API) -- [ ] A PR is reviewed and approved by at least one other team member -- [ ] The relevant GitHub issue is referenced in the PR and closed on merge - ---- - ### Mini-Sprint A — UI-only polish (~2 weeks) *Scope: predevals JS/CSS only. No schema changes. No R package changes.* @@ -150,17 +130,10 @@ If the issue text isn't clear enough to write a test from, update it before star **Deliverable**: New predevals release. No changes to hubPredEvalsData, hubEvals, or Docker. -**Sprint A — Definition of Done:** -- [ ] #22: Test harness chosen and one passing test for an existing function exists; all subsequent issues in this sprint add tests to it -- [ ] #31: Issue refined to specify exact reset behaviour → failing test written → implemented. Test: selecting a new target resets metric and disaggregate_by to the first valid value for that target -- [ ] #5: Issue refined to specify which state is preserved and when → failing test written → implemented. Test: model selection is unchanged after switching between table/heatmap/line views -- [ ] #49: Issue refined to name the exact column and scroll behaviour → failing DOM test written → implemented. Test: first column has `position: sticky` and table body scrolls independently -- [ ] #42: Issue refined per metric (which direction, what text) → failing tests written → implemented. Tests: each metric name has a registered direction; that direction renders correctly in headers and axis labels -- [ ] #13: Issue refined with complete metric glossary entries → failing test written → implemented. Test: glossary panel present on load; toggles; every metric name used in the dashboard has a glossary entry -- [ ] #50: Add `'visibility'` to `columnControl` array; test: each metric column header has a working hide/show toggle; `model_id` column is excluded from hiding -- [ ] #30: Characterization tests written against *existing* metrics-list behaviour → refactored → tests still pass. No behaviour change -- [ ] #34: Grep confirms zero occurrences of `predeval` (without trailing `s`) in `src/` -- [ ] CI passes; new predevals release tagged +**Sprint A — Acceptance criteria:** +- All listed issues resolved and closed +- JS test harness established (#22) with tests covering new functionality +- CI passes; new predevals release tagged --- @@ -179,46 +152,24 @@ If the issue text isn't clear enough to write a test from, update it before star **Deliverable**: hubPredEvalsData v1.0.2 schema, new predevals release, Docker rebuild. -**Sprint B — Definition of Done:** -- [ ] #28: Characterization tests written against existing rounding behaviour → refactored into shared helper → tests still pass. Must be merged before #48 is started -- [ ] #48: Issue refined to specify which targets need non-default precision and what values are valid → failing tests written (R: schema rejects invalid `decimal_places`; JS: rendered table rounds to configured places) → implemented -- [ ] #27: Characterization tests written against existing sorting behaviour → refactored into helper → tests still pass. Must be merged before #4 is started -- [ ] #4: Issue refined to specify fallback behaviour when metric is absent → failing tests written (R: schema accepts/rejects new field; JS: initial sort matches config, falls back to alphabetical when field absent) → implemented -- [ ] **Before starting #44**: confirm [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) is resolved; skip #44 in this sprint if not -- [ ] #44 (if included): Issue refined to specify behaviour when `target_name` is missing from data → failing test written → implemented. Test uses fixture data with and without `target_name` -- [ ] R `testthat` tests cover new schema properties end-to-end through `generate_eval_data()`, written before config.R is modified -- [ ] Docker image rebuilt and integration test passes +**Sprint B — Acceptance criteria:** +- All listed issues resolved and closed (#44 only if [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) is resolved) +- hubPredEvalsData schema v1.0.2 released; R tests pass +- Docker image rebuilt and integration test passes +- New predevals release tagged --- ### Mini-Sprint C — Scale transformation pipeline (~4 weeks) *Scope: hubPredEvalsData schema v1.1.0 additions + predevals JS for scale UI. hubEvals unchanged (transforms already implemented).* -**hubPredEvalsData changes:** -- Add `transform_defaults` (top-level) and per-target `transform` to `inst/schema/v1.1.0/config_schema.json` -- Allowed transform functions: `log_shift`, `sqrt`, `log1p`, `log`, `log10`, `log2` -- `append: true/false` — when true, scores.csv gains a `scale` column (`"natural"` or transform label) -- Add `validate_config_transforms()` in `R/config.R` -- Wire resolved transform config into `get_scores_for_output_type()` → `hubEvals::score_model_out(transform=..., transform_append=..., ...)` +The detailed implementation plan for this sprint is in [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34), which covers the schema design, config validation, R pipeline changes, and predevals JS behavior. -**predevals JS changes:** -- When the `scale` column is present in scores data, treat each (metric × scale) combination as a distinct metric: e.g., "wis (natural)" and "wis (log)" appear as separate items in dropdowns and as separate columns in tables -- No separate filter/toggle; scales are just more metrics -- Info banner when any transformed metrics are present -- **Note**: Sprint A's table ergonomics work ([#49](https://github.com/hubverse-org/predevals/issues/49) fixed column) should land before or alongside this sprint, since adding scale variants doubles the number of metric columns - -**Deliverable**: hubPredEvalsData v1.1.0 schema, new predevals release, Docker rebuild. Resolves [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). - -**Sprint C — Definition of Done:** -- [ ] Each behaviour below has its failing test written and merged to a test branch *before* the corresponding R or JS implementation is written -- [ ] R: `generate_eval_data()` with `transform_defaults: {function: log_shift, append: true}` produces `scores.csv` with both `scale = "natural"` and `scale = "log_shift"` rows -- [ ] R: config with an invalid transform function name fails `validate_config_transforms()` with a clear error message -- [ ] R: config applying a transform to a `pmf` target fails validation -- [ ] R: per-target `transform: null` correctly overrides `transform_defaults` (hierarchical override) -- [ ] JS: when `scale` column present, metric dropdown contains `"wis (natural)"` and `"wis (log_shift)"` as distinct entries -- [ ] JS: table has one column per (metric × scale) combination -- [ ] JS: info banner visible when transformed metrics are present; absent otherwise -- [ ] Schema v1.1.0 is backward-compatible: all existing example configs validate against it without changes +**Summary**: Add `transform_defaults` (top-level) and per-target `transform` override to the hubPredEvalsData config schema. Wire the resolved transform config through to `hubEvals::score_model_out()`. When `append: true`, scores.csv gains a `scale` column and the predevals dashboard treats each (metric × scale) combination as a distinct item in dropdowns and table columns. + +**Note**: Sprint A's table ergonomics work ([#49](https://github.com/hubverse-org/predevals/issues/49) fixed column) should land before or alongside this sprint, since adding scale variants increases the number of metric columns. + +**Deliverable**: hubPredEvalsData v1.1.0 schema, new predevals release, Docker rebuild. --- @@ -237,10 +188,10 @@ If the issue text isn't clear enough to write a test from, update it before star **Action**: Review and merge PR #103; track scoringutils#1114 for variogram score availability. **hubPredEvalsData changes** (extends v1.1.0 schema from Sprint C): -- Add `joint_across` optional property to target config +- Add `compound_taskid_set` optional property to target config - Add `"variogram_score"` as a recognized metric name for sample output types - `R/utils-metrics.R` — add `sample = "variogram_score"` case in `get_standard_metrics()` -- `R/generate_eval_data.R` — extract and propagate `joint_across`; skip location-based disaggregation for sample metrics when `joint_across = "location"` +- `R/generate_eval_data.R` — extract and propagate `compound_taskid_set`; skip location-based disaggregation for sample metrics when `compound_taskid_set = "location"` **predevals JS changes:** - No new chart types; variogram score appears as another column in the overall scores table @@ -253,15 +204,12 @@ If the issue text isn't clear enough to write a test from, update it before star **Deliverable**: hubEvals new minor version, hubPredEvalsData v1.1.0 (with Sprint C changes), Docker rebuild (with hubData oracle fetching). -**Sprint D — Definition of Done:** -- [ ] Issues #99–#102 each refined and resolved via TDD before PR #103 is merged -- [ ] PR #103: each open issue has a failing test written first; hand-computed expected values for CRPS and energy score are confirmed correct before the implementation is accepted -- [ ] hubEvals: failing test for `score_model_out()` with `output_type = "sample"` returning the expected variogram score is written against energy score first (interim), then updated when scoringutils#1114 lands — written before implementation is touched -- [ ] hubPredEvalsData: issue for sample output type support is refined to specify exact config fields and error conditions → failing R tests written → `utils-metrics.R` and `generate_eval_data.R` implemented. Tests: `get_standard_metrics("sample")` returns expected names; `generate_eval_data()` with `joint_across: location` produces correct `scores.csv` -- [ ] hubPredEvalsData: failing validation test for the `joint_across` + `disaggregate_by` conflict written before `config.R` validation is modified -- [ ] docker#6: integration test written against *current* oracle-fetching behaviour → hubData migration implemented → same test still passes -- [ ] JS: issue refined to specify how missing variogram score column is handled in disaggregated views → failing test written → implemented -- [ ] `R CMD check` passes for both hubEvals and hubPredEvalsData +**Sprint D — Acceptance criteria:** +- hubEvals PR #103 merged with issues #99–#102 resolved +- hubPredEvalsData supports `sample` output type with `compound_taskid_set` config; validates against `disaggregate_by` conflicts +- Variogram score appears in predevals dashboard (once scoringutils#1114 lands) +- Docker image uses `hubData` for oracle fetching ([docker#6](https://github.com/hubverse-org/hubPredEvalsData-docker/issues/6)) +- `R CMD check` passes for both hubEvals and hubPredEvalsData --- @@ -283,11 +231,10 @@ The existing guide covers infrastructure well (Docker setup, renv, build process **Deliverable**: New or extended hubDocs page; predevals minor release if glossary was deferred from Sprint A. -**Sprint E — Definition of Done:** -- [ ] The guide explicitly describes the issue-refinement + TDD workflow expected for each repo, including the special cases (refactors, renames, Docker) -- [ ] The guide is validated by a walkthrough: a developer unfamiliar with the codebase reads it and successfully adds a toy metric in a local dev environment without asking for help -- [ ] All metric names mentioned in the guide are present in the predevals JS glossary (#13) -- [ ] hubDocs CI (link checks, build) passes +**Sprint E — Acceptance criteria:** +- A developer unfamiliar with the codebase can follow the guide to add a toy metric end-to-end without reading source across repos +- All metric names in the guide are present in the predevals JS glossary (#13) +- hubDocs CI passes --- From 4c38e092f9feaed7055c4f8fac7dde23dbcaf657 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Thu, 5 Mar 2026 16:29:24 -0500 Subject: [PATCH 3/8] update based on re-org of some of the tasks --- .../eval-metrics-expansion.md | 143 +++++++++--------- 1 file changed, 75 insertions(+), 68 deletions(-) diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index 45a1cbe..a16e139 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -10,17 +10,15 @@ ### What are we doing? -Extending the hubverse forecast evaluation ecosystem across five goals: +Extending the hubverse forecast evaluation ecosystem across four goals: -1. **UI polish for end users**: Fix known usability gaps in the predevals dashboard—metric documentation, score direction indicators, table ergonomics, and several priority bugs—without touching R packages or config schemas. Issues: [predevals#13](https://github.com/hubverse-org/predevals/issues/13), [#31](https://github.com/hubverse-org/predevals/issues/31), [#42](https://github.com/hubverse-org/predevals/issues/42), [#41](https://github.com/hubverse-org/predevals/issues/41), [#49](https://github.com/hubverse-org/predevals/issues/49), [#5](https://github.com/hubverse-org/predevals/issues/5). Explore whether column hiding (so users can make the tabler simpler to focus on just desired metrics) is possible using datatables. +1. **Dashboard UI improvements**: Fix known usability gaps in the predevals dashboard—table ergonomics for wider tables (column hiding, frozen model column), metric documentation, score direction indicators, and several priority bugs—without touching R packages or config schemas. Table readiness issues ([#49](https://github.com/hubverse-org/predevals/issues/49), [#50](https://github.com/hubverse-org/predevals/issues/50)) are critical prerequisites for the scaled-metrics work in goal 2. Additional polish issues: [#13](https://github.com/hubverse-org/predevals/issues/13), [#31](https://github.com/hubverse-org/predevals/issues/31), [#42](https://github.com/hubverse-org/predevals/issues/42), [#41](https://github.com/hubverse-org/predevals/issues/41), [#5](https://github.com/hubverse-org/predevals/issues/5). -2. **Config-driven UI enhancements**: Small additions to the hubPredEvalsData schema enabling per-target decimal precision, human-readable target names, and a configurable default sort column for the evaluation table. Issues: [predevals#48](https://github.com/hubverse-org/predevals/issues/48), [#44](https://github.com/hubverse-org/predevals/issues/44), [#4](https://github.com/hubverse-org/predevals/issues/4). +2. **Evaluation config schema updates**: Consolidate all hubPredEvalsData schema changes into a single v1.1.0 release, including: config-driven UI enhancements (per-target decimal precision, human-readable target names, configurable default sort column), and the scale transformation pipeline (wiring the already-implemented log/sqrt transform support in `hubEvals::score_model_out()` through the config schema and predevals UI). Issues: [predevals#48](https://github.com/hubverse-org/predevals/issues/48), [#44](https://github.com/hubverse-org/predevals/issues/44), [#4](https://github.com/hubverse-org/predevals/issues/4), [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). -3. **Scale transformation pipeline**: Wire the already-implemented log/sqrt transform support in `hubEvals::score_model_out()` through the hubPredEvalsData config schema and the predevals UI so hub admins can evaluate forecasts on transformed scales. Issue: [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). +3. **Sample-based scoring with compound_taskid_set awareness**: Add `sample` output type support to hubEvals and build pipeline infrastructure for multivariate/compound metrics that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. This enables the variogram score (`variogram_score_multivariate()` and `variogram_score_multivariate_point()`, recently added to scoringutils) as a metric evaluating ensemble spatial correlation structure across locations, and lays groundwork for future compound metrics on sample forecasts. See also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. -4. **Variogram score**: Add `sample` output type support to hubEvals and expose `variogram_score_multivariate()` and `variogram_score_multivariate_point()` (recently added to scoringutils) as a metric evaluating ensemble spatial correlation structure across locations. - -5. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. +4. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. ### Why are we doing this? @@ -34,14 +32,14 @@ Extending the hubverse forecast evaluation ecosystem across five goals: - Not adding new chart types (existing table/heatmap/line plot are sufficient). - Not changing the existing WIS/AE/interval coverage metrics. -- Not implementing calibration/reliability diagram visualizations in this sprint. +- Not implementing calibration/reliability diagram visualizations in this project. - Not adding the variogram score for quantile-format forecasts (requires ensemble/sample format). ### How do we judge success? - A first-time user reading the dashboard understands what WIS means, which direction is better, and which models are performing best—without leaving the page. - A hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear as distinct items alongside natural-scale metrics in all dropdowns and table columns. -- A hub submitting sample-format ensemble forecasts can configure and view the variogram score in the dashboard. +- A hub submitting sample-format ensemble forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. - A new developer can follow the hubDocs guide to add a hypothetical metric end-to-end without reading source code across repos. ### What are possible solutions? @@ -85,17 +83,17 @@ Documentation: - `hubEvals::score_model_out()` (v0.1.0) already supports `transform`, `transform_append`, and `transform_label`—no hubEvals changes needed for the scale transform pipeline. - scoringutils dev branch has `as_forecast_multivariate_sample()`, `variogram_score_multivariate()`, and `variogram_score_multivariate_point()`. -- hubPredEvalsData schema versioning is established (v0.1.0 → v1.0.0 → v1.0.1); v1.1.0 is the natural target for transform + variogram + config-driven enhancements. +- hubPredEvalsData schema versioning is established (v0.1.0 → v1.0.0 → v1.0.1); v1.1.0 is the natural target for all config-driven enhancements and scale transform additions in a single release. The schema update checklist from the last version bump is tracked in [hubPredEvalsData#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32). - The predevals JS dashboard builds via webpack into `dist/predevals.bundle.js`; the existing table/heatmap/line plot handle new metric columns without new chart types. - predevals issues #31, #5, #42, #41, #49 are pure JS/CSS changes with no schema dependencies. ### What do we need to answer? -- **Variogram score in scoringutils**: [scoringutils#1114](https://github.com/epiforecasts/scoringutils/issues/1114) tracks adding `variogram_score_multivariate()` as a default compound metric. Confirm its merge status before closing Sprint D. hubEvals PR #103 is already wired to pick it up dynamically once it lands. -- **How many active hubs submit `sample`-format forecasts?** This determines Sprint D's near-term impact. Although, note that the variogram point score could be used by almost all hubs, with some additional specification about joint across. It is possible that some specification would be needed for any variogram score to specify which dimension to assume the forecasts are joint across. -- **Variogram + disaggregate_by conflict**: If a target has `disaggregate_by: location` and also uses the variogram score (computed *across* locations), should hubPredEvalsData skip disaggregation silently, warn, or error? +- **Variogram score in scoringutils**: [scoringutils#1114](https://github.com/epiforecasts/scoringutils/issues/1114) tracks adding `variogram_score_multivariate()` as a default compound metric. Confirm its merge status before closing Sprint C. hubEvals PR #103 is already wired to pick it up dynamically once it lands. +- **How many active hubs submit `sample`-format forecasts?** This determines Sprint C's near-term impact. Although, note that the variogram point score could be used by almost all hubs, with some additional specification about joint across. It is possible that some specification would be needed for any variogram score to specify which dimension to assume the forecasts are joint across. +- **Compound_taskid_set + disaggregate_by conflict**: If a target has `disaggregate_by: location` and also uses a compound metric computed *across* locations (e.g., variogram score), should hubPredEvalsData skip disaggregation silently, warn, or error? - **Issue #44 (human-readable target names)**: Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency. -- **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.0.2 bump. +- **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.1.0 schema bump. - **Which scoringutils metrics are already surfaceable?** For quantile forecasts, any metric in `scoringutils::get_metrics(scoringutils::example_quantile)` is already available via the existing `get_standard_metrics()` pathway — no schema or code changes needed, just config. For sample forecasts, once PR #103 merges, CRPS/bias/DSS/energy score become available the same way. New scoringutils defaults propagate automatically. @@ -105,76 +103,89 @@ Documentation: ### Proposed solution -Five mini-sprints ordered by implementation complexity: UI-only first, then incremental schema additions, then new metric features. Each sprint is independently releasable. +Four mini-sprints ordered by implementation complexity: UI-only first, then schema additions, then new metric pipeline features. Each sprint is independently releasable. The core complexity axis is: -> **UI-only** (predevals JS/CSS, no schema changes) → **Config-driven** (small schema additions + JS) → **Full pipeline** (R packages + schema + JS across 4 repos) +> **UI-only** (predevals JS/CSS, no schema changes) → **Schema-driven** (hubPredEvalsData schema + R pipeline + JS) → **Full pipeline** (R packages + schema + JS across 4 repos) --- -### Mini-Sprint A — UI-only polish (~2 weeks) +### Mini-Sprint A — Dashboard UI improvements *Scope: predevals JS/CSS only. No schema changes. No R package changes.* +Sprint A is split into two tiers. **A1** contains table readiness work that is a prerequisite for Sprint B's scaled-metrics columns. **A2** contains important usability improvements that are not on the critical path for scaled metrics. + +**A1 — Table readiness (blocking for Sprint B)** + | Issue | Change | Complexity | |-------|--------|-----------| -| [#31](https://github.com/hubverse-org/predevals/issues/31) 🐛 **priority** | Auto-update metric + disaggregate_by selectors when target changes | JS logic fix | -| [#5](https://github.com/hubverse-org/predevals/issues/5) **priority** | Preserve model selection state when switching plots | JS state management | | [#49](https://github.com/hubverse-org/predevals/issues/49) **priority** | Freeze model name column; horizontal scroll for other columns | CSS/JS table layout | +| [#50](https://github.com/hubverse-org/predevals/issues/50) | Enable column hiding via ColumnControl visibility toggle (one-line config change; ColumnControl already loaded) | JS config (trivial) | + +**A2 — UI polish (non-blocking)** + +| Issue | Change | Complexity | +|-------|--------|-----------| +| [#31](https://github.com/hubverse-org/predevals/issues/31) **priority** | Auto-update metric + disaggregate_by selectors when target changes | JS logic fix | +| [#5](https://github.com/hubverse-org/predevals/issues/5) **priority** | Preserve model selection state when switching plots | JS state management | | [#42](https://github.com/hubverse-org/predevals/issues/42) **priority** | Add "lower is better" / "closer to nominal is better" cues to axes and table headers | JS + display logic | | [#13](https://github.com/hubverse-org/predevals/issues/13) | Add metric definitions panel with abbreviations and direction indicators, shown by default and togglable | JS; metric glossary baked into predevals.js | -| [#50](https://github.com/hubverse-org/predevals/issues/50) | Enable column hiding via ColumnControl visibility toggle (one-line config change; ColumnControl already loaded) | JS config (trivial) | -| [#30](https://github.com/hubverse-org/predevals/issues/30) | Refactor code for getting full metrics list (prerequisite for Sprint C's per-scale metric treatment) | JS refactor | | [#34](https://github.com/hubverse-org/predevals/issues/34) | Rename all `predeval` → `predevals` references in source | JS cleanup (trivial) | | [#22](https://github.com/hubverse-org/predevals/issues/22) | Add unit tests for existing JS functionality; establishes test harness for TDD in subsequent sprints | JS testing infrastructure | **Deliverable**: New predevals release. No changes to hubPredEvalsData, hubEvals, or Docker. **Sprint A — Acceptance criteria:** -- All listed issues resolved and closed +- A1 issues resolved and closed (required before Sprint B) +- A2 issues resolved and closed - JS test harness established (#22) with tests covering new functionality - CI passes; new predevals release tagged --- -### Mini-Sprint B — Config-driven enhancements (~3 weeks) -*Scope: Small hubPredEvalsData schema additions + corresponding predevals JS. Schema bump to v1.0.2.* +### Mini-Sprint B — Evaluation config schema updates (v1.1.0) +*Scope: All hubPredEvalsData schema changes consolidated into a single v1.1.0 release + corresponding predevals JS + R pipeline changes. hubEvals unchanged (transforms already implemented).* + +This sprint consolidates config-driven UI enhancements and the scale transformation pipeline into a single schema bump, since schema updates are painful to coordinate across repos. The schema update checklist from the previous version bump is in [hubPredEvalsData#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32). + +**Config-driven UI enhancements:** | Issue | Change | Where | |-------|--------|-------| | [predevals#28](https://github.com/hubverse-org/predevals/issues/28) | Refactor numeric rounding into a shared helper (prerequisite for #48) | predevals JS refactor | | [predevals#48](https://github.com/hubverse-org/predevals/issues/48) **priority** | Add `decimal_places` per-target in config schema; JS reads and applies it for table display | hubPredEvalsData schema + predevals JS | | [predevals#27](https://github.com/hubverse-org/predevals/issues/27) | Refactor score-sorting into a helper function (prerequisite for #4) | predevals JS refactor | -| [predevals#4](https://github.com/hubverse-org/predevals/issues/4) **priority** | Add `default_sort_metric` to config/options; JS uses it for initial table sort (e.g., sort models by relative WIS ascending on first load rather than alphabetically) | hubPredEvalsData schema (or predevals options object) + predevals JS | +| [predevals#4](https://github.com/hubverse-org/predevals/issues/4) **priority** | Add `default_sort_metric` to config/options; JS uses it for initial table sort (e.g., sort models by relative WIS ascending on first load rather than alphabetically) | hubPredEvalsData schema + predevals JS | | [predevals#44](https://github.com/hubverse-org/predevals/issues/44) | Display `target_name` from `tasks.json` instead of `target_id` in dropdowns | hubPredEvalsData (depends on [#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)) + predevals JS | > ⚠️ Issue #44 depends on upstream hubPredEvalsData work ([#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)). Include only if that work is complete. -**Deliverable**: hubPredEvalsData v1.0.2 schema, new predevals release, Docker rebuild. +**Scale transformation pipeline:** + +The detailed implementation plan is in [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34), which covers the schema design, config validation, R pipeline changes, and predevals JS behavior. + +| Issue | Change | Where | +|-------|--------|-------| +| [predevals#30](https://github.com/hubverse-org/predevals/issues/30) | Refactor code for getting full metrics list (prerequisite for per-scale metric treatment) | predevals JS refactor | +| [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34) | Add `transform_defaults` (top-level) and per-target `transform` override to config schema; wire through to `hubEvals::score_model_out()` | hubPredEvalsData schema + R pipeline | +| — | When `append: true`, scores.csv gains a `scale` column; predevals treats each (metric × scale) combination as a distinct item in dropdowns and table columns | predevals JS | + +**Deliverable**: hubPredEvalsData v1.1.0 schema, new predevals release, Docker rebuild. **Sprint B — Acceptance criteria:** -- All listed issues resolved and closed (#44 only if [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) is resolved) -- hubPredEvalsData schema v1.0.2 released; R tests pass +- All config-driven UI issues resolved and closed (#44 only if [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) is resolved) +- Scale transformation pipeline functional end-to-end: hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear alongside natural-scale metrics +- hubPredEvalsData schema v1.1.0 released; R tests pass; schema update checklist ([#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32)) followed - Docker image rebuilt and integration test passes - New predevals release tagged --- -### Mini-Sprint C — Scale transformation pipeline (~4 weeks) -*Scope: hubPredEvalsData schema v1.1.0 additions + predevals JS for scale UI. hubEvals unchanged (transforms already implemented).* - -The detailed implementation plan for this sprint is in [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34), which covers the schema design, config validation, R pipeline changes, and predevals JS behavior. - -**Summary**: Add `transform_defaults` (top-level) and per-target `transform` override to the hubPredEvalsData config schema. Wire the resolved transform config through to `hubEvals::score_model_out()`. When `append: true`, scores.csv gains a `scale` column and the predevals dashboard treats each (metric × scale) combination as a distinct item in dropdowns and table columns. +### Mini-Sprint C — Sample-based scoring with compound_taskid_set +*Scope: Merge near-complete hubEvals PR, then wire sample scoring + compound_taskid_set awareness through hubPredEvalsData and predevals.* -**Note**: Sprint A's table ergonomics work ([#49](https://github.com/hubverse-org/predevals/issues/49) fixed column) should land before or alongside this sprint, since adding scale variants increases the number of metric columns. - -**Deliverable**: hubPredEvalsData v1.1.0 schema, new predevals release, Docker rebuild. - ---- - -### Mini-Sprint D — Variogram score (~3–4 weeks) -*Scope: Merge near-complete hubEvals PR, then wire sample scoring + variogram through hubPredEvalsData and predevals.* +This sprint builds pipeline support for sample-format output types that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. The variogram score is the motivating use case, but the infrastructure enables any future compound/multivariate metric for sample forecasts. See also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. **hubEvals changes — mostly done:** [PR #103](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/pull/103) (branch `ak/sample-scoring/94`) is open and awaiting final review. It adds: @@ -187,24 +198,24 @@ The detailed implementation plan for this sprint is in [hubPredEvalsData#34](htt **Action**: Review and merge PR #103; track scoringutils#1114 for variogram score availability. -**hubPredEvalsData changes** (extends v1.1.0 schema from Sprint C): +**hubPredEvalsData changes** (extends v1.1.0 schema from Sprint B or adds v1.2.0): - Add `compound_taskid_set` optional property to target config - Add `"variogram_score"` as a recognized metric name for sample output types - `R/utils-metrics.R` — add `sample = "variogram_score"` case in `get_standard_metrics()` -- `R/generate_eval_data.R` — extract and propagate `compound_taskid_set`; skip location-based disaggregation for sample metrics when `compound_taskid_set = "location"` +- `R/generate_eval_data.R` — extract and propagate `compound_taskid_set`; skip location-based disaggregation for compound metrics when `compound_taskid_set = "location"` **predevals JS changes:** - No new chart types; variogram score appears as another column in the overall scores table - Graceful handling of missing metric columns in disaggregated views (may already work) -> ⚠️ **Scope constraint**: The variogram score is computed jointly across locations and is **not** disaggregable by location. Document this in the config schema and validation error messages. +> ⚠️ **Scope constraint**: Compound metrics computed jointly across a dimension (e.g., variogram score across locations) are **not** disaggregable by that dimension. Document this in the config schema and validation error messages. **hubPredEvalsData-docker changes:** - [docker#6](https://github.com/hubverse-org/hubPredEvalsData-docker/issues/6): Replace ad-hoc oracle data fetching with `hubData` tooling, so the container fetches oracle output through the standard hubverse data access layer rather than direct file paths -**Deliverable**: hubEvals new minor version, hubPredEvalsData v1.1.0 (with Sprint C changes), Docker rebuild (with hubData oracle fetching). +**Deliverable**: hubEvals new minor version, hubPredEvalsData schema update, Docker rebuild (with hubData oracle fetching). -**Sprint D — Acceptance criteria:** +**Sprint C — Acceptance criteria:** - hubEvals PR #103 merged with issues #99–#102 resolved - hubPredEvalsData supports `sample` output type with `compound_taskid_set` config; validates against `disaggregate_by` conflicts - Variogram score appears in predevals dashboard (once scoringutils#1114 lands) @@ -213,17 +224,17 @@ The detailed implementation plan for this sprint is in [hubPredEvalsData#34](htt --- -### Mini-Sprint E — Documentation (~2 weeks, can overlap with D) -*Scope: Extend the [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) + predevals JS (end-user metric definitions, partly overlapping with Sprint A #13).* +### Mini-Sprint D — Documentation +*Scope: Extend the [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) + predevals JS (end-user metric definitions, partly overlapping with Sprint A #13). Can overlap with other sprints.* -The existing guide covers infrastructure well (Docker setup, renv, build process, integration testing). It does **not** cover metric development workflow. Sprint E adds that missing layer as a new section or sibling page within the same developer guide. +The existing guide covers infrastructure well (Docker setup, renv, build process, integration testing). It does **not** cover metric development workflow. Sprint D adds that missing layer as a new section or sibling page within the same developer guide. **New content to add to hubDocs:** 1. Architecture diagram: hubEvals → hubPredEvalsData → predevals → dashboard (with pointers to existing infrastructure docs) 2. How to add a metric in hubEvals: implement a transform function, update `score_model_out`, update `get_metrics`, write tests 3. How to wire it through hubPredEvalsData: add metric name to schema, `get_standard_metrics`, `get_metric_name_to_output_type`, config validation 4. How predevals reads scores CSVs — when new metrics appear automatically vs. when JS changes are needed -5. Worked example: variogram score end-to-end (referencing Sprints C + D as concrete prior art) +5. Worked example: sample-based scoring with compound_taskid_set end-to-end (referencing Sprints B + C as concrete prior art) **End-user metric documentation** (builds on Sprint A #13): - If not fully addressed in Sprint A, finalize a metric glossary embedded in predevals covering: WIS, ae_point, se_point, interval coverage, variogram score @@ -231,7 +242,7 @@ The existing guide covers infrastructure well (Docker setup, renv, build process **Deliverable**: New or extended hubDocs page; predevals minor release if glossary was deferred from Sprint A. -**Sprint E — Acceptance criteria:** +**Sprint D — Acceptance criteria:** - A developer unfamiliar with the codebase can follow the guide to add a toy metric end-to-end without reading source across repos - All metric names in the guide are present in the predevals JS glossary (#13) - hubDocs CI passes @@ -242,30 +253,26 @@ The existing guide covers infrastructure well (Docker setup, renv, build process | Repo | Files | Sprint | |------|-------|--------| -| hubEvals | `R/validate.R`, `R/score_model_out.R`, new `R/transform_sample_model_out.R` | D | -| hubPredEvalsData | `inst/schema/v1.0.2/config_schema.json` (new), `inst/schema/v1.1.0/config_schema.json` (new), `R/config.R`, `R/utils-metrics.R`, `R/generate_eval_data.R` | B, C, D | -| predevals | `src/predevals.js` and related source files | A, B, C, D | -| hubPredEvalsData-docker | Dockerfile / entrypoint | B, C, D | -| hubDocs | New developer + end-user guide page | E | +| hubEvals | `R/validate.R`, `R/score_model_out.R`, new `R/transform_sample_model_out.R` | C | +| hubPredEvalsData | `inst/schema/v1.1.0/config_schema.json` (new), `R/config.R`, `R/utils-metrics.R`, `R/generate_eval_data.R` | B, C | +| predevals | `src/predevals.js` and related source files | A, B, C | +| hubPredEvalsData-docker | Dockerfile / entrypoint | B, C | +| hubDocs | New developer + end-user guide page | D | ### Scale and scope -- **Duration**: 1–3 months total across all sprints; Sprints A and E can run in parallel with others -- **Sequencing**: +- **Sequencing**: Sprint A1 must complete before Sprint B. All other ordering is flexible. ``` -Sprint A (UI polish, ~2w) ──────────────────────────► release -Sprint B (config enhancements, ~3w) ─────────────────► release -Sprint C (scale transforms, ~4w) ────────────► release -Sprint D (variogram score, ~3–4w) ────────► release -Sprint E (docs, ~2w) ──────────────────────► release +Sprint A1 (table readiness) ──► Sprint B (schema v1.1.0) ──► Sprint C (sample scoring) +Sprint A2 (UI polish) ──────────────────────────────────────────► release +Sprint D (docs) ──────────────────────────────────────────► release ``` -> **Note**: Sprint E can begin in parallel with Sprint B, but cannot fully close until Sprint D is complete. The worked example (item 5 in the content list) depends on Sprints C and D as concrete prior art. +> **Note**: Sprint D can begin early, but cannot fully close until Sprint C is complete. The worked example (item 5 in the content list) depends on Sprints B and C as concrete prior art. ### Key risks -1. **scoringutils#1114 timing**: Variogram score will land in hubEvals automatically once scoringutils#1114 merges, but the timeline is upstream-dependent. The rest of Sprint D (hubPredEvalsData schema, pipeline) can proceed with energy score in the meantime. +1. **scoringutils#1114 timing**: Variogram score will land in hubEvals automatically once scoringutils#1114 merges, but the timeline is upstream-dependent. The rest of Sprint C (hubPredEvalsData pipeline, compound_taskid_set support) can proceed with energy score in the meantime. 2. **hubPredEvalsData#21 dependency**: Sprint B's issue #44 (human-readable target names) cannot land until that upstream issue is resolved. -3. **Schema versioning coordination**: Sprints B, C, and D all modify the hubPredEvalsData schema. Recommend sequential release of B → C → D to avoid conflicting bumps. -4. **Table width**: Treating scaled metrics as separate columns (Sprint C) will widen the evaluation table significantly. Sprint A's frozen-column fix ([#49](https://github.com/hubverse-org/predevals/issues/49)) is a soft prerequisite for Sprint C. +3. **Table width**: Treating scaled metrics as separate columns (Sprint B) will widen the evaluation table significantly. Sprint A1's frozen-column fix ([#49](https://github.com/hubverse-org/predevals/issues/49)) is a hard prerequisite. From dc9e815eddef7c7e0b4ff16bd64f878f2aefb804 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Fri, 6 Mar 2026 16:35:45 -0500 Subject: [PATCH 4/8] minor manual updates to project poster --- .../eval-metrics-expansion.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index a16e139..91e2c01 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -16,7 +16,7 @@ Extending the hubverse forecast evaluation ecosystem across four goals: 2. **Evaluation config schema updates**: Consolidate all hubPredEvalsData schema changes into a single v1.1.0 release, including: config-driven UI enhancements (per-target decimal precision, human-readable target names, configurable default sort column), and the scale transformation pipeline (wiring the already-implemented log/sqrt transform support in `hubEvals::score_model_out()` through the config schema and predevals UI). Issues: [predevals#48](https://github.com/hubverse-org/predevals/issues/48), [#44](https://github.com/hubverse-org/predevals/issues/44), [#4](https://github.com/hubverse-org/predevals/issues/4), [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). -3. **Sample-based scoring with compound_taskid_set awareness**: Add `sample` output type support to hubEvals and build pipeline infrastructure for multivariate/compound metrics that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. This enables the variogram score (`variogram_score_multivariate()` and `variogram_score_multivariate_point()`, recently added to scoringutils) as a metric evaluating ensemble spatial correlation structure across locations, and lays groundwork for future compound metrics on sample forecasts. See also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. +3. **Sample-based scoring with compound_taskid_set awareness**: Add `sample` output type support to hubEvals and build pipeline infrastructure for multivariate/compound metrics that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. This enables the variogram score (`variogram_score_multivariate()` and `variogram_score_multivariate_point()`, recently added to scoringutils) as a metric evaluating spatial correlation structure across locations (or horizons, or more generally, across some subset of task-id variables), and lays groundwork for future compound metrics on sample forecasts. For background on this topic, see also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. 4. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. @@ -25,7 +25,7 @@ Extending the hubverse forecast evaluation ecosystem across four goals: - Increasing evaluation diversity was the most highly ranked priority in a recent survey of hubverse users. This project both surfaces existing metrics, adds new ones, and makes all evaluations more interpretable. - Many predevals usability gaps have been open for over a year; the dashboard is hard for non-expert users to interpret (no metric definitions, no "lower is better" cues, table ergonomics issues). - Infectious disease forecasts predict count data that spans orders of magnitude across locations. Log-scale evaluation provides fairer cross-location comparisons. -- The variogram score is a multivariate proper scoring rule capturing spatial correlation—something WIS cannot measure—and scoringutils now supports it natively. +- The variogram score is a multivariate proper scoring rule capturing correlation across linked prediction tasks—something WIS cannot measure—and scoringutils now supports it natively. - The [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) covers infrastructure (Docker, build, testing) but not metric development workflow, making it hard for new contributors to add metrics end-to-end. ### What are we _not_ trying to do? @@ -33,11 +33,11 @@ Extending the hubverse forecast evaluation ecosystem across four goals: - Not adding new chart types (existing table/heatmap/line plot are sufficient). - Not changing the existing WIS/AE/interval coverage metrics. - Not implementing calibration/reliability diagram visualizations in this project. -- Not adding the variogram score for quantile-format forecasts (requires ensemble/sample format). +- Not adding the variogram score for quantile-format forecasts (requires sample or mean/median format). ### How do we judge success? -- A first-time user reading the dashboard understands what WIS means, which direction is better, and which models are performing best—without leaving the page. +- A first-time user reading the dashboard understands what WIS and other metrics mean, which direction is better, and which models are performing best, all without leaving the page. - A hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear as distinct items alongside natural-scale metrics in all dropdowns and table columns. - A hub submitting sample-format ensemble forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. - A new developer can follow the hubDocs guide to add a hypothetical metric end-to-end without reading source code across repos. @@ -82,7 +82,7 @@ Documentation: - [Developer guide (infrastructure)](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) - `hubEvals::score_model_out()` (v0.1.0) already supports `transform`, `transform_append`, and `transform_label`—no hubEvals changes needed for the scale transform pipeline. -- scoringutils dev branch has `as_forecast_multivariate_sample()`, `variogram_score_multivariate()`, and `variogram_score_multivariate_point()`. +- scoringutils dev branch has `as_forecast_multivariate_sample()`, `variogram_score_multivariate()`, and `variogram_score_multivariate_point()`. (to be released by mid-March.) - hubPredEvalsData schema versioning is established (v0.1.0 → v1.0.0 → v1.0.1); v1.1.0 is the natural target for all config-driven enhancements and scale transform additions in a single release. The schema update checklist from the last version bump is tracked in [hubPredEvalsData#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32). - The predevals JS dashboard builds via webpack into `dist/predevals.bundle.js`; the existing table/heatmap/line plot handle new metric columns without new chart types. - predevals issues #31, #5, #42, #41, #49 are pure JS/CSS changes with no schema dependencies. @@ -90,9 +90,9 @@ Documentation: ### What do we need to answer? - **Variogram score in scoringutils**: [scoringutils#1114](https://github.com/epiforecasts/scoringutils/issues/1114) tracks adding `variogram_score_multivariate()` as a default compound metric. Confirm its merge status before closing Sprint C. hubEvals PR #103 is already wired to pick it up dynamically once it lands. -- **How many active hubs submit `sample`-format forecasts?** This determines Sprint C's near-term impact. Although, note that the variogram point score could be used by almost all hubs, with some additional specification about joint across. It is possible that some specification would be needed for any variogram score to specify which dimension to assume the forecasts are joint across. -- **Compound_taskid_set + disaggregate_by conflict**: If a target has `disaggregate_by: location` and also uses a compound metric computed *across* locations (e.g., variogram score), should hubPredEvalsData skip disaggregation silently, warn, or error? -- **Issue #44 (human-readable target names)**: Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency. +- **What is the best way to specify the compound-task-id set to allow for multivariate scoring?** To compute a score on a multivariate outcome, such as variogram score or the energy score, we need to identify which task id variables define a compound-task-id. How can these be simply and flexibly specified and validated to enable these metrics? +- **Compound_taskid_set + disaggregate_by conflict**: If a target has `disaggregate_by: location` and also uses a compound metric computed *across* locations (e.g., variogram score), should hubPredEvalsData skip disaggregation silently, warn, or error? e.g., how should overall aggregation work, or how would we turn off disaggregation by the task-id variable that defines the compound task id set. For example, if you wanted to compute the variogram score across all locations, then you couldn't disaggregate the computation by location. +- **Issue #44 (human-readable target names)**: ~~Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency.~~ **Resolved**: Issue #21 is not resolved so there is an upstream dependency. - **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.1.0 schema bump. - **Which scoringutils metrics are already surfaceable?** For quantile forecasts, any metric in `scoringutils::get_metrics(scoringutils::example_quantile)` is already available via the existing `get_standard_metrics()` pathway — no schema or code changes needed, just config. For sample forecasts, once PR #103 merges, CRPS/bias/DSS/energy score become available the same way. New scoringutils defaults propagate automatically. @@ -134,7 +134,7 @@ Sprint A is split into two tiers. **A1** contains table readiness work that is a | [#34](https://github.com/hubverse-org/predevals/issues/34) | Rename all `predeval` → `predevals` references in source | JS cleanup (trivial) | | [#22](https://github.com/hubverse-org/predevals/issues/22) | Add unit tests for existing JS functionality; establishes test harness for TDD in subsequent sprints | JS testing infrastructure | -**Deliverable**: New predevals release. No changes to hubPredEvalsData, hubEvals, or Docker. +**Deliverable**: New predevals release(s). No changes to hubPredEvalsData, hubEvals, or Docker. **Sprint A — Acceptance criteria:** - A1 issues resolved and closed (required before Sprint B) From 608c4ec861f1a1b27d6e68b8e04504cd82dda805 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Fri, 6 Mar 2026 16:36:19 -0500 Subject: [PATCH 5/8] adding DS_Store to gitignores. --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index c9ce40a..81ffe67 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,5 @@ .Ruserdata *.Rproj *arrow-issue.qmd + +*.DS_Store \ No newline at end of file From 585deaebd171ef075d48c4aef85d26d36ede1b02 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Fri, 6 Mar 2026 16:43:05 -0500 Subject: [PATCH 6/8] one other minor change in response to comments. --- .../eval-metrics-expansion/eval-metrics-expansion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index 91e2c01..7749ee0 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -39,7 +39,7 @@ Extending the hubverse forecast evaluation ecosystem across four goals: - A first-time user reading the dashboard understands what WIS and other metrics mean, which direction is better, and which models are performing best, all without leaving the page. - A hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear as distinct items alongside natural-scale metrics in all dropdowns and table columns. -- A hub submitting sample-format ensemble forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. +- A hub submitting multivariate forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. - A new developer can follow the hubDocs guide to add a hypothetical metric end-to-end without reading source code across repos. ### What are possible solutions? From 87df936dc83ff572328f8958ee7bd4daabffa1c0 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Mon, 23 Mar 2026 11:46:08 -0400 Subject: [PATCH 7/8] updating to reflect recent conversations --- .../eval-metrics-expansion.md | 76 ++++++++++++++----- 1 file changed, 57 insertions(+), 19 deletions(-) diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index 7749ee0..663e978 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -1,6 +1,6 @@ # Project Poster: Expanding Hubverse Evaluation Metrics and Dashboard Support -- Date: 2026-02-26 +- Date: 2026-03-23 - Owner: Nicholas Reich @@ -10,23 +10,26 @@ ### What are we doing? -Extending the hubverse forecast evaluation ecosystem across four goals: +Extending the hubverse forecast evaluation ecosystem across five goals: 1. **Dashboard UI improvements**: Fix known usability gaps in the predevals dashboard—table ergonomics for wider tables (column hiding, frozen model column), metric documentation, score direction indicators, and several priority bugs—without touching R packages or config schemas. Table readiness issues ([#49](https://github.com/hubverse-org/predevals/issues/49), [#50](https://github.com/hubverse-org/predevals/issues/50)) are critical prerequisites for the scaled-metrics work in goal 2. Additional polish issues: [#13](https://github.com/hubverse-org/predevals/issues/13), [#31](https://github.com/hubverse-org/predevals/issues/31), [#42](https://github.com/hubverse-org/predevals/issues/42), [#41](https://github.com/hubverse-org/predevals/issues/41), [#5](https://github.com/hubverse-org/predevals/issues/5). 2. **Evaluation config schema updates**: Consolidate all hubPredEvalsData schema changes into a single v1.1.0 release, including: config-driven UI enhancements (per-target decimal precision, human-readable target names, configurable default sort column), and the scale transformation pipeline (wiring the already-implemented log/sqrt transform support in `hubEvals::score_model_out()` through the config schema and predevals UI). Issues: [predevals#48](https://github.com/hubverse-org/predevals/issues/48), [#44](https://github.com/hubverse-org/predevals/issues/44), [#4](https://github.com/hubverse-org/predevals/issues/4), [hubPredEvalsData#34](https://github.com/hubverse-org/hubPredEvalsData/issues/34). -3. **Sample-based scoring with compound_taskid_set awareness**: Add `sample` output type support to hubEvals and build pipeline infrastructure for multivariate/compound metrics that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. This enables the variogram score (`variogram_score_multivariate()` and `variogram_score_multivariate_point()`, recently added to scoringutils) as a metric evaluating spatial correlation structure across locations (or horizons, or more generally, across some subset of task-id variables), and lays groundwork for future compound metrics on sample forecasts. For background on this topic, see also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. +3. **Sample-based scoring with compound_taskid_set awareness**: Add `sample` output type support to hubEvals and build pipeline infrastructure for multivariate/compound metrics that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. This enables the variogram score (`variogram_score_multivariate()` and `variogram_score_multivariate_point()`, recently added to scoringutils) as a metric evaluating spatial correlation structure across locations (or horizons, or more generally, across some subset of task-id variables), and lays groundwork for future compound metrics on sample forecasts. For background on this topic, see also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. Note that as part of this sprint, we plan to evaluate complexity and feasibility of integrating compound-taskid-sets into the dashboard, considering that at least [one more complex hub](https://reichlab.io/variant-nowcast-hub-dashboard/eval.html) has already rolled their own evaluation pipelines that then plug into predevals to leverage existing UI tools. -4. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. +4. **Developer documentation**: A hubDocs guide explaining the full metric pipeline for future contributors, plus standardized end-user metric definitions in the dashboard. + +5. **Formalize functional links between hubverse and scoringutils**: It [has been suggested](https://github.com/orgs/hubverse-org/discussions/40#discussioncomment-14882858) that we create a data connector utility between hubverse and scoringutils formats. This would enable analysts to more easily move between the two data formats without having to actually do scoring. The code for some of this is already written, just needs to be surfaced formally. ### Why are we doing this? - Increasing evaluation diversity was the most highly ranked priority in a recent survey of hubverse users. This project both surfaces existing metrics, adds new ones, and makes all evaluations more interpretable. - Many predevals usability gaps have been open for over a year; the dashboard is hard for non-expert users to interpret (no metric definitions, no "lower is better" cues, table ergonomics issues). -- Infectious disease forecasts predict count data that spans orders of magnitude across locations. Log-scale evaluation provides fairer cross-location comparisons. +- Infectious disease forecasts predict count data that spans orders of magnitude across locations. Log-scale evaluation provides a different take at model comparisons that aggregate across spatial scales. - The variogram score is a multivariate proper scoring rule capturing correlation across linked prediction tasks—something WIS cannot measure—and scoringutils now supports it natively. - The [existing developer guide](https://docs.hubverse.io/en/latest/developer/dashboard-predevals.html) covers infrastructure (Docker, build, testing) but not metric development workflow, making it hard for new contributors to add metrics end-to-end. +- Improving interoperability between hubverse and scoringutils tool sets will help users. ### What are we _not_ trying to do? @@ -39,8 +42,9 @@ Extending the hubverse forecast evaluation ecosystem across four goals: - A first-time user reading the dashboard understands what WIS and other metrics mean, which direction is better, and which models are performing best, all without leaving the page. - A hub admin can configure log-scale evaluation in `predevals-config.yml` and see log-scaled metrics appear as distinct items alongside natural-scale metrics in all dropdowns and table columns. -- A hub submitting multivariate forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. +- A hub submitting multivariate forecasts can configure compound_taskid_set and view compound metrics (e.g., variogram score) in the dashboard. (Depending on outcome of feasibility analysis.) - A new developer can follow the hubDocs guide to add a hypothetical metric end-to-end without reading source code across repos. +- Users can move data between hubverse and scoringutils formats. ### What are possible solutions? @@ -95,7 +99,7 @@ Documentation: - **Issue #44 (human-readable target names)**: ~~Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency.~~ **Resolved**: Issue #21 is not resolved so there is an upstream dependency. - **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.1.0 schema bump. - **Which scoringutils metrics are already surfaceable?** For quantile forecasts, any metric in `scoringutils::get_metrics(scoringutils::example_quantile)` is already available via the existing `get_standard_metrics()` pathway — no schema or code changes needed, just config. For sample forecasts, once PR #103 merges, CRPS/bias/DSS/energy score become available the same way. New scoringutils defaults propagate automatically. - +- **What happens when hubs have multiple output types for the same target?** Sometimes, who output types might enable multiple scores to be shown, or the same score to be calculated in different ways. How do we support this? or allow users to specify which output-type to use when calculating metrics? or just set defaults for prioritizing one output-type over another? This may already happen for example if we have quantile with output_type_id == 0.50 (median) forecasts and point forecasts - they both could be used to compute point forecast metrics. --- @@ -103,7 +107,7 @@ Documentation: ### Proposed solution -Four mini-sprints ordered by implementation complexity: UI-only first, then schema additions, then new metric pipeline features. Each sprint is independently releasable. +Five mini-sprints ordered by implementation complexity: UI-only first, then schema additions, then new metric pipeline features. Each sprint is independently releasable. The core complexity axis is: @@ -125,12 +129,14 @@ Sprint A is split into two tiers. **A1** contains table readiness work that is a **A2 — UI polish (non-blocking)** +In order of priority: + | Issue | Change | Complexity | |-------|--------|-----------| +| [#13](https://github.com/hubverse-org/predevals/issues/13) **priority** | Add metric definitions panel with abbreviations and direction indicators, shown by default and togglable | JS; metric glossary baked into predevals.js | | [#31](https://github.com/hubverse-org/predevals/issues/31) **priority** | Auto-update metric + disaggregate_by selectors when target changes | JS logic fix | | [#5](https://github.com/hubverse-org/predevals/issues/5) **priority** | Preserve model selection state when switching plots | JS state management | | [#42](https://github.com/hubverse-org/predevals/issues/42) **priority** | Add "lower is better" / "closer to nominal is better" cues to axes and table headers | JS + display logic | -| [#13](https://github.com/hubverse-org/predevals/issues/13) | Add metric definitions panel with abbreviations and direction indicators, shown by default and togglable | JS; metric glossary baked into predevals.js | | [#34](https://github.com/hubverse-org/predevals/issues/34) | Rename all `predeval` → `predevals` references in source | JS cleanup (trivial) | | [#22](https://github.com/hubverse-org/predevals/issues/22) | Add unit tests for existing JS functionality; establishes test harness for TDD in subsequent sprints | JS testing infrastructure | @@ -147,16 +153,23 @@ Sprint A is split into two tiers. **A1** contains table readiness work that is a ### Mini-Sprint B — Evaluation config schema updates (v1.1.0) *Scope: All hubPredEvalsData schema changes consolidated into a single v1.1.0 release + corresponding predevals JS + R pipeline changes. hubEvals unchanged (transforms already implemented).* -This sprint consolidates config-driven UI enhancements and the scale transformation pipeline into a single schema bump, since schema updates are painful to coordinate across repos. The schema update checklist from the previous version bump is in [hubPredEvalsData#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32). +This sprint consolidates config-driven UI enhancements and the scale transformation pipeline. +It may be completed in a single schema bump, or may be spread out over several updates, depending on the pace of progress and level of coordination. +Schema updates are painful to coordinate across repos, but with new Claude skills it may become easier and we may want a few separate use-cases to test out new releases. +The schema update checklist from the previous version bump is in [hubPredEvalsData#32](https://github.com/hubverse-org/hubPredEvalsData/issues/32). + +Current plan is to make concurrent progress on scale transformation pipeline and a few of the config-driven UI enhancements, making one schema bump when the scale transformation is ready. **Config-driven UI enhancements:** +In order of priority: + | Issue | Change | Where | |-------|--------|-------| -| [predevals#28](https://github.com/hubverse-org/predevals/issues/28) | Refactor numeric rounding into a shared helper (prerequisite for #48) | predevals JS refactor | -| [predevals#48](https://github.com/hubverse-org/predevals/issues/48) **priority** | Add `decimal_places` per-target in config schema; JS reads and applies it for table display | hubPredEvalsData schema + predevals JS | | [predevals#27](https://github.com/hubverse-org/predevals/issues/27) | Refactor score-sorting into a helper function (prerequisite for #4) | predevals JS refactor | | [predevals#4](https://github.com/hubverse-org/predevals/issues/4) **priority** | Add `default_sort_metric` to config/options; JS uses it for initial table sort (e.g., sort models by relative WIS ascending on first load rather than alphabetically) | hubPredEvalsData schema + predevals JS | +| [predevals#28](https://github.com/hubverse-org/predevals/issues/28) | Refactor numeric rounding into a shared helper (prerequisite for #48) | predevals JS refactor | +| [predevals#48](https://github.com/hubverse-org/predevals/issues/48) **priority** | Add `decimal_places` per-target in config schema; JS reads and applies it for table display | hubPredEvalsData schema + predevals JS | | [predevals#44](https://github.com/hubverse-org/predevals/issues/44) | Display `target_name` from `tasks.json` instead of `target_id` in dropdowns | hubPredEvalsData (depends on [#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)) + predevals JS | > ⚠️ Issue #44 depends on upstream hubPredEvalsData work ([#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21)). Include only if that work is complete. @@ -185,7 +198,14 @@ The detailed implementation plan is in [hubPredEvalsData#34](https://github.com/ ### Mini-Sprint C — Sample-based scoring with compound_taskid_set *Scope: Merge near-complete hubEvals PR, then wire sample scoring + compound_taskid_set awareness through hubPredEvalsData and predevals.* -This sprint builds pipeline support for sample-format output types that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. The variogram score is the motivating use case, but the infrastructure enables any future compound/multivariate metric for sample forecasts. See also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. +This sprint builds pipeline support for sample-format output types that require [`compound_taskid_set`](https://docs.hubverse.io/en/latest/user-guide/sample-output-type.html#compound-modeling-tasks) awareness. +The variogram score is the motivating use case, but the infrastructure enables any future compound/multivariate metric for sample forecasts. +See also [hubDocs PR #439](https://github.com/hubverse-org/hubDocs/pull/439) for incoming updates to compound modeling task documentation. + +However, we will begin by assessing the feasibility and challenges of adding support for univariate sample scores vs. multivariate sample scores. +The first deliverable here will be to define a scope of which features to roll out and what the challenges will be with any of them. +Example complexities are how/whether to handle multivariate sample scoring and how to treat compound_taskid_sets in the config files. +Also, how/whether to handle clashes when/if multiple output_types can be used to compute the same score. **hubEvals changes — mostly done:** [PR #103](https://github.com/Infectious-Disease-Modeling-Hubs/hubEvals/pull/103) (branch `ak/sample-scoring/94`) is open and awaiting final review. It adds: @@ -204,10 +224,6 @@ This sprint builds pipeline support for sample-format output types that require - `R/utils-metrics.R` — add `sample = "variogram_score"` case in `get_standard_metrics()` - `R/generate_eval_data.R` — extract and propagate `compound_taskid_set`; skip location-based disaggregation for compound metrics when `compound_taskid_set = "location"` -**predevals JS changes:** -- No new chart types; variogram score appears as another column in the overall scores table -- Graceful handling of missing metric columns in disaggregated views (may already work) - > ⚠️ **Scope constraint**: Compound metrics computed jointly across a dimension (e.g., variogram score across locations) are **not** disaggregable by that dimension. Document this in the config schema and validation error messages. **hubPredEvalsData-docker changes:** @@ -240,6 +256,7 @@ The existing guide covers infrastructure well (Docker setup, renv, build process - If not fully addressed in Sprint A, finalize a metric glossary embedded in predevals covering: WIS, ae_point, se_point, interval coverage, variogram score - Hub-level task variable definitions handled separately (via hub model-output README, not in predevals) + **Deliverable**: New or extended hubDocs page; predevals minor release if glossary was deferred from Sprint A. **Sprint D — Acceptance criteria:** @@ -249,15 +266,35 @@ The existing guide covers infrastructure well (Docker setup, renv, build process --- +### Mini-Sprint E — hubverse ↔ scoringutils data connector +*Scope: hubEvals (new exported functions) + documentation. No schema or dashboard changes.* + +Formalize the existing code that converts between hubverse and scoringutils data formats into a documented, exported utility. This lets analysts use scoringutils tools directly on hubverse data (and vice versa) without running the full scoring pipeline. + +| Step | Change | Where | +|------|--------|-------| +| 1 | Identify and consolidate existing internal conversion code (e.g., format transformations in `score_model_out()` and `transform_sample_model_out()`) into exported helper functions | hubEvals | +| 2 | Export converter functions (e.g., `as_scoringutils()` / `as_hubverse()`) with documentation and input validation | hubEvals | +| 3 | Add vignette or hubDocs section showing round-trip usage examples | hubEvals vignette or hubDocs | + +**Deliverable**: hubEvals minor release with exported converter functions; vignette or docs page. + +**Sprint E — Acceptance criteria:** +- A user can convert a hubverse `model_out_tbl` to a scoringutils forecast object (and back) using documented exported functions +- Round-trip conversion preserves data fidelity (test coverage) +- `R CMD check` passes + +--- + ### Files to modify across all sprints | Repo | Files | Sprint | |------|-------|--------| -| hubEvals | `R/validate.R`, `R/score_model_out.R`, new `R/transform_sample_model_out.R` | C | +| hubEvals | `R/validate.R`, `R/score_model_out.R`, new `R/transform_sample_model_out.R`, new converter functions | C, E | | hubPredEvalsData | `inst/schema/v1.1.0/config_schema.json` (new), `R/config.R`, `R/utils-metrics.R`, `R/generate_eval_data.R` | B, C | | predevals | `src/predevals.js` and related source files | A, B, C | | hubPredEvalsData-docker | Dockerfile / entrypoint | B, C | -| hubDocs | New developer + end-user guide page | D | +| hubDocs | New developer + end-user guide page | D, E | ### Scale and scope @@ -267,6 +304,7 @@ The existing guide covers infrastructure well (Docker setup, renv, build process Sprint A1 (table readiness) ──► Sprint B (schema v1.1.0) ──► Sprint C (sample scoring) Sprint A2 (UI polish) ──────────────────────────────────────────► release Sprint D (docs) ──────────────────────────────────────────► release +Sprint E (data connector) ──────────────────────────────────────────► release ``` > **Note**: Sprint D can begin early, but cannot fully close until Sprint C is complete. The worked example (item 5 in the content list) depends on Sprints B and C as concrete prior art. From 704c7e27e340cb7ce7a6780d1a28ac92c9cfb989 Mon Sep 17 00:00:00 2001 From: Nicholas G Reich Date: Mon, 23 Mar 2026 15:59:45 -0400 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Nicholas G Reich --- .../eval-metrics-expansion/eval-metrics-expansion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md index 663e978..8c8d686 100644 --- a/project-posters/eval-metrics-expansion/eval-metrics-expansion.md +++ b/project-posters/eval-metrics-expansion/eval-metrics-expansion.md @@ -99,7 +99,7 @@ Documentation: - **Issue #44 (human-readable target names)**: ~~Is [hubPredEvalsData#21](https://github.com/hubverse-org/hubPredEvalsData/issues/21) resolved? If not, #44 has an unresolved upstream dependency.~~ **Resolved**: Issue #21 is not resolved so there is an upstream dependency. - **Issue #4 (default sort column)**: ~~Should this be configured in `predevals-config.yml` (schema change) or via a standalone predevals options object (JS only)?~~ **Resolved**: configure in `predevals-config.yml` as a schema change; included in Sprint B's v1.1.0 schema bump. - **Which scoringutils metrics are already surfaceable?** For quantile forecasts, any metric in `scoringutils::get_metrics(scoringutils::example_quantile)` is already available via the existing `get_standard_metrics()` pathway — no schema or code changes needed, just config. For sample forecasts, once PR #103 merges, CRPS/bias/DSS/energy score become available the same way. New scoringutils defaults propagate automatically. -- **What happens when hubs have multiple output types for the same target?** Sometimes, who output types might enable multiple scores to be shown, or the same score to be calculated in different ways. How do we support this? or allow users to specify which output-type to use when calculating metrics? or just set defaults for prioritizing one output-type over another? This may already happen for example if we have quantile with output_type_id == 0.50 (median) forecasts and point forecasts - they both could be used to compute point forecast metrics. +- **What happens when hubs have multiple output types for the same target?** Sometimes, output types present in the data might enable multiple scores to be shown, or the same score to be calculated in different ways. How do we support this? or allow users to specify which output-type to use when calculating metrics? or just set defaults for prioritizing one output-type over another? This may already happen for example if we have quantile with output_type_id == 0.50 (median) forecasts and point forecasts - they both could be used to compute point forecast metrics. ---