second half custom_epiworkflows.Rmd

nmdefries · dsweber2 · commit 05f1507008fd · 2025-04-10T10:30:02.000-05:00
diff --git a/vignettes/custom_epiworkflows.Rmd b/vignettes/custom_epiworkflows.Rmd
@@ -91,10 +91,10 @@ The `{parsnip}` front-end abstracts away the differences in interface between a
 
 ### Postprocessor
 
-Postprocessors are unique to this `{epipredict}`.
+Postprocessors are unique to `{epipredict}`.
 A postprocessor modifies and formats the prediction after a model has been fit.
 
-Each operation within a postprocessor is called a `layer_`, and the stack of layers is known as `frosting()`,
+Each operation within a postprocessor is called a "layer" (functions are named `layer_*`), and the stack of layers is known as `frosting()`,
 continuing the metaphor of baking a cake established in `{recipes}`.
 Some example operations include:
 
@@ -314,23 +314,23 @@ data.
 
 # Extending `four_week_ahead`
 
-Now that we've recreated `four_week_ahead`, we can start modifying parts of it to get custom behavior.
+Now that we know how to create `four_week_ahead` from scratch, we can start modifying the workflow to get custom behavior.
 
 There are many ways we could modify `four_week_ahead`. We might consider:
 
-- Including a growth rate estimate as part of the model
-- Converting from rates to counts, for example if that were the
-preferred prediction format
-- Including a time component to the prediction, useful if we
-expect there to be a strong seasonal component
-- Scaling by some factor
+- Converting from rates to counts
+- Including a growth rate estimate as a predictor
+- Including a time component as a predictor -- useful if we
+expect there to be a strong seasonal component to the outcome
+- Scaling by a factor
 
 We will demo a couple of these modifications below.
 
 ## Growth rate
 
-One feature that may potentially improve our forecast is looking at the growth
-rate
+Let's say we're interested in including growth rate as a predictor in our model because
+we think it may potentially improve our forecast.
+We can easily create a new growth rate column as a step in the `epi_recipe`.
 
 ```{r growth_rate_recipe}
 growth_rate_recipe <- epi_recipe(
@@ -341,6 +341,7 @@ growth_rate_recipe <- epi_recipe(
   step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
   step_epi_ahead(death_rate, ahead = 4 * 7) |>
   step_epi_naomit() |>
+  # Calculate growth rate from death rate column.
   step_growth_rate(death_rate) |>
   step_training_window()
 ```
@@ -354,7 +355,8 @@ growth_rate_recipe |>
   select(
     geo_value, time_value, case_rate,
     death_rate, gr_7_rel_change_death_rate
-  )
+  ) |>
+  tail()
 ```
 
 And the role:
@@ -396,8 +398,7 @@ gr_predictions <- gr_fit_workflow |>
 <details>
 <summary> Plot </summary>
 
-Plotting the result; this is reusing some code from the landing page to print
-the forecast date.
+We'll reuse some code from the landing page to plot the result.
 
 ```{r plotting}
 forecast_date_label <-
@@ -416,7 +417,7 @@ result_plot <- autoplot(
 ) +
   geom_vline(aes(xintercept = forecast_date)) +
   geom_text(
-    data = forecast_date_label %>% filter(.response_name == "death_rate"),
+    data = forecast_date_label |> filter(.response_name == "death_rate"),
     aes(x = dates, label = "forecast\ndate", y = heights),
     size = 3, hjust = "right"
   ) +
@@ -428,14 +429,13 @@ result_plot <- autoplot(
 result_plot
 ```
 
-TODO `get_test_data` isn't actually working here...
+<!-- TODO `get_test_data` isn't actually working here... -->
 
 ## Population scaling
 
-Suppose we're sending our predictions to someone who is looking to understand
-counts, rather than rates.
-Then we can adjust just the frosting to get a forecast for the counts from our
-rates forecaster:
+Suppose we want to modify our predictions to apply to counts, rather than rates.
+To do that, we can adjust _just_ the `frosting` to perform post-processing on our existing rates forecaster.
+Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
 
 ```{r rate_scale}
 count_layers <-
@@ -445,15 +445,18 @@ count_layers <-
   layer_population_scaling(
     .pred,
     .pred_distn,
+    # `df` contains scaling values for all regions; in this case it is the state populations
     df = epidatasets::state_census,
     df_pop_col = "pop",
     create_new = FALSE,
+    # `rate_rescaling` gives the denominator of the existing rate predictions
     rate_rescaling = 1e5,
     by = c("geo_value" = "abbr")
   ) |>
   layer_add_forecast_date() |>
   layer_add_target_date() |>
   layer_threshold()
+
 # building the new workflow
 count_workflow <- epi_workflow(
   four_week_recipe,
@@ -464,66 +467,67 @@ count_pred_data <- get_test_data(four_week_recipe, training_data)
 count_predictions <- count_workflow |>
   fit(training_data) |>
   predict(count_pred_data)
+
 count_predictions
 ```
 
-which are 2-3 orders of magnitude larger than the corresponding rates above.
-`df` represents the scaling value; in this case it is the state populations,
-while `rate_rescaling` gives the denominator of the rate (our fit values were
-per 100,000).
-
 # Custom classifier workflow
 
-As a more complicated example of the kind of pipeline that you can build using
-this framework, here is an example of a hotspot prediction model, which predicts
-whether the case rates are increasing (`up`), decreasing (`down`) or flat
+Let's work through an example of a more complicated kind of pipeline you can build using
+the `epipredict` framework.
+This is a hotspot prediction model, which predicts whether case rates are increasing (`up`), decreasing (`down`) or flat
 (`flat`).
-This comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
+The model comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
 serves as an extension of `arx_classifier()`.
 
-First, we need to add a factor version of the `geo_value`, so that it can
-actually be used as a feature.
+First, we need to add a factor version of `geo_value`, so that it can be used as a feature.
 
 ```{r training_factor}
 training_data <-
-  training_data %>%
+  training_data |>
   mutate(geo_value_factor = as.factor(geo_value))
 ```
 
 Then we put together the recipe, using a combination of base `{recipe}`
-functions such as `add_role()` and `step_dummy()`, and `{epipreict}` functions
+functions such as `add_role()` and `step_dummy()`, and `{epipredict}` functions
 such as `step_growth_rate()`.
 
 ```{r class_recipe}
-classifier_recipe <- epi_recipe(training_data) %>%
-  add_role(time_value, new_role = "predictor") %>%
-  step_dummy(geo_value_factor) %>%
-  step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
-  step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
-  step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
-  # note recipes::step_cut() has a bug in it, or we could use that here
+classifier_recipe <- epi_recipe(training_data) |>
+  # Turn `time_value` into predictor
+  add_role(time_value, new_role = "predictor") |>
+  # Turn `geo_value_factor` into predictor
+  step_dummy(geo_value_factor) |>
+  # Create and lag growth rate
+  step_growth_rate(case_rate, role = "none", prefix = "gr_") |>
+  step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) |>
+  step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") |>
+  # Divide growth rate into 3 bins.
+  # Note `recipes::step_cut()` has a bug, or we could use that here.
   step_mutate(
     response = cut(
       ahead_7_gr_7_rel_change_case_rate,
-      breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
+      # Define bin thresholds.
+      # Divide by 7 to create weekly values.
+      breaks = c(-Inf, -0.2, 0.25, Inf) / 7,
       labels = c("down", "flat", "up")
     ),
     role = "outcome"
-  ) %>%
-  step_rm(has_role("none"), has_role("raw")) %>%
+  ) |>
+  # Drop unused columns.
+  step_rm(has_role("none"), has_role("raw")) |>
   step_epi_naomit()
 ```
 
+This adds as predictors:
 
-Roughly, this adds as predictors:
+- time value (via `add_role()`)
+- `geo_value` (via `step_dummy()` and the previous `as.factor()`)
+- growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks
 
-1. the time value (via `add_role()`)
-2. the `geo_value` (via `step_dummy()` and the `as.factor()` above)
-3. the growth rate, both at prediction time and lagged by one and two weeks
-
-The outcome is created by composing several steps together: `step_epi_ahead()`
-creates a column with the growth rate one week into the future, while
-`step_mutate()` creates a factor with the 3 values:
+The outcome variable is created by composing several steps together. `step_epi_ahead()`
+creates a column with the growth rate one week into the future, and
+`step_mutate()` turns that column into a factor with 3 possible values,
 
 $$
  Z_{\ell, t}=
@@ -538,47 +542,49 @@ where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
 `up` means that the `case_rate` is has increased by at least 25%, while `down`
 means it has decreased by at least 20%.
 
-Note that both `step_growth_rate()` and `step_epi_ahead()` assign the role
-`none` explicitly; this is because they're used as intermediate steps to create
-both predictors and the outcome.
-`step_rm()` drops them after they're done, along with the original `raw` columns
-`death_rate` and `case_rate` (both `geo_value` and `time_value` are retained
-because their roles have been reassigned).
+Note that in both `step_growth_rate()` and `step_epi_ahead()` we explicitly assign the role
+`none`. This is because those columns are used as intermediaries to create
+predictor and outcome columns.
+Afterwards, `step_rm()` drops the temporary columns, along with the original `role = "raw"` columns
+`death_rate` and `case_rate`. Both `geo_value_factor` and `time_value` are retained
+because their roles have been reassigned.
 
 
-To fit a classification model like this, we will need to use a parsnip model
-with mode classification; the simplest example is `multinomial_reg()`.
-We don't actually need to apply any layers, so we can skip adding one to the `epiworkflow()`:
+To fit a classification model like this, we will need to use a `{parsnip}` model
+that has `mode = "classification"`.
+The simplest example of a `{parsnip}` `classification`-`mode` model is `multinomial_reg()`.
+We don't need to do any post-processing, so we can skip adding `layer`s to the `epiworkflow()`.
+So our workflow looks like:
 
 ```{r, warning=FALSE}
 wf <- epi_workflow(
   classifier_recipe,
   multinom_reg()
-) %>%
+) |>
   fit(training_data)
 
-forecast(wf) %>% filter(!is.na(.pred_class))
+forecast(wf) |> filter(!is.na(.pred_class))
 ```
 
-And comparing the result with the actual growth rates at that point:
+And comparing the result with the actual growth rates at that point in time,
 ```{r growth_rate_results}
 growth_rates <- covid_case_death_rates |>
   filter(geo_value %in% used_locations) |>
   group_by(geo_value) |>
   mutate(
-    # multiply by 7 to get to weekly equivalents
+    # Multiply by 7 to estimate weekly equivalents
     case_gr = growth_rate(x = time_value, y = case_rate) * 7
   ) |>
   ungroup()
 
 growth_rates |> filter(time_value == "2021-08-01")
 ```
 
-So they're all increasing at significantly higher than 25% per week (36%-62%),
-which matches the classification.
+we see that they're all significantly higher than 25% per week (36%-62%),
+which matches the classification model's predictions.
 
 
-See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in depth discussion of this example.
+See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in-depth discussion of this example.
 
 
 [^1]: Think of baking a cake, where adding the frosting is the last step in the