@@ -91,10 +91,10 @@ The `{parsnip}` front-end abstracts away the differences in interface between a
9191
9292### Postprocessor
9393
94- Postprocessors are unique to this ` {epipredict} ` .
94+ Postprocessors are unique to ` {epipredict} ` .
9595A postprocessor modifies and formats the prediction after a model has been fit.
9696
97- Each operation within a postprocessor is called a ` layer_ ` , and the stack of layers is known as ` frosting() ` ,
97+ Each operation within a postprocessor is called a "layer" (functions are named ` layer_* ` ) , and the stack of layers is known as ` frosting() ` ,
9898continuing the metaphor of baking a cake established in ` {recipes} ` .
9999Some example operations include:
100100
@@ -314,23 +314,23 @@ data.
314314
315315# Extending ` four_week_ahead `
316316
317- Now that we've recreated ` four_week_ahead ` , we can start modifying parts of it to get custom behavior.
317+ Now that we know how to create ` four_week_ahead ` from scratch , we can start modifying the workflow to get custom behavior.
318318
319319There are many ways we could modify ` four_week_ahead ` . We might consider:
320320
321- - Including a growth rate estimate as part of the model
322- - Converting from rates to counts, for example if that were the
323- preferred prediction format
324- - Including a time component to the prediction, useful if we
325- expect there to be a strong seasonal component
326- - Scaling by some factor
321+ - Converting from rates to counts
322+ - Including a growth rate estimate as a predictor
323+ - Including a time component as a predictor -- useful if we
324+ expect there to be a strong seasonal component to the outcome
325+ - Scaling by a factor
327326
328327We will demo a couple of these modifications below.
329328
330329## Growth rate
331330
332- One feature that may potentially improve our forecast is looking at the growth
333- rate
331+ Let's say we're interested in including growth rate as a predictor in our model because
332+ we think it may potentially improve our forecast.
333+ We can easily create a new growth rate column as a step in the ` epi_recipe ` .
334334
335335``` {r growth_rate_recipe}
336336growth_rate_recipe <- epi_recipe(
@@ -341,6 +341,7 @@ growth_rate_recipe <- epi_recipe(
341341 step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
342342 step_epi_ahead(death_rate, ahead = 4 * 7) |>
343343 step_epi_naomit() |>
344+ # Calculate growth rate from death rate column.
344345 step_growth_rate(death_rate) |>
345346 step_training_window()
346347```
@@ -354,7 +355,8 @@ growth_rate_recipe |>
354355 select(
355356 geo_value, time_value, case_rate,
356357 death_rate, gr_7_rel_change_death_rate
357- )
358+ ) |>
359+ tail()
358360```
359361
360362And the role:
@@ -396,8 +398,7 @@ gr_predictions <- gr_fit_workflow |>
396398<details >
397399<summary > Plot </summary >
398400
399- Plotting the result; this is reusing some code from the landing page to print
400- the forecast date.
401+ We'll reuse some code from the landing page to plot the result.
401402
402403``` {r plotting}
403404forecast_date_label <-
@@ -416,7 +417,7 @@ result_plot <- autoplot(
416417) +
417418 geom_vline(aes(xintercept = forecast_date)) +
418419 geom_text(
419- data = forecast_date_label %>% filter(.response_name == "death_rate"),
420+ data = forecast_date_label |> filter(.response_name == "death_rate"),
420421 aes(x = dates, label = "forecast\ndate", y = heights),
421422 size = 3, hjust = "right"
422423 ) +
@@ -428,14 +429,13 @@ result_plot <- autoplot(
428429result_plot
429430```
430431
431- TODO ` get_test_data ` isn't actually working here...
432+ <!-- TODO `get_test_data` isn't actually working here... -->
432433
433434## Population scaling
434435
435- Suppose we're sending our predictions to someone who is looking to understand
436- counts, rather than rates.
437- Then we can adjust just the frosting to get a forecast for the counts from our
438- rates forecaster:
436+ Suppose we want to modify our predictions to apply to counts, rather than rates.
437+ To do that, we can adjust _ just_ the ` frosting ` to perform post-processing on our existing rates forecaster.
438+ Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$.
439439
440440``` {r rate_scale}
441441count_layers <-
@@ -445,15 +445,18 @@ count_layers <-
445445 layer_population_scaling(
446446 .pred,
447447 .pred_distn,
448+ # `df` contains scaling values for all regions; in this case it is the state populations
448449 df = epidatasets::state_census,
449450 df_pop_col = "pop",
450451 create_new = FALSE,
452+ # `rate_rescaling` gives the denominator of the existing rate predictions
451453 rate_rescaling = 1e5,
452454 by = c("geo_value" = "abbr")
453455 ) |>
454456 layer_add_forecast_date() |>
455457 layer_add_target_date() |>
456458 layer_threshold()
459+
457460# building the new workflow
458461count_workflow <- epi_workflow(
459462 four_week_recipe,
@@ -464,66 +467,67 @@ count_pred_data <- get_test_data(four_week_recipe, training_data)
464467count_predictions <- count_workflow |>
465468 fit(training_data) |>
466469 predict(count_pred_data)
470+
467471count_predictions
468472```
469473
470- which are 2-3 orders of magnitude larger than the corresponding rates above.
471- ` df ` represents the scaling value; in this case it is the state populations,
472- while ` rate_rescaling ` gives the denominator of the rate (our fit values were
473- per 100,000).
474-
475474# Custom classifier workflow
476475
477- As a more complicated example of the kind of pipeline that you can build using
478- this framework, here is an example of a hotspot prediction model, which predicts
479- whether the case rates are increasing (` up ` ), decreasing (` down ` ) or flat
476+ Let's work through an example of a more complicated kind of pipeline you can build using
477+ the ` epipredict ` framework.
478+ This is a hotspot prediction model, which predicts whether case rates are increasing (` up ` ), decreasing (` down ` ) or flat
480479(` flat ` ).
481- This comes from a paper by McDonald, Bien, Green, Hu et al[ ^ 3 ] , and roughly
480+ The model comes from a paper by McDonald, Bien, Green, Hu et al[ ^ 3 ] , and roughly
482481serves as an extension of ` arx_classifier() ` .
483482
484- First, we need to add a factor version of the ` geo_value ` , so that it can
485- actually be used as a feature.
483+ First, we need to add a factor version of ` geo_value ` , so that it can be used as a feature.
486484
487485``` {r training_factor}
488486training_data <-
489- training_data %>%
487+ training_data |>
490488 mutate(geo_value_factor = as.factor(geo_value))
491489```
492490
493491Then we put together the recipe, using a combination of base ` {recipe} `
494- functions such as ` add_role() ` and ` step_dummy() ` , and ` {epipreict } ` functions
492+ functions such as ` add_role() ` and ` step_dummy() ` , and ` {epipredict } ` functions
495493such as ` step_growth_rate() ` .
496494
497495``` {r class_recipe}
498- classifier_recipe <- epi_recipe(training_data) %>%
499- add_role(time_value, new_role = "predictor") %>%
500- step_dummy(geo_value_factor) %>%
501- step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
502- step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
503- step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
504- # note recipes::step_cut() has a bug in it, or we could use that here
496+ classifier_recipe <- epi_recipe(training_data) |>
497+ # Turn `time_value` into predictor
498+ add_role(time_value, new_role = "predictor") |>
499+ # Turn `geo_value_factor` into predictor
500+ step_dummy(geo_value_factor) |>
501+ # Create and lag growth rate
502+ step_growth_rate(case_rate, role = "none", prefix = "gr_") |>
503+ step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) |>
504+ step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") |>
505+ # Divide growth rate into 3 bins.
506+ # Note `recipes::step_cut()` has a bug, or we could use that here.
505507 step_mutate(
506508 response = cut(
507509 ahead_7_gr_7_rel_change_case_rate,
508- breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
510+ # Define bin thresholds.
511+ # Divide by 7 to create weekly values.
512+ breaks = c(-Inf, -0.2, 0.25, Inf) / 7,
509513 labels = c("down", "flat", "up")
510514 ),
511515 role = "outcome"
512- ) %>%
513- step_rm(has_role("none"), has_role("raw")) %>%
516+ ) |>
517+ # Drop unused columns.
518+ step_rm(has_role("none"), has_role("raw")) |>
514519 step_epi_naomit()
515520```
516521
522+ This adds as predictors:
517523
518- Roughly, this adds as predictors:
524+ - time value (via ` add_role() ` )
525+ - ` geo_value ` (via ` step_dummy() ` and the previous ` as.factor() ` )
526+ - growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks
519527
520- 1 . the time value (via ` add_role() ` )
521- 2 . the ` geo_value ` (via ` step_dummy() ` and the ` as.factor() ` above)
522- 3 . the growth rate, both at prediction time and lagged by one and two weeks
523-
524- The outcome is created by composing several steps together: ` step_epi_ahead() `
525- creates a column with the growth rate one week into the future, while
526- ` step_mutate() ` creates a factor with the 3 values:
528+ The outcome variable is created by composing several steps together. ` step_epi_ahead() `
529+ creates a column with the growth rate one week into the future, and
530+ ` step_mutate() ` turns that column into a factor with 3 possible values,
527531
528532$$
529533 Z_{\ell, t}=
@@ -538,47 +542,49 @@ where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
538542` up ` means that the ` case_rate ` is has increased by at least 25%, while ` down `
539543means it has decreased by at least 20%.
540544
541- Note that both ` step_growth_rate() ` and ` step_epi_ahead() ` assign the role
542- ` none ` explicitly; this is because they're used as intermediate steps to create
543- both predictors and the outcome.
544- ` step_rm() ` drops them after they're done , along with the original ` raw ` columns
545- ` death_rate ` and ` case_rate ` (both ` geo_value ` and ` time_value ` are retained
546- because their roles have been reassigned) .
545+ Note that in both ` step_growth_rate() ` and ` step_epi_ahead() ` we explicitly assign the role
546+ ` none ` . This is because those columns are used as intermediaries to create
547+ predictor and outcome columns .
548+ Afterwards, ` step_rm() ` drops the temporary columns , along with the original ` role = " raw" ` columns
549+ ` death_rate ` and ` case_rate ` . Both ` geo_value_factor ` and ` time_value ` are retained
550+ because their roles have been reassigned.
547551
548552
549- To fit a classification model like this, we will need to use a parsnip model
550- with mode classification; the simplest example is ` multinomial_reg() ` .
551- We don't actually need to apply any layers, so we can skip adding one to the ` epiworkflow() ` :
553+ To fit a classification model like this, we will need to use a ` {parsnip} ` model
554+ that has ` mode = "classification" ` .
555+ The simplest example of a ` {parsnip} ` ` classification ` -` mode ` model is ` multinomial_reg() ` .
556+ We don't need to do any post-processing, so we can skip adding ` layer ` s to the ` epiworkflow() ` .
557+ So our workflow looks like:
552558
553559``` {r, warning=FALSE}
554560wf <- epi_workflow(
555561 classifier_recipe,
556562 multinom_reg()
557- ) %>%
563+ ) |>
558564 fit(training_data)
559565
560- forecast(wf) %>% filter(!is.na(.pred_class))
566+ forecast(wf) |> filter(!is.na(.pred_class))
561567```
562568
563- And comparing the result with the actual growth rates at that point:
569+ And comparing the result with the actual growth rates at that point in time,
564570``` {r growth_rate_results}
565571growth_rates <- covid_case_death_rates |>
566572 filter(geo_value %in% used_locations) |>
567573 group_by(geo_value) |>
568574 mutate(
569- # multiply by 7 to get to weekly equivalents
575+ # Multiply by 7 to estimate weekly equivalents
570576 case_gr = growth_rate(x = time_value, y = case_rate) * 7
571577 ) |>
572578 ungroup()
573579
574580growth_rates |> filter(time_value == "2021-08-01")
575581```
576582
577- So they're all increasing at significantly higher than 25% per week (36%-62%),
578- which matches the classification.
583+ we see that they're all significantly higher than 25% per week (36%-62%),
584+ which matches the classification model's predictions .
579585
580586
581- See the [ tooling book] ( https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html ) for a more in depth discussion of this example.
587+ See the [ tooling book] ( https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html ) for a more in- depth discussion of this example.
582588
583589
584590[ ^ 1 ] : Think of baking a cake, where adding the frosting is the last step in the
0 commit comments