Skip to content

Commit bb4025c

Browse files
nmdefriesdsweber2
authored andcommitted
first half custom_epiworkflows.Rmd
1 parent 9bbf7bf commit bb4025c

File tree

1 file changed

+150
-112
lines changed

1 file changed

+150
-112
lines changed

vignettes/custom_epiworkflows.Rmd

Lines changed: 150 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,12 @@ used_locations <- c("ca", "ma", "ny", "tx")
2424
library(epidatr)
2525
```
2626

27-
To get a better handle on custom `epi_workflow()`s, lets recreate and then
28-
modify the example of `four_week_ahead` from the [landing
29-
page](../index.html#motivating-example)
27+
If you want to do custom data preprocessing or fit a model that isn't included in the canned workflows, you'll need to write a custom `epi_workflow()`.
28+
29+
To get understand how to work with custom `epi_workflow()`s, let's recreate and then
30+
modify the `four_week_ahead` example from the [landing
31+
page](../index.html#motivating-example).
32+
Let's first remind ourselves how to use a simple canned workflow:
3033

3134
```{r make-four-forecasts, warning=FALSE}
3235
training_data <- covid_case_death_rates |>
@@ -43,93 +46,111 @@ four_week_ahead <- arx_forecaster(
4346
)
4447
four_week_ahead$epi_workflow
4548
```
49+
4650
# Anatomy of an `epi_workflow`
47-
An `epi_workflow()` is an extension of a `workflows::workflow()` to handle panel
51+
52+
An `epi_workflow()` is an extension of a `workflows::workflow()` that handles panel
4853
data and post-processing.
49-
It consists of 3 components, shown above:
50-
51-
1. Preprocessor: transform the data before model training and prediction, such
52-
as convert counts to rates, create smoothed columns, or [any of the recipes
53-
steps](https://recipes.tidymodels.org/reference/index.html).
54-
Think of it as a more flexible `formula` that you would pass to `lm()`: `y ~
55-
x1 + log(x2) + lag(x1, 5)`.
56-
The above model has 6 of these steps.
57-
In general, there are 2 broad classes of transformation that `{recipes}`
58-
handles:
59-
- Transforms of both training and test data that are always applied.
60-
Examples include taking the log of a variable, leading or lagging,
61-
filtering out rows, handling dummy variables, calculating growth rates,
62-
etc.
63-
- Operations that fit parameters during training to apply during prediction,
64-
such as centering by the mean.
65-
This is a major benefit of `{recipes}`, since it prevents data leakage,
66-
where information about the test/predict time data "leaks" into the
67-
parameters. <!-- TODO unsure if worth even keeping, as we effectively
68-
can't have data leakage. -->
69-
However, the main mechanism we rely on to prevent data leakage is proper
70-
[backtesting](backtesting.html).
71-
For the case of centering, we need to store the mean of the predictor from
72-
the training data and use that value on the prediction data rather than
73-
accidentally calculating the mean of the test predictor for centering.
74-
2. Trainer: use a `{parsnip}` model to train a model on data, resulting in a
75-
fitted model object.
76-
Examples include linear regression, quantile regression, or [any parsnip
77-
engine](https://www.tidymodels.org/find/parsnip/).
78-
Parsnip serves as a front-end that abstracts away the differences in interface
79-
between a wide collection of statistical models.
80-
3. Postprocessor: unique to this package, and used to format and modify the
81-
prediction after the model has been fit.
82-
Each operation is a `layer_`, and the stack of layers is known as `frosting()`,
83-
continuing the metaphor of baking a cake established in the recipe.
84-
Some example operations include:
85-
- generate quantiles from purely point-prediction models,
86-
- undo operations done in the steps, such as convert back to counts from rates
87-
- threshold so the forecast doesn't include negative values
88-
- generally adapt the format of the prediction to it's eventual use.
54+
All `epi_workflows`, including simple and canned workflows, consist of 3 components.
55+
56+
### Preprocessor
57+
58+
A preprocessor transforms the data before model training and prediction.
59+
Transformations can include converting counts to rates, applying a running average
60+
to columns, or [any of the `step`s found in `{recipes}`](https://recipes.tidymodels.org/reference/index.html).
61+
62+
You can think of a preprocessor as a more flexible `formula` that you would pass to `lm()`: `y ~ x1 + log(x2) + lag(x1, 5)`.
63+
The simple model above internally runs 6 of these steps, such as creating lagged predictor columns.
64+
65+
In general, there are 2 broad classes of transformation that `{recipes}` `step`s handle:
66+
67+
- Operations that are applied to both training and test data without using stored information.
68+
Examples include taking the log of a variable, leading or lagging columns,
69+
filtering out rows, handling dummy variables, calculating growth rates,
70+
etc.
71+
- Operations that rely on stored information (parameters fit during training) to modify train and test data.
72+
Examples include centering by the mean, and normalizing the variance (whitening).
73+
74+
We differentiate between these types of transformations because the second type can result in information leakage if not done properly.
75+
Information leakage or data leakage happens when a system has access to information that would not have been available at prediction time and could change our evaluation of the model's real-world performance.
76+
77+
In the case of centering, we need to store the mean of the predictor from
78+
the training data and use that value on the prediction data, rather than
79+
using the mean of the test predictor for centering or including test data in the mean calculation.
80+
81+
A major benefit of `{recipes}` is that it prevents information leakage.
82+
<!-- TODO unsure if worth even keeping, as we effectively can't have data leakage. -->
83+
However, the _main_ mechanism we rely on to prevent data leakage is proper
84+
[backtesting](backtesting.html).
85+
86+
### Trainer
87+
88+
A trainer fits a `{parsnip}` model on data, and outputs a fitted model object.
89+
Examples include linear regression, quantile regression, or [any `{parsnip}` engine](https://www.tidymodels.org/find/parsnip/).
90+
The `{parsnip}` front-end abstracts away the differences in interface between a wide collection of statistical models.
91+
92+
### Postprocessor
93+
94+
Postprocessors are unique to this `{epipredict}`.
95+
A postprocessor modifies and formats the prediction after a model has been fit.
96+
97+
Each operation within a postprocessor is called a `layer_`, and the stack of layers is known as `frosting()`,
98+
continuing the metaphor of baking a cake established in `{recipes}`.
99+
Some example operations include:
100+
101+
- generating quantiles from purely point-prediction models
102+
- reverting transformations done in prior steps, such as converting from rates back to counts
103+
- thresholding forecasts to remove negative values
104+
- generally adapting the format of the prediction to a downstream use.
89105

90106
# Recreating `four_week_ahead` in an `epi_workflow()`
91-
To be able to extend this beyond what `arx_forecaster()` itself will let us do,
92-
we first need to understand how to recreate it using a custom `epi_workflow()`.
93107

94-
To use a custom workflow, there are a couple of steps:
108+
To understand how to create custom workflows, let's first recreate the simple canned `arx_forecaster()` from scratch.
95109

96-
1. define the `epi_recipe()`, which contains the preprocessing steps
97-
2. define the `frosting()` which contains the post-processing layers
110+
We'll think through the following sub-steps:
111+
112+
1. Define the `epi_recipe()`, which contains the preprocessing steps
113+
2. Define the `frosting()` which contains the post-processing layers
98114
3. Combine these with a trainer such as `quantile_reg()` into an
99115
`epi_workflow()`, which we can then fit on the training data
100116
4. `fit()` the workflow on some data
101-
5. grab the right prediction data using `get_test_data()` and apply the fit data
117+
5. Grab the right prediction data using `get_test_data()` and apply the fit data
102118
to generate a prediction
103119

104120
## Define the `epi_recipe()`
105-
To do this, we'll first take a look at the steps as they're found in
106-
`four_week_ahead`:
121+
122+
The steps found in `four_week_ahead` look like:
107123

108124
```{r inspect_fwa_steps, warning=FALSE}
109125
hardhat::extract_recipe(four_week_ahead$epi_workflow)
110126
```
111127

112-
So there are 6 steps we will need to recreate.
113-
One thing to note about the extracted recipe is that it has already been
114-
trained; for steps such as `recipes::step_BoxCox()` which have parameters, this means that their
115-
parameters have been calculated.
116-
Before defining steps, we need to create an `epi_recipe()` to hold them
128+
There are 6 steps we will need to recreate.
129+
Note that all steps in the extracted recipe are marked as already been
130+
`Trained`. For steps such as `recipes::step_BoxCox()` that have parameters that change their behavior, this means that their
131+
parameters have already been calculated based on the training data set.
132+
133+
Let's create an `epi_recipe()` to hold the 6 steps:
134+
117135
```{r make_recipe}
118136
four_week_recipe <- epi_recipe(
119137
covid_case_death_rates |>
120138
filter(time_value <= forecast_date, geo_value %in% used_locations)
121139
)
122140
```
123141

124-
The data here, `covid_case_death_rates` doesn't strictly need to be the actual
125-
dataset on which you are going to train.
126-
However, it should have the same columns and the same metadata, such as `as_of`
127-
or `other_keys`; it is typically easiest to just use the training data itself.
142+
The data set passed to `epi_recipe` isn't required to be the actual
143+
data set on which you are going to train the model.
144+
However, it should have the same columns and the same metadata (such as `as_of`
145+
and `other_keys`); it is typically easiest just to use the training data itself.
128146

147+
This means that you can use the same workflow for multiple data sets as long as the format remains the same.
148+
This might be useful if you continue to get updates to a data set over time and you want to train a new instance of the same model.
129149

130-
Then we add each step via pipes; in principle the order matters, though for this
150+
Then we can append each `step` using pipes. In principle, the order matters, though for this
131151
recipe only `step_epi_naomit()` and `step_training_window()` depend on the steps
132-
before them:
152+
before them.
153+
The other steps can be thought of as setting parameters that help specify later processing and computation.
133154

134155
```{r make_steps}
135156
four_week_recipe <- four_week_recipe |>
@@ -140,55 +161,56 @@ four_week_recipe <- four_week_recipe |>
140161
step_training_window()
141162
```
142163

143-
One thing to note: we only added 5 steps here because `step_epi_naomit()` is
144-
actually a wrapper around adding 2 base `step_naomit()`s, one for
145-
`all_predictors()` and one for `all_outcomes()`, differing in their treatment at
146-
predict time.
164+
Note we said before that `four_week_ahead` contained 6 steps.
165+
We've only added _5_ top-level steps here because `step_epi_naomit()` is
166+
actually a wrapper around adding two `step_naomit()`s, one for
167+
`all_predictors()` and one for `all_outcomes()`.
168+
The `step_naomit()`s differ in their treatment of the data at predict time.
147169

148-
`step_epi_lag()` and `step_epi_ahead()` both accept tidy syntax, so if for
149-
example we wanted the same lags for both `case_rate` and `death_rate`, we could
150-
have done `step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
170+
`step_epi_lag()` and `step_epi_ahead()` both accept ["tidy" syntax](https://dplyr.tidyverse.org/reference/select.html) so processing can be applied to multiple columns at once.
171+
For example, if we wanted to use the same lags for both `case_rate` and `death_rate`, we could
172+
specify them in a single step, like `step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
151173

152-
In general for the `{recipes}` package, steps assign roles, such as `predictor`
153-
and `outcome` to columns, either by adding new columns or adjusting existing
154-
ones.
155-
`step_epi_lag()` for example, creates a new column for each lag with the name
156-
`lag_x_column_name` and assigns them as predictors, while `step_epi_ahead()`
157-
creates `ahead_x_column_name` columns and assigns them as outcomes.
174+
In general, `{recipes}` `step`s assign roles (`predictor`, `outcome`) to columns either by adding new columns or adjusting existing
175+
ones.
176+
`step_epi_lag()`, for example, creates a new column for each lag with the name
177+
`lag_x_column_name` and labels them each with the `predictor` role.
178+
`step_epi_ahead()` creates `ahead_x_column_name` columns and labels each with the `outcome` role.
158179

159-
One way to inspect the roles assigned is to use `prep()`:
180+
We can inspect assigned roles with `prep()` to make sure that we are training on the correct columns:
160181

161182
```{r prep_recipe}
162183
prepped <- four_week_recipe |> prep(training_data)
163184
prepped$term_info |> print(n = 14)
164185
```
165186

166-
The way to inspect the columns created is by using `bake()` on the resulting
167-
recipe:
187+
We can inspect newly-created columns by running `bake()` on the
188+
recipe so far:
168189

169190
```{r bake_recipe}
170191
four_week_recipe |>
171192
prep(training_data) |>
172193
bake(training_data)
173194
```
174195

175-
This is also useful for debugging malfunctioning pipelines; if you define
176-
`four_week_recipe` only up to the step that is misbehaving, you can get a
177-
partial evaluation to see the exact data the step is being applied to.
178-
It also allows you to see the exact data that the parsnip model is fitting on.
196+
This is also useful for debugging malfunctioning pipelines.
197+
You can run `prep()` and `bake()` on a new recipe containing a subset of `step`s -- all `step`s from the beginning up to the one that is misbehaving -- from the full, original recipe.
198+
This will return an evaluation of the `recipe` up to that point so that you can see the data that the misbehaving `step` is being applied to.
199+
It also allows you to see the exact data that a later `{parsnip}` model is trained on.
179200

180201
## Define the `frosting()`
181-
Since the post-processing frosting layers[^1] are unique to this package, to
182-
inspect them we use `extract_frosting()` from `{epipredict}`:
202+
203+
The post-processing `frosting` layers[^1] found in `four_week_ahead` look like:
183204

184205
```{r inspect_fwa_layers, warning=FALSE}
185206
epipredict::extract_frosting(four_week_ahead$epi_workflow)
186207
```
187208

188-
The above gives us detailed descriptions of the arguments to the functions named
189-
above in the postprocessor of `four_week_ahead$epiworkflow`.
190-
Creating the layers is a similar process, except with frosting instead of a
191-
`recipe`[^2]:
209+
Note: since `frosting` is unique to this package, we've defined a custom function `extract_frosting()` to inspect these steps.
210+
211+
Using the detailed information in the output above,
212+
we can recreate the layers similar to how we defined the
213+
`recipe` `step`s[^2]:
192214

193215
```{r make_frosting}
194216
four_week_layers <- frosting() |>
@@ -199,22 +221,23 @@ four_week_layers <- frosting() |>
199221
layer_threshold()
200222
```
201223

224+
`layer_predict()` needs to be included in every postprocessor to actually train the model.
202225

203-
Most layers will work for any engine or steps; `layer_predict()` you will want
204-
to call in every case.
205-
There are a couple of layers, however, which depend on whether the engine used predicts quantiles or point estimates.
226+
Most layers work with any engine or `step`s.
227+
There are a couple of layers, however, which depend on whether the engine predicts quantiles or point estimates.
206228

207229
The layers that are only supported by point estimate engines (such as
208-
`linear_reg()`) are
230+
`linear_reg()`) are:
209231

210232
- `layer_residual_quantiles()`: the preferred method of generating quantiles for
211-
non-quantile models, it uses the error residuals of the engine.
212-
This will work for most parsnip engines.
213-
- `layer_predictive_distn()`: alternate method of generating quantiles, it uses
233+
models that don't generate quantiles themselves.
234+
This function uses the error residuals of the engine to calculate quantiles.
235+
This will work for most `{parsnip}` engines.
236+
- `layer_predictive_distn()`: alternate method of generating quantiles using
214237
an approximate parametric distribution. This will work for linear regression
215238
specifically.
216239

217-
TODO check this
240+
<!-- TODO check this -->
218241

219242
On the other hand, the layers that are only supported by quantile estimating
220243
engines (such as `quantile_reg()`) are
@@ -223,11 +246,13 @@ engines (such as `quantile_reg()`) are
223246
If they differ from the ones actually fit, they will be interpolated and/or
224247
extrapolated.
225248
- `layer_point_from_distn()`: this adds the median quantile as a point estimate,
226-
and should be called after `layer_quantile_distn()` if called at all.
249+
and, if called, should be included after `layer_quantile_distn()`.
227250

228251
## Fitting an `epi_workflow()`
229252

230-
Given that we now have a recipe and some layers, we need to assemble the workflow:
253+
Now that we have a recipe and some layers, we can assemble the workflow.
254+
This is as simple as passing the component preprocessor, model, and postprocessor into `epi_workflow()`.
255+
231256
```{r workflow_building}
232257
four_week_workflow <- epi_workflow(
233258
four_week_recipe,
@@ -236,15 +261,19 @@ four_week_workflow <- epi_workflow(
236261
)
237262
```
238263

239-
And fit it to recreate `four_week_ahead$epi_workflow`
264+
After fitting it, we will have recreated `four_week_ahead$epi_workflow`.
265+
240266
```{r workflow_fitting}
241267
fit_workflow <- four_week_workflow |> fit(training_data)
242268
```
243269

270+
Running `fit()` calculates all preprocessor-required parameters, and trains the model on the data passed in `fit()`.
271+
However, it does not generate any predictions; predictions need to be created in a separate step.
272+
244273
## Predicting
245274

246-
To do a prediction, we need to first narrow the dataset down to the relevant
247-
datapoints:
275+
To make a prediction, we need to first narrow a data set down to the relevant observations.
276+
This process removes observations that will not be used for training because, for example, they contain missing values or <!-- TODO other reasons?-->.
248277

249278
```{r grab_data}
250279
relevant_data <- get_test_data(
@@ -253,22 +282,25 @@ relevant_data <- get_test_data(
253282
)
254283
```
255284

256-
With a fit workflow and test data in hand, we can actually make our predictions:
285+
In this example, we're creating `relevant_data` from `training_data`, but the data set we want predictions for could be an entirely new data set.
286+
287+
With a trained workflow and data in hand, we can actually make our predictions:
257288

258289
```{r workflow_pred}
259290
fit_workflow |> predict(relevant_data)
260291
```
261292

262-
Note that if we had simply plugged `training_data` into `predict()` we still get
293+
Note that if we simply plug `training_data` into `predict()` we will still get
263294
predictions:
264295

265296
```{r workflow_pred_training}
266297
fit_workflow |> predict(training_data)
267298
```
268299

269300
The resulting tibble is 800 rows long, however.
270-
This produces forecasts for not just the actual `forecast_date`, but for every
271-
day in the dataset it has enough data to actually make a prediction.
301+
Not running `get_test_data()` means that we're providing irrelevant data along with relevant, valid data.
302+
Passing the non-subsetted data set produces forecasts for not just the requested `forecast_date`, but for every
303+
day in the data set that has sufficient data to produce a prediction.
272304
To narrow this down, we could filter to rows where the `time_value` matches the `forecast_date`:
273305

274306
```{r workflow_pred_training_filter}
@@ -282,12 +314,18 @@ data.
282314

283315
# Extending `four_week_ahead`
284316

285-
There are many ways we could modify `four_week_ahead`; one simple modification
286-
would be to include a growth rate estimate as part of the model.
287-
Another would be to convert from rates to counts, for example if that were the
288-
preferred prediction format.
289-
Another would be to include a time component to the prediction (useful if we
290-
expect there to be a strong seasonal component).
317+
Now that we've recreated `four_week_ahead`, we can start modifying parts of it to get custom behavior.
318+
319+
There are many ways we could modify `four_week_ahead`. We might consider:
320+
321+
- Including a growth rate estimate as part of the model
322+
- Converting from rates to counts, for example if that were the
323+
preferred prediction format
324+
- Including a time component to the prediction, useful if we
325+
expect there to be a strong seasonal component
326+
- Scaling by some factor
327+
328+
We will demo a couple of these modifications below.
291329

292330
## Growth rate
293331

0 commit comments

Comments
 (0)