You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To get a better handle on custom `epi_workflow()`s, lets recreate and then
28
-
modify the example of `four_week_ahead` from the [landing
29
-
page](../index.html#motivating-example)
27
+
If you want to do custom data preprocessing or fit a model that isn't included in the canned workflows, you'll need to write a custom `epi_workflow()`.
28
+
29
+
To get understand how to work with custom `epi_workflow()`s, let's recreate and then
30
+
modify the `four_week_ahead` example from the [landing
31
+
page](../index.html#motivating-example).
32
+
Let's first remind ourselves how to use a simple canned workflow:
Parsnip serves as a front-end that abstracts away the differences in interface
79
-
between a wide collection of statistical models.
80
-
3. Postprocessor: unique to this package, and used to format and modify the
81
-
prediction after the model has been fit.
82
-
Each operation is a `layer_`, and the stack of layers is known as `frosting()`,
83
-
continuing the metaphor of baking a cake established in the recipe.
84
-
Some example operations include:
85
-
- generate quantiles from purely point-prediction models,
86
-
- undo operations done in the steps, such as convert back to counts from rates
87
-
- threshold so the forecast doesn't include negative values
88
-
- generally adapt the format of the prediction to it's eventual use.
54
+
All `epi_workflows`, including simple and canned workflows, consist of 3 components.
55
+
56
+
### Preprocessor
57
+
58
+
A preprocessor transforms the data before model training and prediction.
59
+
Transformations can include converting counts to rates, applying a running average
60
+
to columns, or [any of the `step`s found in `{recipes}`](https://recipes.tidymodels.org/reference/index.html).
61
+
62
+
You can think of a preprocessor as a more flexible `formula` that you would pass to `lm()`: `y ~ x1 + log(x2) + lag(x1, 5)`.
63
+
The simple model above internally runs 6 of these steps, such as creating lagged predictor columns.
64
+
65
+
In general, there are 2 broad classes of transformation that `{recipes}``step`s handle:
66
+
67
+
- Operations that are applied to both training and test data without using stored information.
68
+
Examples include taking the log of a variable, leading or lagging columns,
69
+
filtering out rows, handling dummy variables, calculating growth rates,
70
+
etc.
71
+
- Operations that rely on stored information (parameters fit during training) to modify train and test data.
72
+
Examples include centering by the mean, and normalizing the variance (whitening).
73
+
74
+
We differentiate between these types of transformations because the second type can result in information leakage if not done properly.
75
+
Information leakage or data leakage happens when a system has access to information that would not have been available at prediction time and could change our evaluation of the model's real-world performance.
76
+
77
+
In the case of centering, we need to store the mean of the predictor from
78
+
the training data and use that value on the prediction data, rather than
79
+
using the mean of the test predictor for centering or including test data in the mean calculation.
80
+
81
+
A major benefit of `{recipes}` is that it prevents information leakage.
82
+
<!-- TODO unsure if worth even keeping, as we effectively can't have data leakage. -->
83
+
However, the _main_ mechanism we rely on to prevent data leakage is proper
84
+
[backtesting](backtesting.html).
85
+
86
+
### Trainer
87
+
88
+
A trainer fits a `{parsnip}` model on data, and outputs a fitted model object.
89
+
Examples include linear regression, quantile regression, or [any `{parsnip}` engine](https://www.tidymodels.org/find/parsnip/).
90
+
The `{parsnip}` front-end abstracts away the differences in interface between a wide collection of statistical models.
91
+
92
+
### Postprocessor
93
+
94
+
Postprocessors are unique to this `{epipredict}`.
95
+
A postprocessor modifies and formats the prediction after a model has been fit.
96
+
97
+
Each operation within a postprocessor is called a `layer_`, and the stack of layers is known as `frosting()`,
98
+
continuing the metaphor of baking a cake established in `{recipes}`.
99
+
Some example operations include:
100
+
101
+
- generating quantiles from purely point-prediction models
102
+
- reverting transformations done in prior steps, such as converting from rates back to counts
103
+
- thresholding forecasts to remove negative values
104
+
- generally adapting the format of the prediction to a downstream use.
89
105
90
106
# Recreating `four_week_ahead` in an `epi_workflow()`
91
-
To be able to extend this beyond what `arx_forecaster()` itself will let us do,
92
-
we first need to understand how to recreate it using a custom `epi_workflow()`.
93
107
94
-
To use a custom workflow, there are a couple of steps:
108
+
To understand how to create custom workflows, let's first recreate the simple canned `arx_forecaster()` from scratch.
95
109
96
-
1. define the `epi_recipe()`, which contains the preprocessing steps
97
-
2. define the `frosting()` which contains the post-processing layers
110
+
We'll think through the following sub-steps:
111
+
112
+
1. Define the `epi_recipe()`, which contains the preprocessing steps
113
+
2. Define the `frosting()` which contains the post-processing layers
98
114
3. Combine these with a trainer such as `quantile_reg()` into an
99
115
`epi_workflow()`, which we can then fit on the training data
100
116
4.`fit()` the workflow on some data
101
-
5.grab the right prediction data using `get_test_data()` and apply the fit data
117
+
5.Grab the right prediction data using `get_test_data()` and apply the fit data
102
118
to generate a prediction
103
119
104
120
## Define the `epi_recipe()`
105
-
To do this, we'll first take a look at the steps as they're found in
One thing to note: we only added 5 steps here because `step_epi_naomit()` is
144
-
actually a wrapper around adding 2 base `step_naomit()`s, one for
145
-
`all_predictors()` and one for `all_outcomes()`, differing in their treatment at
146
-
predict time.
164
+
Note we said before that `four_week_ahead` contained 6 steps.
165
+
We've only added _5_ top-level steps here because `step_epi_naomit()` is
166
+
actually a wrapper around adding two `step_naomit()`s, one for
167
+
`all_predictors()` and one for `all_outcomes()`.
168
+
The `step_naomit()`s differ in their treatment of the data at predict time.
147
169
148
-
`step_epi_lag()` and `step_epi_ahead()` both accept tidy syntax, so if for
149
-
examplewe wanted the same lags for both `case_rate` and `death_rate`, we could
150
-
have done`step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
170
+
`step_epi_lag()` and `step_epi_ahead()` both accept ["tidy" syntax](https://dplyr.tidyverse.org/reference/select.html) so processing can be applied to multiple columns at once.
171
+
For example, if we wanted to use the same lags for both `case_rate` and `death_rate`, we could
172
+
specify them in a single step, like`step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
151
173
152
-
In general for the `{recipes}` package, steps assign roles, such as `predictor`
153
-
and `outcome` to columns, either by adding new columns or adjusting existing
154
-
ones.
155
-
`step_epi_lag()` for example, creates a new column for each lag with the name
156
-
`lag_x_column_name` and assigns them as predictors, while `step_epi_ahead()`
157
-
creates `ahead_x_column_name` columns and assigns them as outcomes.
174
+
In general, `{recipes}``step`s assign roles (`predictor`, `outcome`) to columns either by adding new columns or adjusting existing
175
+
ones.
176
+
`step_epi_lag()`, for example, creates a new column for each lag with the name
177
+
`lag_x_column_name` and labels them each with the `predictor` role.
178
+
`step_epi_ahead()` creates `ahead_x_column_name` columns and labels each with the `outcome` role.
158
179
159
-
One way to inspect the roles assigned is to use `prep()`:
180
+
We can inspect assigned roles with `prep()` to make sure that we are training on the correct columns:
The way to inspect the columns created is by using`bake()` on the resulting
167
-
recipe:
187
+
We can inspect newly-created columns by running`bake()` on the
188
+
recipe so far:
168
189
169
190
```{r bake_recipe}
170
191
four_week_recipe |>
171
192
prep(training_data) |>
172
193
bake(training_data)
173
194
```
174
195
175
-
This is also useful for debugging malfunctioning pipelines; if you define
176
-
`four_week_recipe` only up to the step that is misbehaving, you can get a
177
-
partial evaluation to see the exact data the step is being applied to.
178
-
It also allows you to see the exact data that the parsnip model is fitting on.
196
+
This is also useful for debugging malfunctioning pipelines.
197
+
You can run `prep()` and `bake()` on a new recipe containing a subset of `step`s -- all `step`s from the beginning up to the one that is misbehaving -- from the full, original recipe.
198
+
This will return an evaluation of the `recipe` up to that point so that you can see the data that the misbehaving `step` is being applied to.
199
+
It also allows you to see the exact data that a later `{parsnip}` model is trained on.
179
200
180
201
## Define the `frosting()`
181
-
Since the post-processing frosting layers[^1] are unique to this package, to
182
-
inspect them we use `extract_frosting()` from `{epipredict}`:
202
+
203
+
The post-processing `frosting` layers[^1] found in `four_week_ahead` look like:
0 commit comments