11
22<!-- README.md is generated from README.Rmd. Please edit that file -->
33
4- # epipredict
4+ # Epipredict
55
66<!-- badges: start -->
77
88[ ![ R-CMD-check] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg )] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml )
99<!-- badges: end -->
1010
11- ** Note:** This package is currently in development and may not work as
12- expected. Please file bug reports as issues in this repo, and we will do
13- our best to address them quickly.
11+ Epipredict is a framework for building transformation and forecasting
12+ pipelines for epidemiological and other panel time-series datasets. In
13+ addition to tools for building forecasting pipelines, it contains a
14+ number of “canned” forecasters meant to run with little modification as
15+ an easy way to get started forecasting.
16+
17+ It is designed to work well with
18+ [ ` epiprocess ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19+ handling various time series and geographic processing tools in an
20+ epidemiological context. Both of the packages are meant to work well
21+ with the panel data provided by
22+ [ ` epidatr ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
23+
24+ If you are looking for more detail beyond the package documentation, see
25+ our [ forecasting
26+ book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
1427
1528## Installation
1629
17- To install (unless you’re making changes to the package, use the stable
18- version):
30+ To install (unless you’re planning on contributing to package
31+ development, we suggest using the stable version):
1932
2033``` r
2134# Stable version
@@ -25,52 +38,14 @@ pak::pkg_install("cmu-delphi/epipredict@main")
2538pak :: pkg_install(" cmu-delphi/epipredict@dev" )
2639```
2740
28- ## Documentation
29-
30- You can view documentation for the ` main ` branch at
31- < https://cmu-delphi.github.io/epipredict > .
32-
33- ## Goals for ` epipredict `
34-
35- ** We hope to provide:**
36-
37- 1 . A set of basic, easy-to-use forecasters that work out of the box.
38- You should be able to do a reasonably limited amount of
39- customization on them. For the basic forecasters, we currently
40- provide:
41- - Baseline flatline forecaster
42- - Autoregressive forecaster
43- - Autoregressive classifier
44- - CDC FluSight flatline forecaster
45- 2 . A framework for creating custom forecasters out of modular
46- components. There are four types of components:
47- - Preprocessor: do things to the data before model training
48- - Trainer: train a model on data, resulting in a fitted model object
49- - Predictor: make predictions, using a fitted model object
50- - Postprocessor: do things to the predictions before returning
41+ The documentation for the stable version is at
42+ < https://cmu-delphi.github.io/epipredict > , while the development version
43+ is at < https://cmu-delphi.github.io/epipredict/dev > .
5144
52- ** Target audiences: **
45+ ## Motivating example
5346
54- - Basic. Has data, calls forecaster with default arguments.
55- - Intermediate. Wants to examine changes to the arguments, take
56- advantage of built in flexibility.
57- - Advanced. Wants to write their own forecasters. Maybe willing to build
58- up from some components.
59-
60- The Advanced user should find their task to be relatively easy. Examples
61- of these tasks are illustrated in the [ vignettes and
62- articles] ( https://cmu-delphi.github.io/epipredict ) .
63-
64- See also the (in progress) [ Forecasting
65- Book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
66-
67- ## Intermediate example
68-
69- The package comes with some built-in historical data for illustration,
70- but up-to-date versions of this could be downloaded with the
71- [ ` {epidatr} ` package] ( https://cmu-delphi.github.io/epidatr/ ) and
72- processed using
73- [ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) .[ ^ 1 ]
47+ To demonstrate the kind of forecast epipredict can make, say we’re
48+ predicting COVID deaths per 100k for each state on
7449
7550``` r
7651forecast_date <- as.Date(" 2021-08-01" )
@@ -95,17 +70,19 @@ cases <- pub_covidcast(
9570 signals = " confirmed_incidence_prop" ,
9671 time_type = " day" ,
9772 geo_type = " state" ,
98- time_values = epirange(20200601 , 20220101 ),
99- geo_values = " *" ) | >
73+ time_values = epirange(20200601 , 20211231 ),
74+ geo_values = " *"
75+ ) | >
10076 select(geo_value , time_value , case_rate = value )
10177
10278deaths <- pub_covidcast(
10379 source = " jhu-csse" ,
10480 signals = " deaths_incidence_prop" ,
10581 time_type = " day" ,
10682 geo_type = " state" ,
107- time_values = epirange(20200601 , 20220101 ),
108- geo_values = " *" ) | >
83+ time_values = epirange(20200601 , 20211231 ),
84+ geo_values = " *"
85+ ) | >
10986 select(geo_value , time_value , death_rate = value )
11087cases_deaths <-
11188 full_join(cases , deaths , by = c(" time_value" , " geo_value" )) | >
@@ -123,6 +100,7 @@ cases_deaths |>
123100```
124101
125102<img src =" man/figures/README-case_death-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
103+
126104As with basically any dataset, there is some cleaning that we will need
127105to do to make it actually usable; we’ll use some utilities from
128106[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
@@ -131,7 +109,7 @@ First, to eliminate some of the noise coming from daily reporting, we do
131109
132110``` r
133111cases_deaths <-
134- cases_deaths | >
112+ cases_deaths | >
135113 group_by(geo_value ) | >
136114 epi_slide(
137115 cases_7dav = mean(case_rate , na.rm = TRUE ),
@@ -150,47 +128,54 @@ cases_deaths <-
150128 cases_deaths | >
151129 group_by(geo_value ) | >
152130 mutate(
153- outlr_death_rate = detect_outlr_rm(time_value , death_rate , detect_negatives = TRUE ),
154- outlr_case_rate = detect_outlr_rm(time_value , case_rate , detect_negatives = TRUE )
131+ outlr_death_rate = detect_outlr_rm(
132+ time_value , death_rate , detect_negatives = TRUE
133+ ),
134+ outlr_case_rate = detect_outlr_rm(
135+ time_value , case_rate , detect_negatives = TRUE
136+ )
155137 ) | >
156138 unnest(cols = starts_with(" outlr" ), names_sep = " _" ) | >
157139 ungroup() | >
158140 mutate(
159141 death_rate = outlr_death_rate_replacement ,
160- case_rate = outlr_case_rate_replacement ) | >
142+ case_rate = outlr_case_rate_replacement
143+ ) | >
161144 select(geo_value , time_value , case_rate , death_rate )
162145cases_deaths
163- # > An `epi_df` object, 32,480 x 4 with metadata:
146+ # > An `epi_df` object, 32,424 x 4 with metadata:
164147# > * geo_type = state
165148# > * time_type = day
166- # > * as_of = 2022-05-31
149+ # > * as_of = 2022-01-01
167150# >
168- # > # A tibble: 20,496 × 4
169- # > geo_value time_value case_rate death_rate
170- # > * <chr> <date> <dbl> <dbl>
171- # > 1 ak 2020-12-31 35.9 0.158
172- # > 2 al 2020-12-31 65.1 0.438
173- # > 3 ar 2020-12-31 66.0 1.27
174- # > 4 as 2020-12-31 0 0
175- # > 5 az 2020-12-31 76.8 1.10
176- # > 6 ca 2020-12-31 96.0 0.751
177- # > 7 co 2020-12-31 35.8 0.649
178- # > 8 ct 2020-12-31 52.1 0.819
179- # > 9 dc 2020-12-31 31.0 0.601
180- # > 10 de 2020-12-31 64.3 0.912
181- # > # ℹ 20,486 more rows
151+ # > # A tibble: 32,424 × 4
152+ # > geo_value time_value case_rate death_rate
153+ # > * <chr> <date> <dbl> <dbl>
154+ # > 1 ak 2020-06-01 2.31 0
155+ # > 2 ak 2020-06-02 1.94 0
156+ # > 3 ak 2020-06-03 2.63 0
157+ # > 4 ak 2020-06-04 2.59 0
158+ # > 5 ak 2020-06-05 2.43 0
159+ # > 6 ak 2020-06-06 2.35 0
160+ # > # ℹ 32,418 more rows
182161```
183162
184- To create and train a simple auto-regressive forecaster to predict the
185- death rate two weeks into the future using past (lagged) deaths and
186- cases, we could use the following function.
163+ </details >
164+
165+ After having downloaded and cleaned the data in ` cases_deaths ` , we plot
166+ a subset of the states, noting the actual forecast date:
167+
168+ <details >
169+ <summary >
170+ Plot
171+ </summary >
187172
188173``` r
189174forecast_date_label <-
190175 tibble(
191176 geo_value = rep(plot_locations , 2 ),
192- source = c(rep(" case_rate" ,4 ), rep(" death_rate" , 4 )),
193- dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
177+ source = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
178+ dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
194179 heights = c(rep(150 , 4 ), rep(1.0 , 4 ))
195180 )
196181processed_data_plot <-
@@ -202,7 +187,10 @@ processed_data_plot <-
202187 facet_grid(source ~ geo_value , scale = " free" ) +
203188 geom_vline(aes(xintercept = forecast_date )) +
204189 geom_text(
205- data = forecast_date_label , aes(x = dates , label = " forecast\n date" , y = heights ), size = 3 , hjust = " right" ) +
190+ data = forecast_date_label ,
191+ aes(x = dates , label = " forecast\n date" , y = heights ),
192+ size = 3 , hjust = " right"
193+ ) +
206194 scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
207195 theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
208196```
@@ -222,26 +210,26 @@ four_week_ahead <- arx_forecaster(
222210 predictors = c(" case_rate" , " death_rate" ),
223211 args_list = arx_args_list(
224212 lags = list (c(0 , 1 , 2 , 3 , 7 , 14 ), c(0 , 7 , 14 )),
225- ahead = 14
213+ ahead = 4 * 7
226214 )
227215)
228- two_week_ahead
229- # > ══ A basic forecaster of type ARX Forecaster ═══════════════════════════════
216+ four_week_ahead
217+ # > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
230218# >
231- # > This forecaster was fit on 2025-01-23 14:01:04 .
219+ # > This forecaster was fit on 2025-01-24 14:47:38 .
232220# >
233221# > Training data was an <epi_df> with:
234222# > • Geography: state,
235223# > • Time type: day,
236- # > • Using data up-to-date as of: 2022-05-31 .
237- # > • With the last data available on 2021-12-31
224+ # > • Using data up-to-date as of: 2022-01-01 .
225+ # > • With the last data available on 2021-08-01
238226# >
239- # > ── Predictions ─────────────────────────────────────────────────────────────
227+ # > ── Predictions ──────────────────────────────────────────────────────────────
240228# >
241229# > A total of 56 predictions are available for
242230# > • 56 unique geographic regions,
243- # > • At forecast date: 2021-12-31 ,
244- # > • For target date: 2022-01-14 ,
231+ # > • At forecast date: 2021-08-01 ,
232+ # > • For target date: 2021-08-29 ,
245233# >
246234```
247235
@@ -272,15 +260,16 @@ narrow_data_plot <-
272260 facet_grid(source ~ geo_value , scale = " free" ) +
273261 geom_vline(aes(xintercept = forecast_date )) +
274262 geom_text(
275- data = forecast_date_label , aes(x = dates , label = " forecast\n date" , y = heights ), size = 3 , hjust = " right" ) +
263+ data = forecast_date_label ,
264+ aes(x = dates , label = " forecast\n date" , y = heights ),
265+ size = 3 , hjust = " right"
266+ ) +
276267 scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
277268 theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
278269```
279270
280- The fitted model here involved preprocessing the data to appropriately
281- generate lagged predictors, estimating a linear model with ` stats::lm() `
282- and then postprocessing the results to be meaningful for epidemiological
283- tasks. We can also examine the predictions.
271+ Putting that together with a plot of the bands, and a plot of the median
272+ prediction.
284273
285274``` r
286275epiworkflow <- four_week_ahead $ epi_workflow
@@ -294,16 +283,17 @@ forecast_plot <-
294283 epipredict ::: plot_bands(
295284 restricted_predictions ,
296285 levels = 0.9 ,
297- fill = primary ) +
298- geom_point(data = restricted_predictions , aes(y = .data $ value ), color = secondary )
286+ fill = primary
287+ ) +
288+ geom_point(data = restricted_predictions ,
289+ aes(y = .data $ value ),
290+ color = secondary )
299291```
300292
301- The results above show a distributional forecast produced using data
302- through the end of 2021 for the 14th of January 2022. A prediction for
303- the death rate per 100K inhabitants is available for every state
304- (` geo_value ` ) along with a 90% predictive interval.
293+ </details >
305294
306295<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
296+
307297The yellow dot gives the median prediction, while the red interval gives
308298the 5-95% inter-quantile range. For this particular day and these
309299locations, the forecasts are relatively accurate, with the true data
0 commit comments