Skip to content

Commit 9a8fc7c

Browse files
committed
consistent naming, 7dav pull instead of manually
1 parent ad236b6 commit 9a8fc7c

File tree

7 files changed

+152
-150
lines changed

7 files changed

+152
-150
lines changed

README.Rmd

Lines changed: 28 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ scale_colour_delphi <- scale_color_delphi
7777
[![R-CMD-check](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml)
7878
<!-- badges: end -->
7979

80-
`{epipredict}` is a framework for building transformation and forecasting pipelines for epidemiological and other panel time-series datasets.
80+
[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) is a framework for building transformation and forecasting pipelines for epidemiological and other panel time-series datasets.
8181
In addition to tools for building forecasting pipelines, it contains a number of “canned” forecasters meant to run with little modification as an easy way to get started forecasting.
8282

8383
It is designed to work well with
@@ -111,7 +111,7 @@ The documentation for the stable version is at
111111

112112
## Motivating example
113113

114-
To demonstrate using `{epipredict}` for forecasting, say we want to
114+
To demonstrate using [`{epipredict}`](https://cmu-delphi.github.io/epipredict/) for forecasting, say we want to
115115
predict COVID-19 deaths per 100k people for each of a subset of states
116116

117117
```{r subset_geos}
@@ -136,24 +136,33 @@ library(ggplot2)
136136
```
137137
</details>
138138

139+
```{r setting_cases_deaths}
140+
cases_deaths <- covid_case_death_rates |>
141+
filter(
142+
geo_value %in% used_locations,
143+
time_value <= "2021-12-31"
144+
)
145+
attr(cases_deaths, "metadata")$as_of <- as.Date("2022-01-01")
146+
```
139147

140-
Below the fold, we construct this dataset as an `epiprocess::epi_df` from
141-
[Johns Hopkins Center for Systems Science and Engineering deaths data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html).
148+
`covid_case_death_rates` is a subset of
149+
[Johns Hopkins Center for Systems Science and Engineering deaths data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) stored in [`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/).
150+
Below the fold, we clean this dataset and demonstrate pulling it from the epidata API.
142151

143152
<details>
144153
<summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
145154

146155
This section is intended to demonstrate some of the ubiquitous cleaning operations needed to be able to forecast.
147-
The dataset prepared here is also included ready-to-go in `{epipredict}` as `covid_case_death_rates`.
156+
The dataset prepared here is also included ready-to-go in [`{epipredict}`](https://cmu-delphi.github.io/epipredict/) as `covid_case_death_rates`.
148157

149158
First we pull both `jhu-csse` cases and deaths data from the
150159
[Delphi API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html) using the
151160
[`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
152161

153-
```{r case_death, warning = FALSE}
162+
```{r case_death, warning = FALSE, eval = FALSE}
154163
cases <- pub_covidcast(
155164
source = "jhu-csse",
156-
signals = "confirmed_incidence_prop",
165+
signals = "confirmed_7dav_incidence_prop",
157166
time_type = "day",
158167
geo_type = "state",
159168
time_values = epirange(20200601, 20211231),
@@ -163,23 +172,23 @@ cases <- pub_covidcast(
163172
164173
deaths <- pub_covidcast(
165174
source = "jhu-csse",
166-
signals = "deaths_incidence_prop",
175+
signals = "deaths_7dav_incidence_prop",
167176
time_type = "day",
168177
geo_type = "state",
169178
time_values = epirange(20200601, 20211231),
170179
geo_values = "*"
171180
) |>
172181
select(geo_value, time_value, death_rate = value)
182+
cases_deaths <-
183+
full_join(cases, deaths, by = c("time_value", "geo_value")) |>
184+
filter(geo_value %in% used_locations) |>
185+
as_epi_df(as_of = as.Date("2022-01-01"))
173186
```
174187

175188
Since visualizing the results on every geography is somewhat overwhelming,
176189
we’ll only train on a subset of locations.
177190

178191
```{r date, warning = FALSE}
179-
cases_deaths <-
180-
full_join(cases, deaths, by = c("time_value", "geo_value")) |>
181-
filter(geo_value %in% used_locations) |>
182-
as_epi_df(as_of = as.Date("2022-01-01"))
183192
# plotting the data as it was downloaded
184193
cases_deaths |>
185194
autoplot(
@@ -199,29 +208,7 @@ cases_deaths |>
199208
As with the typical dataset, we will need to do some cleaning to
200209
make it actually usable; we’ll use some utilities from
201210
[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
202-
203-
First, to reduce noise from daily reporting, we will compute a 7 day
204-
average over a trailing window[^1]:
205-
206-
[^1]: This makes it so that any given day of the processed time-series only
207-
depends on the previous week, which means that we avoid leaking future
208-
values when making a forecast.
209-
210-
```{r smooth}
211-
cases_deaths <-
212-
cases_deaths |>
213-
group_by(geo_value) |>
214-
epi_slide(
215-
cases_7dav = mean(case_rate, na.rm = TRUE),
216-
death_rate_7dav = mean(death_rate, na.rm = TRUE),
217-
.window_size = 7
218-
) |>
219-
ungroup() |>
220-
mutate(case_rate = NULL, death_rate = NULL) |>
221-
rename(case_rate = cases_7dav, death_rate = death_rate_7dav)
222-
```
223-
224-
Then we'll trim outliers, especially negative values:
211+
Specifically we'll trim outliers, especially negative values:
225212

226213
```{r outlier}
227214
cases_deaths <-
@@ -262,7 +249,7 @@ forecast_date_label <-
262249
heights = c(rep(150, 4), rep(0.75, 4))
263250
)
264251
processed_data_plot <-
265-
covid_case_death_rates |>
252+
cases_deaths |>
266253
filter(geo_value %in% used_locations) |>
267254
autoplot(
268255
case_rate,
@@ -296,7 +283,7 @@ cases.
296283

297284
```{r make-forecasts, warning=FALSE}
298285
four_week_ahead <- arx_forecaster(
299-
covid_case_death_rates |> filter(time_value <= forecast_date),
286+
cases_deaths |> filter(time_value <= forecast_date),
300287
outcome = "death_rate",
301288
predictors = c("case_rate", "death_rate"),
302289
args_list = arx_args_list(
@@ -317,9 +304,9 @@ date.
317304

318305
Plotting the prediction intervals on the true values for our location subset[^2]:
319306

320-
[^2]: Alternatively, you could call `autoplot(four_week_ahead)` to get the full
321-
collection of forecasts. This is too busy for the space we have for plotting
322-
here.
307+
[^2]: Alternatively, you could call `autoplot(four_week_ahead, plot_data =
308+
cases_deaths)` to get the full collection of forecasts. This is too busy for
309+
the space we have for plotting here.
323310

324311
<details>
325312
<summary> Plot </summary>
@@ -354,7 +341,7 @@ four_week_ahead$predictions |>
354341
pivot_quantiles_longer(.pred_distn)
355342
```
356343

357-
The black dot gives the median prediction, while the blue intervals give the
344+
The yellow dot gives the median prediction, while the blue intervals give the
358345
25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges[^4].
359346
For this particular day and these locations, the forecasts are relatively
360347
accurate, with the true data being at least within the 10-90% interval.

0 commit comments

Comments
 (0)