@@ -13,3 +13,137 @@ options(tibble.print_min = 4L, tibble.print_max = 4L, max.print = 4L)
1313library(epidatr)
1414library(dplyr)
1515```
16+
17+
18+ The Epidata API records not just each signal's estimate for a given location
19+ on a given day, but also * when* that estimate was made, and all updates to that
20+ estimate.
21+
22+ For example, let's look at the [ doctor visits
23+ signal] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html )
24+ from the [ ` covidcast ` endpoint] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html ) ,
25+ which estimates the percentage of outpatient doctor visits that are
26+ COVID-related. Consider a result row with ` time_value ` 2020-05-01 for
27+ ` geo_values = "pa" ` . This is an estimate for Pennsylvania on
28+ May 1, 2020. That estimate was * issued* on May 5, 2020, the delay being due to
29+ the aggregation of data by our source and the time taken by the Epidata API to
30+ ingest the data provided. Later, the estimate for May 1st could be updated,
31+ perhaps because additional visit data from May 1st arrived at our source and was
32+ reported to us. This constitutes a new * issue* of the data.
33+
34+
35+ ### Data known "as of" a specific date
36+
37+ By default, endpoint functions fetch the most recent issue available. This
38+ is the best option for users who simply want to graph the latest data or
39+ construct dashboards. But if we are interested in knowing * when* data was
40+ reported, we can request specific data versions using the ` as_of ` , ` issues ` , or
41+ ` lag ` arguments.
42+
43+ _ Note_ that these are mutually exclusive; only one can be specified
44+ at a time. Also, not all endpoints support all three parameters, so please
45+ check the documentation for that specific endpoint.
46+
47+ First, we can request the data that was available * as of* a specific date, using
48+ the ` as_of ` argument:
49+
50+
51+ ``` {r}
52+ epidata <- pub_covidcast(
53+ source = "doctor-visits",
54+ signals = "smoothed_adj_cli",
55+ time_type = "day",
56+ time_values = epirange("2020-05-01", "2020-05-01"),
57+ geo_type = "state",
58+ geo_values = "pa",
59+ as_of = "2020-05-07"
60+ )
61+ knitr::kable(epidata)
62+ ```
63+
64+ This shows that an estimate of about 2.3% was issued on May 7. If we don't
65+ specify ` as_of ` , we get the most recent estimate available:
66+
67+
68+ ``` {r}
69+ epidata <- pub_covidcast(
70+ source = "doctor-visits",
71+ signals = "smoothed_adj_cli",
72+ time_type = "day",
73+ time_values = epirange("2020-05-01", "2020-05-01"),
74+ geo_type = "state",
75+ geo_values = "pa"
76+ )
77+ knitr::kable(epidata)
78+ ```
79+
80+ Note the substantial change in the estimate, from less than 3% to almost 6%,
81+ reflecting new data that became available after May 7 about visits * occurring on*
82+ May 1. This illustrates the importance of issue date tracking, particularly
83+ for forecasting tasks. To backtest a forecasting model on past data, it is
84+ important to use the data that would have been available * at the time* the model
85+ was or would have been fit, not data that arrived much later.
86+
87+
88+ ### Multiple issues of observations
89+
90+ By using the ` issues ` argument, we can request all issues in a certain time
91+ period:
92+
93+ ``` {r}
94+ epidata <- pub_covidcast(
95+ source = "doctor-visits",
96+ signals = "smoothed_adj_cli",
97+ time_type = "day",
98+ time_values = epirange("2020-05-01", "2020-05-01"),
99+ geo_type = "state",
100+ geo_values = "pa",
101+ issues = epirange("2020-05-01", "2020-05-15")
102+ )
103+ knitr::kable(epidata)
104+ ```
105+
106+ This estimate was clearly updated many times as new data for May 1st arrived.
107+
108+ Note that these results include only data issued or updated between
109+ (inclusive) 2020-05-01 and 2020-05-15. If a value was first reported on
110+ 2020-04-15, and never updated, a query for issues between 2020-05-01 and
111+ 2020-05-15 will not include that value among its results.
112+
113+
114+ ### Observations issued with a specific lag
115+
116+ Finally, we can use the ` lag ` argument to request only data reported with a
117+ certain lag. For example, requesting a lag of 7 days fetches only data issued
118+ exactly 7 days after the corresponding ` time_value ` :
119+
120+ ``` {r}
121+ epidata <- pub_covidcast(
122+ source = "doctor-visits",
123+ signals = "smoothed_adj_cli",
124+ time_type = "day",
125+ time_values = epirange("2020-05-01", "2020-05-07"),
126+ geo_type = "state",
127+ geo_values = "pa",
128+ lag = 7
129+ )
130+ knitr::kable(epidata)
131+ ```
132+
133+ Note that though this query requested all values between 2020-05-01 and
134+ 2020-05-07, May 3rd and May 4th were * not* included in the results set. This is
135+ because the query will only include a result for May 3rd if a value were issued
136+ on May 10th (a 7-day lag), but in fact the value was not updated on that day:
137+
138+ ``` {r}
139+ epidata <- pub_covidcast(
140+ source = "doctor-visits",
141+ signals = "smoothed_adj_cli",
142+ time_type = "day",
143+ time_values = epirange("2020-05-03", "2020-05-03"),
144+ geo_type = "state",
145+ geo_values = "pa",
146+ issues = epirange("2020-05-09", "2020-05-15")
147+ )
148+ knitr::kable(epidata)
149+ ```
0 commit comments