Use dplyr::distinct in {latest,earliest}_issue

capnrefsmmat · capnrefsmmat · commit e98f92495022 · 2020-11-14T17:06:54.000-05:00
Profiling revealed that latest_issue was responsible for a large portion
of the time taken in building correlation-utils.Rmd (apart from
downloading the data). Much of this time was spent in dplyr::filter.

Rather than grouping by geography and time, we can use dplyr::distinct,
knowing that each geo_value and time_value should appear only once per
issue date. By taking the first or last (after sorting by issue date),
we get the desired result.

dplyr does not document algorithmic details, so I can't easily give O(n)
notation here. Algorithmic details notwithstanding, the results are
extraordinary:

&gt; nrow(d)
[1] 203360
&gt; system.time(latest_issue_old(d))
   user  system elapsed
  6.395   0.037   6.465
&gt; system.time(latest_issue(d))
   user  system elapsed
  0.025   0.003   0.027
diff --git a/R-packages/covidcast/R/utils.R b/R-packages/covidcast/R/utils.R
@@ -15,9 +15,9 @@ latest_issue <- function(df) {
   attrs <- attrs[!(names(attrs) %in% c("row.names", "names"))]
 
   df <- df %>%
-    dplyr::group_by(.data$geo_value, .data$time_value) %>%
-    dplyr::filter(.data$issue == max(.data$issue)) %>%
-    dplyr::ungroup()
+    dplyr::arrange(dplyr::desc(.data$issue)) %>%
+    dplyr::distinct(.data$geo_value, .data$time_value,
+                    .keep_all = TRUE)
 
   attributes(df) <- c(attributes(df), attrs)
 
@@ -41,9 +41,9 @@ earliest_issue <- function(df) {
   attrs <- attrs[!(names(attrs) %in% c("row.names", "names"))]
 
   df <- df %>%
-    dplyr::group_by(.data$geo_value, .data$time_value) %>%
-    dplyr::filter(.data$issue == min(.data$issue)) %>%
-    dplyr::ungroup()
+    dplyr::arrange(.data$issue) %>%
+    dplyr::distinct(.data$geo_value, .data$time_value,
+                    .keep_all = TRUE)
 
   attributes(df) <- c(attributes(df), attrs)