Commit e98f924
committed
Use dplyr::distinct in {latest,earliest}_issue
Profiling revealed that latest_issue was responsible for a large portion
of the time taken in building correlation-utils.Rmd (apart from
downloading the data). Much of this time was spent in dplyr::filter.
Rather than grouping by geography and time, we can use dplyr::distinct,
knowing that each geo_value and time_value should appear only once per
issue date. By taking the first or last (after sorting by issue date),
we get the desired result.
dplyr does not document algorithmic details, so I can't easily give O(n)
notation here. Algorithmic details notwithstanding, the results are
extraordinary:
> nrow(d)
[1] 203360
> system.time(latest_issue_old(d))
user system elapsed
6.395 0.037 6.465
> system.time(latest_issue(d))
user system elapsed
0.025 0.003 0.0271 parent 4e22d6e commit e98f924
1 file changed
+6
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
20 | | - | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
45 | | - | |
46 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| |||
0 commit comments