Skip to content

Commit 4abaceb

Browse files
Update 2023-04-10-is-latest-patch.Rmd
1 parent bf60288 commit 4abaceb

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

content/blog/2023-04-10-is-latest-patch.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,20 @@ In August 2022, the Delphi team discovered a fault in the data that we were send
2020

2121
## What went wrong?
2222

23-
Due to the huge volume of COVIDcast data, we are limited in the real-time calculations that we can perform on it. This is because we must maintain a highly available API that can deliver data to end users very quickly. With this in mind, in the previous version of our database, we used a statically set flag that delineated whether a row in the database was the latest version of the data.^[Much of the data that is used to create the COVIDCast API is not complete the first day that it is reported. For instance, COVID cases for a specific day will change for many days to weeks afterwards as the reporting source revises its data. Because of this, we store many different versions of the same reference day for each signal. Usually, our users are most interested in the most recent version of the data. In our previous version of Epidata, version 3, we kept a statically set flag in our table to delineate the latest version of a certain row of data. This flag was set when we ingested a new version of said data. This workflow was very prone to data faults when patching the database outside of the acquisition pipeline (that typically sets the flag). For more information, check out our blog post on Epidata version 4.] We believe that this problem arose when we applied a patch^[A patch, in this context, is a set of data that matches to a database that contains incorrect information. The patch contains the keys to find these rows and update them with the correct information.] to our database and this flag was not properly recalculated.
23+
Due to the huge volume of COVIDcast data, we are limited in the real-time calculations that we can perform on it. This is because we must maintain a highly available API that can deliver data to end users very quickly. With this in mind, in the previous version of our database, we used a statically set flag that delineated whether a row in the database was the latest version of the data.^[Much of the data that is used to create the COVIDCast API is not complete the first day that it is reported. For instance, COVID cases for a specific day will change for many days to weeks afterwards as the reporting source revises its data. Because of this, we store many different versions of the same reference day for each signal. Usually, our users are most interested in the most recent version of the data. In our previous version of Epidata, version 3, we kept a statically set flag in our table to delineate the latest version of a certain row of data. This flag was set when we ingested a new version of said data. This workflow was very prone to data faults when patching the database outside of the acquisition pipeline (that typically sets the flag). For more information, [check out our blog post on Epidata version 4](https://delphi.cmu.edu/blog/2022/12/14/introducing-epidata-v4/).] We believe that this problem arose when we applied a patch^[A patch, in this context, is a set of data that matches to a database that contains incorrect information. The patch contains the keys to find these rows and update them with the correct information.] to our database and this flag was not properly recalculated.
2424

2525
## How did we identify this?
26-
Data faults like this are difficult to identify. In this case, this fault was found by accident while a member of the Delphi team was working on a new system to calculate metadata. During this, they found that some of the JHU-CSSE data was not matching up and looked deeper into it. The team’s analysis identified 11,987,335 rows that were labeled as the latest issue in which had more recent issues in the database; this constituted about 20% of our JHU-CSSE data at the time.
26+
Data faults like this are difficult to identify. In this case, this fault was found by accident while a member of the Delphi team was working on a new system to calculate metadata. During this, they found that [some of the JHU-CSSE data was not matching up and looked deeper into it](https://github.com/cmu-delphi/covidcast-indicators/issues/1685). The team’s analysis identified 11,987,335 rows that were labeled as the latest issue in which had more recent issues in the database; this constituted about 20% of our JHU-CSSE data at the time.
2727

2828
## How did we fix it?
2929

30-
As noted above, this particular fault was identified in the previous version of our database. At the time that it was found, we were in the process of rolling out a new schema, in which we changed the way we stored the latest versions of the data. Because of this, we identified the size of the patch and did validation on the previous version (v3) to ensure that we understood the extent of the fault. Once the new version was live, we recalculated the patch on the new database and applied it to the data on September 28, 2022.
30+
As noted above, this particular fault was identified in the previous version of our database. At the time that it was found, we were in the process of rolling out a [new schema](https://delphi.cmu.edu/blog/2022/12/14/introducing-epidata-v4/), in which we changed the way we stored the latest versions of the data. Because of this, we identified the size of the patch and did validation on the previous version (v3) to ensure that we understood the extent of the fault. Once the new version was live, we recalculated the patch on the new database and applied it to the data on September 28, 2022.
3131

3232
## What did we learn?
3333

3434
There were many takeaways from this data fault that played directly into our development planning:
3535

36-
We plan to build an automated system of data validation to identify these types of faults while they happen. Of course, the automated system is only as good as the tests we give it, but the suite of tests will continue to grow as new types of faults are identified.
37-
We will set up a fault record keeping system that is publicly available so that we can be more transparent about issues that arise and what data are affected. This will also be the system that we update as we fix faults and patch the data, so this will be useful to our end users as they can keep track of the status of the data that they use.
36+
1. We plan to build an automated system of data validation to identify these types of faults while they happen. Of course, the automated system is only as good as the tests we give it, but the suite of tests will continue to grow as new types of faults are identified.
37+
2. We will set up a fault record keeping system that is publicly available so that we can be more transparent about issues that arise and what data are affected. This will also be the system that we update as we fix faults and patch the data, so this will be useful to our end users as they can keep track of the status of the data that they use.
3838

3939

0 commit comments

Comments
 (0)