Skip to content

Commit e8ff69e

Browse files
Jingjing TangJingjing Tang
authored andcommitted
updated DETAILS
1 parent a3aa123 commit e8ff69e

File tree

1 file changed

+16
-114
lines changed

1 file changed

+16
-114
lines changed

google_symptoms/DETAILS.md

Lines changed: 16 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,124 +1,26 @@
1-
# USA Facts Cases and Deaths
1+
# Google Symptoms
22

3-
We import the confirmed case and deaths data from USA Facts website and export
4-
the county-level data as-is. We also aggregate the data to the MSA, HRR, and
5-
State levels.
6-
7-
In order to avoid confusing public consumers of the data, we maintain
8-
consistency how USA Facts reports the data, please refer to [Exceptions](#Exceptions).
3+
We import the confirmed case and deaths data from the Google Research's
4+
Open COVID-19 Data project and export the county-level and state-level data
5+
as-is.
96

107
## Geographical Levels (`geo`)
11-
* `county`: reported using zero-padded FIPS codes. There are some exceptions
12-
that lead to inconsistency with the other COVIDcast data (but are necessary
13-
for internal consistency), noted below.
14-
* `msa`: reported using cbsa (consistent with all other COVIDcast sensors)
15-
* `hrr`: reported using HRR number (consistent with all other COVIDcast sensors)
16-
* `state`: reported using two-letter postal code
8+
* `county`: reported using zero-padded FIPS codes. The county level data is derived
9+
from `/subregions/state/2020_US_state_daily_symptoms_dataset.csv`.
10+
* `state`: reported using two-letter postal code. The state level data is derived from
11+
`2020_US_daily_symptoms_dataset.csv` which includes data for District of Columbia.
1712

1813
## Metrics, Level 1 (`m1`)
19-
* `confirmed`: Confirmed cases
20-
* `deaths`
14+
* `Anosmia`: Google search volume for Anosmia-related searches
15+
* `Ageusia`: Google search volume for Ageusia-related searches
2116

2217
Recoveries are _not_ reported.
2318

2419
## Metrics, Level 2 (`m2`)
25-
* `new_counts`: number of new {confirmed cases, deaths} on a given day
26-
* `cumulative_counts`: total number of {confirmed cases, deaths} up until the
27-
first day of data (January 22nd)
28-
* `incidence`: `new_counts` / population * 100000
29-
30-
All three `m2` are ultimately derived from `cumulative_counts`, which is first
31-
available on January 22nd. In constructing `new_counts`, we take the first
32-
discrete difference of `cumulative_counts`, and assume that the
33-
`cumulative_counts` for January 21st is uniformly zero. This should not be a
34-
problem, because there there is only one county with a nonzero
35-
`cumulative_count` on January 22nd, with a value of 1.
36-
37-
For deriving `incidence`, we use the estimated 2019 county population values
38-
from the US Census Bureau. https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-total.html
39-
40-
## Exceptions
41-
42-
At the County (FIPS) level, we report the data _exactly_ as USA Facts reports their
43-
data, to prevent confusing public consumers of the data.
44-
The visualization and modeling teams should take note of these exceptions.
45-
46-
### New York City
47-
48-
New York City comprises of five boroughs:
49-
50-
|Borough Name |County Name |FIPS Code |
51-
|-------------------|-------------------|---------------|
52-
|Manhattan |New York County |36061 |
53-
|The Bronx |Bronx County |36005 |
54-
|Brooklyn |Kings County |36047 |
55-
|Queens |Queens County |36081 |
56-
|Staten Island |Richmond County |36085 |
57-
58-
**New York City Unallocated cases/deaths are reported by USA Facts independently.** We split them evenly among the five NYC FIPS, which results in float numbers.
59-
60-
All NYC counts are mapped to the MSA with CBSA ID 35620, which encompasses
61-
all five boroughs. All NYC counts are mapped to HRR 303, which intersects
62-
all five boroughs (297 also intersects the Bronx, 301 also intersects
63-
Brooklyn and Queens, but absent additional information, We are leaving all
64-
counts in 303).
65-
66-
67-
### Mismatched FIPS Codes
68-
69-
There are two FIPS codes that were changed in 2015, leading to
70-
mismatch between us and USA Facts. We report the data using the FIPS code used
71-
by USA Facts, again to promote consistency and avoid confusion by external users
72-
of the dataset. For the mapping to MSA, HRR, these two counties are
73-
included properly.
74-
75-
|County Name |State |"Our" FIPS |USA Facts FIPS |
76-
|-------------------|---------------|-------------------|---------------|
77-
|Oglala Lakota |South Dakota |46113 |46102 |
78-
|Kusilvak |Alaska |02270 |02158 \& 02270 |
79-
80-
Documentation for the changes made by the US Census Bureau in 2015:
81-
https://www.census.gov/programs-surveys/geography/technical-documentation/county-changes.html
82-
83-
Besides, Wade Hampton Census Area and Kusilvak Census Area are reported by USA Facts with FIPS 02270 and 02158 respectively, though there is always 0 cases/deaths reported for Wade Hampton Census Area (02270). According to US Census Bureau, Wade Hampton Census Area has changed name and code from Wade Hampton Census Area, Alaska (02270) to Kusilvak Census Area, Alaska (02158) effective July 1, 2015.
84-
https://www.census.gov/quickfacts/kusilvakcensusareaalaska
85-
86-
### Grand Princess Cruise Ship
87-
Data from Grand Princess Cruise Ship is given its own dedicated line, with FIPS code 6000. We just ignore these cases/deaths.
88-
89-
90-
91-
92-
## Negative incidence
93-
94-
Negative incidence is possible because figures are sometimes revised
95-
downwards, e.g., when a public health authority moves cases from County X
96-
to County Y, County X may have negative incidence.
97-
98-
## Non-integral counts
99-
100-
Because the MSA and HRR numbers are computed by taking population-weighted
101-
averages, the count data at those geographical levels may be non-integral.
102-
103-
## Counties not in our canonical dataset
104-
105-
Some FIPS codes do not appear as the primary FIPS for any ZIP code in our
106-
canonical `02_20_uszips.csv`; they appear in the `county` exported files, but
107-
for the MSA/HRR mapping, we disburse them equally to the counties with whom
108-
they appear as a secondary FIPS code. The identification of such "secondary"
109-
FIPS codes are documented in `notebooks/create-mappings.ipynb`. The full list
110-
of `secondary, [mapped]` is:
20+
* `raw_search`: Google search volume reported as-is
21+
* `smoothed_search`: Google search volume using 7-day moving average
11122

112-
```
113-
SECONDARY_FIPS = [ # generated by notebooks/create-mappings.ipynb
114-
('51620', ['51093', '51175']),
115-
('51685', ['51153']),
116-
('28039', ['28059', '28041', '28131', '28045', '28059', '28109',
117-
'28047']),
118-
('51690', ['51089', '51067']),
119-
('51595', ['51081', '51025', '51175', '51183']),
120-
('51600', ['51059', '51059', '51059']),
121-
('51580', ['51005']),
122-
('51678', ['51163']),
123-
]
124-
```
23+
This data reflects the volume of Google searches mapped to symptoms such Anosmia
24+
and Ageusia. The resulting daily dataset for each region showing the relative frequency
25+
of searches for each symptom. This signal is measured in arbitrary units that are normalized
26+
for population. Larger numbers represent higher numbers of symptom-related searches.

0 commit comments

Comments
 (0)