Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions augur/curate/apply_geolocation_rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,16 @@ def load_geolocation_rules(geolocation_rules_file, case_sensitive):

geolocation_rules[raw[0]][raw[1]][raw[2]][raw[3]] = annot

# We want to match '?' values to the empty string as well to allow them to be used as empty values
# but if they are used it's expected they are used for all "empty" fields
annot_using_question_marks = tuple(['?' if x=='' else x for x in annot])
if raw[1]=='' and raw[2]=='' and raw[3]=='':
geolocation_rules[raw[0]]['?']['?']['?'] = annot_using_question_marks
elif raw[2]=='' and raw[3]=='':
geolocation_rules[raw[0]][raw[1]]['?']['?'] = annot_using_question_marks
elif raw[3]=='':
geolocation_rules[raw[0]][raw[1]][raw[2]]['?'] = annot_using_question_marks
Comment on lines +76 to +84
Copy link
Member

@victorlin victorlin Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of adding duplicate rule entries for a new null value, add logic for different null values on the data. Something like this: 5588acb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @joverlee521 here - my heads mostly focused on the downstream fixes this entails rather than the specific way it's fixed. I'm happy to change to your approach as desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @victorlin, I like 5588acb!

We can come back to this at a later date since the seasonal-flu ingest is changing the ? values to empty strings as discussed in nextstrain/seasonal-flu#267 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, easy fix!


return geolocation_rules


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ Attempting to use the rules with no matches results in no changes.

$ echo '{"region": "r_old", "country": "c_old", "division": "d_old", "location": "l_something_else"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "r_old", "country": "c_old", "division": "d_old", "location": "l_something_else"}

Attempting to use the rules with a match results in an error.

$ echo '{"region": "r_old", "country": "c_old", "division": "d_old", "location": "l_old"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
ERROR: More than 1000 geolocation rules applied on the same entry ['r_old', 'c_old', 'd_old', 'l_old'].
[2]
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ Test that an empty field can be loaded in the raw and annotated columns.

$ echo '{"region": "r_raw", "country": "c_raw", "division": "", "location": ""}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "r_annotated", "country": "c_annotated", "division": "", "location": ""}

This is effectively an exact rule, so it will not apply to any record that does
not match the empty string.

$ echo '{"region": "r_raw", "country": "c_raw", "division": "d_raw", "location": ""}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "r_raw", "country": "c_raw", "division": "d_raw", "location": ""}
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Field 1

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "a", "country": "2", "division": "3", "location": "4"}

Field 2
Expand All @@ -23,7 +23,7 @@ Field 2

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "b", "division": "3", "location": "4"}

Field 3
Expand All @@ -34,7 +34,7 @@ Field 3

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "2", "division": "c", "location": "4"}

Field 4
Expand All @@ -45,7 +45,7 @@ Field 4

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "2", "division": "3", "location": "d"}

Fields 2,3
Expand All @@ -56,7 +56,7 @@ Fields 2,3

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "b", "division": "c", "location": "4"}

Fields 1,2,3
Expand All @@ -67,5 +67,5 @@ Fields 1,2,3

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "a", "country": "b", "division": "c", "location": "4"}
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ more generic wildcard matches.

$ echo '{"region": "a", "country": "b", "division": "c", "location": "x"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "2", "division": "3", "location": "x"}

$ echo '{"region": "a", "country": "b", "division": "x", "location": "x"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "2", "division": "x", "location": "x"}

$ echo '{"region": "a", "country": "x", "division": "x", "location": "x"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "x", "division": "x", "location": "x"}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ Test that comments are ignored.

$ echo '{"region": "r_raw", "country": "c_raw", "division": "d_raw", "location": "l_raw"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "r_annotated", "country": "c_annotated", "division": "d_annotated", "location": "l_annotated"}
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Valid rule matches are still applied.

$ echo '{"region": "r_old", "country": "c_old", "division": "d_old", "location": "l_old"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
WARNING: Could not decode geolocation rule 'r_old/c_old/d_old/l_old\n'. Please make sure rules are formatted as 'region/country/division/location<tab>region/country/division/location'.
WARNING: Could not decode the annotated geolocation 'r_new/c_new/d_new'. Please make sure it is formatted as 'region/country/division/location'.
WARNING: Could not decode the raw geolocation 'r_old/c_old/d_old'. Please make sure it is formatted as 'region/country/division/location'.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ It applies the last rules that matched.

$ echo '{"region": "r_raw", "country": "c_raw", "division": "d_raw", "location": "l_raw"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "r_annotated_custom", "country": "c_annotated_custom", "division": "d_annotated_custom", "location": "l_annotated_custom"}
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Setup

$ export AUGUR="${AUGUR:-$TESTDIR/../../../../../bin/augur}"

We sometimes use question marks '?' as an unknown field. `augur curate apply-record-annotations`
special-cases these so that they aren't interpreted as a valid value.

This test uses real data from seasonal-flu where this bug was first observed

Rules taken directly from `augur/data/geolocation_rules.tsv` as at c8181c7

$ cat >rules.tsv <<~~
> Asia/Bangladesh/Jashore/ Asia/Bangladesh/Khulna/Jashore
> Asia/Bangladesh/Jashore/* Asia/Bangladesh/Khulna/*
> ~~

Fist check that if the division is EMPTY then the first rule is applied, as we expect

$ echo '{"region": "Asia", "country": "Bangladesh", "division": "Jashore", "location": ""}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "Asia", "country": "Bangladesh", "division": "Khulna", "location": "Jashore"}

but if the location is a "?", which is the case in our seasonal-flu ingest pipeline we want to use the first rule too
rather than interpreting "?" as a valid location to preserve


$ echo '{"region": "Asia", "country": "Bangladesh", "division": "Jashore", "location": "?"}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "Asia", "country": "Bangladesh", "division": "Khulna", "location": "Jashore"}


Edge case: preserve '?' in the output if we don't replace it via a rule

$ echo '{"region": "Asia", "country": "X", "division": "?", "location": "?"}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "Asia", "country": "X", "division": "?", "location": "?"}


This works for multiple fields in a work-backwards fashion

$ cat >rules.tsv <<~~
> North America/USA Alabama// North America/USA/Alabama/
> ~~

$ echo '{"region": "North America", "country": "USA Alabama", "division": "", "location": ""}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA", "division": "Alabama", "location": ""}

Note that the corrected location is '?', not '' which the rule may suggest -- we attempt to preserve the empty-string value which is used
$ echo '{"region": "North America", "country": "USA Alabama", "division": "?", "location": "?"}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA", "division": "Alabama", "location": "?"}

NOTE: but division=? and location=empty string (or vice-versa) doesn't work - it's expected that the missing-value is used consistently
$ echo '{"region": "North America", "country": "USA Alabama", "division": "?", "location": ""}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA Alabama", "division": "?", "location": ""}

$ echo '{"region": "North America", "country": "USA Alabama", "division": "", "location": "?"}' \
> | ${AUGUR} curate apply-geolocation-rules --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA Alabama", "division": "", "location": "?"}
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,20 @@ Rule matching is case-insensitive and the output matches the casing of the annot

$ echo '{"region": "North America", "country": "USA", "division": "CA", "location": "Los Angeles"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA", "division": "California", "location": "Los Angeles"}

Rule matching is case-insensitive by default, so raw values with mismatched casing will still get updated.

$ echo '{"region": "North America", "country": "USA", "division": "Ca", "location": "Los Angeles"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "North America", "country": "USA", "division": "California", "location": "Los Angeles"}

Using the `--case-sensitive` flag will make rule matching case-sensitive.

$ echo '{"region": "North America", "country": "USA", "division": "Ca", "location": "Los Angeles"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv \
> --no-default-rules --geolocation-rules rules.tsv \
> --case-sensitive
{"region": "North America", "country": "USA", "division": "Ca", "location": "Los Angeles"}
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ Test that matching rules are used over general rules.

$ echo '{"region": "a", "country": "b", "division": "c", "location": "d"}' \
> | ${AUGUR} curate apply-geolocation-rules \
> --geolocation-rules rules.tsv
> --no-default-rules --geolocation-rules rules.tsv
{"region": "1", "country": "2", "division": "3", "location": "4"}
Loading