Skip to content

geo parsing issue #227

@liniiiiii

Description

@liniiiiii

Test the event lUwwjfR.json, where the model raw output of Admin_areas is "Administrative_Areas": [
"Philippines",
"Taiwan",
"China"
],

the parse_events.py output is

['Tukchenzam', 'Valley County, Nebraska, United States', 'Taiwan', 'Philippines', 'Aksai Chin', 'China']

and I add logger in the parse_events.py for the steps below

  1. the Admin_Area_Norm process, the output is fine
parse_events: 2025-09-18 11:49:11 INFO     Ensuring that all admin area data in Administrative_Areas is of type <list>
parse_events: 2025-09-18 11:49:11 INFO     Normalizing administrative areas...
parse_events: 2025-09-18 11:49:11 INFO     Processing area: Philippines
parse_events: 2025-09-18 11:49:11 INFO     Processing area: Taiwan
parse_events: 2025-09-18 11:49:11 INFO     Processing area: China
  1. in the infer the location to country step
parse_events: 2025-09-18 11:49:11 INFO     STEP: Infer country from list of locations
parse_events: 2025-09-18 11:49:11 INFO     Getting GID from GADM for Administrative Areas
parse_events: 2025-09-18 11:49:12 INFO     STEP: Infer country result ['Philippines', 'Pa-li-chia-ssu', 'Taiwan', 'Shaksgam Valley', 'China', 'Aksai Chin']
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: Philippines
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: Pa-li-chia-ssu
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: Taiwan
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: Shaksgam Valley
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: China
parse_events: 2025-09-18 11:49:12 INFO     Processing GID area: Aksai Chin

Because China has four GIDs, apart from China Mainland, other GIDs are for some conflict areas between border of India ect. ['Z03', 'CHN', 'Z08', 'Z02']

So, I update the get_gid_0 function in normalize_locations.py where only isalpha GIDs allows in this country infering process.
then I tested event OzSN6a4.json, where the model raw output "Administrative_Areas": [
"Bangladesh",
"India",
"Sri Lanka",
"Yemen",
"Pakistan",
"Vietnam",
"Thailand",
"Burma",
"Nepal"
],
parsed output is ['San Francisco, California, United States', 'Pakistan', 'Sri Lanka', 'Jammu and Kashmir', 'Kaurik', 'Vietnam', 'Bangladesh', 'Arunachal Pradesh', 'India', 'Azad Kashmir', 'Myanmar', 'Nepal', 'Yemen', 'Thailand', 'Lapthal post'] , since India also has several GIDs apart from the Main land
and now, the output is ['Thailand', 'Sri Lanka', 'Bangladesh', 'Yemen', 'Vietnam', 'Myanmar', 'India', 'Nepal', 'Pakistan']

for the GIDs contain digits , 8 GIDs in total, and they are not countries, all of them are conflict areas.

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions