Skip to content

Conversation

@liniiiiii
Copy link
Collaborator

I update the get_gid_0 for country code, pls help to test

@liniiiiii liniiiiii mentioned this pull request Sep 18, 2025
@liniiiiii
Copy link
Collaborator Author

one if condition is adding in the parse_events.py for Taiwan administrative area type

(wikimpacts-py3.11) [node380:~/bvo00012/vsc10684/WikimpactsV1/Wikimpacts]$ poetry run python3 Database/output/geo_parsing_fix_3/geoparse.py
['Philippines', 'Taiwan', 'China']
(wikimpacts-py3.11) [node380:~/bvo00012/vsc10684/WikimpactsV1/Wikimpacts]$ poetry run python3 Database/output/geo_parsing_fix_3/geoparse.py
['administrative:country', 'administrative:state', 'administrative:country']

Copy link
Collaborator

@i-be-snek i-be-snek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only change now is checking that the gid an alphabetic string. It's encouraged to add as much validation as possible, so this code is fine. My question is: what is the problem you were trying to solve with this change?

@i-be-snek
Copy link
Collaborator

I just checked the attached issue. So the thing is, some GADM codes are alphanumerical and not just alphabetic. The ones I was able to find are indeed all the ones that contain Z followed by two digits and related to China, India, and some other regions. And your reason for trying to remove them is that they include disputed areas, specifically in regards to China.

Because China has four GIDs, apart from China Mainland, other GIDs are for some conflict areas between border of India ect. ['Z03', 'CHN', 'Z08', 'Z02']

My opinion here is that this is not a conflict, this is by design, this is how GADM chooses to break down areas based on their own criteria. If your goal is to remove the location information of these areas so that they would all appear to be part of China (GADM code CHN), then you may be reducing the granularity of our data and erasing data that is essential... if a flood happens in Tibet, how useful is the data if we remove Tibet (GADM code Z03) and lump it all under China (GADM code CHN)? And if a flood happens in all of China, then that would by extension include Tibet given its geographical location.

You may also be introducing unfair breakdowns based on political reasons. Sorry, I will not pass this PR or spend more time on this until we have had a discussion about this in the group. I need to understand the pros and cons of doing such filtering, but I'm still struggling to do that at the moment.

@liniiiiii
Copy link
Collaborator Author

I just checked the attached issue. So the thing is, some GADM codes are alphanumerical and not just alphabetic. The ones I was able to find are indeed all the ones that contain Z followed by two digits and related to China, India, and some other regions. And your reason for trying to remove them is that they include disputed areas, specifically in regards to China.

Because China has four GIDs, apart from China Mainland, other GIDs are for some conflict areas between border of India ect. ['Z03', 'CHN', 'Z08', 'Z02']

My opinion here is that this is not a conflict, this is by design, this is how GADM chooses to break down areas based on their own criteria. If your goal is to remove the location information of these areas so that they would all appear to be part of China (GADM code CHN), then you may be reducing the granularity of our data and erasing data that is essential... if a flood happens in Tibet, how useful is the data if we remove Tibet (GADM code Z03) and lump it all under China (GADM code CHN)? And if a flood happens in all of China, then that would by extension include Tibet given its geographical location.

You may also be introducing unfair breakdowns based on political reasons. Sorry, I will not pass this PR or spend more time on this until we have had a discussion about this in the group. I need to understand the pros and cons of doing such filtering, but I'm still struggling to do that at the moment.

I think the issue is not excluding them in GIDs, but, to make sure we don't have undetected parsed output"'Tukchenzam', 'Valley County, Nebraska, United States', from the original output [
"Philippines",
"Taiwan",
"China"
],. Maybe something wrong in the normalize_location code, which I don't find. Currently, an easy solution is to filter them out in the GID_0 function to prevent this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants