This project provides a cleaning, correction, and normalization pipeline for the historical basemaps dataset from aourednik/historical-basemaps. The goal is to make the GeoJSON files more consistent, usable, and reliable for research, visualization, and exploration.
👉 The final usable GeoJSON files are available in the geojson/ folder.
AI assistance was used during this project.
The cleaning script performs the following operations:
- 🚫 Removes features without a
NAME. - 🧹 Cleans and normalizes feature properties.
- ❌ Drops unused fields (e.g.
ABBREVN). - 🔄 Renames fields for consistency:
BORDERPRECISION→BORDER_PRECISION- merges
PARTOFandSUBJECTO→PART_OF
- 🗂️ Deduplicates features by
NAMEand merges their geometries. - 📐 Flattens geometries to ensure valid
PolygonorMultiPolygonoutput. - 📏 Ensures all
BORDER_PRECISIONvalues are1,2, or3(defaults to1). - ✂️ Trims whitespace in text fields (
NAME,PART_OF). - 📝 Generates a detailed report (
reports/clean_report.txt) with statistics per year.
Extracts all unique NAME and PART_OF values from cleaned files:
- 📋 Creates
reports/names.jsonwith all unique names per year - 🎯 Prepares data for AI-powered correction generation
Uses Google Gemini AI to generate corrections for historical inaccuracies:
- 🤖 Leverages Google Gemini 2.5 Flash model for historical geography expertise
- 🏛️ Normalizes country names to canonical forms
- 🌍 Corrects historical relationships (what was part of what in specific years)
- 📚 Provides reliable sources (Wikipedia, etc.) for corrections
- 💾 Saves corrections to
reports/correction_map.json - 🔄 Supports incremental processing and retry of failed years
Applies the AI-generated corrections to cleaned files:
- ✏️ Renames countries to canonical forms
- 🔗 Updates
PART_OFrelationships based on historical context - 📝 Generates detailed correction report (
reports/corrections_report.txt) - 🎯 Outputs final corrected files to
data_corrected/
Runs a last consistency check after corrections and exports the final cleaned dataset to geojson/.
- 🚀 Ensures all geometries are valid and deduplicated
- ✂️ Cleans properties one more time
- 📦 Saves the definitive, ready-to-use dataset in
geojson/
.
├── data_raw/ # Original GeoJSON files (input)
├── data_clean/ # Cleaned GeoJSON files (intermediate)
├── data_corrected/ # Corrected GeoJSON files (before final pass)
├── geojson/ # ✅ Final cleaned & corrected GeoJSON files
├── reports/ # Reports and intermediate results
├── scripts/ # Processing pipeline
│ ├── clean_geojson.py # Step 1: Data cleaning
│ ├── extract_names.py # Step 2: Name extraction
│ ├── generate_corrections.py # Step 3: AI correction generation
│ ├── apply_corrections.py # Step 4: Apply corrections
│ ├── final_clean.py # Step 5: Final clean & export
│ ├── constants.py
│ ├── geo_types.py
│ ├── reports.py
│ └── utils.py
└── README.md
If you want to rebuild from raw data.
- Clone this repository:
git clone https://github.com/Uspectacle/historical-basemaps-cleaned.git
cd historical-basemaps-cleaned- Create the environment and install dependencies:
conda env create -f environment.yml
conda activate hb_clean- Set up Google API key (for AI corrections):
export GOOGLE_API_KEY="your_google_api_key_here"- Place the original raw GeoJSON files in
data_raw/.
Run the complete pipeline in order:
# Step 1: Clean the raw data
python -m scripts.clean_geojson
# Step 2: Extract unique names for correction
python -m scripts.extract_names
# Step 3: Generate AI corrections (requires API key)
python -m scripts.generate_corrections
# Step 3.5: You may want to re-generate corrections for specific years
# python -m scripts.generate_corrections --years "1914,1939,1945"
# Step 4: Apply corrections
python -m scripts.apply_corrections
# Step 5: Final clean & export to geojson/
python -m scripts.final_cleanOVERALL SUMMARY
---------------
Total features before: 17523
Total features after: 9919
Total removed features without NAME: 7310
Total countries merged (had duplicates): 192
Total border precision fixed (set to 1): 4
OVERALL SUMMARY
---------------
Total renamed features: 1685
Total PART_OF corrections: 1072
You are an expert in historical geography.
For the year {year}, here are extracted values:
- NAME: {names}
- PART_OF: {part_of}
Your tasks:
1. Normalize NAMEs so variants map to a canonical form.
2. Normalize PART_OF values and ensure they match the historical context of {year}.
3. If a country was part of another in {year}, set 'include_in' accordingly.
If independent, set 'include_in' to null.
4. Provide a reliable source (Wikipedia or similar) and optionally some additional notes.
Respond ONLY with valid JSON, wrapped inside a fenced JSON code block like this:
```json
{{
"{year}": {{
"ENTITY_NAME": {{
"rename_to": "Canonical Name or null if unchanged",
"include_in": "Parent entity or null if independent",
"source": "URL",
"note": "Additional comments"
}}
}}
}}
```
{year}, {names}, {part_of} are replaced using reports/names.json
Suggestions, issues, and pull requests are welcome! If you have expertise in history or historical geography, your feedback would be especially valuable.
You can either:
- Open an issue here
- Submit a pull request on this repository
- Or, if appropriate, propose a PR to the original historical-basemaps project
This project follows the same license as the original historical-basemaps dataset. GNU General Public License v3.0 – see LICENSE for details.