Skip to content

Uspectacle/historical-basemaps-cleaned

Repository files navigation

🗺️ Historical Basemaps Cleaned

This project provides a cleaning, correction, and normalization pipeline for the historical basemaps dataset from aourednik/historical-basemaps. The goal is to make the GeoJSON files more consistent, usable, and reliable for research, visualization, and exploration.

👉 The final usable GeoJSON files are available in the geojson/ folder.

AI assistance was used during this project.


✨ Features

1. Data Cleaning (scripts/clean_geojson.py)

The cleaning script performs the following operations:

  • 🚫 Removes features without a NAME.
  • 🧹 Cleans and normalizes feature properties.
  • ❌ Drops unused fields (e.g. ABBREVN).
  • 🔄 Renames fields for consistency:
    • BORDERPRECISIONBORDER_PRECISION
    • merges PARTOF and SUBJECTOPART_OF
  • 🗂️ Deduplicates features by NAME and merges their geometries.
  • 📐 Flattens geometries to ensure valid Polygon or MultiPolygon output.
  • 📏 Ensures all BORDER_PRECISION values are 1, 2, or 3 (defaults to 1).
  • ✂️ Trims whitespace in text fields (NAME, PART_OF).
  • 📝 Generates a detailed report (reports/clean_report.txt) with statistics per year.

2. Name Extraction (scripts/extract_names.py)

Extracts all unique NAME and PART_OF values from cleaned files:

  • 📋 Creates reports/names.json with all unique names per year
  • 🎯 Prepares data for AI-powered correction generation

3. AI-Powered Corrections (scripts/generate_corrections.py)

Uses Google Gemini AI to generate corrections for historical inaccuracies:

  • 🤖 Leverages Google Gemini 2.5 Flash model for historical geography expertise
  • 🏛️ Normalizes country names to canonical forms
  • 🌍 Corrects historical relationships (what was part of what in specific years)
  • 📚 Provides reliable sources (Wikipedia, etc.) for corrections
  • 💾 Saves corrections to reports/correction_map.json
  • 🔄 Supports incremental processing and retry of failed years

4. Correction Application (scripts/apply_corrections.py)

Applies the AI-generated corrections to cleaned files:

  • ✏️ Renames countries to canonical forms
  • 🔗 Updates PART_OF relationships based on historical context
  • 📝 Generates detailed correction report (reports/corrections_report.txt)
  • 🎯 Outputs final corrected files to data_corrected/

5. Final Clean (scripts/final_clean.py)

Runs a last consistency check after corrections and exports the final cleaned dataset to geojson/.

  • 🚀 Ensures all geometries are valid and deduplicated
  • ✂️ Cleans properties one more time
  • 📦 Saves the definitive, ready-to-use dataset in geojson/

📂 Project Structure

.
├── data_raw/           # Original GeoJSON files (input)
├── data_clean/         # Cleaned GeoJSON files (intermediate)
├── data_corrected/     # Corrected GeoJSON files (before final pass)
├── geojson/            # ✅ Final cleaned & corrected GeoJSON files
├── reports/            # Reports and intermediate results
├── scripts/            # Processing pipeline
│   ├── clean_geojson.py        # Step 1: Data cleaning
│   ├── extract_names.py        # Step 2: Name extraction
│   ├── generate_corrections.py # Step 3: AI correction generation
│   ├── apply_corrections.py    # Step 4: Apply corrections
│   ├── final_clean.py          # Step 5: Final clean & export
│   ├── constants.py
│   ├── geo_types.py
│   ├── reports.py
│   └── utils.py
└── README.md

🚀 Usage

If you want to rebuild from raw data.

Prerequisites

  1. Clone this repository:
git clone https://github.com/Uspectacle/historical-basemaps-cleaned.git
cd historical-basemaps-cleaned
  1. Create the environment and install dependencies:
conda env create -f environment.yml
conda activate hb_clean
  1. Set up Google API key (for AI corrections):
export GOOGLE_API_KEY="your_google_api_key_here"
  1. Place the original raw GeoJSON files in data_raw/.

Full Pipeline

Run the complete pipeline in order:

# Step 1: Clean the raw data
python -m scripts.clean_geojson

# Step 2: Extract unique names for correction
python -m scripts.extract_names

# Step 3: Generate AI corrections (requires API key)
python -m scripts.generate_corrections

# Step 3.5: You may want to re-generate corrections for specific years
# python -m scripts.generate_corrections --years "1914,1939,1945"

# Step 4: Apply corrections
python -m scripts.apply_corrections

# Step 5: Final clean & export to geojson/
python -m scripts.final_clean

📊 Sample Reports

Cleaning Report (excerpt)

OVERALL SUMMARY
---------------
Total features before: 17523
Total features after:  9919
Total removed features without NAME: 7310
Total countries merged (had duplicates): 192
Total border precision fixed (set to 1): 4

Corrections Report (excerpt)

OVERALL SUMMARY
---------------
Total renamed features: 1685
Total PART_OF corrections: 1072

🤖 AI Correction Prompt

You are an expert in historical geography.

For the year {year}, here are extracted values:
- NAME: {names}
- PART_OF: {part_of}

Your tasks:
1. Normalize NAMEs so variants map to a canonical form.
2. Normalize PART_OF values and ensure they match the historical context of {year}.
3. If a country was part of another in {year}, set 'include_in' accordingly.
   If independent, set 'include_in' to null.
4. Provide a reliable source (Wikipedia or similar) and optionally some additional notes.

Respond ONLY with valid JSON, wrapped inside a fenced JSON code block like this:
```json
{{
  "{year}": {{
    "ENTITY_NAME": {{
      "rename_to": "Canonical Name or null if unchanged",
      "include_in": "Parent entity or null if independent",
      "source": "URL",
      "note": "Additional comments"
    }}
  }}
}}

```

{year}, {names}, {part_of} are replaced using reports/names.json


🤝 Contributing

Suggestions, issues, and pull requests are welcome! If you have expertise in history or historical geography, your feedback would be especially valuable.

You can either:

  • Open an issue here
  • Submit a pull request on this repository
  • Or, if appropriate, propose a PR to the original historical-basemaps project

📜 License

This project follows the same license as the original historical-basemaps dataset. GNU General Public License v3.0 – see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages