🗺️ Historical Basemaps Cleaned

This project provides a cleaning, correction, and normalization pipeline for the historical basemaps dataset from aourednik/historical-basemaps. The goal is to make the GeoJSON files more consistent, usable, and reliable for research, visualization, and exploration.

👉 The final usable GeoJSON files are available in the geojson/ folder.

AI assistance was used during this project.

✨ Features

1. Data Cleaning (`scripts/clean_geojson.py`)

The cleaning script performs the following operations:

🚫 Removes features without a NAME.
🧹 Cleans and normalizes feature properties.
❌ Drops unused fields (e.g. ABBREVN).
🔄 Renames fields for consistency:
- BORDERPRECISION → BORDER_PRECISION
- merges PARTOF and SUBJECTO → PART_OF
🗂️ Deduplicates features by NAME and merges their geometries.
📐 Flattens geometries to ensure valid Polygon or MultiPolygon output.
📏 Ensures all BORDER_PRECISION values are 1, 2, or 3 (defaults to 1).
✂️ Trims whitespace in text fields (NAME, PART_OF).
📝 Generates a detailed report (reports/clean_report.txt) with statistics per year.

2. Name Extraction (`scripts/extract_names.py`)

Extracts all unique NAME and PART_OF values from cleaned files:

📋 Creates reports/names.json with all unique names per year
🎯 Prepares data for AI-powered correction generation

3. AI-Powered Corrections (`scripts/generate_corrections.py`)

Uses Google Gemini AI to generate corrections for historical inaccuracies:

🤖 Leverages Google Gemini 2.5 Flash model for historical geography expertise
🏛️ Normalizes country names to canonical forms
🌍 Corrects historical relationships (what was part of what in specific years)
📚 Provides reliable sources (Wikipedia, etc.) for corrections
💾 Saves corrections to reports/correction_map.json
🔄 Supports incremental processing and retry of failed years

4. Correction Application (`scripts/apply_corrections.py`)

Applies the AI-generated corrections to cleaned files:

✏️ Renames countries to canonical forms
🔗 Updates PART_OF relationships based on historical context
📝 Generates detailed correction report (reports/corrections_report.txt)
🎯 Outputs final corrected files to data_corrected/

5. Final Clean (`scripts/final_clean.py`)

Runs a last consistency check after corrections and exports the final cleaned dataset to geojson/.

🚀 Ensures all geometries are valid and deduplicated
✂️ Cleans properties one more time
📦 Saves the definitive, ready-to-use dataset in geojson/

📂 Project Structure

.
├── data_raw/           # Original GeoJSON files (input)
├── data_clean/         # Cleaned GeoJSON files (intermediate)
├── data_corrected/     # Corrected GeoJSON files (before final pass)
├── geojson/            # ✅ Final cleaned & corrected GeoJSON files
├── reports/            # Reports and intermediate results
├── scripts/            # Processing pipeline
│   ├── clean_geojson.py        # Step 1: Data cleaning
│   ├── extract_names.py        # Step 2: Name extraction
│   ├── generate_corrections.py # Step 3: AI correction generation
│   ├── apply_corrections.py    # Step 4: Apply corrections
│   ├── final_clean.py          # Step 5: Final clean & export
│   ├── constants.py
│   ├── geo_types.py
│   ├── reports.py
│   └── utils.py
└── README.md

🚀 Usage

If you want to rebuild from raw data.

Prerequisites

Clone this repository:

git clone https://github.com/Uspectacle/historical-basemaps-cleaned.git
cd historical-basemaps-cleaned

Create the environment and install dependencies:

conda env create -f environment.yml
conda activate hb_clean

Set up Google API key (for AI corrections):

export GOOGLE_API_KEY="your_google_api_key_here"

Place the original raw GeoJSON files in data_raw/.

Full Pipeline

Run the complete pipeline in order:

# Step 1: Clean the raw data
python -m scripts.clean_geojson

# Step 2: Extract unique names for correction
python -m scripts.extract_names

# Step 3: Generate AI corrections (requires API key)
python -m scripts.generate_corrections

# Step 3.5: You may want to re-generate corrections for specific years
# python -m scripts.generate_corrections --years "1914,1939,1945"

# Step 4: Apply corrections
python -m scripts.apply_corrections

# Step 5: Final clean & export to geojson/
python -m scripts.final_clean

📊 Sample Reports

Cleaning Report (excerpt)

OVERALL SUMMARY
---------------
Total features before: 17523
Total features after:  9919
Total removed features without NAME: 7310
Total countries merged (had duplicates): 192
Total border precision fixed (set to 1): 4

Corrections Report (excerpt)

OVERALL SUMMARY
---------------
Total renamed features: 1685
Total PART_OF corrections: 1072

🤖 AI Correction Prompt

You are an expert in historical geography.

For the year {year}, here are extracted values:
- NAME: {names}
- PART_OF: {part_of}

Your tasks:
1. Normalize NAMEs so variants map to a canonical form.
2. Normalize PART_OF values and ensure they match the historical context of {year}.
3. If a country was part of another in {year}, set 'include_in' accordingly.
   If independent, set 'include_in' to null.
4. Provide a reliable source (Wikipedia or similar) and optionally some additional notes.

Respond ONLY with valid JSON, wrapped inside a fenced JSON code block like this:
```json
{{
  "{year}": {{
    "ENTITY_NAME": {{
      "rename_to": "Canonical Name or null if unchanged",
      "include_in": "Parent entity or null if independent",
      "source": "URL",
      "note": "Additional comments"
    }}
  }}
}}

```

{year}, {names}, {part_of} are replaced using reports/names.json

🤝 Contributing

Suggestions, issues, and pull requests are welcome! If you have expertise in history or historical geography, your feedback would be especially valuable.

You can either:

Open an issue here
Submit a pull request on this repository
Or, if appropriate, propose a PR to the original historical-basemaps project

📜 License

This project follows the same license as the original historical-basemaps dataset. GNU General Public License v3.0 – see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🗺️ Historical Basemaps Cleaned

✨ Features

1. Data Cleaning (`scripts/clean_geojson.py`)

2. Name Extraction (`scripts/extract_names.py`)

3. AI-Powered Corrections (`scripts/generate_corrections.py`)

4. Correction Application (`scripts/apply_corrections.py`)

5. Final Clean (`scripts/final_clean.py`)

📂 Project Structure

🚀 Usage

Prerequisites

Full Pipeline

📊 Sample Reports

Cleaning Report (excerpt)

Corrections Report (excerpt)

🤖 AI Correction Prompt

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
data_clean		data_clean
data_corrected		data_corrected
data_raw		data_raw
geojson		geojson
reports		reports
scripts		scripts
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENCE		LICENCE
README.md		README.md
environment.yml		environment.yml

License

Uspectacle/historical-basemaps-cleaned

Folders and files

Latest commit

History

Repository files navigation

🗺️ Historical Basemaps Cleaned

✨ Features

1. Data Cleaning (scripts/clean_geojson.py)

2. Name Extraction (scripts/extract_names.py)

3. AI-Powered Corrections (scripts/generate_corrections.py)

4. Correction Application (scripts/apply_corrections.py)

5. Final Clean (scripts/final_clean.py)

📂 Project Structure

🚀 Usage

Prerequisites

Full Pipeline

📊 Sample Reports

Cleaning Report (excerpt)

Corrections Report (excerpt)

🤖 AI Correction Prompt

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Data Cleaning (`scripts/clean_geojson.py`)

2. Name Extraction (`scripts/extract_names.py`)

3. AI-Powered Corrections (`scripts/generate_corrections.py`)

4. Correction Application (`scripts/apply_corrections.py`)

5. Final Clean (`scripts/final_clean.py`)

Packages