Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ data.zip
data/*
data1
data
data_raw/

data_mid/
data_out/
Expand Down
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
fail_fast: true

exclude: |
(?x)^(
notebooks/process_sandbox\.ipynb|
src/cerf/data_acquisition_scrape\.py|
src/disaster_charter/data_acquisition_scrape\.py|
src/glide/data_acquisition_scrape\.py
)$

repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.8
Expand Down
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,30 @@ clean:
@rm -rf .venv
@poetry env remove --all

run_gdacs_download:
@echo "Running GDACS download"
@poetry run python -m src.gdacs.data_acquisition_api

run_glide_download:
@echo "Running Glide download"
@poetry run python -m src.glide.data_acquisition_scrape

run_cerf_download:
@echo "Running CERF download"
@poetry run python -m src.cerf.data_acquisition_scrape

run_disaster_charter_download:
@echo "Running Disaster-Charter download"
@poetry run python -m src.disaster_charter.data_acquisition_scrape

run_idus_download:
@echo "Downloading IDUS dump → data_raw/idmc_idu/idus_all.json"
@mkdir -p data_raw/idmc_idu
@curl -L --compressed \
-o data_raw/idmc_idu/idus_all.json \
"https://helix-copilot-prod-helix-media-external.s3.amazonaws.com/external-media/api-dump/idus-all/2025-06-04-10-00-32/5mndO/idus_all.json"
@echo "✅ Saved (decompressed): data_raw/idmc_idu/idus_all.json"

run_glide_normal:
@echo "Running Glide normalisation"
@poetry run python -m src.glide.data_normalisation_glide
Expand Down
16 changes: 15 additions & 1 deletion docs/DATASETS.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,21 @@ update any of the datasets:
**Important**: Always preserve the folder structure to avoid breaking downstream
processes.

## API-Based Dataset: GDACS

The GDACS dataset is unique in that it provides an official API, made available directly by the GDACS team.

To update this dataset:

1. Modify the date range in the `src/gdacs/data_acquisition_api.py` script and the `main` method.
2. Run the following command:

```sh
make run_gdacs_download
```

This process will automatically download and save the updated records to the appropriate location.

## Web-Scraped Legacy Datasets (Now Blocked or Fragile)

Some datasets were initially extracted using automated **web scraping scripts**.
Expand All @@ -31,7 +46,6 @@ category:
- CERF Activations
- Disaster Charter Activations
- GLIDE Events
- GDACS Events
- WFP ADAM

### How to Update These
Expand Down
Loading