diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 1652a8b..e5b0496 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -5,7 +5,8 @@ exclude: | notebooks/process_sandbox\.ipynb| src/cerf/data_acquisition_scrape\.py| src/disaster_charter/data_acquisition_scrape\.py| - src/glide/data_acquisition_scrape\.py + src/glide/data_acquisition_scrape\.py| + docs/NOTEBOOK_DATASETS\.md )$ repos: diff --git a/docs/NOTEBOOK_DATASETS.md b/docs/NOTEBOOK_DATASETS.md new file mode 100644 index 0000000..f9ac577 --- /dev/null +++ b/docs/NOTEBOOK_DATASETS.md @@ -0,0 +1,61 @@ +# Notebook Data Sources + +This document outlines the datasets consumed by the analysis notebooks. All datasets were downloaded in June 2025 and represent the latest available versions. + +## Dataset Sources + +- **GDACS** (API): + - Description: Global Disaster Alert and Coordination System events. + - Download Method: API + - Location: `data/gdacs/` + - Files: `gdacs_events_*.csv` + +- **IDMC IDUS** (API): + - Description: Internal Displacement Monitoring Centre IDUS dump. + - Download Method: API + - Location: `data/idmc_idu/` + - File: `idus_all.json` + +- **CERF** (Web Scrape): + - Description: CERF emergency data. + - Download Method: Web scrape + - Location: `data/cerf/` + - File: `cerf_emergency_data_dynamic_web_scrape.csv` + +- **Disaster Charter** (Web Scrape): + - Description: Charter activations. + - Download Method: Web scrape + - Location: `data/disaster-charter/` + - File: `charter_activations_web_scrape_2000_2025.csv` + +- **GLIDE** (Web Scrape): + - Description: GLIDE events. + - Download Method: Web scrape + - Location: `data/glide/` + - File: `glide_events.csv` + +- **EMDAT** (Manual Download): + - Description: EM-DAT custom request. + - Download Method: Manual download + - Location: `data/emdat/` + - File: `public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx` + +- **IFRC EME** (Manual Download): + - Description: IFRC emergencies. + - Download Method: Manual download + - Location: `data/ifrc_dref/` + - File: `IFRC_emergencies.csv` + +## Usage in Notebooks + +The notebooks (`exploration.ipynb`, `process_sandbox.ipynb`) read these files directly from the `data/` directory to perform exploratory analysis and data processing. + +## Notebook Analysis Logic + +The `process_sandbox.ipynb` notebook performs the following steps: + +- **Data Preprocessing**: Loads each dataset, renames and cleans columns, standardizes event dates, and maps event types to a common taxonomy. +- **Event Matching**: Combines all sources into a single `disaster_df` by merging events based on date proximity, event type, and country, creating boolean flags for each source. +- **Summary Table ("Number of events per source")**: Shows the total number of events originally in each dataset and the number of unique valid events after preprocessing. +- **Matching Events Table**: Compares, for each source, how many events are exclusive (only appear in that source) versus matched across multiple sources, and displays match percentages. +- **Overlap Matrix & Chord Diagram**: Constructs a matrix counting the number of shared events for every pair of sources and visualizes these overlaps as a circular chord diagram using `pycirclize`. diff --git a/notebooks/process_sandbox.ipynb b/notebooks/process_sandbox.ipynb index a995f31..120cab2 100644 --- a/notebooks/process_sandbox.ipynb +++ b/notebooks/process_sandbox.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 47, + "execution_count": 24, "id": "7fd4519c-da46-4871-a4fc-7a3901a72ada", "metadata": {}, "outputs": [], @@ -12,8 +12,6 @@ "\n", "import numpy as np\n", "import pandas as pd\n", - "\n", - "## pycirclize package is only necessary for the last plot. Installed it with pip instead of conda on my machine\n", "from pycirclize import Circos\n", "\n" ] @@ -30,7 +28,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 25, "id": "838deefe-151a-4450-8136-f638f6e59c54", "metadata": {}, "outputs": [], @@ -48,7 +46,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 26, "id": "2879a7f7-2a67-4cb2-b210-2f567c79fa40", "metadata": {}, "outputs": [], @@ -71,20 +69,11 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 27, "id": "13c3271a-9d45-4d6a-832f-55e0453254f4", "metadata": {}, "outputs": [], "source": [ - "\n", - "# This dictionary is just a placeholder for a preliminary event taxonomy. Has to be replaced\n", - "# by a proper function with less hard-coding. Event type who are not on this dictionary\n", - "# are not converted on the current version.\n", - "\n", - "# The following events are covered as output:\n", - "# Drought (DR) / Floods (FL) / Epidemics (EP) / Earthquake (EQ)\n", - "# Volcano (VO) / Landslide (LS) / Storm + Tropical Cyclone (TC/ST)\n", - "# Fires (WF) / Heat + Cold Waves (HW / CW) / Displacement (-)\n", "\n", "event_dict = {\n", " \"Drought\":\"DR\",\n", @@ -151,7 +140,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 28, "id": "1f56bb68-7e47-46bd-a892-e2db7f20e668", "metadata": {}, "outputs": [], @@ -163,7 +152,6 @@ "input_gdacs_df = pd.DataFrame()\n", "for file_year in range(2000,2025):\n", " filename = \"../data/gdacs/gdacs_events_\" + str(file_year) +\".csv\"\n", - " #filename = '../data_raw/gdacs/gdacs_events_' + str(file_year) +'.csv'\n", " new_df = pd.read_csv(filename)\n", "\n", " input_gdacs_df = pd.concat([input_gdacs_df, new_df])\n", @@ -195,7 +183,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 29, "id": "fe157c75-b24e-4c5c-ae2a-cf7638162c1d", "metadata": {}, "outputs": [ @@ -203,7 +191,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -217,7 +205,6 @@ "#Load data\n", "date_col = \"event_date\"\n", "filename = \"../data/glide/glide_events.csv\"\n", - "#filename = '../data_raw/glide/glide_events.csv'\n", "input_glide_df = pd.read_csv(filename)\n", "\n", "#Rename key columns\n", @@ -238,7 +225,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 30, "id": "1495b634-8a88-454d-8b30-2efbad9da4ab", "metadata": {}, "outputs": [ @@ -246,7 +233,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -259,7 +246,6 @@ "\n", "#Load data\n", "filename = \"../data/cerf/cerf_emergency_data_dynamic_web_scrape.csv\"\n", - "#filename = '../data_raw/cerf/cerf_emergency_data_dynamic_web_scrape.csv'\n", "input_cerf_df = pd.read_csv(filename)\n", "date_col = \"approval_date\"\n", "\n", @@ -281,7 +267,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 31, "id": "a9bf0407-3d1c-4c3e-bf04-8f9b0069d99b", "metadata": {}, "outputs": [ @@ -289,7 +275,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -302,7 +288,6 @@ "\n", "#Load data\n", "filename = \"../data/disaster-charter/charter_activations_web_scrape_2000_2025.csv\"\n", - "#filename = '../data_raw/disaster-charter/charter_activations_web_scrape_2000_2025.csv'\n", "input_charter_df = pd.read_csv(filename)\n", "date_col = \"date\"\n", "\n", @@ -332,7 +317,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 32, "id": "2355c4c3-4bce-49ef-a7b6-84ec89f0cc81", "metadata": {}, "outputs": [ @@ -340,7 +325,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -353,7 +338,6 @@ "\n", "#Load data\n", "filename = \"../data/emdat/public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx\"\n", - "#filename = '../data_raw/emdat/public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx'\n", "input_emdat_df = pd.read_excel(filename)\n", "date_col = \"start_date\"\n", "\n", @@ -381,7 +365,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 33, "id": "88cadbb1-1716-408a-86c0-1546f8d8cd22", "metadata": {}, "outputs": [ @@ -389,7 +373,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -402,12 +386,10 @@ "\n", "#Load data\n", "filename = \"../data/idmc_idu/idus_all.json\"\n", - "#filename = '../data_raw/idmc_idu/idus_all.json'\n", "input_idmc_df = pd.read_json(filename)\n", "date_col = \"event_start_date\"\n", "\n", "df = input_idmc_df.copy()\n", - "#df = df[df['displacement_type'] == 'Disaster'] # we may want to filter out events who are not disaster\n", "\n", "#Rename key columns\n", "col_list = [\"id\",\"type\",\"iso3\",\"country\",\"event_start_date\"]\n", @@ -427,7 +409,7 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 34, "id": "89887462-b053-4769-84e0-9e7f680bf19f", "metadata": {}, "outputs": [ @@ -435,7 +417,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n", + "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", @@ -448,7 +430,6 @@ "\n", "#Load data\n", "filename = \"../data/ifrc_dref/IFRC_emergencies.csv\"\n", - "#filename = '../data_raw/ifrc_dref/IFRC_emergencies.csv'\n", "input_dref_df = pd.read_csv(filename, on_bad_lines=\"skip\")\n", "date_col = \"disaster_start_date\"\n", "df = input_dref_df.copy()\n", @@ -484,7 +465,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 35, "id": "757c5a8d-f9b0-4045-820b-58a0ff533e35", "metadata": {}, "outputs": [], @@ -507,7 +488,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 36, "id": "63f0d274-f799-499c-a0a9-668e055d9f9f", "metadata": {}, "outputs": [], @@ -562,7 +543,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 37, "id": "497d8a4a-a991-4498-8ffb-ac87ae3ef5f5", "metadata": {}, "outputs": [], @@ -608,7 +589,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 38, "id": "cb4a66c2-3f1c-47dc-9c17-5d2d9150fb09", "metadata": {}, "outputs": [], @@ -622,26 +603,6 @@ "disaster_df = add_new_source(disaster_df, dref_df, \"dref\")\n" ] }, - { - "cell_type": "markdown", - "id": "93e177c7-14de-4a60-a355-d2dea4df8c3e", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "6d698df3-2e11-4c41-abc0-01638a6ea280", - "metadata": {}, - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "62d10bcf-3a0b-492a-a3da-b8fd2e2195b4", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "id": "23d981b1-7e84-4383-ba63-5a9e60948860", @@ -657,7 +618,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 39, "id": "902a85d2-4d59-4a65-9289-51117cc0dd15", "metadata": {}, "outputs": [], @@ -672,15 +633,12 @@ "analysis_df[\"idmc\"] = analysis_df[\"idmc_id\"].notna()\n", "analysis_df[\"dref\"] = analysis_df[\"dref_id\"].notna()\n", "\n", - "analysis_df[\"nb_sources\"] = analysis_df[[\"gdacs\",\"glide\",\"cerf\",\"charter\",\"emdat\",\"idmc\", \"dref\"]].sum(axis = 1)\n", - "\n", - "#analysis_df.loc[analysis_df['nb_sources'] > 1, ['nb_sources','gdacs','glide','cerf','charter','emdat','idmc', 'dref']].value_counts().reset_index().head(10)\n", - "#analysis_df['nb_sources'].value_counts()\n" + "analysis_df[\"nb_sources\"] = analysis_df[[\"gdacs\",\"glide\",\"cerf\",\"charter\",\"emdat\",\"idmc\", \"dref\"]].sum(axis = 1)" ] }, { "cell_type": "code", - "execution_count": 63, + "execution_count": 40, "id": "5c94a4c3-ca89-4e1b-9376-681b64a55d24", "metadata": {}, "outputs": [ @@ -690,7 +648,7 @@ "Text(0.5, 1.0, 'Number of events per source')" ] }, - "execution_count": 63, + "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, @@ -738,28 +696,7 @@ }, { "cell_type": "code", - "execution_count": 64, - "id": "38b0d03c-894c-46f9-83fe-0ec10de76b10", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "#df = input_gdacs_df['event_type'].copy()\n", - "#df = input_idmc_df['type'].copy()\n", - "#df = input_cerf_df['emergency_type'].copy()\n", - "#df = input_emdat_df['Disaster Type'].copy()\n", - "#df = input_glide_df['Event_Code'].copy()\n", - "#df = input_dref_df['dtype.name'].copy()\n", - "#df = input_charter_df['Type of Event'].copy()\n", - "\n", - "#df.fillna('', inplace = True)\n", - "#print(df.value_counts().shape[0])\n", - "#df.value_counts(True).head(20)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 65, + "execution_count": 41, "id": "1e27b2fc-3563-4eb4-b17e-94c2bbfdfb12", "metadata": {}, "outputs": [], @@ -788,7 +725,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 42, "id": "2fcb7241-8b08-495b-81e4-35bf06685cf3", "metadata": {}, "outputs": [ @@ -798,7 +735,7 @@ "Text(0.5, 1.0, 'Matching events')" ] }, - "execution_count": 66, + "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, @@ -822,7 +759,7 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 43, "id": "0b38c9b8-7320-459c-bded-eaafd679e1fb", "metadata": {}, "outputs": [], @@ -851,7 +788,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 44, "id": "28dbe9cc-4162-4970-933b-ced26899c489", "metadata": {}, "outputs": [ @@ -906,73 +843,6 @@ "fig = circos.plotfig()\n", "\n" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "47802143-877e-47fd-a7bc-933817b98bfc", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "id": "c677540c-0bf5-48c8-9bd1-e85f542b5a89", - "metadata": { - "jp-MarkdownHeadingCollapsed": true - }, - "source": [ - "## Notes" - ] - }, - { - "cell_type": "markdown", - "id": "6f30423f-3d06-47f3-a29e-69f3c5f3f58f", - "metadata": {}, - "source": [ - "Notes\n", - "\n", - "\n", - "- Miscellaneous:\n", - "\n", - " - No unique ID for Charter source as activation id is sometimes NULL.\n", - " - GDACS seems to have lots of duplicates. Some cleanup is needed\n", - " - Should we drop conflict events? What about epidemics? What about displacement without a identified cause?\n", - " - Some sources have a glide_number id (but very often empty). Add this to the matching\n", - " - GDACS matching with other sources is highly dependent on the alert level (72% for red, 61% for orange, 19% for green)\n", - "\n", - "\n", - "####################################\n", - "\n", - "To do:\n", - "\n", - "- Create a mapping for GLIDE event type categories\n", - "\n", - "- Evaluate CERF event categories, specially if we can have more info on displacement activations\n", - "\n", - "- Approval date (CERF) may be a few days later than the event date. For now I've used 14 days as a threshold but more investigation is needed\n", - " (maybe depending on event type)\n", - "\n", - "- Map CERF and Charter country names to iso_code\n", - "\n", - "\n", - "- Some sources have multiple event types => think about how to add multi-event index\n", - "\n", - "- Think about how to handle event with only month and year data (no day)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "f585e090", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "id": "348f4e39", - "metadata": {}, - "source": [] } ], "metadata": {