diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 1652a8b..e5b0496 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -5,7 +5,8 @@ exclude: |
       notebooks/process_sandbox\.ipynb|
       src/cerf/data_acquisition_scrape\.py|
       src/disaster_charter/data_acquisition_scrape\.py|
-      src/glide/data_acquisition_scrape\.py
+      src/glide/data_acquisition_scrape\.py|
+      docs/NOTEBOOK_DATASETS\.md
   )$
 
 repos:
diff --git a/docs/NOTEBOOK_DATASETS.md b/docs/NOTEBOOK_DATASETS.md
new file mode 100644
index 0000000..f9ac577
--- /dev/null
+++ b/docs/NOTEBOOK_DATASETS.md
@@ -0,0 +1,61 @@
+# Notebook Data Sources
+
+This document outlines the datasets consumed by the analysis notebooks. All datasets were downloaded in June 2025 and represent the latest available versions.
+
+## Dataset Sources
+
+- **GDACS** (API):
+  - Description: Global Disaster Alert and Coordination System events.
+  - Download Method: API
+  - Location: `data/gdacs/`
+  - Files: `gdacs_events_*.csv`
+
+- **IDMC IDUS** (API):
+  - Description: Internal Displacement Monitoring Centre IDUS dump.
+  - Download Method: API
+  - Location: `data/idmc_idu/`
+  - File: `idus_all.json`
+
+- **CERF** (Web Scrape):
+  - Description: CERF emergency data.
+  - Download Method: Web scrape
+  - Location: `data/cerf/`
+  - File: `cerf_emergency_data_dynamic_web_scrape.csv`
+
+- **Disaster Charter** (Web Scrape):
+  - Description: Charter activations.
+  - Download Method: Web scrape
+  - Location: `data/disaster-charter/`
+  - File: `charter_activations_web_scrape_2000_2025.csv`
+
+- **GLIDE** (Web Scrape):
+  - Description: GLIDE events.
+  - Download Method: Web scrape
+  - Location: `data/glide/`
+  - File: `glide_events.csv`
+
+- **EMDAT** (Manual Download):
+  - Description: EM-DAT custom request.
+  - Download Method: Manual download
+  - Location: `data/emdat/`
+  - File: `public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx`
+
+- **IFRC EME** (Manual Download):
+  - Description: IFRC emergencies.
+  - Download Method: Manual download
+  - Location: `data/ifrc_dref/`
+  - File: `IFRC_emergencies.csv`
+
+## Usage in Notebooks
+
+The notebooks (`exploration.ipynb`, `process_sandbox.ipynb`) read these files directly from the `data/` directory to perform exploratory analysis and data processing.
+
+## Notebook Analysis Logic
+
+The `process_sandbox.ipynb` notebook performs the following steps:
+
+- **Data Preprocessing**: Loads each dataset, renames and cleans columns, standardizes event dates, and maps event types to a common taxonomy.
+- **Event Matching**: Combines all sources into a single `disaster_df` by merging events based on date proximity, event type, and country, creating boolean flags for each source.
+- **Summary Table ("Number of events per source")**: Shows the total number of events originally in each dataset and the number of unique valid events after preprocessing.
+- **Matching Events Table**: Compares, for each source, how many events are exclusive (only appear in that source) versus matched across multiple sources, and displays match percentages.
+- **Overlap Matrix & Chord Diagram**: Constructs a matrix counting the number of shared events for every pair of sources and visualizes these overlaps as a circular chord diagram using `pycirclize`.
diff --git a/notebooks/process_sandbox.ipynb b/notebooks/process_sandbox.ipynb
index a995f31..120cab2 100644
--- a/notebooks/process_sandbox.ipynb
+++ b/notebooks/process_sandbox.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 24,
    "id": "7fd4519c-da46-4871-a4fc-7a3901a72ada",
    "metadata": {},
    "outputs": [],
@@ -12,8 +12,6 @@
     "\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "\n",
-    "## pycirclize package is only necessary for the last plot. Installed it with pip instead of conda on my machine\n",
     "from pycirclize import Circos\n",
     "\n"
    ]
@@ -30,7 +28,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 48,
+   "execution_count": 25,
    "id": "838deefe-151a-4450-8136-f638f6e59c54",
    "metadata": {},
    "outputs": [],
@@ -48,7 +46,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 49,
+   "execution_count": 26,
    "id": "2879a7f7-2a67-4cb2-b210-2f567c79fa40",
    "metadata": {},
    "outputs": [],
@@ -71,20 +69,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 50,
+   "execution_count": 27,
    "id": "13c3271a-9d45-4d6a-832f-55e0453254f4",
    "metadata": {},
    "outputs": [],
    "source": [
-    "\n",
-    "# This dictionary is just a placeholder for a preliminary event taxonomy. Has to be replaced\n",
-    "# by a proper function with less hard-coding. Event type who are not on this dictionary\n",
-    "# are not converted on the current version.\n",
-    "\n",
-    "# The following events are covered as output:\n",
-    "# Drought (DR) / Floods (FL) / Epidemics (EP) / Earthquake (EQ)\n",
-    "# Volcano (VO) / Landslide (LS) / Storm + Tropical Cyclone (TC/ST)\n",
-    "# Fires (WF) / Heat + Cold Waves (HW / CW) / Displacement (-)\n",
     "\n",
     "event_dict = {\n",
     "    \"Drought\":\"DR\",\n",
@@ -151,7 +140,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 51,
+   "execution_count": 28,
    "id": "1f56bb68-7e47-46bd-a892-e2db7f20e668",
    "metadata": {},
    "outputs": [],
@@ -163,7 +152,6 @@
     "input_gdacs_df = pd.DataFrame()\n",
     "for file_year in range(2000,2025):\n",
     "    filename = \"../data/gdacs/gdacs_events_\" + str(file_year) +\".csv\"\n",
-    "    #filename = '../data_raw/gdacs/gdacs_events_' + str(file_year) +'.csv'\n",
     "    new_df = pd.read_csv(filename)\n",
     "\n",
     "    input_gdacs_df = pd.concat([input_gdacs_df, new_df])\n",
@@ -195,7 +183,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 52,
+   "execution_count": 29,
    "id": "fe157c75-b24e-4c5c-ae2a-cf7638162c1d",
    "metadata": {},
    "outputs": [
@@ -203,7 +191,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -217,7 +205,6 @@
     "#Load data\n",
     "date_col = \"event_date\"\n",
     "filename = \"../data/glide/glide_events.csv\"\n",
-    "#filename = '../data_raw/glide/glide_events.csv'\n",
     "input_glide_df = pd.read_csv(filename)\n",
     "\n",
     "#Rename key columns\n",
@@ -238,7 +225,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 53,
+   "execution_count": 30,
    "id": "1495b634-8a88-454d-8b30-2efbad9da4ab",
    "metadata": {},
    "outputs": [
@@ -246,7 +233,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -259,7 +246,6 @@
     "\n",
     "#Load data\n",
     "filename = \"../data/cerf/cerf_emergency_data_dynamic_web_scrape.csv\"\n",
-    "#filename = '../data_raw/cerf/cerf_emergency_data_dynamic_web_scrape.csv'\n",
     "input_cerf_df = pd.read_csv(filename)\n",
     "date_col = \"approval_date\"\n",
     "\n",
@@ -281,7 +267,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 54,
+   "execution_count": 31,
    "id": "a9bf0407-3d1c-4c3e-bf04-8f9b0069d99b",
    "metadata": {},
    "outputs": [
@@ -289,7 +275,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -302,7 +288,6 @@
     "\n",
     "#Load data\n",
     "filename = \"../data/disaster-charter/charter_activations_web_scrape_2000_2025.csv\"\n",
-    "#filename = '../data_raw/disaster-charter/charter_activations_web_scrape_2000_2025.csv'\n",
     "input_charter_df = pd.read_csv(filename)\n",
     "date_col = \"date\"\n",
     "\n",
@@ -332,7 +317,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 55,
+   "execution_count": 32,
    "id": "2355c4c3-4bce-49ef-a7b6-84ec89f0cc81",
    "metadata": {},
    "outputs": [
@@ -340,7 +325,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -353,7 +338,6 @@
     "\n",
     "#Load data\n",
     "filename = \"../data/emdat/public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx\"\n",
-    "#filename = '../data_raw/emdat/public_emdat_custom_request_2025-06-04_c1e3334f-e027-4f8a-92d5-7ce401c7654c.xlsx'\n",
     "input_emdat_df = pd.read_excel(filename)\n",
     "date_col = \"start_date\"\n",
     "\n",
@@ -381,7 +365,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 56,
+   "execution_count": 33,
    "id": "88cadbb1-1716-408a-86c0-1546f8d8cd22",
    "metadata": {},
    "outputs": [
@@ -389,7 +373,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -402,12 +386,10 @@
     "\n",
     "#Load data\n",
     "filename = \"../data/idmc_idu/idus_all.json\"\n",
-    "#filename = '../data_raw/idmc_idu/idus_all.json'\n",
     "input_idmc_df = pd.read_json(filename)\n",
     "date_col = \"event_start_date\"\n",
     "\n",
     "df = input_idmc_df.copy()\n",
-    "#df = df[df['displacement_type'] == 'Disaster'] # we may want to filter out events who are not disaster\n",
     "\n",
     "#Rename key columns\n",
     "col_list =  [\"id\",\"type\",\"iso3\",\"country\",\"event_start_date\"]\n",
@@ -427,7 +409,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 57,
+   "execution_count": 34,
    "id": "89887462-b053-4769-84e0-9e7f680bf19f",
    "metadata": {},
    "outputs": [
@@ -435,7 +417,7 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/tmp/ipykernel_785094/621467075.py:13: SettingWithCopyWarning: \n",
+      "/tmp/ipykernel_132203/2704803438.py:11: SettingWithCopyWarning: \n",
       "A value is trying to be set on a copy of a slice from a DataFrame\n",
       "\n",
       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
@@ -448,7 +430,6 @@
     "\n",
     "#Load data\n",
     "filename = \"../data/ifrc_dref/IFRC_emergencies.csv\"\n",
-    "#filename = '../data_raw/ifrc_dref/IFRC_emergencies.csv'\n",
     "input_dref_df = pd.read_csv(filename, on_bad_lines=\"skip\")\n",
     "date_col = \"disaster_start_date\"\n",
     "df = input_dref_df.copy()\n",
@@ -484,7 +465,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 58,
+   "execution_count": 35,
    "id": "757c5a8d-f9b0-4045-820b-58a0ff533e35",
    "metadata": {},
    "outputs": [],
@@ -507,7 +488,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 59,
+   "execution_count": 36,
    "id": "63f0d274-f799-499c-a0a9-668e055d9f9f",
    "metadata": {},
    "outputs": [],
@@ -562,7 +543,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": 37,
    "id": "497d8a4a-a991-4498-8ffb-ac87ae3ef5f5",
    "metadata": {},
    "outputs": [],
@@ -608,7 +589,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 61,
+   "execution_count": 38,
    "id": "cb4a66c2-3f1c-47dc-9c17-5d2d9150fb09",
    "metadata": {},
    "outputs": [],
@@ -622,26 +603,6 @@
     "disaster_df = add_new_source(disaster_df, dref_df, \"dref\")\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "93e177c7-14de-4a60-a355-d2dea4df8c3e",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6d698df3-2e11-4c41-abc0-01638a6ea280",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "62d10bcf-3a0b-492a-a3da-b8fd2e2195b4",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "id": "23d981b1-7e84-4383-ba63-5a9e60948860",
@@ -657,7 +618,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 62,
+   "execution_count": 39,
    "id": "902a85d2-4d59-4a65-9289-51117cc0dd15",
    "metadata": {},
    "outputs": [],
@@ -672,15 +633,12 @@
     "analysis_df[\"idmc\"] = analysis_df[\"idmc_id\"].notna()\n",
     "analysis_df[\"dref\"] = analysis_df[\"dref_id\"].notna()\n",
     "\n",
-    "analysis_df[\"nb_sources\"] = analysis_df[[\"gdacs\",\"glide\",\"cerf\",\"charter\",\"emdat\",\"idmc\", \"dref\"]].sum(axis = 1)\n",
-    "\n",
-    "#analysis_df.loc[analysis_df['nb_sources'] > 1, ['nb_sources','gdacs','glide','cerf','charter','emdat','idmc', 'dref']].value_counts().reset_index().head(10)\n",
-    "#analysis_df['nb_sources'].value_counts()\n"
+    "analysis_df[\"nb_sources\"] = analysis_df[[\"gdacs\",\"glide\",\"cerf\",\"charter\",\"emdat\",\"idmc\", \"dref\"]].sum(axis = 1)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 63,
+   "execution_count": 40,
    "id": "5c94a4c3-ca89-4e1b-9376-681b64a55d24",
    "metadata": {},
    "outputs": [
@@ -690,7 +648,7 @@
        "Text(0.5, 1.0, 'Number of events per source')"
       ]
      },
-     "execution_count": 63,
+     "execution_count": 40,
      "metadata": {},
      "output_type": "execute_result"
     },
@@ -738,28 +696,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
-   "id": "38b0d03c-894c-46f9-83fe-0ec10de76b10",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "\n",
-    "#df = input_gdacs_df['event_type'].copy()\n",
-    "#df = input_idmc_df['type'].copy()\n",
-    "#df = input_cerf_df['emergency_type'].copy()\n",
-    "#df = input_emdat_df['Disaster Type'].copy()\n",
-    "#df = input_glide_df['Event_Code'].copy()\n",
-    "#df = input_dref_df['dtype.name'].copy()\n",
-    "#df = input_charter_df['Type of Event'].copy()\n",
-    "\n",
-    "#df.fillna('', inplace = True)\n",
-    "#print(df.value_counts().shape[0])\n",
-    "#df.value_counts(True).head(20)\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 65,
+   "execution_count": 41,
    "id": "1e27b2fc-3563-4eb4-b17e-94c2bbfdfb12",
    "metadata": {},
    "outputs": [],
@@ -788,7 +725,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 66,
+   "execution_count": 42,
    "id": "2fcb7241-8b08-495b-81e4-35bf06685cf3",
    "metadata": {},
    "outputs": [
@@ -798,7 +735,7 @@
        "Text(0.5, 1.0, 'Matching events')"
       ]
      },
-     "execution_count": 66,
+     "execution_count": 42,
      "metadata": {},
      "output_type": "execute_result"
     },
@@ -822,7 +759,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": 43,
    "id": "0b38c9b8-7320-459c-bded-eaafd679e1fb",
    "metadata": {},
    "outputs": [],
@@ -851,7 +788,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 68,
+   "execution_count": 44,
    "id": "28dbe9cc-4162-4970-933b-ced26899c489",
    "metadata": {},
    "outputs": [
@@ -906,73 +843,6 @@
     "fig = circos.plotfig()\n",
     "\n"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "47802143-877e-47fd-a7bc-933817b98bfc",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c677540c-0bf5-48c8-9bd1-e85f542b5a89",
-   "metadata": {
-    "jp-MarkdownHeadingCollapsed": true
-   },
-   "source": [
-    "## Notes"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6f30423f-3d06-47f3-a29e-69f3c5f3f58f",
-   "metadata": {},
-   "source": [
-    "Notes\n",
-    "\n",
-    "\n",
-    "- Miscellaneous:\n",
-    "\n",
-    "  - No unique ID for Charter source as activation id is sometimes NULL.\n",
-    "  - GDACS seems to have lots of duplicates. Some cleanup is needed\n",
-    "  - Should we drop conflict events? What about epidemics? What about displacement without a identified cause?\n",
-    "  - Some sources have a glide_number id (but very often empty). Add this to the matching\n",
-    "  - GDACS matching with other sources is highly dependent on the alert level (72% for red, 61% for orange, 19% for green)\n",
-    "\n",
-    "\n",
-    "####################################\n",
-    "\n",
-    "To do:\n",
-    "\n",
-    "- Create a mapping for GLIDE event type categories\n",
-    "\n",
-    "- Evaluate CERF event categories, specially if we can have more info on displacement activations\n",
-    "\n",
-    "- Approval date (CERF) may be a few days later than the event date. For now I've used 14 days as a threshold but more investigation is needed\n",
-    " (maybe depending on event type)\n",
-    "\n",
-    "- Map CERF and Charter country names to iso_code\n",
-    "\n",
-    "\n",
-    "- Some sources have multiple event types => think about how to add multi-event index\n",
-    "\n",
-    "- Think about how to handle event with only month and year data (no day)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f585e090",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "348f4e39",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {