Skip to content

mkorrand/ML2026_Orrand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GeOceanographers - ML-Ready Training Data Pipeline

This folder contains the complete workflow for preparing oceanographic satellite and buoy data for machine learning model training. The project combines NOAA satellite SST data with buoy-measured water chemistry (pCO2) observations across 7 coastal monitoring locations.

Project Overview

  • Goal: Generate cleaned, ML-ready training datasets using measured (non-interpolated) continuous buoy data periods
  • Data Sources:
    • Satellite SST: JPL MUR (0.042° resolution, ~4.6 km)
    • Buoys: NOAA water chemistry (7 locations, 2013-2025)
  • Key Output: Training tables with matched satellite/buoy observations within 4 km spatial grid

Directory Structure

GeOceanProject/
├── README.md                          # This file
├── requirements.txt                   # Python package dependencies
├── notebooks/                         # Jupyter notebooks organized by workflow stage
│   ├── 01_data_exploration/          # Initial data inspection and diagnostics
│   │   ├── explore_data.ipynb        # General data exploration
│   │   ├── NOAA_buoy_data.ipynb      # Buoy data inspection
│   │   └── API_attempt_SST_ERDDAP.ipynb  # Satellite data retrieval & quality check
│   ├── 02_data_preparation/          # Data cleaning and standardization
│   │   └── ML_DataPrep_SST_pCO2.ipynb    # Cleaning, merging, scaling
│   └── 03_model_training/            # ML model training datasets
│       └── ML_Training_Continuous_Data.ipynb  # Final training data generation
├── data/                              # All data files (ignore .gitignore for CSV)
│   ├── raw/                          # Original source data (do not modify)
│   │   ├── buoy_sources/            # Individual NOAA buoy CSV files
│   │   │   ├── noaa_water_chem_data/    # Original NOAA files
│   │   │   └── (other buoy sources)
│   │   └── satellite_sources/       # ERDDAP satellite query results
│   │       ├── (6 location SST files)
│   │       └── satellite_sst_daily.csv
│   ├── processed/                    # Cleaned & standardized files
│   │   ├── buoy_data_cleaned.csv            # All buoys, all dates (master)
│   │   ├── satellite_sst_cleaned.csv       # All locations, all dates (master)
│   │   ├── buoy_continuous_data_periods.csv # Analysis output - data windows
│   │   └── combined_satellite_buoy.csv      # Intermediate merged file
│   └── training/                     # ML-ready outputs
│       ├── ml_data_standardized.csv         # Scaled features (mean=0, std=1)
│       ├── ml_data_minmax_scaled.csv        # Scaled features (0-1 range)
│       ├── ml_training_continuous_data_*.csv  # Final training tables
│       └── test_datasets/            # Test/validation data subsets
├── plots/                             # Visualization outputs
│   ├── exploration/                  # Data exploration plots
│   │   ├── (timeseries plots: 01-07)
│   │   └── (distribution plots: 08-14)
│   └── analysis/                     # Summary analysis plots
│       ├── 15_all_locations_comparison.png
│       ├── 16_geographic_sorted.png
│       └── 17_buoy_regional_context.png
└── docs/                              # Documentation
    ├── NOAA_BUOY_DATA_README.md      # Detailed data source documentation
    └── (other project documentation)

Workflow Overview

Phase 1: Data Exploration (01_data_exploration/)

  • Load NOAA satellite and buoy data
  • Inspect data availability and quality across locations
  • Identify continuous measurement windows
  • Generate exploratory plots

Phase 2: Data Preparation (02_data_preparation/)

  • Clean and standardize data (handle -999 nulls, date formats)
  • Create master files combining all locations
  • Scale features (StandardScaler and MinMaxScaler options)
  • Generate quality reports

Phase 3: Training Data Creation (03_model_training/)

  • Filter to continuous data periods only (no interpolation)
  • Create 4 km spatial grid around each buoy location
  • Match satellite observations to buoy dates/locations within grid
  • Export ML-ready training tables with features and target variable

Quick Start

Setup Environment

pip install -r requirements.txt

Run Notebooks in Order

  1. Exploration - Start with notebooks/01_data_exploration/explore_data.ipynb
  2. Preparation - Run notebooks/02_data_preparation/ML_DataPrep_SST_pCO2.ipynb
  3. Training Data - Execute notebooks/03_model_training/ML_Training_Continuous_Data.ipynb

Key Configuration

In ML_Training_Continuous_Data.ipynb, edit the configuration section:

  • APPROACH: Choose 'single', 'multi', or 'all' locations
  • SELECTED_LOCATIONS: List specific buoys to include
  • GRID_RADIUS_KM: Currently set to 4 km (matches satellite resolution)

Data Files Guide

Master Files (Output of Phase 2)

  • buoy_data_cleaned.csv: All 7 buoys, all dates, cleaned values

    • Columns: datetime, latitude, longitude, sst_celsius, pco2_sw_sat, xco2_sw_dry, location
    • ~26,000 records
  • satellite_sst_cleaned.csv: All 6 locations, all dates

    • Columns: datetime, latitude, longitude, sst_celsius, location
    • ~61,500 records

Analysis Files

  • buoy_continuous_data_periods.csv: Data availability windows per location
    • Identifies continuous measurement periods suitable for training

Training Outputs (Phase 3)

  • ml_training_continuous_data_YYYYMMDD.csv: Final training table
    • One row per buoy measurement with matched satellite data
    • No NaN values, all measured (no interpolation)
    • Ready for ML model training

Important Notes

  • Data Quality: All training data uses only measured values; no interpolation or estimation
  • Spatial Resolution: 4 km grid radius chosen to match ~4.6 km satellite resolution
  • Continuous Periods: Training filtered to date windows where buoys had continuous measurements
  • File Paths: Notebooks assume data structure shown above; update paths if reorganizing

Team Members

  • Mary, Colin, Ellie, Arya

Questions & Support

For data documentation details, see docs/NOAA_BUOY_DATA_README.md. For the full data processing workflow, see data/DATA_ANALYSIS_WORKFLOW.md.

About

Mary Orrand class repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors