This folder contains the complete workflow for preparing oceanographic satellite and buoy data for machine learning model training. The project combines NOAA satellite SST data with buoy-measured water chemistry (pCO2) observations across 7 coastal monitoring locations.
- Goal: Generate cleaned, ML-ready training datasets using measured (non-interpolated) continuous buoy data periods
- Data Sources:
- Satellite SST: JPL MUR (0.042° resolution, ~4.6 km)
- Buoys: NOAA water chemistry (7 locations, 2013-2025)
- Key Output: Training tables with matched satellite/buoy observations within 4 km spatial grid
GeOceanProject/
├── README.md # This file
├── requirements.txt # Python package dependencies
├── notebooks/ # Jupyter notebooks organized by workflow stage
│ ├── 01_data_exploration/ # Initial data inspection and diagnostics
│ │ ├── explore_data.ipynb # General data exploration
│ │ ├── NOAA_buoy_data.ipynb # Buoy data inspection
│ │ └── API_attempt_SST_ERDDAP.ipynb # Satellite data retrieval & quality check
│ ├── 02_data_preparation/ # Data cleaning and standardization
│ │ └── ML_DataPrep_SST_pCO2.ipynb # Cleaning, merging, scaling
│ └── 03_model_training/ # ML model training datasets
│ └── ML_Training_Continuous_Data.ipynb # Final training data generation
├── data/ # All data files (ignore .gitignore for CSV)
│ ├── raw/ # Original source data (do not modify)
│ │ ├── buoy_sources/ # Individual NOAA buoy CSV files
│ │ │ ├── noaa_water_chem_data/ # Original NOAA files
│ │ │ └── (other buoy sources)
│ │ └── satellite_sources/ # ERDDAP satellite query results
│ │ ├── (6 location SST files)
│ │ └── satellite_sst_daily.csv
│ ├── processed/ # Cleaned & standardized files
│ │ ├── buoy_data_cleaned.csv # All buoys, all dates (master)
│ │ ├── satellite_sst_cleaned.csv # All locations, all dates (master)
│ │ ├── buoy_continuous_data_periods.csv # Analysis output - data windows
│ │ └── combined_satellite_buoy.csv # Intermediate merged file
│ └── training/ # ML-ready outputs
│ ├── ml_data_standardized.csv # Scaled features (mean=0, std=1)
│ ├── ml_data_minmax_scaled.csv # Scaled features (0-1 range)
│ ├── ml_training_continuous_data_*.csv # Final training tables
│ └── test_datasets/ # Test/validation data subsets
├── plots/ # Visualization outputs
│ ├── exploration/ # Data exploration plots
│ │ ├── (timeseries plots: 01-07)
│ │ └── (distribution plots: 08-14)
│ └── analysis/ # Summary analysis plots
│ ├── 15_all_locations_comparison.png
│ ├── 16_geographic_sorted.png
│ └── 17_buoy_regional_context.png
└── docs/ # Documentation
├── NOAA_BUOY_DATA_README.md # Detailed data source documentation
└── (other project documentation)
Phase 1: Data Exploration (01_data_exploration/)
- Load NOAA satellite and buoy data
- Inspect data availability and quality across locations
- Identify continuous measurement windows
- Generate exploratory plots
Phase 2: Data Preparation (02_data_preparation/)
- Clean and standardize data (handle -999 nulls, date formats)
- Create master files combining all locations
- Scale features (StandardScaler and MinMaxScaler options)
- Generate quality reports
Phase 3: Training Data Creation (03_model_training/)
- Filter to continuous data periods only (no interpolation)
- Create 4 km spatial grid around each buoy location
- Match satellite observations to buoy dates/locations within grid
- Export ML-ready training tables with features and target variable
pip install -r requirements.txt- Exploration - Start with
notebooks/01_data_exploration/explore_data.ipynb - Preparation - Run
notebooks/02_data_preparation/ML_DataPrep_SST_pCO2.ipynb - Training Data - Execute
notebooks/03_model_training/ML_Training_Continuous_Data.ipynb
In ML_Training_Continuous_Data.ipynb, edit the configuration section:
APPROACH: Choose 'single', 'multi', or 'all' locationsSELECTED_LOCATIONS: List specific buoys to includeGRID_RADIUS_KM: Currently set to 4 km (matches satellite resolution)
-
buoy_data_cleaned.csv: All 7 buoys, all dates, cleaned values
- Columns: datetime, latitude, longitude, sst_celsius, pco2_sw_sat, xco2_sw_dry, location
- ~26,000 records
-
satellite_sst_cleaned.csv: All 6 locations, all dates
- Columns: datetime, latitude, longitude, sst_celsius, location
- ~61,500 records
- buoy_continuous_data_periods.csv: Data availability windows per location
- Identifies continuous measurement periods suitable for training
- ml_training_continuous_data_YYYYMMDD.csv: Final training table
- One row per buoy measurement with matched satellite data
- No NaN values, all measured (no interpolation)
- Ready for ML model training
- Data Quality: All training data uses only measured values; no interpolation or estimation
- Spatial Resolution: 4 km grid radius chosen to match ~4.6 km satellite resolution
- Continuous Periods: Training filtered to date windows where buoys had continuous measurements
- File Paths: Notebooks assume data structure shown above; update paths if reorganizing
- Mary, Colin, Ellie, Arya
For data documentation details, see docs/NOAA_BUOY_DATA_README.md. For the full data processing workflow, see data/DATA_ANALYSIS_WORKFLOW.md.