Add GEFS atmospheric ensemble for UFS-Coastal SECOFS#68
Open
mansurjisan wants to merge 40 commits intofeature/unified-nowcast-forecastfrom
Open
Add GEFS atmospheric ensemble for UFS-Coastal SECOFS#68mansurjisan wants to merge 40 commits intofeature/unified-nowcast-forecastfrom
mansurjisan wants to merge 40 commits intofeature/unified-nowcast-forecastfrom
Conversation
Enable 6-member GEFS atmospheric ensemble for secofs_ufs with DATM forcing. Members use per-member DATM forcing directories instead of shared sflux files: - 000: GFS+HRRR blended control (existing prep output) - 001-004: GEFS pgrb2sp25 0.25 deg (no HRRR blend) - 005: RRFS (GFS fallback for now) Changes: - secofs_ufs.yaml: Add ensemble section with perturb_physics: false - nos_ofs_create_datm_forcing.sh: Add GEFS DBASE case + find_gefs_file() - nos_ofs_create_datm_forcing_blended.sh: Parameterize primary source via DATM_PRIMARY_SOURCE env var, add DATM_SKIP_UFS_CONFIG toggle - JNOS_OFS_ENSEMBLE_ATMOS_PREP: Add USE_DATM branch that generates per-member DATM forcing via blended orchestrator with cfp parallelism - nos_ofs_ensemble_run.sh: Read met_source_1 from params.json to stage member-specific DATM input, patch datm_in grid dims - New PBS scripts for UFS ensemble member, atmos prep, and launcher
…e files Remove the 3 separate UFS-specific PBS scripts and instead make the existing ensemble PBS scripts OFS-aware via --ufs flag: - launch_secofs_ensemble.sh: Add --ufs flag (sets OFS=secofs_ufs, UFS_MODE=true), pass OFS to all qsub calls, use UFS prep/nowcast/forecast PBS for det workflow - jnos_secofs_ensemble_member.pbs: Branch module loading on OFS name (*_ufs* loads hpc-stack modules.fv3, standalone loads standard WCOSS2 modules), auto-detect USE_DATM and TOTAL_TASKS for UFS - jnos_secofs_ensemble_atmos_prep.pbs: Accept OFS from launcher, auto-detect USE_DATM, add COMINgefs/COMINrrfs exports
…ribes Three critical fixes for UFS-Coastal ensemble members: 1. ihot=2 → ihot=1 for UFS-Coastal (USE_DATM=true). ihot=2 causes SCHISM/ESMF clock desync and ghost node pressure transient. Standalone SCHISM ensemble still uses ihot=2 for continuous staout timeseries. 2. Patch model_configure start_year/month/day/hour from time_nowcastend, matching the deterministic workflow. Without this, ESMF clock starts at the wrong time for ensemble forecast members. 3. Force nws=4 and nscribes=0 in param.nml for UFS-Coastal members. Prevents mismatches if param.nml originated from standalone SCHISM (nws=2, nscribes=6/7).
prod_envir module sets PACKAGEROOT to /lfs/h1/ops/prod/packages, overriding the dev path. Save PACKAGEROOT before module loads and restore it after, so J-jobs resolve to the correct dev installation directory.
….json
params.json nests met_source_1 under atmospheric_source:{}, but the
staging code was reading it as a top-level key, always getting empty
string. This caused all ensemble members to fall through to the control
datm_input/ directory (blended GFS+HRRR) instead of their member-specific
GEFS-only DATM forcing.
…tion Three fixes for UFS-Coastal GEFS ensemble member failures: 1. Explicit datm_forcing.nc check: Replace silent `cp *.nc || true` with explicit check for datm_forcing.nc and datm_esmf_mesh.nc. Fail early with clear error instead of launching model with missing input. 2. ESMF mesh regeneration: When member forcing grid (e.g., GEFS 1440x721) differs from control mesh (blended 1721x1721), regenerate ESMF mesh in-place from the member's forcing file using Python/netCDF4. 3. Atmos prep wrapper archival: Replace glob `cp *.nc` with explicit per-file copy with warnings, so missing files are visible in logs.
ESMF_Scrip2Unstruct fails when multiple cfp ranks run it concurrently (only rank 0 succeeds, ranks 1-4 produce empty mesh files). Fix: - Atmos prep: Generate GEFS 0.25-deg ESMF mesh once via Python/netCDF4 before cfp launches. Pass pre-generated mesh to all wrappers via DATM_ESMF_MESH env var. - Blended script: Check DATM_ESMF_MESH first in Step 4 priority chain, skipping ESMF_Scrip2Unstruct entirely when pre-generated mesh exists. - Remove broken heredoc attempt, keep only the python3 -c version.
Two issues causing member failures: 1. Mesh mismatch check compared forcing dims against the control's mesh nodeCount. The control mesh (from ESMF_Scrip2Unstruct) has 2,965,284 nodes while the blended forcing is 1721x1721=2,961,841. This triggered unnecessary mesh regeneration for ALL members including control. Fix: compare against the member's own staged mesh instead. 2. Python for-loop for element connectivity was slow for ~3M elements. Replace with vectorized numpy (np.mgrid + column_stack) in both the atmos prep and ensemble run mesh generators.
… mesh Two fixes for UFS-Coastal ensemble failures: 1. Force OMP_NUM_THREADS=1 unconditionally in ensemble run. Cray PBS pre-sets OMP_NUM_THREADS to ncpus (128), causing massive thread oversubscription with 120 MPI ranks per node. All members (001-005) crashed at timestep 1 due to this. The deterministic model_run.sh already forces OMP_NUM_THREADS=1; ensemble was using conditional default which didn't override the PBS value. 2. Pre-generate correct control ESMF mesh during atmos prep. The det prep's ESMF_Scrip2Unstruct produces a 1723x1723 mesh (2,965,284 nodes) but the blended forcing is 1721x1721 (2,961,841 points). Member 000 was spending time regenerating the mesh at runtime. Now atmos prep checks and fixes the mesh before members launch.
GEFS pgrb2sp25 product lacks SPFH_2maboveground and PRATE_surface, and uses PRMSL_meansealevel instead of MSLMA_meansealevel. The datm.streams template references all 8 variables from the GFS+HRRR blended control, causing DATM to crash at the first coupled timestep when variables aren't found in the forcing file. After staging datm.streams and datm_forcing.nc, read actual variable names from the forcing NetCDF and rebuild stream_data_variables01 to only include variables that exist. Maps both PRMSL and MSLMA to the same CMEPS field (Sa_pslv).
ESMF_Scrip2Unstruct (used by det prep) includes elementMask in the mesh file, but our Python mesh generator did not. DATM's InitializeRealize calls ESMF_MeshGet which requires element mask info. Without it: "mesh doesn't contain element mask information" → crash during ATM component initialization. Add elementMask (all zeros = unmasked) to all three Python mesh generation sites: GEFS mesh, control mesh (atmos prep), and runtime mesh regeneration (ensemble run).
…nodeCount The runtime ESMF mesh validation only checked nodeCount, so a mesh with correct dimensions but missing elementMask (generated before the elementMask fix) was accepted as-is. Member 000 (GFS+HRRR control) used such a stale mesh and crashed with "mesh doesn't contain element mask information." Now also verify elementMask variable exists before skipping regeneration.
…neration The 1721x1721 control mesh (3M nodes) full regeneration was killed by OOM (exit 137) on the PBS launch node, leaving the old mesh without elementMask. Split the fix into two paths: full regeneration only when nodeCount mismatches, lightweight in-place append of elementMask when the mesh geometry is correct but the variable is missing.
…ching The datm.streams template and ensemble var_map used incorrect CMEPS field names that don't match the SCHISM NUOPC cap interface: - Sa_t2m → Sa_tbot (temperature) - Sa_q2m → Sa_shum (specific humidity) - Faxa_swnet → Faxa_swdn (shortwave radiation) The working deterministic scripts (modify_gfs_4_esmfmesh.py) already used the correct names. The wrong names caused SCHISM to receive zero values for all atmospheric fields including wind, resulting in zero staout_2/3/4 (pressure, wind_u, wind_v) in ensemble runs.
…reams Reverts the wrong CMEPS field name change (Sa_tbot→Sa_t2m etc) — the deterministic run confirms Sa_t2m/Sa_q2m/Faxa_swnet are correct. Root cause: the ensemble patching code stripped SPFH/PRATE from datm.streams and renamed MSLMA→PRMSL, making the config differ from the working deterministic (which has all 8 variables). This caused SCHISM NUOPC cap to not fully initialize atmospheric coupling, resulting in zero wind/pressure in staout. Fix: instead of stripping variables from datm.streams, add missing SPFH (zeros) and PRATE (zeros) to the forcing NetCDF and rename PRMSL→MSLMA, so the ensemble datm.streams stays identical to the deterministic with all 8 fields.
CMEPS uses srcMaskValues=(/0/) for ATM→OCN regridding, meaning elements with mask=0 are EXCLUDED from bilinear interpolation. The ensemble mesh code was setting elementMask to all zeros, causing every DATM element to be masked out. The mediator then produced zeros for all fields sent to SCHISM — confirmed by PET logs showing DATM sends real wind (Sa_u10m: -12.8 to 2.0) but SCHISM receives all zeros (inst_zonal_wind_height10m: 0.0). The DET mesh (from ESMF_Scrip2Unstruct) works because it has NO elementMask variable — ESMF defaults to all elements active. Fix: set elementMask=1 (active) in both full mesh regeneration and the lightweight in-place patch paths.
The previous fix (65c6843) only fixed elementMask in the member job's mesh regeneration/patching code. But the actual mesh used by GEFS members is pre-generated in JNOS_OFS_ENSEMBLE_ATMOS_PREP and archived to $COMOUT/datm_input_gefs_NN/. The member job sees "ESMF mesh OK" (correct node count + elementMask exists) and uses the ATMOS_PREP mesh as-is — with elementMask=0 (all masked out). Fix: set elementMask=1 in both the GEFS global mesh generator (line 645) and the control mesh regenerator (line 740, also add elementMask which was previously missing).
The pre-generated GEFS ESMF mesh used a global 0.25° grid (1440×721 = 1,038,240 nodes), but the actual GEFS forcing is extracted to the SECOFS subdomain (1721×1721 = 2,961,841 nodes). The node count mismatch caused nos_ofs_create_datm_forcing_blended.sh to fall through to ESMF_Scrip2Unstruct, and 4 parallel cfp tasks racing that tool caused members 02-04 to fail. Fix: after regenerating the control mesh with correct dimensions, copy it to GEFS_MESH so all cfp wrappers inherit the corrected mesh.
Log files now include the PDY date to distinguish runs on different days: secofs_ufs_ens000_12.20260311.out (was: secofs_ufs_ens000_12.out) secofs_ufs_gefs_prep_12.20260311.out secofs_ufs_enspost_12.20260311.out Also added explicit -o/-e for ensemble post job.
Three fixes for UFS-SECOFS ensemble GEFS atmos_prep failures: 1. Remove wrong control→GEFS mesh copy: control mesh is 1721x1721 (SECOFS subdomain) but GEFS forcing is 1440x721 (global 0.25deg). Copying control mesh over GEFS mesh broke all members. 2. Pre-copy GEFS mesh into each member's blend dir as datm_forcing_esmf_mesh.nc before calling blended script. The DATM_ESMF_MESH env var is set but not visible to the blended script under cfp parallel execution (unknown cause). Pre-copying uses the blended script's existing fallback path at line 276, completely avoiding ESMF_Scrip2Unstruct and eliminating the cfp race condition that caused 4/5 members to fail. 3. Fix IndexError in ensemble_run.sh variable checker: control forcing uses x/y dimensions (from blended output) but code only searched for latitude/longitude. Added y/x with graceful fallback.
- Fix ensemble post PBS script: OFS now from qsub -v (was hardcoded
secofs), add udunits before nco, use ${OFS} in paths, 2hr walltime
- Add parameterized schism_combine_outputs.py that reads grid dimensions
dynamically (replaces hardcoded schism_fields_station_redo.py)
- JNOS_OFS_ENSEMBLE_POST Step 1: per-member combining produces
fields.fNNN.nc + stations.forecast.nc matching deterministic format
- FIX file staging handles _ufs prefix variants (secofs_ufs → secofs)
- Replace CLI-args schism_combine_outputs.py with control-file version from feature/comf-schism-post (same pattern as nos_ofs_nowcast_forecast.sh) - Adds UFS-Coastal fallbacks: hgrid.gr3 for missing depth/coords, zeros for missing windSpeedX/Y in out2d - J-job now creates schism_standard_output.ctl per member (5-line format), copies FIX files to outputs dir, runs python script in CWD - Matches the established COMF inline post-processing approach
- DET _comf_execute_ufs_coastal(): patch datm_in nx_global/ny_global to match actual datm_forcing.nc dimensions (was using stale template values 1881x1841 instead of actual 1721x1721 blended grid). Regenerate ESMF mesh from forcing file coordinates to guarantee consistency. - Ensemble post: accept staout_* as valid member output (UFS-Coastal ensemble produces only station text files, not out2d_*.nc field output). schism_combine_outputs.py now runs in stations-only mode when no out2d_*.nc files are present. - PBS script: add gsl module before nco (required dependency on WCOSS2).
Expands atmospheric ensemble from 6 to 7 members: - Members 001-005: GEFS perturbation forecasts (was 001-004) - Member 006: RRFS 3km (was member 005)
All ensemble members were copying datm.streams from the DET prep, which contains an absolute path to the DET's datm_forcing.nc. Even though each member staged its own GEFS-specific datm_forcing.nc to $MEMBER_DATA/INPUT/, DATM read from the DET path — producing identical output for all GEFS members. Fix: After staging member-specific forcing files, sed-replace the paths in datm.streams to point to $MEMBER_DATA/INPUT/ instead of the DET's datm_input directory.
All GEFS ensemble members produced identical output because datm_in model_meshfile and model_maskfile still pointed to the DET's ESMF mesh (1721x1721 GFS+HRRR grid) while the member's forcing uses 1440x721 GEFS grid. The nx_global/ny_global were correctly patched but the mesh file reference was not, causing DATM to decompose forcing incorrectly. Now patch both datm.streams (stream_data_files01, stream_mesh_file01) AND datm_in (model_meshfile, model_maskfile) to point to each member's own INPUT/datm_esmf_mesh.nc and INPUT/datm_forcing.nc.
…cing
find_gfs_file() and find_gefs_file() now never use a GFS/GEFS cycle
newer than ${PDY}${cyc}. This ensures datm_forcing.nc is identical
regardless of when prep runs, matching operational SECOFS behavior.
Previously, running prep later would find newer GFS cycles with shorter
forecast hours, producing different forcing data and causing UFS-Coastal
ensemble control to diverge from operational standalone SECOFS.
Override via MAX_GFS_CYCLE env var if needed.
The DET prep creates an ESMF mesh with 1722x1722 nodes (padding row/column) for a 1721x1721 forcing grid. The ensemble script's node count check (mesh_nodes != nx*ny) triggered full mesh regeneration with 1721x1721 nodes, producing different interpolation weights and causing member 000 to diverge from DET despite having identical datm_forcing.nc. Fix: detect control members (is_control_det flag) that stage from the DET prep's datm_input directory, and skip mesh regeneration. Only add elementMask if missing (required by CMEPS). Non-control members (GEFS, RRFS) continue to use mesh regeneration as before.
The DET model run (_comf_execute_ufs_coastal) always regenerates the ESMF mesh from datm_forcing.nc at runtime. The ensemble run was using the prep's mesh from $COMOUT, which is generated by a different method and has different coordinate values. This caused member 000 to diverge from DET despite having identical forcing. Fix: always regenerate the mesh from the forcing file using the same Python code as the DET model run. This ensures identical interpolation weights for all members. The mesh generation code now also handles both coordinate conventions (1D lon/lat arrays and 2D x/y arrays) matching nos_ofs_model_run.sh exactly. Replaces the previous approach of skipping regeneration for control members, which was incorrect.
All three mesh generation points now use the prep's SCRIP-based mesh (proc_scrip.py + ESMF_Scrip2Unstruct) instead of inline Python center- based mesh generation. This ensures DET and ensemble member 000 use identical interpolation weights, eliminating the divergence caused by different mesh coordinate conventions. Changes: - JNOS_OFS_ENSEMBLE_ATMOS_PREP: Use SCRIP pipeline for GEFS mesh; preserve control mesh from prep (only add elementMask if missing) - nos_ofs_model_run.sh: Remove Python mesh regeneration from _comf_execute_ufs_coastal(); keep datm_in nx/ny patching and add elementMask check only - nos_ofs_ensemble_run.sh: Remove "always regenerate" mesh block; use prep's staged mesh with elementMask check only
The skip-check used || (OR), so having just stations.forecast.nc (from inline post) caused all members to skip combining. Changed to && (AND) so it only skips when both field and station files exist.
Ensemble launcher, member, and atmos_prep PBS scripts used
rpt/v3.7.0 while DET UFS scripts used rpt/secofs_ufs. Now all
use rpt/${OFS} so logs are in the same directory.
Remove ${nosofs_ver} from WORKDIR paths so all jobs (DET, UFS,
ensemble) use the same work directory structure: work/secofs_ufs
for UFS, work/secofs for standalone.
Remove ${nosofs_ver} from DATAROOT paths to match prep scripts.
All jobs now create DATA under work/${OFS}/ instead of some using
work/v3.7.0/${OFS}/.
The Python netCDF4 dimension reader was failing silently due to LD_PRELOAD=libnetcdff.so conflicts (memory item #6/#8). This caused the entire datm_in patching AND elementMask addition to be skipped. Fix: Use ncdump -h (always available, no LD_PRELOAD issues) to read forcing file dimensions. For elementMask Python, run in subshell with LD_PRELOAD unset.
Checks forcing dims, ESMF mesh consistency (node count, elementMask, coordinates), datm_in nx/ny match, and DET vs member 000 identity.
SCRIP meshes for global grids (GEFS 0.25°) wrap in longitude, so node count = nx*(ny+1) not (nx+1)*(ny+1). Also accept PRMSL as equivalent to MSLMA (renamed at runtime by ensemble run script).
CDEPS tintalgo=linear needs data records bracketing each model timestep. Without buffer, the last GEFS forcing record (20 of 20) coincided exactly with model end time, causing: shr_stream_findBounds ERROR: limit on and rDateIn gt rDategvd Adding 3h buffer ensures at least one extra record past forecast end. DET already had enough hourly records; this mainly fixes GEFS members which use 3-hourly data (20 records → 21 records).
schism_combine_outputs.py: Add convert_schout_to_split() to auto-detect and convert combined schout_*.nc format to split files (out2d, temperature, salinity, horizontalVelX/Y) before processing. Existing split format (out2d_*.nc) continues to work unchanged. JNOS_OFS_ENSEMBLE_POST: Update file existence checks to also look for schout_1.nc when deciding whether a member has field outputs, and copy FIX files (nv.nc, hgrid.gr3) when schout format is detected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--ufsflag (no separate scripts)Files Changed (8 files, +652/-141)
parm/systems/secofs_ufs.yaml— ensemble section with GEFS/RRFS configjobs/JNOS_OFS_ENSEMBLE_ATMOS_PREP— USE_DATM branch for per-member DATM forcingush/nos_ofs_ensemble_run.sh— UFS execution path, ihot/nws/nscribes fixes, model_configure patchingush/nosofs/nos_ofs_create_datm_forcing.sh— GEFS DBASE case +find_gefs_file()ush/nosofs/nos_ofs_create_datm_forcing_blended.sh— Parameterized primary sourcepbs/launch_secofs_ensemble.sh—--ufsflag, OFS passthroughpbs/jnos_secofs_ensemble_member.pbs— OFS-aware module loadingpbs/jnos_secofs_ensemble_atmos_prep.pbs— OFS/USE_DATM auto-detectionTest plan
launch_secofs_ensemble.sh 00 --ufs --gefson WCOSS2