Skip to content

Add GEFS atmospheric ensemble for UFS-Coastal SECOFS#68

Open
mansurjisan wants to merge 40 commits intofeature/unified-nowcast-forecastfrom
ufs-ens
Open

Add GEFS atmospheric ensemble for UFS-Coastal SECOFS#68
mansurjisan wants to merge 40 commits intofeature/unified-nowcast-forecastfrom
ufs-ens

Conversation

@mansurjisan
Copy link
Copy Markdown
Owner

Summary

  • Add 6-member GEFS atmospheric ensemble support for UFS-Coastal SECOFS (DATM+SCHISM)
    • Member 000: GFS+HRRR blended control
    • Members 001-004: GEFS pgrb2sp25 0.25° (no HRRR blend)
    • Member 005: RRFS (GFS fallback for now)
  • Per-member DATM forcing directories with cfp parallel generation
  • Consolidate UFS ensemble into existing PBS scripts via --ufs flag (no separate scripts)
  • Fix UFS ensemble: ihot=1 (not ihot=2), model_configure start time patching, nws=4/nscribes=0 enforcement

Files Changed (8 files, +652/-141)

  • parm/systems/secofs_ufs.yaml — ensemble section with GEFS/RRFS config
  • jobs/JNOS_OFS_ENSEMBLE_ATMOS_PREP — USE_DATM branch for per-member DATM forcing
  • ush/nos_ofs_ensemble_run.sh — UFS execution path, ihot/nws/nscribes fixes, model_configure patching
  • ush/nosofs/nos_ofs_create_datm_forcing.sh — GEFS DBASE case + find_gefs_file()
  • ush/nosofs/nos_ofs_create_datm_forcing_blended.sh — Parameterized primary source
  • pbs/launch_secofs_ensemble.sh--ufs flag, OFS passthrough
  • pbs/jnos_secofs_ensemble_member.pbs — OFS-aware module loading
  • pbs/jnos_secofs_ensemble_atmos_prep.pbs — OFS/USE_DATM auto-detection

Test plan

  • Run launch_secofs_ensemble.sh 00 --ufs --gefs on WCOSS2
  • Verify atmos prep generates per-member DATM forcing (datm_input_gefs_01..04, datm_input_rrfs)
  • Verify ensemble members run with ihot=1, nws=4, nscribes=0
  • Verify model_configure has correct start time for forecast phase
  • Verify standalone (non-UFS) ensemble still works with ihot=2

Enable 6-member GEFS atmospheric ensemble for secofs_ufs with DATM
forcing. Members use per-member DATM forcing directories instead of
shared sflux files:
- 000: GFS+HRRR blended control (existing prep output)
- 001-004: GEFS pgrb2sp25 0.25 deg (no HRRR blend)
- 005: RRFS (GFS fallback for now)

Changes:
- secofs_ufs.yaml: Add ensemble section with perturb_physics: false
- nos_ofs_create_datm_forcing.sh: Add GEFS DBASE case + find_gefs_file()
- nos_ofs_create_datm_forcing_blended.sh: Parameterize primary source
  via DATM_PRIMARY_SOURCE env var, add DATM_SKIP_UFS_CONFIG toggle
- JNOS_OFS_ENSEMBLE_ATMOS_PREP: Add USE_DATM branch that generates
  per-member DATM forcing via blended orchestrator with cfp parallelism
- nos_ofs_ensemble_run.sh: Read met_source_1 from params.json to stage
  member-specific DATM input, patch datm_in grid dims
- New PBS scripts for UFS ensemble member, atmos prep, and launcher
…e files

Remove the 3 separate UFS-specific PBS scripts and instead make the existing
ensemble PBS scripts OFS-aware via --ufs flag:

- launch_secofs_ensemble.sh: Add --ufs flag (sets OFS=secofs_ufs, UFS_MODE=true),
  pass OFS to all qsub calls, use UFS prep/nowcast/forecast PBS for det workflow
- jnos_secofs_ensemble_member.pbs: Branch module loading on OFS name (*_ufs* loads
  hpc-stack modules.fv3, standalone loads standard WCOSS2 modules), auto-detect
  USE_DATM and TOTAL_TASKS for UFS
- jnos_secofs_ensemble_atmos_prep.pbs: Accept OFS from launcher, auto-detect
  USE_DATM, add COMINgefs/COMINrrfs exports
…ribes

Three critical fixes for UFS-Coastal ensemble members:

1. ihot=2 → ihot=1 for UFS-Coastal (USE_DATM=true). ihot=2 causes
   SCHISM/ESMF clock desync and ghost node pressure transient. Standalone
   SCHISM ensemble still uses ihot=2 for continuous staout timeseries.

2. Patch model_configure start_year/month/day/hour from time_nowcastend,
   matching the deterministic workflow. Without this, ESMF clock starts
   at the wrong time for ensemble forecast members.

3. Force nws=4 and nscribes=0 in param.nml for UFS-Coastal members.
   Prevents mismatches if param.nml originated from standalone SCHISM
   (nws=2, nscribes=6/7).
prod_envir module sets PACKAGEROOT to /lfs/h1/ops/prod/packages, overriding
the dev path. Save PACKAGEROOT before module loads and restore it after, so
J-jobs resolve to the correct dev installation directory.
….json

params.json nests met_source_1 under atmospheric_source:{}, but the
staging code was reading it as a top-level key, always getting empty
string. This caused all ensemble members to fall through to the control
datm_input/ directory (blended GFS+HRRR) instead of their member-specific
GEFS-only DATM forcing.
…tion

Three fixes for UFS-Coastal GEFS ensemble member failures:

1. Explicit datm_forcing.nc check: Replace silent `cp *.nc || true` with
   explicit check for datm_forcing.nc and datm_esmf_mesh.nc. Fail early
   with clear error instead of launching model with missing input.

2. ESMF mesh regeneration: When member forcing grid (e.g., GEFS 1440x721)
   differs from control mesh (blended 1721x1721), regenerate ESMF mesh
   in-place from the member's forcing file using Python/netCDF4.

3. Atmos prep wrapper archival: Replace glob `cp *.nc` with explicit
   per-file copy with warnings, so missing files are visible in logs.
ESMF_Scrip2Unstruct fails when multiple cfp ranks run it concurrently
(only rank 0 succeeds, ranks 1-4 produce empty mesh files). Fix:

- Atmos prep: Generate GEFS 0.25-deg ESMF mesh once via Python/netCDF4
  before cfp launches. Pass pre-generated mesh to all wrappers via
  DATM_ESMF_MESH env var.
- Blended script: Check DATM_ESMF_MESH first in Step 4 priority chain,
  skipping ESMF_Scrip2Unstruct entirely when pre-generated mesh exists.
- Remove broken heredoc attempt, keep only the python3 -c version.
Two issues causing member failures:

1. Mesh mismatch check compared forcing dims against the control's mesh
   nodeCount. The control mesh (from ESMF_Scrip2Unstruct) has 2,965,284
   nodes while the blended forcing is 1721x1721=2,961,841. This triggered
   unnecessary mesh regeneration for ALL members including control.
   Fix: compare against the member's own staged mesh instead.

2. Python for-loop for element connectivity was slow for ~3M elements.
   Replace with vectorized numpy (np.mgrid + column_stack) in both
   the atmos prep and ensemble run mesh generators.
… mesh

Two fixes for UFS-Coastal ensemble failures:

1. Force OMP_NUM_THREADS=1 unconditionally in ensemble run. Cray PBS
   pre-sets OMP_NUM_THREADS to ncpus (128), causing massive thread
   oversubscription with 120 MPI ranks per node. All members (001-005)
   crashed at timestep 1 due to this. The deterministic model_run.sh
   already forces OMP_NUM_THREADS=1; ensemble was using conditional
   default which didn't override the PBS value.

2. Pre-generate correct control ESMF mesh during atmos prep. The det
   prep's ESMF_Scrip2Unstruct produces a 1723x1723 mesh (2,965,284
   nodes) but the blended forcing is 1721x1721 (2,961,841 points).
   Member 000 was spending time regenerating the mesh at runtime.
   Now atmos prep checks and fixes the mesh before members launch.
GEFS pgrb2sp25 product lacks SPFH_2maboveground and PRATE_surface,
and uses PRMSL_meansealevel instead of MSLMA_meansealevel. The
datm.streams template references all 8 variables from the GFS+HRRR
blended control, causing DATM to crash at the first coupled timestep
when variables aren't found in the forcing file.

After staging datm.streams and datm_forcing.nc, read actual variable
names from the forcing NetCDF and rebuild stream_data_variables01 to
only include variables that exist. Maps both PRMSL and MSLMA to the
same CMEPS field (Sa_pslv).
ESMF_Scrip2Unstruct (used by det prep) includes elementMask in the
mesh file, but our Python mesh generator did not. DATM's
InitializeRealize calls ESMF_MeshGet which requires element mask
info. Without it: "mesh doesn't contain element mask information"
→ crash during ATM component initialization.

Add elementMask (all zeros = unmasked) to all three Python mesh
generation sites: GEFS mesh, control mesh (atmos prep), and
runtime mesh regeneration (ensemble run).
…nodeCount

The runtime ESMF mesh validation only checked nodeCount, so a mesh
with correct dimensions but missing elementMask (generated before the
elementMask fix) was accepted as-is.  Member 000 (GFS+HRRR control)
used such a stale mesh and crashed with "mesh doesn't contain element
mask information."  Now also verify elementMask variable exists before
skipping regeneration.
…neration

The 1721x1721 control mesh (3M nodes) full regeneration was killed by
OOM (exit 137) on the PBS launch node, leaving the old mesh without
elementMask.  Split the fix into two paths: full regeneration only when
nodeCount mismatches, lightweight in-place append of elementMask when
the mesh geometry is correct but the variable is missing.
…ching

The datm.streams template and ensemble var_map used incorrect CMEPS field
names that don't match the SCHISM NUOPC cap interface:
  - Sa_t2m  → Sa_tbot  (temperature)
  - Sa_q2m  → Sa_shum  (specific humidity)
  - Faxa_swnet → Faxa_swdn (shortwave radiation)

The working deterministic scripts (modify_gfs_4_esmfmesh.py) already used
the correct names.  The wrong names caused SCHISM to receive zero values
for all atmospheric fields including wind, resulting in zero staout_2/3/4
(pressure, wind_u, wind_v) in ensemble runs.
…reams

Reverts the wrong CMEPS field name change (Sa_tbot→Sa_t2m etc) — the
deterministic run confirms Sa_t2m/Sa_q2m/Faxa_swnet are correct.

Root cause: the ensemble patching code stripped SPFH/PRATE from
datm.streams and renamed MSLMA→PRMSL, making the config differ from
the working deterministic (which has all 8 variables).  This caused
SCHISM NUOPC cap to not fully initialize atmospheric coupling,
resulting in zero wind/pressure in staout.

Fix: instead of stripping variables from datm.streams, add missing
SPFH (zeros) and PRATE (zeros) to the forcing NetCDF and rename
PRMSL→MSLMA, so the ensemble datm.streams stays identical to the
deterministic with all 8 fields.
CMEPS uses srcMaskValues=(/0/) for ATM→OCN regridding, meaning
elements with mask=0 are EXCLUDED from bilinear interpolation.
The ensemble mesh code was setting elementMask to all zeros,
causing every DATM element to be masked out. The mediator then
produced zeros for all fields sent to SCHISM — confirmed by PET
logs showing DATM sends real wind (Sa_u10m: -12.8 to 2.0) but
SCHISM receives all zeros (inst_zonal_wind_height10m: 0.0).

The DET mesh (from ESMF_Scrip2Unstruct) works because it has NO
elementMask variable — ESMF defaults to all elements active.

Fix: set elementMask=1 (active) in both full mesh regeneration
and the lightweight in-place patch paths.
The previous fix (65c6843) only fixed elementMask in the member
job's mesh regeneration/patching code. But the actual mesh used
by GEFS members is pre-generated in JNOS_OFS_ENSEMBLE_ATMOS_PREP
and archived to $COMOUT/datm_input_gefs_NN/. The member job sees
"ESMF mesh OK" (correct node count + elementMask exists) and uses
the ATMOS_PREP mesh as-is — with elementMask=0 (all masked out).

Fix: set elementMask=1 in both the GEFS global mesh generator
(line 645) and the control mesh regenerator (line 740, also add
elementMask which was previously missing).
The pre-generated GEFS ESMF mesh used a global 0.25° grid (1440×721 =
1,038,240 nodes), but the actual GEFS forcing is extracted to the SECOFS
subdomain (1721×1721 = 2,961,841 nodes). The node count mismatch caused
nos_ofs_create_datm_forcing_blended.sh to fall through to ESMF_Scrip2Unstruct,
and 4 parallel cfp tasks racing that tool caused members 02-04 to fail.

Fix: after regenerating the control mesh with correct dimensions, copy it
to GEFS_MESH so all cfp wrappers inherit the corrected mesh.
Log files now include the PDY date to distinguish runs on different days:
  secofs_ufs_ens000_12.20260311.out  (was: secofs_ufs_ens000_12.out)
  secofs_ufs_gefs_prep_12.20260311.out
  secofs_ufs_enspost_12.20260311.out

Also added explicit -o/-e for ensemble post job.
Three fixes for UFS-SECOFS ensemble GEFS atmos_prep failures:

1. Remove wrong control→GEFS mesh copy: control mesh is 1721x1721
   (SECOFS subdomain) but GEFS forcing is 1440x721 (global 0.25deg).
   Copying control mesh over GEFS mesh broke all members.

2. Pre-copy GEFS mesh into each member's blend dir as
   datm_forcing_esmf_mesh.nc before calling blended script.
   The DATM_ESMF_MESH env var is set but not visible to the
   blended script under cfp parallel execution (unknown cause).
   Pre-copying uses the blended script's existing fallback path
   at line 276, completely avoiding ESMF_Scrip2Unstruct and
   eliminating the cfp race condition that caused 4/5 members
   to fail.

3. Fix IndexError in ensemble_run.sh variable checker: control
   forcing uses x/y dimensions (from blended output) but code
   only searched for latitude/longitude. Added y/x with graceful
   fallback.
- Fix ensemble post PBS script: OFS now from qsub -v (was hardcoded
  secofs), add udunits before nco, use ${OFS} in paths, 2hr walltime
- Add parameterized schism_combine_outputs.py that reads grid dimensions
  dynamically (replaces hardcoded schism_fields_station_redo.py)
- JNOS_OFS_ENSEMBLE_POST Step 1: per-member combining produces
  fields.fNNN.nc + stations.forecast.nc matching deterministic format
- FIX file staging handles _ufs prefix variants (secofs_ufs → secofs)
- Replace CLI-args schism_combine_outputs.py with control-file version
  from feature/comf-schism-post (same pattern as nos_ofs_nowcast_forecast.sh)
- Adds UFS-Coastal fallbacks: hgrid.gr3 for missing depth/coords,
  zeros for missing windSpeedX/Y in out2d
- J-job now creates schism_standard_output.ctl per member (5-line format),
  copies FIX files to outputs dir, runs python script in CWD
- Matches the established COMF inline post-processing approach
- DET _comf_execute_ufs_coastal(): patch datm_in nx_global/ny_global to
  match actual datm_forcing.nc dimensions (was using stale template values
  1881x1841 instead of actual 1721x1721 blended grid). Regenerate ESMF
  mesh from forcing file coordinates to guarantee consistency.

- Ensemble post: accept staout_* as valid member output (UFS-Coastal
  ensemble produces only station text files, not out2d_*.nc field output).
  schism_combine_outputs.py now runs in stations-only mode when no
  out2d_*.nc files are present.

- PBS script: add gsl module before nco (required dependency on WCOSS2).
Expands atmospheric ensemble from 6 to 7 members:
- Members 001-005: GEFS perturbation forecasts (was 001-004)
- Member 006: RRFS 3km (was member 005)
All ensemble members were copying datm.streams from the DET prep,
which contains an absolute path to the DET's datm_forcing.nc. Even
though each member staged its own GEFS-specific datm_forcing.nc to
$MEMBER_DATA/INPUT/, DATM read from the DET path — producing
identical output for all GEFS members.

Fix: After staging member-specific forcing files, sed-replace the
paths in datm.streams to point to $MEMBER_DATA/INPUT/ instead of
the DET's datm_input directory.
All GEFS ensemble members produced identical output because datm_in
model_meshfile and model_maskfile still pointed to the DET's ESMF mesh
(1721x1721 GFS+HRRR grid) while the member's forcing uses 1440x721
GEFS grid. The nx_global/ny_global were correctly patched but the mesh
file reference was not, causing DATM to decompose forcing incorrectly.

Now patch both datm.streams (stream_data_files01, stream_mesh_file01)
AND datm_in (model_meshfile, model_maskfile) to point to each member's
own INPUT/datm_esmf_mesh.nc and INPUT/datm_forcing.nc.
…cing

find_gfs_file() and find_gefs_file() now never use a GFS/GEFS cycle
newer than ${PDY}${cyc}. This ensures datm_forcing.nc is identical
regardless of when prep runs, matching operational SECOFS behavior.

Previously, running prep later would find newer GFS cycles with shorter
forecast hours, producing different forcing data and causing UFS-Coastal
ensemble control to diverge from operational standalone SECOFS.

Override via MAX_GFS_CYCLE env var if needed.
The DET prep creates an ESMF mesh with 1722x1722 nodes (padding
row/column) for a 1721x1721 forcing grid. The ensemble script's
node count check (mesh_nodes != nx*ny) triggered full mesh
regeneration with 1721x1721 nodes, producing different interpolation
weights and causing member 000 to diverge from DET despite having
identical datm_forcing.nc.

Fix: detect control members (is_control_det flag) that stage from
the DET prep's datm_input directory, and skip mesh regeneration.
Only add elementMask if missing (required by CMEPS). Non-control
members (GEFS, RRFS) continue to use mesh regeneration as before.
The DET model run (_comf_execute_ufs_coastal) always regenerates
the ESMF mesh from datm_forcing.nc at runtime. The ensemble run
was using the prep's mesh from $COMOUT, which is generated by a
different method and has different coordinate values. This caused
member 000 to diverge from DET despite having identical forcing.

Fix: always regenerate the mesh from the forcing file using the
same Python code as the DET model run. This ensures identical
interpolation weights for all members. The mesh generation code
now also handles both coordinate conventions (1D lon/lat arrays
and 2D x/y arrays) matching nos_ofs_model_run.sh exactly.

Replaces the previous approach of skipping regeneration for
control members, which was incorrect.
All three mesh generation points now use the prep's SCRIP-based mesh
(proc_scrip.py + ESMF_Scrip2Unstruct) instead of inline Python center-
based mesh generation. This ensures DET and ensemble member 000 use
identical interpolation weights, eliminating the divergence caused by
different mesh coordinate conventions.

Changes:
- JNOS_OFS_ENSEMBLE_ATMOS_PREP: Use SCRIP pipeline for GEFS mesh;
  preserve control mesh from prep (only add elementMask if missing)
- nos_ofs_model_run.sh: Remove Python mesh regeneration from
  _comf_execute_ufs_coastal(); keep datm_in nx/ny patching and
  add elementMask check only
- nos_ofs_ensemble_run.sh: Remove "always regenerate" mesh block;
  use prep's staged mesh with elementMask check only
The skip-check used || (OR), so having just stations.forecast.nc
(from inline post) caused all members to skip combining. Changed
to && (AND) so it only skips when both field and station files exist.
Ensemble launcher, member, and atmos_prep PBS scripts used
rpt/v3.7.0 while DET UFS scripts used rpt/secofs_ufs. Now all
use rpt/${OFS} so logs are in the same directory.
Remove ${nosofs_ver} from WORKDIR paths so all jobs (DET, UFS,
ensemble) use the same work directory structure: work/secofs_ufs
for UFS, work/secofs for standalone.
Remove ${nosofs_ver} from DATAROOT paths to match prep scripts.
All jobs now create DATA under work/${OFS}/ instead of some using
work/v3.7.0/${OFS}/.
The Python netCDF4 dimension reader was failing silently due to
LD_PRELOAD=libnetcdff.so conflicts (memory item #6/#8). This caused
the entire datm_in patching AND elementMask addition to be skipped.

Fix: Use ncdump -h (always available, no LD_PRELOAD issues) to read
forcing file dimensions. For elementMask Python, run in subshell
with LD_PRELOAD unset.
Checks forcing dims, ESMF mesh consistency (node count, elementMask,
coordinates), datm_in nx/ny match, and DET vs member 000 identity.
SCRIP meshes for global grids (GEFS 0.25°) wrap in longitude, so
node count = nx*(ny+1) not (nx+1)*(ny+1). Also accept PRMSL as
equivalent to MSLMA (renamed at runtime by ensemble run script).
CDEPS tintalgo=linear needs data records bracketing each model
timestep. Without buffer, the last GEFS forcing record (20 of 20)
coincided exactly with model end time, causing:
  shr_stream_findBounds ERROR: limit on and rDateIn gt rDategvd

Adding 3h buffer ensures at least one extra record past forecast end.
DET already had enough hourly records; this mainly fixes GEFS members
which use 3-hourly data (20 records → 21 records).
schism_combine_outputs.py: Add convert_schout_to_split() to auto-detect
and convert combined schout_*.nc format to split files (out2d, temperature,
salinity, horizontalVelX/Y) before processing. Existing split format
(out2d_*.nc) continues to work unchanged.

JNOS_OFS_ENSEMBLE_POST: Update file existence checks to also look for
schout_1.nc when deciding whether a member has field outputs, and copy
FIX files (nv.nc, hgrid.gr3) when schout format is detected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant