-
Notifications
You must be signed in to change notification settings - Fork 20
adds cron scripts for nightly tests #482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
760ff9a
e0b6827
bebd6c1
b6c1ba0
0666164
c12641d
9d683ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,102 @@ | ||||||
| # OMEGA Cron Scripts | ||||||
|
|
||||||
| Automated cron job scripts for continuous testing and CDash reporting of OMEGA ocean modeling projects across multiple HPC systems. | ||||||
|
|
||||||
| ## Overview | ||||||
|
|
||||||
| This repository orchestrates the compilation, testing, and result submission to [CDash](https://my.cdash.org) for two major OMEGA ocean model components: | ||||||
|
|
||||||
| - **Omega** - Next-generation ocean model | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||||||
| - **Polaris** - MPAS-Ocean model with Omega integration | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||||||
|
|
||||||
| ## Supported Systems | ||||||
|
|
||||||
| | Machine | Location | Compilers | | ||||||
| |---------|----------|-----------| | ||||||
| | Frontier | ORNL | craygnu, craycray, crayamd (with mphipcc variants) | | ||||||
| | Chrysalis | ANL (LCRC) | gnu, intel | | ||||||
| | pm-gpu | NERSC (Perlmutter GPU) | gnugpu | | ||||||
| | pm-cpu | NERSC (Perlmutter CPU) | gnu | | ||||||
|
|
||||||
| ## Repository Structure | ||||||
|
|
||||||
| ``` | ||||||
| cron-scripts/ | ||||||
| ├── launch_all.sh # Main entry point | ||||||
| ├── machines/ # Machine-specific configurations | ||||||
| │ ├── config_machine.sh # Auto-detection dispatcher | ||||||
| │ ├── config_frontier.sh | ||||||
| │ ├── config_chrysalis.sh | ||||||
| │ ├── config_pm-gpu.sh | ||||||
| │ └── config_pm-cpu.sh | ||||||
| └── tasks/ # Scheduled job definitions | ||||||
| ├── omega_cdash/ # Omega model CDash testing | ||||||
| │ ├── launch_omega_cdash.sh | ||||||
| │ └── job_*.sbatch | ||||||
| └── polaris_cdash/ # Polaris model CDash testing | ||||||
| ├── launch_polaris_ctest.sh | ||||||
| ├── polaris_cdash.py | ||||||
| └── CTestScript.txt | ||||||
| ``` | ||||||
|
|
||||||
| ## Usage | ||||||
|
|
||||||
| ### Run on auto-detected machine | ||||||
|
|
||||||
| ```bash | ||||||
| ./launch_all.sh | ||||||
| ``` | ||||||
|
|
||||||
| ### Run on a specific machine | ||||||
|
|
||||||
| ```bash | ||||||
| ./launch_all.sh -m frontier | ||||||
| ./launch_all.sh -m chrysalis | ||||||
| ./launch_all.sh -m pm-gpu | ||||||
| ./launch_all.sh -m pm-cpu | ||||||
| ``` | ||||||
|
|
||||||
| ### Set up in crontab | ||||||
|
|
||||||
| ```bash | ||||||
| # Run daily at 1 AM | ||||||
| 0 1 * * * /path/to/cron-scripts/launch_all.sh | ||||||
| ``` | ||||||
|
|
||||||
| ## How It Works | ||||||
|
|
||||||
| 1. `launch_all.sh` auto-detects the machine via hostname or accepts a `-m` flag | ||||||
| 2. Sources the appropriate machine configuration (compilers, paths, modules) | ||||||
| 3. Uses file locking to prevent concurrent executions | ||||||
| 4. Discovers and executes all `launch*.sh` scripts in task subdirectories | ||||||
| 5. Each task clones/updates repos, submits SBATCH jobs, and reports to CDash | ||||||
|
|
||||||
| ## Environment Variables | ||||||
|
|
||||||
| | Variable | Description | | ||||||
| |----------|-------------| | ||||||
| | `CRONJOB_BASEDIR` | Root directory for job outputs | | ||||||
| | `CRONJOB_MACHINE` | Detected/specified machine name | | ||||||
| | `CRONJOB_LOGDIR` | Log directory location | | ||||||
| | `E3SM_COMPILERS` | Space-separated list of compilers to test | | ||||||
|
|
||||||
| ## Adding a New Machine | ||||||
|
|
||||||
| 1. Create `machines/config_<machine>.sh` with: | ||||||
| - `CRONJOB_BASEDIR` path | ||||||
| - `E3SM_COMPILERS` list | ||||||
| - Module loads and environment setup | ||||||
| 2. Add hostname pattern to `machines/config_machine.sh` | ||||||
| 3. Create machine-specific SBATCH scripts in task directories if needed | ||||||
|
|
||||||
| ## Adding a New Task | ||||||
|
|
||||||
| 1. Create a new directory under `tasks/` | ||||||
| 2. Add a `launch_<taskname>.sh` script | ||||||
| 3. The script will be auto-discovered and executed by `launch_all.sh` | ||||||
|
|
||||||
| ## CDash Integration | ||||||
|
|
||||||
| Test results are submitted to: | ||||||
| - E3SM project: https://my.cdash.org/submit.php?project=E3SM | ||||||
| - Omega project: https://my.cdash.org/submit.php?project=omega | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| #!/usr/bin/env bash | ||
| set -eo pipefail | ||
|
|
||
| HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")" | ||
|
|
||
| # --- Parse command-line arguments --- | ||
| CLI_MACHINE="" | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| -m|--machine) | ||
| CLI_MACHINE="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| echo "ERROR: Unknown option '$1'" >&2 | ||
| echo "Usage: $SCRIPT_NAME [-m|--machine MACHINE_NAME]" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| echo "[$(date)] Starting $SCRIPT_NAME" | ||
|
|
||
| # set CRONJOB_BASEDIR and machine-specific variables | ||
| # pass -m through so config_machine.sh uses CLI override if provided | ||
| if [[ -n "$CLI_MACHINE" ]]; then | ||
| source "${HERE}/machines/config_machine.sh" -m "$CLI_MACHINE" | ||
| else | ||
| source "${HERE}/machines/config_machine.sh" | ||
| fi | ||
|
|
||
| export CRONJOB_LOGDIR="${CRONJOB_BASEDIR}/logs" | ||
| mkdir -p "$CRONJOB_LOGDIR" | ||
|
|
||
| export CRONJOB_DATE=$(date +"%d") | ||
| export CRONJOB_TIME=$(date +"%T") | ||
|
|
||
| LOCKFILE="/tmp/${USER}_cronjob.lock" | ||
| exec 9>"$LOCKFILE" | ||
| if ! flock -n 9; then | ||
| echo "[$(date)] launch_all.sh is already running, exiting." | ||
| exit 0 | ||
| fi | ||
| #LOCKFILE="${HERE}/cronjob.lock" | ||
| #exec 9>"$LOCKFILE" | ||
| #if ! flock -n 9; then | ||
| # echo "[$(date)] launch_all.sh is already running, exiting." | ||
| # exit 0 | ||
| #fi | ||
|
|
||
| # Run all launch*.sh scripts under immediate subdirectories of $HERE/tasks | ||
| while IFS= read -r script; do | ||
| /bin/bash "$script" | ||
| done < <( | ||
| find "$HERE/tasks" -mindepth 2 -maxdepth 2 \ | ||
| -type f -name 'launch*.sh' | sort | ||
| ) | ||
|
|
||
| echo "[$(date)] Finished $SCRIPT_NAME" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| #!/usr/bin/env bash | ||
| set -eo pipefail | ||
|
|
||
| source /etc/bashrc | ||
|
|
||
| export CRONJOB_BASEDIR=/lcrc/globalscratch/${USER}/cronjobs | ||
| export E3SM_COMPILERS="gnu intel" | ||
|
|
||
| mkdir -p "$CRONJOB_BASEDIR" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -eo pipefail | ||
|
|
||
| module load cray-python cmake | ||
|
|
||
| export all_proxy=socks://proxy.ccs.ornl.gov:3128/ | ||
| export ftp_proxy=ftp://proxy.ccs.ornl.gov:3128/ | ||
| export http_proxy=http://proxy.ccs.ornl.gov:3128/ | ||
| export https_proxy=http://proxy.ccs.ornl.gov:3128/ | ||
| export no_proxy='localhost,127.0.0.0/8,*.ccs.ornl.gov' | ||
|
|
||
| export CRONJOB_BASEDIR=/lustre/orion/cli115/scratch/${USER}/cronjobs | ||
| export E3SM_COMPILERS="craygnu-mphipcc craycray-mphipcc crayamd-mphipcc craygnu craycray crayamd" | ||
|
|
||
| mkdir -p "$CRONJOB_BASEDIR" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| #!/usr/bin/env bash | ||
| set -eo pipefail | ||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| # --- Parse command-line arguments --- | ||
| usage() { | ||
| echo "Usage: $(basename "$0") [-m|--machine MACHINE_NAME] [-h|--help]" | ||
| echo " -m, --machine Override the auto-detected machine name" | ||
| echo " -h, --help Show this help message" | ||
| exit "${1:-0}" | ||
| } | ||
|
|
||
| CLI_MACHINE="" | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| -m|--machine) | ||
| CLI_MACHINE="$2" | ||
| shift 2 | ||
| ;; | ||
| -h|--help) | ||
| usage 0 | ||
| ;; | ||
| *) | ||
| echo "ERROR: Unknown option '$1'" >&2 | ||
| usage 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| # --- Get a stable hostname / FQDN (try multiple methods) --- | ||
| get_fqdn() { | ||
| local fqdn="" | ||
| fqdn="$(hostname -f 2>/dev/null || true)" | ||
| if [[ -z "$fqdn" || "$fqdn" == "(none)" ]]; then | ||
| fqdn="$(hostname --fqdn 2>/dev/null || true)" | ||
| fi | ||
| if [[ -z "$fqdn" || "$fqdn" == "(none)" ]]; then | ||
| fqdn="$(hostname 2>/dev/null || true)" | ||
| fi | ||
| echo "$fqdn" | ||
| } | ||
|
|
||
| FQDN="$(get_fqdn)" | ||
|
|
||
| # --- Determine CRONJOB_MACHINE --- | ||
| if [[ -n "$CLI_MACHINE" ]]; then | ||
| # Command-line argument takes highest priority | ||
| CRONJOB_MACHINE="$CLI_MACHINE" | ||
| else | ||
| # Fall back to FQDN-based detection | ||
| CRONJOB_MACHINE="unknown" | ||
| case "$FQDN" in | ||
| *.frontier.olcf.ornl.gov) | ||
| CRONJOB_MACHINE="frontier" | ||
| ;; | ||
| *.polaris.alcf.anl.gov) | ||
| CRONJOB_MACHINE="polaris" | ||
| ;; | ||
| *.perlmutter.nersc.gov) | ||
| CRONJOB_MACHINE="pm-gpu" | ||
| ;; | ||
| *.lcrc.anl.gov) | ||
| CRONJOB_MACHINE="chrysalis" | ||
| ;; | ||
| esac | ||
| fi | ||
|
|
||
| export CRONJOB_MACHINE | ||
| echo "FQDN=$FQDN" | ||
| echo "CRONJOB_MACHINE=$CRONJOB_MACHINE" | ||
|
|
||
| source "${SCRIPT_DIR}/config_${CRONJOB_MACHINE}.sh" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -eo pipefail | ||
|
|
||
| module load cray-python cmake | ||
|
|
||
| export CRONJOB_BASEDIR=/pscratch/sd/${USER:0:1}/${USER}/omega/cronjobs_pm-cpu | ||
| export E3SM_COMPILERS="gnu" | ||
|
|
||
| mkdir -p "$CRONJOB_BASEDIR" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -eo pipefail | ||
|
|
||
| module load cray-python cmake | ||
|
|
||
| export CRONJOB_BASEDIR=/pscratch/sd/${USER:0:1}/${USER}/omega/cronjobs_pm-gpu | ||
| export E3SM_COMPILERS="gnugpu" | ||
|
|
||
| mkdir -p "$CRONJOB_BASEDIR" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| #!/bin/bash -l | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --qos=high | ||
| #SBATCH --time 02:00:00 | ||
|
|
||
| source /etc/bashrc | ||
|
|
||
| exec bash $(dirname "$0")/run_omega_cdash.sh |
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,10 @@ | ||||
| #!/bin/bash -l | ||||
| #SBATCH --nodes=1 | ||||
| #SBATCH -q debug | ||||
| #SBATCH --account=cli115 | ||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we get
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbegeman , I think it would be better to get the account from a common place. Do any of the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like they don't but they could/should, e.g., polaris/polaris/job/__init__.py Line 76 in 1981a58
|
||||
| #SBATCH --time 02:00:00 | ||||
|
|
||||
|
|
||||
| source /etc/bashrc | ||||
|
|
||||
| exec bash $(dirname "$0")/run_omega_cdash.sh | ||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| #!/bin/bash -l | ||
| #SBATCH --job-name=OmegaSCron | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --ntasks-per-node=64 | ||
| #SBATCH --output=/global/cfs/cdirs/e3sm/omega/cronjbos_pm-cpu/logs/OmegaSCronCPU_%j.out | ||
| #SBATCH --error=/global/cfs/cdirs/e3sm/omega/cronjobs_pm-cpu/logs/OmegaSCronCPU_%j.err | ||
| #SBATCH --constraint=cpu | ||
| #SBATCH --account=e3sm | ||
| #SBATCH --qos regular | ||
| #SBATCH --exclusive | ||
| #SBATCH --time 01:00:00 | ||
|
|
||
|
|
||
| source /etc/bashrc | ||
|
|
||
| exec bash $(dirname "$0")/run_omega_cdash.sh |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| #!/bin/bash -l | ||
| #SBATCH --job-name=OmegaSCron | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --ntasks-per-node=8 | ||
| #SBATCH --gpus-per-node=4 | ||
| #SBATCH --output=/global/cfs/cdirs/e3sm/omega/cronjobs_pm-gpu/logs/OmegaSCronGPU_%j.out | ||
| #SBATCH --error=/global/cfs/cdirs/e3sm/omega/cronjobs_pm-gpu/logs/OmegaSCronGPU_%j.err | ||
| #SBATCH --constraint=gpu | ||
| #SBATCH --account=e3sm_g | ||
| #SBATCH --qos regular | ||
| #SBATCH --exclusive | ||
| #SBATCH --time 01:00:00 | ||
|
|
||
|
|
||
| source /etc/bashrc | ||
|
|
||
| exec bash $(dirname "$0")/run_omega_cdash.sh |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| #!/usr/bin/env bash | ||
| set -eo pipefail | ||
|
|
||
| HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")" | ||
| echo "[$(date)] Starting $SCRIPT_NAME" | ||
|
|
||
| export OMEGA_CDASH_BASEDIR=${CRONJOB_BASEDIR}/tasks/omega_cdash | ||
| export TESTROOT="${OMEGA_CDASH_BASEDIR}/tests" | ||
| mkdir -p $OMEGA_CDASH_BASEDIR | ||
| mkdir -p $TESTROOT | ||
|
|
||
| export OMEGA_HOME="${OMEGA_CDASH_BASEDIR}/Omega" | ||
|
|
||
| if [[ ! -d $OMEGA_HOME ]]; then | ||
| cd ${OMEGA_CDASH_BASEDIR} | ||
| git clone https://github.com/E3SM-Project/Omega.git | ||
| fi | ||
|
|
||
| cd ${OMEGA_HOME} | ||
| git checkout develop | ||
| git fetch origin | ||
| git reset --hard origin/develop | ||
| git submodule update --init --recursive || true | ||
|
|
||
| if [[ ! -f ${TESTROOT}/OmegaMesh.nc ]]; then | ||
| wget -O ${TESTROOT}/OmegaMesh.nc https://web.lcrc.anl.gov/public/e3sm/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc | ||
| fi | ||
|
|
||
| if [[ ! -f ${TESTROOT}/OmegaSphereMesh.nc ]]; then | ||
| wget -O ${TESTROOT}/OmegaSphereMesh.nc https://web.lcrc.anl.gov/public/e3sm/polaris/ocean/polaris_cache/global_convergence/icos/cosine_bell/Icos480/init/initial_state.230220.nc | ||
| fi | ||
|
|
||
| if [[ ! -f ${TESTROOT}/OmegaPlanarMesh.nc ]]; then | ||
| wget -O ${TESTROOT}/OmegaPlanarMesh.nc https://gist.github.com/mwarusz/f8caf260398dbe140d2102ec46a41268/raw/e3c29afbadc835797604369114321d93fd69886d/PlanarPeriodic48x48.nc | ||
| fi | ||
|
|
||
| sbatch \ | ||
| --job-name=OmegaCdash \ | ||
| --output="$CRONJOB_LOGDIR/omega_cdash_%j.out" \ | ||
| --error="$CRONJOB_LOGDIR/omega_cdash_%j.err" \ | ||
| ${HERE}/job_${CRONJOB_MACHINE}_omega_cdash.sbatch | ||
|
|
||
| echo "[$(date)] Finished $SCRIPT_NAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.