updates to factor handling, MoD/Mess map generation and error messaging by spearmangillman · Pull Request #46 · ApexRMS/wisdm

spearmangillman · 2026-03-17T18:32:17Z

Script Organization

Descriptive header (includes name, affiliation, date, brief description of script purpose)
Parameters specified at top of script
Files loaded at top of script
Functions abstracted into separate script
Organized into sections

Readability

Repetitive code refactored into functions
All functions have docstrings
Descriptive in-line comments
Correct casing for variables and functions: camelCase (R) or snake_case (Python)
Descriptive variable and function names – variables should be nouns and functions should be verbs
Used a linter

Generalizability

No hardcoded values (all integers, strings, etc. should be specified at the top of the script in the “Parameters” section)
Use relative file paths
Consistent output file naming (e.g. “example-filename.tif”)

Summary by CodeRabbit

Bug Fixes
- Improved error handling, NA/prediction filtering and validation across data splitting, cross‑validation and model fitting; clearer warnings and informative errors for problematic folds or factor-level issues.
- Fixed MESS/MoD generation to operate on continuous covariates only and added clearer warnings when categorical variables are excluded.
New Features
- Added support for a second hyperparameter column in tuning outputs.
UI Improvements
- Updated display names for several data sheets and output sections.

…ist; fix MoD/MESS restriction masking - categorical vars are just removed from mod and mess calculation and a warning is written to the run log

…rror/warning messaging; updatedna filtering and to improve accuracy and consistency across models

coderabbitai · 2026-03-17T18:32:34Z

Note

`.coderabbit.yml` has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)

Validation error: Unrecognized key(s) in object: 'code_guidelines', 'ai'

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

📝 Walkthrough

Walkthrough

Refactors train/test and cross-validation splitting, tightens validation and NA handling; adds defensive GAM fitting and per-fold CV NA checks; restricts MESS/MoD to continuous covariates; and updates several DataSheet display names and tuning columns. No public signatures changed. (≤50 words)

Changes

Cohort / File(s)	Summary
Train/Test & CV Splitting `src/05-training-testing-data-prep-functions.R`, `src/5-prepare-training-testing-data.R`	Major rewrite of `testTrainSplit` and `crossValidationSplit`: stronger input validation, explicit ratio-presence/absence branches, balanced/random sampling paths, clearer background removal/reattachment, expanded factor-level warnings (pandoc table), and explicit NA-defaulting for split/CV options.
Model Fitting & CV guards `src/08-fit-model-functions.R`, `src/8-fit-gam.R`	GAM fit wrapped in tryCatch (returns NULL on failure); CV fold handling now filters NA predictions, emits warnings for excluded sites, and stops with informative errors when folds yield no usable predictions or insufficient classes.
Apply-model / MESS & MoD `src/9-apply-model.R`	MESS/MoD computation limited to continuous covariates: introduces `contVars = setdiff(modVars, messFactorVars)`, updates prep/prediction to use `contVars`, rescales/masks to match prediction raster, and gates outputs and warnings when categorical vars are present.
Ensemble messaging `src/10-ensemble-model.py`	Formatting change to constant-raster warning text (capitalization and prefixing newline); no behavioral change to normalization logic.
Manifest / DataSheet metadata `src/package.xml`	Updated several DataSheet `displayName` values and added `Parameter2`/`Parameter2Values` columns for `brtTuning` and `rfTuning`.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

apply model revamp; updates to brt fitting and brt hyperparameter tuning; minor bug fixes and general code clean-up #38: Overlaps model-fitting and CV logic edits in src/08-fit-model-functions.R (tryCatch / per-fold guards).
maxent updates and other minor bug fixes #39: Edits to fitModel/CV handling and apply-model paths that intersect with CV and MESS/MoD changes.
spg code cleanup, apply and ensemble model efficiency improvements, maxent cat vars fix #45: Overlapping changes touching train/test splitting, CV handling, and apply/ensemble behavior.

Suggested reviewers

katieb1
afilazzola

Poem

"I hop through code and tidy splits with care,
catching GAM fits that crumble in the air,
I filter NAs, sort continuous from the rest,
tune two parameters to help models do their best. 🐰✨"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main changes across multiple files: factor handling improvements, MoD/MESS map generation refactoring, and enhanced error messaging throughout the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 288-290: The absence-data guard currently uses !is.null(stratify)
which is always TRUE because stratify is a logical; change the condition to
check for an actual TRUE stratification (e.g., isTRUE(stratify) or stratify ==
TRUE) so the stop("Cross Validation requires absence data.") only fires when
stratified CV is requested; update the if that inspects sum(as.numeric(response)
== 0) to use isTRUE(stratify) and keep the rest of the logic unchanged.
- Around line 90-97: The warning message built in the paste0 call inside the
conditional checking length(response) renders "/n" literally; replace those
literal "/n" fragments with proper newline escapes "\\n" (i.e., "\n" in the R
string) so updateRunLog receives real newlines; modify the paste0 block that
references dat and response (the block using updateRunLog and ncol(dat) - 7) to
use "\n" between lines (and ensure the string quotes remain valid).

In `@src/08-fit-model-functions.R`:
- Around line 1165-1176: The per-fold NA-filtering currently drops entries from
y_i, u_i, and weights.subset but leaves the corresponding rows in cv.train.data
/ data$predicted, so downstream functions like makeModelEvalPlots() rebuild fold
stats from out$cvResults$cv.train.data and can fail or produce mismatched
counts; when valid_i is computed (the !is.na(u_i) mask) apply the same
subsetting to the fold's cv.train.data (and any data$predicted slice written for
that fold) so the stored training/prediction rows match y_i/u_i/weights.subset,
and mirror this change in the similar block around the other occurrence (lines
~1208-1217) so both places filter cv.train.data consistently.

In `@src/10-ensemble-model.py`:
- Around line 160-162: The f-string that builds the constant-raster warning
currently begins with a stray "1\n" (f'1\nWarning: Raster {i+1} ...'), causing
an unwanted "1" to be logged; remove the leading "1\n" from that f-string so it
starts with "Warning: Raster {i+1} ..." (edit the f-string where the
constant-raster warning is constructed/issued).

In `@src/9-apply-model.R`:
- Around line 723-739: contVars currently uses factorVars which was earlier
built to exclude mod$catVars for Maxent, so Maxent categorical predictors are
wrongly treated as continuous for MESS/MoD; fix by constructing a map-specific
categorical set (e.g. mapFactorVars <- union(factorVars, mod$catVars)) and then
compute contVarsForMaps <- setdiff(modVars, mapFactorVars) and use
contVarsForMaps (instead of contVars) when slicing trainingData and calling
prepare_mess, and update the warning logic to reference the new set; apply the
same change where MESS/MoD output gating is done (the block around lines
955-964).

In `@src/package.xml`:
- Line 166: The displayName for the rfTuning dataSheet is missing a space before
the bracket; update the displayName attribute on the dataSheet with name
"rfTuning" from "Random Forest[Tuning]" to "Random Forest [Tuning]" so the UI
shows a consistent user-facing label.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 84e7e1ca-0aa4-4775-bf96-942fff0bd0b6

📥 Commits

Reviewing files that changed from the base of the PR and between bf33021 and 5649ff2.

📒 Files selected for processing (7)

src/05-training-testing-data-prep-functions.R
src/08-fit-model-functions.R
src/10-ensemble-model.py
src/5-prepare-training-testing-data.R
src/8-fit-gam.R
src/9-apply-model.R
src/package.xml

src/05-training-testing-data-prep-functions.R

src/08-fit-model-functions.R

src/10-ensemble-model.py

src/9-apply-model.R

src/package.xml

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

src/05-training-testing-data-prep-functions.R (1)

90-96: ⚠️ Potential issue | 🟡 Minor

Use real newline escapes in this advisory.

"/n" is rendered literally, so the run-log warning loses its formatting.

Suggested fix

-      "\nThere are less than 200 observations. Cross-validation might be preferable to a test/training split./n",
+      "\nThere are less than 200 observations. Cross-validation might be preferable to a test/training split.\n",
       "Weigh the decision while keeping in mind the number of predictors being considered: ",
       ncol(dat) - 7,
-      "/n"
+      "\n"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/05-training-testing-data-prep-functions.R` around lines 90 - 96, The
advisory message passed to updateRunLog incorrectly uses "/n" instead of real
newline escapes, so replace the literal "/n" usages in the paste0 call with "\n"
(e.g., the strings containing "/n" must be changed to "\n") so the message
formats correctly; update the code around the updateRunLog(...) invocation that
references response and dat (and uses ncol(dat) - 7) to use proper "\n" escapes
in both places within that paste0 call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 138-243: The discarded rows computed for the presence/absence
balancing are never explicitly marked in dat$UseInModelEvaluation, leaving
stale/NA values; after you finish computing TrainSplit and TestSplit in each
branch (the block that sets dat$UseInModelEvaluation[TrainSplit] and
dat$UseInModelEvaluation[TestSplit]), compute RemovedIndx <-
setdiff(seq_along(response), c(TrainSplit, TestSplit)) and then set
dat$UseInModelEvaluation[RemovedIndx] <- NA (or another sentinel you prefer) so
RmvIndx/any other removed samples are explicitly recorded and wont remain stale.
- Around line 105-135: The code currently uses a temp shortcut to take the
per-class rounding/random-sample path when ratioPresAbs is supplied but the
overall dataset already matches the requested ratio, which can still change
per-class train/test ratios; change the conditional so that the per-class
random-sample block (the code building TrainSplit/TestSplit and setting
dat$UseInModelEvaluation) runs only when ratioPresAbs is NULL, and the balancing
branch runs whenever ratioPresAbs is supplied. Concretely, replace the condition
if (is.null(ratioPresAbs) | temp) with if (is.null(ratioPresAbs)) and remove or
ignore the temp shortcut logic (the temp calculation and its is.null handling)
so that when ratioPresAbs is provided the code always executes the balancing
branch instead of the TrainSplit rounding path.
- Around line 279-282: The nFolds validation currently only checks for integer
>1 (nFolds) but allows nFolds greater than the number of available CV rows,
which can create empty folds; add an additional check after the existing one to
compare nFolds against the number of CV rows (e.g., nrow(cvData) or
length(cvRows)) and call stop(...) if nFolds is greater than the number of CV
rows so every fold will have at least one withheld observation. Ensure you
reference and use the existing nFolds variable in the new condition so the error
is raised early.
- Around line 326-331: The if-check should ignore NA values and treat NA in
UseInModelEvaluation as "not split yet" (assign to training); change the
condition to any(dat$UseInModelEvaluation == TRUE, na.rm = TRUE) and compute
split.mask so NAs count as train, e.g. split.mask <- (dat$UseInModelEvaluation
== FALSE) | is.na(dat$UseInModelEvaluation) and then set index <-
seq_len(nrow(dat))[split.mask]; use these symbols (dat, UseInModelEvaluation,
split.mask, index) when updating the code.

---

Duplicate comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 90-96: The advisory message passed to updateRunLog incorrectly
uses "/n" instead of real newline escapes, so replace the literal "/n" usages in
the paste0 call with "\n" (e.g., the strings containing "/n" must be changed to
"\n") so the message formats correctly; update the code around the
updateRunLog(...) invocation that references response and dat (and uses
ncol(dat) - 7) to use proper "\n" escapes in both places within that paste0
call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 60c08f0d-f5c1-4ed9-a16e-f573fbc3d064

📥 Commits

Reviewing files that changed from the base of the PR and between 5649ff2 and d4601c5.

📒 Files selected for processing (5)

src/05-training-testing-data-prep-functions.R
src/08-fit-model-functions.R
src/10-ensemble-model.py
src/9-apply-model.R
src/package.xml

🚧 Files skipped from review as they are similar to previous changes (2)

src/9-apply-model.R
src/10-ensemble-model.py

src/05-training-testing-data-prep-functions.R

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/05-training-testing-data-prep-functions.R (1)

298-305: ⚠️ Potential issue | 🟠 Major

Validate folds against CV-eligible rows (index), not total rows.

Line 298 checks against nrow(dat), but CV assignment later uses index (Line 334+). If a train/test split already exists, length(index) can be much smaller than nrow(dat), allowing empty folds.

🩹 Proposed fix

-  if (nFolds > nrow(dat)) {
-    stop(paste0(
-      "Number of CV folds (",
-      nFolds,
-      ") exceeds the number of available observations (",
-      nrow(dat),
-      "). Reduce the number of folds before continuing."
-    ))
-  }
@@
   if (any(dat$UseInModelEvaluation == TRUE)) {
     split.mask <- dat$UseInModelEvaluation == FALSE # FALSE = train, TRUE = test
     index <- seq(1:nrow(dat))[split.mask]
   } else {
     split.mask <- index <- seq(1:nrow(dat))
   }
+
+  if (nFolds > length(index)) {
+    stop(paste0(
+      "Number of CV folds (",
+      nFolds,
+      ") exceeds the number of observations assigned to cross-validation (",
+      length(index),
+      "). Reduce the number of folds before continuing."
+    ))
+  }

Also applies to: 334-339

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/05-training-testing-data-prep-functions.R` around lines 298 - 305, The
current validation uses nrow(dat) to compare against nFolds but CV assignment
later uses the CV-eligible rows in index, which can be smaller when a train/test
split exists; update the fold-count checks (the one that references nrow(dat)
around the block with nFolds, and the similar validation in the block 334-339)
to compare nFolds against length(index) instead of nrow(dat) so folds are
validated against the actual eligible observations; ensure you update all
occurrences where nFolds is compared to total rows to use length(index) and keep
the error message wording consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 33-39: The current validation for ratioPresAbs performs a direct
numeric comparison (ratioPresAbs <= 0) which will error for NA, non-finite or
non-scalar values; update the validation around the ratioPresAbs checks to first
ensure ratioPresAbs is a single finite numeric scalar (e.g., check is.numeric(),
length(...) == 1, !is.na(), and is.finite()) and only then perform the > 0
comparison, and if any of those checks fail call stop with the same error
message; apply the same hardened validation pattern to the other similar block
mentioned (the checks around lines 45-48) so both spots guard against
NA/non-finite/non-scalar inputs before comparing to 0.
- Around line 63-64: The for-loop using 1:length(factorVars) can produce an
invalid sequence when factorVars is empty; replace that pattern in the loops
referencing factorVars (the loop that builds factor.table from dat[,
factorVars[i]] and the second occurrence around the later factor handling) with
a safe iteration (e.g., use seq_along(factorVars)) or wrap the loop body in a
guard if(length(factorVars) > 0) to skip iteration when factorVars is
character(0), ensuring no out-of-range indexing occurs in the code that creates
factor.table.

---

Duplicate comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 298-305: The current validation uses nrow(dat) to compare against
nFolds but CV assignment later uses the CV-eligible rows in index, which can be
smaller when a train/test split exists; update the fold-count checks (the one
that references nrow(dat) around the block with nFolds, and the similar
validation in the block 334-339) to compare nFolds against length(index) instead
of nrow(dat) so folds are validated against the actual eligible observations;
ensure you update all occurrences where nFolds is compared to total rows to use
length(index) and keep the error message wording consistent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a9db46b1-2cc9-4d1b-918d-e3e456399840

📥 Commits

Reviewing files that changed from the base of the PR and between d4601c5 and 3ce29e5.

📒 Files selected for processing (2)

src/05-training-testing-data-prep-functions.R
src/5-prepare-training-testing-data.R

🚧 Files skipped from review as they are similar to previous changes (1)

src/5-prepare-training-testing-data.R

coderabbitai · 2026-03-18T02:27:03Z

src/05-training-testing-data-prep-functions.R

+  if (!is.null(ratioPresAbs)) {
+    if (ratioPresAbs <= 0) {
+      stop(
+        "The ratio of presence to absence (ratioPresAbs) must be a \ngreater than 0"
+      )
+    }
+  }


⚠️ Potential issue | 🟠 Major

Harden ratioPresAbs validation before numeric comparisons.

Line 34 can fail when ratioPresAbs is NA (or non-finite / non-scalar), because if (ratioPresAbs <= 0) cannot evaluate NA to TRUE/FALSE.

🛡️ Proposed fix

- if (!is.null(ratioPresAbs)) { - if (ratioPresAbs <= 0) { + if (!is.null(ratioPresAbs)) { + if ( + length(ratioPresAbs) != 1 || + is.na(ratioPresAbs) || + !is.finite(ratioPresAbs) || + ratioPresAbs <= 0 + ) { stop( - "The ratio of presence to absence (ratioPresAbs) must be a \ngreater than 0" + "The ratio of presence to absence (ratioPresAbs) must be a single finite number greater than 0." ) } }

Also applies to: 45-48

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/05-training-testing-data-prep-functions.R` around lines 33 - 39, The current validation for ratioPresAbs performs a direct numeric comparison (ratioPresAbs <= 0) which will error for NA, non-finite or non-scalar values; update the validation around the ratioPresAbs checks to first ensure ratioPresAbs is a single finite numeric scalar (e.g., check is.numeric(), length(...) == 1, !is.na(), and is.finite()) and only then perform the > 0 comparison, and if any of those checks fail call stop with the same error message; apply the same hardened validation pattern to the other similar block mentioned (the checks around lines 45-48) so both spots guard against NA/non-finite/non-scalar inputs before comparing to 0.

coderabbitai · 2026-03-18T02:27:03Z

src/05-training-testing-data-prep-functions.R

+    for (i in 1:length(factorVars)) {
+      factor.table <- table(dat[, factorVars[i]])


⚠️ Potential issue | 🟠 Major

Prevent invalid loop indexing when factorVars is empty.

Line 63 and Line 310 use 1:length(factorVars), which evaluates to 1:0 when factorVars is character(0) and can trigger invalid indexing/runtime errors.

🩹 Proposed fix

- if (!is.null(factorVars)) { - for (i in 1:length(factorVars)) { + if (!is.null(factorVars) && length(factorVars) > 0) { + for (i in seq_along(factorVars)) { factor.table <- table(dat[, factorVars[i]]) if (any(factor.table < 10)) { updateRunLog(paste0( @@ - if (!is.null(factorVars)) { - for (i in 1:length(factorVars)) { + if (!is.null(factorVars) && length(factorVars) > 0) { + for (i in seq_along(factorVars)) { factor.table <- table(dat[, factorVars[i]]) if (any(factor.table < nFolds)) { updateRunLog(paste0(

Also applies to: 310-311

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/05-training-testing-data-prep-functions.R` around lines 63 - 64, The for-loop using 1:length(factorVars) can produce an invalid sequence when factorVars is empty; replace that pattern in the loops referencing factorVars (the loop that builds factor.table from dat[, factorVars[i]] and the second occurrence around the later factor handling) with a safe iteration (e.g., use seq_along(factorVars)) or wrap the loop body in a guard if(length(factorVars) > 0) to skip iteration when factorVars is character(0), ensuring no out-of-range indexing occurs in the code that creates factor.table.

spearmangillman added 4 commits March 16, 2026 13:46

update MoD/MESS map generation to occur even when categorical vars ex…

246ee29

…ist; fix MoD/MESS restriction masking - categorical vars are just removed from mod and mess calculation and a warning is written to the run log

display name updates

5edafdb

improved warning and error messages; formatting updates

60e5314

add tryCatch to align GAM failure handling with BRT and RF; updated e…

5649ff2

…rror/warning messaging; updatedna filtering and to improve accuracy and consistency across models

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

coderabbit updates

d4601c5

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

src/05-training-testing-data-prep-functions.R Show resolved Hide resolved

src/05-training-testing-data-prep-functions.R Show resolved Hide resolved

src/05-training-testing-data-prep-functions.R Show resolved Hide resolved

src/05-training-testing-data-prep-functions.R Show resolved Hide resolved

improved messaging

3ce29e5

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

		for (i in 1:length(factorVars)) {
		factor.table <- table(dat[, factorVars[i]])

Conversation

spearmangillman commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Script Organization

Readability

Generalizability

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yml has unrecognized properties

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spearmangillman commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

`.coderabbit.yml` has unrecognized properties