updates to factor handling, MoD/Mess map generation and error messaging#46
updates to factor handling, MoD/Mess map generation and error messaging#46spearmangillman wants to merge 6 commits intomainfrom
Conversation
…ist; fix MoD/MESS restriction masking - categorical vars are just removed from mod and mess calculation and a warning is written to the run log
…rror/warning messaging; updatedna filtering and to improve accuracy and consistency across models
|
Note
|
| Cohort / File(s) | Summary |
|---|---|
Train/Test & CV Splitting src/05-training-testing-data-prep-functions.R, src/5-prepare-training-testing-data.R |
Major rewrite of testTrainSplit and crossValidationSplit: stronger input validation, explicit ratio-presence/absence branches, balanced/random sampling paths, clearer background removal/reattachment, expanded factor-level warnings (pandoc table), and explicit NA-defaulting for split/CV options. |
Model Fitting & CV guards src/08-fit-model-functions.R, src/8-fit-gam.R |
GAM fit wrapped in tryCatch (returns NULL on failure); CV fold handling now filters NA predictions, emits warnings for excluded sites, and stops with informative errors when folds yield no usable predictions or insufficient classes. |
Apply-model / MESS & MoD src/9-apply-model.R |
MESS/MoD computation limited to continuous covariates: introduces contVars = setdiff(modVars, messFactorVars), updates prep/prediction to use contVars, rescales/masks to match prediction raster, and gates outputs and warnings when categorical vars are present. |
Ensemble messaging src/10-ensemble-model.py |
Formatting change to constant-raster warning text (capitalization and prefixing newline); no behavioral change to normalization logic. |
Manifest / DataSheet metadata src/package.xml |
Updated several DataSheet displayName values and added Parameter2/Parameter2Values columns for brtTuning and rfTuning. |
Sequence Diagram(s)
(Skipped)
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
- apply model revamp; updates to brt fitting and brt hyperparameter tuning; minor bug fixes and general code clean-up #38: Overlaps model-fitting and CV logic edits in
src/08-fit-model-functions.R(tryCatch / per-fold guards). - maxent updates and other minor bug fixes #39: Edits to fitModel/CV handling and apply-model paths that intersect with CV and MESS/MoD changes.
- spg code cleanup, apply and ensemble model efficiency improvements, maxent cat vars fix #45: Overlapping changes touching train/test splitting, CV handling, and apply/ensemble behavior.
Suggested reviewers
- katieb1
- afilazzola
Poem
"I hop through code and tidy splits with care,
catching GAM fits that crumble in the air,
I filter NAs, sort continuous from the rest,
tune two parameters to help models do their best. 🐰✨"
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The title directly addresses the main changes across multiple files: factor handling improvements, MoD/MESS map generation refactoring, and enhanced error messaging throughout the codebase. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
dev
📝 Coding Plan
- Generate coding plan for human review comments
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 288-290: The absence-data guard currently uses !is.null(stratify)
which is always TRUE because stratify is a logical; change the condition to
check for an actual TRUE stratification (e.g., isTRUE(stratify) or stratify ==
TRUE) so the stop("Cross Validation requires absence data.") only fires when
stratified CV is requested; update the if that inspects sum(as.numeric(response)
== 0) to use isTRUE(stratify) and keep the rest of the logic unchanged.
- Around line 90-97: The warning message built in the paste0 call inside the
conditional checking length(response) renders "/n" literally; replace those
literal "/n" fragments with proper newline escapes "\\n" (i.e., "\n" in the R
string) so updateRunLog receives real newlines; modify the paste0 block that
references dat and response (the block using updateRunLog and ncol(dat) - 7) to
use "\n" between lines (and ensure the string quotes remain valid).
In `@src/08-fit-model-functions.R`:
- Around line 1165-1176: The per-fold NA-filtering currently drops entries from
y_i, u_i, and weights.subset but leaves the corresponding rows in cv.train.data
/ data$predicted, so downstream functions like makeModelEvalPlots() rebuild fold
stats from out$cvResults$cv.train.data and can fail or produce mismatched
counts; when valid_i is computed (the !is.na(u_i) mask) apply the same
subsetting to the fold's cv.train.data (and any data$predicted slice written for
that fold) so the stored training/prediction rows match y_i/u_i/weights.subset,
and mirror this change in the similar block around the other occurrence (lines
~1208-1217) so both places filter cv.train.data consistently.
In `@src/10-ensemble-model.py`:
- Around line 160-162: The f-string that builds the constant-raster warning
currently begins with a stray "1\n" (f'1\nWarning: Raster {i+1} ...'), causing
an unwanted "1" to be logged; remove the leading "1\n" from that f-string so it
starts with "Warning: Raster {i+1} ..." (edit the f-string where the
constant-raster warning is constructed/issued).
In `@src/9-apply-model.R`:
- Around line 723-739: contVars currently uses factorVars which was earlier
built to exclude mod$catVars for Maxent, so Maxent categorical predictors are
wrongly treated as continuous for MESS/MoD; fix by constructing a map-specific
categorical set (e.g. mapFactorVars <- union(factorVars, mod$catVars)) and then
compute contVarsForMaps <- setdiff(modVars, mapFactorVars) and use
contVarsForMaps (instead of contVars) when slicing trainingData and calling
prepare_mess, and update the warning logic to reference the new set; apply the
same change where MESS/MoD output gating is done (the block around lines
955-964).
In `@src/package.xml`:
- Line 166: The displayName for the rfTuning dataSheet is missing a space before
the bracket; update the displayName attribute on the dataSheet with name
"rfTuning" from "Random Forest[Tuning]" to "Random Forest [Tuning]" so the UI
shows a consistent user-facing label.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 84e7e1ca-0aa4-4775-bf96-942fff0bd0b6
📒 Files selected for processing (7)
src/05-training-testing-data-prep-functions.Rsrc/08-fit-model-functions.Rsrc/10-ensemble-model.pysrc/5-prepare-training-testing-data.Rsrc/8-fit-gam.Rsrc/9-apply-model.Rsrc/package.xml
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
src/05-training-testing-data-prep-functions.R (1)
90-96:⚠️ Potential issue | 🟡 MinorUse real newline escapes in this advisory.
"/n"is rendered literally, so the run-log warning loses its formatting.Suggested fix
- "\nThere are less than 200 observations. Cross-validation might be preferable to a test/training split./n", + "\nThere are less than 200 observations. Cross-validation might be preferable to a test/training split.\n", "Weigh the decision while keeping in mind the number of predictors being considered: ", ncol(dat) - 7, - "/n" + "\n"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/05-training-testing-data-prep-functions.R` around lines 90 - 96, The advisory message passed to updateRunLog incorrectly uses "/n" instead of real newline escapes, so replace the literal "/n" usages in the paste0 call with "\n" (e.g., the strings containing "/n" must be changed to "\n") so the message formats correctly; update the code around the updateRunLog(...) invocation that references response and dat (and uses ncol(dat) - 7) to use proper "\n" escapes in both places within that paste0 call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 138-243: The discarded rows computed for the presence/absence
balancing are never explicitly marked in dat$UseInModelEvaluation, leaving
stale/NA values; after you finish computing TrainSplit and TestSplit in each
branch (the block that sets dat$UseInModelEvaluation[TrainSplit] and
dat$UseInModelEvaluation[TestSplit]), compute RemovedIndx <-
setdiff(seq_along(response), c(TrainSplit, TestSplit)) and then set
dat$UseInModelEvaluation[RemovedIndx] <- NA (or another sentinel you prefer) so
RmvIndx/any other removed samples are explicitly recorded and wont remain stale.
- Around line 105-135: The code currently uses a temp shortcut to take the
per-class rounding/random-sample path when ratioPresAbs is supplied but the
overall dataset already matches the requested ratio, which can still change
per-class train/test ratios; change the conditional so that the per-class
random-sample block (the code building TrainSplit/TestSplit and setting
dat$UseInModelEvaluation) runs only when ratioPresAbs is NULL, and the balancing
branch runs whenever ratioPresAbs is supplied. Concretely, replace the condition
if (is.null(ratioPresAbs) | temp) with if (is.null(ratioPresAbs)) and remove or
ignore the temp shortcut logic (the temp calculation and its is.null handling)
so that when ratioPresAbs is provided the code always executes the balancing
branch instead of the TrainSplit rounding path.
- Around line 279-282: The nFolds validation currently only checks for integer
>1 (nFolds) but allows nFolds greater than the number of available CV rows,
which can create empty folds; add an additional check after the existing one to
compare nFolds against the number of CV rows (e.g., nrow(cvData) or
length(cvRows)) and call stop(...) if nFolds is greater than the number of CV
rows so every fold will have at least one withheld observation. Ensure you
reference and use the existing nFolds variable in the new condition so the error
is raised early.
- Around line 326-331: The if-check should ignore NA values and treat NA in
UseInModelEvaluation as "not split yet" (assign to training); change the
condition to any(dat$UseInModelEvaluation == TRUE, na.rm = TRUE) and compute
split.mask so NAs count as train, e.g. split.mask <- (dat$UseInModelEvaluation
== FALSE) | is.na(dat$UseInModelEvaluation) and then set index <-
seq_len(nrow(dat))[split.mask]; use these symbols (dat, UseInModelEvaluation,
split.mask, index) when updating the code.
---
Duplicate comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 90-96: The advisory message passed to updateRunLog incorrectly
uses "/n" instead of real newline escapes, so replace the literal "/n" usages in
the paste0 call with "\n" (e.g., the strings containing "/n" must be changed to
"\n") so the message formats correctly; update the code around the
updateRunLog(...) invocation that references response and dat (and uses
ncol(dat) - 7) to use proper "\n" escapes in both places within that paste0
call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 60c08f0d-f5c1-4ed9-a16e-f573fbc3d064
📒 Files selected for processing (5)
src/05-training-testing-data-prep-functions.Rsrc/08-fit-model-functions.Rsrc/10-ensemble-model.pysrc/9-apply-model.Rsrc/package.xml
🚧 Files skipped from review as they are similar to previous changes (2)
- src/9-apply-model.R
- src/10-ensemble-model.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/05-training-testing-data-prep-functions.R (1)
298-305:⚠️ Potential issue | 🟠 MajorValidate folds against CV-eligible rows (
index), not total rows.Line 298 checks against
nrow(dat), but CV assignment later usesindex(Line 334+). If a train/test split already exists,length(index)can be much smaller thannrow(dat), allowing empty folds.🩹 Proposed fix
- if (nFolds > nrow(dat)) { - stop(paste0( - "Number of CV folds (", - nFolds, - ") exceeds the number of available observations (", - nrow(dat), - "). Reduce the number of folds before continuing." - )) - } @@ if (any(dat$UseInModelEvaluation == TRUE)) { split.mask <- dat$UseInModelEvaluation == FALSE # FALSE = train, TRUE = test index <- seq(1:nrow(dat))[split.mask] } else { split.mask <- index <- seq(1:nrow(dat)) } + + if (nFolds > length(index)) { + stop(paste0( + "Number of CV folds (", + nFolds, + ") exceeds the number of observations assigned to cross-validation (", + length(index), + "). Reduce the number of folds before continuing." + )) + }Also applies to: 334-339
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/05-training-testing-data-prep-functions.R` around lines 298 - 305, The current validation uses nrow(dat) to compare against nFolds but CV assignment later uses the CV-eligible rows in index, which can be smaller when a train/test split exists; update the fold-count checks (the one that references nrow(dat) around the block with nFolds, and the similar validation in the block 334-339) to compare nFolds against length(index) instead of nrow(dat) so folds are validated against the actual eligible observations; ensure you update all occurrences where nFolds is compared to total rows to use length(index) and keep the error message wording consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 33-39: The current validation for ratioPresAbs performs a direct
numeric comparison (ratioPresAbs <= 0) which will error for NA, non-finite or
non-scalar values; update the validation around the ratioPresAbs checks to first
ensure ratioPresAbs is a single finite numeric scalar (e.g., check is.numeric(),
length(...) == 1, !is.na(), and is.finite()) and only then perform the > 0
comparison, and if any of those checks fail call stop with the same error
message; apply the same hardened validation pattern to the other similar block
mentioned (the checks around lines 45-48) so both spots guard against
NA/non-finite/non-scalar inputs before comparing to 0.
- Around line 63-64: The for-loop using 1:length(factorVars) can produce an
invalid sequence when factorVars is empty; replace that pattern in the loops
referencing factorVars (the loop that builds factor.table from dat[,
factorVars[i]] and the second occurrence around the later factor handling) with
a safe iteration (e.g., use seq_along(factorVars)) or wrap the loop body in a
guard if(length(factorVars) > 0) to skip iteration when factorVars is
character(0), ensuring no out-of-range indexing occurs in the code that creates
factor.table.
---
Duplicate comments:
In `@src/05-training-testing-data-prep-functions.R`:
- Around line 298-305: The current validation uses nrow(dat) to compare against
nFolds but CV assignment later uses the CV-eligible rows in index, which can be
smaller when a train/test split exists; update the fold-count checks (the one
that references nrow(dat) around the block with nFolds, and the similar
validation in the block 334-339) to compare nFolds against length(index) instead
of nrow(dat) so folds are validated against the actual eligible observations;
ensure you update all occurrences where nFolds is compared to total rows to use
length(index) and keep the error message wording consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: a9db46b1-2cc9-4d1b-918d-e3e456399840
📒 Files selected for processing (2)
src/05-training-testing-data-prep-functions.Rsrc/5-prepare-training-testing-data.R
🚧 Files skipped from review as they are similar to previous changes (1)
- src/5-prepare-training-testing-data.R
| if (!is.null(ratioPresAbs)) { | ||
| if (ratioPresAbs <= 0) { | ||
| stop( | ||
| "The ratio of presence to absence (ratioPresAbs) must be a \ngreater than 0" | ||
| ) | ||
| } | ||
| } |
There was a problem hiding this comment.
Harden ratioPresAbs validation before numeric comparisons.
Line 34 can fail when ratioPresAbs is NA (or non-finite / non-scalar), because if (ratioPresAbs <= 0) cannot evaluate NA to TRUE/FALSE.
🛡️ Proposed fix
- if (!is.null(ratioPresAbs)) {
- if (ratioPresAbs <= 0) {
+ if (!is.null(ratioPresAbs)) {
+ if (
+ length(ratioPresAbs) != 1 ||
+ is.na(ratioPresAbs) ||
+ !is.finite(ratioPresAbs) ||
+ ratioPresAbs <= 0
+ ) {
stop(
- "The ratio of presence to absence (ratioPresAbs) must be a \ngreater than 0"
+ "The ratio of presence to absence (ratioPresAbs) must be a single finite number greater than 0."
)
}
}Also applies to: 45-48
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/05-training-testing-data-prep-functions.R` around lines 33 - 39, The
current validation for ratioPresAbs performs a direct numeric comparison
(ratioPresAbs <= 0) which will error for NA, non-finite or non-scalar values;
update the validation around the ratioPresAbs checks to first ensure
ratioPresAbs is a single finite numeric scalar (e.g., check is.numeric(),
length(...) == 1, !is.na(), and is.finite()) and only then perform the > 0
comparison, and if any of those checks fail call stop with the same error
message; apply the same hardened validation pattern to the other similar block
mentioned (the checks around lines 45-48) so both spots guard against
NA/non-finite/non-scalar inputs before comparing to 0.
| for (i in 1:length(factorVars)) { | ||
| factor.table <- table(dat[, factorVars[i]]) |
There was a problem hiding this comment.
Prevent invalid loop indexing when factorVars is empty.
Line 63 and Line 310 use 1:length(factorVars), which evaluates to 1:0 when factorVars is character(0) and can trigger invalid indexing/runtime errors.
🩹 Proposed fix
- if (!is.null(factorVars)) {
- for (i in 1:length(factorVars)) {
+ if (!is.null(factorVars) && length(factorVars) > 0) {
+ for (i in seq_along(factorVars)) {
factor.table <- table(dat[, factorVars[i]])
if (any(factor.table < 10)) {
updateRunLog(paste0(
@@
- if (!is.null(factorVars)) {
- for (i in 1:length(factorVars)) {
+ if (!is.null(factorVars) && length(factorVars) > 0) {
+ for (i in seq_along(factorVars)) {
factor.table <- table(dat[, factorVars[i]])
if (any(factor.table < nFolds)) {
updateRunLog(paste0(Also applies to: 310-311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/05-training-testing-data-prep-functions.R` around lines 63 - 64, The
for-loop using 1:length(factorVars) can produce an invalid sequence when
factorVars is empty; replace that pattern in the loops referencing factorVars
(the loop that builds factor.table from dat[, factorVars[i]] and the second
occurrence around the later factor handling) with a safe iteration (e.g., use
seq_along(factorVars)) or wrap the loop body in a guard if(length(factorVars) >
0) to skip iteration when factorVars is character(0), ensuring no out-of-range
indexing occurs in the code that creates factor.table.
Script Organization
Readability
Generalizability
Summary by CodeRabbit
Bug Fixes
New Features
UI Improvements