Skip to content

Add check_nm_dataset() for automated NONMEM dataset validation#7

Merged
roninsightrx merged 5 commits intomainfrom
check-nm-dataset
Apr 3, 2026
Merged

Add check_nm_dataset() for automated NONMEM dataset validation#7
roninsightrx merged 5 commits intomainfrom
check-nm-dataset

Conversation

@roninsightrx
Copy link
Copy Markdown
Contributor

Summary

  • Adds check_nm_dataset() — a comprehensive validation function that runs 35+ automated checks on NONMEM datasets
  • Returns a tidy data.frame with check name, description, and result (PASS/FAIL/NA) without throwing errors
  • Checks cover: column naming & reserved names, EVID/AMT/DV/MDV/CMT consistency, TIME ordering, ADDL/II overlap detection, exclusion flags, covariate missingness, numeric precision, CSV safety, and more
  • Includes 31 tests covering all check categories
  • Version bumped to 0.0.0.9005

Test plan

  • devtools::test(filter = "check_nm_dataset") — 31/31 passing
  • Manual testing with real NONMEM datasets

🤖 Generated with Claude Code

Adds a comprehensive validation function that runs 35+ checks on NONMEM
datasets and returns a summary table (check/description/result) without
throwing errors. Covers column naming, EVID/AMT/DV/MDV consistency,
TIME ordering, ADDL/II overlap detection, reserved names, covariate
missingness, and more.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new exported validation utility, check_nm_dataset(), to run a broad suite of automated consistency/sanity checks on NONMEM-style datasets and return a tidy PASS/FAIL/NA summary table without throwing errors.

Changes:

  • Added check_nm_dataset() implementation with ~35+ checks and optional verbose logging.
  • Added a dedicated testthat file with 31 unit tests covering the new check categories.
  • Exported the function, generated Rd documentation, and bumped the package version.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
R/check_nm_dataset.R New dataset validation function implementing the check suite and verbose reporting.
tests/testthat/test-check_nm_dataset.R New unit tests validating expected PASS/FAIL behavior for each check.
NAMESPACE Exports check_nm_dataset.
man/check_nm_dataset.Rd Generated documentation for the new function.
DESCRIPTION Version bumped to 0.0.0.9005.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +52
has_col <- function(col) col %in% names(data)
cols <- toupper(names(data))

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column handling is inconsistent: some checks use cols <- toupper(names(data)) (case-insensitive), but has_col() and most downstream accessors use case-sensitive names(data) / data$TIME / data$EVID. This can cause required columns to be reported as present while subsequent checks are silently skipped (or covariates misclassified) when inputs use lowercase/mixed-case column names. Consider normalizing names once (e.g., map to uppercase internally) or implementing case-insensitive lookup helpers used consistently throughout the function.

Copilot uses AI. Check for mistakes.
Comment on lines +511 to +516
max_len <- max(nchar(ids))
log_check(
"id_digits",
paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),
if (max_len <= max_id_digits) "PASS" else "FAIL"
)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_len <- max(nchar(ids)) will return NA if any ID is missing, and the subsequent if (max_len <= max_id_digits) will error with “missing value where TRUE/FALSE needed”. Since the function promises to not throw errors, this should use na.rm = TRUE (and handle the all-NA case explicitly) or validate/flag missing IDs before computing max_len.

Suggested change
max_len <- max(nchar(ids))
log_check(
"id_digits",
paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),
if (max_len <= max_id_digits) "PASS" else "FAIL"
)
non_missing_ids <- ids[!is.na(ids)]
if (length(non_missing_ids) == 0) {
log_check(
"id_digits",
paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),
"NA"
)
} else {
max_len <- max(nchar(non_missing_ids))
log_check(
"id_digits",
paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"),
if (max_len <= max_id_digits) "PASS" else "FAIL"
)
}

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +486
if (has_col("ADDL") && has_col("II") && has_col("EVID") && has_col("ID") &&
has_col("TIME") && is.numeric(data$ADDL) && is.numeric(data$II)) {
overlap_found <- FALSE
for (id in unique(data$ID)) {
sub <- data[data$ID == id, ]
dose_idx <- which(sub$EVID %in% c(1, 4))
if (length(dose_idx) < 2) next
for (i in seq_len(length(dose_idx) - 1)) {
row_i <- dose_idx[i]
row_next <- dose_idx[i + 1]
addl_i <- sub$ADDL[row_i]
ii_i <- sub$II[row_i]
if (!is.na(addl_i) && !is.na(ii_i) && addl_i > 0 && ii_i > 0) {
last_expanded <- sub$TIME[row_i] + addl_i * ii_i
if (last_expanded >= sub$TIME[row_next]) {
overlap_found <- TRUE
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addl_overlap detection iterates dosing records in their current row order (dose_idx), but overlap is a time-based concept and the dataset may be unsorted (you separately check data_ordered). As written, an unsorted dataset can yield false positives/negatives for overlap. Consider sorting sub by TIME (and possibly by additional tie-breakers) before computing overlaps, or only running this check when the ordering check passes.

Copilot uses AI. Check for mistakes.
roninsightrx and others added 4 commits April 3, 2026 12:39
Allows mapping non-standard column names to expected NONMEM names via a
dictionary argument (e.g., list(ID = "SUBJ", TIME = "TAD")). All checks
now resolve column names through the dictionary. Keys are
case-insensitive and unknown keys produce a warning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Reviewer feedback: the MDV/DV checks were internally inconsistent.
mdv_dv_consistent used only is.na(DV), but dv_leq_zero_mdv expected
DV <= 0 to have MDV=1, and mdv_nonmissing_obs used DV != 0 instead
of DV > 0. A valid dataset with BLQ records could not pass all three.

Changes:
- Unified "missing/BLQ" = is.na(DV) | DV <= 0 across all MDV checks
- mdv_dv_consistent now expects MDV=1 for NA or DV<=0, MDV=0 for DV>0
- Removed redundant dv_leq_zero_mdv check (covered by updated above)
- Fixed mdv_nonmissing_obs filter to use DV > 0 (matching description)
- Added test for BLQ (DV <= 0) MDV consistency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The overlap check iterated dose records in row order, which could
produce false positives/negatives on unsorted datasets. Now sorts
dose_idx by TIME within each ID before comparing expanded dose
windows against subsequent doses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@roninsightrx roninsightrx merged commit cb628f1 into main Apr 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants