Skip to content

Remove parent spec to enable code-first workflow execution#493

Open
smcolby wants to merge 44 commits intomainfrom
remove-parent-spec
Open

Remove parent spec to enable code-first workflow execution#493
smcolby wants to merge 44 commits intomainfrom
remove-parent-spec

Conversation

@smcolby
Copy link
Copy Markdown
Contributor

@smcolby smcolby commented Feb 27, 2026

Description

This PR decouples anvil runtime execution from YAML-backed spec state by removing parent_spec from workflow classes and passing only grouped domain kwargs (model_kwargs, ensemble_kwargs, feat_kwargs), so workflows can be initialized and run programmatically from Python without raw recipe coupling.

In addition, this pull request represents a comprehensive overhaul of the openadmet testing architecture. The primary goal was to transition from lengthy, high-dependency integration-style tests to true, isolated unit tests.

Key changes

1. Remove entangled parent_spec attribute

  • Replaced parent_spec attribute with class-external (i.e. fields that don't explicitly belong to a BaseSpec subclass, "external" to their instantiation) per-spec dictionaries, i.e. model_kwargs, ensemble_kwargs, feat_kwargs, etc. These enable access to a global workflow state for component "communication" without maintaining a YAML-dependent specification object.

2. Unit tests in tests/unit/ must be unit tests

  • Replaced slow, end-to-end integration tests (hiding in unit tests folder) with lightweight unit tests across heavy modules (anvil, inference, and CLI).
  • Introduced synthetic data fixtures (e.g., small PyTorch tensors and diverse SMILES sets) to validate data plumbing without requiring heavy disk access or RDKit overhead.
  • Unit tests on local MacBook Pro used to run in 15 minutes, now in 55 seconds.

3. Standardized mocking and isolation

  • Eliminated "tautological mocks" (mocks that only test themselves) and custom dummy classes, introduced by initial agentic pass (bad AI).
  • Standardized on the pytest-mock library (mocker fixture) to properly isolate testing boundaries and assert internal state transitions.
  • Used "real" fixtures wherever possible.

4. Mathematical and data rigor

  • Data Leakage Prevention: Implemented strict disjoint set validation (via set intersection of indices) for all chemical splitters to mathematically guarantee no overlap between train, validation, and test sets. (@dwwest)
  • UQ Validation: Updated inference tests to explicitly calculate and verify uncertainty quantification math (e.g., UCB bounds) instead of just checking for column existence.
  • Floating-Point Stability: Replaced fragile strict equality (==) with pytest.approx and numpy.testing.assert_almost_equal to ensure cross-platform consistency.

5. Feature concatenation bug fix

  • Exposed and fixed a critical bug in FeatureConcatenator where feature arrays were not being filtered by the intersection of valid indices. This fix prevents shape mismatches and silent data misalignment when featurizers drop different molecules.

6. Meaningful artifact validation

  • Removed assert True statements in evaluation modules.
  • Added assertions to verify that plotting classes actually generate valid matplotlib and seaborn objects.

Status

  • Ready to go

Developers certificate of origin

smcolby and others added 19 commits February 25, 2026 16:19
Added an index filtering step to FeatureConcatenator. Previously, if different featurizers dropped different molecules, the raw arrays were still concatenated, resulting in shape mismatches or mismatched rows. The features are now strictly masked to the common indices prior to concatenation.
This overhaul replaces slow, high-dependency integration tests with true unit tests utilizing pytest-mock and synthetic data fixtures. Key changes include swapping tautological file-writing mocks for internal state assertions, enforcing strict disjoint set validation for chemical splitters, and implementing rigorous mathematical validation for uncertainty quantification and evaluation metrics. These updates significantly improve execution speed and cross-platform stability by replacing fragile floating-point equality with robust approximate comparisons and isolating testing boundaries for featurizers, inference orchestration, and CLI logic.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 91.48936% with 8 lines in your changes missing coverage. Please review.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@smcolby
Copy link
Copy Markdown
Contributor Author

smcolby commented Feb 27, 2026

I'm not super happy with the tests in openadmet.models.tests.anvil. We want these to be truly unit tests (low execution time), but also actually meaningful/useful.

- Strip out unmaintained concrete stub classes.
- Use mocker.create_autospec(..., instance=True) to satisfy Pydantic validations.
- Centralize mock state injection in the build_workflow helper.
Remove tautological mocking of the model, featurizer, metadata, and data spec in test_inference.py. Replace with real instantiated FingerprintFeaturizer, DummyRegressorModel, CommitteeRegressor, Metadata, and DataSpec objects so that SMILES physically flow through the featurization and prediction pipeline.

Only the file I/O boundary (load_anvil_model_and_metadata) remains patched. Assertions now verify mathematically derived values: single model PRED=1.0 (training mean), ensemble PRED=2.0/STD=1.0, and UCB=4.0 (mean + beta*std = 2.0 + 2.0*1.0).
Comment thread openadmet/models/anvil/specification.py
Comment thread openadmet/models/anvil/specification.py Outdated
feat_kwargs = {
"type": self.procedure.feat.type,
"params": self.procedure.feat.params,
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmacdope This is just one idea of how to get around parent_spec for what I've been calling "class-external" fields, i.e. fields that aren't defined in the class returned by the spec section (but still required externally, as they modulate behavior of how the class gets used). It's actually my idea, not AI (though I did tell AI to implement this pattern...)

data_yaml=recipe_components / "data.yaml",
report_yaml=recipe_components / "eval.yaml",
)
return result
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern got added, I think because it allows YAMLs to be written here, instead of trying to write them in AnvilWorkflow.run. The latter requires complete passthrough of everything needed to write the YAML, i.e. parent_spec, so YAML writing needed to be removed from AnvilWorkflow.

Certainly can remove if we want to handle YAML output somewhere else (or just write the YAMLs in the to_workflow call, then still run the workflow from AnvilWorkflow.run).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't bother me, but if you asked me if this task should be a part of the global workflow, or a part of specification defining, i would say the global workflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh actually, I am thinking about it more. I like it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comment open for @hmacdope to weigh in

data_yaml=recipe_components / "data.yaml",
report_yaml=recipe_components / "eval.yaml",
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmacdope Here's the YAML writing within workflow that got removed. Necessary to decouple parent_spec.

Comment thread openadmet/models/anvil/workflow.py Outdated
if not Path(s).exists():
raise ValueError(f"serial_path '{s}' does not exist.")
return self

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to prevent upstream execution (loading data, featurization, etc.) before we hit the error that says "oops, your paths aren't specified correctly, or don't even exist"

Comment thread openadmet/models/anvil/workflow.py Outdated
raise ValueError(
f"The model has {self.model.n_tasks} tasks but the data specification has {len(self.data_spec.target_cols)} target columns."
f"The model has {self.model._n_tasks} tasks but the data specification has {len(self.data_spec.target_cols)} target columns."
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only _n_tasks is guaranteed to be defined, n_tasks sometimes exists. See #498.

if y_arr.ndim == 2 and y_arr.shape[1] == 1:
y_arr = y_arr.ravel()
self.estimator = self.estimator.fit(X, y_arr)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwwest Make sure this jives with our shape coercion elsewhere.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place we do this, if we're going to make this change, can we ask the agent to make the equivalent change in very architecture?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give it a try!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we do check shapes in other places though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #509

wf = spec.to_workflow()
click.echo(f"Workflow initialized successfully with recipe: {recipe_path}")
wf.run(tag=tag, debug=debug, output_dir=output_dir)
spec.run(tag=tag, debug=debug, output_dir=output_dir)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmacdope Implements the run method off of Specification instead of Workflow so YAMLs get saved. See other comment for more details / options.

# find where common_indices are in idx
mask = np.isin(idx, common_indices)
filtered_feats.append(feat[mask])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwwest Fixes a bug where feature index intersection wasn't actually be leveraged. Please double check!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully missed this, whoops, but lgtm

Comment thread openadmet/models/tests/unit/split/test_splitters.py
@smcolby smcolby self-assigned this Mar 1, 2026
@smcolby smcolby requested review from khuddzu March 4, 2026 17:36
Comment thread openadmet/models/anvil/workflow.py
section_name: ClassVar[str] = "feat"
type: str | None = None
params: dict = Field(default_factory=dict)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused what we're talking about here, do you mean the logic that reads the driver type from the trainer? What does that have to do with featurization?

Comment thread openadmet/models/tests/unit/split/test_splitters.py
if y_arr.ndim == 2 and y_arr.shape[1] == 1:
y_arr = y_arr.ravel()
self.estimator = self.estimator.fit(X, y_arr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place we do this, if we're going to make this change, can we ask the agent to make the equivalent change in very architecture?

Comment on lines +494 to +499
param_paths = self.ensemble_kwargs.get("param_paths")
serial_paths = self.ensemble_kwargs.get("serial_paths")
if (param_paths is None) != (serial_paths is None):
raise ValueError(
"Both param_paths and serial_paths must be provided together for ensemble finetuning."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is the same, whether or not ensemble is true or false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite! param_path and serial_path come from model_kwargs if ensemble is False. If True, param_paths and serial_paths (note plural) come from ensemble_kwargs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this for now, but in the future I would not hold importance to updating any of the mtenn code or tests. It might be something that is removed or archived.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I at least wanted what's here to run quickly. But agreed this may end up phased out.

Copy link
Copy Markdown
Contributor

@khuddzu khuddzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a couple of comments, but nothing wild. I want to ensure that you worked out the issue with mock loading data. Did you resolve the issue of the tests being coded in order to pass, not actually testing the package? I didn't notice this, but could have missed something.

@smcolby
Copy link
Copy Markdown
Contributor Author

smcolby commented Mar 4, 2026

Looks good, I left a couple of comments, but nothing wild. I want to ensure that you worked out the issue with mock loading data. Did you resolve the issue of the tests being coded in order to pass, not actually testing the package? I didn't notice this, but could have missed something.

Yep, as far as I was able to convince myself, the tests are not tautological. That was an artifact of an earlier pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants