Skip to content

nemo_retriever: Add structured harness metrics reports#1699

Open
jioffe502 wants to merge 7 commits intoNVIDIA:mainfrom
jioffe502:fix/harness_metrics
Open

nemo_retriever: Add structured harness metrics reports#1699
jioffe502 wants to merge 7 commits intoNVIDIA:mainfrom
jioffe502:fix/harness_metrics

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented Mar 23, 2026

Switch the retriever harness from stdout scraping to structured run reports shared across supported run modes. Follow-on commits also harden artifact metadata handling so these reports stay reliable in real container/worktree runs.

  • add shared batch, inprocess, and fused run-report plumbing and wire the harness to consume it
  • persist runtime and detection metrics as structured artifacts instead of parsing console output
  • keep latest_commit populated in harness artifacts by falling back to .git/HEAD refs when git rev-parse fails in container/worktree setups
  • preserve detection counters through LanceDB round-trips by writing metadata as JSON and accepting legacy metadata literals during detection-summary reads
  • validate behavior with real bo20/jp20 harness runs as integration coverage

- route harness runs through shared batch, fused, and inprocess reports
- persist runtime and detection metrics without scraping stdout
- tighten tests around config and LanceDB metadata behavior

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested a review from a team as a code owner March 23, 2026 21:52
@jioffe502 jioffe502 marked this pull request as draft March 23, 2026 22:00
jioffe502 and others added 5 commits March 24, 2026 16:55
Revert non-essential docs/example/utility churn so this branch only carries the structured run report metrics path in harness and mode runners.
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Keep harness run metadata stable across container/worktree setups by falling back to git HEAD refs when rev-parse fails, and persist detection counters as JSON metadata so detection summaries survive LanceDB round-trips.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 marked this pull request as ready for review March 30, 2026 21:17
@jioffe502 jioffe502 requested a review from a team as a code owner March 30, 2026 21:17
@jioffe502 jioffe502 requested a review from edknv March 30, 2026 21:17
)


class FusedPipelineConfig(ModePipelineConfigModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost exactly the same as the other PipelineConfig is there nothing we can reuse from both files, maybe create a generalized version of the config and tweak only where necessary. Also we have no support for fused yet, so don't think you need to add it.

shutdown_ray_safely()


def render_fused_run_report(report: RunReport) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rendering the report should be the same no matter the pipeline. We should not have different logic in each module particular to an application mode.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated rendering into shared report helpers in application/modes/reports.py and switched batch/inprocess to use the same render path.

)


class InProcessPipelineConfig(ModePipelineConfigModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. dry

return ingestor, file_patterns


def run_inprocess_pipeline(cfg: InProcessPipelineConfig) -> RunReport:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt the only difference between this run pipeline and batch is the run_mode string? If it is, then this shouldnt exist as code it should just be a switch to one generalized function.

model_config = ConfigDict(extra="forbid")


class _TeeStream:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need stream controls, I thought this was removing the need for that? We dont care what the display onscreen looks like (we leave it to lower level libraries, like ray), we just need the datapoints from the run, and we no longer get those from std out/err, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed stream tee/logging controls from application/modes/shared.py to keep this path focused on structured datapoints and mode orchestration

return import_module("lancedb")


def ensure_lancedb_table(uri: str, table_name: str) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we have this in other parts of the code base already? If its not a duplicate, why move it here. This is lance stuff nothing to do with run modes. I should not have to import the lance package in this module.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved LanceDB lifecycle/read helpers out of application/modes/shared.py into vector_store/lancedb_store.py, and updated mode/detection consumers to import from there

return datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S_UTC")


_COMMIT_RE = re.compile(r"^[0-9a-fA-F]{7,40}$")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing commit string conversion logic in this PR? Did you do other stuff in the PR as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept commit fallback hardening because it stabilizes latest_commit in container/worktree runs, and added focused tests for gitdir ref-file and packed-refs fallback to make scope explicit

it came from running inside of a dev container

"max_workers",
"gpu_devices",
}
FUSED_TUNING_FIELDS = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont have support for fused , so why add the fused entries? Also why are some of the kwargs prefixed with fused... is it supposed to mean something different?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed fused harness scope from this PR. harness/config.py and harness/run.py now only support batch/inprocess for this metrics work, and fused-specific harness test coverage was removed

run_fused is back to wrapper-only

Align the harness to structured batch/inprocess reporting boundaries so artifacts and runtime metadata remain interpretable and reviewable. This also removes stale fused-path plumbing and consolidates LanceDB/detection/report helpers to reduce drift in future harness iterations.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants