P2: Sweep configs for TTT/threshold/gate tuning + Rust inference spec#14
P2: Sweep configs for TTT/threshold/gate tuning + Rust inference spec#14
Conversation
Co-authored-by: thinksyncs <42225585+thinksyncs@users.noreply.github.com>
Co-authored-by: thinksyncs <42225585+thinksyncs@users.noreply.github.com>
Co-authored-by: thinksyncs <42225585+thinksyncs@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds example sweep configurations (TTT/threshold/gate-weight tuning) and accompanying documentation, plus a Rust production-inference specification doc, to complete the P2 roadmap item around sweep exploration and production inference guidance.
Changes:
- Added three new sweep JSON configs for TTT, score-threshold tuning, and gate-weight tuning.
- Added documentation for running/analyzing sweeps and a Rust inference “spec template”.
- Added a test file intended to validate sweep config structure; updated README and added a P2 security summary.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_sweep_configs.py | Adds a config-validation “test” for sweep example JSON files. |
| docs/sweep_ttt_example.json | Adds a 64-run param grid for TTT tuning (method/reset/steps/lr). |
| docs/sweep_threshold_example.json | Adds an 8-run score-threshold sweep chaining export + COCO eval. |
| docs/sweep_gate_weights_example.json | Adds an 8-run sweep over gate-weight grid specs / metric target. |
| docs/sweep_examples.md | Adds a usage guide and analysis tips for the new sweep configs. |
| docs/rust_inference_template.md | Adds a Rust inference implementation guide/spec (not an implementation). |
| SECURITY_SUMMARY_P2.md | Adds a security review summary for the P2 sweep/doc additions. |
| README.md | Marks P2 as done and links to sweep examples + production inference cores. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### Sweep Configurations | ||
| The sweep configs execute shell commands via `hpo_sweep.py`. Security notes: | ||
| - Commands are parameterized via config file (user controls all inputs) | ||
| - Environment variables are set explicitly in config (no inheriting untrusted env) |
There was a problem hiding this comment.
This security note is factually incorrect for the current implementation: tools/hpo_sweep.py starts from os.environ.copy() and then overlays values from the config, so it does inherit the caller’s environment in addition to the explicit env block. Please reword to reflect that behavior (or change the implementation if the intention is to run with a clean env).
| - Environment variables are set explicitly in config (no inheriting untrusted env) | |
| - Environment starts from the caller’s environment, with variables from the config explicitly overlaying it (callers should ensure their environment is trusted or run with a sanitized env) |
| "run_dir": "runs/sweep_threshold/{run_id}", | ||
| "metrics": { | ||
| "path": "{run_dir}/metrics.json", | ||
| "keys": ["map50", "map50_95", "mar_100"] |
There was a problem hiding this comment.
The metrics.keys paths don’t match the actual JSON structure written by tools/eval_coco.py. That tool writes metrics under metrics.* (e.g. metrics.map50_95, metrics.map50, metrics.ar100), so the current keys (map50, map50_95, mar_100) will all extract as null/None.
| "keys": ["map50", "map50_95", "mar_100"] | |
| "keys": ["metrics.map50", "metrics.map50_95", "metrics.ar100"] |
| **Command breakdown**: | ||
| 1. Export predictions with varying thresholds | ||
| 2. Evaluate each with COCO mAP (`eval_coco.py`) | ||
| 3. Extract `map50`, `map50_95`, `mar_100` from metrics JSON | ||
|
|
There was a problem hiding this comment.
This section describes extracting map50, map50_95, and mar_100 directly from the eval output JSON, but tools/eval_coco.py nests these under metrics.* and uses ar100 (not mar_100). Update the doc to match the actual output schema so readers can copy/paste the sweep config successfully.
| Each run produces a `gate_tuning_report.json` with: | ||
| - `best_weights`: `{det, tmp, unc}` — optimal weights found | ||
| - `best_score`: mAP achieved with those weights | ||
| - `grid_results`: full grid search results |
There was a problem hiding this comment.
The document claims gate_tuning_report.json contains best_weights and best_score, but tools/tune_gate_weights.py writes a metrics_report where the best row is under metrics.tuning.best (with det/tmp/unc and map50/map50_95). If you want best_weights in the report, it needs to be added by the tool; otherwise update this section to reflect the current report structure.
| Each run produces a `gate_tuning_report.json` with: | |
| - `best_weights`: `{det, tmp, unc}` — optimal weights found | |
| - `best_score`: mAP achieved with those weights | |
| - `grid_results`: full grid search results | |
| Each run produces a `gate_tuning_report.json` metrics report with: | |
| - `metrics.tuning.best.det`, `metrics.tuning.best.tmp`, `metrics.tuning.best.unc`: optimal gate weights found | |
| - `metrics.tuning.best.map50`, `metrics.tuning.best.map50_95`: mAP scores achieved with those weights | |
| - additional tuning rows under `metrics.tuning` that the sweep harness can aggregate into CSV/Markdown |
| #!/usr/bin/env python3 | ||
| """ | ||
| Test suite for sweep configuration examples. | ||
| Validates JSON structure and parameter combinations. | ||
| """ | ||
| import json | ||
| import sys | ||
| from pathlib import Path |
There was a problem hiding this comment.
This module won’t execute any checks under the repo’s CI (python -m unittest) because it defines no unittest.TestCase (or test_* functions). As a result, these sweep configs won’t actually be validated in CI. Convert this into a unittest.TestCase with test methods that assert validity for each config path (and drop the custom main() / sys.exit() entrypoint, or keep it only under if __name__ == "__main__").
| "grid_unc": ["0.0,0.25,0.5,0.75,1.0", "0.0,0.5,1.0"], | ||
| "metric": ["map50_95", "map50"] | ||
| }, | ||
| "param_order": ["metric", "grid_tmp", "grid_unc"], |
There was a problem hiding this comment.
param_order omits grid_det even though it’s in param_grid. If someone changes grid_det to have multiple values later, different runs could collapse to the same run_id (affecting --resume behavior and output directories). Consider including grid_det in param_order for safety/consistency.
| "param_order": ["metric", "grid_tmp", "grid_unc"], | |
| "param_order": ["metric", "grid_det", "grid_tmp", "grid_unc"], |
| "run_dir": "runs/sweep_gate_weights/{run_id}", | ||
| "metrics": { | ||
| "path": "{run_dir}/gate_tuning_report.json", | ||
| "keys": ["best_weights.det", "best_weights.tmp", "best_weights.unc", "best_score"] |
There was a problem hiding this comment.
The metrics.keys paths don’t match the structure of gate_tuning_report.json produced by tools/tune_gate_weights.py (it’s a metrics_report with fields under metrics.tuning.*). As written, these keys will extract as null/None. Update them to the correct dot-paths (e.g., pull metrics.tuning.best.det/tmp/unc and the selected metric value).
| "keys": ["best_weights.det", "best_weights.tmp", "best_weights.unc", "best_score"] | |
| "keys": [ | |
| "metrics.tuning.best.det", | |
| "metrics.tuning.best.tmp", | |
| "metrics.tuning.best.unc", | |
| "metrics.tuning.best.score" | |
| ] |
| | Sweep Type | Config File | Typical Runs | Outputs | Use Case | | ||
| |------------|-------------|--------------|---------|----------| | ||
| | TTT | `sweep_ttt_example.json` | 64 | `sweep_ttt.{jsonl,csv,md}` | Find best TTT hyperparams | | ||
| | Threshold | `sweep_threshold_example.json` | 8 | `sweep_threshold.{jsonl,csv,md}` | Find optimal score cutoff | | ||
| | Gate Weights | `sweep_gate_weights_example.json` | 8 | `sweep_gate_weights.{jsonl,csv,md}` | Tune inference-time score fusion | |
There was a problem hiding this comment.
The summary table is using || at the start of each row, which is not valid GitHub-flavored Markdown table syntax and renders incorrectly. Use single leading/trailing pipes (| ... |) like the other tables in the repo.
Completes P2 roadmap item: extends existing sweep harness for hyperparameter exploration, documents production inference paths.
Sweep Configurations
Added three parameterized sweep configs building on
hpo_sweep.py:sweep_ttt_example.json): 64 runs exploring tent/mim methods, adaptation steps (1-10), learning rates (1e-5 to 5e-4), reset policiessweep_threshold_example.json): 8 runs spanning score thresholds 0.001-0.5, chains export + COCO evalsweep_gate_weights_example.json): 8 runs grid-searching detection/template/uncertainty fusion weightsAll configs use environment variables for fixed settings (dataset, checkpoint, device), support
--resumefor incremental runs, emit CSV/MD tables.Example TTT sweep invocation:
python3 tools/hpo_sweep.py --config docs/sweep_ttt_example.json --resume # Outputs: reports/sweep_ttt.{jsonl,csv,md}Production Inference
examples/infer_cpp/): Verified complete—stub/ONNXRuntime/TensorRT runners with CMake builddocs/rust_inference_template.md): Implementation guide covering ort/tract/candle backends, preprocessing/postprocessing, schema complianceDocumentation
docs/sweep_examples.md: Usage guide with plotting examples, tips for reproducible comparisonsCache/re-run already implemented via
--cacheflag (SHA256 config fingerprinting, automaticruns/yolozu_runs/<hash>/organization).Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.