Context
The STAMP deploy test currently validates that swarm training completes successfully, but does not compare the federated model quality against a local-only training baseline.
In the initial deploy test run (2 rounds, VIT, synthetic data), both clients achieved AUROC 1.0 — but this is on synthetic data with easy-to-separate classes and is not a meaningful quality benchmark.
Proposed Solution
- Automated local baseline: Before swarm training, run `--local_training` on each client's data and record final metrics (val_loss, val_auroc)
- Comparison: After swarm training completes, compare federated metrics against local baselines
- Reporting: Output a comparison table in the deploy test results JSON:
```json
{
"federated_auroc": 0.85,
"local_auroc_site_1": 0.78,
"local_auroc_site_2": 0.72,
"improvement": "+9.7%"
}
```
- Optionally fail the test if the federated model is significantly worse than the best local model (regression check)
Why This Matters
The core value proposition of swarm learning is that the federated model should perform at least as well as (ideally better than) any single site's local model. Without this check, we can't validate that model aggregation is actually beneficial.
Related Files
- `scripts/deploy/run_stamp_deploy_test.sh` — deploy test orchestrator
- `docker_config/master_template_STAMP.yml` — has `--local_training` flag
- `application/jobs/STAMP_classification/app/custom/main.py` — supports `local_training` mode
Context
The STAMP deploy test currently validates that swarm training completes successfully, but does not compare the federated model quality against a local-only training baseline.
In the initial deploy test run (2 rounds, VIT, synthetic data), both clients achieved AUROC 1.0 — but this is on synthetic data with easy-to-separate classes and is not a meaningful quality benchmark.
Proposed Solution
```json
{
"federated_auroc": 0.85,
"local_auroc_site_1": 0.78,
"local_auroc_site_2": 0.72,
"improvement": "+9.7%"
}
```
Why This Matters
The core value proposition of swarm learning is that the federated model should perform at least as well as (ideally better than) any single site's local model. Without this check, we can't validate that model aggregation is actually beneficial.
Related Files