Why
Issue #15 was a one-line wire-format bug: `Chronos2Backend.load()` passed `dtype=` to `BaseChronosPipeline.from_pretrained` where the signature accepts `torch_dtype=`. The existing `Chronos2Backend` unit tests mock the pipeline and only assert on mock call args, so they kept passing while the real integration was broken.
The bug only surfaced when running `examples/quickstart_batch.py` end-to-end against `amazon/chronos-bolt-tiny` — i.e. when a human happened to try the quickstart. Class of bug we should catch in CI, not at the demo.
What
Add a smoke test gated on `SHEAF_SMOKE_TEST=1` (consistent with `tests/test_smoke_ray.py`, `tests/test_smoke_whisper.py`) that:
- Constructs `Chronos2Backend(model_id="amazon/chronos-bolt-tiny", device_map="cpu", torch_dtype="float32")`
- Calls `.load()` against the real chronos library — no mocks
- Calls `.predict(TimeSeriesRequest(history=[1.0, 2.0, 3.0, 4.0], horizon=3, frequency="1h"))` and asserts the response is a valid `TimeSeriesResponse` with `len(mean) == 3`
Skip if `chronos` isn't importable so the test is a no-op without `[time-series]` installed.
`amazon/chronos-bolt-tiny` is ~80MB — small enough for a CI smoke run if we ever wire one up; for now it just runs locally when a contributor sets the env var.
Stretch
Same shape for the other backends whose `load()` makes a real `from_pretrained` call: ESM-3, Nucleotide Transformer, MolFormer, MACE, Prithvi, GraphCast, FLUX, SDXL, VideoMAE, ViTPose, RAFT, DINOv2, OpenCLIP, SAM2, Depth Anything, DETR, MusicGen, Bark, Kokoro, Whisper (already covered by `test_smoke_whisper.py`). All can be the same pattern — gated on `SHEAF_SMOKE_TEST=1` and the relevant import-skipif.
A follow-up question worth answering once these exist: should CI run the cheap ones (chronos-bolt-tiny, dinov2-small, whisper-tiny.en, MACE-MP-0-small) on a separate `smoke` job that's allowed to take 5-10 min? That would catch wire-level regressions in mainline-relevant backends without blocking the fast unit suite.
Acceptance criteria
Related
Why
Issue #15 was a one-line wire-format bug: `Chronos2Backend.load()` passed `dtype=` to `BaseChronosPipeline.from_pretrained` where the signature accepts `torch_dtype=`. The existing `Chronos2Backend` unit tests mock the pipeline and only assert on mock call args, so they kept passing while the real integration was broken.
The bug only surfaced when running `examples/quickstart_batch.py` end-to-end against `amazon/chronos-bolt-tiny` — i.e. when a human happened to try the quickstart. Class of bug we should catch in CI, not at the demo.
What
Add a smoke test gated on `SHEAF_SMOKE_TEST=1` (consistent with `tests/test_smoke_ray.py`, `tests/test_smoke_whisper.py`) that:
Skip if `chronos` isn't importable so the test is a no-op without `[time-series]` installed.
`amazon/chronos-bolt-tiny` is ~80MB — small enough for a CI smoke run if we ever wire one up; for now it just runs locally when a contributor sets the env var.
Stretch
Same shape for the other backends whose `load()` makes a real `from_pretrained` call: ESM-3, Nucleotide Transformer, MolFormer, MACE, Prithvi, GraphCast, FLUX, SDXL, VideoMAE, ViTPose, RAFT, DINOv2, OpenCLIP, SAM2, Depth Anything, DETR, MusicGen, Bark, Kokoro, Whisper (already covered by `test_smoke_whisper.py`). All can be the same pattern — gated on `SHEAF_SMOKE_TEST=1` and the relevant import-skipif.
A follow-up question worth answering once these exist: should CI run the cheap ones (chronos-bolt-tiny, dinov2-small, whisper-tiny.en, MACE-MP-0-small) on a separate `smoke` job that's allowed to take 5-10 min? That would catch wire-level regressions in mainline-relevant backends without blocking the fast unit suite.
Acceptance criteria
Related