fix: release image router models after compression#280
fix: release image router models after compression#280eggrollofchaos wants to merge 1 commit intochopratejas:mainfrom
Conversation
eggrollofchaos
left a comment
There was a problem hiding this comment.
Reviewing the lifecycle change. Overall the per-request create-and-close pattern is sound and the registry unload_prefix shutdown hook is well-placed. Two findings that look worth addressing before merge, plus a few smaller notes.
High — ContentRouter._image_optimizer cached, then closed in finally
File: headroom/transforms/content_router.py:1349,1413-1415
_get_image_optimizer caches the compressor on self._image_optimizer, but optimize_images_in_messages now always closes it in the finally. Two concurrent calls against the same ContentRouter instance both grab the cached compressor; whichever returns first calls compressor.close() → TrainedRouter.release_models(), which sets _classifier, _siglip_model, etc. to None while the second call is still inside compressor.compress(...). That second call will then AttributeError/NoneType partway through.
Today the only callers are tests, so it does not blow up in production — but the method is public on a hot-path class and the footgun is shaped exactly like _get_image_compressor() in proxy/helpers.py, which sidesteps the issue by creating a fresh instance every call. Two viable fixes:
- Mirror the helpers.py pattern: drop the
self._image_optimizercache, instantiate per call, close infinally. - Keep the cache but stop closing it inline; release on
ContentRoutershutdown instead.
Either is fine; the current shape is the worst of both worlds.
High — _default_compressor singleton contradicts the PR's intent
File: headroom/image/compressor.py:574-583, plus tests/test_image_compressor.py:780-784
The PR description says "Avoid retaining proxy-level image compressors longer than needed", but the module-level _default_compressor + get_compressor() still hand out a process-lifetime singleton that nothing ever closes. Any code path that touches get_compressor() (and the test test_get_compressor_singleton actively pins this behavior) keeps a TrainedRouter resident for the lifetime of the process — exactly what the rest of the diff is trying to prevent.
Two cleanups, pick one:
- Remove
_default_compressor,get_compressor(), andtest_get_compressor_singleton. Repoint any remaining callers atcompress_images()(which already does create-use-close) or have them instantiate directly. - If a singleton must stay for back-compat, register its
close()with the proxyshutdown()next to theunload_prefixcalls so it is not unbounded.
Medium — MLModelRegistry._release_runtime_memory hard-imports torch
File: headroom/models/ml_models.py:82-93
The bare import torch at the top of _release_runtime_memory will raise ImportError if the optional image stack isn't installed, which then propagates out of unload/unload_prefix. The rest of the codebase treats torch as optional (helpers.py wraps the image import in try/except ImportError). Suggest wrapping the torch block in with contextlib.suppress(ImportError): so a config without torch can still call unload_prefix during shutdown without crashing.
While here, when release_models unloads both classifier and siglip the registry runs gc.collect() + torch.cuda.empty_cache() twice in a row. Not a bug, but unload_prefix already exists — feel free to consolidate to one call so the runtime-memory pass runs once.
Medium — Per-request reload thrashing under image-heavy load
File: headroom/proxy/handlers/anthropic.py:712-725, openai.py:240-255
Every image request now loads classifier + SigLIP, runs once, releases, and the next image request reloads from scratch. The PR description acknowledges this; flagging it because under burst image traffic the cost is real (model load + torch device transfer per request). Worth either:
- Adding a config knob (
image_release_after_use: bool, default True) so heavy-image deployments can opt back into a longer-lived compressor. - Or an idle-timer release: keep the compressor warm for N seconds after last use, then unload.
Not a blocker, but I would not merge without at least the config knob if you expect any image-heavy users.
Low
headroom/proxy/helpers.py:66— name_get_image_compressorreads like a getter; it allocates per call. Consider_make_image_compressor/_create_image_compressorto make the lifecycle obvious to future readers.headroom/proxy/handlers/anthropic.py:712—compressor = _get_image_compressor()is outside thetry. If that helper ever starts raising anything other thanImportError(which it currently catches), the compressor leaks. Cheap fix: pull the call inside thetry.headroom/models/ml_models.py:111-125—unload_prefixreturnslist[str]of removed keys but neither shutdown caller logs them. A singlelogger.infoat shutdown listing what was released would make the new behavior observable in proxy logs.
Nits
- The CHANGELOG entry would be useful given this changes user-visible memory behavior.
release_modelszeroes_text_embeddingsetc. before callingMLModelRegistry.unload. The registry path doesn't depend on those being None, but the testtest_release_models_unloads_registry_entriesonly checks the post-state — adding an ordering assertion (or a comment about why state-then-registry) would freeze the intent.
Tests look good — the lifecycle coverage in test_proxy_pipeline_lifecycle.py and test_proxy_handler_helpers.py is the right shape. Once the two High items are addressed, this is close to mergeable.
cc2efdf to
e8d9b6f
Compare
Codex Fix — Round 1Addressed the 2 High findings from the review and picked up the small safety notes that fit the same lifecycle change. Changes
Verification
Deferred
Fixes pushed in |
eggrollofchaos
left a comment
There was a problem hiding this comment.
Independent review for #280. Covers full file context, not just diff. No Critical/High findings — verdict: LGTM (commenting for transparency on Mediums/Lows).
Summary
| Severity | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 4 |
| Low | 2 |
Medium
M1 — Per-request load/unload thrashing on bursty image traffic — headroom/image/compressor.py:105-111, headroom/image/trained_router.py:207-229, headroom/proxy/handlers/{anthropic,openai}.py, headroom/transforms/content_router.py:1411-1414. With the new lifecycle, every image request loads technique-router + SigLIP, runs one compression pass, then immediately calls MLModelRegistry.unload_many(...) and best-effort gc.collect() / torch.{cuda,mps}.empty_cache(). For sparse image traffic this is the right call (and the PR description names it explicitly). For sustained image traffic (e.g. repeated multi-modal prompts in one agent loop), each request pays the full from_pretrained allocation cost and the runtime memory cleanup. There is no idle-timeout / TTL between "release immediately" and "pin forever". Suggested fix (follow-up, not blocking): add an optional idle-timeout cache so back-to-back image requests reuse models, but a configurable idle window (e.g. 30–60s) lets memory return after a quiet period. Worth a pm/backlog.md entry.
M2 — Dead self._image_optimizer = None in finally — headroom/transforms/content_router.py:1414. After the refactor, _get_image_optimizer() no longer assigns to self._image_optimizer (it constructs and returns directly). The _image_optimizer instance attribute is initialized to None at content_router.py:662 and never set to anything else, so resetting it in the finally is a no-op. Suggested fix: remove the line (and ideally remove the unused instance attribute at content_router.py:662), or — if you want to keep a defensive guard against future reintroduction — leave a one-line comment explaining the assignment is intentional. Test test_content_router_releases_image_optimizer_after_use asserts router._image_optimizer is None post-call, so removing the line still passes the assertion (the attribute starts at None), but the assertion stops verifying anything meaningful — consider replacing it with fake.close.assert_called_once() only.
M3 — Test mutates module global without cleanup — tests/test_proxy_handler_helpers.py:159-172 (test_proxy_helper_creates_fresh_image_compressors). Sets helpers._image_compressor_available = None to reset state but never restores after the test. After this test runs, _image_compressor_available is left as True (assuming import succeeds inside the patch context, which it does because _FreshCompressor patches headroom.image.ImageCompressor, not the from headroom.image import ImageCompressor lookup… actually, see below — patching the module attribute does take effect). Subsequent tests that depend on the "first-call logs once" behavior or on the import-failure short-circuit may see stale memo state. Suggested fix: wrap in a small fixture that snapshots/restores helpers._image_compressor_available, or use monkeypatch.setattr(helpers, "_image_compressor_available", None). Not a current failure, just an order-dependence trap.
M4 — Unnecessary contextlib.suppress(Exception) around gc.collect() — headroom/image/trained_router.py:226-229. The non-registry branch of release_models does:
with contextlib.suppress(Exception):
import gc
gc.collect()import gc is a stdlib import that cannot fail at runtime in a working interpreter, and gc.collect() returns an int — it doesn't raise. The suppress masks no real failure mode. Suggested fix: drop the suppress and the local import (move import gc to module top, where import gc already exists in ml_models.py — and could exist here too). Minor, but error-masking patterns are worth pruning.
Low
L1 — Inconsistent default local_path resolution between callers — headroom/image/trained_router.py:174-179 resolves the local technique-router path as Path(__file__).parent.parent.parent / "models" / "technique-router-mini" / "final" (file-relative absolute path), while headroom/models/ml_models.py:282 resolves it as Path("headroom/models/technique-router-mini/final/") (CWD-relative). Today this never triggers a mismatch because TrainedRouter._load_models always passes a fully-resolved model_id to MLModelRegistry.get_technique_router(model_path=...), so the registry never reaches its own if model_path is None branch. If any other caller invokes MLModelRegistry.get_technique_router(model_path=None) directly while the local path exists, it will use a different cache key than TrainedRouter would produce, creating a duplicate entry the prefix-based shutdown unload would still cover via unload_prefix("technique_router:") — so still safe at shutdown, but routers and registry could end up with two entries for the "same" model. Not introduced by this PR; worth a small follow-up to centralize the resolution. Suggested fix: extract _resolve_default_technique_router_path() and reuse it from both sites.
L2 — Public API semantic change for get_compressor() — headroom/image/compressor.py:574-580. Previous behavior: returned a process-wide singleton. New behavior: returns a fresh instance each call. The new docstring notes "Kept for backwards-compatible imports", but this is a quiet behavior change — any external consumer that imports get_compressor and relies on identity / shared last_result state will silently break. Suggested fix: optional — add a DeprecationWarning on get_compressor() pointing callers at ImageCompressor() directly, since the helper now adds nothing. Or note in CHANGELOG that the function returns a fresh instance now.
Verified positively
- Cache key consistency across
trained_router._classifier_key/_siglip_keyandMLModelRegistry.get_technique_router/get_siglipkeys (both buildf"technique_router:{model_path}"andf"siglip:{model_name}"from the same resolved values whenTrainedRouteris the caller). - Idempotence of
release_models(re-call seesclassifier_key=None, filters to[],unload_many([])short-circuits without runtime cleanup). - Order in
release_models— local refs nulled before registry unload, so registry can actually free memory. - Order in
proxy/server.py:shutdown— http client + memory handler closed before model unload, so no in-flight request is using a freshly-evicted registry entry.contextlib.suppress(Exception)around the unload is defensible (shutdown should not raise). unload_manyandunload_prefixthread-safety viainstance._model_lock._release_runtime_memorydefensively guardstorch.cuda.is_available()andtorch.mps.empty_cache()(handles both backends and missing-attr versions).- Try/finally placement in both proxy handlers is correct:
compressor = Nonedefaulted before the try,_get_image_compressor()inside the try (so a future failure during construction also closes any partial state),compressor.close()guarded bycompressor and hasattr(compressor, "close")in the finally. - Behaviour on text-only (no-images) requests: fresh
ImageCompressor()is constructed →has_images()returnsFalse→close()early-returns becauseself._router is None. Cost is one object allocation; no model load triggered. Good. compress_imagesconvenience function constructsImageCompressor()directly rather than going throughget_compressor(), decoupling it from M2.- Test coverage for the close path:
test_release_models_unloads_registry_entries,test_compress_images_function_closes_temporary_compressor,test_proxy_helper_creates_fresh_image_compressors,test_content_router_releases_image_optimizer_after_use,test_proxy_shutdown_unloads_image_models— comprehensive.
Verdict
LGTM — approve. The PR is well-scoped to the stated lifecycle fix, the tradeoff (per-request load vs. memory pinning) is intentional and documented, and the test surface covers the new release path on every code site that constructs a compressor. None of the Mediums block merge; M1 is worth a follow-up backlog entry, M2/M3/M4 are cleanups, L1/L2 are defensive notes.
e8d9b6f to
3501616
Compare
Codex Fix — Round 2Cleaned up the easy Medium notes from the independent review. Changes
Verification
Remaining deferred notes
Fixes pushed in |
eggrollofchaos
left a comment
There was a problem hiding this comment.
Final Confirmation Review (commit 3501616)
Independent re-pass against the full PR, not just the Round 2 delta. No new Critical/High findings — verdict: LGTM (commenting for transparency on the small deltas).
Summary
| Severity | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 0 |
| Low | 2 |
Round 2 fixes verified
- M2 (ContentRouter dead
_image_optimizer) — instance attribute removed atheadroom/transforms/content_router.py:654-664;optimize_images_in_messagesno longer touchesself._image_optimizerinfinally. The matching testtest_content_router_releases_image_optimizer_after_use(tests/test_image_compressor.py:970-988) now assertsfake.close.assert_called_once()only — the meaningless post-state check is gone. ✅ - M3 (test global mutation) —
tests/test_proxy_handler_helpers.py:159-172usesmonkeypatch.setattr(helpers, "_image_compressor_available", None). Restores cleanly after the test. ✅ - M4 (unnecessary
contextlib.suppress(Exception)aroundgc.collect()) —headroom/image/trained_router.py:220-226now doeselse: gc.collect()directly; module-levelimport gcis in place atheadroom/image/trained_router.py:13. No suppress, no re-import. ✅ - L2 (public-API behaviour change for
get_compressor()) —CHANGELOG.md:19-23explicitly calls out the singleton → fresh-instance shift. ✅
Other invariants spot-checked on this commit
ImageCompressor.close()atheadroom/image/compressor.py:105-111is a no-op when_router is None, so text-only requests cost only one allocation.TrainedRouter.release_models()atheadroom/image/trained_router.py:207-228nulls local refs before callingMLModelRegistry.unload_many— registry can free, idempotent on repeat calls (classifier_key=None→keys=[]→unload_many([])short-circuits).MLModelRegistry.unload_many/unload_prefix(headroom/models/ml_models.py:103-136) holdinstance._model_lockduring mutation and only run_release_runtime_memory()when something was actually evicted.- Proxy handlers (
headroom/proxy/handlers/anthropic.py:712-726,headroom/proxy/handlers/openai.py:242-256) initialisecompressor = Nonebefore thetry, allocate inside thetry, and guard thefinallywithcompressor and hasattr(compressor, "close"). Construction failures cannot leak. ProxyServer.shutdown(headroom/proxy/server.py:846-853) wraps the unload pass incontextlib.suppress(Exception)(defensible at shutdown) and logs the released keys.
Low
- L-new-1 —
test_content_router_loads_image_compressor/..._is_image_compressorallocate but never close.tests/test_image_compressor.py:920-943. With the new lifecycle each_get_image_optimizer()call is supposed to be caller-owned. These two tests construct compressors and discard them. No model load happens (compress isn't called), so it's hygiene rather than a leak. Optional follow-up: addcompressor.close()for symmetry, or wrap in atry/finally. - L-new-2 — PR currently
mergeable: CONFLICTINGagainstmain. Behindorigin/mainby ~10 commits (last merged: #277 rust-stage-3b-diff-bridge). The PR body documents pre-existing CRLF/LF noise as intentional, but the underlying CONFLICTING state still needs a rebase before squash-merge. Not a code finding; flagging so it isn't missed at merge time.
Carried-over deferred items (acknowledged, not blocking)
- M1 (carry-forward) — per-request load/unload thrashing on bursty image traffic. Idle-timeout/TTL warm cache is a worthwhile follow-up; backlog entry recommended.
- L1 (carry-forward) — default local technique-router path drift between
TrainedRouter._load_modelsandMLModelRegistry.get_technique_router(model_path=None). Pre-existing; safe at shutdown viaunload_prefix("technique_router:"). Centralise in a follow-up.
Verdict
LGTM — approve. Round 2 fixes are correctly applied, no regressions introduced, lifecycle invariants hold. Triggering auto-approval workflow.
3501616 to
4ff2e8e
Compare
Rebase UpdateRebased the PR onto current Conflict resolutionResolved conflicts in:
Kept upstream's current main changes and reapplied only the image-model lifecycle cleanup. Verification
Pushed as |
eggrollofchaos
left a comment
There was a problem hiding this comment.
Re-reviewed against 4ff2e8e. All v1 findings addressed.
Resolved
- High —
ContentRouter._image_optimizerrace: cache attribute removed entirely;_get_image_optimizerreturns fresh instance. Newtest_content_router_compressor_is_fresh_per_callandtest_content_router_releases_image_optimizer_after_usepin both halves of the contract. - High —
_default_compressorsingleton leak: singleton dropped,get_compressor()returns fresh instance,compress_images()wraps in try/finally withclose(). Old singleton test replaced withtest_get_compressor_returns_fresh_instance; newtest_compress_images_function_closes_temporary_compressorpins the close path. - Medium — torch hard-import: wrapped in try/except ImportError with early return. Bonus: outer torch ops now wrapped in
contextlib.suppress(Exception)for defensive teardown. - Medium — duplicate gc/empty_cache passes: new
unload_manyconsolidates both keys behind one runtime cleanup.release_modelsuses it. - Low —
_get_image_compressor()outside try: both anthropic and openai handlers now initcompressor = Nonefirst, assign inside try. Cleaner. - Low — silent unload: server.py shutdown now logs
Released image optimizer models: {keys}. - Nit — CHANGELOG: entry added explaining lifecycle change and
get_compressor()semantics shift.
Deferred by design
- Medium — per-request reload thrashing: no config knob added. PR description acknowledges the trade-off and the user-facing tone is "good neighbor on memory". Acceptable as-is; flag for follow-up if image-burst workloads ever become a deployment target.
New observations on v2
close = release_modelsalias on TrainedRouter (headroom/image/trained_router.py:225) is a direct attribute bind rather than a method wrapper. Subclasses overridingrelease_modelswon't see the override viaclose. Tiny footgun, only matters if subclassing happens later — feel free to convert to a one-line method if you want subclass-safety, otherwise fine.helpers._image_compressor_availablestays sticky across the process lifetime. IfImageCompressorimport ever fails once and then becomes available later (e.g., user installs the optional extra mid-session), the helper will keep returningNone. Practically never happens; skipping unless someone hits it.- Shutdown's
contextlib.suppress(Exception)wraps both registry calls and thelogger.info. If logging happens to fail during teardown, suppression hides it — defensible for a shutdown path; documenting here so reviewers don't get surprised.
LGTM
Clean lifecycle, tests cover the intent, shutdown observable. Approve from my side.
4ff2e8e to
8a3591d
Compare
Description
Releases image-router model resources after image compression so the proxy does not keep one-off vision model loads resident for the lifetime of the process.
Fixes: N/A
Impact
Memory
With Headroom's default image stack, the retained models are estimated at roughly
google/siglip-base-patch16-224(~400 MB) pluschopratejas/technique-router(~100 MB), using the estimates already tracked inheadroom/models/config.py. So the expected steady-state improvement after an image request is ~500 MB of reclaimable model memory, plus whatever backend allocator overhead PyTorch/CPU/GPU/MPS decides to hand back. Not a precise benchmark, but certainly nonzero.Runtime behavior
For normal text-only proxy traffic, this should be a no-op. The change matters after image compression has caused the trained router / SigLIP stack to load; once that request finishes, Headroom drops those model refs and asks the shared registry + backend allocators to give the memory back.
Tradeoff
The tradeoff is that bursty image-heavy workloads may reload the image stack more often. For this support-infra / local-agent case, that is the right default: don't keep half a gig warm forever because one image request went through once. If someone is running sustained image traffic, a future keep-warm knob may make sense.
Provenance
I was running the Headroom proxy w/ Codex on a machine that was also doing local LLM code-review + model-heavy work, and the goal was to track and reduce memory footprint. Codex noticed the image route could leave the image-model stack hanging around after a one-off image request, and decided (autonomously) to clone the source checkout down and patch it locally. Seemed like a good idea, so I decided to open this PR and share.
Type of Change
Changes Made
TrainedRouterandImageCompressorTesting
Describe the tests you ran to verify your changes:
pytest)ruff check .targeted to changed files)mypy headroom)Test Output
Checklist
Screenshots (if applicable)
N/A
Additional Notes
This PR intentionally avoids repo-wide formatting or line-ending normalization. The local checkout currently reports pre-existing CRLF/LF noise because
.gitattributesdeclares*.py text eol=lfwhile some committed Python blobs still contain CRLF or mixed endings. The committed diff is limited to the image-router lifecycle fix and related tests.The lifecycle change is intentionally boring in the best way: load image models when an image request needs them, then hand the memory back.
That keeps the proxy a better neighbor on machines already juggling local review or other model-heavy work.