feat(common): add retriability signal to Result pattern Err type (#121)#168
feat(common): add retriability signal to Result pattern Err type (#121)#168b3lz3but wants to merge 2 commits intocaptainpragmatic:masterfrom
Conversation
…tainpragmatic#121) Add retriable: bool = False field to the Err dataclass so callers (e.g. Django-Q tasks) can distinguish transient errors (DB timeout, lock contention) from permanent ones (validation failure). Default False preserves backward compatibility with all 577 existing Err() call sites. Closes captainpragmatic#121 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Ciprian Radulescu <craps2003@gmail.com>
|
@mostlyvirtual — small arch-debt cleanup, requesting review. |
mostlyvirtual
left a comment
There was a problem hiding this comment.
Review — Request changes
Mechanically the change is clean: field ordering is correct, frozen=True invariant is preserved, the existing one-positional case Err(x) pattern matches still bind to .error correctly via __match_args__, and the Ok.map() / Ok.and_then() exception path is tested. No syntactic or type-system issues. Two design issues need a second pass before this can be the basis for retry decisions across the codebase.
Blocker 1 — retriable: bool = False is the wrong shape for this signal
A boolean works when the producer is forced to know the answer. Here, most producers are legacy Err("...") sites in code that doesn't know whether the underlying error is transient or permanent — bare except Exception as e: return Err(str(e)) patterns. With bool = False the answer for "I don't know" silently becomes "definitely not retriable," which is the most dangerous wrong answer for transient infrastructure errors.
There are three real states here:
RETRIABLE— caller knows it's transient (DB timeout, lock contention, upstream 5xx)NOT_RETRIABLE— caller knows it's permanent (validation, business rule, 4xx)UNKNOWN— caller caught an unclassified exception and cannot determine retriability at construction time
A bool collapses (2) and (3), which means consumers reading .retriable == False cannot tell whether the producer asserted permanence or simply didn't think about it. Once consumers start gating retries on .retriable, this collapse causes silent downgrades.
Suggested fix: either an enum (Retriability.RETRIABLE | NOT_RETRIABLE | UNKNOWN) with UNKNOWN as the default, or retriable: bool | None = None with None documented as "caller did not assert." Consumers then choose conservatively per-workflow — Django-Q for non-idempotent payment/provisioning treats UNKNOWN as "do not retry," for idempotent reads it can treat UNKNOWN as "retry once with backoff."
Blocker 2 — No transient producer was updated, so the signal is inert at merge time
Every existing Err(...) call site continues to default to retriable=False after this PR. Because no consumer reads .retriable yet, the immediate behavior is unchanged — but the moment any consumer starts honoring the signal, all of the following provably-transient sites will be incorrectly classified as permanent:
apps/provisioning/virtualmin_gateway.py:751,812,934,937— rate limit, retry exhaustion, connection failuresapps/provisioning/virtualmin_service.py:316,632,755,913,1028,1202— DB/connection/API re-wraps to stringErrapps/billing/stripe_metering.py:94,108,119,131,205,297,320,359—StripeErrorcatches that include 429/5xx/timeoutapps/billing/refund_service.py:166,323— explicit"database error"stringsapps/infrastructure/hcloud_service.py—:126,144,160,172,184,196,212,243,312,334,374,387provider SDK/API failuresapps/infrastructure/digitalocean_service.py:130,166,192,311,339,357,370— including the_wait_for_action()timeoutapps/provisioning/provider_config.py:336—Err(f"Command timed out after {timeout} seconds")
If the API shape is settled first (Blocker 1) and these are not backfilled, then once any task starts gating retries on .retriable, DB/network/provider outages will stop retrying, failed deployments and Stripe usage syncs will be marked terminal, and manual repair will replace automatic recovery.
Suggested fix: before this lands, decide the shape (Blocker 1) and update the highest-confidence transient producers (Stripe StripeError, virtualmin connection failures, the explicit timeout strings, the DB-error strings) to pass the retriable signal explicitly. Leaving the long tail for follow-up is fine; the explicitly-named ones above should not be left at the default.
Smaller items (non-blocking but worth fixing in the same revision)
- No consumer reads
.retriabletoday. Confirmed via grep acrossservices/platform/apps/. The docstring states "Callers such as Django-Q tasks can use this to decide whether to re-queue" but no task does. The natural first consumer is the_process_paid_orderpath inapps/orders/tasks.pythat handlesOrderPaymentConfirmationService.confirm_order(). Wiring at least one consumer in this PR would prove the contract is usable end-to-end. - Equality/hash contract change is undocumented.
Err("x") != Err("x", retriable=True), andhash(...)differs. Add two tests that document this explicitly:Err("x") != Err("x", retriable=True)andErr("x") == Err("x", retriable=False). This is the contract-documentation tier of test that the current 127 LOC suite is missing. - Tests verify mechanics, not behavior. Add at least one test that proves a retry loop or task code reads
.retriableand changes its decision based on it. Without that, the test suite proves the dataclass works, not that the feature works. - Docs not updated.
docs/ADRs/ADR-0003-comprehensive-type-safety-implementation.md:55documents the Result pattern anddocs/ADRs/ADR-0022-project-structure-strategic-seams.md:120points to sharedOk/Err. Neither describes retryability semantics.docs/domain/REFUND_SERVICE.md:177also discusses Result usage and would benefit from a one-paragraph addition about when to setretriable=True. - Docstring vagueness. The current docstring says the flag "signals whether the operation might succeed on retry." It does not specify what the default value means, what consumers should do with each state, or whether the default is conservative-by-design or an artifact of the dataclass field ordering rule.
Verdict
Rework. The shape of retriable is the load-bearing decision and locking in bool before the codebase adopts it makes the migration to tri-state much harder later. Once the shape is settled and the highest-confidence transient producers are backfilled, this should land cleanly with a behavioral test and a brief ADR addition.
Address PR captainpragmatic#168 review (mostlyvirtual, 2026-05-05): Blocker 1 — replace `retriable: bool = False` with a tri-state `Retriability` StrEnum (RETRIABLE / NOT_RETRIABLE / UNKNOWN), default UNKNOWN. A bool collapsed "caller did not classify" into "definitely not retriable" — the most dangerous wrong answer for transient infrastructure errors. With UNKNOWN as default, legacy `Err(str(e))` sites no longer falsely assert permanence; consumers choose policy per-workflow via `is_retriable` (conservative — only True for RETRIABLE) or by inspecting `retriability` directly. Blocker 2 — backfill highest-confidence transient producers so the signal is not inert at merge time: - apps/billing/stripe_metering.py: classify all 8 StripeError catches via `_classify_stripe_error` (RateLimitError / APIConnectionError / Timeout / TryAgain / APIError → RETRIABLE; InvalidRequestError / AuthenticationError / CardError / etc. → NOT_RETRIABLE) - apps/provisioning/virtualmin_gateway.py: rate limit, retry exhaustion, connection test failures → RETRIABLE - apps/provisioning/virtualmin_service.py: gateway-result-rewrap sites (316, 632, 755, 913, 1198) propagate inner retriability via new `retriability_of(result)` helper; connection test (1028) → RETRIABLE - apps/billing/refund_service.py: explicit "database error" strings (166, 323) → RETRIABLE - apps/infrastructure/provider_config.py: `Command timed out` (336) → RETRIABLE Tests: rewrite ErrRetriabilityTests around the enum; new ErrEqualityContractTests document that two Errs with different retriability compare unequal (and hash differs) — equality contract change is now explicit. New ErrPatternMatchTests confirm `case Err(msg)` still binds to `.error` via positional `__match_args__`. ADR-0003 updated with the tri-state signal and consumer policy. Refs: captainpragmatic#121 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@mostlyvirtual ready for re-review. Pushed Blocker 1 —
Blocker 2 — backfill transient producers:
I deliberately left bare Smaller items:
Deferred to follow-up (acknowledged):
mypy clean across all 6 changed modules, 33 Result-type tests passing. |
Address PR captainpragmatic#168 review (mostlyvirtual, 2026-05-05): Blocker 1 — replace `retriable: bool = False` with a tri-state `Retriability` StrEnum (RETRIABLE / NOT_RETRIABLE / UNKNOWN), default UNKNOWN. A bool collapsed "caller did not classify" into "definitely not retriable" — the most dangerous wrong answer for transient infrastructure errors. With UNKNOWN as default, legacy `Err(str(e))` sites no longer falsely assert permanence; consumers choose policy per-workflow via `is_retriable` (conservative — only True for RETRIABLE) or by inspecting `retriability` directly. Blocker 2 — backfill highest-confidence transient producers so the signal is not inert at merge time: - apps/billing/stripe_metering.py: classify all 8 StripeError catches via `_classify_stripe_error` (RateLimitError / APIConnectionError / Timeout / TryAgain / APIError → RETRIABLE; InvalidRequestError / AuthenticationError / CardError / etc. → NOT_RETRIABLE) - apps/provisioning/virtualmin_gateway.py: rate limit, retry exhaustion, connection test failures → RETRIABLE - apps/provisioning/virtualmin_service.py: gateway-result-rewrap sites (316, 632, 755, 913, 1198) propagate inner retriability via new `retriability_of(result)` helper; connection test (1028) → RETRIABLE - apps/billing/refund_service.py: explicit "database error" strings (166, 323) → RETRIABLE - apps/infrastructure/provider_config.py: `Command timed out` (336) → RETRIABLE Tests: rewrite ErrRetriabilityTests around the enum; new ErrEqualityContractTests document that two Errs with different retriability compare unequal (and hash differs) — equality contract change is now explicit. New ErrPatternMatchTests confirm `case Err(msg)` still binds to `.error` via positional `__match_args__`. ADR-0003 updated with the tri-state signal and consumer policy. Refs: captainpragmatic#121 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ciprian Radulescu <craps2003@gmail.com>
c6060f4 to
7b167e1
Compare
Summary
Adds a
retriable: bool = Falsefield to theErrdataclass so callers can distinguish transient errors (DB timeout, lock contention) from permanent ones (validation failure, business rule violation).Closes #121
Change
Backward compatibility
All 577 existing
Err("message")call sites continue to work unchanged — the defaultretriable=Falseis applied automatically. No migration needed.Usage
Files changed
apps/common/types.py— Addedretriablefield toErrdataclass (+5 lines)tests/common/test_result_types.py— 24 new tests covering Ok/Err behavior and retriable signalTest plan
🤖 Generated with Claude Code