Skip to content

feat(common): add retriability signal to Result pattern Err type (#121)#168

Open
b3lz3but wants to merge 2 commits intocaptainpragmatic:masterfrom
b3lz3but:feat/result-retriable-signal
Open

feat(common): add retriability signal to Result pattern Err type (#121)#168
b3lz3but wants to merge 2 commits intocaptainpragmatic:masterfrom
b3lz3but:feat/result-retriable-signal

Conversation

@b3lz3but
Copy link
Copy Markdown
Contributor

@b3lz3but b3lz3but commented Apr 2, 2026

Summary

Adds a retriable: bool = False field to the Err dataclass so callers can distinguish transient errors (DB timeout, lock contention) from permanent ones (validation failure, business rule violation).

Closes #121

Change

# Before
@dataclass(frozen=True)
class Err[E]:
    error: E

# After
@dataclass(frozen=True)
class Err[E]:
    error: E
    retriable: bool = False

Backward compatibility

All 577 existing Err("message") call sites continue to work unchanged — the default retriable=False is applied automatically. No migration needed.

Usage

# Transient error — caller may retry
return Err("database connection timeout", retriable=True)

# Permanent error — retry will always fail (default)
return Err("invoice amount must be positive")

# Django-Q task can check:
result = refund_service.process_refund(order_id)
if result.is_err() and result.retriable:
    raise result.unwrap_err()  # Re-raise for Django-Q retry

Files changed

  • apps/common/types.py — Added retriable field to Err dataclass (+5 lines)
  • tests/common/test_result_types.py — 24 new tests covering Ok/Err behavior and retriable signal

Test plan

  • 24 Result type tests pass
  • All pre-commit hooks pass
  • DCO sign-off

🤖 Generated with Claude Code

…tainpragmatic#121)

Add retriable: bool = False field to the Err dataclass so callers (e.g.
Django-Q tasks) can distinguish transient errors (DB timeout, lock
contention) from permanent ones (validation failure). Default False
preserves backward compatibility with all 577 existing Err() call sites.

Closes captainpragmatic#121

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ciprian Radulescu <craps2003@gmail.com>
@b3lz3but
Copy link
Copy Markdown
Contributor Author

b3lz3but commented Apr 2, 2026

@mostlyvirtual — small arch-debt cleanup, requesting review.

Copy link
Copy Markdown
Contributor

@mostlyvirtual mostlyvirtual left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Request changes

Mechanically the change is clean: field ordering is correct, frozen=True invariant is preserved, the existing one-positional case Err(x) pattern matches still bind to .error correctly via __match_args__, and the Ok.map() / Ok.and_then() exception path is tested. No syntactic or type-system issues. Two design issues need a second pass before this can be the basis for retry decisions across the codebase.

Blocker 1 — retriable: bool = False is the wrong shape for this signal

A boolean works when the producer is forced to know the answer. Here, most producers are legacy Err("...") sites in code that doesn't know whether the underlying error is transient or permanent — bare except Exception as e: return Err(str(e)) patterns. With bool = False the answer for "I don't know" silently becomes "definitely not retriable," which is the most dangerous wrong answer for transient infrastructure errors.

There are three real states here:

  1. RETRIABLE — caller knows it's transient (DB timeout, lock contention, upstream 5xx)
  2. NOT_RETRIABLE — caller knows it's permanent (validation, business rule, 4xx)
  3. UNKNOWN — caller caught an unclassified exception and cannot determine retriability at construction time

A bool collapses (2) and (3), which means consumers reading .retriable == False cannot tell whether the producer asserted permanence or simply didn't think about it. Once consumers start gating retries on .retriable, this collapse causes silent downgrades.

Suggested fix: either an enum (Retriability.RETRIABLE | NOT_RETRIABLE | UNKNOWN) with UNKNOWN as the default, or retriable: bool | None = None with None documented as "caller did not assert." Consumers then choose conservatively per-workflow — Django-Q for non-idempotent payment/provisioning treats UNKNOWN as "do not retry," for idempotent reads it can treat UNKNOWN as "retry once with backoff."

Blocker 2 — No transient producer was updated, so the signal is inert at merge time

Every existing Err(...) call site continues to default to retriable=False after this PR. Because no consumer reads .retriable yet, the immediate behavior is unchanged — but the moment any consumer starts honoring the signal, all of the following provably-transient sites will be incorrectly classified as permanent:

  • apps/provisioning/virtualmin_gateway.py:751,812,934,937 — rate limit, retry exhaustion, connection failures
  • apps/provisioning/virtualmin_service.py:316,632,755,913,1028,1202 — DB/connection/API re-wraps to string Err
  • apps/billing/stripe_metering.py:94,108,119,131,205,297,320,359StripeError catches that include 429/5xx/timeout
  • apps/billing/refund_service.py:166,323 — explicit "database error" strings
  • apps/infrastructure/hcloud_service.py:126,144,160,172,184,196,212,243,312,334,374,387 provider SDK/API failures
  • apps/infrastructure/digitalocean_service.py:130,166,192,311,339,357,370 — including the _wait_for_action() timeout
  • apps/provisioning/provider_config.py:336Err(f"Command timed out after {timeout} seconds")

If the API shape is settled first (Blocker 1) and these are not backfilled, then once any task starts gating retries on .retriable, DB/network/provider outages will stop retrying, failed deployments and Stripe usage syncs will be marked terminal, and manual repair will replace automatic recovery.

Suggested fix: before this lands, decide the shape (Blocker 1) and update the highest-confidence transient producers (Stripe StripeError, virtualmin connection failures, the explicit timeout strings, the DB-error strings) to pass the retriable signal explicitly. Leaving the long tail for follow-up is fine; the explicitly-named ones above should not be left at the default.

Smaller items (non-blocking but worth fixing in the same revision)

  • No consumer reads .retriable today. Confirmed via grep across services/platform/apps/. The docstring states "Callers such as Django-Q tasks can use this to decide whether to re-queue" but no task does. The natural first consumer is the _process_paid_order path in apps/orders/tasks.py that handles OrderPaymentConfirmationService.confirm_order(). Wiring at least one consumer in this PR would prove the contract is usable end-to-end.
  • Equality/hash contract change is undocumented. Err("x") != Err("x", retriable=True), and hash(...) differs. Add two tests that document this explicitly: Err("x") != Err("x", retriable=True) and Err("x") == Err("x", retriable=False). This is the contract-documentation tier of test that the current 127 LOC suite is missing.
  • Tests verify mechanics, not behavior. Add at least one test that proves a retry loop or task code reads .retriable and changes its decision based on it. Without that, the test suite proves the dataclass works, not that the feature works.
  • Docs not updated. docs/ADRs/ADR-0003-comprehensive-type-safety-implementation.md:55 documents the Result pattern and docs/ADRs/ADR-0022-project-structure-strategic-seams.md:120 points to shared Ok/Err. Neither describes retryability semantics. docs/domain/REFUND_SERVICE.md:177 also discusses Result usage and would benefit from a one-paragraph addition about when to set retriable=True.
  • Docstring vagueness. The current docstring says the flag "signals whether the operation might succeed on retry." It does not specify what the default value means, what consumers should do with each state, or whether the default is conservative-by-design or an artifact of the dataclass field ordering rule.

Verdict

Rework. The shape of retriable is the load-bearing decision and locking in bool before the codebase adopts it makes the migration to tri-state much harder later. Once the shape is settled and the highest-confidence transient producers are backfilled, this should land cleanly with a behavioral test and a brief ADR addition.

b3lz3but added a commit to b3lz3but/PRAHO that referenced this pull request May 7, 2026
Address PR captainpragmatic#168 review (mostlyvirtual, 2026-05-05):

Blocker 1 — replace `retriable: bool = False` with a tri-state
`Retriability` StrEnum (RETRIABLE / NOT_RETRIABLE / UNKNOWN), default
UNKNOWN. A bool collapsed "caller did not classify" into "definitely
not retriable" — the most dangerous wrong answer for transient
infrastructure errors. With UNKNOWN as default, legacy `Err(str(e))`
sites no longer falsely assert permanence; consumers choose policy
per-workflow via `is_retriable` (conservative — only True for
RETRIABLE) or by inspecting `retriability` directly.

Blocker 2 — backfill highest-confidence transient producers so the
signal is not inert at merge time:

- apps/billing/stripe_metering.py: classify all 8 StripeError catches
  via `_classify_stripe_error` (RateLimitError / APIConnectionError /
  Timeout / TryAgain / APIError → RETRIABLE; InvalidRequestError /
  AuthenticationError / CardError / etc. → NOT_RETRIABLE)
- apps/provisioning/virtualmin_gateway.py: rate limit, retry
  exhaustion, connection test failures → RETRIABLE
- apps/provisioning/virtualmin_service.py: gateway-result-rewrap sites
  (316, 632, 755, 913, 1198) propagate inner retriability via new
  `retriability_of(result)` helper; connection test (1028) → RETRIABLE
- apps/billing/refund_service.py: explicit "database error" strings
  (166, 323) → RETRIABLE
- apps/infrastructure/provider_config.py: `Command timed out` (336)
  → RETRIABLE

Tests: rewrite ErrRetriabilityTests around the enum; new
ErrEqualityContractTests document that two Errs with different
retriability compare unequal (and hash differs) — equality contract
change is now explicit. New ErrPatternMatchTests confirm
`case Err(msg)` still binds to `.error` via positional `__match_args__`.

ADR-0003 updated with the tri-state signal and consumer policy.

Refs: captainpragmatic#121

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@b3lz3but
Copy link
Copy Markdown
Contributor Author

b3lz3but commented May 7, 2026

@mostlyvirtual ready for re-review. Pushed c6060f41 addressing both blockers.

Blocker 1 — retriable: boolRetriability enum (default UNKNOWN):

  • New Retriability StrEnum with three states: RETRIABLE, NOT_RETRIABLE, UNKNOWN
  • Err.retriability: Retriability = Retriability.UNKNOWN (default UNKNOWN, not the conservative-collapsed False)
  • New is_retriable property — only returns True when explicitly RETRIABLE; non-idempotent consumers (payments/provisioning) get fail-closed semantics via this property; idempotent consumers can inspect retriability directly per your suggested policy.
  • Docstrings on both Retriability and Err cover the three states, why a bool was wrong, and consumer policy.

Blocker 2 — backfill transient producers:

  • stripe_metering.py: all 8 StripeError catches now use a new _classify_stripe_error helper that maps by class name (lazy-import safe) — RateLimitError/APIConnectionError/Timeout/TryAgain/APIError → RETRIABLE; InvalidRequestError/AuthenticationError/CardError/PermissionError/IdempotencyError/SignatureVerificationError → NOT_RETRIABLE; unknown StripeError subclasses fall through to UNKNOWN.
  • virtualmin_gateway.py: VirtualminRateLimitedError, VirtualminTransientError (retry exhaustion), and connection-test exceptions all marked RETRIABLE.
  • virtualmin_service.py: the 5 gateway-result-rewrap sites you flagged (316, 632, 755, 913, 1198) now propagate inner retriability via a new retriability_of(result) helper (handles the union-type narrowing cleanly); connection test at 1028 → RETRIABLE.
  • refund_service.py:166,323: explicit "database error" → RETRIABLE.
  • infrastructure/provider_config.py:336: Command timed out → RETRIABLE.

I deliberately left bare except Exception as e: return Err(str(e)) sites at the default UNKNOWN — those are exactly the "caller did not classify" cases that motivate the tri-state in the first place.

Smaller items:

  • New ErrEqualityContractTests make the equality/hash contract explicit (Err("x") != Err("x", retriability=RETRIABLE), hashes differ; Err("x") == Err("x", retriability=UNKNOWN), hashes match).
  • New ErrPatternMatchTests confirm case Err(msg) still binds to .error via positional __match_args__.
  • ADR-0003 updated with the tri-state signal and consumer policy.

Deferred to follow-up (acknowledged):

  • Wiring an actual .retriability-reading consumer (e.g., _process_paid_order in orders/tasks.py). The current shape is a building block; routing existing batch tasks to honor it is a separate change because the natural retry unit is per-order, not the whole batch.
  • Long-tail backfill across hcloud/digitalocean/etc. The reviewer-named, highest-confidence sites are done.

mypy clean across all 6 changed modules, 33 Result-type tests passing.

Address PR captainpragmatic#168 review (mostlyvirtual, 2026-05-05):

Blocker 1 — replace `retriable: bool = False` with a tri-state
`Retriability` StrEnum (RETRIABLE / NOT_RETRIABLE / UNKNOWN), default
UNKNOWN. A bool collapsed "caller did not classify" into "definitely
not retriable" — the most dangerous wrong answer for transient
infrastructure errors. With UNKNOWN as default, legacy `Err(str(e))`
sites no longer falsely assert permanence; consumers choose policy
per-workflow via `is_retriable` (conservative — only True for
RETRIABLE) or by inspecting `retriability` directly.

Blocker 2 — backfill highest-confidence transient producers so the
signal is not inert at merge time:

- apps/billing/stripe_metering.py: classify all 8 StripeError catches
  via `_classify_stripe_error` (RateLimitError / APIConnectionError /
  Timeout / TryAgain / APIError → RETRIABLE; InvalidRequestError /
  AuthenticationError / CardError / etc. → NOT_RETRIABLE)
- apps/provisioning/virtualmin_gateway.py: rate limit, retry
  exhaustion, connection test failures → RETRIABLE
- apps/provisioning/virtualmin_service.py: gateway-result-rewrap sites
  (316, 632, 755, 913, 1198) propagate inner retriability via new
  `retriability_of(result)` helper; connection test (1028) → RETRIABLE
- apps/billing/refund_service.py: explicit "database error" strings
  (166, 323) → RETRIABLE
- apps/infrastructure/provider_config.py: `Command timed out` (336)
  → RETRIABLE

Tests: rewrite ErrRetriabilityTests around the enum; new
ErrEqualityContractTests document that two Errs with different
retriability compare unequal (and hash differs) — equality contract
change is now explicit. New ErrPatternMatchTests confirm
`case Err(msg)` still binds to `.error` via positional `__match_args__`.

ADR-0003 updated with the tri-state signal and consumer policy.

Refs: captainpragmatic#121

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ciprian Radulescu <craps2003@gmail.com>
@b3lz3but b3lz3but force-pushed the feat/result-retriable-signal branch from c6060f4 to 7b167e1 Compare May 8, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(common): add retriability signal to Result pattern Err type

2 participants