[14.0][FIX] connector_oxigesti: wrap MSSQL connection errors as retryable#896
Open
[14.0][FIX] connector_oxigesti: wrap MSSQL connection errors as retryable#896
Conversation
…yableError
A transient MSSQL outage (server restart, network blip, TCP EOF) while an
export job is running raises pymssql.OperationalError or InterfaceError.
queue_job does not recognize those as retryable, so the job moves straight
to `failed` on its first attempt — and by the time the adapter returns, the
per-backend `since_date` cursor has already advanced past the record, so
the next cron pass never picks it up. The configured retry_pattern
({1: 10, 5: 30, 10: 60, 15: 300}) is inert for this class of error.
Observed on task #2548: during a MSSQL restart on 2026-04-13 at 06:30, 120
DreamStation 2 serial numbers landed on queue_job `failed` with retry=1/5
and were lost to future cron cycles until manual requeue.
Fix: a new `mssql_connection_retryable` contextmanager re-raises
pymssql.OperationalError / pymssql.InterfaceError as NetworkRetryableError,
applied around every code path in the adapter that opens a pymssql
connection (_exec_sql, write, delete). Data-integrity errors (IntegrityError,
InternalError) are left untouched — they are not transient. The existing
`api_handle_errors` context on the interactive path keeps surfacing a
UserError (first clause catches NetworkRetryableError and translates).
Tests (8, post_install): verify each exception class is wrapped or passed
through correctly, that __cause__ preserves the original pymssql error for
the forensic trail, and an end-to-end case against an unreachable host
(127.0.0.1:1) confirming the adapter path raises NetworkRetryableError.
task-2549
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 14.0 #896 +/- ##
==========================================
+ Coverage 50.37% 50.93% +0.55%
==========================================
Files 1174 1177 +3
Lines 20087 20269 +182
Branches 4267 4273 +6
==========================================
+ Hits 10119 10324 +205
+ Misses 9734 9703 -31
- Partials 234 242 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…apper Add defensive tests to prove every production path still behaves correctly after wrapping pymssql connection errors as NetworkRetryableError. The core of a long-running production connector — higher confidence for the deploy. Now 24 tests (was 8), covering: * Production-observed error variants (2026-04-13): error codes 2 / 20003 / 20004 / 6005 (SHUTDOWN). * Wrapper narrowness: ``DatabaseError`` base class and ``ProgrammingError`` pass through untouched (not silently retried — those need fixing). * Happy path: wrapper is transparent when no exception is raised. * Count-mismatch integrity checks (write/delete): generic Exception and pymssql.IntegrityError raised inside the wrapper propagate unchanged. * create()'s IntegrityError 2627 workaround preserved: the wrapper does NOT interfere with create() catching and converting to ValidationError. * A non-2627 IntegrityError still re-raises through create() untouched. * Interactive path regression: an IntegrityError through api_handle_errors still surfaces as UserError (dedicated clause intact). * Nested wrapper is a no-op (does not double-wrap). * End-to-end write/delete paths against an unreachable host surface NetworkRetryableError (the schema-check _exec_sql at the top of each method exercises the wrapper for those call entries too). task-2549
Every oxigesti queue.job.function declares
retry_pattern={1: 10, 5: 30, 10: 60, 15: 300} — a tiered backoff whose
keys go up to 15, so it assumes the job is allowed to reach at least
retry 15-20. OCA queue_job's DEFAULT_MAX_RETRIES is 5, which clipped
every oxigesti job at the first bucket: 4 postpones of 10s each, ~40s
window, with the 30s/60s/300s tiers never reached.
Override with_delay on the abstract oxigesti.binding model so every
concrete binding (partners, products, lots, sale orders, ...)
transparently gets max_retries=MAX_RETRIES_NETWORK (20) unless the
caller passes an explicit value. No changes to the data XML, no changes
to any call site.
With max_retries=20 and the unchanged pattern the job now gets:
* attempts 1-4 @ 10s ≈ 40s (network blips, fast restarts)
* attempts 5-9 @ 30s ≈ 2.5m (medium maintenance)
* attempts 10-14 @ 60s ≈ 5m (longer restarts)
* attempts 15-19 @ 300s ≈ 25m (extended outages)
Total window per job ≈ 33 min — aligned with the pattern's original
design intent and Sidekiq-style background-queue practice, and enough
to absorb transient MSSQL restarts, VPN blips and short planned
maintenance on the Oxigesti server without spamming it.
An explicit max_retries= on the caller still wins (including
max_retries=0 for infinite retries), and generic, non-oxigesti
recordsets keep the OCA default unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A transient MSSQL outage (server restart, network blip, TCP EOF) while an export job is running raises
pymssql.OperationalErrororpymssql.InterfaceError.queue_jobdoes not recognize those as retryable, so the job moves straight tofailedon its first attempt — and by the time the adapter returns, the per-backendsince_datecursor has already advanced past the record, so the next cron pass never picks it up. The configuredretry_pattern({1: 10, 5: 30, 10: 60, 15: 300}) is inert for this class of error.Observed on task #2548: during a MSSQL restart on 2026-04-13 at 06:30, 120 DreamStation 2 serial numbers landed on
queue.jobfailedwithretry=1/5and were lost to future cron cycles until manual requeue.Fix
A new
mssql_connection_retryablecontextmanager inconnector_oxigesti/components/adapter.pyre-raisespymssql.OperationalError/pymssql.InterfaceErrorasNetworkRetryableError, applied around every code path that opens a pymssql connection (_exec_sql,write,delete). Data-integrity errors (IntegrityError,InternalError) are left untouched — they are not transient. The existingapi_handle_errorscontext on the interactive path keeps surfacing aUserError(its first clause catchesNetworkRetryableErrorand translates).Tests
Added
connector_oxigesti/tests/test_mssql_retryable.py(8 tests,post_install):pymssql.OperationalError/pymssql.InterfaceErrorare wrapped asNetworkRetryableError.pymssql.IntegrityError/pymssql.InternalErrorare NOT wrapped.api_handle_errors) still surfacesUserError.__cause__preserves the original pymssql error for the forensic trail.127.0.0.1:1and callingget_version()through the adapter raisesNetworkRetryableError(the wrapper is actually on the real code path).Test plan
-u connector_oxigesti --test-enable --test-tags=/connector_oxigestionguijarron14test→ 8/8 pass.pre-commit runclean on staged files.state=pendingwitheta, notstate=failed.Refs task-2549 (child of #2548).