Skip to content

[14.0][FIX] connector_oxigesti: wrap MSSQL connection errors as retryable#896

Open
eantones wants to merge 3 commits into14.0from
14.0-fix-connector_oxigesti-pymssql-retryable
Open

[14.0][FIX] connector_oxigesti: wrap MSSQL connection errors as retryable#896
eantones wants to merge 3 commits into14.0from
14.0-fix-connector_oxigesti-pymssql-retryable

Conversation

@eantones
Copy link
Copy Markdown
Member

Summary

A transient MSSQL outage (server restart, network blip, TCP EOF) while an export job is running raises pymssql.OperationalError or pymssql.InterfaceError. queue_job does not recognize those as retryable, so the job moves straight to failed on its first attempt — and by the time the adapter returns, the per-backend since_date cursor has already advanced past the record, so the next cron pass never picks it up. The configured retry_pattern ({1: 10, 5: 30, 10: 60, 15: 300}) is inert for this class of error.

Observed on task #2548: during a MSSQL restart on 2026-04-13 at 06:30, 120 DreamStation 2 serial numbers landed on queue.job failed with retry=1/5 and were lost to future cron cycles until manual requeue.

Fix

A new mssql_connection_retryable contextmanager in connector_oxigesti/components/adapter.py re-raises pymssql.OperationalError / pymssql.InterfaceError as NetworkRetryableError, applied around every code path that opens a pymssql connection (_exec_sql, write, delete). Data-integrity errors (IntegrityError, InternalError) are left untouched — they are not transient. The existing api_handle_errors context on the interactive path keeps surfacing a UserError (its first clause catches NetworkRetryableError and translates).

Tests

Added connector_oxigesti/tests/test_mssql_retryable.py (8 tests, post_install):

  • pymssql.OperationalError / pymssql.InterfaceError are wrapped as NetworkRetryableError.
  • pymssql.IntegrityError / pymssql.InternalError are NOT wrapped.
  • Unrelated exceptions pass through untouched.
  • Interactive path (api_handle_errors) still surfaces UserError.
  • __cause__ preserves the original pymssql error for the forensic trail.
  • End-to-end: creating a backend pointed at 127.0.0.1:1 and calling get_version() through the adapter raises NetworkRetryableError (the wrapper is actually on the real code path).

Test plan

  • -u connector_oxigesti --test-enable --test-tags=/connector_oxigesti on guijarron14test → 8/8 pass.
  • pre-commit run clean on staged files.
  • CI green.
  • Verify on production after deploy: next MSSQL blip should produce jobs in state=pending with eta, not state=failed.

Refs task-2549 (child of #2548).

…yableError

A transient MSSQL outage (server restart, network blip, TCP EOF) while an
export job is running raises pymssql.OperationalError or InterfaceError.
queue_job does not recognize those as retryable, so the job moves straight
to `failed` on its first attempt — and by the time the adapter returns, the
per-backend `since_date` cursor has already advanced past the record, so
the next cron pass never picks it up. The configured retry_pattern
({1: 10, 5: 30, 10: 60, 15: 300}) is inert for this class of error.

Observed on task #2548: during a MSSQL restart on 2026-04-13 at 06:30, 120
DreamStation 2 serial numbers landed on queue_job `failed` with retry=1/5
and were lost to future cron cycles until manual requeue.

Fix: a new `mssql_connection_retryable` contextmanager re-raises
pymssql.OperationalError / pymssql.InterfaceError as NetworkRetryableError,
applied around every code path in the adapter that opens a pymssql
connection (_exec_sql, write, delete). Data-integrity errors (IntegrityError,
InternalError) are left untouched — they are not transient. The existing
`api_handle_errors` context on the interactive path keeps surfacing a
UserError (first clause catches NetworkRetryableError and translates).

Tests (8, post_install): verify each exception class is wrapped or passed
through correctly, that __cause__ preserves the original pymssql error for
the forensic trail, and an end-to-end case against an unreachable host
(127.0.0.1:1) confirming the adapter path raises NetworkRetryableError.

task-2549
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 80.84112% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.93%. Comparing base (ba6ec84) to head (39ae73d).

Files with missing lines Patch % Lines
connector_oxigesti/components/adapter.py 17.50% 33 Missing ⚠️
connector_oxigesti/tests/test_mssql_retryable.py 94.24% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             14.0     #896      +/-   ##
==========================================
+ Coverage   50.37%   50.93%   +0.55%     
==========================================
  Files        1174     1177       +3     
  Lines       20087    20269     +182     
  Branches     4267     4273       +6     
==========================================
+ Hits        10119    10324     +205     
+ Misses       9734     9703      -31     
- Partials      234      242       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…apper

Add defensive tests to prove every production path still behaves correctly
after wrapping pymssql connection errors as NetworkRetryableError. The core
of a long-running production connector — higher confidence for the deploy.

Now 24 tests (was 8), covering:

* Production-observed error variants (2026-04-13): error codes 2 / 20003 /
  20004 / 6005 (SHUTDOWN).
* Wrapper narrowness: ``DatabaseError`` base class and ``ProgrammingError``
  pass through untouched (not silently retried — those need fixing).
* Happy path: wrapper is transparent when no exception is raised.
* Count-mismatch integrity checks (write/delete): generic Exception and
  pymssql.IntegrityError raised inside the wrapper propagate unchanged.
* create()'s IntegrityError 2627 workaround preserved: the wrapper does
  NOT interfere with create() catching and converting to ValidationError.
* A non-2627 IntegrityError still re-raises through create() untouched.
* Interactive path regression: an IntegrityError through api_handle_errors
  still surfaces as UserError (dedicated clause intact).
* Nested wrapper is a no-op (does not double-wrap).
* End-to-end write/delete paths against an unreachable host surface
  NetworkRetryableError (the schema-check _exec_sql at the top of each
  method exercises the wrapper for those call entries too).

task-2549
Every oxigesti queue.job.function declares
retry_pattern={1: 10, 5: 30, 10: 60, 15: 300} — a tiered backoff whose
keys go up to 15, so it assumes the job is allowed to reach at least
retry 15-20. OCA queue_job's DEFAULT_MAX_RETRIES is 5, which clipped
every oxigesti job at the first bucket: 4 postpones of 10s each, ~40s
window, with the 30s/60s/300s tiers never reached.

Override with_delay on the abstract oxigesti.binding model so every
concrete binding (partners, products, lots, sale orders, ...)
transparently gets max_retries=MAX_RETRIES_NETWORK (20) unless the
caller passes an explicit value. No changes to the data XML, no changes
to any call site.

With max_retries=20 and the unchanged pattern the job now gets:
* attempts 1-4  @ 10s  ≈ 40s    (network blips, fast restarts)
* attempts 5-9  @ 30s  ≈ 2.5m   (medium maintenance)
* attempts 10-14 @ 60s ≈ 5m     (longer restarts)
* attempts 15-19 @ 300s ≈ 25m   (extended outages)

Total window per job ≈ 33 min — aligned with the pattern's original
design intent and Sidekiq-style background-queue practice, and enough
to absorb transient MSSQL restarts, VPN blips and short planned
maintenance on the Oxigesti server without spamming it.

An explicit max_retries= on the caller still wins (including
max_retries=0 for infinite retries), and generic, non-oxigesti
recordsets keep the OCA default unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant