DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

baileympearson · 2025-12-02T17:15:50Z

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
- operations are retried using the new SystemOverloadedError label
- operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

baileympearson · 2025-12-02T18:57:17Z

source/logging/logging.md

    > - If the value is "stderr" (case-insensitive), log to stderr.
    > - Else, if direct logging to files is supported, log to a file at the specified path. If the file already exists, it
-    >     MUST be appended to.
+    >   MUST be appended to.


this is failing lint on main. I'll fix separately and rebase this PR.

source/client-backpressure/client-backpressure.md

source/retryable-reads/retryable-reads.md

source/retryable-writes/retryable-writes.md

source/client-backpressure/client-backpressure.md

blink1073 · 2025-12-03T16:31:19Z

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

blink1073 · 2025-12-03T21:05:18Z

WIP Python implementation: mongodb/mongo-python-driver#2635

blink1073 · 2025-12-03T23:11:38Z

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

jyemin

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

jyemin · 2025-12-10T19:26:49Z

source/client-backpressure/client-backpressure.md

+    - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
+        timeout to avoid retry storms.
+5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according
+    to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`


Suggested change

to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`

to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`

jyemin · 2025-12-10T19:33:11Z

source/client-backpressure/client-backpressure.md

+
+This specification expands the driver's retry ability to all commands, including those not currently considered
+retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys
+the following rules:


Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.

jyemin · 2025-12-10T19:36:28Z

source/client-backpressure/client-backpressure.md

+
+## Q&A
+
+TODO


Anything to add here, or just remove?

jyemin · 2025-12-10T19:37:07Z

source/client-backpressure/client-backpressure.md

+
+## Changelog
+
+- 2025-XX-XX: Initial version.


Not sure how we handle the date... Is there an automation for this?

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

Jibola · 2025-12-10T20:34:13Z

source/client-backpressure/tests/README.md

+    ```python
+    assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
+    ```


Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?

To me that's more human-readable.

It maintains that BIG_TIME must always be bigger (removing the need for absolute_value).

It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.

Jibola · 2025-12-10T20:42:12Z

source/client-backpressure/client-backpressure.md

+the following rules:
+
+1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
+    - The value is 0.1 and non-configurable.


Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale

I think the outcome was the opposite: https://docs.google.com/document/d/1teqNgeWbW6dpRQOALrJTEBRoO6sYcrpq9T3_NJE0QfU/edit?disco=AAABtSVfUxI

Jibola · 2025-12-10T20:51:37Z

source/client-backpressure/client-backpressure.md

+to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the
+server may offer different rate limiting behavior for clients that do not support backpressure.
+
+##### Implementation notes


Suggested change

##### Implementation notes

#### Implementation notes

source/retryable-writes/retryable-writes.md

stIncMale · 2025-12-11T04:27:47Z

source/client-backpressure/client-backpressure.md

+#### Goodput
+
+The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results
+that the driver processes per unit of time.


"the number of non-error results that the driver processes per unit of time" is neither throughput, nor the "good throughput" ("goodput"). Throughput is the characteristic of a system (the combination of the application, the driver, the DBMS, their configuration, the network connecting them, the hardware, etc.), which is a constant for a given system, and tells about system capacity at its peak. SPECjbb2012: Updated Metrics for a Business Benchmark explains nicely what a throughput is, and how it may be measured.

"the number of non-error results
that the driver processes per unit of time" is not a characteristic of a system, but rather a metric whose value may vary. Trivially, if an application does not request any operations via the driver, then "the number of non-error results
that the driver processes per unit of time" is zero, but the throughput is still not.

If, however, we want to define "throughput"/"goodput" the way it is currently proposed, then when we use the term in

"negatively affect goodput"

"stable but lowered throughput"

we have to say "max goodput" / "max throughput", or something like that, instead of just "goodput"/"throughput".

stIncMale · 2025-12-11T06:24:44Z

source/client-backpressure/client-backpressure.md

+    - The value is 0.1 and non-configurable.
+2. If the command succeeds on a retry attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE`+1 tokens.
+3. If a retry attempt fails with an error that does not include `SystemOverloadedError` label, drivers MUST deposit 1
+    token.


I fail to find item 3 in the pseudocode (lines [109, 155]). Could you please point where it is there?

stIncMale · 2025-12-11T06:26:03Z

source/client-backpressure/client-backpressure.md

+    token.
+4. A retry attempt will only be permitted if the error includes the `RetryableError` label, we have not reached
+    `MAX_ATTEMPTS`, the CSOT deadline has not expired, and a token can be acquired from the token bucket.
+    - The value of `MAX_ATTEMPTS` is 5 and non-configurable.


Other constants are defined in the pseudocode, but MAX_ATTEMPTS is not. Let's define it.

stIncMale · 2025-12-11T06:40:45Z

source/client-backpressure/client-backpressure.md

+                raise
+
+            # Raise if the error is non retryable.
+            is_retryable = exc.has_error_label("RetryableError") or is_retryable_write_error() or is_retryable_read_error()


The specification says "A retry attempt will only be permitted if the error includes the RetryableError label, we have not reached MAX_ATTEMPTS, the CSOT deadline has not expired, and a token can be acquired from the token bucket." The exc.has_error_label("RetryableError") is the "error includes the RetryableError label" condition. But what are the is_retryable_write_error(), is_retryable_read_error() conditions doing here? If they are supposed to be here, then the specification must reflect that.

is_retryable_write_error()/is_retryable_read_error() are neither called on exc, not is exc passed to them. That does not seem right.

stIncMale · 2025-12-11T06:50:32Z

source/client-backpressure/client-backpressure.md

+            attempt += 1
+
+            if attempt > MAX_ATTEMPTS:
+                raise


The specification says "A retry attempt will only be permitted if ... we have not reached MAX_ATTEMPTS ...". At this point in execution, attempt specifies the number of completed attempts. Therefore, according to the specification, when attempt == MAX_ATTEMPTS, a retry attempt should not be permitted, yet the pseudocode clearly allows another attempt in such a situation, and with MAX_ATTEMPTS being 5, the actual maximun number of attempts the pseudocode allows is 6.

For MAX_ATTEMPTS to correctly represent the maximum number of attempts, the code must be

if attempt >= MAX_ATTEMPTS: raise

stIncMale · 2025-12-11T06:53:21Z

source/client-backpressure/client-backpressure.md

+            # Raise if the error is non retryable.
+            is_retryable = exc.has_error_label("RetryableError") or is_retryable_write_error() or is_retryable_read_error()
+            if not is_retryable:
+                raise error


What is error here? I would have expected to see exc here, given the except PyMongoError as exc: code above.

Why does this line say raise error, but all other lines with raise say merely raise without specific error?

baileympearson added 8 commits December 1, 2025 13:44

initial commit

588f1f2

new files

e467f5b

add tests for handshake changes

d55fdb9

add generated tests

8e74b41

test fixes and add prose test

072b453

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

fix run on requirements

52e2a35

fix run on requirements?

391c951

fix CI

92501c0

baileympearson commented Dec 2, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

baileympearson marked this pull request as ready for review December 2, 2025 18:59

baileympearson requested review from a team as code owners December 2, 2025 18:59

baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59

blink1073 reviewed Dec 3, 2025

View reviewed changes

source/retryable-reads/retryable-reads.md Outdated Show resolved Hide resolved

source/retryable-writes/retryable-writes.md Outdated Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Show resolved Hide resolved

baileympearson added 3 commits December 3, 2025 11:36

comments

0fdef39

Fix broken unified tests

82acab8

fix UTR linting failures

b3a7b6c

blink1073 mentioned this pull request Dec 3, 2025

PYTHON-5528 & PYTHON-5651 Add exponential backoff to operation retry loop for server overloaded errors mongodb/mongo-python-driver#2635

Draft

11 tasks

remove broken deleteMany() from unified tests

60a87b8

add backwards compat section

399a56b

jyemin requested changes Dec 10, 2025

View reviewed changes

Jibola requested changes Dec 10, 2025

View reviewed changes

stIncMale reviewed Dec 11, 2025

View reviewed changes

source/retryable-writes/retryable-writes.md Show resolved Hide resolved

stIncMale reviewed Dec 11, 2025

View reviewed changes

stIncMale requested changes Dec 11, 2025

View reviewed changes

	to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
	to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`


		## Q&A

		TODO

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Are you sure you want to change the base?

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Uh oh!

Conversation

baileympearson commented Dec 2, 2025 • edited by blink1073 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyemin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stIncMale Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stIncMale Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

baileympearson commented Dec 2, 2025 •

edited by blink1073

Loading

blink1073 commented Dec 3, 2025 •

edited

Loading

stIncMale Dec 11, 2025 •

edited

Loading

stIncMale Dec 11, 2025 •

edited

Loading

stIncMale Dec 11, 2025 •

edited

Loading

stIncMale Dec 11, 2025 •

edited

Loading