Skip to content

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Dec 2, 2025

DRIVERS-3239

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

  • Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:

    • operations are retried using the new SystemOverloadedError label
    • operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
  • Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.

  • Update changelog.

  • Test changes in at least one language driver.

  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

> - If the value is "stderr" (case-insensitive), log to stderr.
> - Else, if direct logging to files is supported, log to a file at the specified path. If the file already exists, it
> MUST be appended to.
> MUST be appended to.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is failing lint on main. I'll fix separately and rebase this PR.

@baileympearson baileympearson marked this pull request as ready for review December 2, 2025 18:59
@baileympearson baileympearson requested review from a team as code owners December 2, 2025 18:59
@baileympearson baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59
@blink1073
Copy link
Member

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

@blink1073
Copy link
Member

WIP Python implementation: mongodb/mongo-python-driver#2635

@blink1073
Copy link
Member

blink1073 commented Dec 3, 2025

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`


This specification expands the driver's retry ability to all commands, including those not currently considered
retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys
the following rules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.


## Q&A

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything to add here, or just remove?


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we handle the date... Is there an automation for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

Comment on lines +53 to +55
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?

  • To me that's more human-readable.
  • It maintains that BIG_TIME must always be bigger (removing the need for absolute_value).
  • It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.

the following rules:

1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
- The value is 0.1 and non-configurable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the
server may offer different rate limiting behavior for clients that do not support backpressure.

##### Implementation notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Implementation notes
#### Implementation notes

#### Goodput

The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results
that the driver processes per unit of time.
Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the number of non-error results that the driver processes per unit of time" is neither throughput, nor the "good throughput" ("goodput"). Throughput is the characteristic of a system (the combination of the application, the driver, the DBMS, their configuration, the network connecting them, the hardware, etc.), which is a constant for a given system, and tells about system capacity at its peak. SPECjbb2012: Updated Metrics for a Business Benchmark explains nicely what a throughput is, and how it may be measured.

"the number of non-error results
that the driver processes per unit of time"
is not a characteristic of a system, but rather a metric whose value may vary. Trivially, if an application does not request any operations via the driver, then "the number of non-error results
that the driver processes per unit of time"
is zero, but the throughput is still not.

Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, however, we want to define "throughput"/"goodput" the way it is currently proposed, then when we use the term in

  • "negatively affect goodput"
  • "stable but lowered throughput"

we have to say "max goodput" / "max throughput", or something like that, instead of just "goodput"/"throughput".

- The value is 0.1 and non-configurable.
2. If the command succeeds on a retry attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE`+1 tokens.
3. If a retry attempt fails with an error that does not include `SystemOverloadedError` label, drivers MUST deposit 1
token.
Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to find item 3 in the pseudocode (lines [109, 155]). Could you please point where it is there?

token.
4. A retry attempt will only be permitted if the error includes the `RetryableError` label, we have not reached
`MAX_ATTEMPTS`, the CSOT deadline has not expired, and a token can be acquired from the token bucket.
- The value of `MAX_ATTEMPTS` is 5 and non-configurable.
Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other constants are defined in the pseudocode, but MAX_ATTEMPTS is not. Let's define it.

raise

# Raise if the error is non retryable.
is_retryable = exc.has_error_label("RetryableError") or is_retryable_write_error() or is_retryable_read_error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The specification says "A retry attempt will only be permitted if the error includes the RetryableError label, we have not reached MAX_ATTEMPTS, the CSOT deadline has not expired, and a token can be acquired from the token bucket." The exc.has_error_label("RetryableError") is the "error includes the RetryableError label" condition. But what are the is_retryable_write_error(), is_retryable_read_error() conditions doing here? If they are supposed to be here, then the specification must reflect that.
  2. is_retryable_write_error()/is_retryable_read_error() are neither called on exc, not is exc passed to them. That does not seem right.

attempt += 1

if attempt > MAX_ATTEMPTS:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification says "A retry attempt will only be permitted if ... we have not reached MAX_ATTEMPTS ...". At this point in execution, attempt specifies the number of completed attempts. Therefore, according to the specification, when attempt == MAX_ATTEMPTS, a retry attempt should not be permitted, yet the pseudocode clearly allows another attempt in such a situation, and with MAX_ATTEMPTS being 5, the actual maximun number of attempts the pseudocode allows is 6.

For MAX_ATTEMPTS to correctly represent the maximum number of attempts, the code must be

if attempt >= MAX_ATTEMPTS:
    raise

# Raise if the error is non retryable.
is_retryable = exc.has_error_label("RetryableError") or is_retryable_write_error() or is_retryable_read_error()
if not is_retryable:
raise error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What is error here? I would have expected to see exc here, given the except PyMongoError as exc: code above.
  2. Why does this line say raise error, but all other lines with raise say merely raise without specific error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants