DRIVERS-3217 Investigate error rate tracking for server selection #2

ShaneHarvey · 2025-11-19T19:47:32Z

Changes:

DRIVERS-3217 Investigate error rate tracking for server selection
Add test for overload using activationProbability (sync only so would need to be converted to async.)

The driver tracks the pass/fail rate of every operation for each server. In server selection rather than purely using "operation count" we bias towards the server with the lower error rate. So if server A has an error rate of 75% and server B is 25%, then 75-25 = 50% of the time we select Server B and the other 50% we use the existing "operation count" logic. This appears to work well, before this change the test sees a ~60% error rate:

Overloaded server: ('localhost', 27017)
{('localhost', 27017): 0.77975, ('localhost', 27018): 0.22025, 'overload_errors': 2344, 'operations': 4000, 'error_rate': 0.586}

After this change there's an ~20% error rate:

Overloaded server: ('localhost', 27017)
{('localhost', 27017): 0.239, ('localhost', 27018): 0.761, 'overload_errors': 731, 'operations': 4000, 'error_rate': 0.18275}

Concerns:

Runtime perf cost of tracking the error rate. It should not slow down the happy path. We may want to cap the number of samples (not just their age) to limit memory usage. We may also need to cache the error rate rather than dynamically calculating it for every server selection call.
Some exceptions are normal "success" results which we should not label as failures counted towards the error rate. One example is duplicate key errors. It will be safer to only track errors with the "SystemOverloadedError" label.

ShaneHarvey · 2025-11-19T19:49:45Z

CC: @sleepyStick

baileympearson · 2025-11-20T22:39:17Z

pymongo/asynchronous/topology.py

-        if server1.pool.operation_count <= server2.pool.operation_count:
+        error_rate1 = await server1.pool.get_error_rate()
+        error_rate2 = await server2.pool.get_error_rate()
+        if error_rate1 < error_rate2 and (random.random() < (error_rate2 - error_rate1)):  # noqa: S311


What's the purpose of random.random() < (error_rate2 - error_rate1)?

The random choice scaling with the difference attempts to more evenly balance the load. Imagine a scenario where there are only 2 servers A and B. A's error rate is 0% and B's is 1%. A naive approach is to always pick the server with the lower error rate, but that would mean we'd never choose server B. That's a problem since it means as soon as any errors occur on one server, all requests to it will be rerouted.

Introducing randomness here means we instead bias only 1% of requests away from server B which is clearly better than 100% of requests.

Another example is A's error rate is 25% and B's is 75%. We know we want to bias some operations to server A since it has a higher chance of success but again not all of the requests. Scaling with the difference in error rate seemed to perform well in these scenarios (eg we observe fewer errors overall).

Another way to consider it: when the error rate for 2 servers is around the same we want to continue using operationCount based selection. As the difference in error rates increases we want to route more and more requests away from that server.

Makes sense. I should have read the PR description, sorry 😅

I cannot resolve this comment for some reason. (permissions?)

feel free to resolve.

baileympearson · 2025-11-21T15:57:45Z

pymongo/asynchronous/pool.py

+            self.error_times.append((time.monotonic(), 1))
+            raise
+
+        # clear old info from error rate >10 seconds old


We definitely don't need to figure this out now, because we're not certain we'll move forward with this approach.

But because this might be relevant to Iris' DSI workflow: in scenarios where one node is significantly more overloaded than the other, pruning stale error measurements after connection checkout will result in a slower recovery than necessary because server selection will continue to avoid the overloaded server, even after it has started to recover. The higher the error rate on one node, the longer it will take on average to reach the error rate measurement pruning logic.

And as the error rate approaches 100% on one node and the other stays healthy, the likelihood of ever selecting this server decreases and approaches 0 (if the error rate ever hit 100%, we'd never select a different server and have no chance of ever selecting this server again).

Good point, this logic would need to be moved into server selection to avoid that pitfall. Also there's likely a more efficient algorithm to track the error rate. One way would be having 10 buckets which track the errors and total operations for each of the last 10 seconds. Then rate can be a simple sum of <=10 buckets rather than an unbounded list.

Something roughly like this (ignoring the phasing out old data):

self.error_stats = {} ... current_second = int(time.monotonic()) bucket = self.error_stats.setdefault(current_second, {"errors": 0, "requests": 0}) if error: bucket["errors"] += 1 bucket["requests"] += 1

Then:

async def get_error_rate(self) -> float: current_second = int(time.monotonic()) errors = 0 requests = 0 async with self.lock: for sec in self.error_stats: if sec < current_second - 10: continue bucket = self.error_stats[sec] errors += bucket["errors"] requests += bucket["requests"] # Require at least 10 samples to compute an error rate. if requests < 10: return 0.0 return float(errors) / requests

ShaneHarvey added 2 commits November 19, 2025 11:35

DRIVERS-3217 Investigate error rate tracking for server selection

1dc11e6

DRIVERS-3217 Add sync only test for overload using activationProbability

0a771d3

baileympearson reviewed Nov 20, 2025

View reviewed changes

baileympearson reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRIVERS-3217 Investigate error rate tracking for server selection #2

DRIVERS-3217 Investigate error rate tracking for server selection #2

Uh oh!

ShaneHarvey commented Nov 19, 2025 •

edited

Loading

Uh oh!

ShaneHarvey commented Nov 19, 2025

Uh oh!

baileympearson Nov 20, 2025

Uh oh!

ShaneHarvey Nov 20, 2025 •

edited

Loading

Uh oh!

ShaneHarvey Nov 20, 2025

Uh oh!

baileympearson Nov 21, 2025

Uh oh!

baileympearson Nov 21, 2025

Uh oh!

baileympearson Nov 21, 2025 •

edited

Loading

Uh oh!

ShaneHarvey Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DRIVERS-3217 Investigate error rate tracking for server selection #2

Are you sure you want to change the base?

DRIVERS-3217 Investigate error rate tracking for server selection #2

Uh oh!

Conversation

ShaneHarvey commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaneHarvey commented Nov 19, 2025

Uh oh!

baileympearson Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

baileympearson Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

baileympearson Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

baileympearson Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShaneHarvey commented Nov 19, 2025 •

edited

Loading

ShaneHarvey Nov 20, 2025 •

edited

Loading

baileympearson Nov 21, 2025 •

edited

Loading