Skip to content

Conversation

@ShaneHarvey
Copy link
Owner

@ShaneHarvey ShaneHarvey commented Nov 19, 2025

Changes:

  • DRIVERS-3217 Investigate error rate tracking for server selection
  • Add test for overload using activationProbability (sync only so would need to be converted to async.)

The driver tracks the pass/fail rate of every operation for each server. In server selection rather than purely using "operation count" we bias towards the server with the lower error rate. So if server A has an error rate of 75% and server B is 25%, then 75-25 = 50% of the time we select Server B and the other 50% we use the existing "operation count" logic. This appears to work well, before this change the test sees a ~60% error rate:

Overloaded server: ('localhost', 27017)
{('localhost', 27017): 0.77975, ('localhost', 27018): 0.22025, 'overload_errors': 2344, 'operations': 4000, 'error_rate': 0.586}

After this change there's an ~20% error rate:

Overloaded server: ('localhost', 27017)
{('localhost', 27017): 0.239, ('localhost', 27018): 0.761, 'overload_errors': 731, 'operations': 4000, 'error_rate': 0.18275}

Concerns:

  • Runtime perf cost of tracking the error rate. It should not slow down the happy path. We may want to cap the number of samples (not just their age) to limit memory usage. We may also need to cache the error rate rather than dynamically calculating it for every server selection call.
  • Some exceptions are normal "success" results which we should not label as failures counted towards the error rate. One example is duplicate key errors. It will be safer to only track errors with the "SystemOverloadedError" label.

@ShaneHarvey
Copy link
Owner Author

CC: @sleepyStick

if server1.pool.operation_count <= server2.pool.operation_count:
error_rate1 = await server1.pool.get_error_rate()
error_rate2 = await server2.pool.get_error_rate()
if error_rate1 < error_rate2 and (random.random() < (error_rate2 - error_rate1)): # noqa: S311

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of random.random() < (error_rate2 - error_rate1)?

Copy link
Owner Author

@ShaneHarvey ShaneHarvey Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random choice scaling with the difference attempts to more evenly balance the load. Imagine a scenario where there are only 2 servers A and B. A's error rate is 0% and B's is 1%. A naive approach is to always pick the server with the lower error rate, but that would mean we'd never choose server B. That's a problem since it means as soon as any errors occur on one server, all requests to it will be rerouted.

Introducing randomness here means we instead bias only 1% of requests away from server B which is clearly better than 100% of requests.

Another example is A's error rate is 25% and B's is 75%. We know we want to bias some operations to server A since it has a higher chance of success but again not all of the requests. Scaling with the difference in error rate seemed to perform well in these scenarios (eg we observe fewer errors overall).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to consider it: when the error rate for 2 servers is around the same we want to continue using operationCount based selection. As the difference in error rates increases we want to route more and more requests away from that server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I should have read the PR description, sorry 😅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot resolve this comment for some reason. (permissions?)

feel free to resolve.

self.error_times.append((time.monotonic(), 1))
raise

# clear old info from error rate >10 seconds old
Copy link

@baileympearson baileympearson Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't need to figure this out now, because we're not certain we'll move forward with this approach.

But because this might be relevant to Iris' DSI workflow: in scenarios where one node is significantly more overloaded than the other, pruning stale error measurements after connection checkout will result in a slower recovery than necessary because server selection will continue to avoid the overloaded server, even after it has started to recover. The higher the error rate on one node, the longer it will take on average to reach the error rate measurement pruning logic.

And as the error rate approaches 100% on one node and the other stays healthy, the likelihood of ever selecting this server decreases and approaches 0 (if the error rate ever hit 100%, we'd never select a different server and have no chance of ever selecting this server again).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this logic would need to be moved into server selection to avoid that pitfall. Also there's likely a more efficient algorithm to track the error rate. One way would be having 10 buckets which track the errors and total operations for each of the last 10 seconds. Then rate can be a simple sum of <=10 buckets rather than an unbounded list.

Something roughly like this (ignoring the phasing out old data):

self.error_stats = {}
...
current_second = int(time.monotonic())
bucket = self.error_stats.setdefault(current_second, {"errors": 0, "requests": 0})

if error:
    bucket["errors"] += 1
bucket["requests"] += 1

Then:

    async def get_error_rate(self) -> float:
        current_second = int(time.monotonic())
        errors = 0
        requests = 0
        async with self.lock:
            for sec in self.error_stats:
                if sec < current_second - 10:
                    continue
                bucket = self.error_stats[sec]
                errors += bucket["errors"]
                requests += bucket["requests"]
        # Require at least 10 samples to compute an error rate.
        if requests < 10:
            return 0.0
        return float(errors) / requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants