-
Notifications
You must be signed in to change notification settings - Fork 0
DRIVERS-3217 Investigate error rate tracking for server selection #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: backpressure
Are you sure you want to change the base?
DRIVERS-3217 Investigate error rate tracking for server selection #2
Conversation
|
CC: @sleepyStick |
| if server1.pool.operation_count <= server2.pool.operation_count: | ||
| error_rate1 = await server1.pool.get_error_rate() | ||
| error_rate2 = await server2.pool.get_error_rate() | ||
| if error_rate1 < error_rate2 and (random.random() < (error_rate2 - error_rate1)): # noqa: S311 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of random.random() < (error_rate2 - error_rate1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random choice scaling with the difference attempts to more evenly balance the load. Imagine a scenario where there are only 2 servers A and B. A's error rate is 0% and B's is 1%. A naive approach is to always pick the server with the lower error rate, but that would mean we'd never choose server B. That's a problem since it means as soon as any errors occur on one server, all requests to it will be rerouted.
Introducing randomness here means we instead bias only 1% of requests away from server B which is clearly better than 100% of requests.
Another example is A's error rate is 25% and B's is 75%. We know we want to bias some operations to server A since it has a higher chance of success but again not all of the requests. Scaling with the difference in error rate seemed to perform well in these scenarios (eg we observe fewer errors overall).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to consider it: when the error rate for 2 servers is around the same we want to continue using operationCount based selection. As the difference in error rates increases we want to route more and more requests away from that server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I should have read the PR description, sorry 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot resolve this comment for some reason. (permissions?)
feel free to resolve.
| self.error_times.append((time.monotonic(), 1)) | ||
| raise | ||
|
|
||
| # clear old info from error rate >10 seconds old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely don't need to figure this out now, because we're not certain we'll move forward with this approach.
But because this might be relevant to Iris' DSI workflow: in scenarios where one node is significantly more overloaded than the other, pruning stale error measurements after connection checkout will result in a slower recovery than necessary because server selection will continue to avoid the overloaded server, even after it has started to recover. The higher the error rate on one node, the longer it will take on average to reach the error rate measurement pruning logic.
And as the error rate approaches 100% on one node and the other stays healthy, the likelihood of ever selecting this server decreases and approaches 0 (if the error rate ever hit 100%, we'd never select a different server and have no chance of ever selecting this server again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this logic would need to be moved into server selection to avoid that pitfall. Also there's likely a more efficient algorithm to track the error rate. One way would be having 10 buckets which track the errors and total operations for each of the last 10 seconds. Then rate can be a simple sum of <=10 buckets rather than an unbounded list.
Something roughly like this (ignoring the phasing out old data):
self.error_stats = {}
...
current_second = int(time.monotonic())
bucket = self.error_stats.setdefault(current_second, {"errors": 0, "requests": 0})
if error:
bucket["errors"] += 1
bucket["requests"] += 1Then:
async def get_error_rate(self) -> float:
current_second = int(time.monotonic())
errors = 0
requests = 0
async with self.lock:
for sec in self.error_stats:
if sec < current_second - 10:
continue
bucket = self.error_stats[sec]
errors += bucket["errors"]
requests += bucket["requests"]
# Require at least 10 samples to compute an error rate.
if requests < 10:
return 0.0
return float(errors) / requests
Changes:
The driver tracks the pass/fail rate of every operation for each server. In server selection rather than purely using "operation count" we bias towards the server with the lower error rate. So if server A has an error rate of 75% and server B is 25%, then 75-25 = 50% of the time we select Server B and the other 50% we use the existing "operation count" logic. This appears to work well, before this change the test sees a ~60% error rate:
After this change there's an ~20% error rate:
Concerns: