(Continuing a discussion I had with @volovyks today.)
Problem
Too many concurrent active sign tasks can overwhelm the system. In practice, this is most likely to happen right after a reboot, when recovering from the backlog and catching up.
However, limiting signature tasks (bidirectional or not) is more challenging than limiting Presignature and Triple generating tasks.
The current architecture guarantees:
- Every node always has an active
SignTask for a request that has been indexed until the the response has been indexed
- A
SignTask keeps timing out and retrying until a delivered signature is confirmed. Positor / deliberator roles may switch between retries.
This is great to ensure we don't drop requests. But it assumes infinite capacity for handling concurrent tasks.
In this architecture, if we decide to only have a fixed number of active SignTasks, it is unclear to me how we would coordinate among nodes which requests are currently active. If we don't coordinate that, it will be up to chance to have a threshold of nodes ready to work on the same request.
Brainstorming
A possible resolution could be to only limit how many SignTasks are active with the node as the proposer. And allow an unlimited number of tasks where we are deliberator. Messages meant for deliberators can thus always be processed and reactivate a sleeping task. The total active tasks would still be limited, assuming all nodes behave well.
But just slapping that on top of what we have makes our posit handling even more complex... I don't like the direction we are going here. We have too many ifs inside the SignTask already.
Maybe there is a design that simplifies the current setup and allows limiting the concurrent tasks. This is really just brainstorming:
- Separate posit and generator tasks:
SignatureSpawner maintains SignProposeTask and SignGeneratorTask instead of SignTask.
- Separate deliberator / proposer handling logic: A
SignProposeTask only contains the posit logic we need to do when acting as proposer. (should simplify it quite a bit compared to what is now in SignTask)
- Incoming messages where we are deliberator can be handled stateless, directly inside the SignatureSpawner. To decide on accept / reject, directly read the db / backlog and active tasks. This should be a side-effect free function that doesn't need to maintain any state for sent posit messages. We might accept multiple proposals for the same signature but that shouldn't be a problem.
- The SignatureSpawner spawns a
SignGeneratorTask when it receives a START message from another node. We may even have multiple parallel sigantures ongoing for the same request. Not ideal but better than failing to produce a signature.
- The SignatureSpawner decides when it is time for a node to try (or retry) signing as a proposer. It can look at ongoing protocols where we are deliberators to make this decision. When proposing succeeds, it also spawns a
SignGeneratorTask.
I believe this would reduce the overall complexity and avoid many of the posit problems we face today.
Scheduling (= deciding which tasks we should be proposer for) becomes a local decision of the SignatureSpawner that does not need to be synchronize with other nodes.
Overlapping attempts from different proposers are now handled in parallel. This increases the chance that we produce too many signatures. But it completely removes the race-conditions that we have seen on devnet which stopped posits from going through. We can even remove the per-round message buffer that I added in an attempt to deal with overlapping rounds in a brute-force way.
(Continuing a discussion I had with @volovyks today.)
Problem
Too many concurrent active sign tasks can overwhelm the system. In practice, this is most likely to happen right after a reboot, when recovering from the backlog and catching up.
However, limiting signature tasks (bidirectional or not) is more challenging than limiting Presignature and Triple generating tasks.
The current architecture guarantees:
SignTaskfor a request that has been indexed until the the response has been indexedSignTaskkeeps timing out and retrying until a delivered signature is confirmed. Positor / deliberator roles may switch between retries.This is great to ensure we don't drop requests. But it assumes infinite capacity for handling concurrent tasks.
In this architecture, if we decide to only have a fixed number of active
SignTasks, it is unclear to me how we would coordinate among nodes which requests are currently active. If we don't coordinate that, it will be up to chance to have a threshold of nodes ready to work on the same request.Brainstorming
A possible resolution could be to only limit how many
SignTasks are active with the node as the proposer. And allow an unlimited number of tasks where we are deliberator. Messages meant for deliberators can thus always be processed and reactivate a sleeping task. The total active tasks would still be limited, assuming all nodes behave well.But just slapping that on top of what we have makes our posit handling even more complex... I don't like the direction we are going here. We have too many ifs inside the
SignTaskalready.Maybe there is a design that simplifies the current setup and allows limiting the concurrent tasks. This is really just brainstorming:
SignatureSpawnermaintainsSignProposeTaskandSignGeneratorTaskinstead ofSignTask.SignProposeTaskonly contains the posit logic we need to do when acting as proposer. (should simplify it quite a bit compared to what is now inSignTask)SignGeneratorTaskwhen it receives a START message from another node. We may even have multiple parallel sigantures ongoing for the same request. Not ideal but better than failing to produce a signature.SignGeneratorTask.I believe this would reduce the overall complexity and avoid many of the posit problems we face today.
Scheduling (= deciding which tasks we should be proposer for) becomes a local decision of the
SignatureSpawnerthat does not need to be synchronize with other nodes.Overlapping attempts from different proposers are now handled in parallel. This increases the chance that we produce too many signatures. But it completely removes the race-conditions that we have seen on devnet which stopped posits from going through. We can even remove the per-round message buffer that I added in an attempt to deal with overlapping rounds in a brute-force way.