Skip to content

[Dev]: Stale gathering detection #165

@TylerBloom

Description

@TylerBloom

Unmet Need:

Currently, a Gathering will live forever after it is spawned. This is not ideal as it will cause memory exhaustion.

Solution:

The Gathering actor is the only component in the system that can "know" if it should be removed. There are two primary times when a gathering should be removed: when it has not processed a message in a while (~a day) or when all of its outbound connections have been closed. The second case is a bit harder to detect and would eventually lead to the first case, so that is where we will focus.

Proper implementation of detection and removal of a gathering will require three steps: bookkeeping of the number of messages a gathering processes, communication between the gathering and the gathering hall, and a message being sent to all clients that the WebSocket connection is being closed.

The bookkeeping step should focus on precision. If a gathering does not process a message for 24 hours, it should be removed as close to that point in time as possible. In other words, the gathering should not have a simple check that runs at midnight to do bookkeeping. However, bookkeeping should not be too costly, such as queueing a termination message immediately after it processes any other message.

The second step will require some slight reworking of how a gathering communicates with the gathering hall. Once a gathering has determined that it should be dispersed, it needs to communicate this with the gathering hall. This should trigger the gathering to (somehow) be dropped. How to achieve this is somewhat unclear as the current actor model assumes that the actor will run forever. Some design work on the actor model will be needed here.

The last step is mostly a courtesy. We could simply drop the websocket connections and let the SquireClient figure things out. This is less than ideal since a connection could be terminated for any number of reasons. So, we should explicitly communicate that the connection is being closed. However, the server has a mechanism to retry messages that fail to send over websockets. The termination message does not need to be retried.

Challenges/Considerations:

The biggest list of considerations is in the second step. Because it is assumed that an actor never dies, there are unwraps in several places. We need to ensure that we are not unwrapping things going into or coming out of the gatherings.

Metadata

Metadata

Assignees

Labels

SquireCoreAffects the SquireCore serverSquireSDKAffects the SquireSDK libraryin progessActively being worked onrequirementA requirement for the next major releasetodoWill be resolved but work hasn't started

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions