Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
151 commits
Select commit Hold shift + click to select a range
1b639d8
Adds Steve's redis queue
LukeButters Jul 24, 2025
31704d3
A queue in halibut
LukeButters Jul 28, 2025
0575999
The test
LukeButters Jul 28, 2025
178cb77
.
LukeButters Jul 28, 2025
e268a67
.
LukeButters Jul 28, 2025
9ca97e7
This works with octopus
LukeButters Jul 29, 2025
28e0fa3
Cancel works
LukeButters Jul 29, 2025
b084ae8
pick up test
LukeButters Jul 30, 2025
2ba97f8
Redis sub retry
LukeButters Jul 30, 2025
0e0461d
Ok this is better
LukeButters Jul 31, 2025
a245561
Add simple retries
LukeButters Aug 3, 2025
6cdec4f
Show that the reciever can re-connect itself
LukeButters Aug 3, 2025
760c315
Add failing test for dequeuer disconnect
LukeButters Aug 3, 2025
285a161
If the node processing the request disconnects we can detect that now
LukeButters Aug 4, 2025
b507feb
.
LukeButters Aug 4, 2025
80a6d48
The sender can breifly disconnect and still recieve the response
LukeButters Aug 5, 2025
7051ab2
well well that was hard
LukeButters Aug 5, 2025
0e86378
.
LukeButters Aug 5, 2025
edc2454
.
LukeButters Aug 5, 2025
839b65c
some clean up
LukeButters Aug 5, 2025
6699b9a
.
LukeButters Aug 5, 2025
e4d1077
Now support cancelling in flight requests when the sender is offline
LukeButters Aug 6, 2025
0d18bca
Fix bug where we kept sending pulses
LukeButters Aug 6, 2025
8ae84b6
Now we can detect a redis that drops all it data like its hot
LukeButters Aug 6, 2025
a5d8b93
Add support to detect redis loss
LukeButters Aug 7, 2025
a8aacd3
.
LukeButters Aug 7, 2025
ee7a794
.
LukeButters Aug 7, 2025
c3172e4
.
LukeButters Aug 7, 2025
9675eb1
Fix tests under load
LukeButters Aug 7, 2025
3ed16ca
Improve test
LukeButters Aug 7, 2025
1ea826b
More reliable timeout
LukeButters Aug 7, 2025
b31a64d
Start testing both queues at the same time
LukeButters Aug 7, 2025
c209355
.
LukeButters Aug 7, 2025
ee0fe39
Queue tests are now shared between in mem and redis queue
LukeButters Aug 11, 2025
7b46dc3
Fixes a bug where subscriptions would not be unsubscribed from
LukeButters Aug 12, 2025
b73eead
ignore some exceptions
LukeButters Aug 12, 2025
7ec738f
Fix big in detecting if redis lost its data
LukeButters Aug 12, 2025
2de5e73
Better logging, dispose extra queues created
LukeButters Aug 13, 2025
a85a5df
Never dispose CTS
LukeButters Aug 13, 2025
90157f1
Allow retrying when redis loses all of its data
LukeButters Aug 13, 2025
a03aed0
Account for requests being abandoned
LukeButters Aug 13, 2025
824e793
Update some TODOs
LukeButters Aug 13, 2025
f0b3d5b
When the receiever of the requests detects data lose it returns a ret…
LukeButters Aug 13, 2025
c837543
Set TTL, Use token, Add logging, Remove generic from PollAndSub class
LukeButters Aug 13, 2025
d2e2e9e
Redis Facade unit tests
LukeButters Aug 13, 2025
8ac0d0c
.
LukeButters Aug 14, 2025
b40616b
Test we wait for the request to be collected before timing out on hea…
LukeButters Aug 14, 2025
34f60c3
.
LukeButters Aug 14, 2025
cbdccb2
Dont wait forever when an error occurs reading the response
LukeButters Aug 14, 2025
1dc6145
Fix error types so that RPC will be retried
LukeButters Aug 14, 2025
c926903
Random delay on error
LukeButters Aug 14, 2025
1999a4a
.
LukeButters Aug 14, 2025
b7a3354
One must dispose CTS less mem leak, but also not while anything is st…
LukeButters Aug 14, 2025
647dd08
Fix issue where we dispose the PollAndSubscribeToResponse before we d…
LukeButters Aug 15, 2025
a8976b6
It compiles on windows
LukeButters Aug 15, 2025
a40e45c
Ignore most redis tests for now
LukeButters Aug 15, 2025
f6b9346
Start redis if not already started
LukeButters Aug 15, 2025
2f9964d
.
LukeButters Aug 15, 2025
606fcaf
Add redis test attribute
LukeButters Aug 16, 2025
bda4d4f
Log setup fixture to standard location
LukeButters Aug 16, 2025
04ac599
.
LukeButters Aug 16, 2025
eadfee5
.
LukeButters Aug 16, 2025
899c035
Support setting redis host
LukeButters Aug 16, 2025
d1a1527
Finally the host is respected
LukeButters Aug 17, 2025
d63c83d
Add back net48
LukeButters Aug 17, 2025
b049d97
.
LukeButters Aug 17, 2025
ec3bc12
Merge branch 'luke/redis-queue-reimagined' of github.com:OctopusDeplo…
LukeButters Aug 17, 2025
992a515
.
LukeButters Aug 18, 2025
a29f0d4
.
LukeButters Aug 18, 2025
ca4b5e2
.
LukeButters Aug 18, 2025
faabbeb
.
LukeButters Aug 18, 2025
0a82c2b
.
LukeButters Aug 18, 2025
9ddbb7d
.
LukeButters Aug 18, 2025
90cc276
Merge branch 'main' into luke/redis-queue-reimagined
LukeButters Aug 18, 2025
a7c742e
.
LukeButters Aug 18, 2025
2ea22a8
Merge branch 'main' into luke/redis-queue-reimagined
LukeButters Aug 18, 2025
e64b29f
.
LukeButters Aug 18, 2025
52748cb
.
LukeButters Aug 18, 2025
0c86570
.
LukeButters Aug 18, 2025
648f2ab
Cleanup HalibutRedisTransport
LukeButters Aug 18, 2025
38e95e5
Cleanup
LukeButters Aug 18, 2025
901cd70
Cleanup
LukeButters Aug 18, 2025
374b395
.
LukeButters Aug 18, 2025
d961a7f
.
LukeButters Aug 18, 2025
6b20d3e
.
LukeButters Aug 18, 2025
ec493fb
.
LukeButters Aug 18, 2025
b489a6c
Merge branch 'main' into luke/redis-queue-reimagined
LukeButters Aug 18, 2025
3b2df41
Merged in retryable
LukeButters Aug 18, 2025
dadb074
.
LukeButters Aug 19, 2025
20cf74e
Don't re-send response
rhysparry Aug 19, 2025
d9096ce
Fix typo in Redis queue doc
rhysparry Aug 19, 2025
377d690
fmt ExceptionReturnedByHalibutProxyExtensionMethod
rhysparry Aug 19, 2025
197924f
format Halibut.Util
rhysparry Aug 19, 2025
32bec0e
Cleanup trailing whitespace
rhysparry Aug 19, 2025
2c08d34
format HalibutRuntimeBuilder
rhysparry Aug 19, 2025
b87cf60
Minor nits in QueueMessageSerializer
rhysparry Aug 19, 2025
4158de2
Fix nits for RedisPendingRequest
rhysparry Aug 19, 2025
10c1108
More doco
LukeButters Aug 19, 2025
13f3c52
Fix typo in GetTokenForDataLossDetection
rhysparry Aug 19, 2025
a963fbd
Merge remote-tracking branch 'refs/remotes/origin/luke/redis-queue-re…
rhysparry Aug 19, 2025
33502f3
test cleanup
LukeButters Aug 19, 2025
722a450
Merge branch 'luke/redis-queue-reimagined' of github.com:OctopusDeplo…
LukeButters Aug 19, 2025
c0fb633
test cleanup
LukeButters Aug 19, 2025
c4c48fe
.
LukeButters Aug 19, 2025
2f90245
CancelOnDispose doco
LukeButters Aug 19, 2025
5a2715b
.
LukeButters Aug 19, 2025
6e88938
cleanup
LukeButters Aug 19, 2025
039ec90
cleanup
LukeButters Aug 19, 2025
4e0db6c
Rename request to requestId
rhysparry Aug 20, 2025
0c7aa17
Consolidate key generation in RedisFacade
rhysparry Aug 20, 2025
42f4952
Merge remote-tracking branch 'refs/remotes/origin/luke/redis-queue-re…
rhysparry Aug 20, 2025
2f14433
test cleanup
LukeButters Aug 20, 2025
e4c20a4
t stas app Merge branch 'luke/redis-queue-reimagined' of github.com:O…
LukeButters Aug 20, 2025
cb7a3b5
Tests for CancelOnDisposeCts
LukeButters Aug 20, 2025
986cd9b
Cleanup nits in HalibutRedisTransport
rhysparry Aug 20, 2025
ce3fd6d
.
LukeButters Aug 20, 2025
8c6928d
Renamed DataLoss exception
rhysparry Aug 20, 2025
4e8e74e
Merge remote-tracking branch 'refs/remotes/origin/luke/redis-queue-re…
rhysparry Aug 20, 2025
6729cc7
Rename other data loss exception
rhysparry Aug 20, 2025
cbc5ab8
Re arrange redis queue
LukeButters Aug 20, 2025
8afacfe
Merge branch 'luke/redis-queue-reimagined' of github.com:OctopusDeplo…
LukeButters Aug 20, 2025
76efa2e
clean redis PRQ builder
LukeButters Aug 20, 2025
145904c
clean up test
LukeButters Aug 20, 2025
3778dfd
.
LukeButters Aug 20, 2025
4784065
.
LukeButters Aug 20, 2025
11e6d18
Nit cleanup of NodeHeartBeatWatcher
rhysparry Aug 20, 2025
3643712
Rename data loss detection namespace
rhysparry Aug 20, 2025
81e7d1f
Merge remote-tracking branch 'refs/remotes/origin/luke/redis-queue-re…
rhysparry Aug 20, 2025
22a8c6d
.
LukeButters Aug 20, 2025
34c51a9
Merge branch 'luke/redis-queue-reimagined' of github.com:OctopusDeplo…
LukeButters Aug 20, 2025
4d5d7f2
Comment on QueueMessageSerializer origins
rhysparry Aug 20, 2025
c1b16db
erge branch 'luke/redis-queue-reimagined' of github.com:OctopusDeplo…
LukeButters Aug 20, 2025
9cbad98
Fix compilation on net48
LukeButters Aug 20, 2025
f50beaf
Remove redundant overrideCancellationReason calls
rhysparry Aug 20, 2025
8857add
Merge remote-tracking branch 'refs/remotes/origin/luke/redis-queue-re…
rhysparry Aug 20, 2025
78f20cb
.
LukeButters Aug 20, 2025
9a7fe20
More data loss renames
rhysparry Aug 20, 2025
9e2b086
Fix nits in RedisPendingRequestQueue
rhysparry Aug 20, 2025
fd1412e
fix formatting in MessageReaderWriterExtensionMethods
rhysparry Aug 20, 2025
80cb455
.
LukeButters Aug 21, 2025
14ae881
.
LukeButters Aug 21, 2025
8f1f96a
.
LukeButters Aug 21, 2025
129c1f3
.
LukeButters Aug 21, 2025
2c48c76
.
LukeButters Aug 21, 2025
dd82b2d
.
LukeButters Aug 21, 2025
d86653a
Geoff is willing to risk it
LukeButters Aug 21, 2025
96c4232
.
LukeButters Aug 21, 2025
c6d3fa5
Don't immediatly poll for a response
LukeButters Aug 21, 2025
dd989f2
Don't immediatly poll for request cancellation
LukeButters Aug 21, 2025
9349e64
Don't immediatly check the request was collected
LukeButters Aug 21, 2025
7713ea0
Data Store now has MetaData
LukeButters Aug 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/RedisQueue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Redis Pending Request Queue Beta

Halibut provides a Redis backed pending request queue for multi node setups. This solves the problem where
a cluster of multiple clients need to send commands to polling services which connect to only one of the
clients.

For example if we have two clients ClientA and ClientB and the Service connects to B, yet A wants
to execute an RPC. Currently that won't work as the request will end up in the in memory queue for ClientA
but it needs to be accessible to ClientB.

The Redis queue solves this, as the request is placed into Redis allowing ClientB to access the request and
so send it to the Service.

## How to run Redis for this queue.

Redis can be started by running the following command in the root of the directory:

```
docker run -v `pwd`/redis-conf:/usr/local/etc/redis -p 6379:6379 --name redis -d redis redis-server /usr/local/etc/redis/redis.conf
```

Note that Redis is configured to have no backup, everything must be in memory. The queue makes this assumption to function.

# Design

## Background
### What is a Pending Request Queue.

Halibut turns an RPC call into a RequestMessage which is placed into the Pending Request Queue. This is done by calling: `ResponseMessage QueueAndWait(RequestMessage)`. Which is a blocking call that queues the RequestMessage and waits for the ResponseMessage before returning.

Polling service, e.g, Tentacle, call into the `Dequeue` method of the queue to get the next `RequestMessage` to processing. It then responds by calling `ApplyResponse(ResponseMessage)`, doing so results in `QueueAndWait()` returning the ResponseMessage. This in turn results in the RPC call completing.

The Redis Pending Request Queue solves the problem where we have multiple clients, that wish to execute RPC calls to a single Polling Service that is connected to exactly one client. For example Client A makes an RPC call, but the service is connected to Client B. The Redis Pending Request Queue is what moves the `RequestMessage` from Client A to Client B to be sent to the service.

### Redis specific details relevant to the queue.

First we need to understand just a little about Redis and how we are using redis:
- Redis may have data lose.
- Pub/Sub does not have guaranteed delivery, we can miss publication.
- Pub/Sub channels are not pets in Redis, they can be created simply by "subscribing" and are "deleted" when there are no subscribers to that channel.
- Redis is connected to via the network, which can be flaky we will make retries to Redis when we can.

## High Level design.

Setup:
- Client A is executing the RPC call
- Client B has the Polling service connected to it.

At a high level steps the Redis Queue goes through to execute an RPC are:

1. Client B subscribes to the unique "RequestMessage Pulse Channel", as the client service is connected to it. The channel is keyed by the polling client id e.g. "poll://123"
2. Client A executes an RPC and so Calls QueueAndWait with a RequestMessage. Each RequestMessage has a unique `GUID`.
3. Client A subscribes to the `ResponseMessage channel` keyed by `GUID` to be notified when a response is available.
4. Client A serialises the message and places the message into a hash in Redis keyed by the RequestMessage `Guid`.
5. Client A Adds the `GUID` to the polling clients unique Redis list (aka queue). The key is the polling client id e.g. "poll://123".
6. Client A pulses the polling clients unique "RequestMessage Pulse Channel", to alert to it that it has work to do.
7. Client B receives the Pulse message and tries to dequeue a `GUID` from the polling clients unique Redis list (aka queue).
8. Client B now has the `GUID` of the request and so atomically gets and deletes the RequestMessage from the Redis Hash using that guid.
9. Client B sends the request to the tentacle, waits for the response, and calls `ApplyResponse()` with the ResponseMessage.
10. Client B writes the `ResponseMessage` to redis in a hash using the `GUID` as the key.
11. Client B Pulses the `ResponseMessage channel` keyed by the RequestMessage `GUID`, that a Response is available.
12. Client A receives a pulse on the `ResponseMessage channel` and so knows a Response is available, it reads the response from Redis and returns from the `QueueAndWait()` method.

## Cancellation support.

The Redis PRQ supports cancellation, even for collected requests. This is done by the RequestReceiverNode (ie the node connected to the Service) subscribing to the request cancellation channel and polling for request cancellation.

## Dealing with minor network interruptions to Redis.

All operations to redis are retried for up to 30s, this allows connections to Redis to go down briefly with impacting RPCs even for non idempotent RPCs.

### Pub/Sub and Poll.

Since Pub/Sub does not have guaranteed delivery in Redis, in any place that we do Pub/Sub we must also have a form of polling. For example:
- When Dequeuing work not only are we subscribed but when `Dequeue()` is called we also check for work on the queue anyway. (Note that Dequeue() returns every 30s if there is no work, and thus we have polling.)
- When waiting for a Response, we are not only subscribed to the response channel we also poll to see if the Response has been sent back.

## Dealing with nodes that disappear mid request.

Either node could go offline at any time, including during execution of an RPC. For example:
- The node executing the RPC could go offline, when the node with the Service connected is sending the Request to the Service.
- The node sending the Request to the Service could go offline.

To handle this case in a way that allows for large file transfers aka request that take a long time, we have a concept of "heart beats".

When executing an RPC both nodes involved will send heart beats to a unique channel keyed by the request ID AND the nodes role in the RPC. For example:
- The node executing RPC will pulse heart beats to a channel with a key such as `NodeSendingRequest:GUID`
- The node sending the request to the service will pulse heart beats to a channel with a key such as: `NodeReceivingRequest:GUID`

Now each node can watch for heart beats from the other node, when heart beats stop being sent they can assume it is offline and cancel/abandon the request.

## Dealing with Redis losing its data.

Since redis can lose data at anytime the queue is able to detect data lose and cancel any inflight requests when data lose occurs.

## Message serialisation

Message serialisation is provided by re-using the serialiser halibut uses for transferring requests/responses over the wire.

## Cleanup of old data in Redis.

All values in redis have a TTL applied, so redis will automatically clean up old keys if Halibut does not.

Request message TTL: request pickup timeout + 2 minutes.
Response TTL: default 20 minutes.
Pending GUID list TTL: 1 day.
Heartbeat rates: 15s; timeouts: sender 90s, processor 60s.

### DataStream

DataStreams are not stored in the queue, instead an implementation of `IStoreDataStreamsForDistributedQueues` must be provided. It will be called with the DataStreams that are to be stored, and will be called again with the "husks" of a DataStream that needs to be re-hydrated. DataStreams have unique GUIDs which make it easier to find the data for re-hydration.

Sub classing DataStream is a useful technique for avoiding the storage of DataStream data when it is trivial to read the data from some known places. For example a DataStream might be subclassed to hold the file location on disk that should be read when sending the data for a data stream. The halibut serialiser has been updated to work with sub classes of DataStream, in that it will ignore the sub class and send just the DataStream across the wire. This makes it safe to sub class DataStream for efficient storage and have that work with both listening and polling clients.
Loading