feat: track allocations per fleet#4513
Conversation
5dea128 to
38ae676
Compare
|
/gcbrun |
|
Build Failed 😭 Build Id: 0200e9ed-b8cb-4f9c-82a3-22a40ca1032e Status: FAILURE To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: 52edf034-8467-427b-90f9-765fd733d9c6 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
markmandel
left a comment
There was a problem hiding this comment.
Thanks for doing this -- sorry it took a bit to get to it.
We'll need some docs at https://agones.dev/site/docs/guides/metrics/ -- but also will need to be feature shortcoded -- https://agones.dev/site/docs/contribute/documentation-editing-contribution/
| gs.Status.State = agonesv1.GameServerStateAllocated | ||
| ctrl.gsWatch.Modify(gs) | ||
|
|
||
| require.Eventually(t, func() bool { |
There was a problem hiding this comment.
If you use https://pkg.go.dev/github.com/stretchr/testify@v1.11.1/require#EventuallyWithT you can check multiple assertions, and also get better error messages.
| ReservedReplicas int32 `json:"reservedReplicas"` | ||
| // AllocatedReplicas are the number of Allocated GameServer replicas | ||
| AllocatedReplicas int32 `json:"allocatedReplicas"` | ||
| // Allocations is a counter of the number of allocations observed. |
There was a problem hiding this comment.
New docs will need feature shortcodes please:
https://agones.dev/site/docs/contribute/documentation-editing-contribution/
| }, | ||
| }) | ||
|
|
||
| _, _ = gsInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ |
There was a problem hiding this comment.
You'll wan tto sync GameServers at, since you are now watching their events:
There was a problem hiding this comment.
I thought about that. I decided not to in the end as it does not stop the controller from working. We do not handle Add on the EventHandler (just Update), so the sync would basically block for no reason. Will gladly add it if you still want though.
There was a problem hiding this comment.
Yeah, I'd recommend it - I've seen some weirdness when not syncing -- it'll probably mean nothing much, but also means that if other people edit this later on, or do other controller things they aren't left wondering "when do I sync, or not sync" - better to always sync 👍🏻
There was a problem hiding this comment.
Fair. The global rule for me is, if it is in the processNextItem loop anywhere, it gets synced. But will add it in.
| defer cancel() | ||
|
|
||
| c.allocsMu.Lock() | ||
| defer c.allocsMu.Unlock() |
There was a problem hiding this comment.
Nice one on making sure it gets processed on shutdown.
Probably doesn't make a huge difference, but holding the whole mutex over the network calls may cause some interesting long term locking over shutdown. Might be worth taking a snapshot - or the very least, moving to a RWLock, to avoid some potential contention.
| count, ok := c.allocs[key] | ||
| if !ok { | ||
| c.allocs[key] = 1 | ||
| return | ||
| } | ||
| c.allocs[key] = count + 1 |
There was a problem hiding this comment.
| count, ok := c.allocs[key] | |
| if !ok { | |
| c.allocs[key] = 1 | |
| return | |
| } | |
| c.allocs[key] = count + 1 | |
| c.allocs[key]++ |
maps will 0 value here, just as a suggestion
| } | ||
| } | ||
|
|
||
| fCopy.Status.Allocations += c.getAllocations(fleet.ObjectMeta.Namespace, fCopy.ObjectMeta.Name) |
There was a problem hiding this comment.
If UpdateStatus fails, the worker queue re-enqueues the fleet, but on retry the counter is already 0 — those allocations are permanently lost.
My suggestion would be to capture the delta, attempt the update, and only zero out the counter on success
There was a problem hiding this comment.
Good catch, had not seen that failure path.
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: 6178bfd6-30e7-4060-ba43-0e3864f400b8 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
Signed-off-by: Nicholas Wiersma <nick@wiersma.co.za>
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
Build Succeeded 🥳 Build Id: c09603eb-c5f9-4bac-84cf-41668d315953 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version: |
What type of PR is this?
What this PR does / Why we need it:
This PR adds allocations tracking on a Fleet level to provide real allocation metrics to the Fleet Autoscaler. This provides better metrics to base custom autoscaling off of while also providing a good metric to the system
Which issue(s) this PR fixes:
Closes #4452
Special notes for your reviewer:
Initially I thought to use a
RWMutexandatomic.Int64, but the locking logic got really verbose and I dont think, even in active clusters, that the contention on aMutexis likely to be a real issue, given how low the actual lock time will be. If this proves to no longer be true, it is simple to change it.It also occurred to me that queuing the Fleet when an Allocation is observed is likely not needed, as this will always result in either a change in AllocatedReplicas, ReadyReplicas or Replicas on the
GameServerSetwhich will itself queue theFleet. I will run some manual tests to prove this.