Synchronous config refresh#31
Conversation
6a06c91 to
73ca723
Compare
pkg/config/config.go
Outdated
| select { | ||
| case c.loadQueue <- struct{}{}: | ||
| defer func() { | ||
| <-c.loadQueue | ||
| }() | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
Are you attempting to reinvent sync.Mutex with an optional channel read write to ensure that there are no concurrent loads? This is kind of confusing, I would probably just replace this channel select write/deferred read with TryLock() / defer Unlock()
There was a problem hiding this comment.
Yes, the idea is to ensure no concurrent loads to preserve the behavior with urls[index], but sync.Mutex doesn't take context.Context into consideration, so if there's another load in progress and the context is canceled, using a mutex would delay the caller returing with a canceled context until the previous load completes.
There was a problem hiding this comment.
Sure but the context on both sides is a long-running controller context, not a client request context that is likely to be cancelled on a timeout. I'd say just try to get the lock, and if it can't be taken, return an error and let the caller retry.
Also note that if rancher doesn't pass in a Wait and handles 100% of reloading, there shouldn't ever be multiple overlapping calls to this function in the first place.
There was a problem hiding this comment.
Actually we do call Refresh through a norman action with a client request context when the user wants to refresh outside the usual schedule.
But I've still changed to TryLock since it's still unlikely that there will be a collision, even with on-demand refreshes. We might also change how this is handled anyways by just having the norman action queue the object that triggers the refresh, to avoid some other problems.
|
Requested a couple changes. The whole |
|
The I suspect this was done so that if you manage to get the data from the remote server once, you wouldn't want to go back to the fallback since it could be out of date. It's still a bit broken because when the Rancher pod fails, this will be reset in the new pod and the local fallback could still end up being used. |
| if index, err := c.loadConfig(ctx, subKey, channelServerVersion, appName, urls...); err != nil { | ||
| logrus.Fatalf("Failed to load initial config from %s: %v", urls[index].URL(), err) | ||
| if err := c.LoadConfig(ctx); err != nil { | ||
| logrus.Fatalf("Failed to load initial config for %s: %v", subKey, err) |
There was a problem hiding this comment.
I've removed the index from the return since urls is now altered inside LoadConfig. But we won't print the url here, hopefully that's not a big deal. If that was useful, I'll go back to returning the index for logging purposes.
|
I also changed all previous |
|
cc @jiaqiluo @kinarashah for additional review |
There was a problem hiding this comment.
Pull request overview
This PR introduces a public synchronous LoadConfig method to allow Rancher to know when configuration refreshes complete. Previously, the only way to trigger refreshes was through the Wait interface, which provided no feedback on completion.
Changes:
- Converted private
loadConfigto publicLoadConfigmethod with synchronous semantics - Added configuration parameters as struct fields (
subKey,channelServerVersion,appName,urls) to support the new public API - Introduced
loadMutexto prevent concurrent configuration loads - Made
waitparameter optional (nil check added) to support usage without automatic periodic refreshes
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/config/config.go
Outdated
| locked := c.loadMutex.TryLock() | ||
| if !locked { | ||
| return errors.New("configuration is already being loaded") | ||
| } |
There was a problem hiding this comment.
Using TryLock() and returning an error when a load is already in progress creates a race condition vulnerability. If a caller wants to ensure they have the latest configuration, they may call LoadConfig() but receive an error even though a load is in progress. The caller has no way to wait for the in-progress load to complete and may proceed with stale configuration.
Consider either:
- Using
Lock()instead ofTryLock()to block until the load completes, ensuring callers always get up-to-date config - Providing a separate method that indicates if a load is in progress, allowing callers to handle this case appropriately
| locked := c.loadMutex.TryLock() | |
| if !locked { | |
| return errors.New("configuration is already being loaded") | |
| } | |
| c.loadMutex.Lock() |
There was a problem hiding this comment.
This was discussed above: callers should retry or abort if they get an error here. Lock() would not handle context timeouts.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Rancher uses channelserver as a library to fetch KDM. It would be useful for Rancher if it could know when a refresh completed.
Currently, channelserver doesn't expose any way to know this, only to trigger refreshes (through
Wait).This PR proposes a public function to allow synchronous refreshes.
Potentially needed for rancher/rancher#53204