Skip to content

Synchronous config refresh#31

Merged
pedromfcarvalho merged 8 commits intorancher:masterfrom
pedromfcarvalho:load-sync
Jan 14, 2026
Merged

Synchronous config refresh#31
pedromfcarvalho merged 8 commits intorancher:masterfrom
pedromfcarvalho:load-sync

Conversation

@pedromfcarvalho
Copy link
Contributor

@pedromfcarvalho pedromfcarvalho commented Nov 21, 2025

Rancher uses channelserver as a library to fetch KDM. It would be useful for Rancher if it could know when a refresh completed.

Currently, channelserver doesn't expose any way to know this, only to trigger refreshes (through Wait).

This PR proposes a public function to allow synchronous refreshes.

Potentially needed for rancher/rancher#53204

Comment on lines 91 to 98
select {
case c.loadQueue <- struct{}{}:
defer func() {
<-c.loadQueue
}()
case <-ctx.Done():
return ctx.Err()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you attempting to reinvent sync.Mutex with an optional channel read write to ensure that there are no concurrent loads? This is kind of confusing, I would probably just replace this channel select write/deferred read with TryLock() / defer Unlock()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to ensure no concurrent loads to preserve the behavior with urls[index], but sync.Mutex doesn't take context.Context into consideration, so if there's another load in progress and the context is canceled, using a mutex would delay the caller returing with a canceled context until the previous load completes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but the context on both sides is a long-running controller context, not a client request context that is likely to be cancelled on a timeout. I'd say just try to get the lock, and if it can't be taken, return an error and let the caller retry.

Also note that if rancher doesn't pass in a Wait and handles 100% of reloading, there shouldn't ever be multiple overlapping calls to this function in the first place.

Copy link
Contributor Author

@pedromfcarvalho pedromfcarvalho Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we do call Refresh through a norman action with a client request context when the user wants to refresh outside the usual schedule.

But I've still changed to TryLock since it's still unlikely that there will be a collision, even with on-demand refreshes. We might also change how this is handled anyways by just having the norman action queue the object that triggers the refresh, to avoid some other problems.

@brandond
Copy link
Member

brandond commented Jan 6, 2026

Requested a couple changes. The whole urls[index] thing creates a lot of extra noise in here, and it honestly seems a little broken since the list is mutated every time the config is loaded - so even if you pass in more than one URL, all the URLs after the first successful one are dropped and never used. And I don't think there are actually any cases where callers supply more than a single URL anyway.

@pedromfcarvalho
Copy link
Contributor Author

pedromfcarvalho commented Jan 6, 2026

The urls[index] part is confusing, but Rancher does use it: it passes both the remote url and the fallback "url" which is a path in the local filesystem.

I suspect this was done so that if you manage to get the data from the remote server once, you wouldn't want to go back to the fallback since it could be out of date. It's still a bit broken because when the Rancher pod fails, this will be reset in the new pod and the local fallback could still end up being used.

@pedromfcarvalho pedromfcarvalho changed the title [DNM] Synchronous config refresh Synchronous config refresh Jan 12, 2026
@pedromfcarvalho pedromfcarvalho marked this pull request as ready for review January 12, 2026 16:04
if index, err := c.loadConfig(ctx, subKey, channelServerVersion, appName, urls...); err != nil {
logrus.Fatalf("Failed to load initial config from %s: %v", urls[index].URL(), err)
if err := c.LoadConfig(ctx); err != nil {
logrus.Fatalf("Failed to load initial config for %s: %v", subKey, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the index from the return since urls is now altered inside LoadConfig. But we won't print the url here, hopefully that's not a big deal. If that was useful, I'll go back to returning the index for logging purposes.

@pedromfcarvalho
Copy link
Contributor Author

I also changed all previous fmt.Errorf to use %w instead of %v.

@brandond
Copy link
Member

cc @jiaqiluo @kinarashah for additional review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a public synchronous LoadConfig method to allow Rancher to know when configuration refreshes complete. Previously, the only way to trigger refreshes was through the Wait interface, which provided no feedback on completion.

Changes:

  • Converted private loadConfig to public LoadConfig method with synchronous semantics
  • Added configuration parameters as struct fields (subKey, channelServerVersion, appName, urls) to support the new public API
  • Introduced loadMutex to prevent concurrent configuration loads
  • Made wait parameter optional (nil check added) to support usage without automatic periodic refreshes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 97 to 100
locked := c.loadMutex.TryLock()
if !locked {
return errors.New("configuration is already being loaded")
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using TryLock() and returning an error when a load is already in progress creates a race condition vulnerability. If a caller wants to ensure they have the latest configuration, they may call LoadConfig() but receive an error even though a load is in progress. The caller has no way to wait for the in-progress load to complete and may proceed with stale configuration.

Consider either:

  1. Using Lock() instead of TryLock() to block until the load completes, ensuring callers always get up-to-date config
  2. Providing a separate method that indicates if a load is in progress, allowing callers to handle this case appropriately
Suggested change
locked := c.loadMutex.TryLock()
if !locked {
return errors.New("configuration is already being loaded")
}
c.loadMutex.Lock()

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed above: callers should retry or abort if they get an error here. Lock() would not handle context timeouts.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@jiaqiluo jiaqiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pedromfcarvalho pedromfcarvalho merged commit b93c8ff into rancher:master Jan 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants