xrootd: fix new daemon being terminated immediately after restart on log-level change#3160
Open
brianaydemir wants to merge 6 commits intoPelicanPlatform:mainfrom
Open
Conversation
…level change Fixes PelicanPlatform#3156 (PelicanPlatform/pelican). Root cause: running `pelican server set-logging-level` with an XRootD-specific parameter (e.g., `Logging.Cache.PssSetOpt`) triggers a full XRootD restart. `logging/level_manager.go` applies the change via `param.Set`, which fires all registered `param` callbacks. The `"xrootd-logging"` callback detects the changed field and calls `RestartXrootd`. The problem is in how `handleXrootdLoggingChange` creates its context: ctx, cancel := context.WithTimeout(context.Background(), param.Xrootd_ShutdownTimeout.GetDuration()+time.Minute) defer cancel() _, err := restartXrootdFn(ctx, nil) This short-lived timeout context was passed all the way through `RestartXrootd` to `LaunchDaemons` to `daemon.LaunchDaemons` to `DaemonLauncher.Launch`, where it reached `exec.CommandContext` (and `cap.Launcher.Launch` on Linux). When `handleXrootdLoggingChange` returned and `defer cancel()` fired, the Go runtime sent SIGTERM to the freshly-started daemon via the cancelled context. The result was a second signal to a different PID seconds after the restart appeared to succeed. Fix: store the long-lived server context in `restartInfo` (passed in from the launchers, which hold the true process-lifetime context). In `RestartXrootd`, use `info.ctx` for `LaunchDaemons` so the new daemon is bound to the server lifetime. The short-lived timeout context passed to `RestartXrootd` is now used only for the shutdown/wait phase, where a timeout is appropriate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
Things I learn: The GitHub Copilot CLI is happy to create a PR, but can't set itself as the author of the PR. That said, I did look at all of the changes on this PR and go through the exercise of convincing myself that they're at least plausibly correct. (I make no claim to fully understanding "contexts" in Go….) |
…og-level change Addresses PelicanPlatform#3158 (PelicanPlatform/pelican). When `pelican server set-logging-level` changes an XRootD-specific parameter, the server must restart XRootD to apply the new log level. Previously `RestartXrootd` sent SIGTERM to the old processes immediately, cutting off any in-flight client transfers without notice and without telling the Director to stop redirecting new requests to this server. This mirrors the same issue the Web UI Restart button already handles via `handleGracefulShutdown` in `launchers/launcher.go`: 1. Set health status to `StatusShuttingDown` 2. Advertise to the Director (so it stops redirecting to this server) 3. Sleep for `Xrootd_ShutdownTimeout` (drain period for ongoing transfers) Fix: add a `preRestartHook func(ctx context.Context)` field to `restartInfo`. `RestartXrootd` calls each hook (if set) before sending any signals. The `launchers` package injects closures at `StoreRestartInfo` call sites that call `handleGracefulShutdown` with the appropriate server and module type — the same path used by the Web UI Restart button. No new imports in the `xrootd` package are needed. The hook uses `info.ctx` (the long-lived server context stored in `restartInfo` by the fix for PelicanPlatform#3156) so it is not affected by the short-lived timeout context created in `handleXrootdLoggingChange`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
context.Context must be the first parameter per Go conventions (triggers the revive/context-as-argument linter rule). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Who knew I wasn't the only one with idiosyncratic preferences regarding white space? =P :fingers-crossed:
Our preRestartHook calls handleGracefulShutdown, which sleeps for Xrootd_ShutdownTimeout (default 1 minute) before sending SIGTERM. E2e tests that wait <=15s for PID changes therefore timed out. Setting the timeout to 0 in fed test setup causes time.Sleep(0) to return immediately, preserving the advertise-before-restart behavior without adding meaningful latency in tests. Fixes test failures in TestCLILoggingLevelChanges and TestXRootDRestart. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a restart, the preRestartHook advertises StatusShuttingDown to the Director, which removes the server from the Director's routing table. The server was never re-advertised after the restart completed, leaving the Director unaware the server was back — causing clients to receive 404 errors until the next periodic advertisement cycle. Fix: add a postRestartHook (symmetric to preRestartHook) that is called after LaunchDaemons succeeds and metrics StatusOK is set. In origin_serve.go and cache_serve.go the hook calls launcher_utils.Advertise immediately, so the Director learns the server is available again before RestartXrootd returns. This closes the window introduced by the PelicanPlatform#3158 fix and resolves the TestXRootDRestart e2e test failure where a post-restart DoGet returned 404. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
brianaydemir
commented
Feb 26, 2026
| // wait for in-flight transfers to drain) before sending signals. | ||
| for _, info := range storedInfos { | ||
| if info.preRestartHook != nil { | ||
| info.preRestartHook(info.ctx) |
Contributor
Author
There was a problem hiding this comment.
Do we want the long-lived context from info, or the short-lived one that was passed to RestartXrootd?
brianaydemir
commented
Feb 26, 2026
| // clients can resume routing requests to this server immediately). | ||
| for _, info := range updatedInfos { | ||
| if info.postRestartHook != nil { | ||
| info.postRestartHook(info.ctx) |
Contributor
Author
There was a problem hiding this comment.
Do we want the long-lived context from info, or the short-lived one that was passed to RestartXrootd?
brianaydemir
commented
Feb 26, 2026
Comment on lines
+178
to
+180
| // In tests, skip the drain-wait period before XRootD restarts so tests | ||
| // don't time out waiting for PIDs to change. | ||
| require.NoError(t, param.Set(param.Xrootd_ShutdownTimeout.GetName(), 0)) |
Contributor
Author
There was a problem hiding this comment.
This sounds plausible to me, but will this interfere with tests outside of launchers and xrootd?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3156.
Addresses #3158.
Fix for #3156 — new daemon terminated immediately after restart
Running
pelican server set-logging-levelwith an XRootD-specific parameter (e.g.,Logging.Cache.PssSetOpt) triggers a full XRootD restart via the"xrootd-logging"paramcallback. The problem was in howhandleXrootdLoggingChangecreated its restart context:This short-lived timeout context flowed all the way through
RestartXrootd→LaunchDaemons→daemon.LaunchDaemons→DaemonLauncher.Launch, where it was passed toexec.CommandContext(andcap.Launcher.Launchon Linux). WhenhandleXrootdLoggingChangereturned anddefer cancel()fired, the Go runtime sent SIGTERM to the freshly-started daemon. The result was a second signal to a different PID seconds after the restart appeared to succeed, leaving XRootD dead.Fix: store the long-lived server context in
restartInfoat launch time. InRestartXrootd, useinfo.ctxforLaunchDaemonsso the new daemon is bound to the server lifetime. The short-lived timeout context is now used only for the shutdown/wait phase.Fix for #3158 — no graceful drain before restart
Previously
RestartXrootdsent SIGTERM to old processes immediately, cutting off in-flight client transfers without notice and without advertising the impending downtime to the Director.The Web UI Restart button already handles this via
handleGracefulShutdown:StatusShuttingDownXrootd_ShutdownTimeout(drain period for ongoing transfers)Fix: add a
preRestartHook func(ctx context.Context)field torestartInfo.RestartXrootdcalls each hook before sending any signals. Thelauncherspackage injects closures that callhandleGracefulShutdownwith the appropriate server — the same code path used by the Web UI Restart button. No new imports in thexrootdpackage are needed.Fix for post-restart 404 — server not re-advertised to Director after restart
The
preRestartHookexplicitly tells the Director the server is shutting down (vialauncher_utils.AdvertisewithStatusShuttingDown), which causes the Director to remove the server from its routing table. Without a corresponding re-advertisement after the restart, the Director remained unaware the server was back — causing clients to receive 404 errors until the next periodic advertisement cycle fired.Fix: add a symmetric
postRestartHook func(ctx context.Context)field torestartInfo.RestartXrootdcalls each hook after metricsStatusOKis set and all new daemons are running. Thelauncherspackage injects closures that calllauncher_utils.Advertiseimmediately, so the Director knows the server is available again beforeRestartXrootdreturns.Changes
xrootd/restart.go— addctx,preRestartHook, andpostRestartHookfields torestartInfo;update
StoreRestartInfo; useinfo.ctxforLaunchDaemons; callpreRestartHookbefore SIGTERM loop; callpostRestartHookafterStatusOKis setxrootd/restart_windows.go— update stub signaturelaunchers/cache_serve.go— passctx,preRestartHook, andpostRestartHooktoStoreRestartInfolaunchers/origin_serve.go— same for originxrootd/restart_test.go— update call sitesfed_test_utils/fed.go— setXrootd_ShutdownTimeoutto 0 in tests to prevent 1-minute drain sleepAnalysis and implementation by GitHub Copilot (
copilot-swe-agent).Formatting fixes by brianaydemir.