Skip to content

xrootd: fix new daemon being terminated immediately after restart on log-level change#3160

Open
brianaydemir wants to merge 6 commits intoPelicanPlatform:mainfrom
brianaydemir:issue-3156/fix-double-xrootd-kill
Open

xrootd: fix new daemon being terminated immediately after restart on log-level change#3160
brianaydemir wants to merge 6 commits intoPelicanPlatform:mainfrom
brianaydemir:issue-3156/fix-double-xrootd-kill

Conversation

@brianaydemir
Copy link
Contributor

@brianaydemir brianaydemir commented Feb 25, 2026

Closes #3156.
Addresses #3158.

Fix for #3156 — new daemon terminated immediately after restart

Running pelican server set-logging-level with an XRootD-specific parameter (e.g., Logging.Cache.PssSetOpt) triggers a full XRootD restart via the "xrootd-logging" param callback. The problem was in how handleXrootdLoggingChange created its restart context:

ctx, cancel := context.WithTimeout(context.Background(),
    param.Xrootd_ShutdownTimeout.GetDuration()+time.Minute)
defer cancel()
_, err := restartXrootdFn(ctx, nil)

This short-lived timeout context flowed all the way through RestartXrootdLaunchDaemonsdaemon.LaunchDaemonsDaemonLauncher.Launch, where it was passed to exec.CommandContext (and cap.Launcher.Launch on Linux). When handleXrootdLoggingChange returned and defer cancel() fired, the Go runtime sent SIGTERM to the freshly-started daemon. The result was a second signal to a different PID seconds after the restart appeared to succeed, leaving XRootD dead.

Fix: store the long-lived server context in restartInfo at launch time. In RestartXrootd, use info.ctx for LaunchDaemons so the new daemon is bound to the server lifetime. The short-lived timeout context is now used only for the shutdown/wait phase.

Fix for #3158 — no graceful drain before restart

Previously RestartXrootd sent SIGTERM to old processes immediately, cutting off in-flight client transfers without notice and without advertising the impending downtime to the Director.

The Web UI Restart button already handles this via handleGracefulShutdown:

  1. Set health status to StatusShuttingDown
  2. Advertise to the Director (stops new redirects to this server)
  3. Sleep for Xrootd_ShutdownTimeout (drain period for ongoing transfers)

Fix: add a preRestartHook func(ctx context.Context) field to restartInfo. RestartXrootd calls each hook before sending any signals. The launchers package injects closures that call handleGracefulShutdown with the appropriate server — the same code path used by the Web UI Restart button. No new imports in the xrootd package are needed.

Fix for post-restart 404 — server not re-advertised to Director after restart

The preRestartHook explicitly tells the Director the server is shutting down (via launcher_utils.Advertise with StatusShuttingDown), which causes the Director to remove the server from its routing table. Without a corresponding re-advertisement after the restart, the Director remained unaware the server was back — causing clients to receive 404 errors until the next periodic advertisement cycle fired.

Fix: add a symmetric postRestartHook func(ctx context.Context) field to restartInfo. RestartXrootd calls each hook after metrics StatusOK is set and all new daemons are running. The launchers package injects closures that call launcher_utils.Advertise immediately, so the Director knows the server is available again before RestartXrootd returns.

Changes

  • xrootd/restart.go — add ctx, preRestartHook, and postRestartHook fields to restartInfo;
    update StoreRestartInfo; use info.ctx for LaunchDaemons; call
    preRestartHook before SIGTERM loop; call postRestartHook after StatusOK is set
  • xrootd/restart_windows.go — update stub signature
  • launchers/cache_serve.go — pass ctx, preRestartHook, and postRestartHook to StoreRestartInfo
  • launchers/origin_serve.go — same for origin
  • xrootd/restart_test.go — update call sites
  • fed_test_utils/fed.go — set Xrootd_ShutdownTimeout to 0 in tests to prevent 1-minute drain sleep

Analysis and implementation by GitHub Copilot (copilot-swe-agent).

Formatting fixes by brianaydemir.

…level change

Fixes PelicanPlatform#3156 (PelicanPlatform/pelican).

Root cause: running `pelican server set-logging-level` with an XRootD-specific
parameter (e.g., `Logging.Cache.PssSetOpt`) triggers a full XRootD restart.
`logging/level_manager.go` applies the change via `param.Set`, which fires all
registered `param` callbacks. The `"xrootd-logging"` callback detects the
changed field and calls `RestartXrootd`.

The problem is in how `handleXrootdLoggingChange` creates its context:

    ctx, cancel := context.WithTimeout(context.Background(),
        param.Xrootd_ShutdownTimeout.GetDuration()+time.Minute)
    defer cancel()
    _, err := restartXrootdFn(ctx, nil)

This short-lived timeout context was passed all the way through `RestartXrootd`
to `LaunchDaemons` to `daemon.LaunchDaemons` to `DaemonLauncher.Launch`, where
it reached `exec.CommandContext` (and `cap.Launcher.Launch` on Linux). When
`handleXrootdLoggingChange` returned and `defer cancel()` fired, the Go runtime
sent SIGTERM to the freshly-started daemon via the cancelled context. The result
was a second signal to a different PID seconds after the restart appeared to
succeed.

Fix: store the long-lived server context in `restartInfo` (passed in from the
launchers, which hold the true process-lifetime context). In `RestartXrootd`,
use `info.ctx` for `LaunchDaemons` so the new daemon is bound to the server
lifetime. The short-lived timeout context passed to `RestartXrootd` is now used
only for the shutdown/wait phase, where a timeout is appropriate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brianaydemir brianaydemir added this to the v7.26 milestone Feb 25, 2026
@brianaydemir brianaydemir added bug Something isn't working cache Issue relating to the cache component origin Issue relating to the origin component configuration labels Feb 25, 2026
@brianaydemir brianaydemir self-assigned this Feb 25, 2026
@brianaydemir
Copy link
Contributor Author

brianaydemir commented Feb 25, 2026

Things I learn: The GitHub Copilot CLI is happy to create a PR, but can't set itself as the author of the PR.

That said, I did look at all of the changes on this PR and go through the exercise of convincing myself that they're at least plausibly correct. (I make no claim to fully understanding "contexts" in Go….)

…og-level change

Addresses PelicanPlatform#3158 (PelicanPlatform/pelican).

When `pelican server set-logging-level` changes an XRootD-specific
parameter, the server must restart XRootD to apply the new log level.
Previously `RestartXrootd` sent SIGTERM to the old processes immediately,
cutting off any in-flight client transfers without notice and without
telling the Director to stop redirecting new requests to this server.

This mirrors the same issue the Web UI Restart button already handles
via `handleGracefulShutdown` in `launchers/launcher.go`:
1. Set health status to `StatusShuttingDown`
2. Advertise to the Director (so it stops redirecting to this server)
3. Sleep for `Xrootd_ShutdownTimeout` (drain period for ongoing transfers)

Fix: add a `preRestartHook func(ctx context.Context)` field to
`restartInfo`. `RestartXrootd` calls each hook (if set) before sending
any signals. The `launchers` package injects closures at
`StoreRestartInfo` call sites that call `handleGracefulShutdown` with
the appropriate server and module type — the same path used by the Web
UI Restart button. No new imports in the `xrootd` package are needed.

The hook uses `info.ctx` (the long-lived server context stored in
`restartInfo` by the fix for PelicanPlatform#3156) so it is not affected by the
short-lived timeout context created in `handleXrootdLoggingChange`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot bot and others added 2 commits February 25, 2026 19:05
context.Context must be the first parameter per Go conventions
(triggers the revive/context-as-argument linter rule).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Who knew I wasn't the only one with idiosyncratic preferences
regarding white space? =P :fingers-crossed:
@brianaydemir brianaydemir modified the milestones: v7.26, v7.25 Feb 26, 2026
Copilot bot added 2 commits February 25, 2026 22:17
Our preRestartHook calls handleGracefulShutdown, which sleeps for
Xrootd_ShutdownTimeout (default 1 minute) before sending SIGTERM.
E2e tests that wait <=15s for PID changes therefore timed out.

Setting the timeout to 0 in fed test setup causes time.Sleep(0) to
return immediately, preserving the advertise-before-restart behavior
without adding meaningful latency in tests.

Fixes test failures in TestCLILoggingLevelChanges and TestXRootDRestart.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a restart, the preRestartHook advertises StatusShuttingDown to the
Director, which removes the server from the Director's routing table. The
server was never re-advertised after the restart completed, leaving the
Director unaware the server was back — causing clients to receive 404 errors
until the next periodic advertisement cycle.

Fix: add a postRestartHook (symmetric to preRestartHook) that is called after
LaunchDaemons succeeds and metrics StatusOK is set. In origin_serve.go and
cache_serve.go the hook calls launcher_utils.Advertise immediately, so the
Director learns the server is available again before RestartXrootd returns.

This closes the window introduced by the PelicanPlatform#3158 fix and resolves the
TestXRootDRestart e2e test failure where a post-restart DoGet returned 404.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
// wait for in-flight transfers to drain) before sending signals.
for _, info := range storedInfos {
if info.preRestartHook != nil {
info.preRestartHook(info.ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the long-lived context from info, or the short-lived one that was passed to RestartXrootd?

// clients can resume routing requests to this server immediately).
for _, info := range updatedInfos {
if info.postRestartHook != nil {
info.postRestartHook(info.ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the long-lived context from info, or the short-lived one that was passed to RestartXrootd?

Comment on lines +178 to +180
// In tests, skip the drain-wait period before XRootD restarts so tests
// don't time out waiting for PIDs to change.
require.NoError(t, param.Set(param.Xrootd_ShutdownTimeout.GetName(), 0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds plausible to me, but will this interfere with tests outside of launchers and xrootd?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cache Issue relating to the cache component configuration origin Issue relating to the origin component

Projects

None yet

2 participants