Addition of a new SSH-based backend for the origin#3077
Addition of a new SSH-based backend for the origin#3077bbockelm wants to merge 16 commits intoPelicanPlatform:mainfrom
Conversation
Permits the origin to launch a helper over SSH which connects back and allows the origin to serve out the helper's filesystem.
No real stress test of the code. Still need to try password auth via separate login.
…it to complete in <1s to succeed)
origin/advertise.go
Outdated
There was a problem hiding this comment.
This logic is inconsistent with origin_serve/handlers.go.
Here, we always advertise the data URL as /api/v1.0/origin/data. In origin_serve, we register handlers for that route only if the director is also enabled for the server.
If the federation's director and the origin are running as separate services, while a naïve curl to the origin works as expected
[root@dev app]# curl https://origin-0:8444/public/data/0.0
0.0.28177
the director sends the client to a non-working endpoint:
[root@dev app]# ./pelican object get --debug --direct pelican://director:8444/public/data/0.0 asdf
...
DEBUG[2026-02-16T21:13:53Z] Trying the object servers: [https://origin-0:8444/api/v1.0/origin/data/public/data/0.0]
...
DEBUG[2026-02-16T21:13:53Z] Failed to download from https://origin-0:8444/api/v1.0/origin/data/public/data/0.0 : request failed (HTTP status 404): 404 page not found: Specification.FileNotFound Error: Error code 5011: server returned 404 Not Found job=019c684d-782f-764b-a861-9c2bee9ec718 url="https://origin-0:8444/api/v1.0/origin/data/public/data/0.0"
| // OA4MP is not XRootD specific - configure if enabled | ||
| if param.Origin_EnableIssuer.GetBool() { | ||
| if err = oa4mp.ConfigureOA4MPProxy(engine); err != nil { | ||
| return nil, err | ||
| } | ||
| } | ||
|
|
||
| // Handle XRootD-specific initialization | ||
| if useXRootD { |
There was a problem hiding this comment.
While we configure OA4MP irrespective of the storage backend, we only actually launch OA4MP if the storage backend makes use of XRootD.
This feels unnecessarily restrictive in the presence of the posixv2 and ssh backends.
| // getOriginURL returns the origin URL from the flag, address file, or config | ||
| func getOriginURL() (string, error) { |
There was a problem hiding this comment.
Both runSSHAuthLogin and runSSHAuthStatus suffer from a problem where they rely on this function to return the origin's URL, the problem being that there's no guarantee that Viper has been configured, which these functions rely on (indirectly).
Or in the words of Copilot:
The origin ssh-auth login command is calling config.ReadAddressFile() which uses getServerRuntimeDir().
This function reads from viper.GetString(param.RuntimeDir.GetName()), but viper is not initialized when running the CLI
command.
The problem is:
1. The ssh-auth login command runs as a standalone CLI command
2. It calls config.ReadAddressFile() at line 126 of /Users/baydemir/Ivalice/GitHub/pelican/cmd/origin_ssh_auth.go
3. ReadAddressFile() uses getServerRuntimeDir() which relies on viper configuration
4. But the CLI command hasn't initialized the configuration, so RuntimeDir is empty
5. This causes ReadAddressFile() to fail with "runtime directory is not configured"
The fix: The CLI command needs to initialize the configuration before trying to read the address file. You need to call
config.InitClient() or similar configuration initialization in the runSSHAuthLogin and runSSHAuthStatus functions before
calling getOriginURL().
Emperically, Copilot is not entirely wrong. Where I disagree with it: I think InitServer is more appropriate.
ssh_posixv2/websocket.go
Outdated
| // The websocket is under /api/v1.0/origin/ssh/auth for admin access | ||
| router.GET("/api/v1.0/origin/ssh/auth", handleWebSocket(ctx)) | ||
| router.GET("/api/v1.0/origin/ssh/status", handleSSHStatus(ctx)) |
There was a problem hiding this comment.
I'm concerned here that anyone can run pelican(-server) origin ssh-auth login --origin {url} without the origin authenticating the caller or otherwise enforcing some sort of constraint. (I don't see anything currently that enforces "admin access".)
I'd feel better if there was an obvious, defined policy for who or what can interact with SSH.
Or in the words of Copilot, while I experimented with what it could come up with:
I've updated /Users/baydemir/Ivalice/GitHub/pelican/ssh_posixv2/websocket.go to allow connections from the server's own IP address:
Changes made:
1. Added net import - needed for net.LookupHost()
2. Added param import - needed to access param.Server_Hostname
3. Created isLocalConnection() helper function that:
- Checks standard localhost addresses (127.0.0.1, ::1, localhost)
- Looks up the server's hostname using param.Server_Hostname.GetString()
- Resolves that hostname to IP addresses using net.LookupHost()
- Returns true if the client IP matches any of the server's IPs
4. Updated localhostOnlyMiddleware() to use the new helper function
Now administrators can connect to the SSH auth endpoints from:
- Standard localhost addresses (127.0.0.1, ::1, localhost)
- The server's own IP address(es) based on its configured hostname
This is useful when the origin is accessed via its actual IP address or hostname rather than localhost, while still maintaining security by not allowing arbitrary remote connections.
- Make sure we advertise correct data URLs when run separately. - Make sure SSH auth websocket requires admin access
|
@brianaydemir - can you take another look at this? |
patrickbrophy
left a comment
There was a problem hiding this comment.
Using @brianaydemir's Pelican test framework, I was able to set up an SSH backed origin and pull a file. It should be noted that I ran into issues with the helper installation due to Pelican's reliance on glibc. This was fixed when I switched my ssh storage server from an alpine based image to an Alma based image.
I am now going to be focusing on the code itself. Given that the PR is quite large I will be paying closer attention to the intersection of existing code.
|
Interesting! There should be no glibc dependency, right? |
patrickbrophy
left a comment
There was a problem hiding this comment.
While running the docker compose setup, I noticed after a little while that the Origin panicked with the following:
origin-ssh-1 | panic: runtime error: invalid memory address or nil pointer dereference
origin-ssh-1 | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x123926c]
origin-ssh-1 |
origin-ssh-1 | goroutine 2975 [running]:
origin-ssh-1 | github.com/pelicanplatform/pelican/ssh_posixv2.(*SSHConnection).readHelperStdout(0x4000fecf70, {0x2996660, 0x4001332d70})
origin-ssh-1 | /pelican-build/ssh_posixv2/helper.go:231 +0xcc
origin-ssh-1 | github.com/pelicanplatform/pelican/ssh_posixv2.(*SSHConnection).StartHelper.func2()
origin-ssh-1 | /pelican-build/ssh_posixv2/helper.go:185 +0x24
origin-ssh-1 | golang.org/x/sync/errgroup.(*Group).Go.func1()
origin-ssh-1 | /root/go/pkg/mod/golang.org/x/sync@v0.18.0/errgroup/errgroup.go:93 +0x4c
origin-ssh-1 | created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 369
origin-ssh-1 | /root/go/pkg/mod/golang.org/x/sync@v0.18.0/errgroup/errgroup.go:78 +0x90
The linked comments describe how this likely happened.
ssh_posixv2/backend.go
Outdated
| sessionCtx, sessionCancel := context.WithTimeout(ctx, sessionEstablishTimeout) | ||
| conn := NewSSHConnection(sshConfig) | ||
| backend.AddConnection(sshConfig.Host, conn) | ||
|
|
||
| // Try to establish the connection | ||
| err := runConnection(sessionCtx, conn, exports, authCookie) | ||
| sessionCancel() // Cancel the session context when done |
There was a problem hiding this comment.
The sessionEstablishTimeout is meant to bound only the connection setup phase (connect, detect platform, transfer binary, start helper), but the timeout context is passed to the entire runConnection lifecycle, including the indefinite "wait for helper to exit" phase. This causes the timeout to fire every 5 minutes, killing a healthy, actively-serving helper process. The retry loop then treats this as a failure and increments the failure counter toward the max retry limit. Instead we should, cancel the session establishment timeout after the helper is confirmed ready, and use the parent context for the long-running wait phase.
|
|
||
| // StopHelper stops the remote helper process. | ||
| // It first tries a clean shutdown via stdin message, then falls back to signals. | ||
| func (c *SSHConnection) StopHelper(ctx context.Context) error { |
There was a problem hiding this comment.
The bug described in ssh_posixv2/backend.go triggers a race condition in StopHelper.
- When
runConnectionpassesctx(expired session contex) toStopHelper, so thecleanShutdownCtxderived from it is immediately expired. The 3-second grace period for clean shutdown never actually happens. - After the SIGKILL path, StopHelper sets
c.helperIO = nilwithout waiting for the errgroup goroutines to finish. ThereadHelperStdoutgoroutine, still running between its ctx.Done() check and thec.helperIOdereference, hits a nil pointer.
Permits the origin to launch a helper over SSH which connects back and allows the origin to serve out the helper's filesystem.