feat: add SBOM scanner sidecar for memory-isolated SBOM generation#753
feat: add SBOM scanner sidecar for memory-isolated SBOM generation#753
Conversation
Move Syft-based SBOM scanning into a separate gRPC sidecar container to prevent OOM kills from crashing the main node-agent process. The sidecar communicates via Unix domain socket and follows the existing ClamAV sidecar pattern. - Define protobuf service (CreateSBOM, Health RPCs) - Extract shared Syft helpers into syftutil subpackage - Implement gRPC server with sequential scan processing - Create sbom-scanner binary entrypoint - Implement gRPC client with crash detection and backoff - Add sidecar/not-ready/in-process fallback branching in SbomManager - Add retry tracking with TooLarge-with-memory-limit annotations - Add Prometheus metrics (scan_total, duration, restarts, ready) - Update Dockerfile to build both node-agent and sbom-scanner binaries - Add unit and integration tests Made-with: Cursor Signed-off-by: Ben <ben@armosec.io>
📝 WalkthroughWalkthroughAdds an SBOM scanner sidecar (gRPC server + Unix-socket client), integrates it into the node-agent to delegate SBOM generation with an in-process Syft fallback, introduces protobufs, metrics, syft->SyftDocument conversion utilities, tests, and a Docker build step to include the sbom-scanner binary. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as Node Agent
participant ScannerClient as SBOMScannerClient
participant GRPC as gRPC (Unix Socket)
participant Scanner as SBOM Scanner Server
participant Storage as Storage
Agent->>ScannerClient: Ready()?
alt Ready
Agent->>ScannerClient: CreateSBOM(ScanRequest)
ScannerClient->>GRPC: CreateSBOM RPC
GRPC->>Scanner: Forward request
Scanner->>Scanner: Build image source & run Syft
Scanner-->>GRPC: Response (sbom JSON + size)
GRPC-->>ScannerClient: Response
ScannerClient->>Agent: ScanResult (SyftDocument, size)
Agent->>Storage: Persist SBOM
else Not Ready
Agent->>Agent: Run in-process Syft generation
Agent->>Storage: Persist SBOM
else Crash
Agent->>Agent: handleScannerCrash() -> retry or mark oversized
end
sequenceDiagram
participant Main as main.go
participant ClientCtor as NewSBOMScannerClient
participant GRPC as gRPC (Unix Socket)
participant Server as Scanner Server
Main->>ClientCtor: NewSBOMScannerClient(socketPath)
ClientCtor->>GRPC: Dial unix socket
ClientCtor->>Server: Health() with backoff retries
alt Health OK
Server-->>ClientCtor: Ready=true
ClientCtor-->>Main: client ready
else Health Failed
ClientCtor-->>Main: error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (8)
pkg/sbomscanner/v1/proto/scanner.proto (1)
1-4: Consider fixing the Buf lint directory mismatch.The static analysis tool reports that files with package
sbomscanner.v1should be in asbomscanner/v1directory relative to root, but this file is inpkg/sbomscanner/v1/proto.Options to resolve:
- Add a
buf.yamlwithlint.ignoreforPACKAGE_DIRECTORY_MATCH- Move proto to
proto/sbomscanner/v1/scanner.protoand adjustgo_packageThis won't affect functionality but will cause Buf lint failures in CI if enforced.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbomscanner/v1/proto/scanner.proto` around lines 1 - 4, The Buf lint complaint arises because the proto package name "sbomscanner.v1" in scanner.proto does not match its directory layout; fix by either adding a buf.yaml at repo root that ignores PACKAGE_DIRECTORY_MATCH (add lint.ignore: - PACKAGE_DIRECTORY_MATCH) or move the proto into a directory matching the package (create proto/sbomscanner/v1/scanner.proto) and update the option go_package string ("github.com/kubescape/node-agent/pkg/sbomscanner/v1/proto") if relocating; reference the package declaration "package sbomscanner.v1" and the option "option go_package" when making the change so Buf lint and go codegen remain consistent.pkg/sbommanager/v1/syftutil/helpers.go (1)
9-22: Consider replacingfmt.Printlnwith structured logging.The
diskUsagefunction usesfmt.Printlnfor error output (lines 13, 20), which bypasses structured logging and may not be captured in container logs. Since this code is being extracted into a shared utility, consider using a logger or returning errors.♻️ Suggested improvement
-func diskUsage(path string) int64 { +func diskUsage(path string) (int64, error) { var s int64 dir, err := os.Open(path) if err != nil { - fmt.Println(err) - return s + return 0, fmt.Errorf("open directory %s: %w", path, err) } defer dir.Close() files, err := dir.Readdir(-1) if err != nil { - fmt.Println(err) - return s + return 0, fmt.Errorf("read directory %s: %w", path, err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/syftutil/helpers.go` around lines 9 - 22, The diskUsage function currently prints errors with fmt.Println; change it to return errors and let callers handle logging: update func diskUsage(path string) int64 to func diskUsage(path string) (int64, error) (or add a logger parameter if preferred), replace the fmt.Println(err) calls with early returns returning the zero size and the err, and update any callers (e.g., callers of diskUsage) to handle/log the returned error using the project's structured logger; ensure to update references to diskUsage accordingly.pkg/sbommanager/v1/sbom_manager.go (1)
99-102: Silent failure on invalidSCANNER_MEMORY_LIMITvalue.If the environment variable contains non-numeric data,
ParseIntreturns 0 and the error is silently ignored. Consider logging a warning so operators know their configuration is invalid.♻️ Optional: Log warning on parse failure
var scannerMemLimit int64 if memStr, ok := os.LookupEnv("SCANNER_MEMORY_LIMIT"); ok { - scannerMemLimit, _ = strconv.ParseInt(memStr, 10, 64) + var parseErr error + scannerMemLimit, parseErr = strconv.ParseInt(memStr, 10, 64) + if parseErr != nil { + logger.L().Warning("invalid SCANNER_MEMORY_LIMIT value, using 0", + helpers.String("value", memStr), + helpers.Error(parseErr)) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/sbom_manager.go` around lines 99 - 102, The code silently ignores strconv.ParseInt errors when reading SCANNER_MEMORY_LIMIT into scannerMemLimit; modify the os.LookupEnv/ParseInt block to check the ParseInt error and log a warning (including the invalid value and the parse error) instead of discarding it, falling back to the existing zero/default behavior; reference the scannerMemLimit variable, the SCANNER_MEMORY_LIMIT env var, and strconv.ParseInt so you update that specific parsing branch to emit a warning via the package logger (or log.Printf) when parsing fails.pkg/sbomscanner/v1/server_test.go (1)
74-89: Duplicate test helper: consider extracting to shared test utilities.
makeImageStatusJSONis nearly identical tomakeTestImageStatusinintegration_test.go. Consider extracting to a sharedtestutil_test.gofile to reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbomscanner/v1/server_test.go` around lines 74 - 89, Duplicate test helper functions makeImageStatusJSON and makeTestImageStatus should be consolidated: create a new shared test file (e.g., testutil_test.go) and move the common helper into it (export or keep unexported as needed), then replace calls in pkg/sbomscanner/v1/server_test.go and integration_test.go to use that single helper; ensure the helper signature and returned value remain the same (function name, parameters, and json marshal behavior) and update imports/test files so tests compile.cmd/main.go (1)
385-392: Consider creating scanner client only when SBOM generation is enabled.The scanner client is created unconditionally when
SBOM_SCANNER_SOCKETis set, even ifcfg.EnableSbomGenerationis false. While this doesn't cause issues (the client simply won't be used), it's slightly wasteful to establish a gRPC connection and perform health checks when SBOM generation is disabled.♻️ Optional: Move client creation inside the SBOM-enabled block
- // Create the SBOM scanner sidecar client (if configured) - var scannerClient sbomscanner.SBOMScannerClient - if socket, ok := os.LookupEnv("SBOM_SCANNER_SOCKET"); ok { - scannerClient, err = sbomscanner.NewSBOMScannerClient(socket) - if err != nil { - logger.L().Ctx(ctx).Warning("SBOM scanner sidecar not available, falling back to in-process scanning", helpers.Error(err)) - } - } - // Create the SBOM manager var sbomManager sbommanager.SbomManagerClient if cfg.EnableSbomGeneration { + var scannerClient sbomscanner.SBOMScannerClient + if socket, ok := os.LookupEnv("SBOM_SCANNER_SOCKET"); ok { + scannerClient, err = sbomscanner.NewSBOMScannerClient(socket) + if err != nil { + logger.L().Ctx(ctx).Warning("SBOM scanner sidecar not available, falling back to in-process scanning", helpers.Error(err)) + } + } sbomManager, err = sbommanagerv1.CreateSbomManager(ctx, cfg, igK8sClient.RuntimeConfig.SocketPath, storageClient, k8sObjectCache, scannerClient)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/main.go` around lines 385 - 392, Only create the SBOM scanner sidecar client when SBOM generation is enabled: guard the sbomscanner.NewSBOMScannerClient call (and the SBOM_SCANNER_SOCKET env check) with cfg.EnableSbomGeneration so you don't open the gRPC connection or run health checks when disabled; keep scannerClient in the same scope but move the socket lookup, NewSBOMScannerClient call, and the existing logger.L().Ctx(ctx).Warning(...) error handling inside the cfg.EnableSbomGeneration block to preserve behavior when enabled.pkg/sbomscanner/v1/client.go (1)
43-59: Consider adding a maximum elapsed time to the startup health check backoff.The exponential backoff uses default settings which allow retries for up to ~15 minutes. If the scanner sidecar fails to start, this could significantly delay node-agent startup before falling back to in-process scanning.
♻️ Suggested: Configure bounded backoff for faster failure detection
- _, err = backoff.Retry(context.Background(), func() (struct{}, error) { + bo := backoff.NewExponentialBackOff() + bo.MaxElapsedTime = 30 * time.Second // Fail faster if scanner won't start + _, err = backoff.Retry(context.Background(), func() (struct{}, error) { ctx, cancel := context.WithTimeout(context.Background(), healthCheckTimeout) defer cancel() resp, err := c.client.Health(ctx, &pb.HealthRequest{}) if err != nil { return struct{}{}, fmt.Errorf("health check failed: %w", err) } if !resp.Ready { return struct{}{}, fmt.Errorf("scanner not ready") } return struct{}{}, nil - }, backoff.WithBackOff(backoff.NewExponentialBackOff())) + }, backoff.WithBackOff(bo))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbomscanner/v1/client.go` around lines 43 - 59, The exponential backoff used in the startup health check (backoff.Retry with backoff.NewExponentialBackOff()) is unbounded and can take ~15 minutes; create and configure an ExponentialBackOff instance with a bounded MaxElapsedTime (e.g., a short configurable duration constant) and pass that instance into backoff.WithBackOff instead of calling backoff.NewExponentialBackOff() inline; update the retry call around backoff.Retry(...) and use the existing healthCheckTimeout/timeout context as before so the health check fails fast when the sidecar doesn't start.pkg/sbomscanner/v1/integration_test.go (1)
39-62: Consider usingt.Cleanup()for automatic test cleanup.The test helper returns cleanup responsibilities to callers. Using
t.Cleanup()would ensure resources are cleaned up even if a test panics or forgets to defer cleanup.♻️ Optional: Use t.Cleanup for automatic resource management
func startIntegrationServer(t *testing.T) (SBOMScannerClient, *grpc.Server, string) { t.Helper() dir := t.TempDir() sock := filepath.Join(dir, "scanner.sock") lis, err := net.Listen("unix", sock) require.NoError(t, err) srv := grpc.NewServer() pb.RegisterSBOMScannerServer(srv, NewScannerServer()) go srv.Serve(lis) conn, err := grpc.NewClient("unix://"+sock, grpc.WithTransportCredentials(insecure.NewCredentials()), ) require.NoError(t, err) client := &sbomScannerClient{ conn: conn, client: pb.NewSBOMScannerClient(conn), } + t.Cleanup(func() { + client.Close() + srv.Stop() + os.Remove(sock) + }) + - return client, srv, sock + return client, srv, sock // callers can still use srv for specific test scenarios }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbomscanner/v1/integration_test.go` around lines 39 - 62, The helper startIntegrationServer currently returns resources for callers to clean up; instead register automatic cleanup with t.Cleanup inside startIntegrationServer so resources are always released: after creating lis, register t.Cleanup to close lis (lis.Close()), after creating srv and starting it, register t.Cleanup to stop the server (srv.Stop()), and after creating conn register t.Cleanup to close the client connection (conn.Close()); update startIntegrationServer to perform these t.Cleanup registrations so callers no longer need to remember to defer cleanup.pkg/sbommanager/v1/syftutil/document.go (1)
18-19: Silently ignoring JSON marshal errors may hide issues.Lines 18-19 discard marshal errors for
configurationandmetadata. While these fields are likely always valid JSON, silently ignoring errors could make debugging difficult if malformed data ever appears.Consider logging at debug level if marshaling fails, or at minimum document why errors are safely ignored.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/syftutil/document.go` around lines 18 - 19, The code currently discards errors from json.Marshal for configuration and metadata (configuration, _ := json.Marshal(doc.Descriptor.Configuration); metadata, _ := json.Marshal(doc.Source.Metadata)); update this to capture the error variables, check them, and log a debug-level message (including which field and the error) instead of ignoring (e.g., configuration, err := json.Marshal(...); if err != nil { logger.Debugf("syft document: failed to marshal Descriptor.Configuration: %v", err); configuration = []byte("{}") } ), or otherwise handle the error consistently (return it or set a safe default) so malformed data is observable; reference doc.Descriptor.Configuration and doc.Source.Metadata when implementing the check and log.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 72-74: The scanRetries map is accessed concurrently (e.g., in
handleScannerCrash and processContainer) and may race when worker pool
concurrency increases; protect it by replacing scanRetries map[string]int with a
concurrency-safe structure or guarding accesses with a mutex: either switch to
sync.Map (update uses Load, Store, Delete) or add a sync.Mutex (e.g.,
scanRetriesMu) and wrap all reads/writes in handleScannerCrash,
processContainer, and any other funcs that touch scanRetries to hold the lock;
update all references to use the chosen synchronization so map accesses are
safe.
- Around line 477-497: scanRetries entries are never removed on successful scans
or other terminal non-crash failures, causing a memory leak; remove the entry
when a scan reaches any terminal state. Specifically, in the success path where
sbomScanTotal.WithLabelValues("success").Inc() /
sbomScanDuration.WithLabelValues("success").Observe(...) and syftDoc =
result.SyftDocument are set, add delete(s.scanRetries, sbomName) to clear the
counter; likewise, ensure you delete(s.scanRetries, sbomName) in other
terminal/error branches (any code that marks the SBOM as finished/failed) so
only the handleScannerCrash path relies on its existing deletion when retryCount
>= maxScanRetries.
In `@pkg/sbomscanner/v1/server.go`:
- Around line 67-70: The error handling in the RPC handler currently maps any
ctx.Err() to DeadlineExceeded; update the post-error branch in the handler (the
function containing the shown block, e.g., CreateSBOM/Scan handler) to inspect
ctx.Err() and return status.Error(codes.Canceled, ...) when ctx.Err() ==
context.Canceled and status.Error(codes.DeadlineExceeded, ...) when ctx.Err() ==
context.DeadlineExceeded; use context.Canceled and context.DeadlineExceeded
comparisons to distinguish the two cases and keep the existing error
messages/context in the returned status.Error calls.
---
Nitpick comments:
In `@cmd/main.go`:
- Around line 385-392: Only create the SBOM scanner sidecar client when SBOM
generation is enabled: guard the sbomscanner.NewSBOMScannerClient call (and the
SBOM_SCANNER_SOCKET env check) with cfg.EnableSbomGeneration so you don't open
the gRPC connection or run health checks when disabled; keep scannerClient in
the same scope but move the socket lookup, NewSBOMScannerClient call, and the
existing logger.L().Ctx(ctx).Warning(...) error handling inside the
cfg.EnableSbomGeneration block to preserve behavior when enabled.
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 99-102: The code silently ignores strconv.ParseInt errors when
reading SCANNER_MEMORY_LIMIT into scannerMemLimit; modify the
os.LookupEnv/ParseInt block to check the ParseInt error and log a warning
(including the invalid value and the parse error) instead of discarding it,
falling back to the existing zero/default behavior; reference the
scannerMemLimit variable, the SCANNER_MEMORY_LIMIT env var, and strconv.ParseInt
so you update that specific parsing branch to emit a warning via the package
logger (or log.Printf) when parsing fails.
In `@pkg/sbommanager/v1/syftutil/document.go`:
- Around line 18-19: The code currently discards errors from json.Marshal for
configuration and metadata (configuration, _ :=
json.Marshal(doc.Descriptor.Configuration); metadata, _ :=
json.Marshal(doc.Source.Metadata)); update this to capture the error variables,
check them, and log a debug-level message (including which field and the error)
instead of ignoring (e.g., configuration, err := json.Marshal(...); if err !=
nil { logger.Debugf("syft document: failed to marshal Descriptor.Configuration:
%v", err); configuration = []byte("{}") } ), or otherwise handle the error
consistently (return it or set a safe default) so malformed data is observable;
reference doc.Descriptor.Configuration and doc.Source.Metadata when implementing
the check and log.
In `@pkg/sbommanager/v1/syftutil/helpers.go`:
- Around line 9-22: The diskUsage function currently prints errors with
fmt.Println; change it to return errors and let callers handle logging: update
func diskUsage(path string) int64 to func diskUsage(path string) (int64, error)
(or add a logger parameter if preferred), replace the fmt.Println(err) calls
with early returns returning the zero size and the err, and update any callers
(e.g., callers of diskUsage) to handle/log the returned error using the
project's structured logger; ensure to update references to diskUsage
accordingly.
In `@pkg/sbomscanner/v1/client.go`:
- Around line 43-59: The exponential backoff used in the startup health check
(backoff.Retry with backoff.NewExponentialBackOff()) is unbounded and can take
~15 minutes; create and configure an ExponentialBackOff instance with a bounded
MaxElapsedTime (e.g., a short configurable duration constant) and pass that
instance into backoff.WithBackOff instead of calling
backoff.NewExponentialBackOff() inline; update the retry call around
backoff.Retry(...) and use the existing healthCheckTimeout/timeout context as
before so the health check fails fast when the sidecar doesn't start.
In `@pkg/sbomscanner/v1/integration_test.go`:
- Around line 39-62: The helper startIntegrationServer currently returns
resources for callers to clean up; instead register automatic cleanup with
t.Cleanup inside startIntegrationServer so resources are always released: after
creating lis, register t.Cleanup to close lis (lis.Close()), after creating srv
and starting it, register t.Cleanup to stop the server (srv.Stop()), and after
creating conn register t.Cleanup to close the client connection (conn.Close());
update startIntegrationServer to perform these t.Cleanup registrations so
callers no longer need to remember to defer cleanup.
In `@pkg/sbomscanner/v1/proto/scanner.proto`:
- Around line 1-4: The Buf lint complaint arises because the proto package name
"sbomscanner.v1" in scanner.proto does not match its directory layout; fix by
either adding a buf.yaml at repo root that ignores PACKAGE_DIRECTORY_MATCH (add
lint.ignore: - PACKAGE_DIRECTORY_MATCH) or move the proto into a directory
matching the package (create proto/sbomscanner/v1/scanner.proto) and update the
option go_package string
("github.com/kubescape/node-agent/pkg/sbomscanner/v1/proto") if relocating;
reference the package declaration "package sbomscanner.v1" and the option
"option go_package" when making the change so Buf lint and go codegen remain
consistent.
In `@pkg/sbomscanner/v1/server_test.go`:
- Around line 74-89: Duplicate test helper functions makeImageStatusJSON and
makeTestImageStatus should be consolidated: create a new shared test file (e.g.,
testutil_test.go) and move the common helper into it (export or keep unexported
as needed), then replace calls in pkg/sbomscanner/v1/server_test.go and
integration_test.go to use that single helper; ensure the helper signature and
returned value remain the same (function name, parameters, and json marshal
behavior) and update imports/test files so tests compile.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2487321b-d149-4b24-a51c-d9f527930e73
⛔ Files ignored due to path filters (2)
pkg/sbomscanner/v1/proto/scanner.pb.gois excluded by!**/*.pb.gopkg/sbomscanner/v1/proto/scanner_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (17)
build/Dockerfilecmd/main.gocmd/sbom-scanner/main.gopkg/sbommanager/v1/metrics.gopkg/sbommanager/v1/sbom_manager.gopkg/sbommanager/v1/syftutil/document.gopkg/sbommanager/v1/syftutil/helpers.gopkg/sbommanager/v1/syftutil/resolver.gopkg/sbommanager/v1/syftutil/source.gopkg/sbommanager/v1/syftutil/source_test.gopkg/sbomscanner/v1/client.gopkg/sbomscanner/v1/integration_test.gopkg/sbomscanner/v1/proto/buf.gen.yamlpkg/sbomscanner/v1/proto/scanner.protopkg/sbomscanner/v1/server.gopkg/sbomscanner/v1/server_test.gopkg/sbomscanner/v1/types.go
Signed-off-by: Ben <ben@armosec.io>
…or handling for canceled and timed-out scans Signed-off-by: Ben <ben@armosec.io>
Signed-off-by: Ben <ben@armosec.io>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
pkg/sbommanager/v1/sbom_manager.go (2)
371-379:⚠️ Potential issue | 🟡 MinorClear stale crash-retry state on non-crash error returns.
After prior crashes, Lines 371-379 return on non-crash errors without cleaning
scanRetries[sbomName], so entries can linger indefinitely for images that never reach success/max-retries.🧹 Proposed fix
sbomScanTotal.WithLabelValues("error").Inc() sbomScanDuration.WithLabelValues("error").Observe(scanDuration) + delete(s.scanRetries, sbomName) logger.L().Ctx(s.ctx).Error("SbomManager - sidecar scan failed", helpers.Error(scanErr), helpers.String("namespace", notif.Container.K8s.Namespace),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/sbom_manager.go` around lines 371 - 379, The error-return path in SbomManager leaves stale entries in the scanRetries map for sbomName when a non-crash error occurs; before returning in the non-crash error block (the branch that logs "SbomManager - sidecar scan failed"), remove the retry state by deleting scanRetries[sbomName] (or resetting its counter) so retries do not linger for that image, and ensure you perform the deletion under the same synchronization used for scanRetries (the same lock/guard used elsewhere in the SbomManager methods).
99-102:⚠️ Potential issue | 🟡 MinorHandle invalid
SCANNER_MEMORY_LIMITexplicitly instead of silently coercing to0.Lines 99-102 ignore
strconv.ParseInterrors. A malformed env var silently disables memory-limit-based behavior and makes retries/annotations harder to reason about.🔧 Proposed fix
var scannerMemLimit int64 if memStr, ok := os.LookupEnv("SCANNER_MEMORY_LIMIT"); ok { - scannerMemLimit, _ = strconv.ParseInt(memStr, 10, 64) + parsed, parseErr := strconv.ParseInt(memStr, 10, 64) + if parseErr != nil || parsed <= 0 { + logger.L().Ctx(ctx).Warning("SbomManager - invalid SCANNER_MEMORY_LIMIT, ignoring", + helpers.String("value", memStr), + helpers.Error(parseErr)) + } else { + scannerMemLimit = parsed + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/sbom_manager.go` around lines 99 - 102, The code silently ignores strconv.ParseInt errors when reading SCANNER_MEMORY_LIMIT into scannerMemLimit; change the os.LookupEnv/strconv.ParseInt block to check the parse error and handle it explicitly: call strconv.ParseInt into a temp value and if err != nil then return (or propagate) a descriptive error (or at minimum log it via the package logger and fail fast) instead of assigning zero, otherwise assign scannerMemLimit = parsed value; reference the scannerMemLimit variable, os.LookupEnv and strconv.ParseInt in your change so reviewers can find the fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 491-498: The current branch finalizes a TooLarge terminal status
but ignores errors from s.storageClient.ReplaceSBOM and unconditionally deletes
retry state (s.scanRetries[sbomName]); change this so you capture the error
returned by ReplaceSBOM when updating wipSbom (setting
Annotations[helpersv1.StatusMetadataKey]=helpersv1.TooLarge,
ScannerMemoryLimitAnnotation, and clearing Spec to v1beta1.SBOMSyftSpec{}), and
only remove the retry entry from s.scanRetries if ReplaceSBOM succeeds; if
ReplaceSBOM returns an error, log/propagate the error and leave s.scanRetries
intact so retry state persists.
In `@pkg/sbomscanner/v1/server.go`:
- Around line 24-28: The CreateSBOM handler currently calls s.mu.Lock() before
creating a context with timeout, so mutex wait is unbounded; replace the plain
mutex with a context-aware semaphore and acquire it after creating the request
context: change scannerServer.mu from sync.Mutex to a semaphore channel (e.g.,
sem chan struct{} initialized with one token), in CreateSBOM call ctx, cancel :=
context.WithTimeout(...) before trying to acquire the gate, then use a select to
either receive the token from s.sem (acquire) or return on ctx.Done()
(timeout/cancel), and defer returning the token (release); apply the same
pattern for other methods that used s.mu.Lock()/Unlock() so lock acquisition is
protected by the request context and respects timeouts.
---
Duplicate comments:
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 371-379: The error-return path in SbomManager leaves stale entries
in the scanRetries map for sbomName when a non-crash error occurs; before
returning in the non-crash error block (the branch that logs "SbomManager -
sidecar scan failed"), remove the retry state by deleting scanRetries[sbomName]
(or resetting its counter) so retries do not linger for that image, and ensure
you perform the deletion under the same synchronization used for scanRetries
(the same lock/guard used elsewhere in the SbomManager methods).
- Around line 99-102: The code silently ignores strconv.ParseInt errors when
reading SCANNER_MEMORY_LIMIT into scannerMemLimit; change the
os.LookupEnv/strconv.ParseInt block to check the parse error and handle it
explicitly: call strconv.ParseInt into a temp value and if err != nil then
return (or propagate) a descriptive error (or at minimum log it via the package
logger and fail fast) instead of assigning zero, otherwise assign
scannerMemLimit = parsed value; reference the scannerMemLimit variable,
os.LookupEnv and strconv.ParseInt in your change so reviewers can find the fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 257b1544-5f22-402e-aa0e-23195a640dc5
📒 Files selected for processing (3)
cmd/sbom-scanner/main.gopkg/sbommanager/v1/sbom_manager.gopkg/sbomscanner/v1/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/sbom-scanner/main.go
| if retryCount >= maxScanRetries { | ||
| delete(wipSbom.Annotations, NodeNameMetadataKey) | ||
| wipSbom.Annotations[helpersv1.StatusMetadataKey] = helpersv1.TooLarge | ||
| wipSbom.Annotations[ScannerMemoryLimitAnnotation] = fmt.Sprintf("%d", s.scannerMemLimit) | ||
| wipSbom.Spec = v1beta1.SBOMSyftSpec{} | ||
| _, _ = s.storageClient.ReplaceSBOM(wipSbom) | ||
| delete(s.scanRetries, sbomName) | ||
| } |
There was a problem hiding this comment.
Do not drop retry state when persisting TooLarge fails.
Lines 496-497 ignore ReplaceSBOM errors and still delete retry state. If storage write fails, the SBOM remains un-finalized but retry tracking is lost, causing repeated crash loops without durable terminal status.
✅ Safer terminalization flow
if retryCount >= maxScanRetries {
delete(wipSbom.Annotations, NodeNameMetadataKey)
wipSbom.Annotations[helpersv1.StatusMetadataKey] = helpersv1.TooLarge
wipSbom.Annotations[ScannerMemoryLimitAnnotation] = fmt.Sprintf("%d", s.scannerMemLimit)
wipSbom.Spec = v1beta1.SBOMSyftSpec{}
- _, _ = s.storageClient.ReplaceSBOM(wipSbom)
- delete(s.scanRetries, sbomName)
+ if _, err := s.storageClient.ReplaceSBOM(wipSbom); err != nil {
+ logger.L().Error("SbomManager - failed to persist TooLarge SBOM after scanner crashes",
+ helpers.Error(err),
+ helpers.String("namespace", notif.Container.K8s.Namespace),
+ helpers.String("pod", notif.Container.K8s.PodName),
+ helpers.String("container", notif.Container.K8s.ContainerName),
+ helpers.String("sbomName", sbomName))
+ return
+ }
+ delete(s.scanRetries, sbomName)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/sbommanager/v1/sbom_manager.go` around lines 491 - 498, The current
branch finalizes a TooLarge terminal status but ignores errors from
s.storageClient.ReplaceSBOM and unconditionally deletes retry state
(s.scanRetries[sbomName]); change this so you capture the error returned by
ReplaceSBOM when updating wipSbom (setting
Annotations[helpersv1.StatusMetadataKey]=helpersv1.TooLarge,
ScannerMemoryLimitAnnotation, and clearing Spec to v1beta1.SBOMSyftSpec{}), and
only remove the retry entry from s.scanRetries if ReplaceSBOM succeeds; if
ReplaceSBOM returns an error, log/propagate the error and leave s.scanRetries
intact so retry state persists.
| type scannerServer struct { | ||
| pb.UnimplementedSBOMScannerServer | ||
| mu sync.Mutex | ||
| version string | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C3 'mu\.Lock\(|WithTimeout\(|defer s\.mu\.Unlock' pkg/sbomscanner/v1/server.goRepository: kubescape/node-agent
Length of output: 424
🏁 Script executed:
cat -n pkg/sbomscanner/v1/server.goRepository: kubescape/node-agent
Length of output: 4052
Request timeout does not protect lock acquisition—mutex wait is unbound.
CreateSBOM calls s.mu.Lock() at line 37 before establishing the request timeout at lines 40-44. Under contention, requests block indefinitely on the mutex before timeout logic applies. The timeout only protects the scan execution (line 66), not the serialization gate.
Proposed refactor (context-aware sequential gate)
type scannerServer struct {
pb.UnimplementedSBOMScannerServer
- mu sync.Mutex
+ mu sync.Mutex
+ scanSem chan struct{}
version string
}
func NewScannerServer() pb.SBOMScannerServer {
return &scannerServer{
+ scanSem: make(chan struct{}, 1),
version: packageVersion("github.com/anchore/syft"),
}
}
func (s *scannerServer) CreateSBOM(ctx context.Context, req *pb.CreateSBOMRequest) (*pb.CreateSBOMResponse, error) {
- s.mu.Lock()
- defer s.mu.Unlock()
-
if req.TimeoutSeconds > 0 {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, time.Duration(req.TimeoutSeconds)*time.Second)
defer cancel()
}
+
+ select {
+ case s.scanSem <- struct{}{}:
+ defer func() { <-s.scanSem }()
+ case <-ctx.Done():
+ if ctx.Err() == context.Canceled {
+ return nil, status.Error(codes.Canceled, "scan canceled")
+ }
+ if ctx.Err() == context.DeadlineExceeded {
+ return nil, status.Error(codes.DeadlineExceeded, "scan timed out")
+ }
+ return nil, status.Errorf(codes.Internal, "failed before scan start: %v", ctx.Err())
+ }Also applies to: 30-33, 36-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/sbomscanner/v1/server.go` around lines 24 - 28, The CreateSBOM handler
currently calls s.mu.Lock() before creating a context with timeout, so mutex
wait is unbounded; replace the plain mutex with a context-aware semaphore and
acquire it after creating the request context: change scannerServer.mu from
sync.Mutex to a semaphore channel (e.g., sem chan struct{} initialized with one
token), in CreateSBOM call ctx, cancel := context.WithTimeout(...) before trying
to acquire the gate, then use a select to either receive the token from s.sem
(acquire) or return on ctx.Done() (timeout/cancel), and defer returning the
token (release); apply the same pattern for other methods that used
s.mu.Lock()/Unlock() so lock acquisition is protected by the request context and
respects timeouts.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/sbommanager/v1/sbom_manager.go (1)
491-497:⚠️ Potential issue | 🟠 MajorPreserve retry state when persisting
TooLargefails.Line 496 ignores
ReplaceSBOMerrors and Line 497 always deletes retry state; that can drop recovery context without persisting terminal status.🔧 Proposed fix
if retryCount >= maxScanRetries { delete(wipSbom.Annotations, NodeNameMetadataKey) wipSbom.Annotations[helpersv1.StatusMetadataKey] = helpersv1.TooLarge wipSbom.Annotations[ScannerMemoryLimitAnnotation] = fmt.Sprintf("%d", s.scannerMemLimit) wipSbom.Spec = v1beta1.SBOMSyftSpec{} - _, _ = s.storageClient.ReplaceSBOM(wipSbom) + if _, err := s.storageClient.ReplaceSBOM(wipSbom); err != nil { + logger.L().Error("SbomManager - failed to persist TooLarge SBOM after scanner crashes", + helpers.Error(err), + helpers.String("namespace", notif.Container.K8s.Namespace), + helpers.String("pod", notif.Container.K8s.PodName), + helpers.String("container", notif.Container.K8s.ContainerName), + helpers.String("sbomName", sbomName)) + return + } delete(s.scanRetries, sbomName) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/sbommanager/v1/sbom_manager.go` around lines 491 - 497, The current block ignores the error returned by s.storageClient.ReplaceSBOM and unconditionally removes retry state via delete(s.scanRetries, sbomName), which can lose recovery context if persistence fails; update the code in the retry-finalization branch (references: retryCount, maxScanRetries, wipSbom, s.storageClient.ReplaceSBOM, delete(s.scanRetries, sbomName)) to check and handle the ReplaceSBOM error — attempt the ReplaceSBOM call, if it returns an error then log/return the error and do NOT delete s.scanRetries[sbomName] (preserve retry state); only remove the retry entry after ReplaceSBOM succeeds. Ensure any logging includes context (sbomName and the error) so operators can act.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 351-360: The CreateSBOM RPC is called with s.ctx (a background
context) which has no deadline; wrap s.ctx with a client-side timeout using
context.WithTimeout before calling s.scannerClient.CreateSBOM (and defer cancel)
and pass that derived ctx into CreateSBOM instead of s.ctx so transport-level
stalls are bounded; keep the existing Timeout field in sbomscanner.ScanRequest
but enforce an overall network/transport deadline in the client.
---
Duplicate comments:
In `@pkg/sbommanager/v1/sbom_manager.go`:
- Around line 491-497: The current block ignores the error returned by
s.storageClient.ReplaceSBOM and unconditionally removes retry state via
delete(s.scanRetries, sbomName), which can lose recovery context if persistence
fails; update the code in the retry-finalization branch (references: retryCount,
maxScanRetries, wipSbom, s.storageClient.ReplaceSBOM, delete(s.scanRetries,
sbomName)) to check and handle the ReplaceSBOM error — attempt the ReplaceSBOM
call, if it returns an error then log/return the error and do NOT delete
s.scanRetries[sbomName] (preserve retry state); only remove the retry entry
after ReplaceSBOM succeeds. Ensure any logging includes context (sbomName and
the error) so operators can act.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 693f55df-2d30-4ab3-b1d7-01ed92347f75
📒 Files selected for processing (1)
pkg/sbommanager/v1/sbom_manager.go
| result, scanErr := s.scannerClient.CreateSBOM(s.ctx, sbomscanner.ScanRequest{ | ||
| ImageID: imageID, | ||
| ImageTag: sharedData.ImageTag, | ||
| LayerPaths: mounts, | ||
| ImageStatus: imageStatusBytes, | ||
| MaxImageSize: s.cfg.MaxImageSize, | ||
| MaxSBOMSize: int32(s.cfg.MaxSBOMSize), | ||
| EnableEmbeddedSBOMs: s.cfg.EnableEmbeddedSboms, | ||
| Timeout: 15 * time.Minute, | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify call site context handling for scanner RPC in sbom manager.
rg -nP --type=go -C3 'CreateSBOM\(\s*s\.ctx' pkg/sbommanager/v1/sbom_manager.go
rg -nP --type=go -C3 'CreateSBOM\(' pkg/sbommanager/v1/sbom_manager.go
rg -nP --type=go -C2 'WithTimeout\(' pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 1277
🏁 Script executed:
#!/bin/bash
# Check s.ctx initialization and its properties
rg -nP --type=go -B5 -A3 'type.*Manager.*struct' pkg/sbommanager/v1/sbom_manager.go | head -50Repository: kubescape/node-agent
Length of output: 389
🏁 Script executed:
#!/bin/bash
# Look for ctx field in Manager struct
ast-grep --pattern 'type Manager struct {
$$$
ctx $_
$$$
}'Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Search for context.WithTimeout usage in the codebase to see if it's a common pattern
rg -n 'context\.WithTimeout' --type=go | head -20Repository: kubescape/node-agent
Length of output: 2094
🏁 Script executed:
#!/bin/bash
# Check if there are other scanner-related calls with timeout handling
rg -n 'scannerClient\.' --type=go -C2 pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 451
🏁 Script executed:
#!/bin/bash
# Find how s.ctx is initialized in NewManager or constructor
rg -n 'func New.*Manager' --type=go pkg/sbommanager/v1/sbom_manager.go -A20Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check if s.ctx has any deadline already set
rg -n 'ctx.*context\.Background\|ctx.*context\.TODO' --type=go pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Look at the entire manager initialization to understand ctx source
rg -n 'ctx' --type=go pkg/sbommanager/v1/sbom_manager.go | head -30Repository: kubescape/node-agent
Length of output: 1488
🏁 Script executed:
#!/bin/bash
# Find who calls CreateSbomManager to see what context is passed
rg -n 'CreateSbomManager' --type=go -B3 -A3Repository: kubescape/node-agent
Length of output: 1757
🏁 Script executed:
#!/bin/bash
# Check the ProcessManager to understand context lifecycle
rg -n 'ProcessManager' --type=go pkg/sbommanager/v1/ | head -10Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Look for the actual ProcessManager implementation and its context handling
fd -t f '\.go$' pkg/sbommanager/ --exec grep -l 'ProcessManager' {} \;Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check the main function and context initialization
rg -n 'func main' --type=go cmd/main.go -A50 | head -100Repository: kubescape/node-agent
Length of output: 1917
🏁 Script executed:
#!/bin/bash
# Look for context creation in main
rg -n 'context\.Background\|context\.TODO' --type=go cmd/main.goRepository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check if there's a root context with deadline in the main flow
rg -n 'WithTimeout\|WithCancel\|WithDeadline' --type=go cmd/main.goRepository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Verify the exact flow - check where ctx is created and if it ever gets a deadline
grep -n "context.Background\|context.TODO\|WithTimeout\|WithCancel" cmd/main.go | head -20Repository: kubescape/node-agent
Length of output: 95
🏁 Script executed:
#!/bin/bash
# Check the ProcessManager method that calls the SBOM processing
rg -n 'ProcessManager' --type=go pkg/sbommanager/v1/sbom_manager.go -B5 -A15Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Search for ProcessManager method signature
rg -n 'func.*ProcessManager' --type=go pkg/sbommanager/v1/sbom_manager.go -A20Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check how the sbom manager is used in the process flow
rg -n 'sbomManager\.' --type=go cmd/main.go | head -20Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# List all methods of SbomManager
ast-grep --pattern 'func (s *SbomManager) $_ ($_) {
$$$
}'Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Look for the method that handles containers/SBOM processing
rg -n 'func.*SbomManager' --type=go pkg/sbommanager/v1/sbom_manager.go | head -20Repository: kubescape/node-agent
Length of output: 958
🏁 Script executed:
#!/bin/bash
# Search for method signatures in the file
rg -n 'func \(s \*SbomManager\)' --type=go pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 727
🏁 Script executed:
#!/bin/bash
# Look at the processContainer method signature and implementation
sed -n '209,380p' pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 8159
🏁 Script executed:
#!/bin/bash
# Verify the RPC client definition to understand how timeout is handled
rg -n 'type SBOMScannerClient\|CreateSBOM' --type=go pkg/sbomscanner/v1/ | head -20Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check ContainerCallback to understand the flow calling processContainer
sed -n '162,208p' pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 1949
🏁 Script executed:
#!/bin/bash
# Check the worker pool configuration
rg -n 'pool\|Pool\|submit\|Submit' --type=go pkg/sbommanager/v1/sbom_manager.go | head -20Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Search for pool initialization to see if it's single-worker
rg -n 'workerpool\|NewWorkerPool\|NewPool' --type=go pkg/sbommanager/v1/ -B2 -A2Repository: kubescape/node-agent
Length of output: 46
🏁 Script executed:
#!/bin/bash
# Check the pool implementation
fd -t f '\.go$' -path '*/workerpool*' --exec grep -l "pool\|Pool" {} \;Repository: kubescape/node-agent
Length of output: 232
🏁 Script executed:
#!/bin/bash
# Look at the SbomManager struct definition more carefully
sed -n '60,120p' pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 2324
🏁 Script executed:
#!/bin/bash
# Search for pool in the entire sbom manager file
rg -n 'pool' --type=go pkg/sbommanager/v1/sbom_manager.goRepository: kubescape/node-agent
Length of output: 327
Add a client-side deadline around the scanner RPC call.
Line 351 passes s.ctx directly without a timeout wrapper. The context is context.Background() with no deadline. If the transport or network layer stalls, the RPC call will hang indefinitely, blocking the single-worker pool and preventing any SBOM processing for subsequent containers. The server-side Timeout in the payload is insufficient because it cannot protect against transport-level stalls before reaching the server.
🔧 Proposed fix
- result, scanErr := s.scannerClient.CreateSBOM(s.ctx, sbomscanner.ScanRequest{
+ scanCtx, cancel := context.WithTimeout(s.ctx, 16*time.Minute)
+ defer cancel()
+ result, scanErr := s.scannerClient.CreateSBOM(scanCtx, sbomscanner.ScanRequest{
ImageID: imageID,
ImageTag: sharedData.ImageTag,
LayerPaths: mounts,
ImageStatus: imageStatusBytes,
MaxImageSize: s.cfg.MaxImageSize,
MaxSBOMSize: int32(s.cfg.MaxSBOMSize),
EnableEmbeddedSBOMs: s.cfg.EnableEmbeddedSboms,
Timeout: 15 * time.Minute,
})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/sbommanager/v1/sbom_manager.go` around lines 351 - 360, The CreateSBOM
RPC is called with s.ctx (a background context) which has no deadline; wrap
s.ctx with a client-side timeout using context.WithTimeout before calling
s.scannerClient.CreateSBOM (and defer cancel) and pass that derived ctx into
CreateSBOM instead of s.ctx so transport-level stalls are bounded; keep the
existing Timeout field in sbomscanner.ScanRequest but enforce an overall
network/transport deadline in the client.
Summary
NodeSource,NodeResolver, andtoSyftDocumentinto asyftutilsubpackage to avoid import cycles between the main agent and the sidecar binaryKey changes
pkg/sbomscanner/v1/proto/scanner.proto—CreateSBOMandHealthRPCspkg/sbomscanner/v1/server.go— sequential processing, timeout, image-too-large handlingpkg/sbomscanner/v1/client.go— reconnection backoff, health checks, crash detectioncmd/sbom-scanner/main.go— Unix socket listener, graceful shutdownpkg/sbommanager/v1/sbom_manager.go— sidecar/not-ready/fallback branching, retry tracking, TooLarge-with-memory-limit annotationspkg/sbommanager/v1/metrics.go—sbom_scan_total,sbom_scan_duration_seconds,sbom_scanner_restarts_total,sbom_scanner_readypkg/sbommanager/v1/syftutil/— extractedsource.go,resolver.go,document.go,helpers.gobuild/Dockerfile— builds and copies bothnode-agentandsbom-scannerbinariescmd/main.go— creates scanner client fromSBOM_SCANNER_SOCKETenv varTest plan
go build ./...passesMade with Cursor
Summary by CodeRabbit
New Features
Tests