Skip to content

Conversation

@dray92
Copy link
Contributor

@dray92 dray92 commented Jan 7, 2026

Summary by CodeRabbit

  • New Features

    • Network monitoring agent: conntrack-based flow collection, eBPF DNS tracing with enrichment, HTTP metrics/health endpoints, and deployment-mode support.
    • Helm chart and Kubernetes manifests for DaemonSet deployment; multi-arch container image and distroless runtime.
  • Infrastructure

    • Expanded verification: Kind, EKS, AKS, GKE, and Lima-based local flows; traffic generators and test server manifests.
  • CI / Delivery

    • GitHub Actions to build/push images and automate Helm chart releases; Makefile targets for image and verification workflows.
  • Tests

    • New unit/integration tests and end-to-end/local verification scripts.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Adds zxporter-netmon: a Go network-monitor agent (conntrack + optional eBPF DNS tracer), a Helm chart and DaemonSet, multi‑arch image and release CI workflows, eBPF CO‑RE headers and programs, verification/provisioning scripts, monitoring/aggregation pipeline, and Dakr transport RPCs for network and DNS metrics.

Changes

Cohort / File(s) Summary
CI / Release Workflows
.github/workflows/helm-release-netmon.yml, .github/workflows/push-zxporter-netmon-image.yml
New workflows: Helm chart release (version bump, package, push, PR) and multi-arch image build/push to Docker Hub & ECR with release-aware tagging, validations, and Helm values PR on release.
Build & Packaging
Dockerfile.netmon, Makefile, .gitignore, go.mod
New multi-stage distroless Dockerfile; Makefile targets for netmon image, Lima and multi-provider E2E flows; .gitignore additions; large dependency changes and many require additions in go.mod.
Main Binary & Deployment
cmd/zxporter-netmon/main.go, config/zxporter-netmon.yaml, helm-chart/zxporter-netmon/*
New main program (standalone/in-cluster), HTTP metrics/health endpoints, graceful shutdown; Kubernetes DaemonSet manifest and full Helm chart (Chart.yaml, values, helpers, templates, RBAC).
Conntrack & Cilium Clients
internal/networkmonitor/conntrack_client.go, internal/networkmonitor/cilium_client*.go, internal/networkmonitor/cilium_*
Netfilter conntrack client; Cilium client (linux implementation + non-linux stub); BPF map enumeration, kernel Hz/time utilities for Cilium integration.
eBPF Tracing & Types
internal/networkmonitor/ebpf/*, internal/networkmonitor/ebpf/c/*, internal/networkmonitor/ebpf/c/include/*
eBPF tracer package (linux/no-linux variants), DNSEvent types, generated bindings, c/ eBPF program capturing DNS, CO‑RE/libbpf headers and helper scripts; cgroup2 mount helpers and tracer runtime logic.
Monitoring Orchestrator
internal/networkmonitor/monitor.go, internal/networkmonitor/monitor_test.go
New Monitor: read/cleanup/flush tickers, delta aggregation, DNS & pod enrichment, HTTP metrics handler, Dakr flush path; tests for hashing and concurrency.
Pod Cache & Informers
internal/networkmonitor/pod_cache.go, internal/networkmonitor/pod_cache_test.go
PodCache mapping pod IP → Pod via informer or standalone mode; tests for add/update/delete and IP churn.
Transport / Dakr Integration
internal/transport/interface.go, internal/transport/dakr_client.go, internal/transport/sender.go, internal/transport/connect_compression_test.go
Added SendNetworkTrafficMetrics to DakrClient interface and implementations; RealDakrClient forwards headers/token; SimpleDakrClient and mock updated for tests.
Tests & Test Server
test/stats/types.go, test/testserver/main.go, internal/networkmonitor/ebpf/tracer_linux_test.go
New test types for network/DNS stats; test server handler SendNetworkTrafficMetrics and h2c HTTP/2 cleartext support; eBPF tracer test added.
Verification & Provisioning
verification/* (verify_e2e.sh, verify_local.sh, lima.yaml, kind-config.yaml, provision_.sh, traffic-.yaml)
Local (Lima) and multi-cloud E2E verification scripts, Kind config with mounts, traffic generator/server manifests, provider provisioning scripts for EKS/AKS/GKE, and helper utilities.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant App as zxporter-netmon
    participant PodCache
    participant Conntrack
    participant eBPF
    participant IP2DNS
    participant Monitor
    participant Dakr
    participant HTTP

    User->>App: start (--standalone / in-cluster)
    App->>PodCache: init (if in-cluster)
    App->>Conntrack: init (netfilter or cilium)
    App->>eBPF: init tracer (if Kernel BTF available)
    eBPF->>IP2DNS: emit DNSEvent
    loop Read interval
      Monitor->>Conntrack: ListEntries()
      Conntrack-->>Monitor: entries
      Monitor->>PodCache: GetLocalPodIPs()
      Monitor->>Monitor: compute deltas & aggregate flows
    end
    loop Flush interval
      Monitor->>IP2DNS: Lookups()
      IP2DNS-->>Monitor: recent lookups
      Monitor->>Dakr: SendNetworkTrafficMetrics(payload)
      Dakr-->>Monitor: ACK
    end
    User->>HTTP: GET /metrics
    HTTP->>Monitor: GetMetricsHandler
    Monitor-->>HTTP: JSON metrics
    HTTP-->>User: response
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • sandipanpanda

Poem

🐇 I hop through maps and kernel space,

I sniff DNS, and mark each trace;
From conntrack bytes to Helm's neat chart,
I bundle metrics, play my part.
Cheers — a rabbit for the network art!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: introducing a daemonset for node-level network monitoring (zxporter-netmon). It is clear, concise, and directly reflects the primary purpose of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
go.mod (1)

3-3: Upgrade Go version and review pseudo-version dependencies.

Go 1.24.0 had documented correctness and security issues post-release (including CVE-2025-4674). Upgrade to a stable 1.24.x release (e.g., 1.24.5 or later).

Additionally, review the two pseudo-version dependencies in the indirect requires block:

  • inet.af/netaddr pins to commit b8eac61e914a from May 2023—clarify if this is intentional (e.g., if no stable release exists) or if a newer released version should be preferred.
  • google.golang.org/genproto uses commit 9219d122eba9 from September 2025—verify that this cutting-edge commit remains stable and is necessary.
🤖 Fix all issues with AI agents
In @cmd/zxporter-netmon/main.go:
- Around line 193-202: The shutdown sequence never cancels the monitor's
context, so call the cancel function created with ctx, cancel :=
context.WithCancel(...) to trigger Monitor.Start() cleanup (which closes m.ct).
Update the shutdown flow around logger.Info("Shutting down...") and
server.Shutdown(shutdownCtx) to invoke cancel() (either before or immediately
after server.Shutdown) and ensure shutdownCancel() is still deferred; this will
let the monitor exit and release the conntrack client.

In @config/zxporter-netmon.yaml:
- Around line 28-33: Remove the invalid pod-level securityContext block that
sets privileged: true and capabilities (NET_ADMIN, SYS_RESOURCE); those fields
are only valid at the container level and are already defined in the existing
container-level securityContext, so delete the pod-level securityContext section
(the block containing privileged and capabilities) to avoid validation errors or
ignored settings and rely on the container-level securityContext instead.

In @Dockerfile.netmon:
- Around line 26-33: The Dockerfile sets FROM gcr.io/distroless/static:nonroot
but then overrides it with USER 0:0, which runs the container as root; remove
the USER 0:0 line so the image runs as the nonroot user provided by distroless
(UID 65532), or if root is truly required, change the base to FROM
gcr.io/distroless/static:latest and add a short comment justifying why root is
necessary; ensure ENTRYPOINT ["/zxporter-netmon"] remains unchanged.

In @helm-chart/zxporter-netmon/templates/clusterrole.yaml:
- Line 4: The ClusterRole template is using the serviceAccountName helper
(include "zxporter-netmon.serviceAccountName" .) which is semantically wrong;
replace that include with include "zxporter-netmon.fullname" . to use the
chart's canonical resource name, or add a dedicated helper named
"zxporter-netmon.clusterRoleName" in templates/_helpers.tpl that returns the
intended name and use include "zxporter-netmon.clusterRoleName" . in the
clusterrole.yaml; update any references so ClusterRole naming no longer depends
on serviceAccount.create/serviceAccount.name helper logic.

In @internal/networkmonitor/ebpf/bpf_bpfel.go:
- Around line 107-114: _BpfClose currently returns on the first Close error
which can leak later resources; change it to always iterate and call Close on
every io.Closer in the closers slice, capturing either the first non-nil error
or aggregating errors (e.g., using errors.Join if available) and return that
after the loop; update the generated function _BpfClose to perform all closes
and then return the collected error, and if possible fix the bpf2go generator so
future outputs behave the same.

In @internal/networkmonitor/ebpf/c/include/update.sh:
- Around line 18-19: The current inline download pipeline (curl -sL ... | tar
-xz --xform=... "${headers[@]}") should be made robust: download the release
tarball to a temporary file using a failing curl variant (e.g. enable -f and
preserve redirects) and check curl's exit code, validate the tarball integrity
against a provided checksum (and/or GPG signature from the libbpf release page)
before extracting, and avoid blind overwrite of existing headers by either
extracting into a temporary directory and atomically moving new files or using
tar options to skip/keep existing files; if any verification or extraction step
fails, emit a clear error and exit non‑zero. Ensure you reference LIBBPF_VERSION
and the headers array in the updated logic so the download, checksum URL, and
extraction targets are consistent.
- Line 4: The pinned LIBBPF_VERSION variable in update.sh is set to the
vulnerable 0.6.1; change LIBBPF_VERSION to a patched release (minimum 0.7.0,
preferably the current stable 1.6.2) so the build uses a non-vulnerable libbpf;
update the LIBBPF_VERSION value accordingly and verify the tag/format matches
the repository's expected version string elsewhere in the script.

In @internal/networkmonitor/ebpf/cgroup_linux.go:
- Around line 23-34: The mountCgroup2 function should be made idempotent: when
creating the directory (mountPoint) treat already-existing directory as
non-fatal (use os.Mkdir and ignore os.IsExist errors) and before calling
syscall.Mount check whether cgroup2 is already mounted at mountPoint by parsing
/proc/mounts (or using syscall.Statfs) and return nil if a cgroup2 mount is
present; only call syscall.Mount if the mountPoint exists and no cgroup2 is
currently mounted, and preserve existing error wrapping in the same fmt.Errorf
style if mount fails.

In @internal/networkmonitor/monitor.go:
- Around line 133-144: The select in the goroutine handling flushTicker has a
duplicated branch for "case <-flushTicker.C:" causing redundant execution paths;
remove the redundant second "case <-flushTicker.C:" block (the one that contains
the comment about goroutine vs synchronous flush and the duplicate call to
m.flush(ctx)) so only a single case handling flushTicker.C remains, ensuring the
select calls m.flush(ctx) exactly once and keeps the existing m.log.Error(err,
"Failed to flush metrics to control plane") behavior.
- Around line 327-343: The conntrackEntryKey function can panic using As4() for
IPv6 addresses; update it to use As16() so both IPv4 and IPv6 are handled
(replace calls to e.Src.IP().As4() and e.Dst.IP().As4() with As16() and write
the full 16-byte slices into conntrackEntryHash), keeping the existing port and
proto writes intact and ensuring conntrackEntryHash.Reset()/Sum64() usage
remains the same.
- Around line 324-359: The package-level maphash.Hash variables
(conntrackEntryHash, entryGroupHash) are not safe for concurrent use; replace
them by creating a local maphash.Hash instance inside conntrackEntryKey and
inside entryGroupKey (e.g. var h maphash.Hash or h := maphash.Hash{}), use
h.Reset()/h.Write(...)/h.Sum64() there, and remove or stop referencing the
package-level conntrackEntryHash and entryGroupHash to eliminate the data race.
🟡 Minor comments (13)
internal/networkmonitor/cilium_time.go-57-66 (1)

57-66: Fix grammatical error in error message.

Line 62 has a grammatical error: "Unable get time" should be "Unable to get time".

📝 Proposed fix
-		return 0, fmt.Errorf("Unable get time: %w", err)
+		return 0, fmt.Errorf("unable to get time: %w", err)

Note: Also lowercased the first letter to follow Go error message conventions.

test/stats/types.go-100-108 (1)

100-108: Reconsider timestamp field type for consistency.

The Timestamp field is string while other timestamp fields in this file use time.Time (SnapshotStats.Timestamp at line 38, Stats.FirstMessageTime at line 115). The string type is intentional—the code explicitly converts the timestamp using .AsTime().String() (testserver/main.go:394)—but this creates an unnecessary inconsistency.

If the intent is JSON serialization, using time.Time would be more consistent and eliminate the manual string conversion, letting JSON marshaling handle the formatting automatically. Align with other timestamp fields in the same file or document why this differs.

verification/provision_aks.sh-18-26 (1)

18-26: State file may not correctly track multiple CNI configurations.

The script includes CNI_TYPE in both the resource group and cluster names, but uses a single state file (aks.state). If a user changes CNI_TYPE between runs, the state file may reference the wrong cluster during deletion.

Consider one of these approaches:

🔧 Proposed fixes

Option 1: Include CNI_TYPE in state filename

 STATE_DIR="${STATE_DIR:-$HOME/.cache/zxporter-e2e}"
-STATE_FILE="${STATE_FILE:-${STATE_DIR}/aks.state}"
+STATE_FILE="${STATE_FILE:-${STATE_DIR}/aks-${CNI_TYPE}.state}"

Option 2: Store and validate CNI_TYPE in delete mode

In delete mode, after read_state:

 read_state
+if [[ -n "${CNI_TYPE:-}" && "${CNI_TYPE}" != "${CNI_TYPE_FROM_STATE:-kubenet}" ]]; then
+  echo "WARNING: Current CNI_TYPE (${CNI_TYPE}) differs from state (${CNI_TYPE_FROM_STATE})" >&2
+fi
internal/networkmonitor/ebpf/types.go-18-28 (1)

18-28: Bug: Headers are repeated for each item in the loop.

The "Questions:\n" and "Answers:\n" headers are written inside the loops, causing them to repeat for each entry. They should be printed once before each loop.

Proposed fix
 func (e DNSEvent) String() string {
 	var str strings.Builder
+	if len(e.Questions) > 0 {
+		str.WriteString("Questions:\n")
+	}
 	for _, v := range e.Questions {
-		str.WriteString("Questions:\n")
 		str.WriteString(fmt.Sprintf("%s %s %s \n", v.Class, v.Type, string(v.Name)))
 	}
+	if len(e.Answers) > 0 {
+		str.WriteString("Answers:\n")
+	}
 	for _, v := range e.Answers {
-		str.WriteString("Answers:\n")
 		str.WriteString(fmt.Sprintf("%s %s %s [%s] [%s] \n", v.Class, v.Type, string(v.Name), v.IP, v.CNAME))
 	}
 	return str.String()
 }
.github/workflows/helm-release-netmon.yml-94-125 (1)

94-125: Tag is created before Chart.yaml is updated in the repository.

The workflow creates and pushes the git tag (lines 94-99) before updating Chart.yaml (lines 122-125). This means the tagged commit doesn't reflect the new version in Chart.yaml. While the packaged chart will have the correct version (via --version flag), the source at the tagged commit will show the old version.

Consider reordering: update Chart.yaml first, commit, then tag the commit with the updated version.

verification/verify_e2e.sh-115-123 (1)

115-123: Duplicate --set image.pullPolicy flag.

Line 120 duplicates line 119. Remove the duplicate.

Proposed fix
     helm template zxporter-netmon helm-chart/zxporter-netmon \
         --namespace $NAMESPACE \
         --set image.repository=ttl.sh/zxporter-netmon \
         --set image.tag=$TAG \
         --set image.pullPolicy=$PULL_POLICY \
-        --set image.pullPolicy=$PULL_POLICY \
         --set config.exportInterval="10s" \
         | kubectl apply -f -
internal/networkmonitor/ebpf/tracer_linux.go-133-142 (1)

133-142: InitCgroupv2 silently ignores non-ErrCgroup2NotMounted errors from detectCgroupPath.

If detectCgroupPath returns an error other than ErrCgroup2NotMounted, it's not propagated. The function returns nil, potentially masking real issues.

Proposed fix
 func InitCgroupv2(log logr.Logger) error {
 	_, err := detectCgroupPath()
 	if errors.Is(err, ErrCgroup2NotMounted) {
 		log.Info("mounting cgroup v2")
 		if err := mountCgroup2(); err != nil {
 			return fmt.Errorf("cgroup2 not mounted and failed to mount manually: %w", err)
 		}
+		return nil
 	}
-	return nil
+	return err
 }
.github/workflows/push-zxporter-netmon-image.yml-137-137 (1)

137-137: Fix typo in error message.

"Definately" should be "Definitely".

✏️ Proposed fix
-                    # Definately have to fail if any non-latest tags exist
+                    # Definitely have to fail if any non-latest tags exist
                     if [ ${#NON_LATEST_EXISTING[@]} -gt 0 ]; then
cmd/zxporter-netmon/main.go-46-48 (1)

46-48: Handle or log error from logger initialization.

The error from zap.NewProduction() is silently discarded. If logger initialization fails, the program continues with a potentially invalid logger, which could lead to panics or silent failure of logging.

🛡️ Proposed fix
 	// Initialize Logger
-	zapLog, _ := zap.NewProduction()
+	zapLog, err := zap.NewProduction()
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "failed to initialize logger: %v\n", err)
+		os.Exit(1)
+	}
 	logger := zapr.NewLogger(zapLog)
internal/networkmonitor/conntrack_client.go-123-126 (1)

123-126: Silent error suppression may hide invalid IP addresses.

The error from netaddr.FromStdIP is deliberately ignored. If the kernel returns an invalid IP format, this could silently produce zero-value IPs in entries, leading to incorrect filtering or metrics.

Consider logging when conversion fails:

🔍 Proposed improvement
-func ipFromStdIP(ip net.IP) netaddr.IP {
-	res, _ := netaddr.FromStdIP(ip)
-	return res
-}
+func (c *NetfilterClient) ipFromStdIP(ip net.IP) netaddr.IP {
+	res, ok := netaddr.FromStdIP(ip)
+	if !ok {
+		c.log.Info("Failed to convert IP address", "ip", ip.String())
+	}
+	return res
+}

Note: This would require making ipFromStdIP a method to access the logger.

Committable suggestion skipped: line range outside the PR's diff.

.github/workflows/push-zxporter-netmon-image.yml-237-241 (1)

237-241: Update checkout action to v4.

The workflow uses actions/checkout@v3 which is too old to run on current GitHub Actions runners.

🔄 Proposed fix
             - name: Checkout repo
-              uses: actions/checkout@v3
+              uses: actions/checkout@v4
               with:
                   repository: devzero-inc/zxporter
.github/workflows/push-zxporter-netmon-image.yml-306-319 (1)

306-319: Fix undefined workflow_name reference.

Lines 311 and 318 reference ${{ inputs.workflow_name }} but this input is not defined in the workflow's workflow_dispatch inputs (which only has push_latest). This will evaluate to an empty string.

🔧 Proposed fix
             - name: Notify Slack on failure
               uses: ./services/.github/actions/slack-notify-failure
               if: always()
               with:
                   slack_bot_token: ${{ secrets.SLACK_BOT_TOKEN }}
-                  workflow_name: ${{ inputs.workflow_name || 'Update zxporter-netmon helm chart' }}
+                  workflow_name: 'Update zxporter-netmon helm chart'
 
             - name: Notify Slack on success
               uses: ./services/.github/actions/slack-notify-success
               if: env.SHOULD_CREATE_PR == 'true'
               with:
                   slack_bot_token: ${{ secrets.SLACK_BOT_TOKEN }}
-                  workflow_name: ${{ inputs.workflow_name || 'Update zxporter-netmon helm chart' }}
+                  workflow_name: 'Update zxporter-netmon helm chart'
                   pr_url: ${{ env.PR_URL }}
internal/networkmonitor/ebpf/c/egressd.c-56-60 (1)

56-60: Unchecked return value when re-reading IPv4 header with options.

When ihl > 5, the re-read of the IPv4 header does not check the return value of bpf_skb_load_bytes_relative. If this fails, the program continues with potentially invalid header data.

🐛 Proposed fix
     u32 ihl = nethdrs->iphdrs.iphdr.ihl;
     if (ihl > 5) { // re-read IPv4 header if needed
         size -= get_type_size(struct iphdr);
         size += ihl * 4;
-        bpf_skb_load_bytes_relative(ctx, 0, dest, size, 1);
+        if (bpf_skb_load_bytes_relative(ctx, 0, dest, size, 1)) {
+            return 1;
+        }
     }
🧹 Nitpick comments (40)
internal/networkmonitor/cilium_time.go (1)

14-20: Consider logging or handling Hz detection failure.

The silent fallback to hertz = 1 on error may mask kernel Hz detection failures, potentially leading to incorrect jiffies-based timing calculations. Consider logging the error or documenting when hertz = 1 is an acceptable fallback.

📝 Consider adding error logging
 func init() {
 	var err error
 	hertz, err = getKernelHZ()
 	if err != nil {
+		// Log error or use a more appropriate fallback
+		// logger.Warnf("Failed to detect kernel HZ, using fallback: %v", err)
 		hertz = 1
 	}
 }
internal/networkmonitor/cilium_kernel_hz.go (1)

87-107: Consider refactoring for clarity (optional).

The unsigned subtraction pattern on lines 95-97 works correctly but relies on wrap-around behavior that may not be immediately obvious. Consider computing the absolute difference more explicitly for better readability.

♻️ Optional refactor for clarity
 func nearest(in uint16, values []uint16) (uint16, error) {
 	if len(values) == 0 {
 		return 0, errors.New("values cannot be empty")
 	}
 
 	var out uint16
 	min := ^uint16(0)
 	for _, v := range values {
-		d := uint16(in - v)
+		var d uint16
 		if in < v {
 			d = v - in
+		} else {
+			d = in - v
 		}
 
 		if d < min {
 			min = d
 			out = v
 		}
 	}
 
 	return out, nil
 }
verification/provision_gke.sh (2)

35-39: Consider renaming for consistency across provisioning scripts.

The function is named load_state here, while the EKS and AKS scripts use read_state for the same purpose. Using consistent naming across all three cloud provider scripts improves maintainability.

♻️ Proposed fix
-load_state() {
+read_state() {
   if [[ -f "${STATE_FILE}" ]]; then
     source "${STATE_FILE}"
   fi
 }

Then update the call site on Line 99:

 if [[ "${MODE}" == "delete" ]]; then
-  load_state
+  read_state

36-39: Suppress expected ShellCheck warning for dynamic source.

The ShellCheck warning SC1090 about non-constant source is expected here since the state file is dynamically sourced. Consider adding a directive to suppress this known false positive.

♻️ Proposed fix
 load_state() {
   if [[ -f "${STATE_FILE}" ]]; then
+    # shellcheck source=/dev/null
     source "${STATE_FILE}"
   fi
 }
verification/provision_eks.sh (2)

140-142: Track or address the autoModeConfig TODO.

The commented TODO mentions testing is still required for the auto mode config. Since eksctl is moving toward making this the default, this should be addressed soon.

Would you like me to open an issue to track testing autoModeConfig before it becomes the default in upcoming eksctl releases?


56-58: Suppress expected ShellCheck warning for dynamic source.

Same as the GKE script, add a directive to suppress the SC1090 warning for the dynamically sourced state file.

♻️ Proposed fix
 read_state() {
+  # shellcheck source=/dev/null
   [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}"
 }
verification/provision_aks.sh (2)

118-126: Consider prioritizing LoadBalancer service cleanup.

AKS LoadBalancer services create Azure Load Balancer and Public IP resources. Deleting these first helps ensure clean resource group deletion.

♻️ Proposed enhancement
     echo "Cleaning up k8s resources (best effort)..."
+    # Delete LoadBalancer services first to clean up Azure LB/IP resources
+    kubectl delete svc -A --field-selector spec.type=LoadBalancer --timeout=30s >/dev/null 2>&1 || true
     kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true
     kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true

38-40: Suppress expected ShellCheck warning for dynamic source.

Same SC1090 warning as the other provisioning scripts. Add the suppression directive for consistency.

♻️ Proposed fix
 read_state() {
+  # shellcheck source=/dev/null
   [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}"
 }
internal/networkmonitor/ebpf/tracer_linux_test.go (1)

10-17: Expand test coverage with error cases and edge cases.

The test only covers one happy-path scenario. Consider adding tests for invalid payloads, truncated packets, and various packet types to ensure robust parsing.

📋 Suggested additional test cases
func TestParseEvent_ErrorCases(t *testing.T) {
	tests := []struct {
		name    string
		payload []byte
		wantErr bool
	}{
		{
			name:    "empty payload",
			payload: []byte{},
			wantErr: true,
		},
		{
			name:    "truncated IP header",
			payload: []byte{69, 0, 0, 80},
			wantErr: true,
		},
		{
			name:    "invalid IP version",
			payload: []byte{0, 0, 0, 80, 106, 126, 64, 0},
			wantErr: true,
		},
	}
	
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			_, err := parseEvent(tt.payload)
			if (err != nil) != tt.wantErr {
				t.Errorf("parseEvent() error = %v, wantErr %v", err, tt.wantErr)
			}
		})
	}
}
verification/kind-config.yaml (1)

1-27: Consider using YAML anchors to reduce duplication.

The extraMounts configuration is duplicated across all three nodes. While this works correctly, you can improve maintainability by using YAML anchors and aliases.

♻️ Proposed refactor using YAML anchors
 kind: Cluster
 apiVersion: kind.x-k8s.io/v1alpha4
+
+# Define the common mounts as an anchor
+x-common-mounts: &common-mounts
+  - hostPath: /lib/modules
+    containerPath: /lib/modules
+    readOnly: true
+  - hostPath: /sys/fs/bpf
+    containerPath: /sys/fs/bpf
+    readOnly: false
+
 nodes:
   - role: control-plane
-    extraMounts:
-      - hostPath: /lib/modules
-        containerPath: /lib/modules
-        readOnly: true
-      - hostPath: /sys/fs/bpf
-        containerPath: /sys/fs/bpf
-        readOnly: false
+    extraMounts: *common-mounts
   - role: worker
-    extraMounts:
-      - hostPath: /lib/modules
-        containerPath: /lib/modules
-        readOnly: true
-      - hostPath: /sys/fs/bpf
-        containerPath: /sys/fs/bpf
-        readOnly: false
+    extraMounts: *common-mounts
   - role: worker
-    extraMounts:
-      - hostPath: /lib/modules
-        containerPath: /lib/modules
-        readOnly: true
-      - hostPath: /sys/fs/bpf
-        containerPath: /sys/fs/bpf
-        readOnly: false
+    extraMounts: *common-mounts
verification/lima.yaml (1)

18-22: Consider adding error handling to the provisioning script.

The provisioning script could be made more robust by adding error handling and non-interactive mode for apt-get.

♻️ Proposed enhancement
 provision:
   - mode: system
     script: |
+      set -euo pipefail
+      export DEBIAN_FRONTEND=noninteractive
       apt-get update
       apt-get install -y conntrack iptables

This adds:

  • set -euo pipefail to fail fast on errors
  • DEBIAN_FRONTEND=noninteractive to prevent interactive prompts during installation
verification/verify_local.sh (2)

1-17: Add prerequisite checks and better error messages.

The script assumes Lima is installed and the instance is running, but doesn't verify these prerequisites. This could lead to cryptic errors for users.

📋 Suggested improvements
 #!/bin/bash
 set -e

+# Check prerequisites
+if ! command -v limactl &> /dev/null; then
+    echo "Error: limactl is not installed. Please install Lima first." >&2
+    exit 1
+fi
+
+INSTANCE_NAME=${1:-default}
+
+# Verify Lima instance exists and is running
+if ! limactl list "$INSTANCE_NAME" &> /dev/null; then
+    echo "Error: Lima instance '$INSTANCE_NAME' does not exist." >&2
+    echo "Run 'make setup-lima' to create it." >&2
+    exit 1
+fi
+
+if ! limactl list "$INSTANCE_NAME" | grep -q "Running"; then
+    echo "Error: Lima instance '$INSTANCE_NAME' is not running." >&2
+    exit 1
+fi
+
 # build zxporter-netmon for linux (matching host arch for Lima)
 ARCH=$(go env GOARCH)
 echo "Building zxporter-netmon for Linux/$ARCH..."
 GOOS=linux GOARCH=$ARCH go build -o bin/zxporter-netmon-linux ./cmd/zxporter-netmon/main.go
 
 echo "Build complete: bin/zxporter-netmon-linux"
 echo ""
 
-INSTANCE_NAME=${1:-default}
 echo "Running in Lima instance: $INSTANCE_NAME"

20-23: Consider capturing output and validating execution.

The script runs the binary but doesn't capture or validate its output. Consider adding basic validation to ensure the binary executed successfully.

💡 Suggested enhancement
 # Execute in VM
 # Using sudo to ensure permissions for conntrack/ebpf
 echo "Executing zxporter-netmon in VM..."
 limactl shell $INSTANCE_NAME sudo chmod +x /tmp/zxporter-netmon-linux
-limactl shell $INSTANCE_NAME sudo /tmp/zxporter-netmon-linux --standalone
+
+# Run with timeout and capture output
+echo "Running for 10 seconds to verify functionality..."
+if limactl shell $INSTANCE_NAME sudo timeout 10 /tmp/zxporter-netmon-linux --standalone 2>&1 | tee /tmp/netmon-output.log; then
+    echo "✓ Binary executed successfully"
+else
+    EXIT_CODE=$?
+    if [ $EXIT_CODE -eq 124 ]; then
+        echo "✓ Binary ran and was terminated after timeout (expected)"
+    else
+        echo "✗ Binary execution failed with exit code: $EXIT_CODE" >&2
+        exit 1
+    fi
+fi
verification/traffic-gen.yaml (1)

37-53: Consider adding security context hardening.

The container lacks explicit security constraints. While this is a verification workload, following security best practices improves the overall security posture. Consider adding:

🔒 Suggested security hardening
       containers:
         - name: traffic-gen
           image: curlimages/curl
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            runAsUser: 100
+            capabilities:
+              drop:
+                - ALL
           command: ["/bin/sh"]
           args:
verification/traffic-server.yaml (1)

37-55: Consider using a non-privileged port for the container.

The container runs Python's http.server on port 80, which requires root privileges or special capabilities. For a test workload, using a non-privileged port (≥1024) allows running as non-root and improves security posture.

🔒 Suggested changes for non-privileged port

Update the container to use port 8080 and add security context:

       containers:
         - name: traffic-server
           image: python:3-slim
-          command: ["python3", "-m", "http.server", "80"]
+          command: ["python3", "-m", "http.server", "8080"]
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            runAsUser: 1000
+            capabilities:
+              drop:
+                - ALL
           ports:
-            - containerPort: 80
+            - containerPort: 8080
 ---
 apiVersion: v1
 kind: Service
 metadata:
   name: traffic-server
   namespace: default
 spec:
   selector:
     app: traffic-server
   ports:
     - protocol: TCP
       port: 80
-      targetPort: 80
+      targetPort: 8080

This keeps the Service accessible on port 80 while the container runs securely on 8080.

internal/networkmonitor/ebpf/tracer.go (1)

10-19: Consider validating QueueSize for negative values.

If a negative QueueSize is passed, make(chan DNSEvent, cfg.QueueSize) will panic at runtime. Consider adding validation:

Suggested validation
 func NewTracer(log logr.Logger, cfg Config) *Tracer {
-	if cfg.QueueSize == 0 {
+	if cfg.QueueSize <= 0 {
 		cfg.QueueSize = 1000
 	}
 	return &Tracer{
helm-chart/zxporter-netmon/values.yaml (2)

4-5: Avoid using latest tag as default.

Using latest as the default image tag makes deployments non-reproducible and complicates rollbacks. Consider defaulting to the chart's appVersion or requiring an explicit tag.

Suggested fix
   # Overrides the image tag whose default is the chart appVersion.
-  tag: "latest"
+  tag: ""  # Defaults to chart appVersion if empty

21-22: Redundant AppArmor configuration.

The AppArmor annotation on line 22 (container.apparmor.security.beta.kubernetes.io/zxporter-netmon: "unconfined") duplicates the securityContext.appArmorProfile on lines 29-30. The annotation uses the deprecated beta API. Consider removing the annotation and relying solely on the securityContext.appArmorProfile for Kubernetes 1.30+.

Also applies to: 27-34

config/zxporter-netmon.yaml (1)

36-36: Avoid hardcoding latest tag in manifests.

Using latest makes deployments non-reproducible. Pin to a specific version or use the Helm chart which can template the version.

internal/networkmonitor/ebpf/bpf_bpfel.go (1)

15-24: Minor: Redundant error variable in return.

Line 23 returns spec, err where err is guaranteed to be nil after the check on lines 19-21.

Simplified return
 func loadBpf() (*ebpf.CollectionSpec, error) {
 	reader := bytes.NewReader(_BpfBytes)
 	spec, err := ebpf.LoadCollectionSpecFromReader(reader)
 	if err != nil {
 		return nil, fmt.Errorf("can't load bpf: %w", err)
 	}
 
-	return spec, err
+	return spec, nil
 }
.github/workflows/helm-release-netmon.yml (1)

56-66: Quote shell variables to prevent word splitting.

Static analysis flagged potential word splitting issues. While unlikely to cause problems with semantic version strings, quoting these variables is a defensive best practice.

Proposed fix
             - name: Get last chart version from tags
               id: get_version
               run: |
                   LAST_TAG=$(git tag --list '${{ env.CHART_TAG_PREFIX }}*' --sort=-v:refname | head -n 1)
                   if [ -z "$LAST_TAG" ]; then
                     LAST_VERSION="0.0.0"
                   else
-                    LAST_VERSION="${LAST_TAG#${{ env.CHART_TAG_PREFIX }}}"
+                    LAST_VERSION="${LAST_TAG#"${{ env.CHART_TAG_PREFIX }}"}"
                   fi
                   echo "Last version: $LAST_VERSION"
-                  echo "last_version=$LAST_VERSION" >> $GITHUB_OUTPUT
+                  echo "last_version=$LAST_VERSION" >> "$GITHUB_OUTPUT"
internal/networkmonitor/ebpf/bpf_bpfeb.go (1)

107-114: Early return on error may leave resources unclosed.

_BpfClose returns immediately on the first error, potentially leaving subsequent closers unclosed. While this is common in generated code and unlikely to cause issues in practice (eBPF Close rarely fails), a more robust pattern would collect all errors.

Since this is generated code, consider whether the generator (bpf2go) can be configured to use a different pattern, or accept this as a known limitation.

internal/networkmonitor/pod_cache.go (2)

15-15: Unused store field.

The store field is assigned on line 37 but never used. Either remove it or document its intended purpose for future use.

If unused, remove the field
 type PodCache struct {
 	mu    sync.RWMutex
 	ips   map[netaddr.IP]*corev1.Pod // Using netaddr.IP for efficient lookup
-	store cache.Store
 }

And in NewPodCache:

-		pc.store = informer.GetStore()

Also applies to: 37-37


26-36: Unchecked type assertions in event handlers could panic.

The AddFunc and UpdateFunc handlers perform unchecked type assertions. While the informer contract should guarantee Pod objects, adding defensive checks (like in deletePod) prevents panics from unexpected objects.

Add defensive type checks
 		informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
 			AddFunc: func(obj interface{}) {
-				pc.updatePod(obj.(*corev1.Pod))
+				if pod, ok := obj.(*corev1.Pod); ok {
+					pc.updatePod(pod)
+				}
 			},
 			UpdateFunc: func(old, new interface{}) {
-				pc.updatePod(new.(*corev1.Pod))
+				if pod, ok := new.(*corev1.Pod); ok {
+					pc.updatePod(pod)
+				}
 			},
 			DeleteFunc: func(obj interface{}) {
 				pc.deletePod(obj)
 			},
 		})
internal/networkmonitor/ebpf/tracer_linux.go (1)

191-207: Missing scanner error check in detectCgroupPath.

After the scan loop, scanner.Err() should be checked to catch any errors that occurred during scanning (e.g., I/O errors reading /proc/mounts).

Proposed fix
 func detectCgroupPath() (string, error) {
 	f, err := os.Open("/proc/mounts")
 	if err != nil {
 		return "", err
 	}
 	defer f.Close()

 	scanner := bufio.NewScanner(f)
 	for scanner.Scan() {
 		fields := strings.Split(scanner.Text(), " ")
 		if len(fields) >= 3 && fields[2] == "cgroup2" {
 			return fields[1], nil
 		}
 	}

+	if err := scanner.Err(); err != nil {
+		return "", fmt.Errorf("reading /proc/mounts: %w", err)
+	}
+
 	return "", ErrCgroup2NotMounted
 }
verification/verify_e2e.sh (2)

24-24: Unused variable ARCH.

The ARCH variable is set but never used in the script. Either remove it or use it (e.g., for multi-arch image builds).

Remove unused variable
-ARCH=$(go env GOARCH)
 TAG="dev-$(date +%s)"

283-297: Variable naming inconsistency and unquoted variables.

RETRY_Count uses mixed case while other variables use UPPER_SNAKE_CASE. Also, $STATS_JSON should be quoted in the echo and grep commands to handle special characters safely.

Proposed fix
-        RETRY_Count=0
+        RETRY_COUNT=0
         TOTAL_MSGS=0
         
-        while [ $RETRY_Count -lt $MAX_RETRIES ]; do
+        while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
             STATS_JSON=$(kubectl exec curl-validator -n $NAMESPACE -- curl -v -s http://testserver.${NAMESPACE}.svc.cluster.local:8080/stats)
-            TOTAL_MSGS=$(echo $STATS_JSON | grep -o '"total_messages": [0-9]*' | awk '{print $2}')
+            TOTAL_MSGS=$(echo "$STATS_JSON" | grep -o '"total_messages": [0-9]*' | awk '{print $2}')
             
             if [[ "$TOTAL_MSGS" -gt 0 ]]; then
                 break
             fi
             
-            echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_Count+1))/$MAX_RETRIES)"
+            echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_COUNT+1))/$MAX_RETRIES)"
             sleep 5
-            RETRY_Count=$((RETRY_Count+1))
+            RETRY_COUNT=$((RETRY_COUNT+1))
         done
internal/networkmonitor/cilium_client_linux.go (3)

39-52: defer m.Close() inside loop defers all closes until function return.

All defer m.Close() calls will execute when listRecords returns, not when each iteration completes. With many maps, this could hold file descriptors longer than necessary. Consider using an immediately-invoked function or explicit close.

Proposed fix using helper function
 	for _, m := range maps {
-		m := m.(ctmap.CtMap)
+		ctMap, ok := m.(ctmap.CtMap)
+		if !ok {
+			continue
+		}
+		if err := processMap(ctMap, now, timeDiff, filter, &entries); err != nil {
+			return nil, err
+		}
+	}
+	return entries, nil
+}
+
+func processMap(m ctmap.CtMap, now time.Time, timeDiff func(int64) int64, filter EntriesFilter, entries *[]*Entry) error {
 		_, err := m.Path()
 		if err == nil {
 			err = m.Open()
 		}
 		if err != nil {
 			if os.IsNotExist(err) {
-				continue
+				return nil
 			}
+			return err
 		}
-
 		defer m.Close()
+		// ... rest of callback and dump logic

Alternatively, close explicitly at end of each iteration without defer.


53-76: Unchecked type assertions in callback could cause panics.

Lines 54 and 61 perform type assertions without checking. While the ctmap API should guarantee these types, a panic inside the callback would be difficult to debug. Consider adding defensive checks or using comma-ok idiom.

Add defensive type checks
 		cb := func(key bpf.MapKey, v bpf.MapValue) {
-			k := key.(ctmap.CtKey).ToHost().(*ctmap.CtKey4Global)
+			ctKey, ok := key.(ctmap.CtKey)
+			if !ok {
+				return
+			}
+			k, ok := ctKey.ToHost().(*ctmap.CtKey4Global)
+			if !ok {
+				return // Skip non-global entries (e.g., CtKey6Global for IPv6)
+			}
 			if k.NextHeader == 0 {
 				return
 			}

 			srcIP := k.DestAddr.IP()
 			dstIP := k.SourceAddr.IP()
-			val := v.(*ctmap.CtEntry)
+			val, ok := v.(*ctmap.CtEntry)
+			if !ok {
+				return
+			}

84-93: Magic numbers in initMaps lack documentation.

The parameters to ctmap.InitMapInfo(2<<18, 2<<17, true, false, true) are unclear. Consider adding a comment explaining what these values represent (map sizes, flags, etc.) for maintainability.

Add explanatory comment
 func initMaps() []interface{} {
 	// We might need to handle failure here gracefully or ensure we have permissions
+	// InitMapInfo(maxTCPEntries=524288, maxAnyEntries=262144, ipv6=true, ipv4=false, nodePort=true)
 	ctmap.InitMapInfo(2<<18, 2<<17, true, false, true)
internal/networkmonitor/cilium_client.go (1)

28-35: Remove obsolete comment about logr adaptation.

The type alias logrusLogger = logr.Logger directly uses logr without adaptation. The comment on lines 33-34 suggests uncertainty but the implementation is straightforward.

🧹 Proposed cleanup
 type ciliumClient struct {
-	log         logrusLogger // wrapping logr to match potential interface or just using logr
+	log         logrusLogger
 	maps        []interface{}
 	clockSource ClockSource
 }
 
-// We need a helper to adapt logr to what we need or just use logr directly.
-// The code in linux implementation uses `c.log`
 type logrusLogger = logr.Logger
.github/workflows/push-zxporter-netmon-image.yml (1)

88-168: Shell script could benefit from quoting variables, but functionally correct.

The shellcheck warnings about unquoted variables (SC2086) and array handling (SC2145) are valid style improvements but don't affect functionality in this context since the variables don't contain spaces.

If you want to address the shellcheck warnings for cleaner shell hygiene, quote variables like "$tag", "$GITVERSION", etc., and use "${TAGS[@]}" instead of ${TAGS[@]} in echo statements.

cmd/zxporter-netmon/main.go (1)

32-32: Consider renaming flag to match its purpose.

The flag is named metrics-kubeconfig but it's used for general Kubernetes client configuration, not just metrics. Consider renaming to simply kubeconfig for clarity.

📝 Suggested improvement
-	kubeconfig      = flag.String("metrics-kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")
+	kubeconfig      = flag.String("kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")
internal/networkmonitor/ebpf/c/egressd.c (2)

19-21: Unused function get_pid().

This function is defined but never called anywhere in the file. Consider removing it to avoid dead code, or use it if PID information is needed in the event context.

🔧 Remove unused function
-statfunc int get_pid() {
-    return bpf_get_current_pid_tgid() >> 32;
-}
-

74-75: Unnecessary NULL check - dest cannot be NULL here.

The variable dest is assigned from &nethdrs->protohdrs.udphdr (line 72), which is a pointer to a struct member on the stack. This can never be NULL, making this check dead code.

🔧 Remove unnecessary check
     size = get_type_size(struct udphdr);
     dest = &nethdrs->protohdrs.udphdr;

-    if (!dest)
-        return 1;
-
     // Load L4 headers.
Makefile (1)

703-709: Lima setup is macOS-specific.

The setup-lima target uses brew install lima, which only works on macOS. Consider adding a fallback or documentation for Linux users.

🔧 Proposed improvement with cross-platform detection
 .PHONY: setup-lima
 setup-lima:
 	@echo "Checking for Lima..."
-	@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+	@which limactl >/dev/null || { \
+		if command -v brew >/dev/null 2>&1; then \
+			echo "Installing Lima via brew..." && brew install lima; \
+		else \
+			echo "Error: Lima (limactl) not found. Please install Lima manually: https://lima-vm.io/docs/installation/" && exit 1; \
+		fi \
+	}
 	@echo "Checking if instance $(LIMA_INSTANCE) exists..."
internal/networkmonitor/dns/ip2dns.go (2)

40-49: Exported Tracer field inconsistent with unexported interface type.

The Tracer field is exported (capital T) but uses an unexported interface type tracer. This allows external packages to set the field but not implement the interface, which is inconsistent. Consider making it unexported for encapsulation.

🔧 Proposed fix
 type IP2DNS struct {
-	Tracer tracer
+	tracer tracer
 	log    logr.Logger

 	mu      sync.Mutex
 	lookups []DNSLookup

 	// Keep cache for enrichment if needed, though mostly redundant with Lookups stream
 	ipToName *cache.Cache[string, string]
 }

Update the constructor and Start method accordingly:

 func NewIP2DNS(tracer tracer, log logr.Logger) *IP2DNS {
 	return &IP2DNS{
-		Tracer: tracer,
+		tracer: tracer,
 		log:    log,
 	}
 }

 func (d *IP2DNS) Start(ctx context.Context) error {
 	d.ipToName = cache.NewContext[string, string](ctx)

 	ctx, cancel := context.WithCancel(ctx)
 	defer cancel()
 	errch := make(chan error, 1)
 	go func() {
-		err := d.Tracer.Run(ctx)
+		err := d.tracer.Run(ctx)
 		errch <- err
 	}()

-	evCh := d.Tracer.Events()
+	evCh := d.tracer.Events()

12-12: Consider migrating from deprecated inet.af/netaddr to net/netip.

The inet.af/netaddr package has been deprecated in favor of Go's standard library net/netip (available since Go 1.18). Consider migrating for long-term maintainability.

🔧 Migration example
-	"inet.af/netaddr"
+	"net/netip"

And update usage:

-				ip, ok := netaddr.FromStdIP(answer.IP)
+				ip, ok := netip.AddrFromSlice(answer.IP)
internal/networkmonitor/monitor.go (2)

249-265: Side effect: mutates cached NetworkFlow during read.

Lines 254-255 and 261-262 mutate f.SrcPodName, f.SrcPodNamespace, etc. directly on the NetworkFlow stored in m.podMetrics. This modifies the cached state during what should be a read-only "peek" operation. Consider enriching only the copy that's appended to flows, not the original cached entry.

♻️ Proposed fix
 	for _, f := range m.podMetrics {
+		// Create a copy first to avoid mutating cached state
+		flowCopy := &NetworkFlow{
+			SrcIP:     f.SrcIP,
+			DstIP:     f.DstIP,
+			Protocol:  f.Protocol,
+			DstPort:   f.DstPort,
+			TxBytes:   f.TxBytes,
+			RxBytes:   f.RxBytes,
+			TxPackets: f.TxPackets,
+			RxPackets: f.RxPackets,
+			Timestamp: f.Timestamp,
+		}
+
 		// Enrich with Pod Metadata
 		if m.podCache != nil {
 			if srcIP, err := netaddr.ParseIP(f.SrcIP); err == nil {
 				if srcPod, ok := m.podCache.GetPodByIP(srcIP); ok {
-					f.SrcPodName = srcPod.Name
-					f.SrcPodNamespace = srcPod.Namespace
+					flowCopy.SrcPodName = srcPod.Name
+					flowCopy.SrcPodNamespace = srcPod.Namespace
 				}
 			}

 			if dstIP, err := netaddr.ParseIP(f.DstIP); err == nil {
 				if dstPod, ok := m.podCache.GetPodByIP(dstIP); ok {
-					f.DstPodName = dstPod.Name
-					f.DstPodNamespace = dstPod.Namespace
+					flowCopy.DstPodName = dstPod.Name
+					flowCopy.DstPodNamespace = dstPod.Namespace
 				}
 			}
 		}

-		// Create a copy to return (Peek behavior, no reset)
-		flows = append(flows, &NetworkFlow{
-			SrcIP:           f.SrcIP,
-			...
-		})
+		flows = append(flows, flowCopy)
 	}

316-320: Response corruption if JSON encoding fails mid-stream.

json.NewEncoder(w).Encode(resp) writes directly to the http.ResponseWriter. If encoding fails partway through, the response will be partially written and corrupted. Consider encoding to a buffer first, then writing the complete response.

♻️ Proposed fix
+	var buf bytes.Buffer
+	if err := json.NewEncoder(&buf).Encode(resp); err != nil {
+		m.log.Error(err, "Failed to encode metrics")
+		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
+		return
+	}
+
 	w.Header().Set("Content-Type", "application/json")
-	if err := json.NewEncoder(w).Encode(resp); err != nil {
-		m.log.Error(err, "Failed to encode metrics")
-		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
-	}
+	w.Write(buf.Bytes())

Note: Add "bytes" to imports.

Comment on lines +107 to +114
func _BpfClose(closers ...io.Closer) error {
for _, closer := range closers {
if err := closer.Close(); err != nil {
return err
}
}
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential resource leak in cleanup helper.

_BpfClose returns on the first error, leaving subsequent resources unclosed. If an early Close() fails, later file descriptors or maps may leak.

🔧 Proposed fix to close all resources
 func _BpfClose(closers ...io.Closer) error {
+	var firstErr error
 	for _, closer := range closers {
 		if err := closer.Close(); err != nil {
-			return err
+			if firstErr == nil {
+				firstErr = err
+			}
 		}
 	}
-	return nil
+	return firstErr
 }

Note: Since this is generated code (line 1), the fix should ideally be applied to the bpf2go tool itself. If that's not feasible, consider documenting this limitation or manually patching the generated file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func _BpfClose(closers ...io.Closer) error {
for _, closer := range closers {
if err := closer.Close(); err != nil {
return err
}
}
return nil
}
func _BpfClose(closers ...io.Closer) error {
var firstErr error
for _, closer := range closers {
if err := closer.Close(); err != nil {
if firstErr == nil {
firstErr = err
}
}
}
return firstErr
}
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/bpf_bpfel.go around lines 107 - 114, _BpfClose
currently returns on the first Close error which can leak later resources;
change it to always iterate and call Close on every io.Closer in the closers
slice, capturing either the first non-nil error or aggregating errors (e.g.,
using errors.Join if available) and return that after the loop; update the
generated function _BpfClose to perform all closes and then return the collected
error, and if possible fix the bpf2go generator so future outputs behave the
same.

#!/usr/bin/env bash

# Version of libbpf to fetch headers from
LIBBPF_VERSION=0.6.1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

latest stable version of libbpf 2026

💡 Result:

Latest stable libbpf: v1.6.2 (release tag v1.6.2, released Aug 21, 2025). [1][2]

Sources:
[1] GitHub releases — libbpf v1.6.2.
[2] Debian package tracker — libbpf 1.6.2 packaging/news.


🌐 Web query:

libbpf 0.6.1 security vulnerabilities CVE

💡 Result:

libbpf 0.6.1 is affected by two publicly‑reported heap‑buffer‑overflow vulnerabilities:

  • CVE-2021-45940 — heap‑based buffer overflow (4 bytes) in bpf_object__open; affects libbpf 0.6.0 and 0.6.1. Recommended fix: upgrade to a fixed upstream release (fixed in libbpf 0.7.0). [1][2]

  • CVE-2021-45941 — heap‑based buffer overflow (8 bytes) in __bpf_object__open; affects libbpf 0.6.0 and 0.6.1. Recommended fix: upgrade to a fixed upstream release (fixed in 0.7.0 per downstream trackers). [2][3]

Impact: remote denial‑of‑service (availability) by exploiting specially‑crafted files; CVSS ~6.5 (medium). [1][2]

Sources:

  1. OSV / NVD entry for CVE‑2021‑45940. [OSV/NVD]. [1]
  2. Debian security tracker (CVE‑2021‑45940 / CVE‑2021‑45941) — notes affected versions and that fixes are in later packages (upstream 0.7.0). [2]
  3. CVE databases summarizing CVE‑2021‑45941. [3]

If you want, I can fetch the exact upstream commit or vendor advisories and provide upgrade/mitigation steps.


🏁 Script executed:

cat -n internal/networkmonitor/ebpf/c/include/update.sh | head -20

Repository: devzero-inc/zxporter

Length of output: 741


Upgrade libbpf from 0.6.1 to a patched version. Version 0.6.1 contains two heap-buffer-overflow vulnerabilities (CVE-2021-45940 and CVE-2021-45941) that can cause remote denial-of-service via specially-crafted files.

The pinned version was released in 2021 and is unpatched. Current stable is v1.6.2 (released August 2025). At minimum, upgrade to 0.7.0 where these CVEs were fixed.

🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/c/include/update.sh at line 4, The pinned
LIBBPF_VERSION variable in update.sh is set to the vulnerable 0.6.1; change
LIBBPF_VERSION to a patched release (minimum 0.7.0, preferably the current
stable 1.6.2) so the build uses a non-vulnerable libbpf; update the
LIBBPF_VERSION value accordingly and verify the tag/format matches the
repository's expected version string elsewhere in the script.

Comment on lines +18 to +19
curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
tar -xz --xform='s#.*/##' "${headers[@]}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling and integrity verification for the download.

The current implementation has several reliability and security concerns:

  • curl -sL suppresses error output, making failures silent
  • No checksum or signature verification of the downloaded tarball
  • Will overwrite existing header files without warning
🔒 Proposed improvements
+# Exit on any error
+set -e
+
 # Fetch libbpf release and extract the desired headers
-curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
-    tar -xz --xform='s#.*/##' "${headers[@]}"
+echo "Downloading libbpf v${LIBBPF_VERSION}..."
+if ! curl -fsSL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
+    tar -xz --xform='s#.*/##' "${headers[@]}"; then
+    echo "Error: Failed to download or extract libbpf headers" >&2
+    exit 1
+fi
+echo "Successfully extracted libbpf headers"

Consider also verifying the tarball checksum against a known good value if available from the libbpf release page.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
tar -xz --xform='s#.*/##' "${headers[@]}"
# Exit on any error
set -e
# Fetch libbpf release and extract the desired headers
echo "Downloading libbpf v${LIBBPF_VERSION}..."
if ! curl -fsSL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
tar -xz --xform='s#.*/##' "${headers[@]}"; then
echo "Error: Failed to download or extract libbpf headers" >&2
exit 1
fi
echo "Successfully extracted libbpf headers"
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/c/include/update.sh around lines 18 - 19, The
current inline download pipeline (curl -sL ... | tar -xz --xform=...
"${headers[@]}") should be made robust: download the release tarball to a
temporary file using a failing curl variant (e.g. enable -f and preserve
redirects) and check curl's exit code, validate the tarball integrity against a
provided checksum (and/or GPG signature from the libbpf release page) before
extracting, and avoid blind overwrite of existing headers by either extracting
into a temporary directory and atomically moving new files or using tar options
to skip/keep existing files; if any verification or extraction step fails, emit
a clear error and exit non‑zero. Ensure you reference LIBBPF_VERSION and the
headers array in the updated logic so the download, checksum URL, and extraction
targets are consistent.

Comment on lines +23 to +34
func mountCgroup2() error {
err := os.Mkdir(mountPoint, 0755)
if err != nil {
return fmt.Errorf("creating directory at %q: %w", mountPoint, err)
}
// https://docs.kernel.org/admin-guide/cgroup-v2.html#mounting
err = syscall.Mount("none", mountPoint, "cgroup2", 0, "")
if err != nil {
return fmt.Errorf("mounting cgroup2 at %q: %w", mountPoint, err)
}
return nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure idempotent mounting behavior.

The current implementation will fail if the mount point already exists or cgroup2 is already mounted. This could cause issues during restarts or in environments where cgroup2 is pre-configured.

🔧 Proposed fix for idempotent mounting
 func mountCgroup2() error {
-	err := os.Mkdir(mountPoint, 0755)
+	err := os.MkdirAll(mountPoint, 0755)
 	if err != nil {
 		return fmt.Errorf("creating directory at %q: %w", mountPoint, err)
 	}
 	// https://docs.kernel.org/admin-guide/cgroup-v2.html#mounting
 	err = syscall.Mount("none", mountPoint, "cgroup2", 0, "")
-	if err != nil {
+	if err != nil && !os.IsExist(err) {
 		return fmt.Errorf("mounting cgroup2 at %q: %w", mountPoint, err)
 	}
 	return nil
 }

Alternatively, check /proc/mounts to verify if cgroup2 is already mounted before attempting the mount operation.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/cgroup_linux.go around lines 23 - 34, The
mountCgroup2 function should be made idempotent: when creating the directory
(mountPoint) treat already-existing directory as non-fatal (use os.Mkdir and
ignore os.IsExist errors) and before calling syscall.Mount check whether cgroup2
is already mounted at mountPoint by parsing /proc/mounts (or using
syscall.Statfs) and return nil if a cgroup2 mount is present; only call
syscall.Mount if the mountPoint exists and no cgroup2 is currently mounted, and
preserve existing error wrapping in the same fmt.Errorf style if mount fails.

Comment on lines +10 to +17
func TestTracer(t *testing.T) {
payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163}
event, err := parseEvent(payload)
if err != nil {
t.Fatal(err)
}
fmt.Println(event.String())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Test lacks assertions and validation.

This test only checks that parseEvent doesn't return an error, but provides no validation of the parsed event's content. Without assertions, the test cannot catch regressions in parsing logic or output formatting.

🧪 Proposed test improvements
 func TestTracer(t *testing.T) {
 	payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163}
 	event, err := parseEvent(payload)
 	if err != nil {
 		t.Fatal(err)
 	}
-	fmt.Println(event.String())
+	
+	// Validate parsed event
+	if event == nil {
+		t.Fatal("parseEvent returned nil event")
+	}
+	
+	// Add specific assertions based on expected packet content
+	// For example, if this is a DNS query for google.lt:
+	eventStr := event.String()
+	if eventStr == "" {
+		t.Error("event.String() returned empty string")
+	}
+	t.Logf("Parsed event: %s", eventStr)
+	
+	// TODO: Add more specific assertions based on expected payload structure
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestTracer(t *testing.T) {
payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163}
event, err := parseEvent(payload)
if err != nil {
t.Fatal(err)
}
fmt.Println(event.String())
}
func TestTracer(t *testing.T) {
payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163}
event, err := parseEvent(payload)
if err != nil {
t.Fatal(err)
}
// Validate parsed event
if event == nil {
t.Fatal("parseEvent returned nil event")
}
// Add specific assertions based on expected packet content
// For example, if this is a DNS query for google.lt:
eventStr := event.String()
if eventStr == "" {
t.Error("event.String() returned empty string")
}
t.Logf("Parsed event: %s", eventStr)
// TODO: Add more specific assertions based on expected payload structure
}


# Note: T2A might not be in all zones of us-central1. us-central1-a,b,f usually.
# Let's target a specific zone to simplify.
ZONE="${REGION}-a"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded zone suffix may fail in some regions.

The zone is hardcoded as ${REGION}-a (Lines 62, 94, 107), which assumes all regions have an "a" zone and that T2A instances are available there. This assumption may not hold for all GCP regions.

Consider deriving available zones dynamically or allowing zone override:

🔧 Proposed fix
+# Allow zone override or detect available zones with T2A
+ZONE="${ZONE:-${REGION}-a}"
+
+# Verify zone supports the machine type
+if ! gcloud compute machine-types list \
+    --filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \
+    --format="value(name)" --limit=1 | grep -q .; then
+  echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2
+  exit 1
+fi
+
-  ZONE="${REGION}-a"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ZONE="${REGION}-a"
# Allow zone override or detect available zones with T2A
ZONE="${ZONE:-${REGION}-a}"
# Verify zone supports the machine type
if ! gcloud compute machine-types list \
--filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \
--format="value(name)" --limit=1 | grep -q .; then
echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2
exit 1
fi

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Around line 311-318: The workflow references inputs.workflow_name in
expressions for workflow_name (e.g., "workflow_name: ${{ inputs.workflow_name ||
'Update zxporter-netmon helm chart' }}") but the workflow_dispatch inputs do not
declare workflow_name; either add a workflow_dispatch input named workflow_name
(type: string, optional, with a sensible default) so inputs.workflow_name is
defined, or remove the inputs reference and hardcode the default literal in both
places (replace the expression with 'Update zxporter-netmon helm chart') to
avoid undefined input errors; update every occurrence of inputs.workflow_name
used in the file (including the two workflow_name keys and the slack action
input) so they are consistent.

In @cmd/zxporter-netmon/main.go:
- Around line 193-202: The monitor started with a cancellable context (the
cancel returned when launching the monitor at main) is never invoked during
shutdown; call that cancel function immediately after receiving the OS signal
and before initiating the server shutdown so the monitor and its conntrack
client are signaled to stop; specifically, invoke cancel() (the monitor's
cancel) right after the signal receive/before server.Shutdown(shutdownCtx) and
keep the existing shutdownCtx/shutdownCancel for HTTP shutdown.

In @config/zxporter-netmon.yaml:
- Around line 28-33: The pod-level securityContext block containing "privileged"
and "capabilities" is invalid; remove the pod-level securityContext (the block
that sets privileged: true and capabilities:add: - NET_ADMIN - SYS_RESOURCE) and
ensure those fields remain only in the container-level securityContext for the
relevant container spec so validation succeeds and capabilities are applied
correctly.

In @Dockerfile.netmon:
- Line 31: The Dockerfile currently sets USER 0:0 which contradicts the
gcr.io/distroless/static:nonroot base image; remove the USER 0:0 directive to
run as the nonroot user (UID 65532) provided by that image, or if root
privileges are truly required for the DaemonSet, replace the base image with
gcr.io/distroless/static:latest and then keep or set USER 0:0 accordingly;
update the Dockerfile line containing "USER 0:0" and ensure runtime
capabilities/privileged settings remain configured in the DaemonSet rather than
forcing root in the image.

In @internal/networkmonitor/cilium_time.go:
- Around line 12-20: The init() sets a dangerous fallback hertz = 1 when
getKernelHZ() fails and swallows the error; update init() to choose a realistic
fallback (e.g., 250 or 100) instead of 1, log the getKernelHZ() error and the
chosen fallback via the package logger (or return/fail fast if that’s
preferred), and ensure callers using hertz (e.g., jiffies calculations) will see
correct magnitude; locate symbols init, hertz and getKernelHZ to implement the
change and add a processLogger or package log call to record the failure and
fallback.

In @internal/networkmonitor/conntrack_client.go:
- Around line 74-84: The first composite nil-check inside the sessions loop is
missing a check for sess.Origin.Dst, which is later dereferenced to build
entry.Dst; update the condition that currently tests sess.Origin.Src,
sess.Origin.Proto, sess.Reply.Dst, etc., to also include sess.Origin.Dst so the
loop continues when Origin.Dst is nil, preventing a nil dereference when
constructing entry.Dst.
- Around line 115-117: The code calls filter(entry) without checking for nil,
which will panic if filter is nil; update the loop that currently does "if
filter(entry) { res = append(res, entry) }" to guard against a nil filter (e.g.,
"if filter == nil || filter(entry) { res = append(res, entry) }") or explicitly
validate/document that the filter parameter must be non-nil at function entry;
refer to the filter variable in conntrack_client.go and add the nil-check or
parameter validation accordingly.

In @internal/networkmonitor/ebpf/c/include/vmlinux.h:
- Around line 67-130: The task_struct includes struct thread_info
unconditionally but thread_info is only defined for __TARGET_ARCH_x86 and
__TARGET_ARCH_arm64, so builds for other architectures fail; fix by either
adding architecture-specific thread_info definitions for the missing targets
(e.g., provide appropriate struct thread_info declarations for
__TARGET_ARCH_s390, __TARGET_ARCH_arm, __TARGET_ARCH_mips,
__TARGET_ARCH_powerpc, __TARGET_ARCH_sparc, __TARGET_ARCH_riscv) or guard the
task_struct definition with the same architecture checks, and add a clear
comment in vmlinux.h documenting which architectures are supported (reference
struct task_struct, struct thread_info, and the __TARGET_ARCH_* macros to locate
the affected code).

In @internal/networkmonitor/ebpf/tracer.go:
- Around line 21-25: Add a Close/Stop method to Tracer to close the events
channel and ensure callers don’t block: update the Tracer struct (type Tracer)
to implement a Close() (or Stop()) method that closes the events channel
returned/owned by NewTracer, and call that method from the platform-specific
implementations when their Run() exits (both tracer_linux.go and
tracer_nolinux.go) so the channel is closed once the context cancels; ensure
Close is safe to call multiple times (idempotent) and that any goroutines
reading from Tracer.events handle the closed channel correctly.

In @internal/networkmonitor/monitor.go:
- Around line 133-144: Duplicate select branch: there are two identical "case
<-flushTicker.C:" blocks calling m.flush(ctx). Remove the duplicate branch and
keep a single case that calls m.flush(ctx) and logs via m.log.Error(err, "Failed
to flush metrics to control plane"); if you intended the comment about running
flush in a goroutine or synchronous behavior, consolidate that comment into the
remaining case and implement the chosen behavior (synchronous call to
m.flush(ctx) recommended). Ensure you update only the select inside the monitor
loop so references to flushTicker.C and m.flush(ctx) remain correct.
- Around line 323-359: The current helpers conntrackEntryKey and entryGroupKey
use package-level maphash.Hash vars (conntrackEntryHash, entryGroupHash) and
As4(), which causes a data race and panics on IPv6; fix by removing the
package-level hashes and creating local maphash.Hash instances inside each
function (e.g., start each function with "var h maphash.Hash"), use
e.Src.IP().As16() and e.Dst.IP().As16() to get stable 16-byte slices for both
IPv4/IPv6, write those full slices into the local hash, write ports and proto as
before, and return h.Sum64(); update both conntrackEntryKey and entryGroupKey
accordingly.

In @internal/networkmonitor/pod_cache.go:
- Around line 27-29: The AddFunc handler currently does an unchecked type
assertion obj.(*corev1.Pod) which can panic; modify the AddFunc in the informer
registration to perform a safe type assertion (e.g., pod, ok :=
obj.(*corev1.Pod)) and only call pc.updatePod(pod) when ok is true, otherwise
log or ignore the unexpected type; ensure you reference the AddFunc closure and
the pc.updatePod method so the handler remains robust against non-Pod objects.
- Around line 30-32: The UpdateFunc currently does an unsafe cast
new.(*corev1.Pod) and never removes stale IP entries when a Pod's IP changes;
update the handler to defensively type-assert both old and new objects (use a
type switch or comma-ok) and return/log if types are unexpected, handle nil
pods, then compare oldPod.Status.PodIP and newPod.Status.PodIP: if old IP is
non-empty and different from the new IP remove the old IP mapping from the pod
cache (call the existing cache removal method or implement pc.removeByIP) before
calling pc.updatePod(newPod) so stale mappings are cleared.

In @Makefile:
- Around line 777-789: The verify-e2e-aks-lifecycle target can abort before
running deprovision-aks because the long chained commands use && and Make treats
failures as fatal; modify the target so deprovision-aks always runs regardless
of verify-e2e result—invoke the test with error-ignoring (e.g., prefix the
verify call with -$(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}"
NAMESPACE=devzero-zxporter or append || true to that call) and then
unconditionally call $(MAKE) deprovision-aks; ensure you apply this change to
the verify-e2e-aks-lifecycle target (symbols: verify-e2e-aks-lifecycle,
verify-e2e, deprovision-aks).
- Around line 752-763: The recipe verify-e2e-eks-lifecycle can abort before
deprovision-eks runs if $(MAKE) verify-e2e fails; change the shell sequence so
you always run deprovision-eks: invoke $(MAKE) verify-e2e ... and capture its
exit code (e.g. VERIFY_EXIT=$$?); then call $(MAKE) deprovision-eks; finally
exit with the original verify exit code (exit $$VERIFY_EXIT). Update the
verify-e2e-eks-lifecycle recipe to perform these steps in the same shell
invocation so deprovision-eks always executes even on verification failure.

In @verification/provision_gke.sh:
- Around line 60-62: The script currently hardcodes ZONE="${REGION}-a", which
breaks in regions without an "-a" zone or where T2A is unavailable; change ZONE
to be derived dynamically or overridable: add logic to respect an optional
environment variable (e.g., ZONE override) and if not set, query available zones
for the REGION (via gcloud compute zones list filtered by REGION and required
machine type/feature) and pick a suitable zone, then use that dynamic ZONE
wherever ZONE is referenced (the ZONE variable and all places that use it in
provisioning, checks, and instance creation).
🟡 Minor comments (10)
verification/provision_aks.sh-111-115 (1)

111-115: Inconsistent fallback naming in delete mode.

The fallback values on lines 113-114 use zxporter-e2e-rg-${WHO} and zxporter-e2e-${WHO}, but WHO is computed at script start and the create mode uses aks-zxporter-e2e-rg-${WHO}-${CNI_TYPE} / aks-zxporter-e2e-${WHO}-${CNI_TYPE}. If the state file is missing, the fallback won't match the actual created resources.

Proposed fix
 read_state
-RESOURCE_GROUP="${RESOURCE_GROUP:-zxporter-e2e-rg-${WHO}}"
-CLUSTER_NAME="${CLUSTER_NAME:-zxporter-e2e-${WHO}}"
+RESOURCE_GROUP="${RESOURCE_GROUP:-aks-zxporter-e2e-rg-${WHO}-${CNI_TYPE}}"
+CLUSTER_NAME="${CLUSTER_NAME:-aks-zxporter-e2e-${WHO}-${CNI_TYPE}}"
verification/verify_e2e.sh-24-24 (1)

24-24: Remove unused ARCH variable.

The ARCH variable is defined but never used in the script.

🧹 Proposed fix
-ARCH=$(go env GOARCH)
 TAG="dev-$(date +%s)"
verification/verify_e2e.sh-115-122 (1)

115-122: Remove duplicate image.pullPolicy setting.

Line 120 sets image.pullPolicy twice in the same helm template command.

🐛 Proposed fix
     helm template zxporter-netmon helm-chart/zxporter-netmon \
         --namespace $NAMESPACE \
         --set image.repository=ttl.sh/zxporter-netmon \
         --set image.tag=$TAG \
         --set image.pullPolicy=$PULL_POLICY \
-        --set image.pullPolicy=$PULL_POLICY \
         --set config.exportInterval="10s" \
         | kubectl apply -f -
internal/networkmonitor/ebpf/types.go-18-29 (1)

18-29: Fix section headers printed inside loops.

The "Questions:" and "Answers:" headers are printed inside the loops (lines 21 and 25), causing them to appear once per item instead of once per section. This creates confusing output like:

Questions:
A IN example.com 
Questions:
A IN google.com 

Instead of:

Questions:
A IN example.com 
A IN google.com 
🐛 Proposed fix
 func (e DNSEvent) String() string {
 	var str strings.Builder
+	if len(e.Questions) > 0 {
+		str.WriteString("Questions:\n")
+	}
 	for _, v := range e.Questions {
-		str.WriteString("Questions:\n")
 		str.WriteString(fmt.Sprintf("%s %s %s \n", v.Class, v.Type, string(v.Name)))
 	}
+	if len(e.Answers) > 0 {
+		str.WriteString("Answers:\n")
+	}
 	for _, v := range e.Answers {
-		str.WriteString("Answers:\n")
 		str.WriteString(fmt.Sprintf("%s %s %s [%s] [%s] \n", v.Class, v.Type, string(v.Name), v.IP, v.CNAME))
 	}
 	return str.String()
 }
.github/workflows/push-zxporter-netmon-image.yml-237-241 (1)

237-241: Update deprecated GitHub Action version.

The workflow uses actions/checkout@v3 which is outdated. The newer @v4 version is already used elsewhere in this workflow (lines 29, 298).

♻️ Proposed fix
 - name: Checkout repo
-  uses: actions/checkout@v3
+  uses: actions/checkout@v4
   with:
       repository: devzero-inc/zxporter
       path: zxporter-tmp
       token: ${{ steps.get_workflow_token.outputs.token }}
internal/networkmonitor/ebpf/tracer_linux.go-191-207 (1)

191-207: Minor: File handle leak on scanner error.

If scanner.Scan() encounters an error (which scanner.Err() would return), the function returns ErrCgroup2NotMounted without checking scanner.Err(). While unlikely in practice, the file is properly closed via defer, but the actual scanner error is silently ignored.

♻️ Proposed fix
 func detectCgroupPath() (string, error) {
   f, err := os.Open("/proc/mounts")
   if err != nil {
     return "", err
   }
   defer f.Close()

   scanner := bufio.NewScanner(f)
   for scanner.Scan() {
     fields := strings.Split(scanner.Text(), " ")
     if len(fields) >= 3 && fields[2] == "cgroup2" {
       return fields[1], nil
     }
   }

+  if err := scanner.Err(); err != nil {
+    return "", err
+  }
   return "", ErrCgroup2NotMounted
 }
.github/workflows/push-zxporter-netmon-image.yml-137-137 (1)

137-137: Fix typo in error message.

"Definately" should be "Definitely".

📝 Proposed fix
-    # Definately have to fail if any non-latest tags exist
+    # Definitely have to fail if any non-latest tags exist
.github/workflows/push-zxporter-netmon-image.yml-35-64 (1)

35-64: Quote shell variables to prevent word splitting.

Multiple shellcheck warnings indicate unquoted variable expansions that could cause issues with unexpected input values.

🔧 Recommended fixes for key variables
   GITVERSION=$(git describe --tags --match="v[0-9]*.[0-9]*.[0-9]*" --exclude="chart-*" --always || echo "v0.0.0-$(git rev-parse --short HEAD)")
   echo "GITVERSION=${GITVERSION}" >> $GITHUB_ENV
   if [[ "$GITVERSION" =~ ^v([0-9]+)\.([0-9]+)\.(.+)$ ]]; then
     MAJOR=${BASH_REMATCH[1]}
     MINOR=${BASH_REMATCH[2]}
     PATCH=${BASH_REMATCH[3]}
     PATCH_SHORT=${PATCH%%[-]*}
     FULLV=v${MAJOR}.${MINOR}.${PATCH_SHORT}

-    echo "MAJOR=${MAJOR}" >> $GITHUB_ENV
-    echo "MINOR=${MINOR}" >> $GITHUB_ENV
-    echo "PATCH=${PATCH}" >> $GITHUB_ENV
-    echo "PATCH_SHORT=${PATCH_SHORT}" >> $GITHUB_ENV
-    echo "FULLV=${FULLV}" >> $GITHUB_ENV
-    echo "IS_RELEASE=true" >> $GITHUB_ENV
+    echo "MAJOR=${MAJOR}" >> "$GITHUB_ENV"
+    echo "MINOR=${MINOR}" >> "$GITHUB_ENV"
+    echo "PATCH=${PATCH}" >> "$GITHUB_ENV"
+    echo "PATCH_SHORT=${PATCH_SHORT}" >> "$GITHUB_ENV"
+    echo "FULLV=${FULLV}" >> "$GITHUB_ENV"
+    echo "IS_RELEASE=true" >> "$GITHUB_ENV"

Similar fixes needed in other script blocks (lines 76-79, 88-167, 192-219, etc.)

internal/networkmonitor/cilium_time.go-57-66 (1)

57-66: Fix typo in error message.

Line 62 has a grammatical error: "Unable get time" should be "Unable to get time".

📝 Proposed fix
-		return 0, fmt.Errorf("Unable get time: %w", err)
+		return 0, fmt.Errorf("Unable to get time: %w", err)
internal/networkmonitor/cilium_time.go-46-51 (1)

46-51: Add documentation explaining the 256x scaling factor in jiffies conversion.

The jiffies calculation at line 48 uses diff << 8 (multiply by 256) before dividing by hertz. Standard kernel jiffies-to-seconds conversion is simply diff / hertz, so this additional scaling factor requires explanation:

  • Why is the value scaled by 256?
  • Is this for fixed-point arithmetic precision, a kernel ABI requirement, or Cilium-specific handling?
  • How does this relate to the expected return value in seconds?

Add an inline comment documenting the purpose of this operation to clarify the conversion logic.

🧹 Nitpick comments (21)
verification/provision_aks.sh (1)

38-40: Add shellcheck directive for dynamic source.

ShellCheck cannot follow the non-constant source path. Add a directive to suppress or document this intentional behavior.

Proposed fix
 read_state() {
+  # shellcheck source=/dev/null
   [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}"
 }
internal/networkmonitor/conntrack_client.go (2)

47-49: Consider logging if enabling conntrack accounting fails.

The error from WriteFile is silently ignored. While this is best-effort, a warning log would help debug issues when conntrack accounting doesn't work as expected.

Proposed fix
 func NewNetfilterClient(log logr.Logger) (Client, error) {
 	// Ensure accounting is enabled
-	_ = os.WriteFile("/proc/sys/net/netfilter/nf_conntrack_acct", []byte{'1'}, 0600)
+	if err := os.WriteFile("/proc/sys/net/netfilter/nf_conntrack_acct", []byte{'1'}, 0600); err != nil {
+		log.V(1).Info("failed to enable nf_conntrack_acct (may require CAP_NET_ADMIN)", "error", err)
+	}

123-126: Silent error suppression on IP conversion.

netaddr.FromStdIP returns (IP, bool) but the ok value is discarded. If conversion fails, a zero-value netaddr.IP is returned which could lead to confusing behavior downstream.

Proposed fix

Consider returning an error or skipping entries with invalid IPs in the caller. At minimum, log when conversion fails:

 func ipFromStdIP(ip net.IP) netaddr.IP {
-	res, _ := netaddr.FromStdIP(ip)
+	res, ok := netaddr.FromStdIP(ip)
+	if !ok {
+		return netaddr.IP{} // Returns zero value; caller should validate
+	}
 	return res
 }

Alternatively, check the returned IP validity in ListEntries before constructing the Entry.

internal/networkmonitor/dns/ip2dns.go (3)

40-42: Exported field with unexported interface type limits testability.

Tracer field is exported (uppercase) but its type tracer is unexported. This makes it impossible for external packages to construct an IP2DNS with a mock tracer. Consider either:

  1. Export the interface as Tracer and rename the field
  2. Keep field unexported and only expose via constructor
Proposed fix (option 1: export the interface)
-type tracer interface {
+// Tracer defines the interface for DNS event sources.
+type Tracer interface {
 	Run(ctx context.Context) error
 	Events() <-chan ebpf.DNSEvent
 }

 type IP2DNS struct {
-	Tracer tracer
+	tracer Tracer
 	log    logr.Logger
 	// ...
 }

-func NewIP2DNS(tracer tracer, log logr.Logger) *IP2DNS {
+func NewIP2DNS(tracer Tracer, log logr.Logger) *IP2DNS {
 	return &IP2DNS{
-		Tracer: tracer,
+		tracer: tracer,
 		log:    log,
 	}
 }

61-67: Redundant context wrapping.

The ctx parameter is already a context that can be cancelled. Creating a new derived context with context.WithCancel(ctx) and deferring its cancel is unnecessary here since the function returns when ctx.Done() fires anyway.

Proposed simplification
 func (d *IP2DNS) Start(ctx context.Context) error {
 	d.ipToName = cache.NewContext[string, string](ctx)

-	ctx, cancel := context.WithCancel(ctx)
-	defer cancel()
 	errch := make(chan error, 1)
 	go func() {
-		err := d.Tracer.Run(ctx)
+		err := d.Tracer.Run(ctx)
 		errch <- err
 	}()

The derived context is only useful if you need to cancel the tracer independently of the parent context. If that's intentional, add a comment explaining why.


103-105: CNAME handling is a no-op placeholder.

The CNAME case is empty. If CNAME resolution enrichment is planned, consider adding a TODO comment or removing the case to avoid dead code.

Proposed fix
 			case layers.DNSTypeCNAME:
-				// CNAME handling if needed
+				// TODO: Handle CNAME chaining for alias resolution
+				continue
 			}
internal/networkmonitor/ebpf/c/include/vmlinux.h (1)

1-1219: Consider generating vmlinux.h with bpftool instead of maintaining it manually.

Manually maintaining vmlinux.h requires keeping type definitions in sync with kernel versions, which is error-prone. The standard practice is to generate it from the target kernel's BTF using bpftool btf dump file /sys/kernel/btf/vmlinux format c.

This approach would eliminate maintenance burden and ensure type accuracy across kernel versions.

verification/traffic-gen.yaml (1)

37-53: Add resource limits and explicit security context.

While this is a verification workload, adding resource constraints and an explicit security context prevents resource exhaustion in test environments and follows Kubernetes security best practices.

🔒 Suggested security and resource improvements
       containers:
         - name: traffic-gen
           image: curlimages/curl
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+            capabilities:
+              drop:
+                - ALL
+          resources:
+            requests:
+              cpu: 10m
+              memory: 32Mi
+            limits:
+              cpu: 100m
+              memory: 64Mi
           command: ["/bin/sh"]
           args:
             - "-c"
internal/networkmonitor/ebpf/generate.go (1)

3-6: Clarify the generation command and expected output files.

The commented go:generate directive appears incomplete for multi-architecture builds. Based on the PR context showing both bpf_bpfeb.go and bpf_bpfel.go files (big-endian and little-endian), the command should specify target architectures.

Additionally, Line 4 mentions checking in bpf_bpfel.o and bpf_bpfel.go, but bpf2go typically generates Go source files (.go) and may not require checking in object files (.o).

📝 Suggested clarification
 // To regenerate eBPF artifacts, install llvm (brew install llvm) and uncomment the line below.
-// Ensure you have the 'bpf_bpfel.o' and 'bpf_bpfel.go' files checked in.
+// This generates bpf_bpfeb.go and bpf_bpfel.go (big-endian and little-endian bindings).
 //
-// //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -no-strip -cc clang -cflags $BPF_CFLAGS bpf ./c/egressd.c -- -I./c/include
+// //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -no-strip -cc clang -cflags $BPF_CFLAGS -target bpf -target bpfeb bpf ./c/egressd.c -- -I./c/include
verification/verify_e2e.sh (1)

280-296: Fix inconsistent variable naming.

The variable is named RETRY_Count (line 283) but follows the pattern of MAX_RETRIES (line 282). Use consistent naming convention.

♻️ Suggested fix
         MAX_RETRIES=6
-        RETRY_Count=0
+        RETRY_COUNT=0
         TOTAL_MSGS=0
         
-        while [ $RETRY_Count -lt $MAX_RETRIES ]; do
+        while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
             STATS_JSON=$(kubectl exec curl-validator -n $NAMESPACE -- curl -v -s http://testserver.${NAMESPACE}.svc.cluster.local:8080/stats)
             TOTAL_MSGS=$(echo $STATS_JSON | grep -o '"total_messages": [0-9]*' | awk '{print $2}')
             
             if [[ "$TOTAL_MSGS" -gt 0 ]]; then
                 break
             fi
             
-            echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_Count+1))/$MAX_RETRIES)"
+            echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_COUNT+1))/$MAX_RETRIES)"
             sleep 5
-            RETRY_Count=$((RETRY_Count+1))
+            RETRY_COUNT=$((RETRY_COUNT+1))
         done
internal/networkmonitor/ebpf/tracer_nolinux.go (1)

11-16: Clarify the comment about error handling.

The comment on line 12 states "return error to be safe" but the function actually returns nil. Consider updating the comment to accurately reflect the implementation:

📝 Suggested comment update
 func (t *Tracer) Run(ctx context.Context) error {
-	// Don't panic, just return error to be safe
+	// No-op on non-Linux systems; waits for context cancellation
 	t.log.Info("eBPF tracer not implemented on non-linux systems")
 	<-ctx.Done()
 	return nil
 }
helm-chart/zxporter-netmon/templates/daemonset.yaml (1)

28-29: Consider documenting the hostNetwork security implications.

The DaemonSet uses hostNetwork: true, which grants pods access to the host's network namespace. While this is often necessary for network monitoring, it's a privileged setting with security implications. Consider adding a comment in the template or documenting this in the chart's README.

.github/workflows/helm-release-netmon.yml (1)

50-51: Quote the $GITHUB_ENV variable.

The $GITHUB_ENV variable should be quoted to prevent potential issues with word splitting or globbing.

🔧 Proposed fix
 - name: Enable Helm OCI support
-  run: echo "HELM_EXPERIMENTAL_OCI=1" >> $GITHUB_ENV
+  run: echo "HELM_EXPERIMENTAL_OCI=1" >> "$GITHUB_ENV"
verification/provision_eks.sh (2)

140-142: Address or remove the TODO comment.

The TODO comment indicates that autoModeConfig still requires testing. Since this configuration is commented out, either test and enable it, or remove the comment if it's intentionally deferred.

Would you like me to open a tracking issue for testing autoModeConfig, or should this TODO be removed?


149-152: Document NAT gateway implications.

Disabling the NAT gateway (line 151) reduces costs but requires nodes to have public IPs for internet access, which has security implications. The comment mentions this but consider documenting it more prominently or adding verification that the setup works as intended.

internal/networkmonitor/ebpf/tracer_linux.go (1)

103-106: Consider logging the full raw sample metadata for debugging.

The size check is correct, but when skipping small events, consider logging additional context like the record timestamp or CPU to aid debugging.

♻️ Optional enhancement
 if len(record.RawSample) < 4 {
-  t.log.Info("skipping too small event", "bytes", len(record.RawSample))
+  t.log.Info("skipping too small event", "bytes", len(record.RawSample), "cpu", record.CPU)
   continue
 }
Makefile (1)

703-709: Lima installation assumes macOS with Homebrew.

The brew install lima command will fail on Linux environments. Consider adding OS detection or documenting this limitation.

🔧 Proposed fix for cross-platform support
 .PHONY: setup-lima
 setup-lima:
 	@echo "Checking for Lima..."
-	@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+	@which limactl >/dev/null || { \
+		if command -v brew >/dev/null 2>&1; then \
+			echo "Installing Lima via Homebrew..." && brew install lima; \
+		else \
+			echo "Error: Lima not found and Homebrew not available. Please install Lima manually: https://lima-vm.io/docs/installation/"; \
+			exit 1; \
+		fi; \
+	}
 	@echo "Checking if instance $(LIMA_INSTANCE) exists..."
internal/networkmonitor/monitor.go (2)

90-102: Consider validating required dependencies.

The constructor accepts podCache, ct, and dnsCollector without validation. If ct (conntrack client) is nil, collect() will panic when calling m.ct.ListEntries(). Consider adding nil checks for critical dependencies.

🔧 Proposed fix
 func NewMonitor(cfg Config, log logr.Logger, podCache *PodCache, ct Client, dnsCollector dns.DNSCollector, dakrClient transport.DakrClient) (*Monitor, error) {
+	if ct == nil {
+		return nil, fmt.Errorf("conntrack client is required")
+	}
 	return &Monitor{
 		cfg:          cfg,
 		log:          log,

241-246: Misleading comment: handler does NOT reset counters.

The comment on line 242 says "This currently resets the counters after read (Delta mode)" but the implementation creates copies without resetting (line 267 comment correctly notes "Peek behavior, no reset"). Consider updating the header comment to match the actual behavior.

📝 Proposed fix
 // GetMetricsHandler serves the collected metrics
-// NOTE: This currently resets the counters after read (Delta mode), mimicking egressd's behavior
+// NOTE: This returns a snapshot of metrics without resetting counters (Peek mode).
+// Delta mode with counter reset is used only for flush() to the control plane.
 func (m *Monitor) GetMetricsHandler(w http.ResponseWriter, r *http.Request) {
internal/networkmonitor/cilium_kernel_hz.go (2)

46-60: Consider capturing time.Now() closer to the timestamp read.

The wall clock time is captured at Line 50 before scanning begins, while the kernel timestamp may be read several lines later. For maximum accuracy, consider capturing time.Now() immediately after successfully parsing the timestamp line.

⏱️ Proposed refactor for improved timing accuracy
 func readSchedstat(f io.ReadSeeker) (ktime, error) {
 	defer func() { _, _ = f.Seek(0, 0) }()
 
 	var j uint64
-	var t = time.Now()
 
 	s := bufio.NewScanner(f)
 	for s.Scan() {
 		if _, err := fmt.Sscanf(s.Text(), "timestamp %d", &j); err == nil {
+			t := time.Now()
 			return ktime{j, t}, nil
 		}
 	}
 
 	return ktime{}, errors.New("no kernel timestamp found")
 }

87-107: Consider simplifying distance calculation for clarity.

The unsigned arithmetic pattern at lines 95-98 is correct but could be clearer. The initial assignment d := uint16(in - v) would underflow when in < v, requiring the subsequent correction.

♻️ Proposed refactor for improved readability
 func nearest(in uint16, values []uint16) (uint16, error) {
 	if len(values) == 0 {
 		return 0, errors.New("values cannot be empty")
 	}
 
 	var out uint16
 	min := ^uint16(0)
 	for _, v := range values {
-		d := uint16(in - v)
+		var d uint16
 		if in < v {
 			d = v - in
+		} else {
+			d = in - v
 		}
 
 		if d < min {
 			min = d
 			out = v
 		}
 	}
 
 	return out, nil
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1bd9e3 and 8b91b7f.

⛔ Files ignored due to path filters (9)
  • gen/api/v1/apiv1connect/metrics_collector.connect.go is excluded by !**/gen/**
  • gen/api/v1/common.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/k8s.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/metrics_collector.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • gen/api/v1/metrics_collector_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • go.sum is excluded by !**/*.sum
  • internal/networkmonitor/ebpf/bpf_bpfeb.o is excluded by !**/*.o
  • internal/networkmonitor/ebpf/bpf_bpfel.o is excluded by !**/*.o
  • proto/dakr_proto_descriptor.bin is excluded by !**/*.bin
📒 Files selected for processing (63)
  • .github/workflows/helm-release-netmon.yml
  • .github/workflows/push-zxporter-netmon-image.yml
  • .gitignore
  • Dockerfile.netmon
  • Makefile
  • cmd/zxporter-netmon/main.go
  • config/zxporter-netmon.yaml
  • go.mod
  • helm-chart/zxporter-netmon/Chart.yaml
  • helm-chart/zxporter-netmon/templates/_helpers.tpl
  • helm-chart/zxporter-netmon/templates/clusterrole.yaml
  • helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml
  • helm-chart/zxporter-netmon/templates/daemonset.yaml
  • helm-chart/zxporter-netmon/templates/serviceaccount.yaml
  • helm-chart/zxporter-netmon/values.yaml
  • internal/networkmonitor/cilium_client.go
  • internal/networkmonitor/cilium_client_linux.go
  • internal/networkmonitor/cilium_client_nolinux.go
  • internal/networkmonitor/cilium_kernel_hz.go
  • internal/networkmonitor/cilium_time.go
  • internal/networkmonitor/conntrack_client.go
  • internal/networkmonitor/dns/ip2dns.go
  • internal/networkmonitor/ebpf/bpf_bpfeb.go
  • internal/networkmonitor/ebpf/bpf_bpfel.go
  • internal/networkmonitor/ebpf/c/egressd.c
  • internal/networkmonitor/ebpf/c/include/LICENSE.BSD-2-Clause
  • internal/networkmonitor/ebpf/c/include/bpf_core_read.h
  • internal/networkmonitor/ebpf/c/include/bpf_endian.h
  • internal/networkmonitor/ebpf/c/include/bpf_helper_defs.h
  • internal/networkmonitor/ebpf/c/include/bpf_helpers.h
  • internal/networkmonitor/ebpf/c/include/bpf_tracing.h
  • internal/networkmonitor/ebpf/c/include/common.h
  • internal/networkmonitor/ebpf/c/include/maps.h
  • internal/networkmonitor/ebpf/c/include/types.h
  • internal/networkmonitor/ebpf/c/include/update.sh
  • internal/networkmonitor/ebpf/c/include/vmlinux.h
  • internal/networkmonitor/ebpf/c/include/vmlinux_flavors.h
  • internal/networkmonitor/ebpf/c/include/vmlinux_missing.h
  • internal/networkmonitor/ebpf/cgroup_linux.go
  • internal/networkmonitor/ebpf/cgroup_nolinux.go
  • internal/networkmonitor/ebpf/generate.go
  • internal/networkmonitor/ebpf/tracer.go
  • internal/networkmonitor/ebpf/tracer_linux.go
  • internal/networkmonitor/ebpf/tracer_linux_test.go
  • internal/networkmonitor/ebpf/tracer_nolinux.go
  • internal/networkmonitor/ebpf/types.go
  • internal/networkmonitor/monitor.go
  • internal/networkmonitor/pod_cache.go
  • internal/transport/connect_compression_test.go
  • internal/transport/dakr_client.go
  • internal/transport/interface.go
  • internal/transport/sender.go
  • test/stats/types.go
  • test/testserver/main.go
  • verification/kind-config.yaml
  • verification/lima.yaml
  • verification/provision_aks.sh
  • verification/provision_eks.sh
  • verification/provision_gke.sh
  • verification/traffic-gen.yaml
  • verification/traffic-server.yaml
  • verification/verify_e2e.sh
  • verification/verify_local.sh
✅ Files skipped from review due to trivial changes (1)
  • internal/networkmonitor/ebpf/cgroup_nolinux.go
🚧 Files skipped from review as they are similar to previous changes (20)
  • internal/networkmonitor/ebpf/cgroup_linux.go
  • internal/transport/dakr_client.go
  • internal/networkmonitor/ebpf/c/include/types.h
  • internal/transport/sender.go
  • helm-chart/zxporter-netmon/Chart.yaml
  • internal/networkmonitor/ebpf/c/include/LICENSE.BSD-2-Clause
  • internal/networkmonitor/ebpf/c/include/common.h
  • internal/networkmonitor/cilium_client.go
  • verification/lima.yaml
  • internal/networkmonitor/ebpf/tracer_linux_test.go
  • internal/networkmonitor/ebpf/bpf_bpfel.go
  • internal/networkmonitor/cilium_client_linux.go
  • helm-chart/zxporter-netmon/templates/_helpers.tpl
  • .gitignore
  • internal/transport/connect_compression_test.go
  • internal/transport/interface.go
  • verification/kind-config.yaml
  • internal/networkmonitor/ebpf/c/include/maps.h
  • internal/networkmonitor/cilium_client_nolinux.go
  • internal/networkmonitor/ebpf/c/include/update.sh
🧰 Additional context used
🧬 Code graph analysis (11)
cmd/zxporter-netmon/main.go (10)
internal/networkmonitor/pod_cache.go (2)
  • PodCache (12-16)
  • NewPodCache (20-40)
internal/networkmonitor/conntrack_client.go (2)
  • Client (36-39)
  • NewNetfilterClient (47-59)
internal/networkmonitor/cilium_client.go (2)
  • CiliumAvailable (20-25)
  • NewCiliumClient (9-18)
internal/networkmonitor/cilium_time.go (1)
  • ClockSourceKtime (25-25)
internal/networkmonitor/dns/ip2dns.go (2)
  • DNSCollector (24-29)
  • NewIP2DNS (51-56)
internal/networkmonitor/ebpf/tracer_linux.go (1)
  • IsKernelBTFAvailable (128-131)
internal/networkmonitor/ebpf/tracer_nolinux.go (1)
  • IsKernelBTFAvailable (22-24)
internal/networkmonitor/monitor.go (2)
  • Config (24-29)
  • NewMonitor (91-102)
internal/transport/interface.go (1)
  • DakrClient (15-31)
internal/transport/dakr_client.go (1)
  • NewDakrClient (53-168)
internal/networkmonitor/pod_cache.go (1)
internal/collector/interface.go (1)
  • Pod (94-94)
internal/networkmonitor/ebpf/tracer_nolinux.go (3)
internal/networkmonitor/ebpf/tracer.go (1)
  • Tracer (21-25)
internal/networkmonitor/ebpf/types.go (1)
  • DNSEvent (11-16)
internal/networkmonitor/ebpf/tracer_linux.go (2)
  • IsKernelBTFAvailable (128-131)
  • InitCgroupv2 (133-142)
internal/networkmonitor/ebpf/tracer_linux.go (3)
internal/networkmonitor/ebpf/tracer.go (1)
  • Tracer (21-25)
internal/networkmonitor/ebpf/types.go (1)
  • DNSEvent (11-16)
internal/networkmonitor/ebpf/tracer_nolinux.go (2)
  • IsKernelBTFAvailable (22-24)
  • InitCgroupv2 (26-28)
internal/networkmonitor/ebpf/tracer.go (3)
internal/logger/logger.go (1)
  • Logger (26-29)
internal/networkmonitor/monitor.go (1)
  • Config (24-29)
internal/networkmonitor/ebpf/types.go (1)
  • DNSEvent (11-16)
internal/networkmonitor/conntrack_client.go (3)
internal/logger/logger.go (1)
  • Logger (26-29)
internal/networkmonitor/ebpf/tracer.go (1)
  • Config (5-8)
internal/networkmonitor/monitor.go (1)
  • Config (24-29)
internal/networkmonitor/cilium_time.go (1)
internal/networkmonitor/cilium_client_nolinux.go (3)
  • ClockSource (11-11)
  • ClockSourceKtime (14-14)
  • ClockSourceJiffies (15-15)
internal/networkmonitor/dns/ip2dns.go (2)
internal/networkmonitor/ebpf/types.go (1)
  • DNSEvent (11-16)
internal/networkmonitor/ebpf/tracer.go (1)
  • Tracer (21-25)
test/testserver/main.go (2)
gen/api/v1/metrics_collector.pb.go (6)
  • SendNetworkTrafficMetricsRequest (2281-2292)
  • SendNetworkTrafficMetricsRequest (2307-2307)
  • SendNetworkTrafficMetricsRequest (2322-2324)
  • SendNetworkTrafficMetricsResponse (2368-2374)
  • SendNetworkTrafficMetricsResponse (2389-2389)
  • SendNetworkTrafficMetricsResponse (2404-2406)
test/stats/types.go (2)
  • NetworkTrafficStat (87-98)
  • DnsLookupStat (101-108)
internal/networkmonitor/monitor.go (5)
internal/networkmonitor/dns/ip2dns.go (2)
  • DNSLookup (17-22)
  • DNSCollector (24-29)
internal/networkmonitor/conntrack_client.go (3)
  • Client (36-39)
  • Entry (15-24)
  • FilterBySrcIP (28-33)
internal/networkmonitor/pod_cache.go (1)
  • PodCache (12-16)
internal/transport/interface.go (1)
  • DakrClient (15-31)
gen/api/v1/metrics_collector.pb.go (12)
  • Namespace (1469-1503)
  • Namespace (1518-1518)
  • Namespace (1533-1535)
  • DnsLookupItem (2194-2205)
  • DnsLookupItem (2220-2220)
  • DnsLookupItem (2235-2237)
  • NetworkTrafficItem (2067-2083)
  • NetworkTrafficItem (2098-2098)
  • NetworkTrafficItem (2113-2115)
  • SendNetworkTrafficMetricsRequest (2281-2292)
  • SendNetworkTrafficMetricsRequest (2307-2307)
  • SendNetworkTrafficMetricsRequest (2322-2324)
verification/provision_eks.sh (2)
verification/provision_aks.sh (4)
  • need (4-4)
  • cluster_exists (42-44)
  • write_state (29-36)
  • read_state (38-40)
verification/provision_gke.sh (3)
  • need (4-4)
  • cluster_exists (41-43)
  • write_state (28-33)
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml

35-35: shellcheck reported issue in this script: SC2086:info:12:28: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:13:28: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:14:28: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:15:40: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:16:28: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:17:29: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:21:38: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:23:21: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:24:21: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:25:21: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:26:26: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:27:30: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:28:30: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2086:info:4:36: Double quote to prevent globbing and word splitting

(shellcheck)


35-35: shellcheck reported issue in this script: SC2129:style:12:3: Consider using { cmd1; cmd2; } >> file instead of individual redirects

(shellcheck)


35-35: shellcheck reported issue in this script: SC2129:style:23:3: Consider using { cmd1; cmd2; } >> file instead of individual redirects

(shellcheck)


76-76: shellcheck reported issue in this script: SC2086:info:1:53: Double quote to prevent globbing and word splitting

(shellcheck)


76-76: shellcheck reported issue in this script: SC2086:info:2:32: Double quote to prevent globbing and word splitting

(shellcheck)


76-76: shellcheck reported issue in this script: SC2086:info:3:55: Double quote to prevent globbing and word splitting

(shellcheck)


76-76: shellcheck reported issue in this script: SC2129:style:1:1: Consider using { cmd1; cmd2; } >> file instead of individual redirects

(shellcheck)


88-88: shellcheck reported issue in this script: SC2086:info:78:41: Double quote to prevent globbing and word splitting

(shellcheck)


88-88: shellcheck reported issue in this script: SC2086:info:7:67: Double quote to prevent globbing and word splitting

(shellcheck)


88-88: shellcheck reported issue in this script: SC2145:error:27:29: Argument mixes string and array. Use * or separate argument

(shellcheck)


88-88: shellcheck reported issue in this script: SC2145:error:39:30: Argument mixes string and array. Use * or separate argument

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:12:45: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:13:50: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:17:23: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:18:23: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:19:23: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:20:28: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:21:29: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:22:32: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2086:info:23:28: Double quote to prevent globbing and word splitting

(shellcheck)


192-192: shellcheck reported issue in this script: SC2145:error:27:57: Argument mixes string and array. Use * or separate argument

(shellcheck)


192-192: shellcheck reported issue in this script: SC2145:error:4:44: Argument mixes string and array. Use * or separate argument

(shellcheck)


237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


252-252: shellcheck reported issue in this script: SC2086:info:4:27: Double quote to prevent globbing and word splitting

(shellcheck)


252-252: shellcheck reported issue in this script: SC2086:info:7:26: Double quote to prevent globbing and word splitting

(shellcheck)


266-266: shellcheck reported issue in this script: SC2086:info:11:36: Double quote to prevent globbing and word splitting

(shellcheck)


266-266: shellcheck reported issue in this script: SC2086:info:16:35: Double quote to prevent globbing and word splitting

(shellcheck)


289-289: shellcheck reported issue in this script: SC2086:info:6:28: Double quote to prevent globbing and word splitting

(shellcheck)


311-311: property "workflow_name" is not defined in object type {push_latest: bool}

(expression)


318-318: property "workflow_name" is not defined in object type {push_latest: bool}

(expression)

.github/workflows/helm-release-netmon.yml

51-51: shellcheck reported issue in this script: SC2086:info:1:35: Double quote to prevent globbing and word splitting

(shellcheck)


58-58: shellcheck reported issue in this script: SC2086:info:8:38: Double quote to prevent globbing and word splitting

(shellcheck)


70-70: shellcheck reported issue in this script: SC2086:info:22:36: Double quote to prevent globbing and word splitting

(shellcheck)

🪛 checkmake (0.2.2)
Makefile

[warning] 817-817: Missing required phony target "clean"

(minphony)

🪛 Checkov (3.2.334)
config/zxporter-netmon.yaml

[medium] 9-62: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 9-62: Container should not be privileged

(CKV_K8S_16)


[medium] 9-62: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 9-62: Containers should not share the host network namespace

(CKV_K8S_19)

verification/traffic-gen.yaml

[medium] 1-53: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-53: Minimize the admission of root containers

(CKV_K8S_23)

verification/traffic-server.yaml

[medium] 1-43: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-43: Minimize the admission of root containers

(CKV_K8S_23)

🪛 Clang (14.0.6)
internal/networkmonitor/ebpf/c/include/vmlinux.h

[error] 258-258: field has incomplete type 'struct thread_info'

(clang-diagnostic-error)

🪛 Cppcheck (2.19.0)
internal/networkmonitor/ebpf/c/egressd.c

[error] 13-13: There is an unknown macro here somewhere. Configuration is required. If SEC is a macro then please configure it.

(unknownMacro)

🪛 Shellcheck (0.11.0)
verification/verify_e2e.sh

[warning] 24-24: ARCH appears unused. Verify use (or export if used externally).

(SC2034)

verification/provision_aks.sh

[warning] 39-39: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

verification/provision_eks.sh

[warning] 57-57: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

verification/provision_gke.sh

[warning] 37-37: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

🪛 YAMLlint (1.37.1)
helm-chart/zxporter-netmon/templates/daemonset.yaml

[error] 7-7: syntax error: expected the node content, but found '-'

(syntax)

helm-chart/zxporter-netmon/templates/clusterrole.yaml

[error] 5-5: syntax error: expected the node content, but found '-'

(syntax)

helm-chart/zxporter-netmon/templates/serviceaccount.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml

[error] 5-5: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run make test
  • GitHub Check: Build Docker Image
  • GitHub Check: Build Docker Image
  • GitHub Check: Analyze (go)

Comment on lines +12 to +20
var hertz uint16

func init() {
var err error
hertz, err = getKernelHZ()
if err != nil {
hertz = 1
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Problematic fallback value could break jiffies calculations.

The fallback hertz = 1 on initialization error is concerning. Typical kernel Hz values range from 100 to 1000, so a value of 1 would cause jiffies-based time calculations (line 49) to be off by 2-3 orders of magnitude. Additionally, the error is silently swallowed with no logging.

Consider:

  • Using a more reasonable fallback (e.g., 250 or 100)
  • Logging the error and fallback
  • Or failing fast if Hz cannot be determined
🔧 Proposed fix with logging and safer fallback

Add logging import if not already present, then:

+import (
+	"fmt"
+	"github.com/go-logr/logr"
+)
+
 var hertz uint16
 
 func init() {
 	var err error
 	hertz, err = getKernelHZ()
 	if err != nil {
-		hertz = 1
+		// Fallback to a more typical value (250 Hz is common)
+		hertz = 250
+		// Log the error if logger is available
+		fmt.Fprintf(os.Stderr, "Warning: failed to detect kernel HZ, using fallback %d: %v\n", hertz, err)
 	}
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @internal/networkmonitor/cilium_time.go around lines 12 - 20, The init() sets
a dangerous fallback hertz = 1 when getKernelHZ() fails and swallows the error;
update init() to choose a realistic fallback (e.g., 250 or 100) instead of 1,
log the getKernelHZ() error and the chosen fallback via the package logger (or
return/fail fast if that’s preferred), and ensure callers using hertz (e.g.,
jiffies calculations) will see correct magnitude; locate symbols init, hertz and
getKernelHZ to implement the change and add a processLogger or package log call
to record the failure and fallback.

Comment on lines +123 to +126
echo "Cleaning up k8s resources (best effort)..."
kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true
kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Aggressive cleanup deletes ALL services and PVCs cluster-wide.

kubectl delete svc -A --all will delete all services across all namespaces, including critical kube-system services (like kube-dns). While this runs best-effort before resource group deletion, it could cause issues if the deletion is interrupted or fails.

Consider limiting cleanup to test-related namespaces or excluding kube-system:

Proposed fix
     echo "Cleaning up k8s resources (best effort)..."
-    kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true
-    kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true
+    # Cleanup test namespaces only, excluding kube-system
+    for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep -v '^kube-'); do
+      kubectl delete svc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
+      kubectl delete pvc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
+    done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Cleaning up k8s resources (best effort)..."
kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true
kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true
fi
echo "Cleaning up k8s resources (best effort)..."
# Cleanup test namespaces only, excluding kube-system
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep -v '^kube-'); do
kubectl delete svc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
kubectl delete pvc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
done
fi

Comment on lines +60 to +62
# Note: T2A might not be in all zones of us-central1. us-central1-a,b,f usually.
# Let's target a specific zone to simplify.
ZONE="${REGION}-a"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded zone suffix may fail in some regions.

The zone is hardcoded as ${REGION}-a (Lines 62, 94, 107), which assumes all regions have an "a" zone and that T2A instances are available there. This assumption may not hold for all GCP regions.

Consider deriving available zones dynamically or allowing zone override:

🔧 Proposed fix
+# Allow zone override or detect available zones with T2A
+ZONE="${ZONE:-${REGION}-a}"
+
+# Verify zone supports the machine type
+if ! gcloud compute machine-types list \
+    --filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \
+    --format="value(name)" --limit=1 | grep -q .; then
+  echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2
+  exit 1
+fi
+
-  ZONE="${REGION}-a"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @verification/provision_gke.sh around lines 60 - 62, The script currently
hardcodes ZONE="${REGION}-a", which breaks in regions without an "-a" zone or
where T2A is unavailable; change ZONE to be derived dynamically or overridable:
add logic to respect an optional environment variable (e.g., ZONE override) and
if not set, query available zones for the REGION (via gcloud compute zones list
filtered by REGION and required machine type/feature) and pick a suitable zone,
then use that dynamic ZONE wherever ZONE is referenced (the ZONE variable and
all places that use it in provisioning, checks, and instance creation).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @internal/networkmonitor/pod_cache_test.go:
- Around line 27-28: The test currently ignores the return value from
factory.WaitForCacheSync; capture its result (e.g., synced :=
factory.WaitForCacheSync(ctx.Done())), iterate the returned map and fail the
test if any informer did not sync (for example with t.Fatalf or t.Fatal when any
value is false). Reference the existing
factory.Start(ctx.Done())/factory.WaitForCacheSync(ctx.Done()) calls and add the
"k8s.io/client-go/tools/cache" import if needed for types; ensure you use
ctx.Done() as the stop channel when checking the sync result.

In @Makefile:
- Around line 703-709: The Makefile target setup-lima currently hardcodes
macOS-only installation via "brew install lima"; modify the installation step to
detect the OS and choose the appropriate installer for limactl: if uname
indicates Darwin use brew, if Linux detect distro (apt/dnf/yum) and run the
distro-appropriate package install or fallback to the official install script,
and if Windows/WSL print an error or instructions; update the block around the
existing which limactl check in the setup-lima target so the install command is
conditional and still preserves the existing instance start logic using
LIMA_INSTANCE and LIMA_CONFIG.
🧹 Nitpick comments (4)
internal/networkmonitor/monitor_test.go (1)

50-70: Consider verifying hash consistency for IPv6.

While the current test ensures IPv6 hashing doesn't panic and produces non-zero results, it could be more robust by verifying that hashing the same IPv6 entry multiple times produces consistent results.

♻️ Optional enhancement
 func TestIPv6Hashing(t *testing.T) {
 	// Verify that IPv6 addresses are hashed correctly and do not panic
 	srcIP := netaddr.MustParseIP("2001:db8::1")
 	dstIP := netaddr.MustParseIP("2001:db8::2")
 	entry := &Entry{
 		Src:   netaddr.IPPortFrom(srcIP, 12345),
 		Dst:   netaddr.IPPortFrom(dstIP, 80),
 		Proto: 6, // TCP
 	}
 
-	// Should not panic
+	// Should not panic and should be consistent
 	h1 := conntrackEntryKey(entry)
 	h2 := entryGroupKey(entry)
 
 	if h1 == 0 {
 		t.Error("conntrackEntryKey returned 0 for IPv6")
 	}
 	if h2 == 0 {
 		t.Error("entryGroupKey returned 0 for IPv6")
 	}
+
+	// Verify consistency
+	h1Again := conntrackEntryKey(entry)
+	h2Again := entryGroupKey(entry)
+	if h1 != h1Again {
+		t.Errorf("conntrackEntryKey inconsistent for IPv6: %d vs %d", h1, h1Again)
+	}
+	if h2 != h2Again {
+		t.Errorf("entryGroupKey inconsistent for IPv6: %d vs %d", h2, h2Again)
+	}
 }
cmd/zxporter-netmon/main.go (2)

28-35: Rename the kubeconfig flag for clarity.

The flag metrics-kubeconfig (line 32) is misleading—it's used for general Kubernetes client operations (e.g., creating clientsets, informers) throughout the application, not just for metrics. Renaming it to simply kubeconfig would better reflect its purpose and reduce confusion.

📝 Suggested refactor
-	kubeconfig      = flag.String("metrics-kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")
+	kubeconfig      = flag.String("kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")

46-48: Consider handling the logger initialization error.

The error from zap.NewProduction() is currently ignored (line 47). While this constructor rarely fails in practice, handling the error defensively would make the initialization more robust.

🛡️ Defensive error handling
-	zapLog, _ := zap.NewProduction()
+	zapLog, err := zap.NewProduction()
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "failed to initialize logger: %v\n", err)
+		os.Exit(1)
+	}
	logger := zapr.NewLogger(zapLog)
Makefile (1)

821-823: LGTM: Clean metrics target is straightforward.

The clean-metrics target properly removes generated verification artifacts.

Optional: Consider adding a top-level clean target.

Makefile conventions typically include a clean target that orchestrates all cleanup operations. You could add:

.PHONY: clean
clean: clean-metrics helm-chart-clean ## Clean all build artifacts
	@rm -rf $(DIST_DIR)
	@rm -rf $(LOCALBIN)

Based on static analysis hints.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b91b7f and ebc8e69.

📒 Files selected for processing (9)
  • Dockerfile.netmon
  • Makefile
  • cmd/zxporter-netmon/main.go
  • helm-chart/zxporter-netmon/templates/clusterrole.yaml
  • helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml
  • internal/networkmonitor/monitor.go
  • internal/networkmonitor/monitor_test.go
  • internal/networkmonitor/pod_cache.go
  • internal/networkmonitor/pod_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • Dockerfile.netmon
  • internal/networkmonitor/pod_cache.go
  • internal/networkmonitor/monitor.go
🧰 Additional context used
🧬 Code graph analysis (3)
internal/networkmonitor/monitor_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
  • Entry (15-24)
cmd/zxporter-netmon/main.go (6)
internal/networkmonitor/conntrack_client.go (1)
  • Client (36-39)
internal/networkmonitor/cilium_client_nolinux.go (2)
  • CiliumAvailable (22-24)
  • NewCiliumClient (18-20)
internal/networkmonitor/ebpf/tracer_linux.go (1)
  • IsKernelBTFAvailable (128-131)
internal/networkmonitor/ebpf/tracer_nolinux.go (1)
  • IsKernelBTFAvailable (22-24)
internal/networkmonitor/monitor.go (1)
  • Config (24-29)
internal/transport/interface.go (1)
  • DakrClient (15-31)
internal/networkmonitor/pod_cache_test.go (1)
internal/networkmonitor/pod_cache.go (2)
  • NewPodCache (20-57)
  • PodCache (12-16)
🪛 checkmake (0.2.2)
Makefile

[warning] 821-821: Missing required phony target "clean"

(minphony)

🪛 YAMLlint (1.37.1)
helm-chart/zxporter-netmon/templates/clusterrole.yaml

[error] 4-4: syntax error: expected , but found ''

(syntax)

helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml

[error] 4-4: syntax error: expected , but found ''

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build Docker Image
  • GitHub Check: Run make test
  • GitHub Check: Build Docker Image
  • GitHub Check: Analyze (go)
🔇 Additional comments (13)
internal/networkmonitor/pod_cache_test.go (5)

30-47: LGTM!

Pod creation logic is correct, with appropriate error handling and cache synchronization wait.


49-68: LGTM!

The update logic correctly verifies both the new IP mapping and stale entry removal. The inline comment appropriately explains the timing assumptions.


70-77: LGTM!

Pod deletion and verification logic is correct.


80-98: LGTM!

The polling helper is well-designed with appropriate timeout, interval, and validation logic.


100-116: LGTM!

The deletion verification helper follows the same solid pattern as waitForPod.

helm-chart/zxporter-netmon/templates/clusterrole.yaml (1)

1-9: LGTM! Past issue has been resolved.

The ClusterRole now correctly uses the fullname helper for naming (line 4), which addresses the concern from the previous review. The RBAC permissions are appropriately scoped to read-only access for pods, following the principle of least privilege.

Note: The YAMLlint syntax error is a false positive—it doesn't recognize Helm template directives.

helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml (1)

1-13: LGTM! Correct helper usage for RBAC binding.

The ClusterRoleBinding properly uses the fullname helper for both the binding name (line 4) and the ClusterRole reference (line 9), while using the serviceAccountName helper for the subject (line 12). This separation allows for flexible ServiceAccount name overrides via values.yaml while maintaining consistent ClusterRole naming.

Note: The YAMLlint syntax error is a false positive—it doesn't recognize Helm template directives.

Makefile (6)

253-262: LGTM: Netmon image targets follow established patterns.

The docker build and push targets for zxporter-netmon are well-structured and consistent with the existing testserver and stress image targets.


711-723: LGTM: Verification targets are well-structured.

The verify-local and verify-e2e targets are properly parameterized and include helpful usage comments. The chmod commands ensure the scripts are executable.


725-746: LGTM: Provisioning targets are properly structured.

The EKS and Kind provisioning targets are straightforward wrappers around shell scripts. The Kind cluster creation includes a helpful fallback for existing clusters.


748-765: LGTM: EKS lifecycle properly handles cleanup on verification failure.

The verify-e2e-eks-lifecycle target correctly implements error handling to ensure deprovision-eks always runs, even when verification fails. This addresses the concern raised in the previous review.

The pattern captures the verification exit code, guarantees cleanup execution, and then exits with the original verification result.


767-793: LGTM: AKS lifecycle properly handles cleanup on verification failure.

The verify-e2e-aks-lifecycle target correctly implements the same error handling pattern as EKS, ensuring deprovision-aks always runs regardless of verification outcome. This addresses the concern from the previous review.


134-134: Remove or clarify the purpose of go generate ./....

There are no active //go:generate directives in the codebase. The only directive in internal/networkmonitor/ebpf/generate.go is commented out and requires manual uncommentation plus LLVM installation to function. The eBPF artifacts are pre-generated and checked into the repository. As written, go generate ./... is a no-op and doesn't actually generate any code during the build.

Likely an incorrect or invalid review comment.

Comment on lines +703 to +709
.PHONY: setup-lima
setup-lima:
@echo "Checking for Lima..."
@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
@echo "Checking if instance $(LIMA_INSTANCE) exists..."
@limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG))
@echo "Lima instance $(LIMA_INSTANCE) is ready."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add cross-platform support for Lima installation.

Line 706 uses brew install lima, which only works on macOS. This will fail on Linux systems.

🔧 Suggested improvements for cross-platform support

Consider checking the operating system and providing appropriate installation instructions:

 .PHONY: setup-lima
 setup-lima:
 	@echo "Checking for Lima..."
-	@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+	@which limactl >/dev/null || { \
+		echo "Lima not found. Please install it:"; \
+		echo "  macOS: brew install lima"; \
+		echo "  Linux: See https://github.com/lima-vm/lima#installation"; \
+		exit 1; \
+	}
 	@echo "Checking if instance $(LIMA_INSTANCE) exists..."
 	@limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG))
 	@echo "Lima instance $(LIMA_INSTANCE) is ready."

Alternatively, add OS detection:

 .PHONY: setup-lima
 setup-lima:
 	@echo "Checking for Lima..."
-	@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+	@which limactl >/dev/null || { \
+		echo "Installing Lima..."; \
+		if [ "$$(uname)" = "Darwin" ]; then \
+			brew install lima; \
+		elif [ "$$(uname)" = "Linux" ]; then \
+			echo "Please install Lima: https://github.com/lima-vm/lima#installation"; \
+			exit 1; \
+		fi; \
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.PHONY: setup-lima
setup-lima:
@echo "Checking for Lima..."
@which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
@echo "Checking if instance $(LIMA_INSTANCE) exists..."
@limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG))
@echo "Lima instance $(LIMA_INSTANCE) is ready."
.PHONY: setup-lima
setup-lima:
@echo "Checking for Lima..."
@which limactl >/dev/null || { \
echo "Installing Lima..."; \
if [ "$$(uname)" = "Darwin" ]; then \
brew install lima; \
elif [ "$$(uname)" = "Linux" ]; then \
echo "Please install Lima: https://github.com/lima-vm/lima#installation"; \
exit 1; \
fi; \
}
@echo "Checking if instance $(LIMA_INSTANCE) exists..."
@limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG))
@echo "Lima instance $(LIMA_INSTANCE) is ready."
🤖 Prompt for AI Agents
In @Makefile around lines 703 - 709, The Makefile target setup-lima currently
hardcodes macOS-only installation via "brew install lima"; modify the
installation step to detect the OS and choose the appropriate installer for
limactl: if uname indicates Darwin use brew, if Linux detect distro
(apt/dnf/yum) and run the distro-appropriate package install or fallback to the
official install script, and if Windows/WSL print an error or instructions;
update the block around the existing which limactl check in the setup-lima
target so the install command is conditional and still preserves the existing
instance start logic using LIMA_INSTANCE and LIMA_CONFIG.

Comment on lines +808 to +818
.PHONY: verify-e2e-gke-lifecycle
verify-e2e-gke-lifecycle: provision-gke ## Full GKE E2E (Provision -> Verify -> Deprovision)
@echo "Detected User: $$(whoami)"
@WHO="$$(whoami | tr '[:upper:]' '[:lower:]' | tr -cs 'a-z0-9' '-' | sed 's/^-//; s/-$$//')" && \
CLUSTER_NAME="gke-zxporter-e2e-$${WHO}" && \
echo "Updating kubeconfig for GKE cluster $${CLUSTER_NAME}..." && \
gcloud container clusters get-credentials $${CLUSTER_NAME} --zone us-central1-a && \
CONTEXT=$$(kubectl config current-context) && \
echo "Using Context: $${CONTEXT}" && \
$(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter
$(MAKE) deprovision-gke
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

GKE resources may be orphaned if E2E verification fails.

Lines 817-818 are separate Make invocations. If verify-e2e fails on line 817, the Makefile will abort before executing deprovision-gke on line 818, potentially leaving GKE clusters running and incurring costs.

This is the same issue that was fixed for EKS (lines 762-765) and AKS (lines 790-793), but the fix was not applied to GKE.

🔧 Apply the same error handling pattern used for EKS and AKS
 .PHONY: verify-e2e-gke-lifecycle
 verify-e2e-gke-lifecycle: provision-gke ## Full GKE E2E (Provision -> Verify -> Deprovision)
 	@echo "Detected User: $$(whoami)"
 	@WHO="$$(whoami | tr '[:upper:]' '[:lower:]' | tr -cs 'a-z0-9' '-' | sed 's/^-//; s/-$$//')" && \
 	CLUSTER_NAME="gke-zxporter-e2e-$${WHO}" && \
 	echo "Updating kubeconfig for GKE cluster $${CLUSTER_NAME}..." && \
 	gcloud container clusters get-credentials $${CLUSTER_NAME} --zone us-central1-a && \
 	CONTEXT=$$(kubectl config current-context) && \
 	echo "Using Context: $${CONTEXT}" && \
-	$(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter
-	$(MAKE) deprovision-gke
+	VERIFY_EXIT=0; \
+	$(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter || VERIFY_EXIT=$$?; \
+	$(MAKE) deprovision-gke; \
+	exit $$VERIFY_EXIT

This ensures that:

  1. The verification exit code is captured
  2. deprovision-gke always runs
  3. The original verification exit code is returned

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Line 137: Update the inline comment that currently reads "Definately have to
fail if any non-latest tags exist" to correct the typo to "Definitely have to
fail if any non-latest tags exist"; locate and edit the comment in the
push-zxporter-netmon-image workflow file where that line appears.
- Around line 236-241: The workflow step "Checkout repo" is using
actions/checkout@v3 while another step already uses @v4; update the checkout
action reference in that step from actions/checkout@v3 to actions/checkout@v4 to
ensure consistency and runner compatibility—locate the step with name "Checkout
repo" and replace the version token in the uses field accordingly, keeping the
existing with: repository, path and token parameters unchanged.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebc8e69 and ec65434.

📒 Files selected for processing (6)
  • .github/workflows/push-zxporter-netmon-image.yml
  • internal/networkmonitor/conntrack_client.go
  • internal/networkmonitor/conntrack_client_test.go
  • internal/networkmonitor/monitor.go
  • internal/networkmonitor/monitor_test.go
  • internal/networkmonitor/pod_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/networkmonitor/pod_cache_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
internal/networkmonitor/monitor_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
  • Entry (15-24)
internal/networkmonitor/conntrack_client_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
  • Entry (15-24)
internal/networkmonitor/monitor.go (7)
internal/networkmonitor/dns/ip2dns.go (2)
  • DNSLookup (17-22)
  • DNSCollector (24-29)
internal/networkmonitor/ebpf/tracer.go (1)
  • Config (5-8)
internal/logger/logger.go (1)
  • Logger (26-29)
internal/networkmonitor/conntrack_client.go (3)
  • Client (36-39)
  • Entry (15-24)
  • FilterBySrcIP (28-33)
internal/networkmonitor/pod_cache.go (1)
  • PodCache (12-16)
internal/transport/interface.go (1)
  • DakrClient (15-31)
gen/api/v1/metrics_collector.pb.go (12)
  • Namespace (1469-1503)
  • Namespace (1518-1518)
  • Namespace (1533-1535)
  • DnsLookupItem (2194-2205)
  • DnsLookupItem (2220-2220)
  • DnsLookupItem (2235-2237)
  • NetworkTrafficItem (2067-2083)
  • NetworkTrafficItem (2098-2098)
  • NetworkTrafficItem (2113-2115)
  • SendNetworkTrafficMetricsRequest (2281-2292)
  • SendNetworkTrafficMetricsRequest (2307-2307)
  • SendNetworkTrafficMetricsRequest (2322-2324)
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml

237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Analyze (go)
  • GitHub Check: Build Docker Image
  • GitHub Check: Build Docker Image
  • GitHub Check: Run make test
🔇 Additional comments (5)
internal/networkmonitor/conntrack_client_test.go (1)

10-94: LGTM! Comprehensive test coverage.

The test suite thoroughly covers valid sessions, nil field handling, and filter behavior. The helper functions and test structure are well-organized.

internal/networkmonitor/monitor_test.go (1)

11-71: LGTM! Concurrency and IPv6 tests are well-implemented.

The error channel pattern correctly handles concurrent testing, and the IPv6 test validates proper address handling. Previous concurrency safety issues have been resolved.

internal/networkmonitor/conntrack_client.go (1)

74-124: LGTM! Previous critical issues resolved.

The comprehensive nil checks for Origin.Dst (line 79) and the nil filter guard (line 119) correctly address previously identified risks. The session processing logic is robust.

internal/networkmonitor/monitor.go (2)

319-356: LGTM! Critical hashing issues resolved.

The use of local maphash.Hash instances (lines 323, 342) eliminates data races, and As16() (lines 325, 331, 344, 347) correctly handles both IPv4 and IPv6 addresses. The globalHashSeed ensures consistent hashing across calls.


104-142: LGTM! Clean event loop implementation.

The monitor loop correctly handles collection, cleanup, and flush intervals. Previous duplicate flushTicker.C case has been resolved, and DNS collector integration is properly handled.

Comment on lines +236 to +241
- name: Checkout repo
uses: actions/checkout@v3
with:
repository: devzero-inc/zxporter
path: zxporter-tmp
token: ${{ steps.get_workflow_token.outputs.token }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update checkout action to v4.

Line 237 uses actions/checkout@v3, which is outdated. Line 29 correctly uses @v4. Update for consistency and to avoid runner compatibility issues.

🔄 Proposed fix
       - name: Checkout repo
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
         with:
           repository: devzero-inc/zxporter
           path: zxporter-tmp

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Checkout repo
uses: actions/checkout@v3
with:
repository: devzero-inc/zxporter
path: zxporter-tmp
token: ${{ steps.get_workflow_token.outputs.token }}
- name: Checkout repo
uses: actions/checkout@v4
with:
repository: devzero-inc/zxporter
path: zxporter-tmp
token: ${{ steps.get_workflow_token.outputs.token }}
🧰 Tools
🪛 actionlint (1.7.10)

237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
In @.github/workflows/push-zxporter-netmon-image.yml around lines 236 - 241, The
workflow step "Checkout repo" is using actions/checkout@v3 while another step
already uses @v4; update the checkout action reference in that step from
actions/checkout@v3 to actions/checkout@v4 to ensure consistency and runner
compatibility—locate the step with name "Checkout repo" and replace the version
token in the uses field accordingly, keeping the existing with: repository, path
and token parameters unchanged.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Around line 21-26: The package-and-push job currently sets FULLV only as an
environment variable so downstream jobs can't read it; modify the
package-and-push job to expose FULLV as a job output (add FULLV under outputs)
and update the existing version step (the step named "version") to emit FULLV
via the step outputs, then change the update-helm-values-image-tag-pr job to
consume it as needs.package-and-push.outputs.fullv (replace all uses of
env.FULLV in that job with that needs expression).
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec65434 and 9178fd4.

📒 Files selected for processing (2)
  • .github/workflows/push-zxporter-netmon-image.yml
  • internal/networkmonitor/pod_cache.go
🧰 Additional context used
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml

237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build Docker Image
  • GitHub Check: Build Docker Image
  • GitHub Check: Run make test
  • GitHub Check: Analyze (go)
🔇 Additional comments (7)
internal/networkmonitor/pod_cache.go (4)

27-52: LGTM! Previous type safety and IP change issues resolved.

The event handlers now properly use safe type assertions and handle IP changes correctly:

  • AddFunc and UpdateFunc check type assertions with the comma-ok idiom
  • UpdateFunc removes stale IP mappings when a pod's IP changes
  • Logic correctly handles edge cases (empty IPs, parse failures)

63-76: LGTM! Clean implementation with proper validation.

The method correctly validates the pod IP before parsing and updating the cache, with appropriate error handling and locking.


78-104: LGTM! Robust deletion logic with tombstone handling.

The method properly handles both direct pod deletions and tombstones (DeletedFinalStateUnknown), ensuring reliable cleanup even when the informer provides wrapped objects.


106-124: LGTM! Thread-safe getters with appropriate locking.

Both methods use read locks correctly. GetLocalPodIPs defensively returns a copy of the IP set, preventing external modification of internal state.

.github/workflows/push-zxporter-netmon-image.yml (3)

28-64: LGTM!

The version extraction logic correctly handles both release tags (v*..) and non-release commits, properly excluding chart-* tags and setting appropriate fallback values.


182-220: LGTM!

The multi-arch build correctly uses validated tags and pushes to both Docker Hub and ECR with appropriate build arguments and platform targets.


297-319: LGTM!

The Slack notification logic correctly uses the always() condition for failure notifications and conditionally notifies on success only when a PR is created.

Comment on lines +21 to +26
package-and-push:
name: Package & Push Docker image
runs-on: ubuntu-latest
outputs:
is_release: ${{ steps.version.outputs.is_release }}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Missing FULLV in job outputs breaks Helm update.

The second job update-helm-values-image-tag-pr uses ${{ env.FULLV }} (lines 245, 271, 280, 291-293), but FULLV is only set as an environment variable in the first job and not exposed as an output. GitHub Actions does not share environment variables across jobs, so env.FULLV will be empty in the second job, causing the Helm update to fail or use an incorrect tag.

🔧 Proposed fix

Add FULLV to the job outputs:

     runs-on: ubuntu-latest
     outputs:
       is_release: ${{ steps.version.outputs.is_release }}
+      fullv: ${{ env.FULLV }}

Then update the version step to output FULLV:

           echo "FULLV=${FULLV}" >> $GITHUB_ENV
           echo "IS_RELEASE=true" >> $GITHUB_ENV
 
           echo "Debug: MAJOR=${BASH_REMATCH[1]}, MINOR=${BASH_REMATCH[2]}, PATCH=${PATCH}, PATCH_SHORT=${PATCH_SHORT}"
           echo "Debug: FULLV=${FULLV}"
           echo "is_release=${IS_RELEASE}" >> $GITHUB_OUTPUT
+          echo "fullv=${FULLV}" >> $GITHUB_OUTPUT

And reference it in the second job as ${{ needs.package-and-push.outputs.fullv }} instead of ${{ env.FULLV }}.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @.github/workflows/push-zxporter-netmon-image.yml around lines 21 - 26, The
package-and-push job currently sets FULLV only as an environment variable so
downstream jobs can't read it; modify the package-and-push job to expose FULLV
as a job output (add FULLV under outputs) and update the existing version step
(the step named "version") to emit FULLV via the step outputs, then change the
update-helm-values-image-tag-pr job to consume it as
needs.package-and-push.outputs.fullv (replace all uses of env.FULLV in that job
with that needs expression).

@dray92 dray92 merged commit 6192315 into main Jan 8, 2026
25 checks passed
@dray92 dray92 deleted the debo/netmon branch January 8, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants