-
Notifications
You must be signed in to change notification settings - Fork 0
[netmon] daemonset for node-level network monitoring #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds zxporter-netmon: a Go network-monitor agent (conntrack + optional eBPF DNS tracer), a Helm chart and DaemonSet, multi‑arch image and release CI workflows, eBPF CO‑RE headers and programs, verification/provisioning scripts, monitoring/aggregation pipeline, and Dakr transport RPCs for network and DNS metrics. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant App as zxporter-netmon
participant PodCache
participant Conntrack
participant eBPF
participant IP2DNS
participant Monitor
participant Dakr
participant HTTP
User->>App: start (--standalone / in-cluster)
App->>PodCache: init (if in-cluster)
App->>Conntrack: init (netfilter or cilium)
App->>eBPF: init tracer (if Kernel BTF available)
eBPF->>IP2DNS: emit DNSEvent
loop Read interval
Monitor->>Conntrack: ListEntries()
Conntrack-->>Monitor: entries
Monitor->>PodCache: GetLocalPodIPs()
Monitor->>Monitor: compute deltas & aggregate flows
end
loop Flush interval
Monitor->>IP2DNS: Lookups()
IP2DNS-->>Monitor: recent lookups
Monitor->>Dakr: SendNetworkTrafficMetrics(payload)
Dakr-->>Monitor: ACK
end
User->>HTTP: GET /metrics
HTTP->>Monitor: GetMetricsHandler
Monitor-->>HTTP: JSON metrics
HTTP-->>User: response
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
go.mod (1)
3-3: Upgrade Go version and review pseudo-version dependencies.Go 1.24.0 had documented correctness and security issues post-release (including CVE-2025-4674). Upgrade to a stable 1.24.x release (e.g., 1.24.5 or later).
Additionally, review the two pseudo-version dependencies in the indirect requires block:
inet.af/netaddrpins to commitb8eac61e914afrom May 2023—clarify if this is intentional (e.g., if no stable release exists) or if a newer released version should be preferred.google.golang.org/genprotouses commit9219d122eba9from September 2025—verify that this cutting-edge commit remains stable and is necessary.
🤖 Fix all issues with AI agents
In @cmd/zxporter-netmon/main.go:
- Around line 193-202: The shutdown sequence never cancels the monitor's
context, so call the cancel function created with ctx, cancel :=
context.WithCancel(...) to trigger Monitor.Start() cleanup (which closes m.ct).
Update the shutdown flow around logger.Info("Shutting down...") and
server.Shutdown(shutdownCtx) to invoke cancel() (either before or immediately
after server.Shutdown) and ensure shutdownCancel() is still deferred; this will
let the monitor exit and release the conntrack client.
In @config/zxporter-netmon.yaml:
- Around line 28-33: Remove the invalid pod-level securityContext block that
sets privileged: true and capabilities (NET_ADMIN, SYS_RESOURCE); those fields
are only valid at the container level and are already defined in the existing
container-level securityContext, so delete the pod-level securityContext section
(the block containing privileged and capabilities) to avoid validation errors or
ignored settings and rely on the container-level securityContext instead.
In @Dockerfile.netmon:
- Around line 26-33: The Dockerfile sets FROM gcr.io/distroless/static:nonroot
but then overrides it with USER 0:0, which runs the container as root; remove
the USER 0:0 line so the image runs as the nonroot user provided by distroless
(UID 65532), or if root is truly required, change the base to FROM
gcr.io/distroless/static:latest and add a short comment justifying why root is
necessary; ensure ENTRYPOINT ["/zxporter-netmon"] remains unchanged.
In @helm-chart/zxporter-netmon/templates/clusterrole.yaml:
- Line 4: The ClusterRole template is using the serviceAccountName helper
(include "zxporter-netmon.serviceAccountName" .) which is semantically wrong;
replace that include with include "zxporter-netmon.fullname" . to use the
chart's canonical resource name, or add a dedicated helper named
"zxporter-netmon.clusterRoleName" in templates/_helpers.tpl that returns the
intended name and use include "zxporter-netmon.clusterRoleName" . in the
clusterrole.yaml; update any references so ClusterRole naming no longer depends
on serviceAccount.create/serviceAccount.name helper logic.
In @internal/networkmonitor/ebpf/bpf_bpfel.go:
- Around line 107-114: _BpfClose currently returns on the first Close error
which can leak later resources; change it to always iterate and call Close on
every io.Closer in the closers slice, capturing either the first non-nil error
or aggregating errors (e.g., using errors.Join if available) and return that
after the loop; update the generated function _BpfClose to perform all closes
and then return the collected error, and if possible fix the bpf2go generator so
future outputs behave the same.
In @internal/networkmonitor/ebpf/c/include/update.sh:
- Around line 18-19: The current inline download pipeline (curl -sL ... | tar
-xz --xform=... "${headers[@]}") should be made robust: download the release
tarball to a temporary file using a failing curl variant (e.g. enable -f and
preserve redirects) and check curl's exit code, validate the tarball integrity
against a provided checksum (and/or GPG signature from the libbpf release page)
before extracting, and avoid blind overwrite of existing headers by either
extracting into a temporary directory and atomically moving new files or using
tar options to skip/keep existing files; if any verification or extraction step
fails, emit a clear error and exit non‑zero. Ensure you reference LIBBPF_VERSION
and the headers array in the updated logic so the download, checksum URL, and
extraction targets are consistent.
- Line 4: The pinned LIBBPF_VERSION variable in update.sh is set to the
vulnerable 0.6.1; change LIBBPF_VERSION to a patched release (minimum 0.7.0,
preferably the current stable 1.6.2) so the build uses a non-vulnerable libbpf;
update the LIBBPF_VERSION value accordingly and verify the tag/format matches
the repository's expected version string elsewhere in the script.
In @internal/networkmonitor/ebpf/cgroup_linux.go:
- Around line 23-34: The mountCgroup2 function should be made idempotent: when
creating the directory (mountPoint) treat already-existing directory as
non-fatal (use os.Mkdir and ignore os.IsExist errors) and before calling
syscall.Mount check whether cgroup2 is already mounted at mountPoint by parsing
/proc/mounts (or using syscall.Statfs) and return nil if a cgroup2 mount is
present; only call syscall.Mount if the mountPoint exists and no cgroup2 is
currently mounted, and preserve existing error wrapping in the same fmt.Errorf
style if mount fails.
In @internal/networkmonitor/monitor.go:
- Around line 133-144: The select in the goroutine handling flushTicker has a
duplicated branch for "case <-flushTicker.C:" causing redundant execution paths;
remove the redundant second "case <-flushTicker.C:" block (the one that contains
the comment about goroutine vs synchronous flush and the duplicate call to
m.flush(ctx)) so only a single case handling flushTicker.C remains, ensuring the
select calls m.flush(ctx) exactly once and keeps the existing m.log.Error(err,
"Failed to flush metrics to control plane") behavior.
- Around line 327-343: The conntrackEntryKey function can panic using As4() for
IPv6 addresses; update it to use As16() so both IPv4 and IPv6 are handled
(replace calls to e.Src.IP().As4() and e.Dst.IP().As4() with As16() and write
the full 16-byte slices into conntrackEntryHash), keeping the existing port and
proto writes intact and ensuring conntrackEntryHash.Reset()/Sum64() usage
remains the same.
- Around line 324-359: The package-level maphash.Hash variables
(conntrackEntryHash, entryGroupHash) are not safe for concurrent use; replace
them by creating a local maphash.Hash instance inside conntrackEntryKey and
inside entryGroupKey (e.g. var h maphash.Hash or h := maphash.Hash{}), use
h.Reset()/h.Write(...)/h.Sum64() there, and remove or stop referencing the
package-level conntrackEntryHash and entryGroupHash to eliminate the data race.
🟡 Minor comments (13)
internal/networkmonitor/cilium_time.go-57-66 (1)
57-66: Fix grammatical error in error message.Line 62 has a grammatical error: "Unable get time" should be "Unable to get time".
📝 Proposed fix
- return 0, fmt.Errorf("Unable get time: %w", err) + return 0, fmt.Errorf("unable to get time: %w", err)Note: Also lowercased the first letter to follow Go error message conventions.
test/stats/types.go-100-108 (1)
100-108: Reconsider timestamp field type for consistency.The
Timestampfield is string while other timestamp fields in this file usetime.Time(SnapshotStats.Timestamp at line 38, Stats.FirstMessageTime at line 115). The string type is intentional—the code explicitly converts the timestamp using.AsTime().String()(testserver/main.go:394)—but this creates an unnecessary inconsistency.If the intent is JSON serialization, using
time.Timewould be more consistent and eliminate the manual string conversion, letting JSON marshaling handle the formatting automatically. Align with other timestamp fields in the same file or document why this differs.verification/provision_aks.sh-18-26 (1)
18-26: State file may not correctly track multiple CNI configurations.The script includes
CNI_TYPEin both the resource group and cluster names, but uses a single state file (aks.state). If a user changesCNI_TYPEbetween runs, the state file may reference the wrong cluster during deletion.Consider one of these approaches:
🔧 Proposed fixes
Option 1: Include CNI_TYPE in state filename
STATE_DIR="${STATE_DIR:-$HOME/.cache/zxporter-e2e}" -STATE_FILE="${STATE_FILE:-${STATE_DIR}/aks.state}" +STATE_FILE="${STATE_FILE:-${STATE_DIR}/aks-${CNI_TYPE}.state}"Option 2: Store and validate CNI_TYPE in delete mode
In delete mode, after
read_state:read_state +if [[ -n "${CNI_TYPE:-}" && "${CNI_TYPE}" != "${CNI_TYPE_FROM_STATE:-kubenet}" ]]; then + echo "WARNING: Current CNI_TYPE (${CNI_TYPE}) differs from state (${CNI_TYPE_FROM_STATE})" >&2 +fiinternal/networkmonitor/ebpf/types.go-18-28 (1)
18-28: Bug: Headers are repeated for each item in the loop.The "Questions:\n" and "Answers:\n" headers are written inside the loops, causing them to repeat for each entry. They should be printed once before each loop.
Proposed fix
func (e DNSEvent) String() string { var str strings.Builder + if len(e.Questions) > 0 { + str.WriteString("Questions:\n") + } for _, v := range e.Questions { - str.WriteString("Questions:\n") str.WriteString(fmt.Sprintf("%s %s %s \n", v.Class, v.Type, string(v.Name))) } + if len(e.Answers) > 0 { + str.WriteString("Answers:\n") + } for _, v := range e.Answers { - str.WriteString("Answers:\n") str.WriteString(fmt.Sprintf("%s %s %s [%s] [%s] \n", v.Class, v.Type, string(v.Name), v.IP, v.CNAME)) } return str.String() }.github/workflows/helm-release-netmon.yml-94-125 (1)
94-125: Tag is created beforeChart.yamlis updated in the repository.The workflow creates and pushes the git tag (lines 94-99) before updating
Chart.yaml(lines 122-125). This means the tagged commit doesn't reflect the new version inChart.yaml. While the packaged chart will have the correct version (via--versionflag), the source at the tagged commit will show the old version.Consider reordering: update
Chart.yamlfirst, commit, then tag the commit with the updated version.verification/verify_e2e.sh-115-123 (1)
115-123: Duplicate--set image.pullPolicyflag.Line 120 duplicates line 119. Remove the duplicate.
Proposed fix
helm template zxporter-netmon helm-chart/zxporter-netmon \ --namespace $NAMESPACE \ --set image.repository=ttl.sh/zxporter-netmon \ --set image.tag=$TAG \ --set image.pullPolicy=$PULL_POLICY \ - --set image.pullPolicy=$PULL_POLICY \ --set config.exportInterval="10s" \ | kubectl apply -f -internal/networkmonitor/ebpf/tracer_linux.go-133-142 (1)
133-142:InitCgroupv2silently ignores non-ErrCgroup2NotMountederrors fromdetectCgroupPath.If
detectCgroupPathreturns an error other thanErrCgroup2NotMounted, it's not propagated. The function returnsnil, potentially masking real issues.Proposed fix
func InitCgroupv2(log logr.Logger) error { _, err := detectCgroupPath() if errors.Is(err, ErrCgroup2NotMounted) { log.Info("mounting cgroup v2") if err := mountCgroup2(); err != nil { return fmt.Errorf("cgroup2 not mounted and failed to mount manually: %w", err) } + return nil } - return nil + return err }.github/workflows/push-zxporter-netmon-image.yml-137-137 (1)
137-137: Fix typo in error message."Definately" should be "Definitely".
✏️ Proposed fix
- # Definately have to fail if any non-latest tags exist + # Definitely have to fail if any non-latest tags exist if [ ${#NON_LATEST_EXISTING[@]} -gt 0 ]; thencmd/zxporter-netmon/main.go-46-48 (1)
46-48: Handle or log error from logger initialization.The error from
zap.NewProduction()is silently discarded. If logger initialization fails, the program continues with a potentially invalid logger, which could lead to panics or silent failure of logging.🛡️ Proposed fix
// Initialize Logger - zapLog, _ := zap.NewProduction() + zapLog, err := zap.NewProduction() + if err != nil { + fmt.Fprintf(os.Stderr, "failed to initialize logger: %v\n", err) + os.Exit(1) + } logger := zapr.NewLogger(zapLog)internal/networkmonitor/conntrack_client.go-123-126 (1)
123-126: Silent error suppression may hide invalid IP addresses.The error from
netaddr.FromStdIPis deliberately ignored. If the kernel returns an invalid IP format, this could silently produce zero-value IPs in entries, leading to incorrect filtering or metrics.Consider logging when conversion fails:
🔍 Proposed improvement
-func ipFromStdIP(ip net.IP) netaddr.IP { - res, _ := netaddr.FromStdIP(ip) - return res -} +func (c *NetfilterClient) ipFromStdIP(ip net.IP) netaddr.IP { + res, ok := netaddr.FromStdIP(ip) + if !ok { + c.log.Info("Failed to convert IP address", "ip", ip.String()) + } + return res +}Note: This would require making
ipFromStdIPa method to access the logger.Committable suggestion skipped: line range outside the PR's diff.
.github/workflows/push-zxporter-netmon-image.yml-237-241 (1)
237-241: Update checkout action to v4.The workflow uses
actions/checkout@v3which is too old to run on current GitHub Actions runners.🔄 Proposed fix
- name: Checkout repo - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: repository: devzero-inc/zxporter.github/workflows/push-zxporter-netmon-image.yml-306-319 (1)
306-319: Fix undefined workflow_name reference.Lines 311 and 318 reference
${{ inputs.workflow_name }}but this input is not defined in the workflow'sworkflow_dispatchinputs (which only haspush_latest). This will evaluate to an empty string.🔧 Proposed fix
- name: Notify Slack on failure uses: ./services/.github/actions/slack-notify-failure if: always() with: slack_bot_token: ${{ secrets.SLACK_BOT_TOKEN }} - workflow_name: ${{ inputs.workflow_name || 'Update zxporter-netmon helm chart' }} + workflow_name: 'Update zxporter-netmon helm chart' - name: Notify Slack on success uses: ./services/.github/actions/slack-notify-success if: env.SHOULD_CREATE_PR == 'true' with: slack_bot_token: ${{ secrets.SLACK_BOT_TOKEN }} - workflow_name: ${{ inputs.workflow_name || 'Update zxporter-netmon helm chart' }} + workflow_name: 'Update zxporter-netmon helm chart' pr_url: ${{ env.PR_URL }}internal/networkmonitor/ebpf/c/egressd.c-56-60 (1)
56-60: Unchecked return value when re-reading IPv4 header with options.When
ihl > 5, the re-read of the IPv4 header does not check the return value ofbpf_skb_load_bytes_relative. If this fails, the program continues with potentially invalid header data.🐛 Proposed fix
u32 ihl = nethdrs->iphdrs.iphdr.ihl; if (ihl > 5) { // re-read IPv4 header if needed size -= get_type_size(struct iphdr); size += ihl * 4; - bpf_skb_load_bytes_relative(ctx, 0, dest, size, 1); + if (bpf_skb_load_bytes_relative(ctx, 0, dest, size, 1)) { + return 1; + } }
🧹 Nitpick comments (40)
internal/networkmonitor/cilium_time.go (1)
14-20: Consider logging or handling Hz detection failure.The silent fallback to
hertz = 1on error may mask kernel Hz detection failures, potentially leading to incorrect jiffies-based timing calculations. Consider logging the error or documenting whenhertz = 1is an acceptable fallback.📝 Consider adding error logging
func init() { var err error hertz, err = getKernelHZ() if err != nil { + // Log error or use a more appropriate fallback + // logger.Warnf("Failed to detect kernel HZ, using fallback: %v", err) hertz = 1 } }internal/networkmonitor/cilium_kernel_hz.go (1)
87-107: Consider refactoring for clarity (optional).The unsigned subtraction pattern on lines 95-97 works correctly but relies on wrap-around behavior that may not be immediately obvious. Consider computing the absolute difference more explicitly for better readability.
♻️ Optional refactor for clarity
func nearest(in uint16, values []uint16) (uint16, error) { if len(values) == 0 { return 0, errors.New("values cannot be empty") } var out uint16 min := ^uint16(0) for _, v := range values { - d := uint16(in - v) + var d uint16 if in < v { d = v - in + } else { + d = in - v } if d < min { min = d out = v } } return out, nil }verification/provision_gke.sh (2)
35-39: Consider renaming for consistency across provisioning scripts.The function is named
load_statehere, while the EKS and AKS scripts useread_statefor the same purpose. Using consistent naming across all three cloud provider scripts improves maintainability.♻️ Proposed fix
-load_state() { +read_state() { if [[ -f "${STATE_FILE}" ]]; then source "${STATE_FILE}" fi }Then update the call site on Line 99:
if [[ "${MODE}" == "delete" ]]; then - load_state + read_state
36-39: Suppress expected ShellCheck warning for dynamic source.The ShellCheck warning SC1090 about non-constant source is expected here since the state file is dynamically sourced. Consider adding a directive to suppress this known false positive.
♻️ Proposed fix
load_state() { if [[ -f "${STATE_FILE}" ]]; then + # shellcheck source=/dev/null source "${STATE_FILE}" fi }verification/provision_eks.sh (2)
140-142: Track or address the autoModeConfig TODO.The commented TODO mentions testing is still required for the auto mode config. Since eksctl is moving toward making this the default, this should be addressed soon.
Would you like me to open an issue to track testing autoModeConfig before it becomes the default in upcoming eksctl releases?
56-58: Suppress expected ShellCheck warning for dynamic source.Same as the GKE script, add a directive to suppress the SC1090 warning for the dynamically sourced state file.
♻️ Proposed fix
read_state() { + # shellcheck source=/dev/null [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}" }verification/provision_aks.sh (2)
118-126: Consider prioritizing LoadBalancer service cleanup.AKS LoadBalancer services create Azure Load Balancer and Public IP resources. Deleting these first helps ensure clean resource group deletion.
♻️ Proposed enhancement
echo "Cleaning up k8s resources (best effort)..." + # Delete LoadBalancer services first to clean up Azure LB/IP resources + kubectl delete svc -A --field-selector spec.type=LoadBalancer --timeout=30s >/dev/null 2>&1 || true kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true
38-40: Suppress expected ShellCheck warning for dynamic source.Same SC1090 warning as the other provisioning scripts. Add the suppression directive for consistency.
♻️ Proposed fix
read_state() { + # shellcheck source=/dev/null [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}" }internal/networkmonitor/ebpf/tracer_linux_test.go (1)
10-17: Expand test coverage with error cases and edge cases.The test only covers one happy-path scenario. Consider adding tests for invalid payloads, truncated packets, and various packet types to ensure robust parsing.
📋 Suggested additional test cases
func TestParseEvent_ErrorCases(t *testing.T) { tests := []struct { name string payload []byte wantErr bool }{ { name: "empty payload", payload: []byte{}, wantErr: true, }, { name: "truncated IP header", payload: []byte{69, 0, 0, 80}, wantErr: true, }, { name: "invalid IP version", payload: []byte{0, 0, 0, 80, 106, 126, 64, 0}, wantErr: true, }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { _, err := parseEvent(tt.payload) if (err != nil) != tt.wantErr { t.Errorf("parseEvent() error = %v, wantErr %v", err, tt.wantErr) } }) } }verification/kind-config.yaml (1)
1-27: Consider using YAML anchors to reduce duplication.The
extraMountsconfiguration is duplicated across all three nodes. While this works correctly, you can improve maintainability by using YAML anchors and aliases.♻️ Proposed refactor using YAML anchors
kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 + +# Define the common mounts as an anchor +x-common-mounts: &common-mounts + - hostPath: /lib/modules + containerPath: /lib/modules + readOnly: true + - hostPath: /sys/fs/bpf + containerPath: /sys/fs/bpf + readOnly: false + nodes: - role: control-plane - extraMounts: - - hostPath: /lib/modules - containerPath: /lib/modules - readOnly: true - - hostPath: /sys/fs/bpf - containerPath: /sys/fs/bpf - readOnly: false + extraMounts: *common-mounts - role: worker - extraMounts: - - hostPath: /lib/modules - containerPath: /lib/modules - readOnly: true - - hostPath: /sys/fs/bpf - containerPath: /sys/fs/bpf - readOnly: false + extraMounts: *common-mounts - role: worker - extraMounts: - - hostPath: /lib/modules - containerPath: /lib/modules - readOnly: true - - hostPath: /sys/fs/bpf - containerPath: /sys/fs/bpf - readOnly: false + extraMounts: *common-mountsverification/lima.yaml (1)
18-22: Consider adding error handling to the provisioning script.The provisioning script could be made more robust by adding error handling and non-interactive mode for apt-get.
♻️ Proposed enhancement
provision: - mode: system script: | + set -euo pipefail + export DEBIAN_FRONTEND=noninteractive apt-get update apt-get install -y conntrack iptablesThis adds:
set -euo pipefailto fail fast on errorsDEBIAN_FRONTEND=noninteractiveto prevent interactive prompts during installationverification/verify_local.sh (2)
1-17: Add prerequisite checks and better error messages.The script assumes Lima is installed and the instance is running, but doesn't verify these prerequisites. This could lead to cryptic errors for users.
📋 Suggested improvements
#!/bin/bash set -e +# Check prerequisites +if ! command -v limactl &> /dev/null; then + echo "Error: limactl is not installed. Please install Lima first." >&2 + exit 1 +fi + +INSTANCE_NAME=${1:-default} + +# Verify Lima instance exists and is running +if ! limactl list "$INSTANCE_NAME" &> /dev/null; then + echo "Error: Lima instance '$INSTANCE_NAME' does not exist." >&2 + echo "Run 'make setup-lima' to create it." >&2 + exit 1 +fi + +if ! limactl list "$INSTANCE_NAME" | grep -q "Running"; then + echo "Error: Lima instance '$INSTANCE_NAME' is not running." >&2 + exit 1 +fi + # build zxporter-netmon for linux (matching host arch for Lima) ARCH=$(go env GOARCH) echo "Building zxporter-netmon for Linux/$ARCH..." GOOS=linux GOARCH=$ARCH go build -o bin/zxporter-netmon-linux ./cmd/zxporter-netmon/main.go echo "Build complete: bin/zxporter-netmon-linux" echo "" -INSTANCE_NAME=${1:-default} echo "Running in Lima instance: $INSTANCE_NAME"
20-23: Consider capturing output and validating execution.The script runs the binary but doesn't capture or validate its output. Consider adding basic validation to ensure the binary executed successfully.
💡 Suggested enhancement
# Execute in VM # Using sudo to ensure permissions for conntrack/ebpf echo "Executing zxporter-netmon in VM..." limactl shell $INSTANCE_NAME sudo chmod +x /tmp/zxporter-netmon-linux -limactl shell $INSTANCE_NAME sudo /tmp/zxporter-netmon-linux --standalone + +# Run with timeout and capture output +echo "Running for 10 seconds to verify functionality..." +if limactl shell $INSTANCE_NAME sudo timeout 10 /tmp/zxporter-netmon-linux --standalone 2>&1 | tee /tmp/netmon-output.log; then + echo "✓ Binary executed successfully" +else + EXIT_CODE=$? + if [ $EXIT_CODE -eq 124 ]; then + echo "✓ Binary ran and was terminated after timeout (expected)" + else + echo "✗ Binary execution failed with exit code: $EXIT_CODE" >&2 + exit 1 + fi +fiverification/traffic-gen.yaml (1)
37-53: Consider adding security context hardening.The container lacks explicit security constraints. While this is a verification workload, following security best practices improves the overall security posture. Consider adding:
🔒 Suggested security hardening
containers: - name: traffic-gen image: curlimages/curl + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 100 + capabilities: + drop: + - ALL command: ["/bin/sh"] args:verification/traffic-server.yaml (1)
37-55: Consider using a non-privileged port for the container.The container runs Python's http.server on port 80, which requires root privileges or special capabilities. For a test workload, using a non-privileged port (≥1024) allows running as non-root and improves security posture.
🔒 Suggested changes for non-privileged port
Update the container to use port 8080 and add security context:
containers: - name: traffic-server image: python:3-slim - command: ["python3", "-m", "http.server", "80"] + command: ["python3", "-m", "http.server", "8080"] + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 + capabilities: + drop: + - ALL ports: - - containerPort: 80 + - containerPort: 8080 --- apiVersion: v1 kind: Service metadata: name: traffic-server namespace: default spec: selector: app: traffic-server ports: - protocol: TCP port: 80 - targetPort: 80 + targetPort: 8080This keeps the Service accessible on port 80 while the container runs securely on 8080.
internal/networkmonitor/ebpf/tracer.go (1)
10-19: Consider validatingQueueSizefor negative values.If a negative
QueueSizeis passed,make(chan DNSEvent, cfg.QueueSize)will panic at runtime. Consider adding validation:Suggested validation
func NewTracer(log logr.Logger, cfg Config) *Tracer { - if cfg.QueueSize == 0 { + if cfg.QueueSize <= 0 { cfg.QueueSize = 1000 } return &Tracer{helm-chart/zxporter-netmon/values.yaml (2)
4-5: Avoid usinglatesttag as default.Using
latestas the default image tag makes deployments non-reproducible and complicates rollbacks. Consider defaulting to the chart'sappVersionor requiring an explicit tag.Suggested fix
# Overrides the image tag whose default is the chart appVersion. - tag: "latest" + tag: "" # Defaults to chart appVersion if empty
21-22: Redundant AppArmor configuration.The AppArmor annotation on line 22 (
container.apparmor.security.beta.kubernetes.io/zxporter-netmon: "unconfined") duplicates thesecurityContext.appArmorProfileon lines 29-30. The annotation uses the deprecated beta API. Consider removing the annotation and relying solely on thesecurityContext.appArmorProfilefor Kubernetes 1.30+.Also applies to: 27-34
config/zxporter-netmon.yaml (1)
36-36: Avoid hardcodinglatesttag in manifests.Using
latestmakes deployments non-reproducible. Pin to a specific version or use the Helm chart which can template the version.internal/networkmonitor/ebpf/bpf_bpfel.go (1)
15-24: Minor: Redundant error variable in return.Line 23 returns
spec, errwhereerris guaranteed to benilafter the check on lines 19-21.Simplified return
func loadBpf() (*ebpf.CollectionSpec, error) { reader := bytes.NewReader(_BpfBytes) spec, err := ebpf.LoadCollectionSpecFromReader(reader) if err != nil { return nil, fmt.Errorf("can't load bpf: %w", err) } - return spec, err + return spec, nil }.github/workflows/helm-release-netmon.yml (1)
56-66: Quote shell variables to prevent word splitting.Static analysis flagged potential word splitting issues. While unlikely to cause problems with semantic version strings, quoting these variables is a defensive best practice.
Proposed fix
- name: Get last chart version from tags id: get_version run: | LAST_TAG=$(git tag --list '${{ env.CHART_TAG_PREFIX }}*' --sort=-v:refname | head -n 1) if [ -z "$LAST_TAG" ]; then LAST_VERSION="0.0.0" else - LAST_VERSION="${LAST_TAG#${{ env.CHART_TAG_PREFIX }}}" + LAST_VERSION="${LAST_TAG#"${{ env.CHART_TAG_PREFIX }}"}" fi echo "Last version: $LAST_VERSION" - echo "last_version=$LAST_VERSION" >> $GITHUB_OUTPUT + echo "last_version=$LAST_VERSION" >> "$GITHUB_OUTPUT"internal/networkmonitor/ebpf/bpf_bpfeb.go (1)
107-114: Early return on error may leave resources unclosed.
_BpfClosereturns immediately on the first error, potentially leaving subsequent closers unclosed. While this is common in generated code and unlikely to cause issues in practice (eBPF Close rarely fails), a more robust pattern would collect all errors.Since this is generated code, consider whether the generator (
bpf2go) can be configured to use a different pattern, or accept this as a known limitation.internal/networkmonitor/pod_cache.go (2)
15-15: Unusedstorefield.The
storefield is assigned on line 37 but never used. Either remove it or document its intended purpose for future use.If unused, remove the field
type PodCache struct { mu sync.RWMutex ips map[netaddr.IP]*corev1.Pod // Using netaddr.IP for efficient lookup - store cache.Store }And in
NewPodCache:- pc.store = informer.GetStore()Also applies to: 37-37
26-36: Unchecked type assertions in event handlers could panic.The
AddFuncandUpdateFunchandlers perform unchecked type assertions. While the informer contract should guarantee Pod objects, adding defensive checks (like indeletePod) prevents panics from unexpected objects.Add defensive type checks
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { - pc.updatePod(obj.(*corev1.Pod)) + if pod, ok := obj.(*corev1.Pod); ok { + pc.updatePod(pod) + } }, UpdateFunc: func(old, new interface{}) { - pc.updatePod(new.(*corev1.Pod)) + if pod, ok := new.(*corev1.Pod); ok { + pc.updatePod(pod) + } }, DeleteFunc: func(obj interface{}) { pc.deletePod(obj) }, })internal/networkmonitor/ebpf/tracer_linux.go (1)
191-207: Missing scanner error check indetectCgroupPath.After the scan loop,
scanner.Err()should be checked to catch any errors that occurred during scanning (e.g., I/O errors reading/proc/mounts).Proposed fix
func detectCgroupPath() (string, error) { f, err := os.Open("/proc/mounts") if err != nil { return "", err } defer f.Close() scanner := bufio.NewScanner(f) for scanner.Scan() { fields := strings.Split(scanner.Text(), " ") if len(fields) >= 3 && fields[2] == "cgroup2" { return fields[1], nil } } + if err := scanner.Err(); err != nil { + return "", fmt.Errorf("reading /proc/mounts: %w", err) + } + return "", ErrCgroup2NotMounted }verification/verify_e2e.sh (2)
24-24: Unused variableARCH.The
ARCHvariable is set but never used in the script. Either remove it or use it (e.g., for multi-arch image builds).Remove unused variable
-ARCH=$(go env GOARCH) TAG="dev-$(date +%s)"
283-297: Variable naming inconsistency and unquoted variables.
RETRY_Countuses mixed case while other variables useUPPER_SNAKE_CASE. Also,$STATS_JSONshould be quoted in theechoandgrepcommands to handle special characters safely.Proposed fix
- RETRY_Count=0 + RETRY_COUNT=0 TOTAL_MSGS=0 - while [ $RETRY_Count -lt $MAX_RETRIES ]; do + while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do STATS_JSON=$(kubectl exec curl-validator -n $NAMESPACE -- curl -v -s http://testserver.${NAMESPACE}.svc.cluster.local:8080/stats) - TOTAL_MSGS=$(echo $STATS_JSON | grep -o '"total_messages": [0-9]*' | awk '{print $2}') + TOTAL_MSGS=$(echo "$STATS_JSON" | grep -o '"total_messages": [0-9]*' | awk '{print $2}') if [[ "$TOTAL_MSGS" -gt 0 ]]; then break fi - echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_Count+1))/$MAX_RETRIES)" + echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_COUNT+1))/$MAX_RETRIES)" sleep 5 - RETRY_Count=$((RETRY_Count+1)) + RETRY_COUNT=$((RETRY_COUNT+1)) doneinternal/networkmonitor/cilium_client_linux.go (3)
39-52:defer m.Close()inside loop defers all closes until function return.All
defer m.Close()calls will execute whenlistRecordsreturns, not when each iteration completes. With many maps, this could hold file descriptors longer than necessary. Consider using an immediately-invoked function or explicit close.Proposed fix using helper function
for _, m := range maps { - m := m.(ctmap.CtMap) + ctMap, ok := m.(ctmap.CtMap) + if !ok { + continue + } + if err := processMap(ctMap, now, timeDiff, filter, &entries); err != nil { + return nil, err + } + } + return entries, nil +} + +func processMap(m ctmap.CtMap, now time.Time, timeDiff func(int64) int64, filter EntriesFilter, entries *[]*Entry) error { _, err := m.Path() if err == nil { err = m.Open() } if err != nil { if os.IsNotExist(err) { - continue + return nil } + return err } - defer m.Close() + // ... rest of callback and dump logicAlternatively, close explicitly at end of each iteration without defer.
53-76: Unchecked type assertions in callback could cause panics.Lines 54 and 61 perform type assertions without checking. While the ctmap API should guarantee these types, a panic inside the callback would be difficult to debug. Consider adding defensive checks or using comma-ok idiom.
Add defensive type checks
cb := func(key bpf.MapKey, v bpf.MapValue) { - k := key.(ctmap.CtKey).ToHost().(*ctmap.CtKey4Global) + ctKey, ok := key.(ctmap.CtKey) + if !ok { + return + } + k, ok := ctKey.ToHost().(*ctmap.CtKey4Global) + if !ok { + return // Skip non-global entries (e.g., CtKey6Global for IPv6) + } if k.NextHeader == 0 { return } srcIP := k.DestAddr.IP() dstIP := k.SourceAddr.IP() - val := v.(*ctmap.CtEntry) + val, ok := v.(*ctmap.CtEntry) + if !ok { + return + }
84-93: Magic numbers ininitMapslack documentation.The parameters to
ctmap.InitMapInfo(2<<18, 2<<17, true, false, true)are unclear. Consider adding a comment explaining what these values represent (map sizes, flags, etc.) for maintainability.Add explanatory comment
func initMaps() []interface{} { // We might need to handle failure here gracefully or ensure we have permissions + // InitMapInfo(maxTCPEntries=524288, maxAnyEntries=262144, ipv6=true, ipv4=false, nodePort=true) ctmap.InitMapInfo(2<<18, 2<<17, true, false, true)internal/networkmonitor/cilium_client.go (1)
28-35: Remove obsolete comment about logr adaptation.The type alias
logrusLogger = logr.Loggerdirectly uses logr without adaptation. The comment on lines 33-34 suggests uncertainty but the implementation is straightforward.🧹 Proposed cleanup
type ciliumClient struct { - log logrusLogger // wrapping logr to match potential interface or just using logr + log logrusLogger maps []interface{} clockSource ClockSource } -// We need a helper to adapt logr to what we need or just use logr directly. -// The code in linux implementation uses `c.log` type logrusLogger = logr.Logger.github/workflows/push-zxporter-netmon-image.yml (1)
88-168: Shell script could benefit from quoting variables, but functionally correct.The shellcheck warnings about unquoted variables (SC2086) and array handling (SC2145) are valid style improvements but don't affect functionality in this context since the variables don't contain spaces.
If you want to address the shellcheck warnings for cleaner shell hygiene, quote variables like
"$tag","$GITVERSION", etc., and use"${TAGS[@]}"instead of${TAGS[@]}in echo statements.cmd/zxporter-netmon/main.go (1)
32-32: Consider renaming flag to match its purpose.The flag is named
metrics-kubeconfigbut it's used for general Kubernetes client configuration, not just metrics. Consider renaming to simplykubeconfigfor clarity.📝 Suggested improvement
- kubeconfig = flag.String("metrics-kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.") + kubeconfig = flag.String("kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")internal/networkmonitor/ebpf/c/egressd.c (2)
19-21: Unused functionget_pid().This function is defined but never called anywhere in the file. Consider removing it to avoid dead code, or use it if PID information is needed in the event context.
🔧 Remove unused function
-statfunc int get_pid() { - return bpf_get_current_pid_tgid() >> 32; -} -
74-75: Unnecessary NULL check -destcannot be NULL here.The variable
destis assigned from&nethdrs->protohdrs.udphdr(line 72), which is a pointer to a struct member on the stack. This can never be NULL, making this check dead code.🔧 Remove unnecessary check
size = get_type_size(struct udphdr); dest = &nethdrs->protohdrs.udphdr; - if (!dest) - return 1; - // Load L4 headers.Makefile (1)
703-709: Lima setup is macOS-specific.The
setup-limatarget usesbrew install lima, which only works on macOS. Consider adding a fallback or documentation for Linux users.🔧 Proposed improvement with cross-platform detection
.PHONY: setup-lima setup-lima: @echo "Checking for Lima..." - @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima) + @which limactl >/dev/null || { \ + if command -v brew >/dev/null 2>&1; then \ + echo "Installing Lima via brew..." && brew install lima; \ + else \ + echo "Error: Lima (limactl) not found. Please install Lima manually: https://lima-vm.io/docs/installation/" && exit 1; \ + fi \ + } @echo "Checking if instance $(LIMA_INSTANCE) exists..."internal/networkmonitor/dns/ip2dns.go (2)
40-49: ExportedTracerfield inconsistent with unexported interface type.The
Tracerfield is exported (capital T) but uses an unexported interface typetracer. This allows external packages to set the field but not implement the interface, which is inconsistent. Consider making it unexported for encapsulation.🔧 Proposed fix
type IP2DNS struct { - Tracer tracer + tracer tracer log logr.Logger mu sync.Mutex lookups []DNSLookup // Keep cache for enrichment if needed, though mostly redundant with Lookups stream ipToName *cache.Cache[string, string] }Update the constructor and Start method accordingly:
func NewIP2DNS(tracer tracer, log logr.Logger) *IP2DNS { return &IP2DNS{ - Tracer: tracer, + tracer: tracer, log: log, } } func (d *IP2DNS) Start(ctx context.Context) error { d.ipToName = cache.NewContext[string, string](ctx) ctx, cancel := context.WithCancel(ctx) defer cancel() errch := make(chan error, 1) go func() { - err := d.Tracer.Run(ctx) + err := d.tracer.Run(ctx) errch <- err }() - evCh := d.Tracer.Events() + evCh := d.tracer.Events()
12-12: Consider migrating from deprecatedinet.af/netaddrtonet/netip.The
inet.af/netaddrpackage has been deprecated in favor of Go's standard librarynet/netip(available since Go 1.18). Consider migrating for long-term maintainability.🔧 Migration example
- "inet.af/netaddr" + "net/netip"And update usage:
- ip, ok := netaddr.FromStdIP(answer.IP) + ip, ok := netip.AddrFromSlice(answer.IP)internal/networkmonitor/monitor.go (2)
249-265: Side effect: mutates cachedNetworkFlowduring read.Lines 254-255 and 261-262 mutate
f.SrcPodName,f.SrcPodNamespace, etc. directly on theNetworkFlowstored inm.podMetrics. This modifies the cached state during what should be a read-only "peek" operation. Consider enriching only the copy that's appended toflows, not the original cached entry.♻️ Proposed fix
for _, f := range m.podMetrics { + // Create a copy first to avoid mutating cached state + flowCopy := &NetworkFlow{ + SrcIP: f.SrcIP, + DstIP: f.DstIP, + Protocol: f.Protocol, + DstPort: f.DstPort, + TxBytes: f.TxBytes, + RxBytes: f.RxBytes, + TxPackets: f.TxPackets, + RxPackets: f.RxPackets, + Timestamp: f.Timestamp, + } + // Enrich with Pod Metadata if m.podCache != nil { if srcIP, err := netaddr.ParseIP(f.SrcIP); err == nil { if srcPod, ok := m.podCache.GetPodByIP(srcIP); ok { - f.SrcPodName = srcPod.Name - f.SrcPodNamespace = srcPod.Namespace + flowCopy.SrcPodName = srcPod.Name + flowCopy.SrcPodNamespace = srcPod.Namespace } } if dstIP, err := netaddr.ParseIP(f.DstIP); err == nil { if dstPod, ok := m.podCache.GetPodByIP(dstIP); ok { - f.DstPodName = dstPod.Name - f.DstPodNamespace = dstPod.Namespace + flowCopy.DstPodName = dstPod.Name + flowCopy.DstPodNamespace = dstPod.Namespace } } } - // Create a copy to return (Peek behavior, no reset) - flows = append(flows, &NetworkFlow{ - SrcIP: f.SrcIP, - ... - }) + flows = append(flows, flowCopy) }
316-320: Response corruption if JSON encoding fails mid-stream.
json.NewEncoder(w).Encode(resp)writes directly to thehttp.ResponseWriter. If encoding fails partway through, the response will be partially written and corrupted. Consider encoding to a buffer first, then writing the complete response.♻️ Proposed fix
+ var buf bytes.Buffer + if err := json.NewEncoder(&buf).Encode(resp); err != nil { + m.log.Error(err, "Failed to encode metrics") + http.Error(w, "Internal Server Error", http.StatusInternalServerError) + return + } + w.Header().Set("Content-Type", "application/json") - if err := json.NewEncoder(w).Encode(resp); err != nil { - m.log.Error(err, "Failed to encode metrics") - http.Error(w, "Internal Server Error", http.StatusInternalServerError) - } + w.Write(buf.Bytes())Note: Add
"bytes"to imports.
| func _BpfClose(closers ...io.Closer) error { | ||
| for _, closer := range closers { | ||
| if err := closer.Close(); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak in cleanup helper.
_BpfClose returns on the first error, leaving subsequent resources unclosed. If an early Close() fails, later file descriptors or maps may leak.
🔧 Proposed fix to close all resources
func _BpfClose(closers ...io.Closer) error {
+ var firstErr error
for _, closer := range closers {
if err := closer.Close(); err != nil {
- return err
+ if firstErr == nil {
+ firstErr = err
+ }
}
}
- return nil
+ return firstErr
}Note: Since this is generated code (line 1), the fix should ideally be applied to the bpf2go tool itself. If that's not feasible, consider documenting this limitation or manually patching the generated file.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func _BpfClose(closers ...io.Closer) error { | |
| for _, closer := range closers { | |
| if err := closer.Close(); err != nil { | |
| return err | |
| } | |
| } | |
| return nil | |
| } | |
| func _BpfClose(closers ...io.Closer) error { | |
| var firstErr error | |
| for _, closer := range closers { | |
| if err := closer.Close(); err != nil { | |
| if firstErr == nil { | |
| firstErr = err | |
| } | |
| } | |
| } | |
| return firstErr | |
| } |
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/bpf_bpfel.go around lines 107 - 114, _BpfClose
currently returns on the first Close error which can leak later resources;
change it to always iterate and call Close on every io.Closer in the closers
slice, capturing either the first non-nil error or aggregating errors (e.g.,
using errors.Join if available) and return that after the loop; update the
generated function _BpfClose to perform all closes and then return the collected
error, and if possible fix the bpf2go generator so future outputs behave the
same.
| #!/usr/bin/env bash | ||
|
|
||
| # Version of libbpf to fetch headers from | ||
| LIBBPF_VERSION=0.6.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
latest stable version of libbpf 2026
💡 Result:
Latest stable libbpf: v1.6.2 (release tag v1.6.2, released Aug 21, 2025). [1][2]
Sources:
[1] GitHub releases — libbpf v1.6.2.
[2] Debian package tracker — libbpf 1.6.2 packaging/news.
🌐 Web query:
libbpf 0.6.1 security vulnerabilities CVE
💡 Result:
libbpf 0.6.1 is affected by two publicly‑reported heap‑buffer‑overflow vulnerabilities:
-
CVE-2021-45940 — heap‑based buffer overflow (4 bytes) in bpf_object__open; affects libbpf 0.6.0 and 0.6.1. Recommended fix: upgrade to a fixed upstream release (fixed in libbpf 0.7.0). [1][2]
-
CVE-2021-45941 — heap‑based buffer overflow (8 bytes) in __bpf_object__open; affects libbpf 0.6.0 and 0.6.1. Recommended fix: upgrade to a fixed upstream release (fixed in 0.7.0 per downstream trackers). [2][3]
Impact: remote denial‑of‑service (availability) by exploiting specially‑crafted files; CVSS ~6.5 (medium). [1][2]
Sources:
- OSV / NVD entry for CVE‑2021‑45940. [OSV/NVD]. [1]
- Debian security tracker (CVE‑2021‑45940 / CVE‑2021‑45941) — notes affected versions and that fixes are in later packages (upstream 0.7.0). [2]
- CVE databases summarizing CVE‑2021‑45941. [3]
If you want, I can fetch the exact upstream commit or vendor advisories and provide upgrade/mitigation steps.
🏁 Script executed:
cat -n internal/networkmonitor/ebpf/c/include/update.sh | head -20Repository: devzero-inc/zxporter
Length of output: 741
Upgrade libbpf from 0.6.1 to a patched version. Version 0.6.1 contains two heap-buffer-overflow vulnerabilities (CVE-2021-45940 and CVE-2021-45941) that can cause remote denial-of-service via specially-crafted files.
The pinned version was released in 2021 and is unpatched. Current stable is v1.6.2 (released August 2025). At minimum, upgrade to 0.7.0 where these CVEs were fixed.
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/c/include/update.sh at line 4, The pinned
LIBBPF_VERSION variable in update.sh is set to the vulnerable 0.6.1; change
LIBBPF_VERSION to a patched release (minimum 0.7.0, preferably the current
stable 1.6.2) so the build uses a non-vulnerable libbpf; update the
LIBBPF_VERSION value accordingly and verify the tag/format matches the
repository's expected version string elsewhere in the script.
| curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \ | ||
| tar -xz --xform='s#.*/##' "${headers[@]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and integrity verification for the download.
The current implementation has several reliability and security concerns:
curl -sLsuppresses error output, making failures silent- No checksum or signature verification of the downloaded tarball
- Will overwrite existing header files without warning
🔒 Proposed improvements
+# Exit on any error
+set -e
+
# Fetch libbpf release and extract the desired headers
-curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
- tar -xz --xform='s#.*/##' "${headers[@]}"
+echo "Downloading libbpf v${LIBBPF_VERSION}..."
+if ! curl -fsSL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \
+ tar -xz --xform='s#.*/##' "${headers[@]}"; then
+ echo "Error: Failed to download or extract libbpf headers" >&2
+ exit 1
+fi
+echo "Successfully extracted libbpf headers"Consider also verifying the tarball checksum against a known good value if available from the libbpf release page.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| curl -sL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \ | |
| tar -xz --xform='s#.*/##' "${headers[@]}" | |
| # Exit on any error | |
| set -e | |
| # Fetch libbpf release and extract the desired headers | |
| echo "Downloading libbpf v${LIBBPF_VERSION}..." | |
| if ! curl -fsSL "https://github.com/libbpf/libbpf/archive/refs/tags/v${LIBBPF_VERSION}.tar.gz" | \ | |
| tar -xz --xform='s#.*/##' "${headers[@]}"; then | |
| echo "Error: Failed to download or extract libbpf headers" >&2 | |
| exit 1 | |
| fi | |
| echo "Successfully extracted libbpf headers" |
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/c/include/update.sh around lines 18 - 19, The
current inline download pipeline (curl -sL ... | tar -xz --xform=...
"${headers[@]}") should be made robust: download the release tarball to a
temporary file using a failing curl variant (e.g. enable -f and preserve
redirects) and check curl's exit code, validate the tarball integrity against a
provided checksum (and/or GPG signature from the libbpf release page) before
extracting, and avoid blind overwrite of existing headers by either extracting
into a temporary directory and atomically moving new files or using tar options
to skip/keep existing files; if any verification or extraction step fails, emit
a clear error and exit non‑zero. Ensure you reference LIBBPF_VERSION and the
headers array in the updated logic so the download, checksum URL, and extraction
targets are consistent.
| func mountCgroup2() error { | ||
| err := os.Mkdir(mountPoint, 0755) | ||
| if err != nil { | ||
| return fmt.Errorf("creating directory at %q: %w", mountPoint, err) | ||
| } | ||
| // https://docs.kernel.org/admin-guide/cgroup-v2.html#mounting | ||
| err = syscall.Mount("none", mountPoint, "cgroup2", 0, "") | ||
| if err != nil { | ||
| return fmt.Errorf("mounting cgroup2 at %q: %w", mountPoint, err) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure idempotent mounting behavior.
The current implementation will fail if the mount point already exists or cgroup2 is already mounted. This could cause issues during restarts or in environments where cgroup2 is pre-configured.
🔧 Proposed fix for idempotent mounting
func mountCgroup2() error {
- err := os.Mkdir(mountPoint, 0755)
+ err := os.MkdirAll(mountPoint, 0755)
if err != nil {
return fmt.Errorf("creating directory at %q: %w", mountPoint, err)
}
// https://docs.kernel.org/admin-guide/cgroup-v2.html#mounting
err = syscall.Mount("none", mountPoint, "cgroup2", 0, "")
- if err != nil {
+ if err != nil && !os.IsExist(err) {
return fmt.Errorf("mounting cgroup2 at %q: %w", mountPoint, err)
}
return nil
}Alternatively, check /proc/mounts to verify if cgroup2 is already mounted before attempting the mount operation.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @internal/networkmonitor/ebpf/cgroup_linux.go around lines 23 - 34, The
mountCgroup2 function should be made idempotent: when creating the directory
(mountPoint) treat already-existing directory as non-fatal (use os.Mkdir and
ignore os.IsExist errors) and before calling syscall.Mount check whether cgroup2
is already mounted at mountPoint by parsing /proc/mounts (or using
syscall.Statfs) and return nil if a cgroup2 mount is present; only call
syscall.Mount if the mountPoint exists and no cgroup2 is currently mounted, and
preserve existing error wrapping in the same fmt.Errorf style if mount fails.
| func TestTracer(t *testing.T) { | ||
| payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163} | ||
| event, err := parseEvent(payload) | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| } | ||
| fmt.Println(event.String()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test lacks assertions and validation.
This test only checks that parseEvent doesn't return an error, but provides no validation of the parsed event's content. Without assertions, the test cannot catch regressions in parsing logic or output formatting.
🧪 Proposed test improvements
func TestTracer(t *testing.T) {
payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163}
event, err := parseEvent(payload)
if err != nil {
t.Fatal(err)
}
- fmt.Println(event.String())
+
+ // Validate parsed event
+ if event == nil {
+ t.Fatal("parseEvent returned nil event")
+ }
+
+ // Add specific assertions based on expected packet content
+ // For example, if this is a DNS query for google.lt:
+ eventStr := event.String()
+ if eventStr == "" {
+ t.Error("event.String() returned empty string")
+ }
+ t.Logf("Parsed event: %s", eventStr)
+
+ // TODO: Add more specific assertions based on expected payload structure
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestTracer(t *testing.T) { | |
| payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163} | |
| event, err := parseEvent(payload) | |
| if err != nil { | |
| t.Fatal(err) | |
| } | |
| fmt.Println(event.String()) | |
| } | |
| func TestTracer(t *testing.T) { | |
| payload := []byte{69, 0, 0, 80, 106, 126, 64, 0, 62, 17, 36, 91, 192, 168, 65, 7, 172, 17, 0, 3, 0, 53, 224, 37, 0, 60, 174, 17, 191, 43, 129, 128, 0, 1, 0, 1, 0, 0, 0, 0, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 6, 103, 111, 111, 103, 108, 101, 2, 108, 116, 0, 0, 1, 0, 1, 0, 0, 1, 121, 0, 4, 172, 217, 21, 163} | |
| event, err := parseEvent(payload) | |
| if err != nil { | |
| t.Fatal(err) | |
| } | |
| // Validate parsed event | |
| if event == nil { | |
| t.Fatal("parseEvent returned nil event") | |
| } | |
| // Add specific assertions based on expected packet content | |
| // For example, if this is a DNS query for google.lt: | |
| eventStr := event.String() | |
| if eventStr == "" { | |
| t.Error("event.String() returned empty string") | |
| } | |
| t.Logf("Parsed event: %s", eventStr) | |
| // TODO: Add more specific assertions based on expected payload structure | |
| } |
|
|
||
| # Note: T2A might not be in all zones of us-central1. us-central1-a,b,f usually. | ||
| # Let's target a specific zone to simplify. | ||
| ZONE="${REGION}-a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded zone suffix may fail in some regions.
The zone is hardcoded as ${REGION}-a (Lines 62, 94, 107), which assumes all regions have an "a" zone and that T2A instances are available there. This assumption may not hold for all GCP regions.
Consider deriving available zones dynamically or allowing zone override:
🔧 Proposed fix
+# Allow zone override or detect available zones with T2A
+ZONE="${ZONE:-${REGION}-a}"
+
+# Verify zone supports the machine type
+if ! gcloud compute machine-types list \
+ --filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \
+ --format="value(name)" --limit=1 | grep -q .; then
+ echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2
+ exit 1
+fi
+
- ZONE="${REGION}-a"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ZONE="${REGION}-a" | |
| # Allow zone override or detect available zones with T2A | |
| ZONE="${ZONE:-${REGION}-a}" | |
| # Verify zone supports the machine type | |
| if ! gcloud compute machine-types list \ | |
| --filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \ | |
| --format="value(name)" --limit=1 | grep -q .; then | |
| echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2 | |
| exit 1 | |
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Around line 311-318: The workflow references inputs.workflow_name in
expressions for workflow_name (e.g., "workflow_name: ${{ inputs.workflow_name ||
'Update zxporter-netmon helm chart' }}") but the workflow_dispatch inputs do not
declare workflow_name; either add a workflow_dispatch input named workflow_name
(type: string, optional, with a sensible default) so inputs.workflow_name is
defined, or remove the inputs reference and hardcode the default literal in both
places (replace the expression with 'Update zxporter-netmon helm chart') to
avoid undefined input errors; update every occurrence of inputs.workflow_name
used in the file (including the two workflow_name keys and the slack action
input) so they are consistent.
In @cmd/zxporter-netmon/main.go:
- Around line 193-202: The monitor started with a cancellable context (the
cancel returned when launching the monitor at main) is never invoked during
shutdown; call that cancel function immediately after receiving the OS signal
and before initiating the server shutdown so the monitor and its conntrack
client are signaled to stop; specifically, invoke cancel() (the monitor's
cancel) right after the signal receive/before server.Shutdown(shutdownCtx) and
keep the existing shutdownCtx/shutdownCancel for HTTP shutdown.
In @config/zxporter-netmon.yaml:
- Around line 28-33: The pod-level securityContext block containing "privileged"
and "capabilities" is invalid; remove the pod-level securityContext (the block
that sets privileged: true and capabilities:add: - NET_ADMIN - SYS_RESOURCE) and
ensure those fields remain only in the container-level securityContext for the
relevant container spec so validation succeeds and capabilities are applied
correctly.
In @Dockerfile.netmon:
- Line 31: The Dockerfile currently sets USER 0:0 which contradicts the
gcr.io/distroless/static:nonroot base image; remove the USER 0:0 directive to
run as the nonroot user (UID 65532) provided by that image, or if root
privileges are truly required for the DaemonSet, replace the base image with
gcr.io/distroless/static:latest and then keep or set USER 0:0 accordingly;
update the Dockerfile line containing "USER 0:0" and ensure runtime
capabilities/privileged settings remain configured in the DaemonSet rather than
forcing root in the image.
In @internal/networkmonitor/cilium_time.go:
- Around line 12-20: The init() sets a dangerous fallback hertz = 1 when
getKernelHZ() fails and swallows the error; update init() to choose a realistic
fallback (e.g., 250 or 100) instead of 1, log the getKernelHZ() error and the
chosen fallback via the package logger (or return/fail fast if that’s
preferred), and ensure callers using hertz (e.g., jiffies calculations) will see
correct magnitude; locate symbols init, hertz and getKernelHZ to implement the
change and add a processLogger or package log call to record the failure and
fallback.
In @internal/networkmonitor/conntrack_client.go:
- Around line 74-84: The first composite nil-check inside the sessions loop is
missing a check for sess.Origin.Dst, which is later dereferenced to build
entry.Dst; update the condition that currently tests sess.Origin.Src,
sess.Origin.Proto, sess.Reply.Dst, etc., to also include sess.Origin.Dst so the
loop continues when Origin.Dst is nil, preventing a nil dereference when
constructing entry.Dst.
- Around line 115-117: The code calls filter(entry) without checking for nil,
which will panic if filter is nil; update the loop that currently does "if
filter(entry) { res = append(res, entry) }" to guard against a nil filter (e.g.,
"if filter == nil || filter(entry) { res = append(res, entry) }") or explicitly
validate/document that the filter parameter must be non-nil at function entry;
refer to the filter variable in conntrack_client.go and add the nil-check or
parameter validation accordingly.
In @internal/networkmonitor/ebpf/c/include/vmlinux.h:
- Around line 67-130: The task_struct includes struct thread_info
unconditionally but thread_info is only defined for __TARGET_ARCH_x86 and
__TARGET_ARCH_arm64, so builds for other architectures fail; fix by either
adding architecture-specific thread_info definitions for the missing targets
(e.g., provide appropriate struct thread_info declarations for
__TARGET_ARCH_s390, __TARGET_ARCH_arm, __TARGET_ARCH_mips,
__TARGET_ARCH_powerpc, __TARGET_ARCH_sparc, __TARGET_ARCH_riscv) or guard the
task_struct definition with the same architecture checks, and add a clear
comment in vmlinux.h documenting which architectures are supported (reference
struct task_struct, struct thread_info, and the __TARGET_ARCH_* macros to locate
the affected code).
In @internal/networkmonitor/ebpf/tracer.go:
- Around line 21-25: Add a Close/Stop method to Tracer to close the events
channel and ensure callers don’t block: update the Tracer struct (type Tracer)
to implement a Close() (or Stop()) method that closes the events channel
returned/owned by NewTracer, and call that method from the platform-specific
implementations when their Run() exits (both tracer_linux.go and
tracer_nolinux.go) so the channel is closed once the context cancels; ensure
Close is safe to call multiple times (idempotent) and that any goroutines
reading from Tracer.events handle the closed channel correctly.
In @internal/networkmonitor/monitor.go:
- Around line 133-144: Duplicate select branch: there are two identical "case
<-flushTicker.C:" blocks calling m.flush(ctx). Remove the duplicate branch and
keep a single case that calls m.flush(ctx) and logs via m.log.Error(err, "Failed
to flush metrics to control plane"); if you intended the comment about running
flush in a goroutine or synchronous behavior, consolidate that comment into the
remaining case and implement the chosen behavior (synchronous call to
m.flush(ctx) recommended). Ensure you update only the select inside the monitor
loop so references to flushTicker.C and m.flush(ctx) remain correct.
- Around line 323-359: The current helpers conntrackEntryKey and entryGroupKey
use package-level maphash.Hash vars (conntrackEntryHash, entryGroupHash) and
As4(), which causes a data race and panics on IPv6; fix by removing the
package-level hashes and creating local maphash.Hash instances inside each
function (e.g., start each function with "var h maphash.Hash"), use
e.Src.IP().As16() and e.Dst.IP().As16() to get stable 16-byte slices for both
IPv4/IPv6, write those full slices into the local hash, write ports and proto as
before, and return h.Sum64(); update both conntrackEntryKey and entryGroupKey
accordingly.
In @internal/networkmonitor/pod_cache.go:
- Around line 27-29: The AddFunc handler currently does an unchecked type
assertion obj.(*corev1.Pod) which can panic; modify the AddFunc in the informer
registration to perform a safe type assertion (e.g., pod, ok :=
obj.(*corev1.Pod)) and only call pc.updatePod(pod) when ok is true, otherwise
log or ignore the unexpected type; ensure you reference the AddFunc closure and
the pc.updatePod method so the handler remains robust against non-Pod objects.
- Around line 30-32: The UpdateFunc currently does an unsafe cast
new.(*corev1.Pod) and never removes stale IP entries when a Pod's IP changes;
update the handler to defensively type-assert both old and new objects (use a
type switch or comma-ok) and return/log if types are unexpected, handle nil
pods, then compare oldPod.Status.PodIP and newPod.Status.PodIP: if old IP is
non-empty and different from the new IP remove the old IP mapping from the pod
cache (call the existing cache removal method or implement pc.removeByIP) before
calling pc.updatePod(newPod) so stale mappings are cleared.
In @Makefile:
- Around line 777-789: The verify-e2e-aks-lifecycle target can abort before
running deprovision-aks because the long chained commands use && and Make treats
failures as fatal; modify the target so deprovision-aks always runs regardless
of verify-e2e result—invoke the test with error-ignoring (e.g., prefix the
verify call with -$(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}"
NAMESPACE=devzero-zxporter or append || true to that call) and then
unconditionally call $(MAKE) deprovision-aks; ensure you apply this change to
the verify-e2e-aks-lifecycle target (symbols: verify-e2e-aks-lifecycle,
verify-e2e, deprovision-aks).
- Around line 752-763: The recipe verify-e2e-eks-lifecycle can abort before
deprovision-eks runs if $(MAKE) verify-e2e fails; change the shell sequence so
you always run deprovision-eks: invoke $(MAKE) verify-e2e ... and capture its
exit code (e.g. VERIFY_EXIT=$$?); then call $(MAKE) deprovision-eks; finally
exit with the original verify exit code (exit $$VERIFY_EXIT). Update the
verify-e2e-eks-lifecycle recipe to perform these steps in the same shell
invocation so deprovision-eks always executes even on verification failure.
In @verification/provision_gke.sh:
- Around line 60-62: The script currently hardcodes ZONE="${REGION}-a", which
breaks in regions without an "-a" zone or where T2A is unavailable; change ZONE
to be derived dynamically or overridable: add logic to respect an optional
environment variable (e.g., ZONE override) and if not set, query available zones
for the REGION (via gcloud compute zones list filtered by REGION and required
machine type/feature) and pick a suitable zone, then use that dynamic ZONE
wherever ZONE is referenced (the ZONE variable and all places that use it in
provisioning, checks, and instance creation).
🟡 Minor comments (10)
verification/provision_aks.sh-111-115 (1)
111-115: Inconsistent fallback naming in delete mode.The fallback values on lines 113-114 use
zxporter-e2e-rg-${WHO}andzxporter-e2e-${WHO}, butWHOis computed at script start and the create mode usesaks-zxporter-e2e-rg-${WHO}-${CNI_TYPE}/aks-zxporter-e2e-${WHO}-${CNI_TYPE}. If the state file is missing, the fallback won't match the actual created resources.Proposed fix
read_state -RESOURCE_GROUP="${RESOURCE_GROUP:-zxporter-e2e-rg-${WHO}}" -CLUSTER_NAME="${CLUSTER_NAME:-zxporter-e2e-${WHO}}" +RESOURCE_GROUP="${RESOURCE_GROUP:-aks-zxporter-e2e-rg-${WHO}-${CNI_TYPE}}" +CLUSTER_NAME="${CLUSTER_NAME:-aks-zxporter-e2e-${WHO}-${CNI_TYPE}}"verification/verify_e2e.sh-24-24 (1)
24-24: Remove unused ARCH variable.The
ARCHvariable is defined but never used in the script.🧹 Proposed fix
-ARCH=$(go env GOARCH) TAG="dev-$(date +%s)"verification/verify_e2e.sh-115-122 (1)
115-122: Remove duplicate image.pullPolicy setting.Line 120 sets
image.pullPolicytwice in the same helm template command.🐛 Proposed fix
helm template zxporter-netmon helm-chart/zxporter-netmon \ --namespace $NAMESPACE \ --set image.repository=ttl.sh/zxporter-netmon \ --set image.tag=$TAG \ --set image.pullPolicy=$PULL_POLICY \ - --set image.pullPolicy=$PULL_POLICY \ --set config.exportInterval="10s" \ | kubectl apply -f -internal/networkmonitor/ebpf/types.go-18-29 (1)
18-29: Fix section headers printed inside loops.The "Questions:" and "Answers:" headers are printed inside the loops (lines 21 and 25), causing them to appear once per item instead of once per section. This creates confusing output like:
Questions: A IN example.com Questions: A IN google.comInstead of:
Questions: A IN example.com A IN google.com🐛 Proposed fix
func (e DNSEvent) String() string { var str strings.Builder + if len(e.Questions) > 0 { + str.WriteString("Questions:\n") + } for _, v := range e.Questions { - str.WriteString("Questions:\n") str.WriteString(fmt.Sprintf("%s %s %s \n", v.Class, v.Type, string(v.Name))) } + if len(e.Answers) > 0 { + str.WriteString("Answers:\n") + } for _, v := range e.Answers { - str.WriteString("Answers:\n") str.WriteString(fmt.Sprintf("%s %s %s [%s] [%s] \n", v.Class, v.Type, string(v.Name), v.IP, v.CNAME)) } return str.String() }.github/workflows/push-zxporter-netmon-image.yml-237-241 (1)
237-241: Update deprecated GitHub Action version.The workflow uses
actions/checkout@v3which is outdated. The newer@v4version is already used elsewhere in this workflow (lines 29, 298).♻️ Proposed fix
- name: Checkout repo - uses: actions/checkout@v3 + uses: actions/checkout@v4 with: repository: devzero-inc/zxporter path: zxporter-tmp token: ${{ steps.get_workflow_token.outputs.token }}internal/networkmonitor/ebpf/tracer_linux.go-191-207 (1)
191-207: Minor: File handle leak on scanner error.If
scanner.Scan()encounters an error (whichscanner.Err()would return), the function returnsErrCgroup2NotMountedwithout checkingscanner.Err(). While unlikely in practice, the file is properly closed via defer, but the actual scanner error is silently ignored.♻️ Proposed fix
func detectCgroupPath() (string, error) { f, err := os.Open("/proc/mounts") if err != nil { return "", err } defer f.Close() scanner := bufio.NewScanner(f) for scanner.Scan() { fields := strings.Split(scanner.Text(), " ") if len(fields) >= 3 && fields[2] == "cgroup2" { return fields[1], nil } } + if err := scanner.Err(); err != nil { + return "", err + } return "", ErrCgroup2NotMounted }.github/workflows/push-zxporter-netmon-image.yml-137-137 (1)
137-137: Fix typo in error message."Definately" should be "Definitely".
📝 Proposed fix
- # Definately have to fail if any non-latest tags exist + # Definitely have to fail if any non-latest tags exist.github/workflows/push-zxporter-netmon-image.yml-35-64 (1)
35-64: Quote shell variables to prevent word splitting.Multiple shellcheck warnings indicate unquoted variable expansions that could cause issues with unexpected input values.
🔧 Recommended fixes for key variables
GITVERSION=$(git describe --tags --match="v[0-9]*.[0-9]*.[0-9]*" --exclude="chart-*" --always || echo "v0.0.0-$(git rev-parse --short HEAD)") echo "GITVERSION=${GITVERSION}" >> $GITHUB_ENV if [[ "$GITVERSION" =~ ^v([0-9]+)\.([0-9]+)\.(.+)$ ]]; then MAJOR=${BASH_REMATCH[1]} MINOR=${BASH_REMATCH[2]} PATCH=${BASH_REMATCH[3]} PATCH_SHORT=${PATCH%%[-]*} FULLV=v${MAJOR}.${MINOR}.${PATCH_SHORT} - echo "MAJOR=${MAJOR}" >> $GITHUB_ENV - echo "MINOR=${MINOR}" >> $GITHUB_ENV - echo "PATCH=${PATCH}" >> $GITHUB_ENV - echo "PATCH_SHORT=${PATCH_SHORT}" >> $GITHUB_ENV - echo "FULLV=${FULLV}" >> $GITHUB_ENV - echo "IS_RELEASE=true" >> $GITHUB_ENV + echo "MAJOR=${MAJOR}" >> "$GITHUB_ENV" + echo "MINOR=${MINOR}" >> "$GITHUB_ENV" + echo "PATCH=${PATCH}" >> "$GITHUB_ENV" + echo "PATCH_SHORT=${PATCH_SHORT}" >> "$GITHUB_ENV" + echo "FULLV=${FULLV}" >> "$GITHUB_ENV" + echo "IS_RELEASE=true" >> "$GITHUB_ENV"Similar fixes needed in other script blocks (lines 76-79, 88-167, 192-219, etc.)
internal/networkmonitor/cilium_time.go-57-66 (1)
57-66: Fix typo in error message.Line 62 has a grammatical error: "Unable get time" should be "Unable to get time".
📝 Proposed fix
- return 0, fmt.Errorf("Unable get time: %w", err) + return 0, fmt.Errorf("Unable to get time: %w", err)internal/networkmonitor/cilium_time.go-46-51 (1)
46-51: Add documentation explaining the 256x scaling factor in jiffies conversion.The jiffies calculation at line 48 uses
diff << 8(multiply by 256) before dividing by hertz. Standard kernel jiffies-to-seconds conversion is simplydiff / hertz, so this additional scaling factor requires explanation:
- Why is the value scaled by 256?
- Is this for fixed-point arithmetic precision, a kernel ABI requirement, or Cilium-specific handling?
- How does this relate to the expected return value in seconds?
Add an inline comment documenting the purpose of this operation to clarify the conversion logic.
🧹 Nitpick comments (21)
verification/provision_aks.sh (1)
38-40: Add shellcheck directive for dynamic source.ShellCheck cannot follow the non-constant source path. Add a directive to suppress or document this intentional behavior.
Proposed fix
read_state() { + # shellcheck source=/dev/null [[ -f "${STATE_FILE}" ]] && source "${STATE_FILE}" }internal/networkmonitor/conntrack_client.go (2)
47-49: Consider logging if enabling conntrack accounting fails.The error from
WriteFileis silently ignored. While this is best-effort, a warning log would help debug issues when conntrack accounting doesn't work as expected.Proposed fix
func NewNetfilterClient(log logr.Logger) (Client, error) { // Ensure accounting is enabled - _ = os.WriteFile("/proc/sys/net/netfilter/nf_conntrack_acct", []byte{'1'}, 0600) + if err := os.WriteFile("/proc/sys/net/netfilter/nf_conntrack_acct", []byte{'1'}, 0600); err != nil { + log.V(1).Info("failed to enable nf_conntrack_acct (may require CAP_NET_ADMIN)", "error", err) + }
123-126: Silent error suppression on IP conversion.
netaddr.FromStdIPreturns(IP, bool)but theokvalue is discarded. If conversion fails, a zero-valuenetaddr.IPis returned which could lead to confusing behavior downstream.Proposed fix
Consider returning an error or skipping entries with invalid IPs in the caller. At minimum, log when conversion fails:
func ipFromStdIP(ip net.IP) netaddr.IP { - res, _ := netaddr.FromStdIP(ip) + res, ok := netaddr.FromStdIP(ip) + if !ok { + return netaddr.IP{} // Returns zero value; caller should validate + } return res }Alternatively, check the returned IP validity in
ListEntriesbefore constructing the Entry.internal/networkmonitor/dns/ip2dns.go (3)
40-42: Exported field with unexported interface type limits testability.
Tracerfield is exported (uppercase) but its typetraceris unexported. This makes it impossible for external packages to construct anIP2DNSwith a mock tracer. Consider either:
- Export the interface as
Tracerand rename the field- Keep field unexported and only expose via constructor
Proposed fix (option 1: export the interface)
-type tracer interface { +// Tracer defines the interface for DNS event sources. +type Tracer interface { Run(ctx context.Context) error Events() <-chan ebpf.DNSEvent } type IP2DNS struct { - Tracer tracer + tracer Tracer log logr.Logger // ... } -func NewIP2DNS(tracer tracer, log logr.Logger) *IP2DNS { +func NewIP2DNS(tracer Tracer, log logr.Logger) *IP2DNS { return &IP2DNS{ - Tracer: tracer, + tracer: tracer, log: log, } }
61-67: Redundant context wrapping.The
ctxparameter is already a context that can be cancelled. Creating a new derived context withcontext.WithCancel(ctx)and deferring its cancel is unnecessary here since the function returns whenctx.Done()fires anyway.Proposed simplification
func (d *IP2DNS) Start(ctx context.Context) error { d.ipToName = cache.NewContext[string, string](ctx) - ctx, cancel := context.WithCancel(ctx) - defer cancel() errch := make(chan error, 1) go func() { - err := d.Tracer.Run(ctx) + err := d.Tracer.Run(ctx) errch <- err }()The derived context is only useful if you need to cancel the tracer independently of the parent context. If that's intentional, add a comment explaining why.
103-105: CNAME handling is a no-op placeholder.The CNAME case is empty. If CNAME resolution enrichment is planned, consider adding a TODO comment or removing the case to avoid dead code.
Proposed fix
case layers.DNSTypeCNAME: - // CNAME handling if needed + // TODO: Handle CNAME chaining for alias resolution + continue }internal/networkmonitor/ebpf/c/include/vmlinux.h (1)
1-1219: Consider generating vmlinux.h with bpftool instead of maintaining it manually.Manually maintaining vmlinux.h requires keeping type definitions in sync with kernel versions, which is error-prone. The standard practice is to generate it from the target kernel's BTF using
bpftool btf dump file /sys/kernel/btf/vmlinux format c.This approach would eliminate maintenance burden and ensure type accuracy across kernel versions.
verification/traffic-gen.yaml (1)
37-53: Add resource limits and explicit security context.While this is a verification workload, adding resource constraints and an explicit security context prevents resource exhaustion in test environments and follows Kubernetes security best practices.
🔒 Suggested security and resource improvements
containers: - name: traffic-gen image: curlimages/curl + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + capabilities: + drop: + - ALL + resources: + requests: + cpu: 10m + memory: 32Mi + limits: + cpu: 100m + memory: 64Mi command: ["/bin/sh"] args: - "-c"internal/networkmonitor/ebpf/generate.go (1)
3-6: Clarify the generation command and expected output files.The commented go:generate directive appears incomplete for multi-architecture builds. Based on the PR context showing both
bpf_bpfeb.goandbpf_bpfel.gofiles (big-endian and little-endian), the command should specify target architectures.Additionally, Line 4 mentions checking in
bpf_bpfel.oandbpf_bpfel.go, but bpf2go typically generates Go source files (.go) and may not require checking in object files (.o).📝 Suggested clarification
// To regenerate eBPF artifacts, install llvm (brew install llvm) and uncomment the line below. -// Ensure you have the 'bpf_bpfel.o' and 'bpf_bpfel.go' files checked in. +// This generates bpf_bpfeb.go and bpf_bpfel.go (big-endian and little-endian bindings). // -// //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -no-strip -cc clang -cflags $BPF_CFLAGS bpf ./c/egressd.c -- -I./c/include +// //go:generate go run github.com/cilium/ebpf/cmd/bpf2go -no-strip -cc clang -cflags $BPF_CFLAGS -target bpf -target bpfeb bpf ./c/egressd.c -- -I./c/includeverification/verify_e2e.sh (1)
280-296: Fix inconsistent variable naming.The variable is named
RETRY_Count(line 283) but follows the pattern ofMAX_RETRIES(line 282). Use consistent naming convention.♻️ Suggested fix
MAX_RETRIES=6 - RETRY_Count=0 + RETRY_COUNT=0 TOTAL_MSGS=0 - while [ $RETRY_Count -lt $MAX_RETRIES ]; do + while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do STATS_JSON=$(kubectl exec curl-validator -n $NAMESPACE -- curl -v -s http://testserver.${NAMESPACE}.svc.cluster.local:8080/stats) TOTAL_MSGS=$(echo $STATS_JSON | grep -o '"total_messages": [0-9]*' | awk '{print $2}') if [[ "$TOTAL_MSGS" -gt 0 ]]; then break fi - echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_Count+1))/$MAX_RETRIES)" + echo " [INFO] TestServer has 0 messages. Waiting 5s for flush... ($((RETRY_COUNT+1))/$MAX_RETRIES)" sleep 5 - RETRY_Count=$((RETRY_Count+1)) + RETRY_COUNT=$((RETRY_COUNT+1)) doneinternal/networkmonitor/ebpf/tracer_nolinux.go (1)
11-16: Clarify the comment about error handling.The comment on line 12 states "return error to be safe" but the function actually returns
nil. Consider updating the comment to accurately reflect the implementation:📝 Suggested comment update
func (t *Tracer) Run(ctx context.Context) error { - // Don't panic, just return error to be safe + // No-op on non-Linux systems; waits for context cancellation t.log.Info("eBPF tracer not implemented on non-linux systems") <-ctx.Done() return nil }helm-chart/zxporter-netmon/templates/daemonset.yaml (1)
28-29: Consider documenting the hostNetwork security implications.The DaemonSet uses
hostNetwork: true, which grants pods access to the host's network namespace. While this is often necessary for network monitoring, it's a privileged setting with security implications. Consider adding a comment in the template or documenting this in the chart's README..github/workflows/helm-release-netmon.yml (1)
50-51: Quote the $GITHUB_ENV variable.The
$GITHUB_ENVvariable should be quoted to prevent potential issues with word splitting or globbing.🔧 Proposed fix
- name: Enable Helm OCI support - run: echo "HELM_EXPERIMENTAL_OCI=1" >> $GITHUB_ENV + run: echo "HELM_EXPERIMENTAL_OCI=1" >> "$GITHUB_ENV"verification/provision_eks.sh (2)
140-142: Address or remove the TODO comment.The TODO comment indicates that
autoModeConfigstill requires testing. Since this configuration is commented out, either test and enable it, or remove the comment if it's intentionally deferred.Would you like me to open a tracking issue for testing autoModeConfig, or should this TODO be removed?
149-152: Document NAT gateway implications.Disabling the NAT gateway (line 151) reduces costs but requires nodes to have public IPs for internet access, which has security implications. The comment mentions this but consider documenting it more prominently or adding verification that the setup works as intended.
internal/networkmonitor/ebpf/tracer_linux.go (1)
103-106: Consider logging the full raw sample metadata for debugging.The size check is correct, but when skipping small events, consider logging additional context like the record timestamp or CPU to aid debugging.
♻️ Optional enhancement
if len(record.RawSample) < 4 { - t.log.Info("skipping too small event", "bytes", len(record.RawSample)) + t.log.Info("skipping too small event", "bytes", len(record.RawSample), "cpu", record.CPU) continue }Makefile (1)
703-709: Lima installation assumes macOS with Homebrew.The
brew install limacommand will fail on Linux environments. Consider adding OS detection or documenting this limitation.🔧 Proposed fix for cross-platform support
.PHONY: setup-lima setup-lima: @echo "Checking for Lima..." - @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima) + @which limactl >/dev/null || { \ + if command -v brew >/dev/null 2>&1; then \ + echo "Installing Lima via Homebrew..." && brew install lima; \ + else \ + echo "Error: Lima not found and Homebrew not available. Please install Lima manually: https://lima-vm.io/docs/installation/"; \ + exit 1; \ + fi; \ + } @echo "Checking if instance $(LIMA_INSTANCE) exists..."internal/networkmonitor/monitor.go (2)
90-102: Consider validating required dependencies.The constructor accepts
podCache,ct, anddnsCollectorwithout validation. Ifct(conntrack client) is nil,collect()will panic when callingm.ct.ListEntries(). Consider adding nil checks for critical dependencies.🔧 Proposed fix
func NewMonitor(cfg Config, log logr.Logger, podCache *PodCache, ct Client, dnsCollector dns.DNSCollector, dakrClient transport.DakrClient) (*Monitor, error) { + if ct == nil { + return nil, fmt.Errorf("conntrack client is required") + } return &Monitor{ cfg: cfg, log: log,
241-246: Misleading comment: handler does NOT reset counters.The comment on line 242 says "This currently resets the counters after read (Delta mode)" but the implementation creates copies without resetting (line 267 comment correctly notes "Peek behavior, no reset"). Consider updating the header comment to match the actual behavior.
📝 Proposed fix
// GetMetricsHandler serves the collected metrics -// NOTE: This currently resets the counters after read (Delta mode), mimicking egressd's behavior +// NOTE: This returns a snapshot of metrics without resetting counters (Peek mode). +// Delta mode with counter reset is used only for flush() to the control plane. func (m *Monitor) GetMetricsHandler(w http.ResponseWriter, r *http.Request) {internal/networkmonitor/cilium_kernel_hz.go (2)
46-60: Consider capturing time.Now() closer to the timestamp read.The wall clock time is captured at Line 50 before scanning begins, while the kernel timestamp may be read several lines later. For maximum accuracy, consider capturing
time.Now()immediately after successfully parsing the timestamp line.⏱️ Proposed refactor for improved timing accuracy
func readSchedstat(f io.ReadSeeker) (ktime, error) { defer func() { _, _ = f.Seek(0, 0) }() var j uint64 - var t = time.Now() s := bufio.NewScanner(f) for s.Scan() { if _, err := fmt.Sscanf(s.Text(), "timestamp %d", &j); err == nil { + t := time.Now() return ktime{j, t}, nil } } return ktime{}, errors.New("no kernel timestamp found") }
87-107: Consider simplifying distance calculation for clarity.The unsigned arithmetic pattern at lines 95-98 is correct but could be clearer. The initial assignment
d := uint16(in - v)would underflow whenin < v, requiring the subsequent correction.♻️ Proposed refactor for improved readability
func nearest(in uint16, values []uint16) (uint16, error) { if len(values) == 0 { return 0, errors.New("values cannot be empty") } var out uint16 min := ^uint16(0) for _, v := range values { - d := uint16(in - v) + var d uint16 if in < v { d = v - in + } else { + d = in - v } if d < min { min = d out = v } } return out, nil }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (9)
gen/api/v1/apiv1connect/metrics_collector.connect.gois excluded by!**/gen/**gen/api/v1/common.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/api/v1/k8s.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/api/v1/metrics_collector.pb.gois excluded by!**/*.pb.go,!**/gen/**gen/api/v1/metrics_collector_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**go.sumis excluded by!**/*.suminternal/networkmonitor/ebpf/bpf_bpfeb.ois excluded by!**/*.ointernal/networkmonitor/ebpf/bpf_bpfel.ois excluded by!**/*.oproto/dakr_proto_descriptor.binis excluded by!**/*.bin
📒 Files selected for processing (63)
.github/workflows/helm-release-netmon.yml.github/workflows/push-zxporter-netmon-image.yml.gitignoreDockerfile.netmonMakefilecmd/zxporter-netmon/main.goconfig/zxporter-netmon.yamlgo.modhelm-chart/zxporter-netmon/Chart.yamlhelm-chart/zxporter-netmon/templates/_helpers.tplhelm-chart/zxporter-netmon/templates/clusterrole.yamlhelm-chart/zxporter-netmon/templates/clusterrolebinding.yamlhelm-chart/zxporter-netmon/templates/daemonset.yamlhelm-chart/zxporter-netmon/templates/serviceaccount.yamlhelm-chart/zxporter-netmon/values.yamlinternal/networkmonitor/cilium_client.gointernal/networkmonitor/cilium_client_linux.gointernal/networkmonitor/cilium_client_nolinux.gointernal/networkmonitor/cilium_kernel_hz.gointernal/networkmonitor/cilium_time.gointernal/networkmonitor/conntrack_client.gointernal/networkmonitor/dns/ip2dns.gointernal/networkmonitor/ebpf/bpf_bpfeb.gointernal/networkmonitor/ebpf/bpf_bpfel.gointernal/networkmonitor/ebpf/c/egressd.cinternal/networkmonitor/ebpf/c/include/LICENSE.BSD-2-Clauseinternal/networkmonitor/ebpf/c/include/bpf_core_read.hinternal/networkmonitor/ebpf/c/include/bpf_endian.hinternal/networkmonitor/ebpf/c/include/bpf_helper_defs.hinternal/networkmonitor/ebpf/c/include/bpf_helpers.hinternal/networkmonitor/ebpf/c/include/bpf_tracing.hinternal/networkmonitor/ebpf/c/include/common.hinternal/networkmonitor/ebpf/c/include/maps.hinternal/networkmonitor/ebpf/c/include/types.hinternal/networkmonitor/ebpf/c/include/update.shinternal/networkmonitor/ebpf/c/include/vmlinux.hinternal/networkmonitor/ebpf/c/include/vmlinux_flavors.hinternal/networkmonitor/ebpf/c/include/vmlinux_missing.hinternal/networkmonitor/ebpf/cgroup_linux.gointernal/networkmonitor/ebpf/cgroup_nolinux.gointernal/networkmonitor/ebpf/generate.gointernal/networkmonitor/ebpf/tracer.gointernal/networkmonitor/ebpf/tracer_linux.gointernal/networkmonitor/ebpf/tracer_linux_test.gointernal/networkmonitor/ebpf/tracer_nolinux.gointernal/networkmonitor/ebpf/types.gointernal/networkmonitor/monitor.gointernal/networkmonitor/pod_cache.gointernal/transport/connect_compression_test.gointernal/transport/dakr_client.gointernal/transport/interface.gointernal/transport/sender.gotest/stats/types.gotest/testserver/main.goverification/kind-config.yamlverification/lima.yamlverification/provision_aks.shverification/provision_eks.shverification/provision_gke.shverification/traffic-gen.yamlverification/traffic-server.yamlverification/verify_e2e.shverification/verify_local.sh
✅ Files skipped from review due to trivial changes (1)
- internal/networkmonitor/ebpf/cgroup_nolinux.go
🚧 Files skipped from review as they are similar to previous changes (20)
- internal/networkmonitor/ebpf/cgroup_linux.go
- internal/transport/dakr_client.go
- internal/networkmonitor/ebpf/c/include/types.h
- internal/transport/sender.go
- helm-chart/zxporter-netmon/Chart.yaml
- internal/networkmonitor/ebpf/c/include/LICENSE.BSD-2-Clause
- internal/networkmonitor/ebpf/c/include/common.h
- internal/networkmonitor/cilium_client.go
- verification/lima.yaml
- internal/networkmonitor/ebpf/tracer_linux_test.go
- internal/networkmonitor/ebpf/bpf_bpfel.go
- internal/networkmonitor/cilium_client_linux.go
- helm-chart/zxporter-netmon/templates/_helpers.tpl
- .gitignore
- internal/transport/connect_compression_test.go
- internal/transport/interface.go
- verification/kind-config.yaml
- internal/networkmonitor/ebpf/c/include/maps.h
- internal/networkmonitor/cilium_client_nolinux.go
- internal/networkmonitor/ebpf/c/include/update.sh
🧰 Additional context used
🧬 Code graph analysis (11)
cmd/zxporter-netmon/main.go (10)
internal/networkmonitor/pod_cache.go (2)
PodCache(12-16)NewPodCache(20-40)internal/networkmonitor/conntrack_client.go (2)
Client(36-39)NewNetfilterClient(47-59)internal/networkmonitor/cilium_client.go (2)
CiliumAvailable(20-25)NewCiliumClient(9-18)internal/networkmonitor/cilium_time.go (1)
ClockSourceKtime(25-25)internal/networkmonitor/dns/ip2dns.go (2)
DNSCollector(24-29)NewIP2DNS(51-56)internal/networkmonitor/ebpf/tracer_linux.go (1)
IsKernelBTFAvailable(128-131)internal/networkmonitor/ebpf/tracer_nolinux.go (1)
IsKernelBTFAvailable(22-24)internal/networkmonitor/monitor.go (2)
Config(24-29)NewMonitor(91-102)internal/transport/interface.go (1)
DakrClient(15-31)internal/transport/dakr_client.go (1)
NewDakrClient(53-168)
internal/networkmonitor/pod_cache.go (1)
internal/collector/interface.go (1)
Pod(94-94)
internal/networkmonitor/ebpf/tracer_nolinux.go (3)
internal/networkmonitor/ebpf/tracer.go (1)
Tracer(21-25)internal/networkmonitor/ebpf/types.go (1)
DNSEvent(11-16)internal/networkmonitor/ebpf/tracer_linux.go (2)
IsKernelBTFAvailable(128-131)InitCgroupv2(133-142)
internal/networkmonitor/ebpf/tracer_linux.go (3)
internal/networkmonitor/ebpf/tracer.go (1)
Tracer(21-25)internal/networkmonitor/ebpf/types.go (1)
DNSEvent(11-16)internal/networkmonitor/ebpf/tracer_nolinux.go (2)
IsKernelBTFAvailable(22-24)InitCgroupv2(26-28)
internal/networkmonitor/ebpf/tracer.go (3)
internal/logger/logger.go (1)
Logger(26-29)internal/networkmonitor/monitor.go (1)
Config(24-29)internal/networkmonitor/ebpf/types.go (1)
DNSEvent(11-16)
internal/networkmonitor/conntrack_client.go (3)
internal/logger/logger.go (1)
Logger(26-29)internal/networkmonitor/ebpf/tracer.go (1)
Config(5-8)internal/networkmonitor/monitor.go (1)
Config(24-29)
internal/networkmonitor/cilium_time.go (1)
internal/networkmonitor/cilium_client_nolinux.go (3)
ClockSource(11-11)ClockSourceKtime(14-14)ClockSourceJiffies(15-15)
internal/networkmonitor/dns/ip2dns.go (2)
internal/networkmonitor/ebpf/types.go (1)
DNSEvent(11-16)internal/networkmonitor/ebpf/tracer.go (1)
Tracer(21-25)
test/testserver/main.go (2)
gen/api/v1/metrics_collector.pb.go (6)
SendNetworkTrafficMetricsRequest(2281-2292)SendNetworkTrafficMetricsRequest(2307-2307)SendNetworkTrafficMetricsRequest(2322-2324)SendNetworkTrafficMetricsResponse(2368-2374)SendNetworkTrafficMetricsResponse(2389-2389)SendNetworkTrafficMetricsResponse(2404-2406)test/stats/types.go (2)
NetworkTrafficStat(87-98)DnsLookupStat(101-108)
internal/networkmonitor/monitor.go (5)
internal/networkmonitor/dns/ip2dns.go (2)
DNSLookup(17-22)DNSCollector(24-29)internal/networkmonitor/conntrack_client.go (3)
Client(36-39)Entry(15-24)FilterBySrcIP(28-33)internal/networkmonitor/pod_cache.go (1)
PodCache(12-16)internal/transport/interface.go (1)
DakrClient(15-31)gen/api/v1/metrics_collector.pb.go (12)
Namespace(1469-1503)Namespace(1518-1518)Namespace(1533-1535)DnsLookupItem(2194-2205)DnsLookupItem(2220-2220)DnsLookupItem(2235-2237)NetworkTrafficItem(2067-2083)NetworkTrafficItem(2098-2098)NetworkTrafficItem(2113-2115)SendNetworkTrafficMetricsRequest(2281-2292)SendNetworkTrafficMetricsRequest(2307-2307)SendNetworkTrafficMetricsRequest(2322-2324)
verification/provision_eks.sh (2)
verification/provision_aks.sh (4)
need(4-4)cluster_exists(42-44)write_state(29-36)read_state(38-40)verification/provision_gke.sh (3)
need(4-4)cluster_exists(41-43)write_state(28-33)
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml
35-35: shellcheck reported issue in this script: SC2086:info:12:28: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:13:28: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:14:28: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:15:40: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:16:28: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:17:29: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:21:38: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:23:21: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:24:21: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:25:21: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:26:26: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:27:30: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:28:30: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2086:info:4:36: Double quote to prevent globbing and word splitting
(shellcheck)
35-35: shellcheck reported issue in this script: SC2129:style:12:3: Consider using { cmd1; cmd2; } >> file instead of individual redirects
(shellcheck)
35-35: shellcheck reported issue in this script: SC2129:style:23:3: Consider using { cmd1; cmd2; } >> file instead of individual redirects
(shellcheck)
76-76: shellcheck reported issue in this script: SC2086:info:1:53: Double quote to prevent globbing and word splitting
(shellcheck)
76-76: shellcheck reported issue in this script: SC2086:info:2:32: Double quote to prevent globbing and word splitting
(shellcheck)
76-76: shellcheck reported issue in this script: SC2086:info:3:55: Double quote to prevent globbing and word splitting
(shellcheck)
76-76: shellcheck reported issue in this script: SC2129:style:1:1: Consider using { cmd1; cmd2; } >> file instead of individual redirects
(shellcheck)
88-88: shellcheck reported issue in this script: SC2086:info:78:41: Double quote to prevent globbing and word splitting
(shellcheck)
88-88: shellcheck reported issue in this script: SC2086:info:7:67: Double quote to prevent globbing and word splitting
(shellcheck)
88-88: shellcheck reported issue in this script: SC2145:error:27:29: Argument mixes string and array. Use * or separate argument
(shellcheck)
88-88: shellcheck reported issue in this script: SC2145:error:39:30: Argument mixes string and array. Use * or separate argument
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:12:45: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:13:50: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:17:23: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:18:23: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:19:23: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:20:28: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:21:29: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:22:32: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2086:info:23:28: Double quote to prevent globbing and word splitting
(shellcheck)
192-192: shellcheck reported issue in this script: SC2145:error:27:57: Argument mixes string and array. Use * or separate argument
(shellcheck)
192-192: shellcheck reported issue in this script: SC2145:error:4:44: Argument mixes string and array. Use * or separate argument
(shellcheck)
237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
252-252: shellcheck reported issue in this script: SC2086:info:4:27: Double quote to prevent globbing and word splitting
(shellcheck)
252-252: shellcheck reported issue in this script: SC2086:info:7:26: Double quote to prevent globbing and word splitting
(shellcheck)
266-266: shellcheck reported issue in this script: SC2086:info:11:36: Double quote to prevent globbing and word splitting
(shellcheck)
266-266: shellcheck reported issue in this script: SC2086:info:16:35: Double quote to prevent globbing and word splitting
(shellcheck)
289-289: shellcheck reported issue in this script: SC2086:info:6:28: Double quote to prevent globbing and word splitting
(shellcheck)
311-311: property "workflow_name" is not defined in object type {push_latest: bool}
(expression)
318-318: property "workflow_name" is not defined in object type {push_latest: bool}
(expression)
.github/workflows/helm-release-netmon.yml
51-51: shellcheck reported issue in this script: SC2086:info:1:35: Double quote to prevent globbing and word splitting
(shellcheck)
58-58: shellcheck reported issue in this script: SC2086:info:8:38: Double quote to prevent globbing and word splitting
(shellcheck)
70-70: shellcheck reported issue in this script: SC2086:info:22:36: Double quote to prevent globbing and word splitting
(shellcheck)
🪛 checkmake (0.2.2)
Makefile
[warning] 817-817: Missing required phony target "clean"
(minphony)
🪛 Checkov (3.2.334)
config/zxporter-netmon.yaml
[medium] 9-62: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 9-62: Container should not be privileged
(CKV_K8S_16)
[medium] 9-62: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 9-62: Containers should not share the host network namespace
(CKV_K8S_19)
verification/traffic-gen.yaml
[medium] 1-53: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-53: Minimize the admission of root containers
(CKV_K8S_23)
verification/traffic-server.yaml
[medium] 1-43: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-43: Minimize the admission of root containers
(CKV_K8S_23)
🪛 Clang (14.0.6)
internal/networkmonitor/ebpf/c/include/vmlinux.h
[error] 258-258: field has incomplete type 'struct thread_info'
(clang-diagnostic-error)
🪛 Cppcheck (2.19.0)
internal/networkmonitor/ebpf/c/egressd.c
[error] 13-13: There is an unknown macro here somewhere. Configuration is required. If SEC is a macro then please configure it.
(unknownMacro)
🪛 Shellcheck (0.11.0)
verification/verify_e2e.sh
[warning] 24-24: ARCH appears unused. Verify use (or export if used externally).
(SC2034)
verification/provision_aks.sh
[warning] 39-39: ShellCheck can't follow non-constant source. Use a directive to specify location.
(SC1090)
verification/provision_eks.sh
[warning] 57-57: ShellCheck can't follow non-constant source. Use a directive to specify location.
(SC1090)
verification/provision_gke.sh
[warning] 37-37: ShellCheck can't follow non-constant source. Use a directive to specify location.
(SC1090)
🪛 YAMLlint (1.37.1)
helm-chart/zxporter-netmon/templates/daemonset.yaml
[error] 7-7: syntax error: expected the node content, but found '-'
(syntax)
helm-chart/zxporter-netmon/templates/clusterrole.yaml
[error] 5-5: syntax error: expected the node content, but found '-'
(syntax)
helm-chart/zxporter-netmon/templates/serviceaccount.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml
[error] 5-5: syntax error: expected the node content, but found '-'
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run make test
- GitHub Check: Build Docker Image
- GitHub Check: Build Docker Image
- GitHub Check: Analyze (go)
| var hertz uint16 | ||
|
|
||
| func init() { | ||
| var err error | ||
| hertz, err = getKernelHZ() | ||
| if err != nil { | ||
| hertz = 1 | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problematic fallback value could break jiffies calculations.
The fallback hertz = 1 on initialization error is concerning. Typical kernel Hz values range from 100 to 1000, so a value of 1 would cause jiffies-based time calculations (line 49) to be off by 2-3 orders of magnitude. Additionally, the error is silently swallowed with no logging.
Consider:
- Using a more reasonable fallback (e.g., 250 or 100)
- Logging the error and fallback
- Or failing fast if Hz cannot be determined
🔧 Proposed fix with logging and safer fallback
Add logging import if not already present, then:
+import (
+ "fmt"
+ "github.com/go-logr/logr"
+)
+
var hertz uint16
func init() {
var err error
hertz, err = getKernelHZ()
if err != nil {
- hertz = 1
+ // Fallback to a more typical value (250 Hz is common)
+ hertz = 250
+ // Log the error if logger is available
+ fmt.Fprintf(os.Stderr, "Warning: failed to detect kernel HZ, using fallback %d: %v\n", hertz, err)
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @internal/networkmonitor/cilium_time.go around lines 12 - 20, The init() sets
a dangerous fallback hertz = 1 when getKernelHZ() fails and swallows the error;
update init() to choose a realistic fallback (e.g., 250 or 100) instead of 1,
log the getKernelHZ() error and the chosen fallback via the package logger (or
return/fail fast if that’s preferred), and ensure callers using hertz (e.g.,
jiffies calculations) will see correct magnitude; locate symbols init, hertz and
getKernelHZ to implement the change and add a processLogger or package log call
to record the failure and fallback.
| echo "Cleaning up k8s resources (best effort)..." | ||
| kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true | ||
| kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggressive cleanup deletes ALL services and PVCs cluster-wide.
kubectl delete svc -A --all will delete all services across all namespaces, including critical kube-system services (like kube-dns). While this runs best-effort before resource group deletion, it could cause issues if the deletion is interrupted or fails.
Consider limiting cleanup to test-related namespaces or excluding kube-system:
Proposed fix
echo "Cleaning up k8s resources (best effort)..."
- kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true
- kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true
+ # Cleanup test namespaces only, excluding kube-system
+ for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep -v '^kube-'); do
+ kubectl delete svc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
+ kubectl delete pvc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true
+ done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "Cleaning up k8s resources (best effort)..." | |
| kubectl delete svc -A --all --timeout=10s >/dev/null 2>&1 || true | |
| kubectl delete pvc -A --all --timeout=10s >/dev/null 2>&1 || true | |
| fi | |
| echo "Cleaning up k8s resources (best effort)..." | |
| # Cleanup test namespaces only, excluding kube-system | |
| for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep -v '^kube-'); do | |
| kubectl delete svc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true | |
| kubectl delete pvc -n "$ns" --all --timeout=10s >/dev/null 2>&1 || true | |
| done | |
| fi |
| # Note: T2A might not be in all zones of us-central1. us-central1-a,b,f usually. | ||
| # Let's target a specific zone to simplify. | ||
| ZONE="${REGION}-a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded zone suffix may fail in some regions.
The zone is hardcoded as ${REGION}-a (Lines 62, 94, 107), which assumes all regions have an "a" zone and that T2A instances are available there. This assumption may not hold for all GCP regions.
Consider deriving available zones dynamically or allowing zone override:
🔧 Proposed fix
+# Allow zone override or detect available zones with T2A
+ZONE="${ZONE:-${REGION}-a}"
+
+# Verify zone supports the machine type
+if ! gcloud compute machine-types list \
+ --filter="name=${NODE_MACHINE_TYPE} AND zone:${ZONE}" \
+ --format="value(name)" --limit=1 | grep -q .; then
+ echo "ERROR: Machine type ${NODE_MACHINE_TYPE} not available in ${ZONE}" >&2
+ exit 1
+fi
+
- ZONE="${REGION}-a"Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @verification/provision_gke.sh around lines 60 - 62, The script currently
hardcodes ZONE="${REGION}-a", which breaks in regions without an "-a" zone or
where T2A is unavailable; change ZONE to be derived dynamically or overridable:
add logic to respect an optional environment variable (e.g., ZONE override) and
if not set, query available zones for the REGION (via gcloud compute zones list
filtered by REGION and required machine type/feature) and pick a suitable zone,
then use that dynamic ZONE wherever ZONE is referenced (the ZONE variable and
all places that use it in provisioning, checks, and instance creation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @internal/networkmonitor/pod_cache_test.go:
- Around line 27-28: The test currently ignores the return value from
factory.WaitForCacheSync; capture its result (e.g., synced :=
factory.WaitForCacheSync(ctx.Done())), iterate the returned map and fail the
test if any informer did not sync (for example with t.Fatalf or t.Fatal when any
value is false). Reference the existing
factory.Start(ctx.Done())/factory.WaitForCacheSync(ctx.Done()) calls and add the
"k8s.io/client-go/tools/cache" import if needed for types; ensure you use
ctx.Done() as the stop channel when checking the sync result.
In @Makefile:
- Around line 703-709: The Makefile target setup-lima currently hardcodes
macOS-only installation via "brew install lima"; modify the installation step to
detect the OS and choose the appropriate installer for limactl: if uname
indicates Darwin use brew, if Linux detect distro (apt/dnf/yum) and run the
distro-appropriate package install or fallback to the official install script,
and if Windows/WSL print an error or instructions; update the block around the
existing which limactl check in the setup-lima target so the install command is
conditional and still preserves the existing instance start logic using
LIMA_INSTANCE and LIMA_CONFIG.
🧹 Nitpick comments (4)
internal/networkmonitor/monitor_test.go (1)
50-70: Consider verifying hash consistency for IPv6.While the current test ensures IPv6 hashing doesn't panic and produces non-zero results, it could be more robust by verifying that hashing the same IPv6 entry multiple times produces consistent results.
♻️ Optional enhancement
func TestIPv6Hashing(t *testing.T) { // Verify that IPv6 addresses are hashed correctly and do not panic srcIP := netaddr.MustParseIP("2001:db8::1") dstIP := netaddr.MustParseIP("2001:db8::2") entry := &Entry{ Src: netaddr.IPPortFrom(srcIP, 12345), Dst: netaddr.IPPortFrom(dstIP, 80), Proto: 6, // TCP } - // Should not panic + // Should not panic and should be consistent h1 := conntrackEntryKey(entry) h2 := entryGroupKey(entry) if h1 == 0 { t.Error("conntrackEntryKey returned 0 for IPv6") } if h2 == 0 { t.Error("entryGroupKey returned 0 for IPv6") } + + // Verify consistency + h1Again := conntrackEntryKey(entry) + h2Again := entryGroupKey(entry) + if h1 != h1Again { + t.Errorf("conntrackEntryKey inconsistent for IPv6: %d vs %d", h1, h1Again) + } + if h2 != h2Again { + t.Errorf("entryGroupKey inconsistent for IPv6: %d vs %d", h2, h2Again) + } }cmd/zxporter-netmon/main.go (2)
28-35: Rename the kubeconfig flag for clarity.The flag
metrics-kubeconfig(line 32) is misleading—it's used for general Kubernetes client operations (e.g., creating clientsets, informers) throughout the application, not just for metrics. Renaming it to simplykubeconfigwould better reflect its purpose and reduce confusion.📝 Suggested refactor
- kubeconfig = flag.String("metrics-kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.") + kubeconfig = flag.String("kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")
46-48: Consider handling the logger initialization error.The error from
zap.NewProduction()is currently ignored (line 47). While this constructor rarely fails in practice, handling the error defensively would make the initialization more robust.🛡️ Defensive error handling
- zapLog, _ := zap.NewProduction() + zapLog, err := zap.NewProduction() + if err != nil { + fmt.Fprintf(os.Stderr, "failed to initialize logger: %v\n", err) + os.Exit(1) + } logger := zapr.NewLogger(zapLog)Makefile (1)
821-823: LGTM: Clean metrics target is straightforward.The
clean-metricstarget properly removes generated verification artifacts.Optional: Consider adding a top-level
cleantarget.Makefile conventions typically include a
cleantarget that orchestrates all cleanup operations. You could add:.PHONY: clean clean: clean-metrics helm-chart-clean ## Clean all build artifacts @rm -rf $(DIST_DIR) @rm -rf $(LOCALBIN)Based on static analysis hints.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile.netmonMakefilecmd/zxporter-netmon/main.gohelm-chart/zxporter-netmon/templates/clusterrole.yamlhelm-chart/zxporter-netmon/templates/clusterrolebinding.yamlinternal/networkmonitor/monitor.gointernal/networkmonitor/monitor_test.gointernal/networkmonitor/pod_cache.gointernal/networkmonitor/pod_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- Dockerfile.netmon
- internal/networkmonitor/pod_cache.go
- internal/networkmonitor/monitor.go
🧰 Additional context used
🧬 Code graph analysis (3)
internal/networkmonitor/monitor_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
Entry(15-24)
cmd/zxporter-netmon/main.go (6)
internal/networkmonitor/conntrack_client.go (1)
Client(36-39)internal/networkmonitor/cilium_client_nolinux.go (2)
CiliumAvailable(22-24)NewCiliumClient(18-20)internal/networkmonitor/ebpf/tracer_linux.go (1)
IsKernelBTFAvailable(128-131)internal/networkmonitor/ebpf/tracer_nolinux.go (1)
IsKernelBTFAvailable(22-24)internal/networkmonitor/monitor.go (1)
Config(24-29)internal/transport/interface.go (1)
DakrClient(15-31)
internal/networkmonitor/pod_cache_test.go (1)
internal/networkmonitor/pod_cache.go (2)
NewPodCache(20-57)PodCache(12-16)
🪛 checkmake (0.2.2)
Makefile
[warning] 821-821: Missing required phony target "clean"
(minphony)
🪛 YAMLlint (1.37.1)
helm-chart/zxporter-netmon/templates/clusterrole.yaml
[error] 4-4: syntax error: expected , but found ''
(syntax)
helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml
[error] 4-4: syntax error: expected , but found ''
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build Docker Image
- GitHub Check: Run make test
- GitHub Check: Build Docker Image
- GitHub Check: Analyze (go)
🔇 Additional comments (13)
internal/networkmonitor/pod_cache_test.go (5)
30-47: LGTM!Pod creation logic is correct, with appropriate error handling and cache synchronization wait.
49-68: LGTM!The update logic correctly verifies both the new IP mapping and stale entry removal. The inline comment appropriately explains the timing assumptions.
70-77: LGTM!Pod deletion and verification logic is correct.
80-98: LGTM!The polling helper is well-designed with appropriate timeout, interval, and validation logic.
100-116: LGTM!The deletion verification helper follows the same solid pattern as
waitForPod.helm-chart/zxporter-netmon/templates/clusterrole.yaml (1)
1-9: LGTM! Past issue has been resolved.The ClusterRole now correctly uses the
fullnamehelper for naming (line 4), which addresses the concern from the previous review. The RBAC permissions are appropriately scoped to read-only access for pods, following the principle of least privilege.Note: The YAMLlint syntax error is a false positive—it doesn't recognize Helm template directives.
helm-chart/zxporter-netmon/templates/clusterrolebinding.yaml (1)
1-13: LGTM! Correct helper usage for RBAC binding.The ClusterRoleBinding properly uses the
fullnamehelper for both the binding name (line 4) and the ClusterRole reference (line 9), while using theserviceAccountNamehelper for the subject (line 12). This separation allows for flexible ServiceAccount name overrides viavalues.yamlwhile maintaining consistent ClusterRole naming.Note: The YAMLlint syntax error is a false positive—it doesn't recognize Helm template directives.
Makefile (6)
253-262: LGTM: Netmon image targets follow established patterns.The docker build and push targets for zxporter-netmon are well-structured and consistent with the existing testserver and stress image targets.
711-723: LGTM: Verification targets are well-structured.The verify-local and verify-e2e targets are properly parameterized and include helpful usage comments. The chmod commands ensure the scripts are executable.
725-746: LGTM: Provisioning targets are properly structured.The EKS and Kind provisioning targets are straightforward wrappers around shell scripts. The Kind cluster creation includes a helpful fallback for existing clusters.
748-765: LGTM: EKS lifecycle properly handles cleanup on verification failure.The
verify-e2e-eks-lifecycletarget correctly implements error handling to ensuredeprovision-eksalways runs, even when verification fails. This addresses the concern raised in the previous review.The pattern captures the verification exit code, guarantees cleanup execution, and then exits with the original verification result.
767-793: LGTM: AKS lifecycle properly handles cleanup on verification failure.The
verify-e2e-aks-lifecycletarget correctly implements the same error handling pattern as EKS, ensuringdeprovision-aksalways runs regardless of verification outcome. This addresses the concern from the previous review.
134-134: Remove or clarify the purpose ofgo generate ./....There are no active
//go:generatedirectives in the codebase. The only directive ininternal/networkmonitor/ebpf/generate.gois commented out and requires manual uncommentation plus LLVM installation to function. The eBPF artifacts are pre-generated and checked into the repository. As written,go generate ./...is a no-op and doesn't actually generate any code during the build.Likely an incorrect or invalid review comment.
| .PHONY: setup-lima | ||
| setup-lima: | ||
| @echo "Checking for Lima..." | ||
| @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima) | ||
| @echo "Checking if instance $(LIMA_INSTANCE) exists..." | ||
| @limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG)) | ||
| @echo "Lima instance $(LIMA_INSTANCE) is ready." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add cross-platform support for Lima installation.
Line 706 uses brew install lima, which only works on macOS. This will fail on Linux systems.
🔧 Suggested improvements for cross-platform support
Consider checking the operating system and providing appropriate installation instructions:
.PHONY: setup-lima
setup-lima:
@echo "Checking for Lima..."
- @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+ @which limactl >/dev/null || { \
+ echo "Lima not found. Please install it:"; \
+ echo " macOS: brew install lima"; \
+ echo " Linux: See https://github.com/lima-vm/lima#installation"; \
+ exit 1; \
+ }
@echo "Checking if instance $(LIMA_INSTANCE) exists..."
@limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG))
@echo "Lima instance $(LIMA_INSTANCE) is ready."Alternatively, add OS detection:
.PHONY: setup-lima
setup-lima:
@echo "Checking for Lima..."
- @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima)
+ @which limactl >/dev/null || { \
+ echo "Installing Lima..."; \
+ if [ "$$(uname)" = "Darwin" ]; then \
+ brew install lima; \
+ elif [ "$$(uname)" = "Linux" ]; then \
+ echo "Please install Lima: https://github.com/lima-vm/lima#installation"; \
+ exit 1; \
+ fi; \
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .PHONY: setup-lima | |
| setup-lima: | |
| @echo "Checking for Lima..." | |
| @which limactl >/dev/null || (echo "Installing Lima..." && brew install lima) | |
| @echo "Checking if instance $(LIMA_INSTANCE) exists..." | |
| @limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG)) | |
| @echo "Lima instance $(LIMA_INSTANCE) is ready." | |
| .PHONY: setup-lima | |
| setup-lima: | |
| @echo "Checking for Lima..." | |
| @which limactl >/dev/null || { \ | |
| echo "Installing Lima..."; \ | |
| if [ "$$(uname)" = "Darwin" ]; then \ | |
| brew install lima; \ | |
| elif [ "$$(uname)" = "Linux" ]; then \ | |
| echo "Please install Lima: https://github.com/lima-vm/lima#installation"; \ | |
| exit 1; \ | |
| fi; \ | |
| } | |
| @echo "Checking if instance $(LIMA_INSTANCE) exists..." | |
| @limactl list | grep -q $(LIMA_INSTANCE) || (echo "Creating instance $(LIMA_INSTANCE)..." && limactl start --tty=false --name=$(LIMA_INSTANCE) $(LIMA_CONFIG)) | |
| @echo "Lima instance $(LIMA_INSTANCE) is ready." |
🤖 Prompt for AI Agents
In @Makefile around lines 703 - 709, The Makefile target setup-lima currently
hardcodes macOS-only installation via "brew install lima"; modify the
installation step to detect the OS and choose the appropriate installer for
limactl: if uname indicates Darwin use brew, if Linux detect distro
(apt/dnf/yum) and run the distro-appropriate package install or fallback to the
official install script, and if Windows/WSL print an error or instructions;
update the block around the existing which limactl check in the setup-lima
target so the install command is conditional and still preserves the existing
instance start logic using LIMA_INSTANCE and LIMA_CONFIG.
| .PHONY: verify-e2e-gke-lifecycle | ||
| verify-e2e-gke-lifecycle: provision-gke ## Full GKE E2E (Provision -> Verify -> Deprovision) | ||
| @echo "Detected User: $$(whoami)" | ||
| @WHO="$$(whoami | tr '[:upper:]' '[:lower:]' | tr -cs 'a-z0-9' '-' | sed 's/^-//; s/-$$//')" && \ | ||
| CLUSTER_NAME="gke-zxporter-e2e-$${WHO}" && \ | ||
| echo "Updating kubeconfig for GKE cluster $${CLUSTER_NAME}..." && \ | ||
| gcloud container clusters get-credentials $${CLUSTER_NAME} --zone us-central1-a && \ | ||
| CONTEXT=$$(kubectl config current-context) && \ | ||
| echo "Using Context: $${CONTEXT}" && \ | ||
| $(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter | ||
| $(MAKE) deprovision-gke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GKE resources may be orphaned if E2E verification fails.
Lines 817-818 are separate Make invocations. If verify-e2e fails on line 817, the Makefile will abort before executing deprovision-gke on line 818, potentially leaving GKE clusters running and incurring costs.
This is the same issue that was fixed for EKS (lines 762-765) and AKS (lines 790-793), but the fix was not applied to GKE.
🔧 Apply the same error handling pattern used for EKS and AKS
.PHONY: verify-e2e-gke-lifecycle
verify-e2e-gke-lifecycle: provision-gke ## Full GKE E2E (Provision -> Verify -> Deprovision)
@echo "Detected User: $$(whoami)"
@WHO="$$(whoami | tr '[:upper:]' '[:lower:]' | tr -cs 'a-z0-9' '-' | sed 's/^-//; s/-$$//')" && \
CLUSTER_NAME="gke-zxporter-e2e-$${WHO}" && \
echo "Updating kubeconfig for GKE cluster $${CLUSTER_NAME}..." && \
gcloud container clusters get-credentials $${CLUSTER_NAME} --zone us-central1-a && \
CONTEXT=$$(kubectl config current-context) && \
echo "Using Context: $${CONTEXT}" && \
- $(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter
- $(MAKE) deprovision-gke
+ VERIFY_EXIT=0; \
+ $(MAKE) verify-e2e CLUSTER_CONTEXT="$${CONTEXT}" NAMESPACE=devzero-zxporter || VERIFY_EXIT=$$?; \
+ $(MAKE) deprovision-gke; \
+ exit $$VERIFY_EXITThis ensures that:
- The verification exit code is captured
deprovision-gkealways runs- The original verification exit code is returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Line 137: Update the inline comment that currently reads "Definately have to
fail if any non-latest tags exist" to correct the typo to "Definitely have to
fail if any non-latest tags exist"; locate and edit the comment in the
push-zxporter-netmon-image workflow file where that line appears.
- Around line 236-241: The workflow step "Checkout repo" is using
actions/checkout@v3 while another step already uses @v4; update the checkout
action reference in that step from actions/checkout@v3 to actions/checkout@v4 to
ensure consistency and runner compatibility—locate the step with name "Checkout
repo" and replace the version token in the uses field accordingly, keeping the
existing with: repository, path and token parameters unchanged.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/push-zxporter-netmon-image.ymlinternal/networkmonitor/conntrack_client.gointernal/networkmonitor/conntrack_client_test.gointernal/networkmonitor/monitor.gointernal/networkmonitor/monitor_test.gointernal/networkmonitor/pod_cache_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/networkmonitor/pod_cache_test.go
🧰 Additional context used
🧬 Code graph analysis (3)
internal/networkmonitor/monitor_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
Entry(15-24)
internal/networkmonitor/conntrack_client_test.go (1)
internal/networkmonitor/conntrack_client.go (1)
Entry(15-24)
internal/networkmonitor/monitor.go (7)
internal/networkmonitor/dns/ip2dns.go (2)
DNSLookup(17-22)DNSCollector(24-29)internal/networkmonitor/ebpf/tracer.go (1)
Config(5-8)internal/logger/logger.go (1)
Logger(26-29)internal/networkmonitor/conntrack_client.go (3)
Client(36-39)Entry(15-24)FilterBySrcIP(28-33)internal/networkmonitor/pod_cache.go (1)
PodCache(12-16)internal/transport/interface.go (1)
DakrClient(15-31)gen/api/v1/metrics_collector.pb.go (12)
Namespace(1469-1503)Namespace(1518-1518)Namespace(1533-1535)DnsLookupItem(2194-2205)DnsLookupItem(2220-2220)DnsLookupItem(2235-2237)NetworkTrafficItem(2067-2083)NetworkTrafficItem(2098-2098)NetworkTrafficItem(2113-2115)SendNetworkTrafficMetricsRequest(2281-2292)SendNetworkTrafficMetricsRequest(2307-2307)SendNetworkTrafficMetricsRequest(2322-2324)
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml
237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Analyze (go)
- GitHub Check: Build Docker Image
- GitHub Check: Build Docker Image
- GitHub Check: Run make test
🔇 Additional comments (5)
internal/networkmonitor/conntrack_client_test.go (1)
10-94: LGTM! Comprehensive test coverage.The test suite thoroughly covers valid sessions, nil field handling, and filter behavior. The helper functions and test structure are well-organized.
internal/networkmonitor/monitor_test.go (1)
11-71: LGTM! Concurrency and IPv6 tests are well-implemented.The error channel pattern correctly handles concurrent testing, and the IPv6 test validates proper address handling. Previous concurrency safety issues have been resolved.
internal/networkmonitor/conntrack_client.go (1)
74-124: LGTM! Previous critical issues resolved.The comprehensive nil checks for
Origin.Dst(line 79) and the nil filter guard (line 119) correctly address previously identified risks. The session processing logic is robust.internal/networkmonitor/monitor.go (2)
319-356: LGTM! Critical hashing issues resolved.The use of local
maphash.Hashinstances (lines 323, 342) eliminates data races, andAs16()(lines 325, 331, 344, 347) correctly handles both IPv4 and IPv6 addresses. TheglobalHashSeedensures consistent hashing across calls.
104-142: LGTM! Clean event loop implementation.The monitor loop correctly handles collection, cleanup, and flush intervals. Previous duplicate
flushTicker.Ccase has been resolved, and DNS collector integration is properly handled.
| - name: Checkout repo | ||
| uses: actions/checkout@v3 | ||
| with: | ||
| repository: devzero-inc/zxporter | ||
| path: zxporter-tmp | ||
| token: ${{ steps.get_workflow_token.outputs.token }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update checkout action to v4.
Line 237 uses actions/checkout@v3, which is outdated. Line 29 correctly uses @v4. Update for consistency and to avoid runner compatibility issues.
🔄 Proposed fix
- name: Checkout repo
- uses: actions/checkout@v3
+ uses: actions/checkout@v4
with:
repository: devzero-inc/zxporter
path: zxporter-tmpBased on static analysis hints.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Checkout repo | |
| uses: actions/checkout@v3 | |
| with: | |
| repository: devzero-inc/zxporter | |
| path: zxporter-tmp | |
| token: ${{ steps.get_workflow_token.outputs.token }} | |
| - name: Checkout repo | |
| uses: actions/checkout@v4 | |
| with: | |
| repository: devzero-inc/zxporter | |
| path: zxporter-tmp | |
| token: ${{ steps.get_workflow_token.outputs.token }} |
🧰 Tools
🪛 actionlint (1.7.10)
237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
In @.github/workflows/push-zxporter-netmon-image.yml around lines 236 - 241, The
workflow step "Checkout repo" is using actions/checkout@v3 while another step
already uses @v4; update the checkout action reference in that step from
actions/checkout@v3 to actions/checkout@v4 to ensure consistency and runner
compatibility—locate the step with name "Checkout repo" and replace the version
token in the uses field accordingly, keeping the existing with: repository, path
and token parameters unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/push-zxporter-netmon-image.yml:
- Around line 21-26: The package-and-push job currently sets FULLV only as an
environment variable so downstream jobs can't read it; modify the
package-and-push job to expose FULLV as a job output (add FULLV under outputs)
and update the existing version step (the step named "version") to emit FULLV
via the step outputs, then change the update-helm-values-image-tag-pr job to
consume it as needs.package-and-push.outputs.fullv (replace all uses of
env.FULLV in that job with that needs expression).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/push-zxporter-netmon-image.ymlinternal/networkmonitor/pod_cache.go
🧰 Additional context used
🪛 actionlint (1.7.10)
.github/workflows/push-zxporter-netmon-image.yml
237-237: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build Docker Image
- GitHub Check: Build Docker Image
- GitHub Check: Run make test
- GitHub Check: Analyze (go)
🔇 Additional comments (7)
internal/networkmonitor/pod_cache.go (4)
27-52: LGTM! Previous type safety and IP change issues resolved.The event handlers now properly use safe type assertions and handle IP changes correctly:
- AddFunc and UpdateFunc check type assertions with the comma-ok idiom
- UpdateFunc removes stale IP mappings when a pod's IP changes
- Logic correctly handles edge cases (empty IPs, parse failures)
63-76: LGTM! Clean implementation with proper validation.The method correctly validates the pod IP before parsing and updating the cache, with appropriate error handling and locking.
78-104: LGTM! Robust deletion logic with tombstone handling.The method properly handles both direct pod deletions and tombstones (
DeletedFinalStateUnknown), ensuring reliable cleanup even when the informer provides wrapped objects.
106-124: LGTM! Thread-safe getters with appropriate locking.Both methods use read locks correctly.
GetLocalPodIPsdefensively returns a copy of the IP set, preventing external modification of internal state..github/workflows/push-zxporter-netmon-image.yml (3)
28-64: LGTM!The version extraction logic correctly handles both release tags (v*..) and non-release commits, properly excluding chart-* tags and setting appropriate fallback values.
182-220: LGTM!The multi-arch build correctly uses validated tags and pushes to both Docker Hub and ECR with appropriate build arguments and platform targets.
297-319: LGTM!The Slack notification logic correctly uses the
always()condition for failure notifications and conditionally notifies on success only when a PR is created.
| package-and-push: | ||
| name: Package & Push Docker image | ||
| runs-on: ubuntu-latest | ||
| outputs: | ||
| is_release: ${{ steps.version.outputs.is_release }} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Missing FULLV in job outputs breaks Helm update.
The second job update-helm-values-image-tag-pr uses ${{ env.FULLV }} (lines 245, 271, 280, 291-293), but FULLV is only set as an environment variable in the first job and not exposed as an output. GitHub Actions does not share environment variables across jobs, so env.FULLV will be empty in the second job, causing the Helm update to fail or use an incorrect tag.
🔧 Proposed fix
Add FULLV to the job outputs:
runs-on: ubuntu-latest
outputs:
is_release: ${{ steps.version.outputs.is_release }}
+ fullv: ${{ env.FULLV }}Then update the version step to output FULLV:
echo "FULLV=${FULLV}" >> $GITHUB_ENV
echo "IS_RELEASE=true" >> $GITHUB_ENV
echo "Debug: MAJOR=${BASH_REMATCH[1]}, MINOR=${BASH_REMATCH[2]}, PATCH=${PATCH}, PATCH_SHORT=${PATCH_SHORT}"
echo "Debug: FULLV=${FULLV}"
echo "is_release=${IS_RELEASE}" >> $GITHUB_OUTPUT
+ echo "fullv=${FULLV}" >> $GITHUB_OUTPUTAnd reference it in the second job as ${{ needs.package-and-push.outputs.fullv }} instead of ${{ env.FULLV }}.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @.github/workflows/push-zxporter-netmon-image.yml around lines 21 - 26, The
package-and-push job currently sets FULLV only as an environment variable so
downstream jobs can't read it; modify the package-and-push job to expose FULLV
as a job output (add FULLV under outputs) and update the existing version step
(the step named "version") to emit FULLV via the step outputs, then change the
update-helm-values-image-tag-pr job to consume it as
needs.package-and-push.outputs.fullv (replace all uses of env.FULLV in that job
with that needs expression).
Summary by CodeRabbit
New Features
Infrastructure
CI / Delivery
Tests
✏️ Tip: You can customize this high-level summary in your review settings.