Skip to content

Conversation

@nammn
Copy link
Collaborator

@nammn nammn commented Dec 17, 2025

Summary

A lot of fixes for IBM Power (ppc64le) and Z (s390x) e2e test reliability issues on shared Evergreen CI machines.

Problem Description

Why do run into a lot of flakes when using podman? Why don't we run into this in docker?

Docker has a daemon (dockerd) that manages everything centrally:

  • When containers exit or crash, the daemon cleans up
  • State is managed in one place
  • Orphaned processes get reaped by the daemon

Podman is daemonless:

  • Each podman command is independent
  • When a test crashes/fails, there's no daemon to clean up
  • conmon processes (container monitors) become orphaned (PPID=1)
  • Lock files, volumes, networks are left behind
  • Next test run hits stale state

Therefore, the safe way to solve this is to cleanup from podman after/before every run as much as we can.

Problems Fixed

1. Container Runtime Issues

  • Process namespace join errors: Orphaned conmon processes from previous runs caused "process namespace join" failures -> we will now clean all of them up on startup. There is an open issues which should fix that: Link
  • Stale container state: Previous test runs left stale volumes, networks, and lock files

2. Podman/Registry IPv6 Issues

  • Registry connection refused: podman push localhost:5000 tried IPv6 (::1) first, failing because registry only bound to IPv4
  • Registry cleanup during build: Registry could be cleaned up by other processes while building the custom kicbase image
  • We also try to skip starting minikube if it already runs

3. Python Environment Issues

  • boto3/requests missing: SKIP_INSTALL_REQUIREMENTS flag caused missing dependencies in some code paths

Proof of Work

  • master merge should pass
  • 2 consecutive manual runs passed:

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.2 Release Notes

Bug Fixes

  • Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.

@nammn nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Dec 17, 2025
fi
export XDG_RUNTIME_DIR="${runtime_dir}"

# Clean up stale podman state (fixes "cannot re-exec process to join the existing user namespace")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still happens, but once evg agents properly cleanup podman containers we should be able to rremove this: https://jira.mongodb.org/browse/DEVPROD-25447

@nammn nammn changed the title use rootless podman CLOUDP-362015 - use rootless podman Dec 18, 2025
@nammn nammn force-pushed the fix-ibm-docker-2 branch 2 times, most recently from 96d10bf to 43349d7 Compare December 19, 2025 11:59
@nammn nammn changed the title CLOUDP-362015 - use rootless podman CLOUDP-362015 - Fix IBM Power/Z (ppc64le/s390x) e2e test reliability Dec 19, 2025

crun_path=$(which crun)
echo "Using crun path: ${crun_path}"
# Kill orphaned root conmon processes (PPID=1 means orphaned)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this should be handled by https://jira.mongodb.org/browse/DEVPROD-25447

}

# Start minikube with podman driver
# Start minikube with podman driver (rootful mode for reliable networking)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minikube requires root podman for changing networking and iptables

@nammn nammn marked this pull request as ready for review December 19, 2025 13:47
@nammn nammn requested a review from a team as a code owner December 19, 2025 13:47
Comprehensive fixes for IBM Power and Z e2e test reliability on shared
Evergreen CI machines.

Problems fixed:
- Process namespace join errors from orphaned conmon processes
- Stale crun lock files and container state from previous runs
- Registry IPv6 connection refused (podman tries ::1 before 127.0.0.1)
- Registry cleanup during kicbase image build
- Rootless vs rootful podman mode conflicts
- Non-idempotent minikube setup
- Missing boto3/requests due to skipped requirements install

Key changes:
- Clean orphaned conmon processes and stale lock files before setup
- Bind registry to 127.0.0.1:5000 to avoid IPv6 issues
- Use 127.0.0.1 for all curl health checks
- Add registry restart before podman push
- Use rootful podman (--rootless=false) for reliable CNI networking
- Add idempotent checks (skip if minikube already healthy)
- Always install requirements.txt
@nammn nammn requested a review from MaciejKaras December 19, 2025 13:54
Copy link
Collaborator

@MaciejKaras MaciejKaras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants