Skip to content

Conversation

@tchap
Copy link
Contributor

@tchap tchap commented Sep 9, 2025

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2025
@openshift-ci openshift-ci bot requested review from p0lyn0mial and tkashem September 9, 2025 15:44
@tchap tchap force-pushed the atomic-certsync branch 4 times, most recently from 8dec327 to 60f05a8 Compare September 10, 2025 12:45
@tchap tchap changed the title WIP: certsyncpod: Swap secret/cm directories atomically OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 10, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 10, 2025
@openshift-ci openshift-ci bot requested a review from wangke19 September 10, 2025 12:56
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set. This should not be a problem as this call is supported since Linux 3.15 on all modern file systems.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap changed the title OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap changed the title WIP: OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically OCPBUGS-33013: certsyncpod: Swap secret/cm directories atomically Sep 10, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2025
@tchap
Copy link
Contributor Author

tchap commented Sep 10, 2025

I actually have to make sure this can be merged as this is only supported on Linux 3.15 or later.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2025
@tchap
Copy link
Contributor Author

tchap commented Sep 10, 2025

This patch should be OK for RHEL 8 or later based on https://access.redhat.com/articles/3078

The latest CI for OCP 4.21 actually uses RHEL 9.6.

@tchap
Copy link
Contributor Author

tchap commented Sep 11, 2025

The PR using this change in cluster-kube-apiserver-operator seems to be passing on CI, I deem this ready.

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2025
@p0lyn0mial
Copy link
Contributor

@tchap is there a must-gather from an incident i could take a look at ?

@tchap
Copy link
Contributor Author

tchap commented Sep 15, 2025

@p0lyn0mial
Copy link
Contributor

@vrutkovs do you have time to take a look at this issue ?

I think that the issue might be real. I think the issue is when a two file cert is replaced. It can happen that the server picks up the update and notices the public/private key mismatch and crashes. Is there a way to repo this issue ?

filePerms := os.FileMode(0600)
if strings.HasSuffix(fullFilename, ".sh") {
filePerms = 0755
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, didn't notice this check for custom permission setting...

strings.HasSuffix(path, "/staging/cert-sync/secrets") ||
strings.HasSuffix(path, "/staging/cert-sync/configmaps") ||
path == filepath.Join(controller.destinationDir, "configmaps") ||
path == filepath.Join(controller.destinationDir, "secrets") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not too pretty, but meh.

Copy link
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few more comments. overall lgtm.
please also test this pr with some operator e.g. kas-o

@tchap tchap force-pushed the atomic-certsync branch 2 times, most recently from 439e1f3 to 46d0eac Compare October 23, 2025 09:34
Use atomicdir.Sync to write target secret/configmap directories to be
synchronized with the relevant objects.

Added unit tests, but the coverage is not complete. Particularly
filesystem operations failing are not being tested.
@p0lyn0mial
Copy link
Contributor

lgtm

@p0lyn0mial
Copy link
Contributor

/lgtm

@p0lyn0mial p0lyn0mial added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Nov 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, tchap, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tchap
Copy link
Contributor Author

tchap commented Nov 4, 2025

/unhold

let's merge this.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2025

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit e9c2485 into openshift:master Nov 4, 2025
4 checks passed
@openshift-ci-robot
Copy link

@tchap: Jira Issue OCPBUGS-33013: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33013 has been moved to the MODIFIED state.

In response to this:

Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.

This introduces a new staticpod.SwapDirectoriesAtomic, which uses unix.Renameat2 with RENAME_EXCHANGE flag set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap tchap deleted the atomic-certsync branch November 5, 2025 12:07
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.21.0-0.nightly-2025-11-13-042845

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants