Skip to content

Comments

Support Zone-Aware Topologies in the KubeVirt CSI Driver#124

Open
moadqassem wants to merge 11 commits intokubevirt:mainfrom
moadqassem:support-pv-node-affinities
Open

Support Zone-Aware Topologies in the KubeVirt CSI Driver#124
moadqassem wants to merge 11 commits intokubevirt:mainfrom
moadqassem:support-pv-node-affinities

Conversation

@moadqassem
Copy link
Contributor

What this PR does / why we need it:
This pull request introduces support for zone-aware topology in the KubeVirt CSI driver. This enhancement allows the CSI driver to handle storage provisioning and attachment in multi-zone Kubernetes clusters. Integrates with Kubernetes topology-aware volume provisioners to ensure volumes are created in the same zone/region as the requesting node by:

  • Implements support for TopologyKeys to enable zone/region-specific volume scheduling and attachment
  • Modifies the Controller plugin to handle topology constraints during volume provisioning
  • Updates to the Node plugin to include zone information in NodeGetInfo responses

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Support zone and region-aware topologies in the KubeVirt CSI Driver

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 14, 2025
@kubevirt-bot
Copy link

Hi @moadqassem. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from 1de4f1d to c37d1ae Compare January 14, 2025 14:23
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 14, 2025
@awels
Copy link
Member

awels commented Jan 14, 2025

/test all

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025
@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from c37d1ae to 8812d3f Compare March 30, 2025 16:34
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mhenriks for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@moadqassem
Copy link
Contributor Author

moadqassem commented Mar 30, 2025

@awels I have fixed some tests files that were failing and update the PR to resolve few conflicts. However, there were some other issues which I believe are not relevant to my PR:

I0114 23:43:26.714262   37496 create-pvc_test.go:608] Event: AttachVolume.Attach failed for volume "tenant-pv" : rpc error: code = Unknown desc = invalid volume name [FailedAttachVolume]

And

 failed to provision volume with StorageClass "kubevirt": rpc error: code = InvalidArgument desc = non-block volume with RWX access mode is not supported

I also took a quick look into the testing process and In order to add an e2e k8s test to make sure that the zone and region are respected, I must change few things in the kubevirtci cluster provider(or add a new provider, haven't looked deep so not sure). This is where it needs to be adjusted to add labels that points the allowed topology: https://github.com/kubevirt/kubevirtci/blob/a291de27bb596074c79729ea6f88533555f523fd/cluster-up/cluster/ephemeral-provider-common.sh#L90

Let me know what do you think?

@awels
Copy link
Member

awels commented Apr 2, 2025

/test all

@awels
Copy link
Member

awels commented Apr 2, 2025

Sorry been very busy with other stuff, I will try to take a look at this soon.

@awels
Copy link
Member

awels commented Apr 2, 2025

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

@moadqassem
Copy link
Contributor Author

If you look at the testing, we do actually exclude a few tests from the k8s test suite. In particular the RWX filesystem one, since we don't support that.

I see. Alright let me check this filtering criteria.

@awels
Copy link
Member

awels commented Apr 2, 2025

https://github.com/kubevirt/csi-driver/blob/main/hack/run-k8s-e2e.sh#L117-L128 is what we use, I believe the last skip is what skips the RWX filesystem test.

@awels
Copy link
Member

awels commented Apr 3, 2025

Okay so I think I know what happened here, we made a change in how we identify which VM the volume is hotplugged into. Before we used the VMI, but that proved to be problematic, now we look at the VM directly. I am not sure if that affects how you look up VMs, it likely does.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025
…t requirements

Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
@moadqassem moadqassem force-pushed the support-pv-node-affinities branch from 1701709 to 989e1ea Compare May 13, 2025 10:29
@awels
Copy link
Member

awels commented May 13, 2025

PR looks good, one thing I forgot to ask about, did you enabled the topology tests in the k8s test suite. You can modify https://github.com/kubevirt/csi-driver/blob/main/hack/test-driver.yaml and set topology: true and that should be all that is needed to run the topology tests.

Signed-off-by: moadqassem <moad.qassem@gmail.com>
@awels
Copy link
Member

awels commented May 13, 2025

/test all

Signed-off-by: moadqassem <moad.qassem@gmail.com>
Signed-off-by: moadqassem <moad.qassem@gmail.com>
@awels
Copy link
Member

awels commented May 14, 2025

/test all

@awels
Copy link
Member

awels commented Jun 5, 2025

Looks like we are having an issue with the tests when we enable topology here. I still have not had a change to look at why.

@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2025
@awels
Copy link
Member

awels commented Sep 3, 2025

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2025
@kubevirt-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2025
@awels
Copy link
Member

awels commented Dec 2, 2025

/remove-lifecycle stale

@kubevirt-bot kubevirt-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2025
@awels
Copy link
Member

awels commented Jan 2, 2026

/test all

@awels
Copy link
Member

awels commented Jan 2, 2026

My sincere apologies it took me this long to have time to really take a look at this. I was able to easily reproduce the test failures locally. It appears to be two issues:
1: The storage class in the tenant is not configured to use WFFC, so just changing tenant storage class to look something like this will make the tenant storage class correct:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: kubevirt
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.kubevirt.io
allowedTopologies:
- matchLabelExpressions:
  - key: topology.kubernetes.io/zone
    values:
    - az-1
  - key: topology.kubernetes.io/region
    values:
    - eu-central
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  infraStorageClassName: rook-ceph-block-wffc
  bus: scsi

Basically the two changes are the infra storage class is pointing to a WFFC one, and the volumeBindingMode is set to WaitForFirstConsumer.
2. It appears the labels on the nodes are not being set correctly in the test cluster. When I looked at the labels the it looked like this:

    topology.kubernetes.io/region: ""
    topology.kubernetes.io/zone: ""

This of course caused the nodes to not be considered during scheduling and it failed. Once I manually corrected the labels and the storage class I was able to properly create topology aware volumes in the tenant cluster without issues.

So I don't think there is any issue with the code itself (as you concluded as well) it is just a matter of properly configuring the test cluster during test setup.

@kubevirt-bot
Copy link

@moadqassem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-csi-driver-e2e-k8s 4b2711d link true /test pull-csi-driver-e2e-k8s
pull-csi-driver-split-e2e-k8s 4b2711d link true /test pull-csi-driver-split-e2e-k8s
pull-csi-driver-split-k8s-suite-k8s 4b2711d link true /test pull-csi-driver-split-k8s-suite-k8s
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@awels
Copy link
Member

awels commented Jan 2, 2026

I created a branch here https://github.com/awels/kubevirt-csi/tree/topology_support with the changes I think are needed to complete this PR. I will also update one of the lanes to use the WFFC binding mode environment variable. If you could take the last commit on that branch and move it into your PR I would appreciate it. If you can't I will make a PR out of that branch and give you credit.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2026
@kubevirt-bot
Copy link

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@awels
Copy link
Member

awels commented Jan 30, 2026

Okay I guess you no longer have time to look at this. I created another PR based on this one with the fixes to the build and tests so that it all passes.

@moadqassem
Copy link
Contributor Author

Okay I guess you no longer have time to look at this. I created another PR based on this one with the fixes to the build and tests so that it all passes.

Oh really sorry, haven't seen your comment. No worries we can merge your PR as well 😉.

@awels
Copy link
Member

awels commented Jan 30, 2026

No worries, it is mostly my fault I have been really busy and never got around to figuring out what was causing the CI to fail. I just want to make sure you get the credit for the PR that is all.

@moadqassem
Copy link
Contributor Author

No worries, it is mostly my fault I have been really busy and never got around to figuring out what was causing the CI to fail. I just want to make sure you get the credit for the PR that is all.

I see. Totally appreciated. Btw I have been looking into disk expansion and snapshotting lately. Are those features actually working.

@awels
Copy link
Member

awels commented Jan 30, 2026

yes they should be working if the infra cluster storage class supports it

@moadqassem
Copy link
Contributor Author

yes they should be working if the infra cluster storage class supports it

Hmm ok, even though I tried it out and the controller didn't acknowledge that an object was created. Anyway might be an issue in my configs then

@awels
Copy link
Member

awels commented Jan 30, 2026

Yeah we have the tests enabled in the kubernetes csi test suite. Also my companies test suite also says it is working fine. So I suspect a configuration issue on your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants