diff --git a/.custom_wordlist.txt b/.custom_wordlist.txt index 76ba7d9..5ae70c1 100644 --- a/.custom_wordlist.txt +++ b/.custom_wordlist.txt @@ -68,6 +68,7 @@ MicroCeph MicroCluster MicroOVN Multipass +MySQL NATed Netplan Numa @@ -97,11 +98,13 @@ PV PVCs QCOW QEMU +Rados RabbitMQ RefStack ReplicaSet RBAC RPC +RTO SAML SEV SHUTOFF @@ -125,6 +128,7 @@ apache api args auth +aws balancers balancer's backend @@ -134,11 +138,13 @@ backported br ceph cirros +cli clusterctl clusterd coe config configs +cmd cron daemonset datapath @@ -200,6 +206,7 @@ mypassword myproject myrouter myspace +mysql mystack mysubnet myuser @@ -260,7 +267,9 @@ unschedulable untagged unallocated url +uri userspace +velero vfio vlan vGPU diff --git a/how-to/operations/backup-and-restore.rst b/how-to/operations/backup-and-restore.rst new file mode 100644 index 0000000..15d8f18 --- /dev/null +++ b/how-to/operations/backup-and-restore.rst @@ -0,0 +1,377 @@ +Backup and Restore +================== + +Overview +-------- + +Regular backups of the Sunbeam cluster are a critical component of any robust disaster recovery plan, +ensuring the resilience and continuity of the Canonical OpenStack Cluster deployment. Given that +the procedures described below primarily focus on backing up essential control-plane elements +including application data (MySQL, Vault), the Kubernetes control plane, Juju controller state, +and sunbeam-clusterd. + +Unexpected hardware failures, human error, or data corruption can severely compromise the +control plane, leading to extended outages and potential data loss. By maintaining up-to-date +backups, administrators can significantly minimize recovery time objectives (RTO) and restore the +core management services necessary for operating the cloud infrastructure. + + +s3-integrator +------------- +The Sunbeam cluster, by default, utilizes ceph-rgw within MicroCeph, which provides S3-compatible +object storage capabilities. This built-in functionality can be used to create the S3 buckets +necessary for the backup procedures described here. While this is convenient for initial setup +and testing, it is recommended that for production environments, all critical backups be +stored in an S3-compatible service located outside of the Canonical OpenStack Cluster deployment +itself. Storing backups externally ensures resilience against catastrophic failures that could +affect the entire cloud environment, including the internal Ceph cluster. + +For demonstration purposes, the backup procedures outlined in this document will utilize the internal +Ceph Rados Gateway (RGW) provided by the ceph-rgw charm. + +.. code-block :: text + + juju switch openstack-machines + juju exec -u microceph/leader -- microceph.radosgw-admin user create --uid my-user --display-name my-user + { + "user_id": "my-user", + "display_name": "my-user", + "email": "", + "suspended": 0, + "max_buckets": 1000, + "subusers": [], + "keys": [ + { + "user": "my-user", + "access_key": "", # save this access key + "secret_key": "", # save this secret key + "active": true, + "create_date": "2026-02-26T20:40:18.959341Z" + } + ], + } + + # get the endpoint of the ceph-rgw service on openstack model + juju switch openstack + juju run traefik-rgw/leader show-external-endpoints + Running operation 316 with 1 task + - task 317 on unit-traefik-rgw-1 + + Waiting for task 317... + external-endpoints: '{"traefik-rgw": {"url": "http://"}}' + +Install a tool like `aws-cli`` or `s3cmd` and configure it with the access key and secret key +obtained from the previous command to interact with the S3 storage provided by ceph-rgw. + +.. code-block :: text + + sudo snap install aws-cli --classic + aws configure --profile ceph # fill the asked information + aws --profile ceph --endpoint-url http:// s3api create-bucket --bucket mysql + ... + # repeat the previous command to create a bucket for each application you want to backup + +Deploy one s3-integrator application for each application that needs s3-integration. E.g: + +.. code-block :: text + + juju switch openstack + juju deploy s3-integrator --model openstack mysql-s3-integrator + juju integrate mysql-s3-integrator mysql + ... + # deploy and integrate for all necessary apps + + +Run the sync-s3-credentials action to configure the charm + +.. code-block :: text + + juju run mysql-s3-integrator/leader sync-s3-credentials access-key= secret-key= + ... + # do the same for all necessary apps + +Configure the s3-integrator charm to use the correct bucket for each application + +.. code-block :: text + + juju config mysql-s3-integrator bucket=mysql s3-uri-style=path endpoint=http:// path=mysql + ... + # do the same for all necessary apps + +MySQL +----- + +Requirements +~~~~~~~~~~~~ +* A deployed MySQL K8s cluster +* Access to S3 storage +* Configured settings for S3 storage +* Units in active/idle +* Control-plane units paused to avoid usage of the cluster during **restore** procedure + +Backup +~~~~~~ +The backup procedure should be executed on secondary MySQL units to avoid impacting the performance +of the primary unit. To get a secondary unit, run the following command: + +.. code-block :: text + + juju run mysql/leader get-cluster-status + Running operation 196 with 1 task + - task 197 on unit-mysql-2 + + Waiting for task 197... + status: + clustername: cluster-1e57de179fb5edd8c4e6392a25473b96 + clusterrole: primary + defaultreplicaset: + name: default + primary: mysql-2.mysql-endpoints.openstack.svc.cluster.local.:3306 + ssl: required + status: ok + statustext: cluster is online and can tolerate up to one failure. + topology: + mysql-0: + address: mysql-0.mysql-endpoints.openstack.svc.cluster.local.:3306 + memberrole: secondary + mode: r/o + replicationlagfromimmediatesource: "" + replicationlagfromoriginalsource: "" + role: ha + status: online + version: 8.0.41 + mysql-1: + address: mysql-1.mysql-endpoints.openstack.svc.cluster.local.:3306 + memberrole: secondary + mode: r/o + replicationlagfromimmediatesource: "" + replicationlagfromoriginalsource: "" + role: ha + status: online + version: 8.0.41 + mysql-2: + address: mysql-2.mysql-endpoints.openstack.svc.cluster.local.:3306 + memberrole: primary + mode: r/w + role: ha + status: online + version: 8.0.41 + topologymode: single-primary + domainname: cluster-set-1e57de179fb5edd8c4e6392a25473b96 + groupinformationsourcemember: mysql-2.mysql-endpoints.openstack.svc.cluster.local.:3306 + success: "True" + +It's possible to see in this case that mysql/0 and mysql/1 are secondary and mysql/2 is primary. +So backups should be run on unit 0 or 1. + +.. code-block :: text + + juju run mysql/0 create-backup --wait 1m + +Restore +~~~~~~~ +To restore it is recommended to stop all control-plane services that might be using the database +before running the restore-backup action. This is to avoid any issues related to data corruption +or inconsistencies during the restore process. + +At the moment, there isn't a charm action to stop all control-plane services at once, so it needs +to be done manually by running on all OpenStack API services: + +.. code-block :: bash + + # get the container names of all OpenStack API services + kubectl get pods -n openstack -o json | jq -r ' + .items[] + | select( + (.metadata.name | test("traefik|rabbitmq|mysql|modeloperator|ovn") | not) + ) + | .metadata.name as $pod + | .spec.containers[] + | select(.name != "charm") + | "\($pod) => \(.name)" + ' + ... + + # get the pebble service names for all OpenStack API services + for i in {0..2}; do kubectl -n openstack exec keystone-$i -c keystone -- pebble services; done + # do the same for all necessary apps + + # stop the containers of all OpenStack API services + for i in {0..2}; do kubectl -n openstack exec keystone-$i -c keystone -- pebble stop wsgi-keystone; done + # do the same for all necessary apps + +With all API services stopped, it's possible to run the restore-backup action on a MySQL unit. +Before that is necessary to scale down the MySQL cluster to 1 replica to ensure data consistency +during the restore process. See the `charmed MySQL documentation`_ for more details + +.. code-block :: text + + juju scale-application mysql 1 + +Then, run the restore-backup action on the unit where you want to restore the backup. E.g: +.. code-block :: text + + juju run mysql/leader restore-backup backup-id= + +After restoring all databases, it's necessary to resume the OpenStack services and scale again +the mysql units. + +.. code-block :: text + + # start the containers of all OpenStack API services + for i in {0..2}; do kubectl -n openstack exec keystone-$i -c keystone -- pebble start wsgi-keystone; done + # do the same for all necessary apps + + juju scale-application mysql 3 + +In case you find mysql-routers on blocked state, it's necessary to re-launch them by running the following command: +.. code-block :: text + + juju scale-application keystone-mysql-router 0 + juju scale-application keystone-mysql-router 3 + +After the restoration, MySQL application will be in blocked state with the message: +"Move restored cluster to another S3 repository". To unblock it, it's necessary to create a new S3 +bucket and configure the `mysql-s3-integrator`` charm to use it by running the following command: +.. code-block :: text + + juju config mysql-s3-integrator bucket= + +Vault +----- + +Requirements +~~~~~~~~~~~~ +* Have a Vault cluster enabled in Sunbeam. +* Units are in active idle state +* Configured settings for S3 storage +* Have saved your unseal keys and root-token in a secure location of your choice + +Backup / Restore +~~~~~~~~~~~~~~~~ +.. code-block :: text + + juju run vault/leader create-backup + + juju run vault/leader list-backups + + juju run vault/leader restore-backup backup-id= + +K8s control plane backup +------------------------ + +Requirements +~~~~~~~~~~~~ +* Have a `velero-operator`_ deployed +* Have the `infra-backup-operator`_ deployed +* Have access to S3 storage +* Configure s3-integrator + +Backup +~~~~~~ +.. code-block :: text + + juju run velero-operator/0 create-backup \ + target=infra-backup-operator:cluster-infra-backup + + juju run velero-operator/0 create-backup \ + target=infra-backup-operator:namespaced-infra-backup + +Restore +~~~~~~~ +.. code-block :: text + + # list the backups + + juju run velero-operator/0 list-backups + + backups: + 83503892-a24a-409b-b0df-553dcc2465ec: + app: infra-backup-operator + completion-timestamp: "2025-08-08T20:00:28Z" + endpoint: cluster-infra-backup + model: test-charm-9f0e8dda + name: infra-backup-operator-cluster-infra-backup-pblz2 + phase: Completed + start-timestamp: "2025-08-08T20:00:26Z" + 85662948-8e5e-4922-8e1c-c5568eafa6e7: + app: infra-backup-operator + completion-timestamp: "2025-08-07T18:42:13Z" + endpoint: cluster-infra-backup + model: test-charm-9f0e8dda + name: infra-backup-operator-cluster-infra-backup-4bm7p + phase: Completed + start-timestamp: "2025-08-07T18:42:10Z" + + # restore the backups + + juju run velero-operator/0 restore backup-uid=85662948-8e5e-4922-8e1c-c5568eafa6e7 + + juju run velero-operator/0 restore backup-uid=83503892-a24a-409b-b0df-553dcc2465ec + +Juju +---- + +Backup +~~~~~~ +.. code-block :: text + + # export all models + juju export-bundle --model=cos --filename=cos-bundle.yaml + juju export-bundle --model=openstack --filename=openstack-bundle.yaml + ... + + # backup of controller + juju create-backup --model=${CONTROLLERS_MODEL} --filename=juju-ctrl-backup.tar.gz + + # local client configuration + tar -czf juju-credentials.tar.gz ~/.local/share/juju/* + +Restore +~~~~~~~ +For restoring there is the `juju-restore`_ tool to help. + + +MAAS deployment access +---------------------- + +See the :doc:`Backup and Restore MAAS Deployment` for details. + +Sunbeam-clusterd +---------------- + +Backup +~~~~~~ +It's recommended to create a backup of sunbeam-clusterd data by running the following command: + +.. code-block :: text + + juju exec -a sunbeam-clusterd -- tar -cvf /home/ubuntu/backup.tar /var/snap/openstack/common/state/database + +Note that the backup file is created in the home directory of the ubuntu user, so it needs to be +moved to a safe location after the backup is created. + +Restore +~~~~~~~ +If a unit has a corrupted database, it's possible to restore the backup by running the following command: + +.. code-block :: text + + # stop the clusterd service before restoring the backup + juju exec -a sunbeam-clusterd -- sudo systemctl stop snap.openstack.clusterd.service + + # remove snapshots and segments database files from the corrupted unit + juju exec -u sunbeam-clusterd/{unit} -- rm /var/snap/openstack/common/state/database/snapshot* + juju exec -u sunbeam-clusterd/{unit} -- rm /var/snap/openstack/common/state/database/000000* + + # restore the backup on the corrupted unit + juju exec -u sunbeam-clusterd/{unit} -- tar -xvf /home/ubuntu/backup.tar -C / + + # start the clusterd service after restoring the backup + juju exec -a sunbeam-clusterd -- sudo systemctl start snap.openstack.clusterd.service + +.. LINKS +.. _velero-operator: https://charmhub.io/velero-operator +.. _infra-backup-operator: https://charmhub.io/infra-backup-operator/docs/tutorial +.. _juju-restore: https://github.com/juju/juju-restore/ +.. _charmed mysql documentation: https://canonical-charmed-mysql.readthedocs-hosted.com/8.0/how-to/back-up-and-restore/restore-a-backup/ diff --git a/how-to/operations/index.rst b/how-to/operations/index.rst index debccd0..5eaa8a8 100644 --- a/how-to/operations/index.rst +++ b/how-to/operations/index.rst @@ -13,3 +13,4 @@ Operations removing-the-primary-node scaling-the-cluster-in scaling-the-cluster-out + backup-and-restore