-
Notifications
You must be signed in to change notification settings - Fork 18
feat: Custom backup and restoration #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8f98e65
f10f8ba
f3b1d2e
3c2da80
47744d8
c9afdf9
246df5d
b5544e9
18b3497
3c016e9
9bc6caf
02000d5
1c8552d
a15fd57
7ab47d4
8981b8a
d869831
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -707,6 +707,61 @@ The `backup` procedure executes the following sequence of tasks: | |
| * make_descriptor | ||
| * pack | ||
|
|
||
| ### Periodic ETCD backups | ||
|
|
||
| It's posible to set the periodic ETCD backups via CronJob. The procedure config for that case is the following | ||
|
|
||
| ```yaml | ||
| backup_location: '/tmp/tmp_folder' | ||
| backup_plan: | ||
| etcd: | ||
| cron_job: | ||
| enabled: true | ||
| storage_class: "local-path" | ||
| storage_name: "etcd-backup" | ||
| storage_size: "50Gi" | ||
| etcdctl_image: ghcr.io/netcracker/etcdctl:0.0.1 | ||
| busybox_image: busybox:1.37.0 | ||
| schedule: "*/5 * * * *" | ||
| storage_depth: 5 | ||
| ``` | ||
|
|
||
| * `enabled` is a switcher to create or delete the CronJob | ||
| * `storage_class` is StorageClasss that is used to create a PersistentVolume for backups | ||
| * `storage_name` is PersistentVolumeClaim name | ||
| * `storage_size` is PersistentVolume size | ||
| * `etcdctl_image` is Docker image with etcdctl and additional utilities on board | ||
| * `busybox_image` is Docker image with Linux shell | ||
| * `schedule` is a crontab notation schedule | ||
| * `storage_depth` is a storage time in hours | ||
|
|
||
| **Warning**: Do not use StorageClass with `reclaimPolicy: Delete` if you wat to keep snapshots after disabling periodic backups. | ||
|
|
||
| After the enabling, the CronJob must be created in `kube-system` Namespace: | ||
|
|
||
| ```shell | ||
| $ kubectl -n kube-system get cronjob | ||
| NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE | ||
| etcd-backup */5 * * * * <none> False 0 <none> 35s | ||
| ``` | ||
|
|
||
| That CronJob runs two scripts periodically. The first one create ETCD snapshot with the name like `etcd-snapshot-20260311_114008_15743.db` on PersistentVolume. The second one delete the snapshots with the age more than `ststorage_depth` | ||
|
|
||
| To disable the existing CronJob procedure config is the following: | ||
|
|
||
| ```yaml | ||
| backup_location: '/tmp/tmp_folder' | ||
| backup_plan: | ||
| etcd: | ||
| cron_job: | ||
| enabled: false | ||
| ``` | ||
|
|
||
| The procedure runs only the following tasks(the others tasks are skipped be default): | ||
|
|
||
| * verify_backup_location | ||
| * export | ||
| * etcd | ||
|
|
||
| ## Restore Procedure | ||
|
|
||
|
|
@@ -789,6 +844,22 @@ The `restore` procedure executes the following sequence of tasks: | |
| * etcd | ||
| * reboot | ||
|
|
||
| ### Restore From Periodic Backup | ||
|
|
||
| To restore the existing periodic backup the procedure config should be like the following: | ||
|
|
||
| ```yaml | ||
| backup_location: /tmp/backups | ||
|
|
||
| restore_plan: | ||
| etcd: | ||
| image: registry.k8s.io/etcd:3.6.6-0 | ||
| snapshot: /opt/local-path-provisioner/pvc-e3b0d6c5-495d-4887-90d9-000d6b3d4d00_kube-system_etcd-backup/etcd-snapshot-20260220_103000.db | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if some other provisioner is used, e.g. some network-attached like NFS. In this case there will be no directory on the host by default, since volume is not mounted. Do we expect users to mount the volume on their own and find the mount point? Maybe we could use some additional "backup download" job, which we will run only during restore, which will mount backup volume and copy latest backup to some well-known node direcotry. Then kubemarine will take backup from this well-known location on this particular host |
||
| ``` | ||
|
|
||
| **Notices**: | ||
| * Images must be chosen according to the ETCD version that has been used originally to create a backup. | ||
| * Path to the snapshot could be as folder only. The latest snapshot will be used in that case. | ||
|
|
||
| ## Add Node Procedure | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -118,6 +118,8 @@ def import_nodes_data(cluster: KubernetesCluster) -> None: | |
|
|
||
|
|
||
| def restore_dns_resolv_conf(cluster: KubernetesCluster) -> None: | ||
| if cluster.procedure_inventory.get('restore_plan', {}).get('etcd', {}).get('snapshot', {}): | ||
| return | ||
| import_nodes_data(cluster) | ||
|
|
||
| unpack_cmd = "sudo tar xzvf /tmp/kubemarine-backup.tar.gz -C / --overwrite /etc/resolv.conf" | ||
|
|
@@ -130,10 +132,14 @@ def restore_dns_resolv_conf(cluster: KubernetesCluster) -> None: | |
|
|
||
|
|
||
| def restore_thirdparties(cluster: KubernetesCluster) -> None: | ||
| if cluster.procedure_inventory.get('restore_plan', {}).get('etcd', {}).get('snapshot', {}): | ||
| return | ||
| install.system_prepare_thirdparties(cluster) | ||
|
|
||
|
|
||
| def import_nodes(cluster: KubernetesCluster) -> None: | ||
| if cluster.procedure_inventory.get('restore_plan', {}).get('etcd', {}).get('snapshot', {}): | ||
| return | ||
| if not cluster.is_task_completed('restore.dns.resolv_conf'): | ||
| import_nodes_data(cluster) | ||
|
|
||
|
|
@@ -163,6 +169,25 @@ def import_etcd(cluster: KubernetesCluster) -> None: | |
| cluster.log.verbose('ETCD will be restored from the following image: ' + etcd_image) | ||
|
|
||
| cluster.log.debug('Uploading ETCD snapshot...') | ||
| # Custom path to ETCD snapshot | ||
| if cluster.procedure_inventory.get('restore_plan', {}).get('etcd', {}).get('snapshot', {}): | ||
| cluster.log.debug('The particular snapshot will be used') | ||
| path_to_snap = cluster.procedure_inventory.get('restore_plan', {}).get('etcd', {}).get('snapshot', {}) | ||
| first_control_plane = cluster.nodes['control-plane'].get_first_member() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably should not assume that backup will be present on first master. E.g. local path provisioner could create volume on another node. Maybe using "download backup" job (and checking on which node it run) would be better |
||
| result = first_control_plane.sudo(f'file -b {path_to_snap}').get_simple_out().split('\n')[0] | ||
| if "directory" == result : | ||
| # Getting the latest snapshot | ||
| last_snapshot = first_control_plane.sudo(f'ls -1tr {path_to_snap} | tail -n 1').get_simple_out().split('\n')[0] | ||
| snapshot = f'{path_to_snap}/{last_snapshot}' | ||
| elif "data" == result : | ||
| # Getting the particular snapshot | ||
| snapshot = path_to_snap | ||
| else: | ||
| raise Exception("ETCD snapshot is incorrect or doesn't exist") | ||
| # Copying snapshot from first control-plane node to backup_location | ||
| cluster.log.debug('Coping snapshot from first control-plane node to the backup folder') | ||
| first_control_plane.get(snapshot, os.path.join(cluster.context['backup_tmpdir'], 'etcd.db')) | ||
|
|
||
| snap_name = '/var/lib/etcd/etcd-snapshot%s.db' % int(round(time.time() * 1000)) | ||
| cluster.nodes['control-plane'].put(os.path.join(cluster.context['backup_tmpdir'], 'etcd.db'), snap_name, | ||
| sudo=True, compare_hashes=True) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to enable periodic backups, we need to run separate
backupprocedure. Problems here is thatbackupprocedure goal originally (as I understand) is to actually backup data, not to install backup job which will perform backups in future. It seems strange to usebackupto performinstall. If we need clusters to beinstalled with disabled etcd fsync, and we recommend that backup job is enabled when fsync is disabled, it would be safer if we install backup job right ininstallprocedure (where we disable fsync).The provisioner/storageclass need not to be present immediately, it could be installed later, and once installed it can provision the volume for the job