Disaster Recovery
Back up and restore K3s clusters with etcd snapshots
Overview
PodWarden automatically configures etcd snapshots on all clusters, providing a recovery path if control plane nodes are lost. Snapshots capture the full cluster state — all Kubernetes resources, secrets, and CRDs.
Automated Etcd Snapshots
Every cluster takes etcd snapshots every 6 hours, retaining the last 5. Snapshots are stored on each control plane node at:
/var/lib/rancher/k3s/server/db/snapshots/Verify snapshots exist
SSH to any control plane and run:
k3s etcd-snapshot listTake a manual snapshot
k3s etcd-snapshot save --name manual-backupManual snapshots are kept in addition to the automatic rotation and are not subject to the 5-snapshot retention limit.
Recovery Scenarios
Scenario 1: Single CP lost (HA cluster)
If you have 3 CPs and one goes down, the cluster continues operating — etcd still has quorum. Workers reconnect to surviving CPs automatically.
To replace the lost CP:
- Remove the dead CP from the cluster detail page
- Add a new host as a control plane — PodWarden provisions it and joins it to the existing etcd cluster
No snapshot restore is required.
Scenario 2: Quorum lost (HA cluster)
If 2 of 3 CPs go down, etcd loses quorum. The cluster becomes read-only: existing pods keep running but no new scheduling occurs and the API server rejects writes.
To recover:
- Restart at least one of the failed CPs — if the host is still accessible, a simple reboot is often enough to restore quorum
- If the hosts are permanently gone, use
k3s server --cluster-reseton the one surviving CP to bootstrap a new single-node etcd from the most recent snapshot
# On the surviving CP — this resets etcd to a single node
systemctl stop k3s
k3s server --cluster-reset
systemctl start k3sAfter --cluster-reset, the cluster is functional as a single CP. You can then add new control planes to restore HA.
Scenario 3: All CPs lost
If all control planes are permanently destroyed:
- Locate a snapshot — check any recoverable CP host at
/var/lib/rancher/k3s/server/db/snapshots/, or an off-site copy if you have one - Provision a new host as a control plane via PodWarden
- Restore the snapshot on the new host:
systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot.db
systemctl start k3s- Use Adopt Existing Cluster in PodWarden to re-register the recovered cluster
- Re-join workers to the recovered cluster from the cluster detail page
Worker pods that were running at snapshot time will be rescheduled automatically once the workers reconnect.
Snapshot Storage Recommendations
The default on-disk snapshots are sufficient for most failures but are lost if the CP host's disk is destroyed. For stronger recovery guarantees:
- Copy snapshots to an NFS share or S3 bucket on a regular schedule
- Use K3s's built-in S3 snapshot upload: set
--etcd-s3flags in the K3s service config
# Example K3s S3 snapshot config (/etc/rancher/k3s/config.yaml)
etcd-s3: true
etcd-s3-endpoint: s3.example.com
etcd-s3-access-key: <key>
etcd-s3-secret-key: <secret>
etcd-s3-bucket: k3s-snapshotsWhat Snapshots Contain
Etcd snapshots capture everything stored in the Kubernetes control plane:
- All workload resources (Deployments, StatefulSets, DaemonSets, Services, Ingresses)
- Secrets and ConfigMaps
- PodWarden CRDs and custom resources (BackupPolicies, BackupRuns, etc.)
- RBAC configuration
Snapshots do not capture persistent volume data. After a full restore, volumes that existed at snapshot time will reappear as PVC objects, but their data depends on whether the underlying storage survived. Pair etcd snapshots with volume backups for full data protection.
Troubleshooting
Snapshot list is empty: The cluster may have been recently provisioned. Take a manual snapshot to confirm etcd is healthy: k3s etcd-snapshot save --name test.
cluster-reset fails to start: Check K3s logs for etcd errors: journalctl -u k3s -n 100. Ensure the snapshot file is not corrupted and matches the K3s version.
Workers won't reconnect after restore: The cluster CA and node tokens are preserved in the snapshot. If tokens were rotated after the snapshot date, workers may need to be re-joined from the cluster detail page.
Related Docs
- High Availability Clusters — Setting up multi-control-plane clusters
- Backups — Volume and database backup policies
- Node Management — Cordon, drain, and node-level operations