Disaster Recovery | PodWarden Hub

Overview

PodWarden automatically configures etcd snapshots on all clusters, providing a recovery path if control plane nodes are lost. Snapshots capture the full cluster state — all Kubernetes resources, secrets, and CRDs.

Automated Etcd Snapshots

Every cluster takes etcd snapshots every 6 hours, retaining the last 5. Snapshots are stored on each control plane node at:

/var/lib/rancher/k3s/server/db/snapshots/

Verify snapshots exist

SSH to any control plane and run:

k3s etcd-snapshot list

Take a manual snapshot

k3s etcd-snapshot save --name manual-backup

Manual snapshots are kept in addition to the automatic rotation and are not subject to the 5-snapshot retention limit.

Recovery Scenarios

Scenario 1: Single CP lost (HA cluster)

If you have 3 CPs and one goes down, the cluster continues operating — etcd still has quorum. Workers reconnect to surviving CPs automatically.

To replace the lost CP:

Remove the dead CP from the cluster detail page
Add a new host as a control plane — PodWarden provisions it and joins it to the existing etcd cluster

No snapshot restore is required.

Scenario 2: Quorum lost (HA cluster)

If 2 of 3 CPs go down, etcd loses quorum. The cluster becomes read-only: existing pods keep running but no new scheduling occurs and the API server rejects writes.

To recover:

Restart at least one of the failed CPs — if the host is still accessible, a simple reboot is often enough to restore quorum
If the hosts are permanently gone, use k3s server --cluster-reset on the one surviving CP to bootstrap a new single-node etcd from the most recent snapshot

# On the surviving CP — this resets etcd to a single node
systemctl stop k3s
k3s server --cluster-reset
systemctl start k3s

After --cluster-reset, the cluster is functional as a single CP. You can then add new control planes to restore HA.

Scenario 3: All CPs lost

If all control planes are permanently destroyed:

Locate a snapshot — check any recoverable CP host at /var/lib/rancher/k3s/server/db/snapshots/, or an off-site copy if you have one
Provision a new host as a control plane via PodWarden
Restore the snapshot on the new host:

systemctl stop k3s
k3s server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot.db
systemctl start k3s

Use Adopt Existing Cluster in PodWarden to re-register the recovered cluster
Re-join workers to the recovered cluster from the cluster detail page

Worker pods that were running at snapshot time will be rescheduled automatically once the workers reconnect.

Snapshot Storage Recommendations

The default on-disk snapshots are sufficient for most failures but are lost if the CP host's disk is destroyed. For stronger recovery guarantees:

Copy snapshots to an NFS share or S3 bucket on a regular schedule
Use K3s's built-in S3 snapshot upload: set --etcd-s3 flags in the K3s service config

# Example K3s S3 snapshot config (/etc/rancher/k3s/config.yaml)
etcd-s3: true
etcd-s3-endpoint: s3.example.com
etcd-s3-access-key: <key>
etcd-s3-secret-key: <secret>
etcd-s3-bucket: k3s-snapshots

What Snapshots Contain

Etcd snapshots capture everything stored in the Kubernetes control plane:

All workload resources (Deployments, StatefulSets, DaemonSets, Services, Ingresses)
Secrets and ConfigMaps
PodWarden CRDs and custom resources (BackupPolicies, BackupRuns, etc.)
RBAC configuration

Snapshots do not capture persistent volume data. After a full restore, volumes that existed at snapshot time will reappear as PVC objects, but their data depends on whether the underlying storage survived. Pair etcd snapshots with volume backups for full data protection.

Troubleshooting

Snapshot list is empty: The cluster may have been recently provisioned. Take a manual snapshot to confirm etcd is healthy: k3s etcd-snapshot save --name test.

cluster-reset fails to start: Check K3s logs for etcd errors: journalctl -u k3s -n 100. Ensure the snapshot file is not corrupted and matches the K3s version.

Workers won't reconnect after restore: The cluster CA and node tokens are preserved in the snapshot. If tokens were rotated after the snapshot date, workers may need to be re-joined from the cluster detail page.

Related Docs

High Availability Clusters — Setting up multi-control-plane clusters
Backups — Volume and database backup policies
Node Management — Cordon, drain, and node-level operations