Self-Hoster's Disaster Recovery: When Everything Goes Wrong at Once

The Backup That Was Never Actually a Backup

Imagine your home server dies. Drive failure, power surge, theft, flood — pick your disaster. You have backups. Except when you go to restore them, you discover:

The backup script has been failing silently for three months
The backup includes the application directory but not the data directory
The backup is on the same NAS that also died
You can restore the files but you have no idea how to reinstall and reconfigure the application
Everything works, but all your Docker volumes pointed to a path you can’t remember

Backup is not the same as recovery. Having files somewhere is not a disaster recovery strategy. A disaster recovery plan is the full picture: what you back up, how you restore it, in what order, and how you verify it actually works.

The Disaster Scenarios You Need to Plan For

Different disasters require different recovery strategies. Don’t let perfect be the enemy of good — plan for the likely ones:

Single drive failure: Most common. RAID or ZFS RAIDZ handles this automatically. The “disaster” is replacing the failed drive and waiting for the rebuild.

Accidental deletion: “rm -rf /data” moments. File-level backups that support point-in-time recovery. Recycle bin features in your NAS.

Application/OS corruption: Botched update, config error, failed migration. VM snapshots. System-level backups. Ability to roll back.

Full hardware failure: Server won’t boot, motherboard dead. Need to restore to new hardware. Your backup must be hardware-independent.

Ransomware: Backups need to be air-gapped or immutable. If the ransomware can reach your backup destination, it will encrypt that too.

Site-level disaster (fire, flood, theft): Requires off-site backups. Your backup NAS in the same room as your server doesn’t help here.

RTO and RPO: The Two Numbers That Define Your DR Plan

These terms come from enterprise DR planning but are useful at any scale.

RPO — Recovery Point Objective: How much data loss is acceptable? If your RPO is 4 hours, your backup needs to run at least every 4 hours. If your RPO is 0 (no data loss acceptable), you need real-time replication.

RTO — Recovery Time Objective: How long can you be down before it matters? If your RTO is 1 day, you have 24 hours to restore. If your RTO is 1 hour, your restore process needs to be fast and well-practiced.

For a home lab, be honest with yourself:

Service	My RPO	My RTO	Why
Personal media library	7 days	3 days	I can re-download, it’s just annoying
Home automation config	1 day	4 hours	I like my automations
Password manager	1 hour	1 hour	This is important
Family photos	0 (no loss)	2 days	Irreplaceable
Minecraft server	1 day	2 days	The kids will complain

Setting explicit RPO/RTO forces prioritization. You can’t treat everything as equally important, and the constraints help you decide backup frequency and restore process complexity.

The 3-2-1 Backup Rule

The foundational backup strategy: 3 copies, on 2 different media types, with 1 copy offsite.

3 copies: The original plus two backups. One backup means if the backup and the original both fail (they’re often on the same system), you have nothing.

2 media types: Don’t put both backups on the same type of storage. External hard drives + cloud, or NAS + tape, or NAS + cloud. Different media fails differently.

1 offsite: One backup needs to be physically somewhere else. Cloud storage counts. A hard drive at a family member’s house counts. Your backup NAS in the same rack does not count.

Some add a “1 offline” modifier — 3-2-1-1: one copy offline or air-gapped, protecting against ransomware and network-reachable failures.

Proxmox VM Backups

Proxmox has excellent built-in backup capabilities that should be your first layer.

Using Proxmox Backup Server (PBS)

PBS is a dedicated backup application that pairs with Proxmox VE. Deduplication means incremental backups are efficient even for large VMs.

# On Proxmox VE host — add PBS as storage
# Datacenter → Storage → Add → Proxmox Backup Server
# Server: pbs.local, Datastore: your-datastore
# Fingerprint: (get from PBS → Dashboard → Show Fingerprint)

Configure automated backups in Proxmox VE:

Datacenter → Backup → Add:

Node: your-node (or All)
Storage: your PBS storage
Schedule: 0 2 * * * (2am daily)
Selection: All VMs / specific VMs
Mode: Snapshot (or Suspend for consistency)
Retention: Keep last 7 daily, 4 weekly, 3 monthly

Restore from PBS:

# Via web UI: VM → Backup tab → Restore
# Or CLI:
qmrestore PBS:backup/100/2026-04-01T02:00:00Z 100 --storage local-lvm

Manual VM Snapshots vs Backups

Snapshots are for short-term protection (before risky updates). Backups are for disaster recovery. Don’t confuse them:

# Create snapshot before a risky operation
qm snapshot 100 pre-update --description "Before kernel update"

# Roll back if needed
qm rollback 100 pre-update

# Delete snapshot when done
qm delsnapshot 100 pre-update

Snapshots live on the same storage as the VM. They protect against oops moments, not hardware failures.

Docker Volume Backup Strategies

Docker volumes are a common backup blind spot. People back up their compose files but not the data directories.

Strategy 1: Bind Mounts to a Backed-Up Path

The simplest approach — use bind mounts that point to a path your backup tool already covers:

services:
  postgres:
    image: postgres:16
    volumes:
      - /opt/appdata/postgres:/var/lib/postgresql/data

  nextcloud:
    image: nextcloud:latest
    volumes:
      - /opt/appdata/nextcloud:/var/www/html

Now /opt/appdata/ is the single directory to back up for all your Docker data.

Strategy 2: Database Dump Before Backup

For databases, file-level backup of a running database can produce inconsistent snapshots. Dump first:

#!/bin/bash
BACKUP_DIR="/opt/backups/db"
DATE=$(date +%Y%m%d-%H%M%S)

# Dump PostgreSQL
docker exec postgres pg_dumpall -U postgres > "${BACKUP_DIR}/postgres-${DATE}.sql"

# Dump MySQL/MariaDB
docker exec mariadb mysqldump -u root -p"${MYSQL_ROOT_PASSWORD}" --all-databases > "${BACKUP_DIR}/mysql-${DATE}.sql"

# Keep last 7 days
find "${BACKUP_DIR}" -name "*.sql" -mtime +7 -delete

Run this before your file backup job.

Strategy 3: docker-volume-backup

# docker-compose.yml — add alongside your services
services:
  backup:
    image: offen/docker-volume-backup:latest
    environment:
      BACKUP_CRON_EXPRESSION: "0 2 * * *"
      BACKUP_RETENTION_DAYS: "7"
      AWS_S3_BUCKET_NAME: my-backup-bucket
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    volumes:
      - myapp_data:/backup/myapp_data:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro

Offsite Backup with Restic and Backblaze B2

Restic is an excellent backup tool: deduplication, encryption at rest, incremental, supports many backends.

Installing Restic

# Debian/Ubuntu
sudo apt install restic

# Or latest binary
curl -L https://github.com/restic/restic/releases/latest/download/restic_linux_amd64.bz2 | bunzip2 > /usr/local/bin/restic
chmod +x /usr/local/bin/restic

Configure Backblaze B2

Create a Backblaze account, create a B2 bucket (note: private), create an application key with access to that bucket.

# Environment variables (store in /etc/restic-env, chmod 600)
export B2_ACCOUNT_ID="your-account-id"
export B2_ACCOUNT_KEY="your-application-key"
export RESTIC_REPOSITORY="b2:my-backup-bucket:/homelab"
export RESTIC_PASSWORD="your-strong-encryption-password"

# Initialize repository (first time only)
source /etc/restic-env
restic init

# Backup your data directory
restic backup /opt/appdata --tag docker-data

# Backup with exclusions
restic backup /opt/appdata \
  --exclude /opt/appdata/*/cache \
  --exclude /opt/appdata/*/logs \
  --tag docker-data

# Check backup integrity
restic check

# List snapshots
restic snapshots

# Restore specific snapshot
restic restore latest --target /restore/test

# Restore specific path from latest snapshot
restic restore latest --target /tmp/restore --include /opt/appdata/nextcloud

Automated Backup with systemd

[Unit]
Description=Restic backup to Backblaze B2
OnFailure=restic-backup-failure@%n.service

[Service]
Type=oneshot
EnvironmentFile=/etc/restic-env
ExecStart=/usr/local/bin/restic backup /opt/appdata --tag docker-data
ExecStart=/usr/local/bin/restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --prune
ExecStart=/usr/local/bin/restic check --read-data-subset=10%%

[Unit]
Description=Run Restic backup daily

[Timer]
OnCalendar=*-*-* 03:00:00
Persistent=true
RandomizedDelaySec=1800

[Install]
WantedBy=timers.target

sudo systemctl enable --now restic-backup.timer

Alert on Backup Failure

# /etc/systemd/system/restic-backup-failure@.service
[Unit]
Description=Notify on backup failure

[Service]
Type=oneshot
ExecStart=/opt/scripts/notify-backup-failure.sh

#!/bin/bash
curl -X POST https://ntfy.sh/your-backup-alerts \
  -d "Restic backup failed on $(hostname) at $(date)" \
  -H "Title: Backup Failure" \
  -H "Priority: high"

Testing Your Backups — The Step Everyone Skips

A backup you’ve never restored is a backup of unknown quality. Test regularly:

Monthly: Spot check restore

# Restore a random file and verify it's readable
source /etc/restic-env
restic restore latest --target /tmp/restore-test --include /opt/appdata/nextcloud/config/config.php
diff /opt/appdata/nextcloud/config/config.php /tmp/restore-test/opt/appdata/nextcloud/config/config.php
rm -rf /tmp/restore-test

Quarterly: Full application restore test Spin up a VM, restore your backup, start your application, verify it works. This is the only way to know your RTO is achievable.

Annually: Site-level disaster simulation Assume your entire server is gone. Start from scratch using only your off-site backups and your documentation. What breaks? What’s missing? Update your documentation.

A Practical DR Runbook Template

# Disaster Recovery Runbook — [Service Name]
Last tested: [date]
Last updated: [date]

## Service Description
[What does this service do? What data does it hold?]

## RPO / RTO
- Recovery Point Objective: X hours
- Recovery Time Objective: X hours

## Backup Locations
- Primary: [PBS datastore / path]
- Off-site: [Backblaze B2 bucket / Restic repo]
- Database dumps: /opt/backups/db/

## Recovery Procedure

### Prerequisites
- [ ] New server/VM provisioned with [OS version]
- [ ] Docker and Docker Compose installed
- [ ] SSH access configured

### Step 1: Restore application data
restic -r b2:my-bucket:/homelab restore latest \
  --target /opt/appdata \
  --include /opt/appdata/[service-name]

### Step 2: Restore database
docker exec -i postgres psql -U postgres < /backups/postgres-latest.sql

### Step 3: Start application
cd /opt/compose/[service-name]
docker compose up -d

### Step 4: Verify
- [ ] Application responds at [URL]
- [ ] Check [specific data] is intact
- [ ] Run smoke test: [test command]

## Known Issues
[What might go wrong during recovery]

## Contact
[Who knows this service and can help]

Document it before you need it. The 3am disaster recovery session is not the time to be reading documentation for the first time — it’s when you want to be executing a checklist you’ve already validated works.

The backup that’s never been restored might as well not exist. The runbook that’s never been tested will fail when you need it most. Testing is not optional; it’s the only way to know you actually have disaster recovery, rather than the appearance of it.