The Lie You Tell Yourself
You’ve got a solid backup strategy. Three copies of everything. Backblaze, local NAS, external USB drive. You’ve got a Grafana dashboard showing all your backups are green. You feel responsible and professional.
Then your database gets corrupted. You go to restore from backup. The backup is corrupt too. Or it’s missing the tables you need. Or the restore process is so broken that it takes three hours to figure out.
Now you’ve got a real disaster.
This happens because you’ve never actually tried to restore. You’ve just assumed it works.
RTO and RPO Are Meaningless Without Testing
RPO (Recovery Point Objective) is how much data you can afford to lose. “We back up daily, so we lose at most 24 hours of data.”
RTO (Recovery Time Objective) is how long you can afford to be down. “We can recover in 2 hours.”
Both are completely meaningless if you’ve never actually recovered.
You think your backup takes 2 hours to restore. You haven’t timed it. You think you’ll only lose 1 day of data. You haven’t verified the backup actually contains yesterday’s data. You think your recovery process is documented. It’s not — it’s in your head.
The Testing Strategy
Pick a frequency. Monthly is good. Once a quarter is the bare minimum.
Test 1: Restore a file
Pick a random file from your backups. Restore it to a test location. Verify it’s correct.
# Example with Resticrestic -r /backup/repo restore latest --target /tmp/restore-test --path /important/file.txt
# Verify it matchesdiff /important/file.txt /tmp/restore-test/important/file.txtTakes 5 minutes. Do this monthly.
Test 2: Restore the entire system
Once a quarter, actually restore everything. Spin up a test VM or container. Restore your full backup. Boot it. Verify the system works.
This takes longer (maybe an hour), but it’s the only way you know your RTO is realistic.
# Restore a full Proxmox backup to a new VM# (Proxmox example, adjust for your backup tool)qmrestore /backup/dump/vm1-2025-12-01.vma.gz 123 --storage localDocument Your Restore Process
Write down every step. Not in your head. In a document.
# Database Restore Procedure
## Prerequisites- Access to backup server- Test environment available- Minimum 500GB free disk space
## Steps
1. Download latest backuprestic -r s3://bucket/db restore latest —target /tmp/db-restore —path /var/lib/postgresql
2. Stop the databasesystemctl stop postgresql
3. Restore datarm -rf /var/lib/postgresql/data cp -r /tmp/db-restore/var/lib/postgresql/data /var/lib/postgresql/ chown -R postgres:postgres /var/lib/postgresql/data
4. Start database and verifysystemctl start postgresql psql -U postgres -c “SELECT COUNT(*) FROM users;”
5. Cleanuprm -rf /tmp/db-restore
Estimated time: 30 minutesLast tested: 2025-12-11Tested by: [your name]Follow this document step-by-step during your test. If something’s wrong, fix the document and the process.
Test Failure Modes
What if your restore fails? Good. Now you know. Better now than when your data’s actually gone.
Common failures:
“Backup is corrupted”
restic check -r /backup/repoThe check command verifies backup integrity. Run this monthly on each backup.
“Backup is incomplete”
List what’s in the backup:
restic list snapshots -r /backup/reporestic ls -r /backup/repo snapshot_idVerify important files are there.
“Restore is slow”
Time it:
time restic restore latest -r /backup/repo --target /tmp/testIf it takes 4 hours, your RTO of “2 hours” is a lie. Update it or optimize the restore.
“Restored data is old”
Check the backup’s timestamp:
restic snapshots -r /backup/repoYou’ll see when each snapshot was taken. Make sure you’re restoring from the one you think you are.
Backup Verification Scripts
Make this automated. Check your backups weekly:
#!/bin/bashset -e
for backup_dir in /backups/*/; do echo "Checking $(basename $backup_dir)..."
# Check integrity restic -r "$backup_dir" check 2>&1 | grep -i "error" && { echo "ERROR: Backup corrupted in $backup_dir" mail -s "Backup Failure" admin@example.com exit 1 }
# Check age (warn if older than 24 hours) latest=$(restic -r "$backup_dir" snapshots --json | jq -r '.[0].time') age_hours=$(( ($(date +%s) - $(date -d "$latest" +%s)) / 3600 ))
if [ $age_hours -gt 24 ]; then echo "WARNING: Latest backup is $age_hours hours old" mail -s "Backup Too Old" admin@example.com fidone
echo "All backups OK"Run it weekly:
0 2 * * 1 root /usr/local/bin/check_backups.shThe Three Types of Restore Tests
Cold Test — You actually restore and verify, but don’t cut over
- Most realistic
- Takes time
- Do this quarterly
Warm Test — You restore to a parallel environment and verify it matches
- Less disruptive
- Still thorough
- Do this monthly
Hot Test — You restore files and spot-check them without shutting down production
- Minimal risk
- Quick
- Do this monthly
What Gets You Fired
Not having backups: bad. Having backups you’ve never tested: worse. Restoring for the first time when you’re in an actual disaster: career-ending.
Spend 2 hours a month testing your backups. It’s insurance.
When the disk dies at 2 AM and you go to restore, you’ll be the one who saves the day instead of being the person who lost everything.