Skip to content
Go back

You Should Be Testing Your Restores

By SumGuy 5 min read
You Should Be Testing Your Restores

The Lie You Tell Yourself

You’ve got a solid backup strategy. Three copies of everything. Backblaze, local NAS, external USB drive. You’ve got a Grafana dashboard showing all your backups are green. You feel responsible and professional.

Then your database gets corrupted. You go to restore from backup. The backup is corrupt too. Or it’s missing the tables you need. Or the restore process is so broken that it takes three hours to figure out.

Now you’ve got a real disaster.

This happens because you’ve never actually tried to restore. You’ve just assumed it works.

RTO and RPO Are Meaningless Without Testing

RPO (Recovery Point Objective) is how much data you can afford to lose. “We back up daily, so we lose at most 24 hours of data.”

RTO (Recovery Time Objective) is how long you can afford to be down. “We can recover in 2 hours.”

Both are completely meaningless if you’ve never actually recovered.

You think your backup takes 2 hours to restore. You haven’t timed it. You think you’ll only lose 1 day of data. You haven’t verified the backup actually contains yesterday’s data. You think your recovery process is documented. It’s not — it’s in your head.

The Testing Strategy

Pick a frequency. Monthly is good. Once a quarter is the bare minimum.

Test 1: Restore a file

Pick a random file from your backups. Restore it to a test location. Verify it’s correct.

Terminal window
# Example with Restic
restic -r /backup/repo restore latest --target /tmp/restore-test --path /important/file.txt
# Verify it matches
diff /important/file.txt /tmp/restore-test/important/file.txt

Takes 5 minutes. Do this monthly.

Test 2: Restore the entire system

Once a quarter, actually restore everything. Spin up a test VM or container. Restore your full backup. Boot it. Verify the system works.

This takes longer (maybe an hour), but it’s the only way you know your RTO is realistic.

Terminal window
# Restore a full Proxmox backup to a new VM
# (Proxmox example, adjust for your backup tool)
qmrestore /backup/dump/vm1-2025-12-01.vma.gz 123 --storage local

Document Your Restore Process

Write down every step. Not in your head. In a document.

# Database Restore Procedure
## Prerequisites
- Access to backup server
- Test environment available
- Minimum 500GB free disk space
## Steps
1. Download latest backup

restic -r s3://bucket/db restore latest —target /tmp/db-restore —path /var/lib/postgresql

2. Stop the database

systemctl stop postgresql

3. Restore data

rm -rf /var/lib/postgresql/data cp -r /tmp/db-restore/var/lib/postgresql/data /var/lib/postgresql/ chown -R postgres:postgres /var/lib/postgresql/data

4. Start database and verify

systemctl start postgresql psql -U postgres -c “SELECT COUNT(*) FROM users;”

5. Cleanup

rm -rf /tmp/db-restore

Estimated time: 30 minutes
Last tested: 2025-12-11
Tested by: [your name]

Follow this document step-by-step during your test. If something’s wrong, fix the document and the process.

Test Failure Modes

What if your restore fails? Good. Now you know. Better now than when your data’s actually gone.

Common failures:

“Backup is corrupted”

Terminal window
restic check -r /backup/repo

The check command verifies backup integrity. Run this monthly on each backup.

“Backup is incomplete”

List what’s in the backup:

Terminal window
restic list snapshots -r /backup/repo
restic ls -r /backup/repo snapshot_id

Verify important files are there.

“Restore is slow”

Time it:

Terminal window
time restic restore latest -r /backup/repo --target /tmp/test

If it takes 4 hours, your RTO of “2 hours” is a lie. Update it or optimize the restore.

“Restored data is old”

Check the backup’s timestamp:

Terminal window
restic snapshots -r /backup/repo

You’ll see when each snapshot was taken. Make sure you’re restoring from the one you think you are.

Backup Verification Scripts

Make this automated. Check your backups weekly:

check_backups.sh
#!/bin/bash
set -e
for backup_dir in /backups/*/; do
echo "Checking $(basename $backup_dir)..."
# Check integrity
restic -r "$backup_dir" check 2>&1 | grep -i "error" && {
echo "ERROR: Backup corrupted in $backup_dir"
mail -s "Backup Failure" admin@example.com
exit 1
}
# Check age (warn if older than 24 hours)
latest=$(restic -r "$backup_dir" snapshots --json | jq -r '.[0].time')
age_hours=$(( ($(date +%s) - $(date -d "$latest" +%s)) / 3600 ))
if [ $age_hours -gt 24 ]; then
echo "WARNING: Latest backup is $age_hours hours old"
mail -s "Backup Too Old" admin@example.com
fi
done
echo "All backups OK"

Run it weekly:

/etc/cron.d/verify-backups
0 2 * * 1 root /usr/local/bin/check_backups.sh

The Three Types of Restore Tests

Cold Test — You actually restore and verify, but don’t cut over

Warm Test — You restore to a parallel environment and verify it matches

Hot Test — You restore files and spot-check them without shutting down production

What Gets You Fired

Not having backups: bad. Having backups you’ve never tested: worse. Restoring for the first time when you’re in an actual disaster: career-ending.

Spend 2 hours a month testing your backups. It’s insurance.

When the disk dies at 2 AM and you go to restore, you’ll be the one who saves the day instead of being the person who lost everything.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
Package Management in 2026: apt, brew, nix, and the Friends We Made Along the Way
Next Post
IPv6 on Your Home Lab: You Should Care (Here's Why)

Related Posts