You Should Be Testing Your Restores

The Lie You Tell Yourself

You’ve got a solid backup strategy. Three copies of everything. Backblaze, local NAS, external USB drive. You’ve got a Grafana dashboard showing all your backups are green. You feel responsible and professional.

Then your database gets corrupted. You go to restore from backup. The backup is corrupt too. Or it’s missing the tables you need. Or the restore process is so broken that it takes three hours to figure out.

Now you’ve got a real disaster.

This happens because you’ve never actually tried to restore. You’ve just assumed it works.

RTO and RPO Are Meaningless Without Testing

RPO (Recovery Point Objective) is how much data you can afford to lose. “We back up daily, so we lose at most 24 hours of data.”

RTO (Recovery Time Objective) is how long you can afford to be down. “We can recover in 2 hours.”

Both are completely meaningless if you’ve never actually recovered.

You think your backup takes 2 hours to restore. You haven’t timed it. You think you’ll only lose 1 day of data. You haven’t verified the backup actually contains yesterday’s data. You think your recovery process is documented. It’s not — it’s in your head.

The Testing Strategy

Pick a frequency. Monthly is good. Once a quarter is the bare minimum.

Test 1: Restore a file

Pick a random file from your backups. Restore it to a test location. Verify it’s correct.

# Example with Restic
restic -r /backup/repo restore latest --target /tmp/restore-test --path /important/file.txt

# Verify it matches
diff /important/file.txt /tmp/restore-test/important/file.txt

Takes 5 minutes. Do this monthly.

Test 2: Restore the entire system

Once a quarter, actually restore everything. Spin up a test VM or container. Restore your full backup. Boot it. Verify the system works.

This takes longer (maybe an hour), but it’s the only way you know your RTO is realistic.

# Restore a full Proxmox backup to a new VM
# (Proxmox example, adjust for your backup tool)
qmrestore /backup/dump/vm1-2025-12-01.vma.gz 123 --storage local

Document Your Restore Process

Write down every step. Not in your head. In a document.

# Database Restore Procedure

## Prerequisites
- Access to backup server
- Test environment available
- Minimum 500GB free disk space

## Steps

1. Download latest backup

restic -r s3://bucket/db restore latest —target /tmp/db-restore —path /var/lib/postgresql

2. Stop the database

systemctl stop postgresql

3. Restore data

rm -rf /var/lib/postgresql/data cp -r /tmp/db-restore/var/lib/postgresql/data /var/lib/postgresql/ chown -R postgres:postgres /var/lib/postgresql/data

4. Start database and verify

systemctl start postgresql psql -U postgres -c “SELECT COUNT(*) FROM users;”

5. Cleanup

rm -rf /tmp/db-restore

Estimated time: 30 minutes
Last tested: 2025-12-11
Tested by: [your name]

Follow this document step-by-step during your test. If something’s wrong, fix the document and the process.

Test Failure Modes

What if your restore fails? Good. Now you know. Better now than when your data’s actually gone.

Common failures:

“Backup is corrupted”

restic check -r /backup/repo

The check command verifies backup integrity. Run this monthly on each backup.

“Backup is incomplete”

List what’s in the backup:

restic list snapshots -r /backup/repo
restic ls -r /backup/repo snapshot_id

Verify important files are there.

“Restore is slow”

Time it:

time restic restore latest -r /backup/repo --target /tmp/test

If it takes 4 hours, your RTO of “2 hours” is a lie. Update it or optimize the restore.

“Restored data is old”

Check the backup’s timestamp:

restic snapshots -r /backup/repo

You’ll see when each snapshot was taken. Make sure you’re restoring from the one you think you are.

Backup Verification Scripts

Make this automated. Check your backups weekly:

#!/bin/bash
set -e

for backup_dir in /backups/*/; do
    echo "Checking $(basename $backup_dir)..."

    # Check integrity
    restic -r "$backup_dir" check 2>&1 | grep -i "error" && {
        echo "ERROR: Backup corrupted in $backup_dir"
        mail -s "Backup Failure" admin@example.com
        exit 1
    }

    # Check age (warn if older than 24 hours)
    latest=$(restic -r "$backup_dir" snapshots --json | jq -r '.[0].time')
    age_hours=$(( ($(date +%s) - $(date -d "$latest" +%s)) / 3600 ))

    if [ $age_hours -gt 24 ]; then
        echo "WARNING: Latest backup is $age_hours hours old"
        mail -s "Backup Too Old" admin@example.com
    fi
done

echo "All backups OK"

Run it weekly:

0 2 * * 1 root /usr/local/bin/check_backups.sh

The Three Types of Restore Tests

Cold Test — You actually restore and verify, but don’t cut over

Most realistic
Takes time
Do this quarterly

Warm Test — You restore to a parallel environment and verify it matches

Less disruptive
Still thorough
Do this monthly

Hot Test — You restore files and spot-check them without shutting down production

Minimal risk
Quick
Do this monthly

What Gets You Fired

Not having backups: bad. Having backups you’ve never tested: worse. Restoring for the first time when you’re in an actual disaster: career-ending.

Spend 2 hours a month testing your backups. It’s insurance.

When the disk dies at 2 AM and you go to restore, you’ll be the one who saves the day instead of being the person who lost everything.

You Should Be Testing Your Restores

The Lie You Tell Yourself

RTO and RPO Are Meaningless Without Testing

The Testing Strategy

Document Your Restore Process

Test Failure Modes

Backup Verification Scripts

The Three Types of Restore Tests

What Gets You Fired

Responses from around the web

Discussion

Related Posts

Backblaze B2 + rclone: Tiered Backup at Real-World Costs

Snapper for Btrfs Snapshots on Root Filesystems

Kopia Repository Server: Multi-Host Backups Done Right

Restic Repository Maintenance: Prune, Check, Forget