Dedicated Server Disaster Recovery: Building a Plan That Actually Works

Written by:

·

Last Updated on:

·

HostingGuider uses affiliate links. We may earn a commission if you purchase through them, at no extra cost to you.

Most disaster recovery plans fail when they are needed most.

Not because they were written badly. Because they were never tested. A backup script that silently fails for three months leaves you with nothing when the disk dies. A recovery procedure that has never been run takes three times longer than expected under pressure at 2am.

Real disaster recovery is not a document. It is a tested, practiced system that you can execute while stressed, sleep-deprived, and under pressure.

This guide builds that system from the ground up. You will define what recovery actually means for your server, set up automated backups that verify themselves, write recovery procedures you can follow without thinking, and test the whole plan before you need it.

Key Takeaways

  • A backup you have never restored from is not a backup. It is an unverified assumption
  • RTO and RPO must be defined before you can build a meaningful plan. They drive every other decision
  • The 3-2-1 backup rule is the minimum baseline for any production server
  • Encrypted off-site backups with Restic cost almost nothing and survive scenarios that on-server backups cannot
  • Testing recovery procedures quarterly is not optional if you want them to work
  • Every disaster scenario needs its own recovery procedure. One generic procedure covers nothing adequately

Start Here: Define Your RTO and RPO

Before you touch a single tool, answer two questions. These answers drive every decision in your plan.

RTO (Recovery Time Objective) is the maximum acceptable time your server can be offline after a disaster.

For a personal blog, four hours of downtime is inconvenient. For a WooCommerce store processing orders continuously, four hours of downtime costs real money. For a SaaS application, four minutes may be unacceptable.

Write down your RTO honestly. Not aspirationally. The number you would tell your users if they asked.

RPO (Recovery Point Objective) is the maximum amount of data you can afford to lose.

If you back up daily and the server fails at 11pm, you lose up to 23 hours of data. Is that acceptable? For a blog, probably yes. For a membership site collecting user registrations all day, probably not.

Write down your RPO honestly.

Your Business TypeTypical RTOTypical RPO
Personal blog4-24 hours24 hours
Business brochure site2-4 hours24 hours
Content site with ad revenue1-2 hours4-8 hours
WooCommerce store15-30 minutes1-2 hours
Membership or SaaSUnder 15 minutesUnder 1 hour
Critical business systemUnder 5 minutesMinutes

Your RTO tells you what failover infrastructure you need. A 4-hour RTO means you can rebuild from scratch if needed. A 15-minute RTO means you need a warm standby server ready to take over immediately.

Your RPO tells you how frequently you need to back up. A 24-hour RPO means daily backups suffice. A 1-hour RPO means hourly backups or continuous replication.

Disaster Scenarios to Plan For

Every scenario needs its own procedure. A generic plan that says restore from backup is not a plan. It is a starting point.

Scenario 1: Single disk failure

The server is running. One disk in a RAID array fails. The server stays online but is degraded.

Your procedure: Replace the failed disk, rebuild the RAID array, verify array health.

Scenario 2: Catastrophic disk failure / data corruption

All data is lost or corrupted. The server cannot boot. You need to rebuild from scratch.

Your procedure: Provision a new server, restore from backup, verify application functionality, switch DNS.

Scenario 3: Accidental data deletion

A developer ran the wrong command. A plugin deleted content. A database table was dropped accidentally.

Your procedure: Identify what was deleted and when, restore that specific data from the most recent backup before the deletion, verify data integrity.

Scenario 4: Security breach or ransomware

The server is compromised. Data may be encrypted or exfiltrated. The server cannot be trusted.

Your procedure: Take the server offline immediately, provision a clean replacement, restore from a backup predating the compromise, audit what was accessed, patch the vulnerability.

Scenario 5: Data centre outage

Your hosting provider’s data centre has an outage. The server is unreachable through no fault of your own.

Your procedure: Wait for provider updates, activate a temporary server in a different region if RTO requires it, communicate status to users.

Scenario 6: Human error on live configuration

A configuration change broke something. The server is partly functional but the application is throwing errors.

Your procedure: Identify the last working configuration from version control or backups, roll back the specific change, verify recovery.

Write a specific procedure for each scenario before a disaster happens. A procedure written during an outage is slow, incomplete, and error-prone.

The 3-2-1 Backup Rule

The 3-2-1 rule is the minimum standard for any production server. It means:

3 copies of your data. The live data on the server plus two independent backups.

2 different storage media. Local disk plus something else. USB drive, secondary server, cloud storage.

1 copy off-site. At least one backup must be geographically separate from the server. A backup on the same server that suffers a catastrophic failure is useless.

Most servers that get built without a plan end up with zero verified backups or one on-server backup that disappears along with the server when disaster strikes.

The 3-2-1 rule ensures that any single failure leaves you with at least two remaining copies of your data.

For a practical implementation:

  • Copy 1: Live data on the server
  • Copy 2: Daily backup on a different local volume or secondary server
  • Copy 3: Encrypted off-site backup in cloud storage (Backblaze B2, Cloudflare R2, or similar)

The object storage guide explains why object storage is the right choice for off-site backup storage specifically.

Setting Up Automated Backups

A backup system has three requirements. It must run automatically. It must verify that it ran successfully. It must alert you when it fails.

Most backup scripts fulfill only the first requirement. Implementing all three is what separates a real backup system from a false sense of security.

File Backups with tar and rsync

Create the backup directory structure:

sudo mkdir -p /var/backups/server/{daily,weekly,monthly}
sudo chmod 700 /var/backups/server

Create the backup script:

sudo nano /usr/local/bin/backup-files.sh

Paste this:

#!/bin/bash

set -euo pipefail

DATE=$(date +%Y%m%d-%H%M)
BACKUP_DIR="/var/backups/server/daily"
WEB_ROOT="/var/www"
LOG="/var/log/backup.log"
RETENTION_DAYS=7

echo "[$DATE] Starting file backup" >> $LOG

tar -czf $BACKUP_DIR/web-$DATE.tar.gz \
  --exclude="$WEB_ROOT/*/wp-content/cache" \
  --exclude="$WEB_ROOT/*/wp-content/uploads/backups" \
  --exclude="*.log" \
  $WEB_ROOT 2>> $LOG

SIZE=$(du -sh $BACKUP_DIR/web-$DATE.tar.gz | cut -f1)
echo "[$DATE] File backup complete. Size: $SIZE" >> $LOG

find $BACKUP_DIR -name "web-*.tar.gz" -mtime +$RETENTION_DAYS -delete
echo "[$DATE] Cleaned backups older than $RETENTION_DAYS days" >> $LOG

The set -euo pipefail at the top is critical. It causes the script to exit immediately if any command fails. Without it, a failed backup step is silently ignored and the script reports success.

Make it executable:

sudo chmod +x /usr/local/bin/backup-files.sh

Database Backups with mysqldump

Create a separate database backup script:

sudo nano /usr/local/bin/backup-databases.sh

Paste:

#!/bin/bash

set -euo pipefail

DATE=$(date +%Y%m%d-%H%M)
BACKUP_DIR="/var/backups/server/daily"
LOG="/var/log/backup.log"
RETENTION_DAYS=7
MYSQL_USER="backup_user"
MYSQL_PASS="your_backup_password"

echo "[$DATE] Starting database backup" >> $LOG

mysqldump \
  -u $MYSQL_USER \
  -p$MYSQL_PASS \
  --all-databases \
  --single-transaction \
  --quick \
  --lock-tables=false \
  --routines \
  --triggers \
  --events \
  | gzip > $BACKUP_DIR/databases-$DATE.sql.gz

SIZE=$(du -sh $BACKUP_DIR/databases-$DATE.sql.gz | cut -f1)
echo "[$DATE] Database backup complete. Size: $SIZE" >> $LOG

find $BACKUP_DIR -name "databases-*.sql.gz" -mtime +$RETENTION_DAYS -delete
echo "[$DATE] Cleaned database backups older than $RETENTION_DAYS days" >> $LOG

Create a dedicated read-only backup user in MySQL:

sudo mysql
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'your_backup_password';
GRANT SELECT, RELOAD, LOCK TABLES, REPLICATION CLIENT, SHOW VIEW, EVENT, TRIGGER ON *.* TO 'backup_user'@'localhost';
FLUSH PRIVILEGES;
EXIT;

This user has read-only access. Even if the backup script is compromised, it cannot modify or delete database data.

sudo chmod +x /usr/local/bin/backup-databases.sh

Encrypted Off-Site Backups with Restic

Restic is a modern backup tool that provides deduplication, encryption, and direct upload to cloud storage. It is the right tool for off-site backups.

Install Restic:

sudo apt install restic -y

Create a Backblaze B2 bucket or a Cloudflare R2 bucket for your backup storage. Get the access key and secret key from the provider.

Initialise a Restic repository in your cloud bucket:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export RESTIC_REPOSITORY="s3:https://s3.us-west-000.backblazeb2.com/your-bucket-name"
export RESTIC_PASSWORD="strong_encryption_password_store_this_safely"

restic init

The repository is now initialised and encrypted. The password is required to restore from this backup. Store it somewhere safe and separate from the server. A password manager, an encrypted note, a physical printout in a secure location.

Create the off-site backup script:

sudo nano /usr/local/bin/backup-offsite.sh

Paste:

#!/bin/bash

set -euo pipefail

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export RESTIC_REPOSITORY="s3:https://s3.us-west-000.backblazeb2.com/your-bucket-name"
export RESTIC_PASSWORD="your_encryption_password"

LOG="/var/log/backup.log"
DATE=$(date +%Y%m%d-%H%M)

echo "[$DATE] Starting off-site backup" >> $LOG

restic backup \
  /var/www \
  /var/backups/server/daily \
  /etc \
  --exclude="/var/www/*/wp-content/cache" \
  --exclude="*.log" \
  --tag "daily" \
  >> $LOG 2>&1

restic forget \
  --keep-daily 14 \
  --keep-weekly 8 \
  --keep-monthly 6 \
  --prune \
  >> $LOG 2>&1

echo "[$DATE] Off-site backup complete" >> $LOG

The forget --prune command removes old backups according to the retention policy. 14 daily backups, 8 weekly, 6 monthly. This keeps backup storage costs controlled.

sudo chmod +x /usr/local/bin/backup-offsite.sh

Automate All Three with Cron

sudo crontab -e

Add:

# File and database backups: 1am every day
0 1 * * * /usr/local/bin/backup-files.sh
15 1 * * * /usr/local/bin/backup-databases.sh

# Off-site backup after local backups complete: 2am every day
0 2 * * * /usr/local/bin/backup-offsite.sh

# Weekly backup to separate retention folder: Sundays at 3am
0 3 * * 0 tar -czf /var/backups/server/weekly/web-$(date +%Y%m%d).tar.gz /var/www && mysqldump --all-databases -u backup_user -p'your_backup_password' | gzip > /var/backups/server/weekly/databases-$(date +%Y%m%d).sql.gz

Add Backup Monitoring with Healthchecks

A backup that runs but fails silently is worthless. Healthchecks.io monitors your cron jobs by expecting a ping on a schedule. If the ping does not arrive, you receive an alert.

Create a free account. Add a check for your backup job. Set the expected period to 25 hours (slightly more than daily).

Update your off-site backup script to ping on success:

# Add this as the last line of backup-offsite.sh
curl -fsS "https://hc-ping.com/YOUR-HEALTHCHECK-UUID" > /dev/null

If the backup script fails at any point due to set -euo pipefail, it exits before reaching the ping. Healthchecks notices the missing ping and alerts you. This is the only way to know your backup is actually running.

Verify That Your Backups Actually Work

A backup you have never restored is a backup you cannot trust.

Testing restores is not something you do once. It is something you do on a schedule. Quarterly is the minimum for production servers.

Verify Local Backups

Check that the backup files exist and are not empty:

ls -lh /var/backups/server/daily/

All files should have a non-zero size. A zero-byte file means the backup command ran but produced no output.

Test that the archive is not corrupted:

tar -tzf /var/backups/server/daily/web-LATEST.tar.gz | tail -20

This lists the archive contents without extracting. If this command produces output, the archive is readable.

Test that the database dump is valid:

gunzip -c /var/backups/server/daily/databases-LATEST.sql.gz | head -30

The output should show SQL statements beginning with -- MySQL dump or similar. An empty output or an error means the dump is corrupted.

Verify Off-Site Backups with Restic

Restic includes a built-in verification command:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export RESTIC_REPOSITORY="s3:https://s3.us-west-000.backblazeb2.com/your-bucket-name"
export RESTIC_PASSWORD="your_encryption_password"

restic check

This verifies the repository integrity without downloading all data. It confirms that the structure is intact and readable.

For a deeper verification that downloads and decrypts a sample of data:

restic check --read-data-subset=5%

This reads and verifies 5% of your backed-up data. Run this monthly.

List your recent snapshots:

restic snapshots

Confirm snapshots are appearing at the expected frequency. If you see a gap of more than 2 days, something stopped the backup.

Practice a Full Restore

At least once per quarter, restore a complete backup to a test environment. This does not need to be your production server. A cheap temporary VPS that you provision, restore to, verify, and then terminate.

The restore procedure should be documented. Follow the documentation during the test. Note any step that is unclear or that fails. Fix the documentation before the next test.

Restore files from Restic:

restic restore latest --target /tmp/restore-test/

This restores the most recent snapshot to /tmp/restore-test/. Compare the directory structure with your live server.

Restore a specific database:

gunzip -c /var/backups/server/daily/databases-LATEST.sql.gz | mysql -u root -p test_restore_db

After import, verify row counts match your production database:

mysql -u root -p test_restore_db -e "SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema='test_restore_db' ORDER BY table_rows DESC LIMIT 10;"

Compare against the same query on your production database.

Writing the Recovery Runbook

A runbook is a step-by-step procedure document that someone can follow without prior knowledge of the system. Write it for a competent sysadmin who does not know your specific setup.

The runbook lives somewhere accessible even when the server is down. Options:

  • A printed binder in your home or office
  • A document in a cloud-synced folder (Google Drive, Dropbox)
  • A private GitHub or GitLab repository
  • A self-hosted wiki on a different server

Never store your only copy of the runbook on the server it protects.

Runbook Structure for Each Scenario

Write a separate section for each disaster scenario you identified earlier. Each section follows this structure.

Scenario name: Catastrophic disk failure on primary server

Symptoms:

  • Server unresponsive
  • SSH connection refused
  • Hosting provider control panel shows server offline

Severity: Critical (RTO: 2 hours, RPO: 24 hours)

Step 1: Notify stakeholders. Message [contact name] via [contact method] immediately. Do not wait until recovery is complete.

Step 2: Confirm the failure with the provider.

# Call provider support: [phone number]
# Check provider status page: [URL]

Step 3: Provision replacement server. Log into [provider name] control panel at [URL]. Create new server with these specifications: [OS, CPU, RAM, storage].

Step 4: Restore files from Restic.

export AWS_ACCESS_KEY_ID="[key from password manager]"
export AWS_SECRET_ACCESS_KEY="[secret from password manager]"
export RESTIC_REPOSITORY="[repository URL]"
export RESTIC_PASSWORD="[password from password manager]"
restic restore latest --target /var/www/

Step 5: Restore databases.

gunzip -c /tmp/databases-LATEST.sql.gz | mysql -u root -p

Step 6: Restore configuration.

restic restore latest --target / --include /etc/nginx
restic restore latest --target / --include /etc/php
sudo nginx -t
sudo systemctl restart nginx php8.1-fpm mysql

Step 7: Verify sites are working. Add new server IP to local hosts file. Test each domain. Confirm no errors in logs.

Step 8: Switch DNS. Update A records at [registrar URL] to point to new server IP.

Step 9: Verify DNS propagation and remove hosts file entries.

Write this level of detail for every scenario. It looks tedious. During an incident at 3am, you will be grateful for every specific command.

The Contact List

Include in the runbook:

RoleNameContact MethodAvailability
Primary on-call[Name][Phone]24/7
Secondary on-call[Name][Phone]24/7
Hosting provider support[Provider][Phone/URL][Hours]
Domain registrar support[Registrar][Phone/URL][Hours]
DNS provider support[Provider][Phone/URL][Hours]
Key stakeholder[Name][Contact][Hours]

Key Credentials Location

The runbook should reference where credentials are stored but not include them inline. A compromised runbook document should not give an attacker everything they need.

Restic encryption password: [stored in password manager under "Server Backup Restic"]
Hosting provider login: [stored in password manager under "Provider Name Admin"]
Database root password: [stored in password manager under "Production MySQL Root"]

Warm Standby for Aggressive RTO Targets

For RTO targets under 30 minutes, you need a warm standby server. You cannot build from scratch fast enough.

A warm standby is a second server in a different location that receives continuous or near-continuous replication of your primary server data. When the primary fails, you switch DNS to the standby. Users reconnect. Recovery time is the DNS propagation window, typically under 5 minutes with TTL=300.

File replication with rsync and cron:

On the standby server, set up periodic rsync from the primary:

# On standby server, cron job every 15 minutes
*/15 * * * * rsync -avz --delete -e "ssh -i ~/.ssh/sync_key" root@PRIMARY_IP:/var/www/ /var/www/

Database replication with MySQL:

For near-real-time database replication, set up MySQL binary log replication. The primary server writes all changes to a binary log. The standby reads the log and applies the same changes.

On the primary, edit MySQL config:

sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

Add under [mysqld]:

server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_expire_logs_seconds = 604800
max_binlog_size = 100M
binlog_do_db = your_database_name

On the standby, configure it as a replica:

sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

Add under [mysqld]:

server-id = 2
relay_log = /var/log/mysql/mysql-relay-bin.log
log_bin = /var/log/mysql/mysql-bin.log

Set up replication by following the MySQL replication documentation for your version. Replication lag should stay under 1 second under normal conditions.

With file rsync every 15 minutes and MySQL replication in near-real-time, your effective RPO is 15 minutes for files and near-zero for database data.

Testing Your Disaster Recovery Plan

The test schedule is as important as the plan itself.

Monthly test: Backup integrity

Run the backup verification steps. Confirm backup files exist, are non-corrupted, and that Restic reports repository health. Takes 10-15 minutes.

Quarterly test: Full restore to test environment

Provision a temporary server. Follow the runbook step by step as if it were a real disaster. Restore from backup. Bring the application online. Verify every major function works. Note the actual time taken. Compare against your RTO. If the test took twice as long as your RTO allows, either improve your procedures or revise your RTO.

After the test, record:

  • Date and duration of the test
  • Any procedure that failed or was unclear
  • Any missing credentials or access issues discovered
  • Actual time to recovery versus RTO target
  • What was fixed in the runbook afterward

Annual test: Data centre simulation

Pretend your primary server is gone and your main credentials are unavailable. Use only what is in your off-site backup and your runbook. This test finds dependencies you forgot to document.

The monitoring article covers what to watch after a DR event to confirm full recovery. Monitoring and disaster recovery are complementary practices.

What Managed Hosting Changes

If you run managed cloud hosting, some of this responsibility shifts to your provider.

Cloudways provides automated daily backups with up to 4 weeks retention and one-click restoration. Their infrastructure spans multiple data centres, reducing the risk of total data loss from a single data centre event.

Kinsta includes automated daily backups for every site, with backup download available from the dashboard. They also provide system-level monitoring and incident response.

ScalaHosting managed VPS includes daily automated backups with one-click restore from SPanel.

With managed hosting, your DR responsibility shifts from infrastructure backup to application-level recovery. You still need to test restores. You still need a runbook. You still need to define RTO and RPO. But the backup infrastructure is handled.

The principle remains the same: trust the provider’s backups but also maintain your own. The 3-2-1 rule applies even when someone else manages the server.

Frequently Asked Questions

How long should I keep backups?

For most web servers, a practical retention schedule is 14 daily backups, 8 weekly backups, and 6 monthly backups. This gives you two weeks of daily granularity for finding recently deleted data, two months of weekly snapshots for finding data deleted or corrupted longer ago, and six months of monthly archives for compliance or long-term reference. Restic handles this automatically with the forget command. Adjust the retention based on your RPO requirements and storage budget.

What should I back up beyond just the website files?

Most teams back up web files and databases and miss important configuration that is harder to rebuild than it is to back up. Back up /etc/nginx/, /etc/apache2/, /etc/php/, /etc/mysql/, all cron job definitions, SSL certificates (or just the Let’s Encrypt auto-renewal setup), Fail2Ban configuration, and the server firewall rules from ufw status verbose. A server rebuilt from scratch without these takes hours to reconfigure. A server rebuilt with these takes minutes.

How do I know my off-site backup is actually restorable?

Run restic check weekly to verify repository integrity. Run restic restore to a temporary directory quarterly to verify actual data recovery. Restic’s encryption means a corrupted or partially uploaded backup fails the integrity check. Do not trust a backup you have not restored from. The restore test is the only reliable proof that the backup works.

What is the difference between a backup and a snapshot?

A backup is a copy of your data stored separately from the primary. A snapshot is a point-in-time image of a storage volume. Snapshots are fast to create and restore but exist on the same infrastructure as the primary. A snapshot on the same server does not protect against hardware failure or data centre outage. Use snapshots as a fast rollback mechanism for changes (before a major upgrade). Use backups as the protection against catastrophic failure. Both serve different purposes and both are needed.

How should I handle encryption passwords for off-site backups?

The encryption password must be stored somewhere independent of the server and accessible even when the server is completely gone. A password manager on your personal devices is the most practical option. A printed copy in a physically secure location provides a hardware backup if digital access is unavailable. Sharing the password with a trusted secondary contact ensures you can recover even if your personal devices are unavailable. Never store the password only on the server being backed up. Never store it only in your head.

What if my RTO is under 15 minutes? Is that achievable?

Yes, with the right architecture. A warm standby server receiving continuous replication combined with a pre-configured DNS failover brings recovery time under 5 minutes. Cloud load balancers with automatic health checks can fail over to a standby instance in seconds. These architectures cost significantly more than single-server setups. Calculate the hourly revenue or cost of downtime for your business. If one hour of downtime costs more than a month of warm standby infrastructure, the standby pays for itself quickly.

About The Author

Hostinger

4.7/5 (62k)
Claim 88% OFF Now

Liquid Web

4.3/5 (2.6k)
Claim 50% OFF Now

WP Engine

4.3/5 (1.6k)
Claim 33% OFF Now