Skip to content

Disaster Recovery

Procedures for backup, restore, and recovery of the Freeze Design webshop.

Overview

RPO and RTO Targets

Metric Target Actual
RPO (Recovery Point Objective) 1 hour 1 hour (hourly automated backups via Celery Beat)
RTO (Recovery Time Objective) 4 hours 20-35 minutes (database restore from S3)

Storage and Retention

Property Value
Backup storage AWS S3 bucket
Encryption Server-side AES256 on S3
Storage class S3 Standard-IA (Infrequent Access)
Media files DigitalOcean Spaces (separate from DB backups)
Backup format Full PostgreSQL pg_dump, gzip compressed, MD5/SHA-256 checksum
Retention 24 hourly + 7 daily + 4 weekly (~35 backups maintained)

Backup Architecture

Components

  1. Celery Beat Scheduler (backend/config/celery.py) -- triggers hourly backups at minute 0
  2. BackupService (backend/apps/core/services/backup_service.py) -- orchestrates the backup pipeline
  3. AWS S3 -- encrypted backup storage with lifecycle retention

Process Flow

┌─────────────────┐
│  Celery Beat     │
│  (crontab min=0) │
└────────┬─────────┘
         │ dispatches task
┌─────────────────┐
│  Celery Worker   │
│  backup.daily_   │
│  backup          │
└────────┬─────────┘
         │ calls BackupService
┌─────────────────┐
│  BackupService   │
│                  │
│  1. pg_dump      │──▶ Full SQL dump of PostgreSQL database
│  2. gzip -9      │──▶ Compress (~9:1 ratio)
│  3. Encrypt      │──▶ AES encryption (if key configured)
│  4. SHA-256 hash │──▶ Integrity checksum
│  5. Upload to S3 │──▶ AES256 server-side encryption, Standard-IA
│  6. BackupRecord │──▶ Django model tracks metadata
│  7. Cleanup      │──▶ Remove temporary files
│                  │
└────────┬─────────┘
┌─────────────────┐
│  AWS S3 Bucket   │
│  (encrypted,     │
│   Standard-IA)   │
└──────────────────┘

Data Flow Summary

  1. pg_dump creates a plain-format SQL dump (~100-500 MB uncompressed)
  2. gzip level 9 compresses to ~50-100 MB
  3. SHA-256 checksum calculated for integrity verification (MD5 also computed for backward compatibility)
  4. Uploaded to S3 with metadata (timestamp, database name, size, checksum)
  5. Temporary files cleaned up
  6. Retention enforced by daily cleanup task

Automated Schedule

All times in UTC. Defined in backend/config/celery.py.

Hourly Backup

  • Schedule: Every hour at minute 0 (crontab(minute=0))
  • Task: backup.daily_backup
  • Timeout: Expires after 55 minutes (must complete before next hourly run)
  • Action: Full database backup, compress, upload to S3
'hourly-database-backup': {
    'task': 'backup.daily_backup',
    'schedule': crontab(minute=0),
    'options': {'expires': 3300},
}

Daily Cleanup

  • Schedule: 03:00 UTC daily (crontab(hour=3, minute=0))
  • Task: backup.cleanup_old_backups
  • Action: Enforces tiered retention policy

Retention logic:

  • Keep last 24 hourly backups
  • Keep 1 per day for the last 7 days
  • Keep 1 per week for the last 4 weeks
  • Delete everything else

Daily Verification

  • Schedule: 04:00 UTC daily (crontab(hour=4, minute=0))
  • Task: backup.verify_latest_backup
  • Timeout: Expires after 30 minutes

Verification checks:

  • Backup file exists in S3
  • File size is reasonable (> 1 MB)
  • SHA-256 checksum matches stored metadata
  • Backup timestamp is recent (< 2 hours old)

Monthly Restore Test

  • Schedule: 1st of each month at 05:00 UTC (crontab(day_of_month=1, hour=5, minute=0))
  • Task: backup.test_restore
  • Timeout: Expires after 2 hours

Test procedure:

  1. Download latest backup from S3
  2. Create temporary test database
  3. Restore backup to test database
  4. Verify record counts match
  5. Clean up test database
  6. Report success/failure

Weekly CI Backup Test

  • Schedule: Sunday at 05:00 UTC
  • Workflow: .github/workflows/backup-test.yml
  • Environment: Fresh PostgreSQL 15 + Redis in GitHub Actions

CI test steps:

  1. Create test database with seed data (create_test_fixture)
  2. Export backup using export_seed_data
  3. Create PostgreSQL dump using pg_dump -F c
  4. Verify backup file size and contents
  5. Flush database
  6. Restore from backup (import_seed_data)
  7. Verify data integrity (record counts match original)
  8. Restore pg_dump to a separate database and verify
  9. Upload backup artifacts for review (retained 7 days)

Manual Backup Procedure

When to Use

  • Before risky deployments
  • Before database migrations (especially irreversible ones)
  • Before major data operations (bulk updates, deletions)
  • As part of disaster recovery testing

Using Django Management Command

# SSH to VPS
ssh production-vps
cd /opt/webshop

# Trigger manual backup (creates dump, compresses, uploads to S3)
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py backup_database

# Verify backup was created (check logs for S3 URL and backup info)
docker compose -f docker-compose.prod.yml logs backend | grep -i backup

Expected output:

Backup completed successfully: {
  'filename': 'backup_20260201_143022.sql.gz',
  's3_url': 's3://freezedesign-backups/backup_20260201_143022.sql.gz',
  'size_mb': 87.43,
  'checksum_sha256': 'a1b2c3d4...',
  'timestamp': '2026-02-01T14:30:22Z'
}

Using pg_dump Directly

For more control or when the Django management command is unavailable:

ssh production-vps
cd /opt/webshop

# Load database credentials
source .env

# Custom format (recommended -- supports parallel restore)
docker compose -f docker-compose.prod.yml exec db \
  pg_dump -U $DB_USER -d $DB_NAME -F c -f /tmp/manual_backup.dump

# Or plain SQL format (human-readable, works with psql)
docker compose -f docker-compose.prod.yml exec db \
  pg_dump -U $DB_USER -d $DB_NAME -F p -f /tmp/manual_backup.sql

# Copy backup out of container
docker cp $(docker compose -f docker-compose.prod.yml ps -q db):/tmp/manual_backup.dump ./manual_backup.dump

# Compress manually
gzip manual_backup.dump

Backup format reference:

Flag Format Use Case
-F c Custom (binary) Recommended. Supports parallel restore with pg_restore
-F p Plain (SQL text) Human-readable. Restore with psql
-F t Tar archive Alternative archive format

Restore Procedure

Prerequisites

  • SSH access to production VPS
  • AWS CLI installed and configured (for downloading from S3)
  • Database credentials (from .env file on VPS)
  • Backup filename or timestamp to restore

Step 1: Download Backup from S3

ssh production-vps
cd /opt/webshop

# Set AWS credentials (if not already configured via ~/.aws/credentials)
export AWS_ACCESS_KEY_ID="your-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# List available backups (most recent last)
aws s3 ls s3://freezedesign-backups/ | grep backup_

# Download specific backup
aws s3 cp s3://freezedesign-backups/backup_20260201_143022.sql.gz /tmp/

# Verify download
ls -lh /tmp/backup_*.sql.gz

Step 2: Decompress Backup

# Decompress the gzip backup
gunzip /tmp/backup_20260201_143022.sql.gz

# Verify decompressed file
ls -lh /tmp/backup_20260201_143022.sql

For custom-format backups (.dump files), skip decompression -- use directly with pg_restore.

Step 3: Stop Application

Stop services that write to the database to prevent conflicts during restore:

docker compose -f docker-compose.prod.yml stop backend celery celery-beat

# Verify services are stopped (db and redis should still be running)
docker compose -f docker-compose.prod.yml ps

Step 4: Restore to Database

For custom-format backups (.dump):

docker compose -f docker-compose.prod.yml exec db \
  pg_restore -U $DB_USER -d $DB_NAME --clean --if-exists /tmp/backup.dump

Flags:

  • --clean -- drops database objects before recreating them
  • --if-exists -- uses IF EXISTS when dropping to avoid errors on missing objects

For SQL-format backups (.sql):

# Copy backup into the database container
docker cp /tmp/backup_20260201_143022.sql \
  $(docker compose -f docker-compose.prod.yml ps -q db):/tmp/backup.sql

# Restore using psql
docker compose -f docker-compose.prod.yml exec db \
  psql -U $DB_USER -d $DB_NAME -f /tmp/backup.sql

Step 5: Run Pending Migrations

After restoring, the backup may be from a schema version behind the current code:

docker compose -f docker-compose.prod.yml exec backend \
  python manage.py migrate

# Verify all migrations are applied
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py showmigrations

Step 6: Restart Services

docker compose -f docker-compose.prod.yml up -d

# Verify all services are healthy
docker compose -f docker-compose.prod.yml ps

Step 7: Verify Health

# Check backend health endpoint
curl -f http://127.0.0.1:8000/api/health/

# Check frontend
curl -f http://127.0.0.1:3000/

# Tail logs for errors
docker compose -f docker-compose.prod.yml logs -f --tail=50

Step 8: Verify Data Integrity

docker compose -f docker-compose.prod.yml exec backend python manage.py shell
from apps.products.models import Product, Color, Size
from apps.orders.models import Order
from django.contrib.auth import get_user_model

User = get_user_model()

print(f"Products: {Product.objects.count()}")
print(f"Colors: {Color.objects.count()}")
print(f"Sizes: {Size.objects.count()}")
print(f"Orders: {Order.objects.count()}")
print(f"Users: {User.objects.count()}")

Compare counts with expected values or pre-incident counts.

Disaster Recovery Scenarios

Scenario A: Database Corruption

Situation: Database tables corrupted due to disk failure, software bug, or hardware issue.

Symptoms:

  • Database queries return errors
  • Application crashes with database errors
  • Data appears inconsistent or missing

Estimated downtime: 20-35 minutes Data loss: Up to 1 hour (RPO)

Procedure:

  1. Identify corruption scope -- determine which tables or data are affected
  2. Select restore point -- choose the most recent backup from before corruption occurred
  3. Follow the full Restore Procedure (steps 1-8)
  4. Verify data integrity in the Django shell
  5. Resume operations and monitor logs

Scenario B: Complete VPS Failure

Situation: VPS hardware failure, cloud provider outage, or catastrophic system failure.

Symptoms:

  • VPS unreachable via SSH
  • Application completely down
  • No response from any service

Estimated downtime: 2-4 hours (within RTO target) Data loss: Up to 1 hour (RPO)

Procedure:

  1. Provision new VPS (30-60 min):

    • Spin up new VPS instance (same specs: 2 GB RAM, 2 CPU cores minimum)
    • Configure firewall rules (ports 22, 80, 443)
    • Install Docker and Docker Compose
  2. Deploy infrastructure (15-30 min):

    # Clone repository (use SSH deploy key)
    git clone git@github.com:your-org/webshop_freeze_design.git /opt/webshop
    cd /opt/webshop
    
    # Copy environment variables from secure backup / password manager
    # (keep a copy of .env in a secure location outside the VPS)
    nano .env
    
    # Pull Docker images from GHCR
    docker compose -f docker-compose.prod.yml pull
    
    # Start infrastructure services first
    docker compose -f docker-compose.prod.yml up -d db redis
    
  3. Create the database (the container does not auto-create it):

    docker compose -f docker-compose.prod.yml exec db \
      psql -U $DB_USER -d postgres -c "CREATE DATABASE $DB_NAME;"
    
  4. Restore database from S3 backup (20-35 min) -- follow Restore Procedure steps 1-5

  5. Start application services:

    docker compose -f docker-compose.prod.yml up -d
    
  6. Configure DNS (5-30 min): Point freezedesign.eu A record to new VPS IP. Check propagation:

    dig @8.8.8.8 freezedesign.eu A +short
    
  7. Configure SSL (5-10 min): Run Let's Encrypt certificate generation via certbot

  8. Verify functionality: Full smoke test -- health endpoints, product listing, checkout flow

Downtime breakdown:

Step Time
VPS provisioning 30-60 min
Infrastructure deployment 15-30 min
Database restore 20-35 min
DNS propagation 5-30 min
SSL certificate 5-10 min
Verification 10-15 min
Total ~2-4 hours

Scenario C: Accidental Data Deletion

Situation: User data, orders, or products accidentally deleted (bulk delete, admin error, script bug).

Symptoms:

  • Reports of missing data
  • Empty tables or reduced record counts
  • User complaints about lost orders or designs

Estimated downtime: 15-35 minutes Data loss: Minimal (data between last backup and deletion)

Procedure:

  1. Identify deletion timestamp -- check admin logs, application logs, or user reports
  2. Stop application to prevent further changes:

    docker compose -f docker-compose.prod.yml stop backend celery celery-beat
    
  3. Select backup from before the deletion timestamp

  4. Choose restore approach:

    Option A -- Selective restore (if only specific tables affected):

    # Download and decompress backup, then restore specific tables only
    docker compose -f docker-compose.prod.yml exec db \
      pg_restore -U $DB_USER -d $DB_NAME \
        -t orders_order -t orders_orderitem \
        --data-only \
        /tmp/backup.dump
    

    Option B -- Full restore (if deletion scope is unclear):

    Follow the complete Restore Procedure.

  5. Verify that deleted data is restored

  6. Restart services

Scenario D: Failed Database Migration

Situation: Database migration fails partway through or causes data corruption.

Symptoms:

  • migrate command fails with an error
  • Application errors after migration
  • Data inconsistencies after schema change

Estimated downtime: 20-35 minutes (restore) + time to fix migration Data loss: Depends on approach

Procedure:

  1. Attempt migration rollback (if the migration is reversible):

    docker compose -f docker-compose.prod.yml exec backend \
      python manage.py migrate <app_name> <previous_migration_number>
    
  2. If rollback fails or migration is irreversible:

    • Stop the application
    • Restore from backup taken before the migration (this is why you take a manual backup before risky migrations)
    • Follow the full Restore Procedure
    • Verify schema and data integrity
  3. Fix the migration -- correct the migration code, test in a local or staging environment

  4. Re-apply the fixed migration

Prevention:

  • Always take a manual backup before irreversible migrations
  • Test migrations on staging first
  • Deploy migrations separately from application code changes
  • Write migrations to be reversible when possible

Monitoring and Alerting

Backup Task Logs

All backup tasks log to the apps.core.backup logger.

# View recent backup task logs
docker compose -f docker-compose.prod.yml logs celery | grep -i backup

# Follow backup logs in real time
docker compose -f docker-compose.prod.yml logs -f celery

Success output:

INFO apps.core.backup: Backup completed successfully: {
  'filename': 'backup_20260201_143022.sql.gz',
  's3_url': 's3://freezedesign-backups/backup_20260201_143022.sql.gz',
  'size_mb': 87.43, ...
}

Failure output:

ERROR apps.core.backup: Backup failed: pg_dump failed: ...
ERROR apps.core.backup: S3 upload failed: ...

Discord Notifications

Backup task failures trigger Discord notifications to the #backups channel via the notify-discord.sh script. Both the weekly CI backup test and production backup tasks send Discord alerts on failure.

Daily Verification Task

Runs at 04:00 UTC every day. Checks:

  • Latest backup exists in S3
  • File size > 1 MB
  • Checksum matches metadata
  • Backup is less than 2 hours old

Failures are logged and should trigger manual investigation.

Weekly CI Backup Test

Runs every Sunday at 05:00 UTC in GitHub Actions (.github/workflows/backup-test.yml). Performs a full backup/restore cycle and reports results via GitHub Actions summary and Discord notification.

Health Check Commands

# List recent backups in S3
aws s3 ls s3://freezedesign-backups/ | tail -5

# Check the latest backup timestamp
aws s3 ls s3://freezedesign-backups/ | tail -1

# Check Celery Beat is scheduling tasks
docker compose -f docker-compose.prod.yml logs celery-beat | tail -20

# Check recent backup task results
docker compose -f docker-compose.prod.yml logs celery | grep -i backup | tail -20

RTO/RPO Evidence

RPO: 1 Hour

  • Mechanism: Hourly automated backups via Celery Beat (crontab(minute=0))
  • Maximum data loss: 1 hour (worst case: failure occurs 1 minute before next hourly backup)
  • Verification: Daily verification task confirms latest backup is < 2 hours old

RTO: 20-35 Minutes (Actual)

Step Time Estimate Notes
Identify issue 5 min UptimeRobot alert + Discord notification
Download backup from S3 2-5 min ~50-100 MB compressed
Decompress backup 1-2 min gzip decompression
Stop application 1 min docker compose stop
Restore database (pg_restore) 5-15 min Depends on database size (100-500 MB)
Run migrations 1-2 min Usually fast (idempotent)
Restart services 2-3 min docker compose up -d
Health verification 2-5 min Smoke tests, check logs
Total 20-35 min Well within 4-hour target

Continuous Verification

Frequency Check Source
Hourly Backup creation and S3 upload Celery Beat task
Daily Verification of latest backup integrity backup.verify_latest_backup task
Weekly Full backup/restore cycle in CI .github/workflows/backup-test.yml
Monthly Automated restore test to temporary database backup.test_restore task