Disaster Recovery¶
Procedures for backup, restore, and recovery of the Freeze Design webshop.
Overview¶
RPO and RTO Targets¶
| Metric | Target | Actual |
|---|---|---|
| RPO (Recovery Point Objective) | 1 hour | 1 hour (hourly automated backups via Celery Beat) |
| RTO (Recovery Time Objective) | 4 hours | 20-35 minutes (database restore from S3) |
Storage and Retention¶
| Property | Value |
|---|---|
| Backup storage | AWS S3 bucket |
| Encryption | Server-side AES256 on S3 |
| Storage class | S3 Standard-IA (Infrequent Access) |
| Media files | DigitalOcean Spaces (separate from DB backups) |
| Backup format | Full PostgreSQL pg_dump, gzip compressed, MD5/SHA-256 checksum |
| Retention | 24 hourly + 7 daily + 4 weekly (~35 backups maintained) |
Backup Architecture¶
Components¶
- Celery Beat Scheduler (
backend/config/celery.py) -- triggers hourly backups at minute 0 - BackupService (
backend/apps/core/services/backup_service.py) -- orchestrates the backup pipeline - AWS S3 -- encrypted backup storage with lifecycle retention
Process Flow¶
┌─────────────────┐
│ Celery Beat │
│ (crontab min=0) │
└────────┬─────────┘
│ dispatches task
▼
┌─────────────────┐
│ Celery Worker │
│ backup.daily_ │
│ backup │
└────────┬─────────┘
│ calls BackupService
▼
┌─────────────────┐
│ BackupService │
│ │
│ 1. pg_dump │──▶ Full SQL dump of PostgreSQL database
│ 2. gzip -9 │──▶ Compress (~9:1 ratio)
│ 3. Encrypt │──▶ AES encryption (if key configured)
│ 4. SHA-256 hash │──▶ Integrity checksum
│ 5. Upload to S3 │──▶ AES256 server-side encryption, Standard-IA
│ 6. BackupRecord │──▶ Django model tracks metadata
│ 7. Cleanup │──▶ Remove temporary files
│ │
└────────┬─────────┘
│
▼
┌─────────────────┐
│ AWS S3 Bucket │
│ (encrypted, │
│ Standard-IA) │
└──────────────────┘
Data Flow Summary¶
pg_dumpcreates a plain-format SQL dump (~100-500 MB uncompressed)gziplevel 9 compresses to ~50-100 MB- SHA-256 checksum calculated for integrity verification (MD5 also computed for backward compatibility)
- Uploaded to S3 with metadata (timestamp, database name, size, checksum)
- Temporary files cleaned up
- Retention enforced by daily cleanup task
Automated Schedule¶
All times in UTC. Defined in backend/config/celery.py.
Hourly Backup¶
- Schedule: Every hour at minute 0 (
crontab(minute=0)) - Task:
backup.daily_backup - Timeout: Expires after 55 minutes (must complete before next hourly run)
- Action: Full database backup, compress, upload to S3
'hourly-database-backup': {
'task': 'backup.daily_backup',
'schedule': crontab(minute=0),
'options': {'expires': 3300},
}
Daily Cleanup¶
- Schedule: 03:00 UTC daily (
crontab(hour=3, minute=0)) - Task:
backup.cleanup_old_backups - Action: Enforces tiered retention policy
Retention logic:
- Keep last 24 hourly backups
- Keep 1 per day for the last 7 days
- Keep 1 per week for the last 4 weeks
- Delete everything else
Daily Verification¶
- Schedule: 04:00 UTC daily (
crontab(hour=4, minute=0)) - Task:
backup.verify_latest_backup - Timeout: Expires after 30 minutes
Verification checks:
- Backup file exists in S3
- File size is reasonable (> 1 MB)
- SHA-256 checksum matches stored metadata
- Backup timestamp is recent (< 2 hours old)
Monthly Restore Test¶
- Schedule: 1st of each month at 05:00 UTC (
crontab(day_of_month=1, hour=5, minute=0)) - Task:
backup.test_restore - Timeout: Expires after 2 hours
Test procedure:
- Download latest backup from S3
- Create temporary test database
- Restore backup to test database
- Verify record counts match
- Clean up test database
- Report success/failure
Weekly CI Backup Test¶
- Schedule: Sunday at 05:00 UTC
- Workflow:
.github/workflows/backup-test.yml - Environment: Fresh PostgreSQL 15 + Redis in GitHub Actions
CI test steps:
- Create test database with seed data (
create_test_fixture) - Export backup using
export_seed_data - Create PostgreSQL dump using
pg_dump -F c - Verify backup file size and contents
- Flush database
- Restore from backup (
import_seed_data) - Verify data integrity (record counts match original)
- Restore
pg_dumpto a separate database and verify - Upload backup artifacts for review (retained 7 days)
Manual Backup Procedure¶
When to Use¶
- Before risky deployments
- Before database migrations (especially irreversible ones)
- Before major data operations (bulk updates, deletions)
- As part of disaster recovery testing
Using Django Management Command¶
# SSH to VPS
ssh production-vps
cd /opt/webshop
# Trigger manual backup (creates dump, compresses, uploads to S3)
docker compose -f docker-compose.prod.yml exec backend \
python manage.py backup_database
# Verify backup was created (check logs for S3 URL and backup info)
docker compose -f docker-compose.prod.yml logs backend | grep -i backup
Expected output:
Backup completed successfully: {
'filename': 'backup_20260201_143022.sql.gz',
's3_url': 's3://freezedesign-backups/backup_20260201_143022.sql.gz',
'size_mb': 87.43,
'checksum_sha256': 'a1b2c3d4...',
'timestamp': '2026-02-01T14:30:22Z'
}
Using pg_dump Directly¶
For more control or when the Django management command is unavailable:
ssh production-vps
cd /opt/webshop
# Load database credentials
source .env
# Custom format (recommended -- supports parallel restore)
docker compose -f docker-compose.prod.yml exec db \
pg_dump -U $DB_USER -d $DB_NAME -F c -f /tmp/manual_backup.dump
# Or plain SQL format (human-readable, works with psql)
docker compose -f docker-compose.prod.yml exec db \
pg_dump -U $DB_USER -d $DB_NAME -F p -f /tmp/manual_backup.sql
# Copy backup out of container
docker cp $(docker compose -f docker-compose.prod.yml ps -q db):/tmp/manual_backup.dump ./manual_backup.dump
# Compress manually
gzip manual_backup.dump
Backup format reference:
| Flag | Format | Use Case |
|---|---|---|
-F c |
Custom (binary) | Recommended. Supports parallel restore with pg_restore |
-F p |
Plain (SQL text) | Human-readable. Restore with psql |
-F t |
Tar archive | Alternative archive format |
Restore Procedure¶
Prerequisites¶
- SSH access to production VPS
- AWS CLI installed and configured (for downloading from S3)
- Database credentials (from
.envfile on VPS) - Backup filename or timestamp to restore
Step 1: Download Backup from S3¶
ssh production-vps
cd /opt/webshop
# Set AWS credentials (if not already configured via ~/.aws/credentials)
export AWS_ACCESS_KEY_ID="your-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
# List available backups (most recent last)
aws s3 ls s3://freezedesign-backups/ | grep backup_
# Download specific backup
aws s3 cp s3://freezedesign-backups/backup_20260201_143022.sql.gz /tmp/
# Verify download
ls -lh /tmp/backup_*.sql.gz
Step 2: Decompress Backup¶
# Decompress the gzip backup
gunzip /tmp/backup_20260201_143022.sql.gz
# Verify decompressed file
ls -lh /tmp/backup_20260201_143022.sql
For custom-format backups (.dump files), skip decompression -- use directly with pg_restore.
Step 3: Stop Application¶
Stop services that write to the database to prevent conflicts during restore:
docker compose -f docker-compose.prod.yml stop backend celery celery-beat
# Verify services are stopped (db and redis should still be running)
docker compose -f docker-compose.prod.yml ps
Step 4: Restore to Database¶
For custom-format backups (.dump):
docker compose -f docker-compose.prod.yml exec db \
pg_restore -U $DB_USER -d $DB_NAME --clean --if-exists /tmp/backup.dump
Flags:
--clean-- drops database objects before recreating them--if-exists-- usesIF EXISTSwhen dropping to avoid errors on missing objects
For SQL-format backups (.sql):
# Copy backup into the database container
docker cp /tmp/backup_20260201_143022.sql \
$(docker compose -f docker-compose.prod.yml ps -q db):/tmp/backup.sql
# Restore using psql
docker compose -f docker-compose.prod.yml exec db \
psql -U $DB_USER -d $DB_NAME -f /tmp/backup.sql
Step 5: Run Pending Migrations¶
After restoring, the backup may be from a schema version behind the current code:
docker compose -f docker-compose.prod.yml exec backend \
python manage.py migrate
# Verify all migrations are applied
docker compose -f docker-compose.prod.yml exec backend \
python manage.py showmigrations
Step 6: Restart Services¶
docker compose -f docker-compose.prod.yml up -d
# Verify all services are healthy
docker compose -f docker-compose.prod.yml ps
Step 7: Verify Health¶
# Check backend health endpoint
curl -f http://127.0.0.1:8000/api/health/
# Check frontend
curl -f http://127.0.0.1:3000/
# Tail logs for errors
docker compose -f docker-compose.prod.yml logs -f --tail=50
Step 8: Verify Data Integrity¶
from apps.products.models import Product, Color, Size
from apps.orders.models import Order
from django.contrib.auth import get_user_model
User = get_user_model()
print(f"Products: {Product.objects.count()}")
print(f"Colors: {Color.objects.count()}")
print(f"Sizes: {Size.objects.count()}")
print(f"Orders: {Order.objects.count()}")
print(f"Users: {User.objects.count()}")
Compare counts with expected values or pre-incident counts.
Disaster Recovery Scenarios¶
Scenario A: Database Corruption¶
Situation: Database tables corrupted due to disk failure, software bug, or hardware issue.
Symptoms:
- Database queries return errors
- Application crashes with database errors
- Data appears inconsistent or missing
Estimated downtime: 20-35 minutes Data loss: Up to 1 hour (RPO)
Procedure:
- Identify corruption scope -- determine which tables or data are affected
- Select restore point -- choose the most recent backup from before corruption occurred
- Follow the full Restore Procedure (steps 1-8)
- Verify data integrity in the Django shell
- Resume operations and monitor logs
Scenario B: Complete VPS Failure¶
Situation: VPS hardware failure, cloud provider outage, or catastrophic system failure.
Symptoms:
- VPS unreachable via SSH
- Application completely down
- No response from any service
Estimated downtime: 2-4 hours (within RTO target) Data loss: Up to 1 hour (RPO)
Procedure:
-
Provision new VPS (30-60 min):
- Spin up new VPS instance (same specs: 2 GB RAM, 2 CPU cores minimum)
- Configure firewall rules (ports 22, 80, 443)
- Install Docker and Docker Compose
-
Deploy infrastructure (15-30 min):
# Clone repository (use SSH deploy key) git clone git@github.com:your-org/webshop_freeze_design.git /opt/webshop cd /opt/webshop # Copy environment variables from secure backup / password manager # (keep a copy of .env in a secure location outside the VPS) nano .env # Pull Docker images from GHCR docker compose -f docker-compose.prod.yml pull # Start infrastructure services first docker compose -f docker-compose.prod.yml up -d db redis -
Create the database (the container does not auto-create it):
-
Restore database from S3 backup (20-35 min) -- follow Restore Procedure steps 1-5
-
Start application services:
-
Configure DNS (5-30 min): Point
freezedesign.euA record to new VPS IP. Check propagation: -
Configure SSL (5-10 min): Run Let's Encrypt certificate generation via certbot
-
Verify functionality: Full smoke test -- health endpoints, product listing, checkout flow
Downtime breakdown:
| Step | Time |
|---|---|
| VPS provisioning | 30-60 min |
| Infrastructure deployment | 15-30 min |
| Database restore | 20-35 min |
| DNS propagation | 5-30 min |
| SSL certificate | 5-10 min |
| Verification | 10-15 min |
| Total | ~2-4 hours |
Scenario C: Accidental Data Deletion¶
Situation: User data, orders, or products accidentally deleted (bulk delete, admin error, script bug).
Symptoms:
- Reports of missing data
- Empty tables or reduced record counts
- User complaints about lost orders or designs
Estimated downtime: 15-35 minutes Data loss: Minimal (data between last backup and deletion)
Procedure:
- Identify deletion timestamp -- check admin logs, application logs, or user reports
-
Stop application to prevent further changes:
-
Select backup from before the deletion timestamp
-
Choose restore approach:
Option A -- Selective restore (if only specific tables affected):
# Download and decompress backup, then restore specific tables only docker compose -f docker-compose.prod.yml exec db \ pg_restore -U $DB_USER -d $DB_NAME \ -t orders_order -t orders_orderitem \ --data-only \ /tmp/backup.dumpOption B -- Full restore (if deletion scope is unclear):
Follow the complete Restore Procedure.
-
Verify that deleted data is restored
- Restart services
Scenario D: Failed Database Migration¶
Situation: Database migration fails partway through or causes data corruption.
Symptoms:
migratecommand fails with an error- Application errors after migration
- Data inconsistencies after schema change
Estimated downtime: 20-35 minutes (restore) + time to fix migration Data loss: Depends on approach
Procedure:
-
Attempt migration rollback (if the migration is reversible):
-
If rollback fails or migration is irreversible:
- Stop the application
- Restore from backup taken before the migration (this is why you take a manual backup before risky migrations)
- Follow the full Restore Procedure
- Verify schema and data integrity
-
Fix the migration -- correct the migration code, test in a local or staging environment
-
Re-apply the fixed migration
Prevention:
- Always take a manual backup before irreversible migrations
- Test migrations on staging first
- Deploy migrations separately from application code changes
- Write migrations to be reversible when possible
Monitoring and Alerting¶
Backup Task Logs¶
All backup tasks log to the apps.core.backup logger.
# View recent backup task logs
docker compose -f docker-compose.prod.yml logs celery | grep -i backup
# Follow backup logs in real time
docker compose -f docker-compose.prod.yml logs -f celery
Success output:
INFO apps.core.backup: Backup completed successfully: {
'filename': 'backup_20260201_143022.sql.gz',
's3_url': 's3://freezedesign-backups/backup_20260201_143022.sql.gz',
'size_mb': 87.43, ...
}
Failure output:
ERROR apps.core.backup: Backup failed: pg_dump failed: ...
ERROR apps.core.backup: S3 upload failed: ...
Discord Notifications¶
Backup task failures trigger Discord notifications to the #backups channel via the notify-discord.sh script. Both the weekly CI backup test and production backup tasks send Discord alerts on failure.
Daily Verification Task¶
Runs at 04:00 UTC every day. Checks:
- Latest backup exists in S3
- File size > 1 MB
- Checksum matches metadata
- Backup is less than 2 hours old
Failures are logged and should trigger manual investigation.
Weekly CI Backup Test¶
Runs every Sunday at 05:00 UTC in GitHub Actions (.github/workflows/backup-test.yml). Performs a full backup/restore cycle and reports results via GitHub Actions summary and Discord notification.
Health Check Commands¶
# List recent backups in S3
aws s3 ls s3://freezedesign-backups/ | tail -5
# Check the latest backup timestamp
aws s3 ls s3://freezedesign-backups/ | tail -1
# Check Celery Beat is scheduling tasks
docker compose -f docker-compose.prod.yml logs celery-beat | tail -20
# Check recent backup task results
docker compose -f docker-compose.prod.yml logs celery | grep -i backup | tail -20
RTO/RPO Evidence¶
RPO: 1 Hour¶
- Mechanism: Hourly automated backups via Celery Beat (
crontab(minute=0)) - Maximum data loss: 1 hour (worst case: failure occurs 1 minute before next hourly backup)
- Verification: Daily verification task confirms latest backup is < 2 hours old
RTO: 20-35 Minutes (Actual)¶
| Step | Time Estimate | Notes |
|---|---|---|
| Identify issue | 5 min | UptimeRobot alert + Discord notification |
| Download backup from S3 | 2-5 min | ~50-100 MB compressed |
| Decompress backup | 1-2 min | gzip decompression |
| Stop application | 1 min | docker compose stop |
Restore database (pg_restore) |
5-15 min | Depends on database size (100-500 MB) |
| Run migrations | 1-2 min | Usually fast (idempotent) |
| Restart services | 2-3 min | docker compose up -d |
| Health verification | 2-5 min | Smoke tests, check logs |
| Total | 20-35 min | Well within 4-hour target |
Continuous Verification¶
| Frequency | Check | Source |
|---|---|---|
| Hourly | Backup creation and S3 upload | Celery Beat task |
| Daily | Verification of latest backup integrity | backup.verify_latest_backup task |
| Weekly | Full backup/restore cycle in CI | .github/workflows/backup-test.yml |
| Monthly | Automated restore test to temporary database | backup.test_restore task |