Deployment Runbook¶

This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application in both staging and production environments.

Table of Contents¶

Overview
Staging Deployment
Production Deployment
Zero-Downtime Deployment Details
Automatic Rollback
Manual Rollback Procedure
Database Migration Rollback
Troubleshooting

Overview¶

Architecture¶

The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:

Two environments: Staging and Production
Container registry: GitHub Container Registry (GHCR)
Deployment strategy: Zero-downtime rolling deployment
VPS hosting: Single VPS per environment with Docker Compose
Automatic rollback: Health check failures trigger automatic rollback

Deployment Flow¶

CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
Image Pull: VPS pulls latest images from GHCR
Rolling Deployment: deploy-rolling.sh performs zero-downtime deployment
Health Checks: Backend and frontend health endpoints are verified
Migration: Database migrations run after successful deployment
Notification: Discord notifications on success or failure

Key Scripts¶

scripts/deploy-rolling.sh - Zero-downtime rolling deployment
scripts/rollback.sh - Automatic rollback on failure
scripts/notify-discord.sh - Discord notifications

Staging Deployment¶

Trigger: Automatic on push to main branch

Workflow: .github/workflows/deploy-staging.yml

Flow¶

Push to main branch triggers the workflow
CI tests pass (linting, type checking, unit tests)
Build images: Backend and frontend images built with staging configuration
Backend image tagged: sha-{commit}, main, staging
Frontend image tagged: sha-{commit}, main, staging
Frontend built with NEXT_PUBLIC_API_URL from staging secrets
Push to GHCR: Images pushed to GitHub Container Registry
Copy deployment scripts to staging VPS
SSH to staging VPS and run deploy-rolling.sh
Health checks verify backend and frontend are healthy
Import seed data (staging only) for testing
Discord notification on success or failure

Manual Intervention¶

Staging deployment is fully automatic. No manual steps required.

Monitoring¶

GitHub Actions workflow run status
Discord notifications
Health check logs in workflow output

Production Deployment¶

Trigger: Manual via GitHub Actions workflow dispatch

Workflow: .github/workflows/deploy-production.yml

Environment Protection: Requires manual approval via GitHub environment rules

Pre-Deployment Checklist¶

Before triggering a production deployment:

All CI checks pass on the tag
Staging tested with the same code
Database backup verified (check latest hourly backup in S3)
Team notified via Discord or Slack
Breaking changes documented
Rollback plan reviewed

Deployment Steps¶

Tag the release:

git tag -a v1.16.0 -m "Release v1.16.0: Production readiness"
git push origin v1.16.0

Trigger workflow via GitHub Actions UI:
Go to Actions → Deploy to Production
Click "Run workflow"
Enter tag: v1.16.0
Confirm
Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)
Deployment proceeds:
Build images with production configuration
Push to GHCR with semantic version tags: v1.16.0, 1.16, latest
Copy deployment scripts to production VPS
SSH to production VPS
Run deploy-rolling.sh for zero-downtime deployment
Health checks with 5 retries (5-second intervals)
Automatic rollback if health checks fail
Discord notification on success or failure

Post-Deployment Verification¶

After successful deployment:

Check application health:
Visit production URL in browser
Verify key pages load correctly
Test critical user flows (product catalog, designer, checkout)

Monitor logs:

ssh production-vps
cd /opt/webshop
docker compose -f docker-compose.prod.yml logs -f --tail=100

Check error tracking: Review Sentry or logging service for errors

Zero-Downtime Deployment Details¶

The deploy-rolling.sh script ensures at least one instance is always running during deployment.

How It Works¶

For each service (backend, frontend):

Scale up: Start new container alongside old one (temporarily 2 instances)
Health check: Verify new container is healthy (30 attempts, 5-second intervals = 150 seconds max)
Scale down: Stop and remove old container
Scale back: Return to 1 instance

Resource Requirements¶

Temporarily 2x containers: ~600-800MB total memory for backend during deployment
VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
Duration: ~2-5 minutes for full deployment

Health Check Details¶

Backend: http://localhost:8000/api/health/
Frontend: http://localhost:3000/
Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait
Timeout: Health check must pass before old container is removed

Celery Services¶

Celery workers and beat scheduler are restarted gracefully (not rolled) since they don't serve HTTP traffic.

Automatic Rollback¶

Trigger: Health check fails after 5 retries in deployment workflow

Script: scripts/rollback.sh

What Happens¶

Health check fails after 5 attempts (25 seconds)
Rollback script executes:
Pull previous image tag from GHCR
Tag it as the version to use
Restart services with previous image
Discord notification sent with rollback status
Workflow fails to signal deployment failure

Verification After Automatic Rollback¶

After automatic rollback, verify services are healthy:

ssh production-vps
curl -f http://localhost:8000/api/health/
curl -f http://localhost:3000/

Investigate Failure¶

Check deployment logs in GitHub Actions workflow output

Check application logs:

docker compose -f docker-compose.prod.yml logs backend --tail=200
docker compose -f docker-compose.prod.yml logs frontend --tail=200

Identify root cause before attempting re-deployment

Manual Rollback Procedure¶

When to use:

Automatic rollback failed
Issues discovered after successful deployment
Emergency rollback needed outside of deployment workflow

Step-by-Step¶

SSH into VPS:
```
ssh production-user@production-vps
```
Navigate to deployment directory:
```
cd /opt/webshop
```

Set environment variables:

export COMPOSE_FILE=docker-compose.prod.yml
export GITHUB_REPOSITORY=your-org/webshop_freeze_design
export REGISTRY=ghcr.io
export BACKUP_TAG=<previous-sha-or-version>

Find the previous tag: - For semantic version: Use previous release tag (e.g., v1.15.0) - For commit SHA: Use previous successful commit SHA

Run rollback script:
```
./scripts/rollback.sh
```

Verify health:

curl -f http://localhost:8000/api/health/
curl -f http://localhost:3000/

Check logs:

docker compose -f docker-compose.prod.yml logs -f --tail=50

Last Resort: Simple Restart¶

If rollback.sh fails completely:

docker compose -f docker-compose.prod.yml restart backend frontend celery celery-beat

This restarts services with the currently tagged images (may not fully revert if images were already pulled).

Database Migration Rollback¶

Django migrations can be reversible or irreversible.

Reversible Migrations¶

Most migrations are automatically reversible:

AddField - removes the field
CreateModel - drops the table
AlterField - reverts field changes
AddIndex - drops the index

To rollback a reversible migration:

# SSH to VPS
ssh production-vps
cd /opt/webshop

# Rollback to specific migration number
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py migrate <app_name> <previous_migration_number>

# Example: rollback products app to migration 0015
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py migrate products 0015

Verify rollback:

# Show current migration status
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py showmigrations <app_name>

Irreversible Migrations¶

Some migrations cannot be automatically reversed:

RemoveField - data is lost when field is dropped
DeleteModel - data is lost when table is dropped
RunPython - custom Python code without reverse_code
RunSQL - custom SQL without reverse SQL

For irreversible migrations:

Do not attempt migrate rollback (will fail or lose data)
Restore from database backup (see Disaster Recovery)
Choose backup timestamp before the migration ran

Best Practices¶

Deploy migrations separately:
Release 1: Deploy migration only (additive, safe)
Verify migration succeeded
Release 2: Deploy code that uses new schema
Make migrations reversible:
Provide reverse_code for RunPython operations
Avoid RemoveField and DeleteModel when possible
Use AlterField with null=True before removing
Test migrations on staging before production

Take backup before risky migrations:

# Trigger manual backup
docker compose exec backend python manage.py backup_database

Troubleshooting¶

Health Check Fails¶

Symptom: Deployment fails with "Health check failed after 5 attempts"

Diagnosis:

# Check logs
docker compose -f docker-compose.prod.yml logs backend --tail=200
docker compose -f docker-compose.prod.yml logs frontend --tail=200

# Test health endpoint directly
curl -v http://localhost:8000/api/health/
curl -v http://localhost:3000/

Common causes:

Database connection failure (check DB_HOST, DB_PASSWORD in .env)
Redis connection failure (check REDIS_URL)
Missing environment variables
Application startup errors (check logs for stack traces)
Port already in use (check docker compose ps)

Container Won't Start¶

Symptom: Container exits immediately or restart loop

Diagnosis:

# Check container status
docker compose -f docker-compose.prod.yml ps

# Check resource usage
docker stats

# Inspect container
docker compose -f docker-compose.prod.yml logs <service-name> --tail=100

Common causes:

Out of memory (check docker stats)
Syntax error in code (check logs for Python/JavaScript errors)
Missing dependencies (rebuild image)
Configuration error in docker-compose.prod.yml

Image Pull Fails¶

Symptom: "Error pulling image" during deployment

Diagnosis:

# Check GHCR authentication
docker login ghcr.io -u <username>

# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>

Common causes:

Invalid tag name
GHCR authentication expired
Image not pushed to registry
Network connectivity issues

Solution:

# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin

# Verify image tags
docker images | grep ghcr.io

Disk Space Issues¶

Symptom: "No space left on device"

Diagnosis:

# Check disk usage
df -h

# Check Docker disk usage
docker system df

Solution:

# Clean up old images and containers
docker system prune -f

# Remove unused volumes (careful!)
docker volume prune -f

# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi

Migration Fails¶

Symptom: "Migration failed" during deployment

Diagnosis:

# Check migration status
docker compose exec backend python manage.py showmigrations

# Check migration errors
docker compose logs backend | grep -i migration

Common causes:

Database schema conflict
Missing dependency migration
Custom SQL error in migration
Database connection interrupted

Solution:

# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake

# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>

Performance Degradation After Deployment¶

Symptom: Application slower than before deployment

Diagnosis:

# Check resource usage
docker stats

# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis memory
docker compose exec redis redis-cli INFO memory

Common causes:

Missing database indexes (new queries added)
Inefficient queries (check Django query logs)
Memory leak (check docker stats over time)
Cache not warming up (Redis empty after restart)

Solution:

Review and optimize new queries
Add database indexes for slow queries
Restart services if memory leak suspected
Warm up cache if needed

Disaster Recovery - Database backup, restore, RTO/RPO targets
Architecture Overview - System architecture overview

Deployment Runbook¶

Table of Contents¶

Overview¶

Architecture¶

Deployment Flow¶

Key Scripts¶

Staging Deployment¶

Flow¶

Manual Intervention¶

Monitoring¶

Production Deployment¶

Pre-Deployment Checklist¶

Deployment Steps¶

Post-Deployment Verification¶

Zero-Downtime Deployment Details¶

How It Works¶

Resource Requirements¶

Health Check Details¶

Celery Services¶

Automatic Rollback¶

What Happens¶

Verification After Automatic Rollback¶

Investigate Failure¶

Manual Rollback Procedure¶

Step-by-Step¶

Last Resort: Simple Restart¶

Database Migration Rollback¶

Reversible Migrations¶

Irreversible Migrations¶

Best Practices¶

Troubleshooting¶

Health Check Fails¶

Container Won't Start¶

Image Pull Fails¶

Disk Space Issues¶

Migration Fails¶

Performance Degradation After Deployment¶

Related Documentation¶