Skip to content

Deployment Runbook

This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application in both staging and production environments.

Table of Contents

Overview

Architecture

The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:

  • Two environments: Staging and Production
  • Container registry: GitHub Container Registry (GHCR)
  • Deployment strategy: Zero-downtime rolling deployment
  • VPS hosting: Single VPS per environment with Docker Compose
  • Automatic rollback: Health check failures trigger automatic rollback

Deployment Flow

  1. CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
  2. Image Pull: VPS pulls latest images from GHCR
  3. Rolling Deployment: deploy-rolling.sh performs zero-downtime deployment
  4. Health Checks: Backend and frontend health endpoints are verified
  5. Migration: Database migrations run after successful deployment
  6. Notification: Discord notifications on success or failure

Key Scripts

  • scripts/deploy-rolling.sh - Zero-downtime rolling deployment
  • scripts/rollback.sh - Automatic rollback on failure
  • scripts/notify-discord.sh - Discord notifications

Staging Deployment

Trigger: Automatic on push to main branch

Workflow: .github/workflows/deploy-staging.yml

Flow

  1. Push to main branch triggers the workflow
  2. CI tests pass (linting, type checking, unit tests)
  3. Build images: Backend and frontend images built with staging configuration
  4. Backend image tagged: sha-{commit}, main, staging
  5. Frontend image tagged: sha-{commit}, main, staging
  6. Frontend built with NEXT_PUBLIC_API_URL from staging secrets
  7. Push to GHCR: Images pushed to GitHub Container Registry
  8. Copy deployment scripts to staging VPS
  9. SSH to staging VPS and run deploy-rolling.sh
  10. Health checks verify backend and frontend are healthy
  11. Import seed data (staging only) for testing
  12. Discord notification on success or failure

Manual Intervention

Staging deployment is fully automatic. No manual steps required.

Monitoring

  • GitHub Actions workflow run status
  • Discord notifications
  • Health check logs in workflow output

Production Deployment

Trigger: Manual via GitHub Actions workflow dispatch

Workflow: .github/workflows/deploy-production.yml

Environment Protection: Requires manual approval via GitHub environment rules

Pre-Deployment Checklist

Before triggering a production deployment:

  • All CI checks pass on the tag
  • Staging tested with the same code
  • Database backup verified (check latest hourly backup in S3)
  • Team notified via Discord or Slack
  • Breaking changes documented
  • Rollback plan reviewed

Deployment Steps

  1. Tag the release:

    git tag -a v1.16.0 -m "Release v1.16.0: Production readiness"
    git push origin v1.16.0
    

  2. Trigger workflow via GitHub Actions UI:

  3. Go to Actions → Deploy to Production
  4. Click "Run workflow"
  5. Enter tag: v1.16.0
  6. Confirm

  7. Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)

  8. Deployment proceeds:

  9. Build images with production configuration
  10. Push to GHCR with semantic version tags: v1.16.0, 1.16, latest
  11. Copy deployment scripts to production VPS
  12. SSH to production VPS
  13. Run deploy-rolling.sh for zero-downtime deployment
  14. Health checks with 5 retries (5-second intervals)
  15. Automatic rollback if health checks fail
  16. Discord notification on success or failure

Post-Deployment Verification

After successful deployment:

  1. Check application health:
  2. Visit production URL in browser
  3. Verify key pages load correctly
  4. Test critical user flows (product catalog, designer, checkout)

  5. Monitor logs:

    ssh production-vps
    cd /opt/webshop
    docker compose -f docker-compose.prod.yml logs -f --tail=100
    

  6. Check error tracking: Review Sentry or logging service for errors

Zero-Downtime Deployment Details

The deploy-rolling.sh script ensures at least one instance is always running during deployment.

How It Works

For each service (backend, frontend):

  1. Scale up: Start new container alongside old one (temporarily 2 instances)
  2. Health check: Verify new container is healthy (30 attempts, 5-second intervals = 150 seconds max)
  3. Scale down: Stop and remove old container
  4. Scale back: Return to 1 instance

Resource Requirements

  • Temporarily 2x containers: ~600-800MB total memory for backend during deployment
  • VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
  • Duration: ~2-5 minutes for full deployment

Health Check Details

  • Backend: http://localhost:8000/api/health/
  • Frontend: http://localhost:3000/
  • Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait
  • Timeout: Health check must pass before old container is removed

Celery Services

Celery workers and beat scheduler are restarted gracefully (not rolled) since they don't serve HTTP traffic.

Automatic Rollback

Trigger: Health check fails after 5 retries in deployment workflow

Script: scripts/rollback.sh

What Happens

  1. Health check fails after 5 attempts (25 seconds)
  2. Rollback script executes:
  3. Pull previous image tag from GHCR
  4. Tag it as the version to use
  5. Restart services with previous image
  6. Discord notification sent with rollback status
  7. Workflow fails to signal deployment failure

Verification After Automatic Rollback

After automatic rollback, verify services are healthy:

ssh production-vps
curl -f http://localhost:8000/api/health/
curl -f http://localhost:3000/

Investigate Failure

  1. Check deployment logs in GitHub Actions workflow output
  2. Check application logs:
    docker compose -f docker-compose.prod.yml logs backend --tail=200
    docker compose -f docker-compose.prod.yml logs frontend --tail=200
    
  3. Identify root cause before attempting re-deployment

Manual Rollback Procedure

When to use:

  • Automatic rollback failed
  • Issues discovered after successful deployment
  • Emergency rollback needed outside of deployment workflow

Step-by-Step

  1. SSH into VPS:

    ssh production-user@production-vps
    

  2. Navigate to deployment directory:

    cd /opt/webshop
    

  3. Set environment variables:

    export COMPOSE_FILE=docker-compose.prod.yml
    export GITHUB_REPOSITORY=your-org/webshop_freeze_design
    export REGISTRY=ghcr.io
    export BACKUP_TAG=<previous-sha-or-version>
    

Find the previous tag: - For semantic version: Use previous release tag (e.g., v1.15.0) - For commit SHA: Use previous successful commit SHA

  1. Run rollback script:

    ./scripts/rollback.sh
    

  2. Verify health:

    curl -f http://localhost:8000/api/health/
    curl -f http://localhost:3000/
    

  3. Check logs:

    docker compose -f docker-compose.prod.yml logs -f --tail=50
    

Last Resort: Simple Restart

If rollback.sh fails completely:

docker compose -f docker-compose.prod.yml restart backend frontend celery celery-beat

This restarts services with the currently tagged images (may not fully revert if images were already pulled).

Database Migration Rollback

Django migrations can be reversible or irreversible.

Reversible Migrations

Most migrations are automatically reversible:

  • AddField - removes the field
  • CreateModel - drops the table
  • AlterField - reverts field changes
  • AddIndex - drops the index

To rollback a reversible migration:

# SSH to VPS
ssh production-vps
cd /opt/webshop

# Rollback to specific migration number
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py migrate <app_name> <previous_migration_number>

# Example: rollback products app to migration 0015
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py migrate products 0015

Verify rollback:

# Show current migration status
docker compose -f docker-compose.prod.yml exec backend \
  python manage.py showmigrations <app_name>

Irreversible Migrations

Some migrations cannot be automatically reversed:

  • RemoveField - data is lost when field is dropped
  • DeleteModel - data is lost when table is dropped
  • RunPython - custom Python code without reverse_code
  • RunSQL - custom SQL without reverse SQL

For irreversible migrations:

  1. Do not attempt migrate rollback (will fail or lose data)
  2. Restore from database backup (see Disaster Recovery)
  3. Choose backup timestamp before the migration ran

Best Practices

  1. Deploy migrations separately:
  2. Release 1: Deploy migration only (additive, safe)
  3. Verify migration succeeded
  4. Release 2: Deploy code that uses new schema

  5. Make migrations reversible:

  6. Provide reverse_code for RunPython operations
  7. Avoid RemoveField and DeleteModel when possible
  8. Use AlterField with null=True before removing

  9. Test migrations on staging before production

  10. Take backup before risky migrations:

    # Trigger manual backup
    docker compose exec backend python manage.py backup_database
    

Troubleshooting

Health Check Fails

Symptom: Deployment fails with "Health check failed after 5 attempts"

Diagnosis:

# Check logs
docker compose -f docker-compose.prod.yml logs backend --tail=200
docker compose -f docker-compose.prod.yml logs frontend --tail=200

# Test health endpoint directly
curl -v http://localhost:8000/api/health/
curl -v http://localhost:3000/

Common causes:

  • Database connection failure (check DB_HOST, DB_PASSWORD in .env)
  • Redis connection failure (check REDIS_URL)
  • Missing environment variables
  • Application startup errors (check logs for stack traces)
  • Port already in use (check docker compose ps)

Container Won't Start

Symptom: Container exits immediately or restart loop

Diagnosis:

# Check container status
docker compose -f docker-compose.prod.yml ps

# Check resource usage
docker stats

# Inspect container
docker compose -f docker-compose.prod.yml logs <service-name> --tail=100

Common causes:

  • Out of memory (check docker stats)
  • Syntax error in code (check logs for Python/JavaScript errors)
  • Missing dependencies (rebuild image)
  • Configuration error in docker-compose.prod.yml

Image Pull Fails

Symptom: "Error pulling image" during deployment

Diagnosis:

# Check GHCR authentication
docker login ghcr.io -u <username>

# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>

Common causes:

  • Invalid tag name
  • GHCR authentication expired
  • Image not pushed to registry
  • Network connectivity issues

Solution:

# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin

# Verify image tags
docker images | grep ghcr.io

Disk Space Issues

Symptom: "No space left on device"

Diagnosis:

# Check disk usage
df -h

# Check Docker disk usage
docker system df

Solution:

# Clean up old images and containers
docker system prune -f

# Remove unused volumes (careful!)
docker volume prune -f

# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi

Migration Fails

Symptom: "Migration failed" during deployment

Diagnosis:

# Check migration status
docker compose exec backend python manage.py showmigrations

# Check migration errors
docker compose logs backend | grep -i migration

Common causes:

  • Database schema conflict
  • Missing dependency migration
  • Custom SQL error in migration
  • Database connection interrupted

Solution:

# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake

# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>

Performance Degradation After Deployment

Symptom: Application slower than before deployment

Diagnosis:

# Check resource usage
docker stats

# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis memory
docker compose exec redis redis-cli INFO memory

Common causes:

  • Missing database indexes (new queries added)
  • Inefficient queries (check Django query logs)
  • Memory leak (check docker stats over time)
  • Cache not warming up (Redis empty after restart)

Solution:

  • Review and optimize new queries
  • Add database indexes for slow queries
  • Restart services if memory leak suspected
  • Warm up cache if needed