Skip to content

Deployment Runbook

This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application.

Environment status

There is currently no production environment. The only live environment is staging (https://staging.freezedesign.eu). The Deploy to Production workflow is disabled (gh workflow disable "Deploy to Production") until a production VPS exists — the production sections below describe the future setup.

Table of Contents

Overview

Architecture

The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:

  • One live environment: Staging (production planned, not yet provisioned)
  • Container registry: GitHub Container Registry (GHCR)
  • Deployment strategy: Zero-downtime rolling deployment
  • VPS hosting: Single VPS per environment with Docker Compose
  • Automatic rollback: Health check failures trigger automatic rollback

Deployment Flow

  1. CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
  2. Image Pull: VPS pulls latest images from GHCR
  3. Rolling Deployment: deploy-rolling.sh performs zero-downtime deployment
  4. Migration: Database migrations run after successful deployment
  5. Health Checks: Docker health status of backend and frontend is verified
  6. Performance Gate: k6 homepage load test runs against the deployed staging URL
  7. Notification: Discord notifications on success or failure

Key Scripts

  • scripts/deploy-rolling.sh - Zero-downtime rolling deployment
  • scripts/rollback.sh - Automatic rollback on failure
  • scripts/health-check.sh - Shared Docker health-status check (sourced by the workflow)
  • scripts/notify-discord.sh - Discord notifications

Staging Deployment

Trigger: Automatic on push to the staging branch, as the final deploy job of the CI workflow (after all tests pass)

Workflows:

  • .github/workflows/ci.yml — contains the deploy job (uses: ./.github/workflows/deploy-staging.yml via workflow_call), gated on backend-test, frontend-test, frontend-build, e2e-smoke-test and e2e-visual-test
  • .github/workflows/deploy-staging.yml — the reusable deploy workflow itself (on: workflow_call + workflow_dispatch). A standalone "Deploy to Staging" run only exists for manual dispatch.

Flow

  1. Push to staging branch triggers the CI workflow
  2. CI tests pass (linting, type checking, unit tests, E2E smoke + visual tests)
  3. Build images (build-and-push job): Backend and frontend images built with staging configuration
  4. Images tagged: sha-{commit}, branch ref (staging), staging
  5. Frontend built with NEXT_PUBLIC_API_URL from staging secrets
  6. Push to GHCR: Images pushed to GitHub Container Registry
  7. Copy deployment scripts and compose file to staging VPS (appleboy/scp-action)
  8. SSH to staging VPS (appleboy/ssh-action) and run deploy-rolling.sh (rollback via rollback.sh on failure)
  9. Health checks verify backend and frontend via Docker health status (scripts/health-check.sh)
  10. k6 homepage performance gate runs against the staging URL
  11. Discord notification on success or failure

Manual Intervention

Staging deployment is fully automatic. To manually redeploy the current staging tip (without re-running CI):

gh workflow run "Deploy to Staging"

Promotion to main

After staging verification, promote staging to main with:

scripts/release.sh --promote-only --no-uat

This runs fully automatically: deploy detection in the CI run, promotion PR, self-merge, staging-branch restore, and retarget after-check. The staging branch has a protection rule (deletion and force-push forbidden), so it no longer disappears after promotion merges; the restore steps in release.sh are safety nets.

Monitoring

  • GitHub Actions workflow run status (the deploy job inside the CI run, or a manual "Deploy to Staging" run)
  • Discord notifications
  • Health check logs in workflow output

Production Deployment

Disabled

deploy-production.yml is disabled (gh workflow disable "Deploy to Production") because there is no production VPS. Do not create release tags or trigger this workflow until production exists. Re-enable with gh workflow enable "Deploy to Production" once the production VPS is provisioned.

Trigger: Manual via GitHub Actions workflow dispatch (currently disabled)

Workflow: .github/workflows/deploy-production.yml

Environment Protection: Requires manual approval via GitHub environment rules

Pre-Deployment Checklist

Before triggering a production deployment:

  • All CI checks pass on the tag
  • Staging tested with the same code
  • Database backup verified (check latest hourly backup in S3)
  • Team notified via Discord or Slack
  • Breaking changes documented
  • Rollback plan reviewed

Deployment Steps

  1. Tag the release:

    git tag -a v1.16.0 -m "Release v1.16.0: Production readiness"
    git push origin v1.16.0
    

  2. Trigger workflow via GitHub Actions UI:

  3. Go to Actions → Deploy to Production
  4. Click "Run workflow"
  5. Enter tag: v1.16.0
  6. Confirm

  7. Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)

  8. Deployment proceeds:

  9. Build images with production configuration
  10. Push to GHCR with semantic version tags: v1.16.0, 1.16, latest
  11. Copy deployment scripts to production VPS
  12. SSH to production VPS
  13. Run deploy-rolling.sh for zero-downtime deployment
  14. Health checks via Docker health status (scripts/health-check.sh, 10 attempts × 5 seconds per service)
  15. Automatic rollback if health checks fail
  16. Discord notification on success or failure

Post-Deployment Verification

After successful deployment:

  1. Check application health:
  2. Visit production URL in browser
  3. Verify key pages load correctly
  4. Test critical user flows (product catalog, designer, checkout)

  5. Monitor logs:

    ssh production-vps
    cd /opt/webshop
    docker compose -f docker-compose.prod.yml logs -f --tail=100
    

  6. Check error tracking: Review Sentry or logging service for errors

Zero-Downtime Deployment Details

The deploy-rolling.sh script updates services one at a time with an automatic per-service rollback to the previous image.

How It Works

  1. Pull images: docker compose pull fetches the latest images from GHCR
  2. For each service (backend, then frontend):
    1. Backup image: The currently running image is tagged locally as predeploy
    2. Recreate: Container is recreated with the new image (up -d --no-deps --force-recreate)
    3. Health check: Docker health status is polled (30 attempts, 5-second intervals = 150 seconds max)
    4. Rollback on failure: If the health check fails, the predeploy image is restored and the container recreated again
  3. Celery services: celery and celery-beat are updated with a plain up -d (no health check)
  4. Migrations: python manage.py migrate runs inside the backend container
  5. Nginx restart: Nginx is restarted to refresh upstream DNS resolution (new container IPs)

Resource Requirements

  • VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
  • Duration: ~2-5 minutes for full deployment

Health Check Details

  • Mechanism: Docker's built-in health status (from the healthcheck definitions in the compose file), read via docker compose ps --format json
  • Backend healthcheck endpoint: http://127.0.0.1:8000/api/health/ (probed inside the container for debugging when unhealthy)
  • Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait per service
  • The workflow-level verification step afterwards (scripts/health-check.sh, check_docker_health) uses 10 attempts × 5 seconds per service

Celery Services

Celery workers and beat scheduler are restarted (not rolled, no health check) since they don't serve HTTP traffic.

Automatic Rollback

Trigger: Either deploy-rolling.sh fails on the VPS, or the workflow's health verification step (10 attempts × 5 seconds per service) fails afterwards

Script: scripts/rollback.sh

What Happens

  1. Deployment or health check fails
  2. Rollback script executes:
  3. Prefer the local predeploy backup image (snapshot taken by deploy-rolling.sh)
  4. Otherwise pull the remote backup tag (BACKUP_TAG, default previous) from GHCR
  5. Re-tag it as the deploy tag and recreate the services
  6. Discord notification sent with deployment failure status
  7. Workflow fails to signal deployment failure

Verification After Automatic Rollback

After automatic rollback, verify services are healthy (on the staging VPS):

ssh <user>@staging-vps
cd /opt/webshop
docker compose -f docker-compose.staging.yml ps   # all services healthy?
curl -f https://staging.freezedesign.eu/api/health/

Investigate Failure

  1. Check deployment logs in GitHub Actions workflow output (the deploy job in the CI run)
  2. Check application logs:
    docker compose -f docker-compose.staging.yml logs backend --tail=200
    docker compose -f docker-compose.staging.yml logs frontend --tail=200
    
  3. Identify root cause before attempting re-deployment

Manual Rollback Procedure

When to use:

  • Automatic rollback failed
  • Issues discovered after successful deployment
  • Emergency rollback needed outside of deployment workflow

Step-by-Step

  1. SSH into VPS:

    ssh <user>@staging-vps
    

  2. Navigate to deployment directory:

    cd /opt/webshop
    

  3. Set environment variables:

    export COMPOSE_FILE=docker-compose.staging.yml
    export GITHUB_REPOSITORY=Voorman/webshop_freeze_design
    export REGISTRY=ghcr.io
    export BACKUP_TAG=<previous-sha-tag>
    

Find the previous tag: - Use the sha-{commit} tag of the previous successful staging deploy (visible in GHCR or the previous CI run) - If a local predeploy backup image still exists on the VPS, the script uses it automatically and BACKUP_TAG is not needed

  1. Run rollback script:

    ./scripts/rollback.sh
    

  2. Verify health:

    docker compose -f docker-compose.staging.yml ps
    curl -f https://staging.freezedesign.eu/api/health/
    

  3. Check logs:

    docker compose -f docker-compose.staging.yml logs -f --tail=50
    

Last Resort: Simple Restart

If rollback.sh fails completely:

docker compose -f docker-compose.staging.yml restart backend frontend celery celery-beat

This restarts services with the currently tagged images (may not fully revert if images were already pulled).

Database Migration Rollback

Django migrations can be reversible or irreversible.

Reversible Migrations

Most migrations are automatically reversible:

  • AddField - removes the field
  • CreateModel - drops the table
  • AlterField - reverts field changes
  • AddIndex - drops the index

To rollback a reversible migration:

# SSH to VPS
ssh <user>@staging-vps
cd /opt/webshop

# Rollback to specific migration number
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py migrate <app_name> <previous_migration_number>

# Example: rollback products app to migration 0015
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py migrate products 0015

Verify rollback:

# Show current migration status
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py showmigrations <app_name>

Irreversible Migrations

Some migrations cannot be automatically reversed:

  • RemoveField - data is lost when field is dropped
  • DeleteModel - data is lost when table is dropped
  • RunPython - custom Python code without reverse_code
  • RunSQL - custom SQL without reverse SQL

For irreversible migrations:

  1. Do not attempt migrate rollback (will fail or lose data)
  2. Restore from database backup (see Disaster Recovery)
  3. Choose backup timestamp before the migration ran

Best Practices

  1. Deploy migrations separately:
  2. Release 1: Deploy migration only (additive, safe)
  3. Verify migration succeeded
  4. Release 2: Deploy code that uses new schema

  5. Make migrations reversible:

  6. Provide reverse_code for RunPython operations
  7. Avoid RemoveField and DeleteModel when possible
  8. Use AlterField with null=True before removing

  9. Test migrations on staging before production

  10. Take backup before risky migrations:

    # Trigger manual backup
    docker compose exec backend python manage.py backup_database
    

Troubleshooting

Health Check Fails

Symptom: Deployment fails with "<service> failed health check after N attempts" or "<service> health check failed"

Diagnosis:

# Check logs
docker compose -f docker-compose.staging.yml logs backend --tail=200
docker compose -f docker-compose.staging.yml logs frontend --tail=200

# Check Docker health status
docker compose -f docker-compose.staging.yml ps

# Test health endpoint from inside the container
docker compose -f docker-compose.staging.yml exec backend \
  curl -v http://127.0.0.1:8000/api/health/

Common causes:

  • Database connection failure (check DB_HOST, DB_PASSWORD in .env)
  • Redis connection failure (check REDIS_URL)
  • Missing environment variables
  • Application startup errors (check logs for stack traces)
  • Port already in use (check docker compose ps)

Container Won't Start

Symptom: Container exits immediately or restart loop

Diagnosis:

# Check container status
docker compose -f docker-compose.staging.yml ps

# Check resource usage
docker stats

# Inspect container
docker compose -f docker-compose.staging.yml logs <service-name> --tail=100

Common causes:

  • Out of memory (check docker stats)
  • Syntax error in code (check logs for Python/JavaScript errors)
  • Missing dependencies (rebuild image)
  • Configuration error in docker-compose.staging.yml

Image Pull Fails

Symptom: "Error pulling image" during deployment

Diagnosis:

# Check GHCR authentication
docker login ghcr.io -u <username>

# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>

Common causes:

  • Invalid tag name
  • GHCR authentication expired
  • Image not pushed to registry
  • Network connectivity issues

Solution:

# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin

# Verify image tags
docker images | grep ghcr.io

Disk Space Issues

Symptom: "No space left on device"

Diagnosis:

# Check disk usage
df -h

# Check Docker disk usage
docker system df

Solution:

# Clean up old images and containers
docker system prune -f

# Remove unused volumes (careful!)
docker volume prune -f

# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi

Migration Fails

Symptom: "Migration failed" during deployment

Diagnosis:

# Check migration status
docker compose exec backend python manage.py showmigrations

# Check migration errors
docker compose logs backend | grep -i migration

Common causes:

  • Database schema conflict
  • Missing dependency migration
  • Custom SQL error in migration
  • Database connection interrupted

Solution:

# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake

# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>

Performance Degradation After Deployment

Symptom: Application slower than before deployment

Diagnosis:

# Check resource usage
docker stats

# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis memory
docker compose exec redis redis-cli INFO memory

Common causes:

  • Missing database indexes (new queries added)
  • Inefficient queries (check Django query logs)
  • Memory leak (check docker stats over time)
  • Cache not warming up (Redis empty after restart)

Solution:

  • Review and optimize new queries
  • Add database indexes for slow queries
  • Restart services if memory leak suspected
  • Warm up cache if needed