Deployment Runbook¶

This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application.

Environment status

There is currently no production environment. The only live environment is staging (https://staging.freezedesign.eu). The Deploy to Production workflow is disabled (gh workflow disable "Deploy to Production") until a production VPS exists — the production sections below describe the future setup.

Overview¶

Architecture¶

The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:

One live environment: Staging (production planned, not yet provisioned)
Container registry: GitHub Container Registry (GHCR)
Deployment strategy: Zero-downtime rolling deployment
VPS hosting: Single VPS per environment with Docker Compose
Automatic rollback: Health check failures trigger automatic rollback

Deployment Flow¶

CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
Image Pull: VPS pulls latest images from GHCR
Rolling Deployment: deploy-rolling.sh performs zero-downtime deployment
Migration: Database migrations run after successful deployment
Health Checks: Docker health status of backend and frontend is verified
Performance Gate: k6 homepage load test runs against the deployed staging URL
Notification: Discord notifications on success or failure

Key Scripts¶

scripts/deploy-rolling.sh - Zero-downtime rolling deployment
scripts/rollback.sh - Automatic rollback on failure
scripts/health-check.sh - Shared Docker health-status check (sourced by the workflow)
scripts/notify-discord.sh - Discord notifications

Staging Deployment¶

Trigger: Automatic on push to the staging branch, as the final deploy job of the CI workflow (after all tests pass)

Workflows:

.github/workflows/ci.yml — contains the deploy job (uses: ./.github/workflows/deploy-staging.yml via workflow_call), gated on backend-test, frontend-test, frontend-build, e2e-smoke-test and e2e-visual-test
.github/workflows/deploy-staging.yml — the reusable deploy workflow itself (on: workflow_call + workflow_dispatch). A standalone "Deploy to Staging" run only exists for manual dispatch.

Flow¶

Push to staging branch triggers the CI workflow
CI tests pass (linting, type checking, unit tests, E2E smoke + visual tests)
Build images (build-and-push job): Backend and frontend images built with staging configuration
Images tagged: sha-{commit}, branch ref (staging), staging
Frontend built with NEXT_PUBLIC_API_URL from staging secrets
Push to GHCR: Images pushed to GitHub Container Registry
Copy deployment scripts and compose file to staging VPS (appleboy/scp-action)
SSH to staging VPS (appleboy/ssh-action) and run deploy-rolling.sh (rollback via rollback.sh on failure)
Health checks verify backend and frontend via Docker health status (scripts/health-check.sh)
k6 homepage performance gate runs against the staging URL
Discord notification on success or failure

Manual Intervention¶

Staging deployment is fully automatic. To manually redeploy the current staging tip (without re-running CI):

gh workflow run "Deploy to Staging"

Promotion to main¶

After staging verification, promote staging to main with:

scripts/release.sh --promote-only --no-uat

This runs fully automatically: deploy detection in the CI run, promotion PR, self-merge, staging-branch restore, and retarget after-check. The staging branch has a protection rule (deletion and force-push forbidden), so it no longer disappears after promotion merges; the restore steps in release.sh are safety nets.

Monitoring¶

GitHub Actions workflow run status (the deploy job inside the CI run, or a manual "Deploy to Staging" run)
Discord notifications
Health check logs in workflow output

Production Deployment¶

Disabled

deploy-production.yml is disabled (gh workflow disable "Deploy to Production") because there is no production VPS. Do not create release tags or trigger this workflow until production exists. Re-enable with gh workflow enable "Deploy to Production" once the production VPS is provisioned.

Trigger: Manual via GitHub Actions workflow dispatch (currently disabled)

Workflow: .github/workflows/deploy-production.yml

Environment Protection: Requires manual approval via GitHub environment rules

Pre-Deployment Checklist¶

Before triggering a production deployment:

All CI checks pass on the tag
Staging tested with the same code
Database backup verified (check latest hourly backup in S3)
Team notified via Discord or Slack
Breaking changes documented
Rollback plan reviewed

Deployment Steps¶

Tag the release:

git tag -a v1.16.0 -m "Release v1.16.0: Production readiness"
git push origin v1.16.0

Trigger workflow via GitHub Actions UI:
Go to Actions → Deploy to Production
Click "Run workflow"
Enter tag: v1.16.0
Confirm
Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)
Deployment proceeds:
Build images with production configuration
Push to GHCR with semantic version tags: v1.16.0, 1.16, latest
Copy deployment scripts to production VPS
SSH to production VPS
Run deploy-rolling.sh for zero-downtime deployment
Health checks via Docker health status (scripts/health-check.sh, 10 attempts × 5 seconds per service)
Automatic rollback if health checks fail
Discord notification on success or failure

Post-Deployment Verification¶

After successful deployment:

Check application health:
Visit production URL in browser
Verify key pages load correctly
Test critical user flows (product catalog, designer, checkout)

Monitor logs:

ssh production-vps
cd /opt/webshop
docker compose -f docker-compose.prod.yml logs -f --tail=100

Check error tracking: Review Sentry or logging service for errors

Zero-Downtime Deployment Details¶

The deploy-rolling.sh script updates services one at a time with an automatic per-service rollback to the previous image.

How It Works¶

Pull images: docker compose pull fetches the latest images from GHCR
For each service (backend, then frontend):
1. Backup image: The currently running image is tagged locally as predeploy
2. Recreate: Container is recreated with the new image (up -d --no-deps --force-recreate)
3. Health check: Docker health status is polled (30 attempts, 5-second intervals = 150 seconds max)
4. Rollback on failure: If the health check fails, the predeploy image is restored and the container recreated again
Celery services: celery and celery-beat are updated with a plain up -d (no health check)
Migrations: python manage.py migrate runs inside the backend container
Nginx restart: Nginx is restarted to refresh upstream DNS resolution (new container IPs)

Resource Requirements¶

VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
Duration: ~2-5 minutes for full deployment

Health Check Details¶

Mechanism: Docker's built-in health status (from the healthcheck definitions in the compose file), read via docker compose ps --format json
Backend healthcheck endpoint: http://127.0.0.1:8000/api/health/ (probed inside the container for debugging when unhealthy)
Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait per service
The workflow-level verification step afterwards (scripts/health-check.sh, check_docker_health) uses 10 attempts × 5 seconds per service

Celery Services¶

Celery workers and beat scheduler are restarted (not rolled, no health check) since they don't serve HTTP traffic.

Automatic Rollback¶

Trigger: Either deploy-rolling.sh fails on the VPS, or the workflow's health verification step (10 attempts × 5 seconds per service) fails afterwards

Script: scripts/rollback.sh

What Happens¶

Deployment or health check fails
Rollback script executes:
Prefer the local predeploy backup image (snapshot taken by deploy-rolling.sh)
Otherwise pull the remote backup tag (BACKUP_TAG, default previous) from GHCR
Re-tag it as the deploy tag and recreate the services
Discord notification sent with deployment failure status
Workflow fails to signal deployment failure

Verification After Automatic Rollback¶

After automatic rollback, verify services are healthy (on the staging VPS):

ssh <user>@staging-vps
cd /opt/webshop
docker compose -f docker-compose.staging.yml ps   # all services healthy?
curl -f https://staging.freezedesign.eu/api/health/

Investigate Failure¶

Check deployment logs in GitHub Actions workflow output (the deploy job in the CI run)

Check application logs:

docker compose -f docker-compose.staging.yml logs backend --tail=200
docker compose -f docker-compose.staging.yml logs frontend --tail=200

Identify root cause before attempting re-deployment

Manual Rollback Procedure¶

When to use:

Automatic rollback failed
Issues discovered after successful deployment
Emergency rollback needed outside of deployment workflow

Step-by-Step¶

SSH into VPS:
```
ssh <user>@staging-vps
```
Navigate to deployment directory:
```
cd /opt/webshop
```

Set environment variables:

export COMPOSE_FILE=docker-compose.staging.yml
export GITHUB_REPOSITORY=Voorman/webshop_freeze_design
export REGISTRY=ghcr.io
export BACKUP_TAG=<previous-sha-tag>

Find the previous tag: - Use the sha-{commit} tag of the previous successful staging deploy (visible in GHCR or the previous CI run) - If a local predeploy backup image still exists on the VPS, the script uses it automatically and BACKUP_TAG is not needed

Run rollback script:
```
./scripts/rollback.sh
```

Verify health:

docker compose -f docker-compose.staging.yml ps
curl -f https://staging.freezedesign.eu/api/health/

Check logs:

docker compose -f docker-compose.staging.yml logs -f --tail=50

Last Resort: Simple Restart¶

If rollback.sh fails completely:

docker compose -f docker-compose.staging.yml restart backend frontend celery celery-beat

This restarts services with the currently tagged images (may not fully revert if images were already pulled).

Database Migration Rollback¶

Django migrations can be reversible or irreversible.

Reversible Migrations¶

Most migrations are automatically reversible:

AddField - removes the field
CreateModel - drops the table
AlterField - reverts field changes
AddIndex - drops the index

To rollback a reversible migration:

# SSH to VPS
ssh <user>@staging-vps
cd /opt/webshop

# Rollback to specific migration number
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py migrate <app_name> <previous_migration_number>

# Example: rollback products app to migration 0015
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py migrate products 0015

Verify rollback:

# Show current migration status
docker compose -f docker-compose.staging.yml exec backend \
  python manage.py showmigrations <app_name>

Irreversible Migrations¶

Some migrations cannot be automatically reversed:

RemoveField - data is lost when field is dropped
DeleteModel - data is lost when table is dropped
RunPython - custom Python code without reverse_code
RunSQL - custom SQL without reverse SQL

For irreversible migrations:

Do not attempt migrate rollback (will fail or lose data)
Restore from database backup (see Disaster Recovery)
Choose backup timestamp before the migration ran

Best Practices¶

Deploy migrations separately:
Release 1: Deploy migration only (additive, safe)
Verify migration succeeded
Release 2: Deploy code that uses new schema
Make migrations reversible:
Provide reverse_code for RunPython operations
Avoid RemoveField and DeleteModel when possible
Use AlterField with null=True before removing
Test migrations on staging before production

Take backup before risky migrations:

# Trigger manual backup
docker compose exec backend python manage.py backup_database

Troubleshooting¶

Health Check Fails¶

Symptom: Deployment fails with "<service> failed health check after N attempts" or "<service> health check failed"

Diagnosis:

# Check logs
docker compose -f docker-compose.staging.yml logs backend --tail=200
docker compose -f docker-compose.staging.yml logs frontend --tail=200

# Check Docker health status
docker compose -f docker-compose.staging.yml ps

# Test health endpoint from inside the container
docker compose -f docker-compose.staging.yml exec backend \
  curl -v http://127.0.0.1:8000/api/health/

Common causes:

Database connection failure (check DB_HOST, DB_PASSWORD in .env)
Redis connection failure (check REDIS_URL)
Missing environment variables
Application startup errors (check logs for stack traces)
Port already in use (check docker compose ps)

Container Won't Start¶

Symptom: Container exits immediately or restart loop

Diagnosis:

# Check container status
docker compose -f docker-compose.staging.yml ps

# Check resource usage
docker stats

# Inspect container
docker compose -f docker-compose.staging.yml logs <service-name> --tail=100

Common causes:

Out of memory (check docker stats)
Syntax error in code (check logs for Python/JavaScript errors)
Missing dependencies (rebuild image)
Configuration error in docker-compose.staging.yml

Image Pull Fails¶

Symptom: "Error pulling image" during deployment

Diagnosis:

# Check GHCR authentication
docker login ghcr.io -u <username>

# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>

Common causes:

Invalid tag name
GHCR authentication expired
Image not pushed to registry
Network connectivity issues

Solution:

# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin

# Verify image tags
docker images | grep ghcr.io

Disk Space Issues¶

Symptom: "No space left on device"

Diagnosis:

# Check disk usage
df -h

# Check Docker disk usage
docker system df

Solution:

# Clean up old images and containers
docker system prune -f

# Remove unused volumes (careful!)
docker volume prune -f

# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi

Migration Fails¶

Symptom: "Migration failed" during deployment

Diagnosis:

# Check migration status
docker compose exec backend python manage.py showmigrations

# Check migration errors
docker compose logs backend | grep -i migration

Common causes:

Database schema conflict
Missing dependency migration
Custom SQL error in migration
Database connection interrupted

Solution:

# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake

# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>

Performance Degradation After Deployment¶

Symptom: Application slower than before deployment

Diagnosis:

# Check resource usage
docker stats

# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"

# Check Redis memory
docker compose exec redis redis-cli INFO memory

Common causes:

Missing database indexes (new queries added)
Inefficient queries (check Django query logs)
Memory leak (check docker stats over time)
Cache not warming up (Redis empty after restart)

Solution:

Review and optimize new queries
Add database indexes for slow queries
Restart services if memory leak suspected
Warm up cache if needed

Disaster Recovery - Database backup, restore, RTO/RPO targets
Architecture Overview - System architecture overview

Deployment Runbook¶

Table of Contents¶

Overview¶

Architecture¶

Deployment Flow¶

Key Scripts¶

Staging Deployment¶

Flow¶

Manual Intervention¶

Promotion to main¶

Monitoring¶

Production Deployment¶

Pre-Deployment Checklist¶

Deployment Steps¶

Post-Deployment Verification¶

Zero-Downtime Deployment Details¶

How It Works¶

Resource Requirements¶

Health Check Details¶

Celery Services¶

Automatic Rollback¶

What Happens¶

Verification After Automatic Rollback¶

Investigate Failure¶

Manual Rollback Procedure¶

Step-by-Step¶

Last Resort: Simple Restart¶

Database Migration Rollback¶

Reversible Migrations¶

Irreversible Migrations¶

Best Practices¶

Troubleshooting¶

Health Check Fails¶

Container Won't Start¶

Image Pull Fails¶

Disk Space Issues¶

Migration Fails¶

Performance Degradation After Deployment¶

Related Documentation¶