Deployment Runbook¶
This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application.
Environment status
There is currently no production environment. The only live environment is
staging (https://staging.freezedesign.eu). The Deploy to Production
workflow is disabled (gh workflow disable "Deploy to Production") until a
production VPS exists — the production sections below describe the future setup.
Table of Contents¶
- Overview
- Staging Deployment
- Production Deployment
- Zero-Downtime Deployment Details
- Automatic Rollback
- Manual Rollback Procedure
- Database Migration Rollback
- Troubleshooting
Overview¶
Architecture¶
The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:
- One live environment: Staging (production planned, not yet provisioned)
- Container registry: GitHub Container Registry (GHCR)
- Deployment strategy: Zero-downtime rolling deployment
- VPS hosting: Single VPS per environment with Docker Compose
- Automatic rollback: Health check failures trigger automatic rollback
Deployment Flow¶
- CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
- Image Pull: VPS pulls latest images from GHCR
- Rolling Deployment:
deploy-rolling.shperforms zero-downtime deployment - Migration: Database migrations run after successful deployment
- Health Checks: Docker health status of backend and frontend is verified
- Performance Gate: k6 homepage load test runs against the deployed staging URL
- Notification: Discord notifications on success or failure
Key Scripts¶
scripts/deploy-rolling.sh- Zero-downtime rolling deploymentscripts/rollback.sh- Automatic rollback on failurescripts/health-check.sh- Shared Docker health-status check (sourced by the workflow)scripts/notify-discord.sh- Discord notifications
Staging Deployment¶
Trigger: Automatic on push to the staging branch, as the final deploy job of the CI workflow (after all tests pass)
Workflows:
.github/workflows/ci.yml— contains thedeployjob (uses: ./.github/workflows/deploy-staging.ymlviaworkflow_call), gated onbackend-test,frontend-test,frontend-build,e2e-smoke-testande2e-visual-test.github/workflows/deploy-staging.yml— the reusable deploy workflow itself (on: workflow_call+workflow_dispatch). A standalone "Deploy to Staging" run only exists for manual dispatch.
Flow¶
- Push to
stagingbranch triggers the CI workflow - CI tests pass (linting, type checking, unit tests, E2E smoke + visual tests)
- Build images (
build-and-pushjob): Backend and frontend images built with staging configuration - Images tagged:
sha-{commit}, branch ref (staging),staging - Frontend built with
NEXT_PUBLIC_API_URLfrom staging secrets - Push to GHCR: Images pushed to GitHub Container Registry
- Copy deployment scripts and compose file to staging VPS (
appleboy/scp-action) - SSH to staging VPS (
appleboy/ssh-action) and rundeploy-rolling.sh(rollback viarollback.shon failure) - Health checks verify backend and frontend via Docker health status (
scripts/health-check.sh) - k6 homepage performance gate runs against the staging URL
- Discord notification on success or failure
Manual Intervention¶
Staging deployment is fully automatic. To manually redeploy the current staging tip (without re-running CI):
Promotion to main¶
After staging verification, promote staging to main with:
This runs fully automatically: deploy detection in the CI run, promotion PR, self-merge, staging-branch restore, and retarget after-check. The staging branch has a protection rule (deletion and force-push forbidden), so it no longer disappears after promotion merges; the restore steps in release.sh are safety nets.
Monitoring¶
- GitHub Actions workflow run status (the
deployjob inside the CI run, or a manual "Deploy to Staging" run) - Discord notifications
- Health check logs in workflow output
Production Deployment¶
Disabled
deploy-production.yml is disabled (gh workflow disable "Deploy to Production")
because there is no production VPS. Do not create release tags or trigger this
workflow until production exists. Re-enable with
gh workflow enable "Deploy to Production" once the production VPS is provisioned.
Trigger: Manual via GitHub Actions workflow dispatch (currently disabled)
Workflow: .github/workflows/deploy-production.yml
Environment Protection: Requires manual approval via GitHub environment rules
Pre-Deployment Checklist¶
Before triggering a production deployment:
- All CI checks pass on the tag
- Staging tested with the same code
- Database backup verified (check latest hourly backup in S3)
- Team notified via Discord or Slack
- Breaking changes documented
- Rollback plan reviewed
Deployment Steps¶
-
Tag the release:
-
Trigger workflow via GitHub Actions UI:
- Go to Actions → Deploy to Production
- Click "Run workflow"
- Enter tag:
v1.16.0 -
Confirm
-
Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)
-
Deployment proceeds:
- Build images with production configuration
- Push to GHCR with semantic version tags:
v1.16.0,1.16,latest - Copy deployment scripts to production VPS
- SSH to production VPS
- Run
deploy-rolling.shfor zero-downtime deployment - Health checks via Docker health status (
scripts/health-check.sh, 10 attempts × 5 seconds per service) - Automatic rollback if health checks fail
- Discord notification on success or failure
Post-Deployment Verification¶
After successful deployment:
- Check application health:
- Visit production URL in browser
- Verify key pages load correctly
-
Test critical user flows (product catalog, designer, checkout)
-
Monitor logs:
-
Check error tracking: Review Sentry or logging service for errors
Zero-Downtime Deployment Details¶
The deploy-rolling.sh script updates services one at a time with an automatic per-service rollback to the previous image.
How It Works¶
- Pull images:
docker compose pullfetches the latest images from GHCR - For each service (backend, then frontend):
- Backup image: The currently running image is tagged locally as
predeploy - Recreate: Container is recreated with the new image (
up -d --no-deps --force-recreate) - Health check: Docker health status is polled (30 attempts, 5-second intervals = 150 seconds max)
- Rollback on failure: If the health check fails, the
predeployimage is restored and the container recreated again
- Backup image: The currently running image is tagged locally as
- Celery services:
celeryandcelery-beatare updated with a plainup -d(no health check) - Migrations:
python manage.py migrateruns inside the backend container - Nginx restart: Nginx is restarted to refresh upstream DNS resolution (new container IPs)
Resource Requirements¶
- VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
- Duration: ~2-5 minutes for full deployment
Health Check Details¶
- Mechanism: Docker's built-in health status (from the healthcheck definitions in the compose file), read via
docker compose ps --format json - Backend healthcheck endpoint:
http://127.0.0.1:8000/api/health/(probed inside the container for debugging when unhealthy) - Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait per service
- The workflow-level verification step afterwards (
scripts/health-check.sh,check_docker_health) uses 10 attempts × 5 seconds per service
Celery Services¶
Celery workers and beat scheduler are restarted (not rolled, no health check) since they don't serve HTTP traffic.
Automatic Rollback¶
Trigger: Either deploy-rolling.sh fails on the VPS, or the workflow's health verification step (10 attempts × 5 seconds per service) fails afterwards
Script: scripts/rollback.sh
What Happens¶
- Deployment or health check fails
- Rollback script executes:
- Prefer the local
predeploybackup image (snapshot taken bydeploy-rolling.sh) - Otherwise pull the remote backup tag (
BACKUP_TAG, defaultprevious) from GHCR - Re-tag it as the deploy tag and recreate the services
- Discord notification sent with deployment failure status
- Workflow fails to signal deployment failure
Verification After Automatic Rollback¶
After automatic rollback, verify services are healthy (on the staging VPS):
ssh <user>@staging-vps
cd /opt/webshop
docker compose -f docker-compose.staging.yml ps # all services healthy?
curl -f https://staging.freezedesign.eu/api/health/
Investigate Failure¶
- Check deployment logs in GitHub Actions workflow output (the
deployjob in the CI run) - Check application logs:
- Identify root cause before attempting re-deployment
Manual Rollback Procedure¶
When to use:
- Automatic rollback failed
- Issues discovered after successful deployment
- Emergency rollback needed outside of deployment workflow
Step-by-Step¶
-
SSH into VPS:
-
Navigate to deployment directory:
-
Set environment variables:
Find the previous tag:
- Use the sha-{commit} tag of the previous successful staging deploy (visible in GHCR or the previous CI run)
- If a local predeploy backup image still exists on the VPS, the script uses it automatically and BACKUP_TAG is not needed
-
Run rollback script:
-
Verify health:
-
Check logs:
Last Resort: Simple Restart¶
If rollback.sh fails completely:
This restarts services with the currently tagged images (may not fully revert if images were already pulled).
Database Migration Rollback¶
Django migrations can be reversible or irreversible.
Reversible Migrations¶
Most migrations are automatically reversible:
AddField- removes the fieldCreateModel- drops the tableAlterField- reverts field changesAddIndex- drops the index
To rollback a reversible migration:
# SSH to VPS
ssh <user>@staging-vps
cd /opt/webshop
# Rollback to specific migration number
docker compose -f docker-compose.staging.yml exec backend \
python manage.py migrate <app_name> <previous_migration_number>
# Example: rollback products app to migration 0015
docker compose -f docker-compose.staging.yml exec backend \
python manage.py migrate products 0015
Verify rollback:
# Show current migration status
docker compose -f docker-compose.staging.yml exec backend \
python manage.py showmigrations <app_name>
Irreversible Migrations¶
Some migrations cannot be automatically reversed:
RemoveField- data is lost when field is droppedDeleteModel- data is lost when table is droppedRunPython- custom Python code withoutreverse_codeRunSQL- custom SQL without reverse SQL
For irreversible migrations:
- Do not attempt
migraterollback (will fail or lose data) - Restore from database backup (see Disaster Recovery)
- Choose backup timestamp before the migration ran
Best Practices¶
- Deploy migrations separately:
- Release 1: Deploy migration only (additive, safe)
- Verify migration succeeded
-
Release 2: Deploy code that uses new schema
-
Make migrations reversible:
- Provide
reverse_codeforRunPythonoperations - Avoid
RemoveFieldandDeleteModelwhen possible -
Use
AlterFieldwithnull=Truebefore removing -
Test migrations on staging before production
-
Take backup before risky migrations:
Troubleshooting¶
Health Check Fails¶
Symptom: Deployment fails with "<service> failed health check after N attempts" or "<service> health check failed"
Diagnosis:
# Check logs
docker compose -f docker-compose.staging.yml logs backend --tail=200
docker compose -f docker-compose.staging.yml logs frontend --tail=200
# Check Docker health status
docker compose -f docker-compose.staging.yml ps
# Test health endpoint from inside the container
docker compose -f docker-compose.staging.yml exec backend \
curl -v http://127.0.0.1:8000/api/health/
Common causes:
- Database connection failure (check
DB_HOST,DB_PASSWORDin.env) - Redis connection failure (check
REDIS_URL) - Missing environment variables
- Application startup errors (check logs for stack traces)
- Port already in use (check
docker compose ps)
Container Won't Start¶
Symptom: Container exits immediately or restart loop
Diagnosis:
# Check container status
docker compose -f docker-compose.staging.yml ps
# Check resource usage
docker stats
# Inspect container
docker compose -f docker-compose.staging.yml logs <service-name> --tail=100
Common causes:
- Out of memory (check
docker stats) - Syntax error in code (check logs for Python/JavaScript errors)
- Missing dependencies (rebuild image)
- Configuration error in
docker-compose.staging.yml
Image Pull Fails¶
Symptom: "Error pulling image" during deployment
Diagnosis:
# Check GHCR authentication
docker login ghcr.io -u <username>
# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>
Common causes:
- Invalid tag name
- GHCR authentication expired
- Image not pushed to registry
- Network connectivity issues
Solution:
# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin
# Verify image tags
docker images | grep ghcr.io
Disk Space Issues¶
Symptom: "No space left on device"
Diagnosis:
Solution:
# Clean up old images and containers
docker system prune -f
# Remove unused volumes (careful!)
docker volume prune -f
# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi
Migration Fails¶
Symptom: "Migration failed" during deployment
Diagnosis:
# Check migration status
docker compose exec backend python manage.py showmigrations
# Check migration errors
docker compose logs backend | grep -i migration
Common causes:
- Database schema conflict
- Missing dependency migration
- Custom SQL error in migration
- Database connection interrupted
Solution:
# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake
# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>
Performance Degradation After Deployment¶
Symptom: Application slower than before deployment
Diagnosis:
# Check resource usage
docker stats
# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"
# Check Redis memory
docker compose exec redis redis-cli INFO memory
Common causes:
- Missing database indexes (new queries added)
- Inefficient queries (check Django query logs)
- Memory leak (check
docker statsover time) - Cache not warming up (Redis empty after restart)
Solution:
- Review and optimize new queries
- Add database indexes for slow queries
- Restart services if memory leak suspected
- Warm up cache if needed
Related Documentation¶
- Disaster Recovery - Database backup, restore, RTO/RPO targets
- Architecture Overview - System architecture overview