Deployment Runbook¶
This document provides step-by-step procedures for deploying, rolling back, and managing the FreezeDesign webshop application in both staging and production environments.
Table of Contents¶
- Overview
- Staging Deployment
- Production Deployment
- Zero-Downtime Deployment Details
- Automatic Rollback
- Manual Rollback Procedure
- Database Migration Rollback
- Troubleshooting
Overview¶
Architecture¶
The FreezeDesign application is deployed using a Docker-based infrastructure with the following characteristics:
- Two environments: Staging and Production
- Container registry: GitHub Container Registry (GHCR)
- Deployment strategy: Zero-downtime rolling deployment
- VPS hosting: Single VPS per environment with Docker Compose
- Automatic rollback: Health check failures trigger automatic rollback
Deployment Flow¶
- CI/CD Pipeline: GitHub Actions builds Docker images and pushes to GHCR
- Image Pull: VPS pulls latest images from GHCR
- Rolling Deployment:
deploy-rolling.shperforms zero-downtime deployment - Health Checks: Backend and frontend health endpoints are verified
- Migration: Database migrations run after successful deployment
- Notification: Discord notifications on success or failure
Key Scripts¶
scripts/deploy-rolling.sh- Zero-downtime rolling deploymentscripts/rollback.sh- Automatic rollback on failurescripts/notify-discord.sh- Discord notifications
Staging Deployment¶
Trigger: Automatic on push to main branch
Workflow: .github/workflows/deploy-staging.yml
Flow¶
- Push to
mainbranch triggers the workflow - CI tests pass (linting, type checking, unit tests)
- Build images: Backend and frontend images built with staging configuration
- Backend image tagged:
sha-{commit},main,staging - Frontend image tagged:
sha-{commit},main,staging - Frontend built with
NEXT_PUBLIC_API_URLfrom staging secrets - Push to GHCR: Images pushed to GitHub Container Registry
- Copy deployment scripts to staging VPS
- SSH to staging VPS and run
deploy-rolling.sh - Health checks verify backend and frontend are healthy
- Import seed data (staging only) for testing
- Discord notification on success or failure
Manual Intervention¶
Staging deployment is fully automatic. No manual steps required.
Monitoring¶
- GitHub Actions workflow run status
- Discord notifications
- Health check logs in workflow output
Production Deployment¶
Trigger: Manual via GitHub Actions workflow dispatch
Workflow: .github/workflows/deploy-production.yml
Environment Protection: Requires manual approval via GitHub environment rules
Pre-Deployment Checklist¶
Before triggering a production deployment:
- All CI checks pass on the tag
- Staging tested with the same code
- Database backup verified (check latest hourly backup in S3)
- Team notified via Discord or Slack
- Breaking changes documented
- Rollback plan reviewed
Deployment Steps¶
-
Tag the release:
-
Trigger workflow via GitHub Actions UI:
- Go to Actions → Deploy to Production
- Click "Run workflow"
- Enter tag:
v1.16.0 -
Confirm
-
Manual approval: GitHub will pause and wait for approval (if environment protection is enabled)
-
Deployment proceeds:
- Build images with production configuration
- Push to GHCR with semantic version tags:
v1.16.0,1.16,latest - Copy deployment scripts to production VPS
- SSH to production VPS
- Run
deploy-rolling.shfor zero-downtime deployment - Health checks with 5 retries (5-second intervals)
- Automatic rollback if health checks fail
- Discord notification on success or failure
Post-Deployment Verification¶
After successful deployment:
- Check application health:
- Visit production URL in browser
- Verify key pages load correctly
-
Test critical user flows (product catalog, designer, checkout)
-
Monitor logs:
-
Check error tracking: Review Sentry or logging service for errors
Zero-Downtime Deployment Details¶
The deploy-rolling.sh script ensures at least one instance is always running during deployment.
How It Works¶
For each service (backend, frontend):
- Scale up: Start new container alongside old one (temporarily 2 instances)
- Health check: Verify new container is healthy (30 attempts, 5-second intervals = 150 seconds max)
- Scale down: Stop and remove old container
- Scale back: Return to 1 instance
Resource Requirements¶
- Temporarily 2x containers: ~600-800MB total memory for backend during deployment
- VPS capacity: Works within 2GB VPS (1GB base + 1GB headroom)
- Duration: ~2-5 minutes for full deployment
Health Check Details¶
- Backend:
http://localhost:8000/api/health/ - Frontend:
http://localhost:3000/ - Max attempts: 30 attempts × 5 seconds = 150 seconds maximum wait
- Timeout: Health check must pass before old container is removed
Celery Services¶
Celery workers and beat scheduler are restarted gracefully (not rolled) since they don't serve HTTP traffic.
Automatic Rollback¶
Trigger: Health check fails after 5 retries in deployment workflow
Script: scripts/rollback.sh
What Happens¶
- Health check fails after 5 attempts (25 seconds)
- Rollback script executes:
- Pull previous image tag from GHCR
- Tag it as the version to use
- Restart services with previous image
- Discord notification sent with rollback status
- Workflow fails to signal deployment failure
Verification After Automatic Rollback¶
After automatic rollback, verify services are healthy:
Investigate Failure¶
- Check deployment logs in GitHub Actions workflow output
- Check application logs:
- Identify root cause before attempting re-deployment
Manual Rollback Procedure¶
When to use:
- Automatic rollback failed
- Issues discovered after successful deployment
- Emergency rollback needed outside of deployment workflow
Step-by-Step¶
-
SSH into VPS:
-
Navigate to deployment directory:
-
Set environment variables:
Find the previous tag:
- For semantic version: Use previous release tag (e.g., v1.15.0)
- For commit SHA: Use previous successful commit SHA
-
Run rollback script:
-
Verify health:
-
Check logs:
Last Resort: Simple Restart¶
If rollback.sh fails completely:
This restarts services with the currently tagged images (may not fully revert if images were already pulled).
Database Migration Rollback¶
Django migrations can be reversible or irreversible.
Reversible Migrations¶
Most migrations are automatically reversible:
AddField- removes the fieldCreateModel- drops the tableAlterField- reverts field changesAddIndex- drops the index
To rollback a reversible migration:
# SSH to VPS
ssh production-vps
cd /opt/webshop
# Rollback to specific migration number
docker compose -f docker-compose.prod.yml exec backend \
python manage.py migrate <app_name> <previous_migration_number>
# Example: rollback products app to migration 0015
docker compose -f docker-compose.prod.yml exec backend \
python manage.py migrate products 0015
Verify rollback:
# Show current migration status
docker compose -f docker-compose.prod.yml exec backend \
python manage.py showmigrations <app_name>
Irreversible Migrations¶
Some migrations cannot be automatically reversed:
RemoveField- data is lost when field is droppedDeleteModel- data is lost when table is droppedRunPython- custom Python code withoutreverse_codeRunSQL- custom SQL without reverse SQL
For irreversible migrations:
- Do not attempt
migraterollback (will fail or lose data) - Restore from database backup (see Disaster Recovery)
- Choose backup timestamp before the migration ran
Best Practices¶
- Deploy migrations separately:
- Release 1: Deploy migration only (additive, safe)
- Verify migration succeeded
-
Release 2: Deploy code that uses new schema
-
Make migrations reversible:
- Provide
reverse_codeforRunPythonoperations - Avoid
RemoveFieldandDeleteModelwhen possible -
Use
AlterFieldwithnull=Truebefore removing -
Test migrations on staging before production
-
Take backup before risky migrations:
Troubleshooting¶
Health Check Fails¶
Symptom: Deployment fails with "Health check failed after 5 attempts"
Diagnosis:
# Check logs
docker compose -f docker-compose.prod.yml logs backend --tail=200
docker compose -f docker-compose.prod.yml logs frontend --tail=200
# Test health endpoint directly
curl -v http://localhost:8000/api/health/
curl -v http://localhost:3000/
Common causes:
- Database connection failure (check
DB_HOST,DB_PASSWORDin.env) - Redis connection failure (check
REDIS_URL) - Missing environment variables
- Application startup errors (check logs for stack traces)
- Port already in use (check
docker compose ps)
Container Won't Start¶
Symptom: Container exits immediately or restart loop
Diagnosis:
# Check container status
docker compose -f docker-compose.prod.yml ps
# Check resource usage
docker stats
# Inspect container
docker compose -f docker-compose.prod.yml logs <service-name> --tail=100
Common causes:
- Out of memory (check
docker stats) - Syntax error in code (check logs for Python/JavaScript errors)
- Missing dependencies (rebuild image)
- Configuration error in
docker-compose.prod.yml
Image Pull Fails¶
Symptom: "Error pulling image" during deployment
Diagnosis:
# Check GHCR authentication
docker login ghcr.io -u <username>
# Verify image exists
docker pull ghcr.io/<repo>/backend:<tag>
Common causes:
- Invalid tag name
- GHCR authentication expired
- Image not pushed to registry
- Network connectivity issues
Solution:
# Re-authenticate to GHCR
echo $GITHUB_TOKEN | docker login ghcr.io -u <username> --password-stdin
# Verify image tags
docker images | grep ghcr.io
Disk Space Issues¶
Symptom: "No space left on device"
Diagnosis:
Solution:
# Clean up old images and containers
docker system prune -f
# Remove unused volumes (careful!)
docker volume prune -f
# Remove specific old images
docker images | grep '<none>' | awk '{print $3}' | xargs docker rmi
Migration Fails¶
Symptom: "Migration failed" during deployment
Diagnosis:
# Check migration status
docker compose exec backend python manage.py showmigrations
# Check migration errors
docker compose logs backend | grep -i migration
Common causes:
- Database schema conflict
- Missing dependency migration
- Custom SQL error in migration
- Database connection interrupted
Solution:
# Fake migration if already applied manually
docker compose exec backend python manage.py migrate <app> <migration> --fake
# Or rollback and re-apply
docker compose exec backend python manage.py migrate <app> <previous-migration>
docker compose exec backend python manage.py migrate <app>
Performance Degradation After Deployment¶
Symptom: Application slower than before deployment
Diagnosis:
# Check resource usage
docker stats
# Check database connections
docker compose exec db psql -U $DB_USER -d $DB_NAME -c "SELECT count(*) FROM pg_stat_activity;"
# Check Redis memory
docker compose exec redis redis-cli INFO memory
Common causes:
- Missing database indexes (new queries added)
- Inefficient queries (check Django query logs)
- Memory leak (check
docker statsover time) - Cache not warming up (Redis empty after restart)
Solution:
- Review and optimize new queries
- Add database indexes for slow queries
- Restart services if memory leak suspected
- Warm up cache if needed
Related Documentation¶
- Disaster Recovery - Database backup, restore, RTO/RPO targets
- Architecture Overview - System architecture overview