Skip to content

Alerting Runbook

Step-by-step response procedures for every alert type in the Freeze Design monitoring system.

Overview

Alert Routing Diagram

Error occurs in Django
        |
        v
  Sentry captures it
        |
        +---> [New Issue]         ---> #errors    (Discord)
        +---> [Regression]        ---> #errors    (Discord)
        +---> [Error Spike]       ---> #errors    (Discord)
        +---> [Payment Error]     ---> #payments  (Discord)
        +---> [Payment Spike]     ---> #payments  (Discord)

Celery backup tasks
        |
        +---> [Backup Failure]    ---> #backups   (Discord webhook)
        +---> [Missed Backup]     ---> #backups   (Discord webhook)

Health check system
        |
        +---> [Disk Space Alert]  ---> #errors    (Discord webhook)

Admin audit mixin
        |
        +---> [Admin Delete]      ---> #admin     (Discord webhook)

Quick Reference Table

# Alert Channel Severity Response Time Key Command
1 New Issue #errors Medium Within 4 hours docker compose logs backend --tail=100
2 Regression #errors HIGH Within 1 hour docker compose logs backend --tail=200
3 Error Spike #errors Critical Within 30 min docker compose ps
4 Payment Error #payments CRITICAL Within 15 min docker compose logs backend --tail=100 \| grep payment
5 Payment Error Spike #payments CRITICAL Within 15 min docker compose logs backend --tail=200 \| grep payment
6 Backup Failure #backups High Within 2 hours docker compose exec backend python manage.py backup_database --check
7 Missed Backup #backups High Within 2 hours docker compose exec backend python manage.py shell -c "from apps.core.services.backup_monitor import BackupMonitor; print(BackupMonitor().get_metrics())"
8 Disk Space Alert #errors Varies Within 1 hour df -h
9 Admin Delete #admin Info Next business day Check Django admin audit log

Alert Type 1: New Issue Alert

Channel: #errors Severity: Medium Source: Sentry issue alert rule What it means: Sentry detected an error it has never seen before in production.

Response Steps

  1. Click the Sentry link in the Discord notification to open the issue

  2. Assess the impact:

  3. How many users affected? (check "Events" count in Sentry)
  4. Is it a 500 error visible to customers?
  5. What page/endpoint is affected?

  6. Check recent deployments:

    # Check when the last deployment happened
    docker compose exec backend python -c "import sentry_sdk; print(sentry_sdk.Hub.current.client.options.get('release'))"
    
    # Check recent git history
    git log --oneline -5
    

  7. Check application logs:

    docker compose logs backend --tail=100 --since=1h
    

  8. If the error is user-facing:

  9. Determine if a hotfix is needed
  10. If critical path (checkout, payment), escalate to Payment Error procedures

  11. Resolve or assign in Sentry:

  12. If one-off / transient: mark as Ignored with note
  13. If needs fixing: leave open, add to backlog
  14. If fixed by deploy: mark as Resolved in next release

Common Causes

  • New code deployment introduced a bug
  • External service API changed (Mollie, DigitalOcean Spaces)
  • Edge case in user input not previously encountered
  • Browser-specific JavaScript error (if frontend Sentry is added)

Alert Type 2: Regression Alert

Channel: #errors Severity: HIGH Source: Sentry issue alert rule What it means: An issue that was previously marked as resolved has reappeared. This means a fix did not hold or was reverted.

Response Steps

  1. Click the Sentry link to open the regressed issue

  2. Check the issue history in Sentry:

  3. When was it originally resolved?
  4. What release resolved it?
  5. What release caused the regression?

  6. Compare releases:

    # Find the commit that resolved the original issue
    git log --oneline --all | head -20
    
    # Check what changed between the resolving commit and now
    git diff <resolving-commit>..HEAD --stat
    

  7. Check application logs for the specific error:

    docker compose logs backend --tail=200 --since=2h | grep -i "error\|exception\|traceback"
    

  8. If the regression is in a critical path:

  9. Consider rolling back to the previous release:

    # Check available images
    docker images | grep freeze
    
    # Rollback (use the previous image tag)
    # WARNING: Only do this if the regression is severe
    docker compose pull backend
    docker compose up -d backend
    

  10. Create a fix:

  11. Regressions indicate the original fix was incomplete or a new change conflicted
  12. Write a test that covers the regression scenario
  13. Deploy the fix and verify in Sentry that the issue resolves

Common Causes

  • A new deployment overwrote or conflicted with the original fix
  • The original fix was incomplete (fixed one case but not edge cases)
  • Database migration changed data shape that the fix depended on
  • Dependency update changed behavior

Alert Type 3: Error Spike Alert

Channel: #errors Severity: Warning (>10/hr) or Critical (>50/hr) Source: Sentry metric alert rule What it means: The error rate has crossed the threshold, indicating a systemic issue rather than isolated errors.

Response Steps

  1. Assess the severity:
  2. Warning (>10/hr): Elevated errors, investigate within 30 minutes
  3. Critical (>50/hr): Major incident, investigate immediately

  4. Check if the application is running:

    docker compose ps
    
    All services should show "Up" and "healthy".

  5. Check for a recent deployment:

    # When was the backend container last started?
    docker inspect --format='{{.State.StartedAt}}' $(docker compose ps -q backend)
    

  6. Check application logs for patterns:

    # Look for the most common errors in the last hour
    docker compose logs backend --tail=500 --since=1h | grep -c "ERROR"
    docker compose logs backend --tail=500 --since=1h | grep "ERROR" | sort | uniq -c | sort -rn | head -10
    

  7. Check external dependencies:

    # Database connection
    docker compose exec backend python manage.py dbshell -c "SELECT 1;"
    
    # Redis connection
    docker compose exec redis redis-cli ping
    
    # Celery workers
    docker compose logs celery --tail=50 --since=1h
    

  8. Check Sentry for the top issues:

  9. Go to Sentry > Issues > Sort by "Events" > Last hour
  10. The top issue is likely the cause of the spike

  11. If caused by a deployment, consider rollback:

  12. See rollback steps in Alert Type 2

  13. If caused by external service failure:

  14. Check Mollie status: https://status.mollie.com
  15. Check DigitalOcean Spaces status: https://status.digitalocean.com
  16. If external, the spike will resolve when the service recovers

Common Causes

  • Bad deployment (new bug affecting many requests)
  • Database connection issues (pool exhausted, slow queries)
  • Redis connection issues (cache failures cascading)
  • External API outage (Mollie, DigitalOcean)
  • DNS or certificate issues (SSL expiry)

Alert Type 4: Payment Error

Channel: #payments Severity: CRITICAL Source: Sentry issue alert rule (tag: domain=payment) What it means: A new payment-related error occurred. Customers may be unable to complete purchases.

Response Steps

  1. Click the Sentry link immediately

  2. Determine the error type:

  3. Payment creation failure (Mollie API error)
  4. Webhook processing failure (payment status update failed)
  5. Checkout flow error (cart/order creation)
  6. Invoice generation error

  7. Check payment logs:

    docker compose logs backend --tail=100 --since=30m | grep -i "payment\|mollie\|checkout"
    

  8. Check Mollie dashboard:

  9. Log in to https://my.mollie.com
  10. Go to Payments > check recent payment statuses
  11. Are payments being created but failing at Mollie?

  12. Check Mollie API status:

  13. Visit https://status.mollie.com
  14. If Mollie is down, customers will see payment failures until service recovers

  15. Verify webhook connectivity:

    # Check if Mollie can reach our webhook endpoint
    docker compose logs nginx --tail=50 | grep "webhook"
    

  16. If the error is in our code:

  17. Check the traceback in Sentry for the exact failure point
  18. Common locations: apps/orders/services/, apps/orders/views.py
  19. If a quick fix is possible, deploy immediately

  20. Customer communication:

  21. If payment failures are widespread, consider adding a banner to the site
  22. Check for stuck orders: orders with pending_payment status that are older than 1 hour

  23. Check for stuck orders:

    docker compose exec backend python manage.py shell -c "
    from apps.orders.models import Order
    from django.utils import timezone
    from datetime import timedelta
    stuck = Order.objects.filter(
        status='pending_payment',
        created_at__lt=timezone.now() - timedelta(hours=1)
    ).count()
    print(f'Stuck orders: {stuck}')
    "
    

Common Causes

  • Mollie API outage or degradation
  • Mollie API key expired or revoked
  • Webhook URL not reachable (nginx misconfiguration, SSL issue)
  • Code bug in payment flow (especially after deployment)
  • Currency/amount formatting error

Alert Type 5: Payment Error Spike

Channel: #payments Severity: CRITICAL Source: Sentry metric alert rule (>3 payment errors/hour) What it means: Multiple payment errors in a short period. This is a systemic payment failure, not an isolated error.

Response Steps

  1. Follow all steps from Alert Type 4 (Payment Error)

  2. Additional steps for a spike:

  3. Check if ALL payments are failing:

    docker compose exec backend python manage.py shell -c "
    from apps.orders.models import Order
    from django.utils import timezone
    from datetime import timedelta
    one_hour_ago = timezone.now() - timedelta(hours=1)
    total = Order.objects.filter(created_at__gte=one_hour_ago).count()
    failed = Order.objects.filter(created_at__gte=one_hour_ago, status='payment_failed').count()
    paid = Order.objects.filter(created_at__gte=one_hour_ago, status='paid').count()
    print(f'Last hour: {total} orders, {paid} paid, {failed} failed')
    "
    

  4. If 100% failure rate:

  5. Likely a Mollie outage or API key issue
  6. Check MOLLIE_API_KEY is set: docker compose exec backend env | grep MOLLIE
  7. Consider temporarily disabling checkout with a maintenance message

  8. If partial failure rate:

  9. Likely a specific payment method or amount issue
  10. Check Sentry for common attributes across the failing payments

  11. Monitor resolution:

  12. The metric alert will auto-resolve when errors drop below threshold
  13. Verify in Sentry that the error rate is declining

Common Causes

  • Same as Alert Type 4, but systemic
  • API key rotation without updating env var
  • Mollie account suspension (check email)
  • Network issue between server and Mollie API

Alert Type 6: Backup Failure

Channel: #backups Severity: High Source: Django backup task (Celery) Discord notification What it means: A scheduled database backup failed. Data protection is compromised until the next successful backup.

Response Steps

  1. Check the failure details in the Discord notification (includes error message and task ID)

  2. Check backup task logs:

    docker compose logs celery --tail=100 --since=2h | grep -i "backup"
    

  3. Try a manual backup:

    docker compose exec backend python manage.py backup_database
    
    If the manual backup succeeds, the failure was likely transient.

  4. Check disk space:

    docker compose exec backend df -h /backups
    
    Backups will fail if the disk is full.

  5. Check database connectivity:

    docker compose exec backend python manage.py dbshell -c "SELECT 1;"
    

  6. Check backup storage:

    # List recent backups
    docker compose exec backend ls -la /backups/ | tail -10
    

  7. Verify backup schedule is running:

    docker compose logs celery-beat --tail=20
    
    Look for the backup task being scheduled.

  8. If disk space is the issue:

    # Check retention policy - old backups should be cleaned up
    docker compose exec backend python manage.py backup_database --cleanup
    

Common Causes

  • Disk full (backup volume)
  • Database connection timeout during dump
  • PostgreSQL in recovery mode (after crash)
  • Celery worker crashed or restarted during backup
  • File permission issue on backup directory

Alert Type 7: Missed Backup Warning

Channel: #backups Severity: High Source: Daily backup summary task (checks for stale backups) What it means: No successful backup has been recorded in the last 25 hours. The backup system may be silently broken.

Response Steps

  1. Check if backups are running at all:

    docker compose logs celery-beat --tail=50 | grep -i "backup"
    

  2. Check Celery worker health:

    docker compose ps celery
    docker compose logs celery --tail=50
    
    If the Celery worker is not running, backups cannot execute.

  3. Check backup monitor metrics:

    docker compose exec backend python manage.py shell -c "
    from apps.core.services.backup_monitor import BackupMonitor
    monitor = BackupMonitor()
    metrics = monitor.get_metrics()
    print(f'Last backup: {metrics.get(\"last_backup_time\", \"Never\")}')
    print(f'Success rate: {metrics.get(\"success_rate\", \"Unknown\")}%')
    print(f'Total backups: {metrics.get(\"total_backups\", 0)}')
    "
    

  4. Check Redis (backup history is cached):

    docker compose exec redis redis-cli ping
    
    If Redis is down, backup history may be lost (but backups themselves still run).

  5. Run a manual backup to restore the schedule:

    docker compose exec backend python manage.py backup_database
    

  6. If Celery Beat is not scheduling tasks:

    docker compose restart celery-beat
    docker compose logs celery-beat --tail=20
    

Common Causes

  • Celery Beat crashed or was not started
  • Celery worker crashed (tasks scheduled but not executed)
  • Redis down (task results not stored, but tasks may still run)
  • Docker container restarted and lost the Beat schedule
  • Server reboot without docker compose up

Alert Type 8: Disk Space Alert

Channel: #errors Severity: Warning (>80%) or Critical (>90%) Source: Health check system (Django) What it means: Server disk usage has crossed a threshold. If it reaches 100%, the application will crash.

Response Steps

  1. Check current disk usage:

    df -h
    

  2. Identify what is consuming space:

    du -sh /var/lib/docker/volumes/* | sort -rh | head -10
    

  3. Common space consumers:

Docker images and containers:

docker system df
# Clean up unused images, containers, and volumes
docker system prune -f

Old backups:

docker compose exec backend ls -lah /backups/ | head -20
# Run cleanup
docker compose exec backend python manage.py backup_database --cleanup

Application logs:

du -sh /var/lib/docker/containers/*/
# Truncate large log files
docker compose logs backend --tail=1 > /dev/null

PostgreSQL WAL files:

docker compose exec db du -sh /var/lib/postgresql/data/pg_wal/

  1. If above 90% (critical):
  2. Delete old Docker images: docker image prune -a -f
  3. Clean backup retention: docker compose exec backend python manage.py backup_database --cleanup
  4. Consider increasing disk size at Hetzner

  5. If above 95% (emergency):

  6. Delete the oldest backups manually
  7. Truncate Django log files
  8. Increase disk immediately at Hetzner Cloud console

  9. Set up prevention:

  10. Ensure backup retention policy is active (keeps 24h hourly, 7d daily, 4w weekly)
  11. Consider adding log rotation to docker-compose

Common Causes

  • Backup retention not running (accumulating hourly backups)
  • Docker images accumulating from deployments without cleanup
  • PostgreSQL WAL files growing (replication lag or missing checkpoint)
  • Application log files growing without rotation
  • Media uploads filling disk (should be on DigitalOcean Spaces, not local)

Alert Type 9: Admin Delete Action

Channel: #admin Severity: Info Source: Django admin audit trail mixin What it means: An admin user performed a delete action in the Django admin interface. Deletes are irreversible and logged for audit purposes.

Response Steps

  1. Review the notification details:
  2. Who performed the delete?
  3. What model/object was deleted?
  4. When did it happen?

  5. Verify the delete was intentional:

  6. Check with the admin user if this was planned
  7. If unexpected, check the admin audit log for more context:

    docker compose exec backend python manage.py shell -c "
    from apps.core.models import AdminAuditLog
    recent = AdminAuditLog.objects.filter(action='delete').order_by('-timestamp')[:5]
    for log in recent:
        print(f'{log.timestamp}: {log.user} deleted {log.model_name} (ID: {log.object_id})')
    "
    

  8. If the delete was accidental:

  9. Check if a backup exists from before the delete
  10. Consult Disaster Recovery for restore instructions
  11. Note: Restoring a full backup will overwrite all changes since the backup

  12. No action needed if:

  13. The delete was intentional and expected
  14. This is informational only -- most delete notifications require no response

Common Causes

  • Normal admin operations (removing test data, cleaning up old records)
  • Accidental bulk delete (admin list view checkbox + delete action)
  • Admin cleanup of expired/orphaned data

Escalation Procedures

For a solo-operator setup, "escalation" means increasing your own response urgency.

Severity Levels

Severity Response Time Description
Info Next business day Informational, no action usually needed
Medium Within 4 hours Investigate, may need a fix in next deployment
High Within 1 hour Needs prompt attention, potential data or service impact
Critical Within 15-30 min Active incident affecting customers or revenue

When to Consider Downtime

Take the site into maintenance mode if:

  • Payment system is completely broken (100% failure rate for >30 min)
  • Error rate exceeds 100/hour and rising (cascading failure)
  • Database is corrupt or unreachable
  • Security breach suspected (unauthorized admin access)

Maintenance Mode

# Quick maintenance page via Nginx (if configured)
# Or: stop the frontend to prevent new orders
docker compose stop frontend
# Fix the issue
# Restart
docker compose up -d frontend

Log Locations

Quick reference for finding logs during incident response.

Service Log Command What It Contains
Backend (Django) docker compose logs backend --tail=N API requests, errors, application logs
Celery worker docker compose logs celery --tail=N Task execution, backup results
Celery Beat docker compose logs celery-beat --tail=N Task scheduling
Nginx docker compose logs nginx --tail=N HTTP requests, proxy errors
PostgreSQL docker compose logs db --tail=N Database queries, connections
Redis docker compose logs redis --tail=N Cache operations, connection issues

Filtering Logs by Time

# Last hour
docker compose logs backend --since=1h

# Last 30 minutes
docker compose logs backend --since=30m

# Since specific time (UTC)
docker compose logs backend --since="2026-01-15T10:00:00"

Searching Logs

# Find all errors
docker compose logs backend --tail=500 | grep "ERROR"

# Find payment-related logs
docker compose logs backend --tail=500 | grep -i "payment\|mollie"

# Find backup-related logs
docker compose logs celery --tail=200 | grep -i "backup"