Alerting Runbook¶

Step-by-step response procedures for every alert type in the Freeze Design monitoring system.

Overview¶

Alert Routing Diagram¶

Error occurs in Django
        |
        v
  Sentry captures it
        |
        +---> [New Issue]         ---> #errors    (Discord)
        +---> [Regression]        ---> #errors    (Discord)
        +---> [Error Spike]       ---> #errors    (Discord)
        +---> [Payment Error]     ---> #payments  (Discord)
        +---> [Payment Spike]     ---> #payments  (Discord)

Celery backup tasks
        |
        +---> [Backup Failure]    ---> #backups   (Discord webhook)
        +---> [Missed Backup]     ---> #backups   (Discord webhook)

Health check system
        |
        +---> [Disk Space Alert]  ---> #errors    (Discord webhook)

Admin audit mixin
        |
        +---> [Admin Delete]      ---> #admin     (Discord webhook)

Quick Reference Table¶

#	Alert	Channel	Severity	Response Time	Key Command
1	New Issue	#errors	Medium	Within 4 hours	`docker compose logs backend --tail=100`
2	Regression	#errors	HIGH	Within 1 hour	`docker compose logs backend --tail=200`
3	Error Spike	#errors	Critical	Within 30 min	`docker compose ps`
4	Payment Error	#payments	CRITICAL	Within 15 min	`docker compose logs backend --tail=100 \\| grep payment`
5	Payment Error Spike	#payments	CRITICAL	Within 15 min	`docker compose logs backend --tail=200 \\| grep payment`
6	Backup Failure	#backups	High	Within 2 hours	`docker compose exec backend python manage.py backup_database --check`
7	Missed Backup	#backups	High	Within 2 hours	`docker compose exec backend python manage.py shell -c "from apps.core.services.backup_monitor import BackupMonitor; print(BackupMonitor().get_metrics())"`
8	Disk Space Alert	#errors	Varies	Within 1 hour	`df -h`
9	Admin Delete	#admin	Info	Next business day	Check Django admin audit log

Alert Type 1: New Issue Alert¶

Channel: #errors Severity: Medium Source: Sentry issue alert rule What it means: Sentry detected an error it has never seen before in production.

Response Steps¶

Click the Sentry link in the Discord notification to open the issue
Assess the impact:
How many users affected? (check "Events" count in Sentry)
Is it a 500 error visible to customers?
What page/endpoint is affected?

Check recent deployments:

# Check when the last deployment happened
docker compose exec backend python -c "import sentry_sdk; print(sentry_sdk.Hub.current.client.options.get('release'))"

# Check recent git history
git log --oneline -5

Check application logs:

docker compose logs backend --tail=100 --since=1h

If the error is user-facing:
Determine if a hotfix is needed
If critical path (checkout, payment), escalate to Payment Error procedures
Resolve or assign in Sentry:
If one-off / transient: mark as Ignored with note
If needs fixing: leave open, add to backlog
If fixed by deploy: mark as Resolved in next release

Common Causes¶

New code deployment introduced a bug
External service API changed (Mollie, DigitalOcean Spaces)
Edge case in user input not previously encountered
Browser-specific JavaScript error (if frontend Sentry is added)

Alert Type 2: Regression Alert¶

Channel: #errors Severity: HIGH Source: Sentry issue alert rule What it means: An issue that was previously marked as resolved has reappeared. This means a fix did not hold or was reverted.

Response Steps¶

Click the Sentry link to open the regressed issue
Check the issue history in Sentry:
When was it originally resolved?
What release resolved it?
What release caused the regression?

Compare releases:

# Find the commit that resolved the original issue
git log --oneline --all | head -20

# Check what changed between the resolving commit and now
git diff <resolving-commit>..HEAD --stat

Check application logs for the specific error:

docker compose logs backend --tail=200 --since=2h | grep -i "error\|exception\|traceback"

If the regression is in a critical path:

Consider rolling back to the previous release:

# Check available images
docker images | grep freeze

# Rollback (use the previous image tag)
# WARNING: Only do this if the regression is severe
docker compose pull backend
docker compose up -d backend

Create a fix:
Regressions indicate the original fix was incomplete or a new change conflicted
Write a test that covers the regression scenario
Deploy the fix and verify in Sentry that the issue resolves

Common Causes¶

A new deployment overwrote or conflicted with the original fix
The original fix was incomplete (fixed one case but not edge cases)
Database migration changed data shape that the fix depended on
Dependency update changed behavior

Alert Type 3: Error Spike Alert¶

Channel: #errors Severity: Warning (>10/hr) or Critical (>50/hr) Source: Sentry metric alert rule What it means: The error rate has crossed the threshold, indicating a systemic issue rather than isolated errors.

Response Steps¶

Assess the severity:
Warning (>10/hr): Elevated errors, investigate within 30 minutes
Critical (>50/hr): Major incident, investigate immediately
Check if the application is running:
```
docker compose ps
```
All services should show "Up" and "healthy".

Check for a recent deployment:

# When was the backend container last started?
docker inspect --format='{{.State.StartedAt}}' $(docker compose ps -q backend)

Check application logs for patterns:

# Look for the most common errors in the last hour
docker compose logs backend --tail=500 --since=1h | grep -c "ERROR"
docker compose logs backend --tail=500 --since=1h | grep "ERROR" | sort | uniq -c | sort -rn | head -10

Check external dependencies:

# Database connection
docker compose exec backend python manage.py dbshell -c "SELECT 1;"

# Redis connection
docker compose exec redis redis-cli ping

# Celery workers
docker compose logs celery --tail=50 --since=1h

Check Sentry for the top issues:
Go to Sentry > Issues > Sort by "Events" > Last hour
The top issue is likely the cause of the spike
If caused by a deployment, consider rollback:
See rollback steps in Alert Type 2
If caused by external service failure:
Check Mollie status: https://status.mollie.com
Check DigitalOcean Spaces status: https://status.digitalocean.com
If external, the spike will resolve when the service recovers

Common Causes¶

Bad deployment (new bug affecting many requests)
Database connection issues (pool exhausted, slow queries)
Redis connection issues (cache failures cascading)
External API outage (Mollie, DigitalOcean)
DNS or certificate issues (SSL expiry)

Alert Type 4: Payment Error¶

Channel: #payments Severity: CRITICAL Source: Sentry issue alert rule (tag: domain=payment) What it means: A new payment-related error occurred. Customers may be unable to complete purchases.

Response Steps¶

Click the Sentry link immediately
Determine the error type:
Payment creation failure (Mollie API error)
Webhook processing failure (payment status update failed)
Checkout flow error (cart/order creation)
Invoice generation error

Check payment logs:

docker compose logs backend --tail=100 --since=30m | grep -i "payment\|mollie\|checkout"

Check Mollie dashboard:
Log in to https://my.mollie.com
Go to Payments > check recent payment statuses
Are payments being created but failing at Mollie?
Check Mollie API status:
Visit https://status.mollie.com
If Mollie is down, customers will see payment failures until service recovers

Verify webhook connectivity:

# Check if Mollie can reach our webhook endpoint
docker compose logs nginx --tail=50 | grep "webhook"

If the error is in our code:
Check the traceback in Sentry for the exact failure point
Common locations: apps/orders/services/, apps/orders/views.py
If a quick fix is possible, deploy immediately
Customer communication:
If payment failures are widespread, consider adding a banner to the site
Check for stuck orders: orders with pending_payment status that are older than 1 hour

Check for stuck orders:

docker compose exec backend python manage.py shell -c "
from apps.orders.models import Order
from django.utils import timezone
from datetime import timedelta
stuck = Order.objects.filter(
    status='pending_payment',
    created_at__lt=timezone.now() - timedelta(hours=1)
).count()
print(f'Stuck orders: {stuck}')
"

Common Causes¶

Mollie API outage or degradation
Mollie API key expired or revoked
Webhook URL not reachable (nginx misconfiguration, SSL issue)
Code bug in payment flow (especially after deployment)
Currency/amount formatting error

Alert Type 5: Payment Error Spike¶

Channel: #payments Severity: CRITICAL Source: Sentry metric alert rule (>3 payment errors/hour) What it means: Multiple payment errors in a short period. This is a systemic payment failure, not an isolated error.

Response Steps¶

Follow all steps from Alert Type 4 (Payment Error)
Additional steps for a spike:

Check if ALL payments are failing:

docker compose exec backend python manage.py shell -c "
from apps.orders.models import Order
from django.utils import timezone
from datetime import timedelta
one_hour_ago = timezone.now() - timedelta(hours=1)
total = Order.objects.filter(created_at__gte=one_hour_ago).count()
failed = Order.objects.filter(created_at__gte=one_hour_ago, status='payment_failed').count()
paid = Order.objects.filter(created_at__gte=one_hour_ago, status='paid').count()
print(f'Last hour: {total} orders, {paid} paid, {failed} failed')
"

If 100% failure rate:
Likely a Mollie outage or API key issue
Check MOLLIE_API_KEY is set: docker compose exec backend env | grep MOLLIE
Consider temporarily disabling checkout with a maintenance message
If partial failure rate:
Likely a specific payment method or amount issue
Check Sentry for common attributes across the failing payments
Monitor resolution:
The metric alert will auto-resolve when errors drop below threshold
Verify in Sentry that the error rate is declining

Common Causes¶

Same as Alert Type 4, but systemic
API key rotation without updating env var
Mollie account suspension (check email)
Network issue between server and Mollie API

Alert Type 6: Backup Failure¶

Channel: #backups Severity: High Source: Django backup task (Celery) Discord notification What it means: A scheduled database backup failed. Data protection is compromised until the next successful backup.

Response Steps¶

Check the failure details in the Discord notification (includes error message and task ID)

Check backup task logs:

docker compose logs celery --tail=100 --since=2h | grep -i "backup"

Try a manual backup:
```
docker compose exec backend python manage.py backup_database
```
If the manual backup succeeds, the failure was likely transient.
Check disk space:
```
docker compose exec backend df -h /backups
```
Backups will fail if the disk is full.

Check database connectivity:

docker compose exec backend python manage.py dbshell -c "SELECT 1;"

Check backup storage:

# List recent backups
docker compose exec backend ls -la /backups/ | tail -10

Verify backup schedule is running:
```
docker compose logs celery-beat --tail=20
```
Look for the backup task being scheduled.

If disk space is the issue:

# Check retention policy - old backups should be cleaned up
docker compose exec backend python manage.py backup_database --cleanup

Common Causes¶

Disk full (backup volume)
Database connection timeout during dump
PostgreSQL in recovery mode (after crash)
Celery worker crashed or restarted during backup
File permission issue on backup directory

Alert Type 7: Missed Backup Warning¶

Channel: #backups Severity: High Source: Daily backup summary task (checks for stale backups) What it means: No successful backup has been recorded in the last 25 hours. The backup system may be silently broken.

Response Steps¶

Check if backups are running at all:

docker compose logs celery-beat --tail=50 | grep -i "backup"

Check Celery worker health:
```
docker compose ps celery
docker compose logs celery --tail=50
```
If the Celery worker is not running, backups cannot execute.

Check backup monitor metrics:

docker compose exec backend python manage.py shell -c "
from apps.core.services.backup_monitor import BackupMonitor
monitor = BackupMonitor()
metrics = monitor.get_metrics()
print(f'Last backup: {metrics.get(\"last_backup_time\", \"Never\")}')
print(f'Success rate: {metrics.get(\"success_rate\", \"Unknown\")}%')
print(f'Total backups: {metrics.get(\"total_backups\", 0)}')
"

Check Redis (backup history is cached):
```
docker compose exec redis redis-cli ping
```
If Redis is down, backup history may be lost (but backups themselves still run).

Run a manual backup to restore the schedule:

docker compose exec backend python manage.py backup_database

If Celery Beat is not scheduling tasks:

docker compose restart celery-beat
docker compose logs celery-beat --tail=20

Common Causes¶

Celery Beat crashed or was not started
Celery worker crashed (tasks scheduled but not executed)
Redis down (task results not stored, but tasks may still run)
Docker container restarted and lost the Beat schedule
Server reboot without docker compose up

Alert Type 8: Disk Space Alert¶

Channel: #errors Severity: Warning (>80%) or Critical (>90%) Source: Health check system (Django) What it means: Server disk usage has crossed a threshold. If it reaches 100%, the application will crash.

Response Steps¶

Check current disk usage:
```
df -h
```

Identify what is consuming space:

du -sh /var/lib/docker/volumes/* | sort -rh | head -10

Common space consumers:

Docker images and containers:

docker system df
# Clean up unused images, containers, and volumes
docker system prune -f

Old backups:

docker compose exec backend ls -lah /backups/ | head -20
# Run cleanup
docker compose exec backend python manage.py backup_database --cleanup

Application logs:

du -sh /var/lib/docker/containers/*/
# Truncate large log files
docker compose logs backend --tail=1 > /dev/null

PostgreSQL WAL files:

docker compose exec db du -sh /var/lib/postgresql/data/pg_wal/

If above 90% (critical):
Delete old Docker images: docker image prune -a -f
Clean backup retention: docker compose exec backend python manage.py backup_database --cleanup
Consider increasing disk size at Hetzner
If above 95% (emergency):
Delete the oldest backups manually
Truncate Django log files
Increase disk immediately at Hetzner Cloud console
Set up prevention:
Ensure backup retention policy is active (keeps 24h hourly, 7d daily, 4w weekly)
Consider adding log rotation to docker-compose

Common Causes¶

Backup retention not running (accumulating hourly backups)
Docker images accumulating from deployments without cleanup
PostgreSQL WAL files growing (replication lag or missing checkpoint)
Application log files growing without rotation
Media uploads filling disk (should be on DigitalOcean Spaces, not local)

Alert Type 9: Admin Delete Action¶

Channel: #admin Severity: Info Source: Django admin audit trail mixin What it means: An admin user performed a delete action in the Django admin interface. Deletes are irreversible and logged for audit purposes.

Response Steps¶

Review the notification details:
Who performed the delete?
What model/object was deleted?
When did it happen?
Verify the delete was intentional:
Check with the admin user if this was planned

If unexpected, check the admin audit log for more context:

docker compose exec backend python manage.py shell -c "
from apps.core.models import AdminAuditLog
recent = AdminAuditLog.objects.filter(action='delete').order_by('-timestamp')[:5]
for log in recent:
    print(f'{log.timestamp}: {log.user} deleted {log.model_name} (ID: {log.object_id})')
"

If the delete was accidental:
Check if a backup exists from before the delete
Consult Disaster Recovery for restore instructions
Note: Restoring a full backup will overwrite all changes since the backup
No action needed if:
The delete was intentional and expected
This is informational only -- most delete notifications require no response

Common Causes¶

Normal admin operations (removing test data, cleaning up old records)
Accidental bulk delete (admin list view checkbox + delete action)
Admin cleanup of expired/orphaned data

Escalation Procedures¶

For a solo-operator setup, "escalation" means increasing your own response urgency.

Severity Levels¶

Severity	Response Time	Description
Info	Next business day	Informational, no action usually needed
Medium	Within 4 hours	Investigate, may need a fix in next deployment
High	Within 1 hour	Needs prompt attention, potential data or service impact
Critical	Within 15-30 min	Active incident affecting customers or revenue

When to Consider Downtime¶

Take the site into maintenance mode if:

Payment system is completely broken (100% failure rate for >30 min)
Error rate exceeds 100/hour and rising (cascading failure)
Database is corrupt or unreachable
Security breach suspected (unauthorized admin access)

Maintenance Mode¶

# Quick maintenance page via Nginx (if configured)
# Or: stop the frontend to prevent new orders
docker compose stop frontend
# Fix the issue
# Restart
docker compose up -d frontend

Log Locations¶

Quick reference for finding logs during incident response.

Service	Log Command	What It Contains
Backend (Django)	`docker compose logs backend --tail=N`	API requests, errors, application logs
Celery worker	`docker compose logs celery --tail=N`	Task execution, backup results
Celery Beat	`docker compose logs celery-beat --tail=N`	Task scheduling
Nginx	`docker compose logs nginx --tail=N`	HTTP requests, proxy errors
PostgreSQL	`docker compose logs db --tail=N`	Database queries, connections
Redis	`docker compose logs redis --tail=N`	Cache operations, connection issues

Filtering Logs by Time¶

# Last hour
docker compose logs backend --since=1h

# Last 30 minutes
docker compose logs backend --since=30m

# Since specific time (UTC)
docker compose logs backend --since="2026-01-15T10:00:00"

Searching Logs¶

# Find all errors
docker compose logs backend --tail=500 | grep "ERROR"

# Find payment-related logs
docker compose logs backend --tail=500 | grep -i "payment\|mollie"

# Find backup-related logs
docker compose logs celery --tail=200 | grep -i "backup"