Alerting Runbook¶
Step-by-step response procedures for every alert type in the Freeze Design monitoring system.
Overview¶
Alert Routing Diagram¶
Error occurs in Django
|
v
Sentry captures it
|
+---> [New Issue] ---> #errors (Discord)
+---> [Regression] ---> #errors (Discord)
+---> [Error Spike] ---> #errors (Discord)
+---> [Payment Error] ---> #payments (Discord)
+---> [Payment Spike] ---> #payments (Discord)
Celery backup tasks
|
+---> [Backup Failure] ---> #backups (Discord webhook)
+---> [Missed Backup] ---> #backups (Discord webhook)
Health check system
|
+---> [Disk Space Alert] ---> #errors (Discord webhook)
Admin audit mixin
|
+---> [Admin Delete] ---> #admin (Discord webhook)
Quick Reference Table¶
| # | Alert | Channel | Severity | Response Time | Key Command |
|---|---|---|---|---|---|
| 1 | New Issue | #errors | Medium | Within 4 hours | docker compose logs backend --tail=100 |
| 2 | Regression | #errors | HIGH | Within 1 hour | docker compose logs backend --tail=200 |
| 3 | Error Spike | #errors | Critical | Within 30 min | docker compose ps |
| 4 | Payment Error | #payments | CRITICAL | Within 15 min | docker compose logs backend --tail=100 \| grep payment |
| 5 | Payment Error Spike | #payments | CRITICAL | Within 15 min | docker compose logs backend --tail=200 \| grep payment |
| 6 | Backup Failure | #backups | High | Within 2 hours | docker compose exec backend python manage.py backup_database --check |
| 7 | Missed Backup | #backups | High | Within 2 hours | docker compose exec backend python manage.py shell -c "from apps.core.services.backup_monitor import BackupMonitor; print(BackupMonitor().get_metrics())" |
| 8 | Disk Space Alert | #errors | Varies | Within 1 hour | df -h |
| 9 | Admin Delete | #admin | Info | Next business day | Check Django admin audit log |
Alert Type 1: New Issue Alert¶
Channel: #errors Severity: Medium Source: Sentry issue alert rule What it means: Sentry detected an error it has never seen before in production.
Response Steps¶
-
Click the Sentry link in the Discord notification to open the issue
-
Assess the impact:
- How many users affected? (check "Events" count in Sentry)
- Is it a 500 error visible to customers?
-
What page/endpoint is affected?
-
Check recent deployments:
-
Check application logs:
-
If the error is user-facing:
- Determine if a hotfix is needed
-
If critical path (checkout, payment), escalate to Payment Error procedures
-
Resolve or assign in Sentry:
- If one-off / transient: mark as Ignored with note
- If needs fixing: leave open, add to backlog
- If fixed by deploy: mark as Resolved in next release
Common Causes¶
- New code deployment introduced a bug
- External service API changed (Mollie, DigitalOcean Spaces)
- Edge case in user input not previously encountered
- Browser-specific JavaScript error (if frontend Sentry is added)
Alert Type 2: Regression Alert¶
Channel: #errors Severity: HIGH Source: Sentry issue alert rule What it means: An issue that was previously marked as resolved has reappeared. This means a fix did not hold or was reverted.
Response Steps¶
-
Click the Sentry link to open the regressed issue
-
Check the issue history in Sentry:
- When was it originally resolved?
- What release resolved it?
-
What release caused the regression?
-
Compare releases:
-
Check application logs for the specific error:
-
If the regression is in a critical path:
-
Consider rolling back to the previous release:
-
Create a fix:
- Regressions indicate the original fix was incomplete or a new change conflicted
- Write a test that covers the regression scenario
- Deploy the fix and verify in Sentry that the issue resolves
Common Causes¶
- A new deployment overwrote or conflicted with the original fix
- The original fix was incomplete (fixed one case but not edge cases)
- Database migration changed data shape that the fix depended on
- Dependency update changed behavior
Alert Type 3: Error Spike Alert¶
Channel: #errors Severity: Warning (>10/hr) or Critical (>50/hr) Source: Sentry metric alert rule What it means: The error rate has crossed the threshold, indicating a systemic issue rather than isolated errors.
Response Steps¶
- Assess the severity:
- Warning (>10/hr): Elevated errors, investigate within 30 minutes
-
Critical (>50/hr): Major incident, investigate immediately
-
Check if the application is running:
All services should show "Up" and "healthy". -
Check for a recent deployment:
-
Check application logs for patterns:
-
Check external dependencies:
-
Check Sentry for the top issues:
- Go to Sentry > Issues > Sort by "Events" > Last hour
-
The top issue is likely the cause of the spike
-
If caused by a deployment, consider rollback:
-
See rollback steps in Alert Type 2
-
If caused by external service failure:
- Check Mollie status: https://status.mollie.com
- Check DigitalOcean Spaces status: https://status.digitalocean.com
- If external, the spike will resolve when the service recovers
Common Causes¶
- Bad deployment (new bug affecting many requests)
- Database connection issues (pool exhausted, slow queries)
- Redis connection issues (cache failures cascading)
- External API outage (Mollie, DigitalOcean)
- DNS or certificate issues (SSL expiry)
Alert Type 4: Payment Error¶
Channel: #payments Severity: CRITICAL Source: Sentry issue alert rule (tag: domain=payment) What it means: A new payment-related error occurred. Customers may be unable to complete purchases.
Response Steps¶
-
Click the Sentry link immediately
-
Determine the error type:
- Payment creation failure (Mollie API error)
- Webhook processing failure (payment status update failed)
- Checkout flow error (cart/order creation)
-
Invoice generation error
-
Check payment logs:
-
Check Mollie dashboard:
- Log in to https://my.mollie.com
- Go to Payments > check recent payment statuses
-
Are payments being created but failing at Mollie?
-
Check Mollie API status:
- Visit https://status.mollie.com
-
If Mollie is down, customers will see payment failures until service recovers
-
Verify webhook connectivity:
-
If the error is in our code:
- Check the traceback in Sentry for the exact failure point
- Common locations:
apps/orders/services/,apps/orders/views.py -
If a quick fix is possible, deploy immediately
-
Customer communication:
- If payment failures are widespread, consider adding a banner to the site
-
Check for stuck orders: orders with
pending_paymentstatus that are older than 1 hour -
Check for stuck orders:
docker compose exec backend python manage.py shell -c " from apps.orders.models import Order from django.utils import timezone from datetime import timedelta stuck = Order.objects.filter( status='pending_payment', created_at__lt=timezone.now() - timedelta(hours=1) ).count() print(f'Stuck orders: {stuck}') "
Common Causes¶
- Mollie API outage or degradation
- Mollie API key expired or revoked
- Webhook URL not reachable (nginx misconfiguration, SSL issue)
- Code bug in payment flow (especially after deployment)
- Currency/amount formatting error
Alert Type 5: Payment Error Spike¶
Channel: #payments Severity: CRITICAL Source: Sentry metric alert rule (>3 payment errors/hour) What it means: Multiple payment errors in a short period. This is a systemic payment failure, not an isolated error.
Response Steps¶
-
Follow all steps from Alert Type 4 (Payment Error)
-
Additional steps for a spike:
-
Check if ALL payments are failing:
docker compose exec backend python manage.py shell -c " from apps.orders.models import Order from django.utils import timezone from datetime import timedelta one_hour_ago = timezone.now() - timedelta(hours=1) total = Order.objects.filter(created_at__gte=one_hour_ago).count() failed = Order.objects.filter(created_at__gte=one_hour_ago, status='payment_failed').count() paid = Order.objects.filter(created_at__gte=one_hour_ago, status='paid').count() print(f'Last hour: {total} orders, {paid} paid, {failed} failed') " -
If 100% failure rate:
- Likely a Mollie outage or API key issue
- Check
MOLLIE_API_KEYis set:docker compose exec backend env | grep MOLLIE -
Consider temporarily disabling checkout with a maintenance message
-
If partial failure rate:
- Likely a specific payment method or amount issue
-
Check Sentry for common attributes across the failing payments
-
Monitor resolution:
- The metric alert will auto-resolve when errors drop below threshold
- Verify in Sentry that the error rate is declining
Common Causes¶
- Same as Alert Type 4, but systemic
- API key rotation without updating env var
- Mollie account suspension (check email)
- Network issue between server and Mollie API
Alert Type 6: Backup Failure¶
Channel: #backups Severity: High Source: Django backup task (Celery) Discord notification What it means: A scheduled database backup failed. Data protection is compromised until the next successful backup.
Response Steps¶
-
Check the failure details in the Discord notification (includes error message and task ID)
-
Check backup task logs:
-
Try a manual backup:
If the manual backup succeeds, the failure was likely transient. -
Check disk space:
Backups will fail if the disk is full. -
Check database connectivity:
-
Check backup storage:
-
Verify backup schedule is running:
Look for the backup task being scheduled. -
If disk space is the issue:
Common Causes¶
- Disk full (backup volume)
- Database connection timeout during dump
- PostgreSQL in recovery mode (after crash)
- Celery worker crashed or restarted during backup
- File permission issue on backup directory
Alert Type 7: Missed Backup Warning¶
Channel: #backups Severity: High Source: Daily backup summary task (checks for stale backups) What it means: No successful backup has been recorded in the last 25 hours. The backup system may be silently broken.
Response Steps¶
-
Check if backups are running at all:
-
Check Celery worker health:
If the Celery worker is not running, backups cannot execute. -
Check backup monitor metrics:
docker compose exec backend python manage.py shell -c " from apps.core.services.backup_monitor import BackupMonitor monitor = BackupMonitor() metrics = monitor.get_metrics() print(f'Last backup: {metrics.get(\"last_backup_time\", \"Never\")}') print(f'Success rate: {metrics.get(\"success_rate\", \"Unknown\")}%') print(f'Total backups: {metrics.get(\"total_backups\", 0)}') " -
Check Redis (backup history is cached):
If Redis is down, backup history may be lost (but backups themselves still run). -
Run a manual backup to restore the schedule:
-
If Celery Beat is not scheduling tasks:
Common Causes¶
- Celery Beat crashed or was not started
- Celery worker crashed (tasks scheduled but not executed)
- Redis down (task results not stored, but tasks may still run)
- Docker container restarted and lost the Beat schedule
- Server reboot without docker compose up
Alert Type 8: Disk Space Alert¶
Channel: #errors Severity: Warning (>80%) or Critical (>90%) Source: Health check system (Django) What it means: Server disk usage has crossed a threshold. If it reaches 100%, the application will crash.
Response Steps¶
-
Check current disk usage:
-
Identify what is consuming space:
-
Common space consumers:
Docker images and containers:
Old backups:
docker compose exec backend ls -lah /backups/ | head -20
# Run cleanup
docker compose exec backend python manage.py backup_database --cleanup
Application logs:
du -sh /var/lib/docker/containers/*/
# Truncate large log files
docker compose logs backend --tail=1 > /dev/null
PostgreSQL WAL files:
- If above 90% (critical):
- Delete old Docker images:
docker image prune -a -f - Clean backup retention:
docker compose exec backend python manage.py backup_database --cleanup -
Consider increasing disk size at Hetzner
-
If above 95% (emergency):
- Delete the oldest backups manually
- Truncate Django log files
-
Increase disk immediately at Hetzner Cloud console
-
Set up prevention:
- Ensure backup retention policy is active (keeps 24h hourly, 7d daily, 4w weekly)
- Consider adding log rotation to docker-compose
Common Causes¶
- Backup retention not running (accumulating hourly backups)
- Docker images accumulating from deployments without cleanup
- PostgreSQL WAL files growing (replication lag or missing checkpoint)
- Application log files growing without rotation
- Media uploads filling disk (should be on DigitalOcean Spaces, not local)
Alert Type 9: Admin Delete Action¶
Channel: #admin Severity: Info Source: Django admin audit trail mixin What it means: An admin user performed a delete action in the Django admin interface. Deletes are irreversible and logged for audit purposes.
Response Steps¶
- Review the notification details:
- Who performed the delete?
- What model/object was deleted?
-
When did it happen?
-
Verify the delete was intentional:
- Check with the admin user if this was planned
-
If unexpected, check the admin audit log for more context:
-
If the delete was accidental:
- Check if a backup exists from before the delete
- Consult Disaster Recovery for restore instructions
-
Note: Restoring a full backup will overwrite all changes since the backup
-
No action needed if:
- The delete was intentional and expected
- This is informational only -- most delete notifications require no response
Common Causes¶
- Normal admin operations (removing test data, cleaning up old records)
- Accidental bulk delete (admin list view checkbox + delete action)
- Admin cleanup of expired/orphaned data
Escalation Procedures¶
For a solo-operator setup, "escalation" means increasing your own response urgency.
Severity Levels¶
| Severity | Response Time | Description |
|---|---|---|
| Info | Next business day | Informational, no action usually needed |
| Medium | Within 4 hours | Investigate, may need a fix in next deployment |
| High | Within 1 hour | Needs prompt attention, potential data or service impact |
| Critical | Within 15-30 min | Active incident affecting customers or revenue |
When to Consider Downtime¶
Take the site into maintenance mode if:
- Payment system is completely broken (100% failure rate for >30 min)
- Error rate exceeds 100/hour and rising (cascading failure)
- Database is corrupt or unreachable
- Security breach suspected (unauthorized admin access)
Maintenance Mode¶
# Quick maintenance page via Nginx (if configured)
# Or: stop the frontend to prevent new orders
docker compose stop frontend
# Fix the issue
# Restart
docker compose up -d frontend
Log Locations¶
Quick reference for finding logs during incident response.
| Service | Log Command | What It Contains |
|---|---|---|
| Backend (Django) | docker compose logs backend --tail=N |
API requests, errors, application logs |
| Celery worker | docker compose logs celery --tail=N |
Task execution, backup results |
| Celery Beat | docker compose logs celery-beat --tail=N |
Task scheduling |
| Nginx | docker compose logs nginx --tail=N |
HTTP requests, proxy errors |
| PostgreSQL | docker compose logs db --tail=N |
Database queries, connections |
| Redis | docker compose logs redis --tail=N |
Cache operations, connection issues |
Filtering Logs by Time¶
# Last hour
docker compose logs backend --since=1h
# Last 30 minutes
docker compose logs backend --since=30m
# Since specific time (UTC)
docker compose logs backend --since="2026-01-15T10:00:00"