🦆 Ops Runbook

Space Duck · Incident response procedures

Operator Runbook

Common incident patterns, triage steps, and resolution commands for the Spaceduckling production stack.

Contents

  1. Dead Lambda — Function not responding
  2. Cert Pipeline Stall — No new certs issuing
  3. Agent Fleet All-Dead — agents.alive = 0
  4. Peck callback_error Spike
  5. SES Quota Exhaustion — Email delivery blocked

INC-001 Dead Lambda — Function Not Responding P0

The Lambda function powering /beak/* routes is not returning 200s. All mission-control tiles show failure, API returns 5xx or timeouts.

Signals

Mission Control API card showing red / all module cards red
GET /beak/system/status returns 500, 502, or connection timeout
CloudWatch: Lambda error rate > 0 in last 5 minutes
Alias drift banner showing on Mission Control
1
Confirm the outage

Run a direct status probe:

curl -s -o /dev/null -w "%{http_code}" \
  https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status

Expected: 200. Anything else = Lambda down.

2
Check Lambda logs

Pull the latest CloudWatch errors:

aws logs filter-log-events \
  --log-group-name /aws/lambda/mission-control-api \
  --start-time $(date -d '15 minutes ago' +%s000) \
  --filter-pattern "ERROR" \
  --region us-east-1 \
  --query 'events[*].message' --output text | tail -20
3
Verify prod alias target
aws lambda get-alias \
  --function-name mission-control-api \
  --name prod \
  --region us-east-1

Confirm FunctionVersion points to the expected version number.

4
Rollback if alias drift

If the alias points to a broken version, promote the last known-good version:

aws lambda update-alias \
  --function-name mission-control-api \
  --name prod \
  --function-version <PREV_VERSION> \
  --region us-east-1

Replace <PREV_VERSION> with the last working version from DEPLOY-LOG.md.

5
Verify recovery

Repeat the probe from step 1. Confirm 200. Reload Mission Control and confirm all tiles green.

EscalationIf rollback does not restore service within 5 minutes, escalate to Josh (T-JOSH). Document the incident in GOVERNANCE-LOG.md.

INC-002 Cert Pipeline Stall P1

New birth certificates are not being issued despite new duckling registrations. database.birth_certificates count is not growing relative to database.ducklings.

Signals

Cert Pipeline tile: issued count flat, pending backlog rising
Cert Issuance Latency tile: "Awaiting cert signal" persisting > 24h
Audit Activity: cert events absent from last 24h feed
1
Confirm cert counts from status
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('ducklings:', d['database']['ducklings'], 'certs:', d['database']['birth_certificates'])"
2
Check audit for last cert event
curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); certs=[e for e in entries if 'cert' in e.get('event_type','')]; print(certs[:3] if certs else 'No cert events in audit feed')"
3
Verify cert issuance Lambda path

Check Lambda logs for duck.cert_issued events or SES send errors:

aws logs filter-log-events \
  --log-group-name /aws/lambda/mission-control-api \
  --start-time $(date -d '1 hour ago' +%s000) \
  --filter-pattern "cert" \
  --region us-east-1 \
  --query 'events[*].message' --output text | head -20
4
Check SES sandbox status

If SES is in sandbox, cert delivery emails may be failing silently for unverified recipients. Confirm SES posture in the Sandbox Exit Readiness tile and review system/status.ses.

5
Escalate if code issue

If certs are not writing to DynamoDB (not just email failures), this requires a Lambda code review. Do not redeploy without T-JOSH approval.

EscalationIf cert count delta is zero over 48h with active duckling registrations, escalate to Josh (T-JOSH) for code-path review.

INC-003 Agent Fleet All-Dead P1

All bonded spaceduck agents are reporting as dead. agents.alive = 0 and agents.dead > 0. Peck approval workflow may be impacted.

Signals

Agent Fleet tile: alive=0, dead=N (red pill)
Anomaly Summary banner: "All agents dead" alert
Peck requests may queue but not process if agent callbacks unreachable
1
Confirm agent state
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; a=json.load(sys.stdin).get('agents',{}); print(a)"
2
Check last pulse events in audit
curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pulses=[e for e in entries if 'pulse' in e.get('event_type','')]; print(pulses[:3] if pulses else 'No pulse events')"
3
Determine if this is a data issue or a real outage
  • If all spaceduck agents registered but never pulsed → expected state on a fresh platform
  • If agents previously pulsed but stopped → investigate agent connectivity
  • If agents.total_bonded = 0 → no agents registered yet, not an incident
4
Check peck impact

Dead agents may prevent peck callback delivery. Review peck_protocol.failure_breakdown.callback_error in Mission Control. If callback_error count is rising alongside agent deaths, the two are correlated.

5
No Lambda action required unless code is the cause

Agent death is typically an agent-side connectivity issue, not a Lambda/backend fault. Escalate to the spaceduck-bot team for agent restart procedures.

EscalationIf agents were alive within the last 24h and are now all dead, treat as P0 and escalate to Josh immediately.

INC-004 Peck callback_error Spike P1

The peck approval workflow is failing at the callback stage. peck_protocol.failure_breakdown.callback_error count is rising. Spaceducks are not receiving peck approval notifications.

Signals

Peck Failure Breakdown panel: callback_error chip showing red with rising count
Peck Failure Analysis: high failure rate, callback_error dominant
Agent Fleet: agents.dead > 0 (correlated)
1
Confirm callback_error count
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; d=json.load(sys.stdin); pp=d.get('peck_protocol',{}); print('failure_breakdown:', pp.get('failure_breakdown',{}))"
2
Pull recent peck audit events
curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pecks=[e for e in entries if 'peck' in e.get('event_type','').lower()]; [print(p.get('event_type'), p.get('timestamp','')) for p in pecks[:5]]"
3
Check Step Functions for FAILED executions
aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:121546003735:stateMachine:peck-approval-workflow \
  --status-filter FAILED \
  --region us-east-1 \
  --query 'executions[:5].{name:name,start:startDate,stop:stopDate}' --output table
4
Check if agents are reachable

Callback errors are usually caused by the target spaceduck agent URL being unreachable. Confirm agent pulse recency (see INC-003). If all agents are dead, callback errors are expected until agents reconnect.

5
Clear stale peck requests if needed

If peck requests are stuck in RUNNING state and blocking new requests, escalate to Josh (T-JOSH) for SFN cleanup. Do not terminate executions without T-JOSH approval.

EscalationIf callback_error count exceeds 10 in any 24h window, escalate to Josh (T-JOSH). Document in GOVERNANCE-LOG.md.

INC-005 SES Quota Exhaustion P0

SES daily send quota has been exhausted. All outbound email (signup verification, cert delivery, password reset) is failing. While in sandbox mode, limit is 200 emails/day.

Signals

SES daily quota progress bar > 95% in Mission Control
Sandbox Exit Readiness tile: quota at or near 200/day
User signups completing but verification emails not arriving
CloudWatch SES bounce/complaint rate rising
1
Confirm SES quota status
aws sesv2 get-account --region us-east-1 \
  --query '{SendingEnabled:SendingEnabled,DailyQuota:SendQuota.Max24HourSend,SentLast24h:SendQuota.SentLast24Hours}'
2
If quota exhausted, wait for reset

SES sandbox quota resets every 24 hours. There is no manual override. Note the reset time and communicate expected restoration to affected users.

3
Request SES production access immediately

If this has happened, sandbox limits are a production blocker. Submit the AWS SES production access request now:

  • Go to: AWS SES Console → Account dashboard
  • Click "Request production access"
  • Use case: transactional email for user verification and certificate delivery
  • Expected volume: < 1,000/day initially
4
Check for bounce/complaint issues
aws cloudwatch get-metric-statistics \
  --namespace AWS/SES \
  --metric-name Bounces \
  --start-time $(date -d '24 hours ago' --utc +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date --utc +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 --statistics Sum \
  --region us-east-1

High bounce rates can trigger SES account suspension. Address immediately if elevated.

5
Document and escalate

Log the incident in GOVERNANCE-LOG.md with exact quota numbers and user impact estimate. Escalate to Josh (T-JOSH) for production access approval follow-up.

EscalationSES quota exhaustion is a P0 incident if it blocks signup or cert delivery during active user onboarding. Escalate to Josh (T-JOSH) immediately. Do not wait for auto-reset.
Ops Runbook · DC-109 · Published 2026-03-22 UTC · spaceduckling.com