Ops Runbook · Space Duck

Operator Runbook

Common incident patterns, triage steps, and resolution commands for the Spaceduckling production stack.

Dead Lambda — Function not responding
Cert Pipeline Stall — No new certs issuing
Agent Fleet All-Dead — agents.alive = 0
Peck callback_error Spike
SES Quota Exhaustion — Email delivery blocked

INC-001 Dead Lambda — Function Not Responding P0

The Lambda function powering /beak/* routes is not returning 200s. All mission-control tiles show failure, API returns 5xx or timeouts.

Signals

Mission Control API card showing red / all module cards red

GET /beak/system/status returns 500, 502, or connection timeout

CloudWatch: Lambda error rate > 0 in last 5 minutes

Alias drift banner showing on Mission Control

Confirm the outage

Run a direct status probe:

curl -s -o /dev/null -w "%{http_code}" \
  https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status

Expected: 200. Anything else = Lambda down.

Check Lambda logs

Pull the latest CloudWatch errors:

aws logs filter-log-events \
  --log-group-name /aws/lambda/mission-control-api \
  --start-time $(date -d '15 minutes ago' +%s000) \
  --filter-pattern "ERROR" \
  --region us-east-1 \
  --query 'events[*].message' --output text | tail -20

Verify prod alias target

aws lambda get-alias \
  --function-name mission-control-api \
  --name prod \
  --region us-east-1

Confirm FunctionVersion points to the expected version number.

Rollback if alias drift

If the alias points to a broken version, promote the last known-good version:

aws lambda update-alias \
  --function-name mission-control-api \
  --name prod \
  --function-version <PREV_VERSION> \
  --region us-east-1

Replace <PREV_VERSION> with the last working version from DEPLOY-LOG.md.

Verify recovery

Repeat the probe from step 1. Confirm 200. Reload Mission Control and confirm all tiles green.

EscalationIf rollback does not restore service within 5 minutes, escalate to Josh (T-JOSH). Document the incident in GOVERNANCE-LOG.md.

INC-002 Cert Pipeline Stall P1

New birth certificates are not being issued despite new duckling registrations. database.birth_certificates count is not growing relative to database.ducklings.

Signals

Cert Pipeline tile: issued count flat, pending backlog rising

Cert Issuance Latency tile: "Awaiting cert signal" persisting > 24h

Audit Activity: cert events absent from last 24h feed

Confirm cert counts from status

curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('ducklings:', d['database']['ducklings'], 'certs:', d['database']['birth_certificates'])"

Check audit for last cert event

curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); certs=[e for e in entries if 'cert' in e.get('event_type','')]; print(certs[:3] if certs else 'No cert events in audit feed')"

Verify cert issuance Lambda path

Check Lambda logs for duck.cert_issued events or SES send errors:

aws logs filter-log-events \
  --log-group-name /aws/lambda/mission-control-api \
  --start-time $(date -d '1 hour ago' +%s000) \
  --filter-pattern "cert" \
  --region us-east-1 \
  --query 'events[*].message' --output text | head -20

Check SES sandbox status

If SES is in sandbox, cert delivery emails may be failing silently for unverified recipients. Confirm SES posture in the Sandbox Exit Readiness tile and review system/status.ses.

Escalate if code issue

If certs are not writing to DynamoDB (not just email failures), this requires a Lambda code review. Do not redeploy without T-JOSH approval.

EscalationIf cert count delta is zero over 48h with active duckling registrations, escalate to Josh (T-JOSH) for code-path review.

INC-003 Agent Fleet All-Dead P1

All bonded spaceduck agents are reporting as dead. agents.alive = 0 and agents.dead > 0. Peck approval workflow may be impacted.

Signals

Agent Fleet tile: alive=0, dead=N (red pill)

Anomaly Summary banner: "All agents dead" alert

Peck requests may queue but not process if agent callbacks unreachable

Confirm agent state

curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; a=json.load(sys.stdin).get('agents',{}); print(a)"

Check last pulse events in audit

curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pulses=[e for e in entries if 'pulse' in e.get('event_type','')]; print(pulses[:3] if pulses else 'No pulse events')"

Determine if this is a data issue or a real outage

If all spaceduck agents registered but never pulsed → expected state on a fresh platform
If agents previously pulsed but stopped → investigate agent connectivity
If agents.total_bonded = 0 → no agents registered yet, not an incident

Check peck impact

Dead agents may prevent peck callback delivery. Review peck_protocol.failure_breakdown.callback_error in Mission Control. If callback_error count is rising alongside agent deaths, the two are correlated.

No Lambda action required unless code is the cause

Agent death is typically an agent-side connectivity issue, not a Lambda/backend fault. Escalate to the spaceduck-bot team for agent restart procedures.

EscalationIf agents were alive within the last 24h and are now all dead, treat as P0 and escalate to Josh immediately.

INC-004 Peck callback_error Spike P1

The peck approval workflow is failing at the callback stage. peck_protocol.failure_breakdown.callback_error count is rising. Spaceducks are not receiving peck approval notifications.

Signals

Peck Failure Breakdown panel: callback_error chip showing red with rising count

Peck Failure Analysis: high failure rate, callback_error dominant

Agent Fleet: agents.dead > 0 (correlated)

Confirm callback_error count

curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
  | python3 -c "import sys,json; d=json.load(sys.stdin); pp=d.get('peck_protocol',{}); print('failure_breakdown:', pp.get('failure_breakdown',{}))"

Pull recent peck audit events

curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
  -H "Content-Type: application/json" -d '{}' \
  | python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pecks=[e for e in entries if 'peck' in e.get('event_type','').lower()]; [print(p.get('event_type'), p.get('timestamp','')) for p in pecks[:5]]"

Check Step Functions for FAILED executions

aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:121546003735:stateMachine:peck-approval-workflow \
  --status-filter FAILED \
  --region us-east-1 \
  --query 'executions[:5].{name:name,start:startDate,stop:stopDate}' --output table

Check if agents are reachable

Callback errors are usually caused by the target spaceduck agent URL being unreachable. Confirm agent pulse recency (see INC-003). If all agents are dead, callback errors are expected until agents reconnect.

Clear stale peck requests if needed

If peck requests are stuck in RUNNING state and blocking new requests, escalate to Josh (T-JOSH) for SFN cleanup. Do not terminate executions without T-JOSH approval.

EscalationIf callback_error count exceeds 10 in any 24h window, escalate to Josh (T-JOSH). Document in GOVERNANCE-LOG.md.

INC-005 SES Quota Exhaustion P0

SES daily send quota has been exhausted. All outbound email (signup verification, cert delivery, password reset) is failing. While in sandbox mode, limit is 200 emails/day.

Signals

SES daily quota progress bar > 95% in Mission Control

Sandbox Exit Readiness tile: quota at or near 200/day

User signups completing but verification emails not arriving

CloudWatch SES bounce/complaint rate rising

Confirm SES quota status

aws sesv2 get-account --region us-east-1 \
  --query '{SendingEnabled:SendingEnabled,DailyQuota:SendQuota.Max24HourSend,SentLast24h:SendQuota.SentLast24Hours}'

If quota exhausted, wait for reset

SES sandbox quota resets every 24 hours. There is no manual override. Note the reset time and communicate expected restoration to affected users.

Request SES production access immediately

If this has happened, sandbox limits are a production blocker. Submit the AWS SES production access request now:

Go to: AWS SES Console → Account dashboard
Click "Request production access"
Use case: transactional email for user verification and certificate delivery
Expected volume: < 1,000/day initially

Check for bounce/complaint issues

aws cloudwatch get-metric-statistics \
  --namespace AWS/SES \
  --metric-name Bounces \
  --start-time $(date -d '24 hours ago' --utc +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date --utc +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 --statistics Sum \
  --region us-east-1

High bounce rates can trigger SES account suspension. Address immediately if elevated.

Document and escalate

Log the incident in GOVERNANCE-LOG.md with exact quota numbers and user impact estimate. Escalate to Josh (T-JOSH) for production access approval follow-up.

EscalationSES quota exhaustion is a P0 incident if it blocks signup or cert delivery during active user onboarding. Escalate to Josh (T-JOSH) immediately. Do not wait for auto-reset.

Ops Runbook · DC-109 · Published 2026-03-22 UTC · spaceduckling.com

🦆 Ops Runbook

Operator Runbook

Contents

INC-001 Dead Lambda — Function Not Responding P0

Signals

INC-002 Cert Pipeline Stall P1

Signals

INC-003 Agent Fleet All-Dead P1

Signals

INC-004 Peck callback_error Spike P1

Signals

INC-005 SES Quota Exhaustion P0

Signals