Common incident patterns, triage steps, and resolution commands for the Spaceduckling production stack.
The Lambda function powering /beak/* routes is not returning 200s. All mission-control tiles show failure, API returns 5xx or timeouts.
GET /beak/system/status returns 500, 502, or connection timeoutRun a direct status probe:
curl -s -o /dev/null -w "%{http_code}" \
https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/statusExpected: 200. Anything else = Lambda down.
Pull the latest CloudWatch errors:
aws logs filter-log-events \ --log-group-name /aws/lambda/mission-control-api \ --start-time $(date -d '15 minutes ago' +%s000) \ --filter-pattern "ERROR" \ --region us-east-1 \ --query 'events[*].message' --output text | tail -20
aws lambda get-alias \ --function-name mission-control-api \ --name prod \ --region us-east-1
Confirm FunctionVersion points to the expected version number.
If the alias points to a broken version, promote the last known-good version:
aws lambda update-alias \ --function-name mission-control-api \ --name prod \ --function-version <PREV_VERSION> \ --region us-east-1
Replace <PREV_VERSION> with the last working version from DEPLOY-LOG.md.
Repeat the probe from step 1. Confirm 200. Reload Mission Control and confirm all tiles green.
GOVERNANCE-LOG.md.New birth certificates are not being issued despite new duckling registrations. database.birth_certificates count is not growing relative to database.ducklings.
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
| python3 -c "import sys,json; d=json.load(sys.stdin); print('ducklings:', d['database']['ducklings'], 'certs:', d['database']['birth_certificates'])"curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
-H "Content-Type: application/json" -d '{}' \
| python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); certs=[e for e in entries if 'cert' in e.get('event_type','')]; print(certs[:3] if certs else 'No cert events in audit feed')"Check Lambda logs for duck.cert_issued events or SES send errors:
aws logs filter-log-events \ --log-group-name /aws/lambda/mission-control-api \ --start-time $(date -d '1 hour ago' +%s000) \ --filter-pattern "cert" \ --region us-east-1 \ --query 'events[*].message' --output text | head -20
If SES is in sandbox, cert delivery emails may be failing silently for unverified recipients. Confirm SES posture in the Sandbox Exit Readiness tile and review system/status.ses.
If certs are not writing to DynamoDB (not just email failures), this requires a Lambda code review. Do not redeploy without T-JOSH approval.
All bonded spaceduck agents are reporting as dead. agents.alive = 0 and agents.dead > 0. Peck approval workflow may be impacted.
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
| python3 -c "import sys,json; a=json.load(sys.stdin).get('agents',{}); print(a)"curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
-H "Content-Type: application/json" -d '{}' \
| python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pulses=[e for e in entries if 'pulse' in e.get('event_type','')]; print(pulses[:3] if pulses else 'No pulse events')"agents.total_bonded = 0 → no agents registered yet, not an incidentDead agents may prevent peck callback delivery. Review peck_protocol.failure_breakdown.callback_error in Mission Control. If callback_error count is rising alongside agent deaths, the two are correlated.
Agent death is typically an agent-side connectivity issue, not a Lambda/backend fault. Escalate to the spaceduck-bot team for agent restart procedures.
The peck approval workflow is failing at the callback stage. peck_protocol.failure_breakdown.callback_error count is rising. Spaceducks are not receiving peck approval notifications.
curl -s https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/system/status \
| python3 -c "import sys,json; d=json.load(sys.stdin); pp=d.get('peck_protocol',{}); print('failure_breakdown:', pp.get('failure_breakdown',{}))"curl -s -X POST https://czt9d57q83.execute-api.us-east-1.amazonaws.com/prod/beak/audit \
-H "Content-Type: application/json" -d '{}' \
| python3 -c "import sys,json; entries=json.load(sys.stdin).get('entries',[]); pecks=[e for e in entries if 'peck' in e.get('event_type','').lower()]; [print(p.get('event_type'), p.get('timestamp','')) for p in pecks[:5]]"aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:us-east-1:121546003735:stateMachine:peck-approval-workflow \
--status-filter FAILED \
--region us-east-1 \
--query 'executions[:5].{name:name,start:startDate,stop:stopDate}' --output tableCallback errors are usually caused by the target spaceduck agent URL being unreachable. Confirm agent pulse recency (see INC-003). If all agents are dead, callback errors are expected until agents reconnect.
If peck requests are stuck in RUNNING state and blocking new requests, escalate to Josh (T-JOSH) for SFN cleanup. Do not terminate executions without T-JOSH approval.
GOVERNANCE-LOG.md.SES daily send quota has been exhausted. All outbound email (signup verification, cert delivery, password reset) is failing. While in sandbox mode, limit is 200 emails/day.
aws sesv2 get-account --region us-east-1 \
--query '{SendingEnabled:SendingEnabled,DailyQuota:SendQuota.Max24HourSend,SentLast24h:SendQuota.SentLast24Hours}'SES sandbox quota resets every 24 hours. There is no manual override. Note the reset time and communicate expected restoration to affected users.
If this has happened, sandbox limits are a production blocker. Submit the AWS SES production access request now:
aws cloudwatch get-metric-statistics \ --namespace AWS/SES \ --metric-name Bounces \ --start-time $(date -d '24 hours ago' --utc +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date --utc +%Y-%m-%dT%H:%M:%SZ) \ --period 3600 --statistics Sum \ --region us-east-1
High bounce rates can trigger SES account suspension. Address immediately if elevated.
Log the incident in GOVERNANCE-LOG.md with exact quota numbers and user impact estimate. Escalate to Josh (T-JOSH) for production access approval follow-up.