Docs / Troubleshooting
Operations Guide
Common issues, diagnostic procedures, and recovery steps for Synaps deployments on Kubernetes.
Quick Status Check
# Check all pods
kubectl get pods -n synaps-runtime
# Check control API health
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
wget -qO- http://localhost:8081/healthz
# Check queue depth
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
psql "$DATABASE_URL" -c \
"SELECT state, COUNT(*) FROM queue_item GROUP BY state;"
# Check worker registrations
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
psql "$DATABASE_URL" -c \
"SELECT worker_id, last_seen_at, queue_name \
FROM worker_registration ORDER BY last_seen_at DESC LIMIT 10;"
# Check recent failed runs
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
psql "$DATABASE_URL" -c \
"SELECT id, status, objective, created_at FROM run \
WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10;"
Common Issues
Deployment failures
Helm install fails, migration jobs error, or pods never start.
Check secrets exist, migration logs, node resources, and image pull
status. Rollback with helm rollback synaps <revision>.
Control API unresponsive
500 errors or timeouts. Check pod restarts (OOMKilled?), DB pool exhaustion, and NATS connection loss. Restart the control API deployment to clear stuck connections.
Stale workers
Queue depth grows with no processing. Check worker pod status
(crashed? OOMKilled?), DB connectivity, and NATS subscription state.
Scale up or restart workers: kubectl rollout restart deployment/synaps-demo-worker -n synaps-runtime.
Provider errors
OpenAI or Kimi returns 401, 429, or region rejection. Verify API
keys in synaps-provider-env secret. Check worker logs for
rate-limit or region errors. Update credentials and restart workers.
Queue depth buildup
Items accumulate in queued state. Scale workers horizontally
or check for a stuck lease. Reset leased items older than 10
minutes if their worker is dead.
Recovery Procedures
Restart a component
# Control API
kubectl rollout restart deployment/synaps-control-api -n synaps-runtime
# Worker
kubectl rollout restart deployment/synaps-demo-worker -n synaps-runtime
# PostgreSQL (downtime)
kubectl rollout restart statefulset/synaps-postgresql -n synaps-runtime
# NATS (downtime)
kubectl rollout restart statefulset/synaps-nats -n synaps-runtime
Reset stuck queue items
# Reset all leased items older than 10 minutes
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
psql "$DATABASE_URL" -c \
"UPDATE queue_item SET state = 'queued', leased_at = NULL, leased_by = NULL \
WHERE state = 'leased' AND NOW() - leased_at > INTERVAL '10 minutes';"
For the complete operations runbook, see the full troubleshooting guide in the repository.