IngressLabs navigation lockup Compact IngressLabs lockup with radial ingress mark and wordmark. IngressLabs IngressLabs

Docs / Troubleshooting

Operations Guide

Common issues, diagnostic procedures, and recovery steps for Synaps deployments on Kubernetes.

Quick Status Check

# Check all pods
kubectl get pods -n synaps-runtime

# Check control API health
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
  wget -qO- http://localhost:8081/healthz

# Check queue depth
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
  psql "$DATABASE_URL" -c \
  "SELECT state, COUNT(*) FROM queue_item GROUP BY state;"

# Check worker registrations
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
  psql "$DATABASE_URL" -c \
  "SELECT worker_id, last_seen_at, queue_name \
   FROM worker_registration ORDER BY last_seen_at DESC LIMIT 10;"

# Check recent failed runs
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
  psql "$DATABASE_URL" -c \
  "SELECT id, status, objective, created_at FROM run \
   WHERE status = 'failed' ORDER BY created_at DESC LIMIT 10;"

Common Issues

Deployment failures

Helm install fails, migration jobs error, or pods never start. Check secrets exist, migration logs, node resources, and image pull status. Rollback with helm rollback synaps <revision>.

Control API unresponsive

500 errors or timeouts. Check pod restarts (OOMKilled?), DB pool exhaustion, and NATS connection loss. Restart the control API deployment to clear stuck connections.

Stale workers

Queue depth grows with no processing. Check worker pod status (crashed? OOMKilled?), DB connectivity, and NATS subscription state. Scale up or restart workers: kubectl rollout restart deployment/synaps-demo-worker -n synaps-runtime.

Provider errors

OpenAI or Kimi returns 401, 429, or region rejection. Verify API keys in synaps-provider-env secret. Check worker logs for rate-limit or region errors. Update credentials and restart workers.

Queue depth buildup

Items accumulate in queued state. Scale workers horizontally or check for a stuck lease. Reset leased items older than 10 minutes if their worker is dead.

Recovery Procedures

Restart a component

# Control API
kubectl rollout restart deployment/synaps-control-api -n synaps-runtime

# Worker
kubectl rollout restart deployment/synaps-demo-worker -n synaps-runtime

# PostgreSQL (downtime)
kubectl rollout restart statefulset/synaps-postgresql -n synaps-runtime

# NATS (downtime)
kubectl rollout restart statefulset/synaps-nats -n synaps-runtime

Reset stuck queue items

# Reset all leased items older than 10 minutes
kubectl exec -n synaps-runtime deployment/synaps-control-api -- \
  psql "$DATABASE_URL" -c \
  "UPDATE queue_item SET state = 'queued', leased_at = NULL, leased_by = NULL \
   WHERE state = 'leased' AND NOW() - leased_at > INTERVAL '10 minutes';"

For the complete operations runbook, see the full troubleshooting guide in the repository.