SentinAI Docs

SentinAI Q1 Operations Runbook (Proposal 22-24)

guide/agentic-q1-operations-runbook.md

SentinAI Q1 Operations Runbook (Proposal 22-24)

Scope

Guardian Agent v2 cycle operation (observe -> detect -> analyze -> plan -> act -> verify)
Agent memory / decision trace inspection and replay
Adaptive model routing policy operation and rollback

1. Guardian Agent v2

1.1 Health Checks

Check Agent loop status

curl -s http://localhost:3002/api/agent-loop | jq '.scheduler, .lastCycle.phase, .lastCycle.decisionId'

Check phase trace

curl -s http://localhost:3002/api/agent-loop | jq '.lastCycle.phaseTrace'

Check verification

curl -s http://localhost:3002/api/agent-loop | jq '.lastCycle.verification'

1.2 Degraded Incident Handling

Check whether degraded

curl -s http://localhost:3002/api/agent-loop | jq '.lastCycle.degraded'

When observe-fallback:last-safe-metrics occurs

First, check the L2 RPC status.
If recently collected metrics are stale, check the seed/live data sources.

When act-failed:* occurs

Check orchestrator (k8s/docker) permissions/connectivity.
Compare the verify result with the actual vCPU state.

2. Agent Memory / Decision Trace

2.1 Retrieve recent memory

curl -s "http://localhost:3002/api/agent-memory?limit=20" | jq '.total, .entries[0]'

2.2 Retrieve Decision Trace

Recent trace list

curl -s "http://localhost:3002/api/agent-decisions?limit=20" | jq '.total, .traces[0].decisionId'

Retrieve a single trace

curl -s "http://localhost:3002/api/agent-decisions?decisionId=<decisionId>" | jq '.trace'

2.3 Incident Replay: standard procedure

Obtain the decisionId for a failed or high-risk cycle.
Review the trace fields: reasoningSummary, evidence, verification.
Compare the activity log and phase trace for the same time window.
If it recurs, query similar cases using memory query filters (component, severity, fromTs).

3. Adaptive Routing Policy

3.1 Check status

curl -s http://localhost:3002/api/ai-routing/status | jq '.policy, .budget, .circuitStates, .counters'

Inspection items:

budget.exceeded: Whether the daily budget is exceeded
circuitStates[].isOpen: Whether the provider circuit is open
counters.fallbackRecovered: Number of successful fallback recoveries

3.2 Change policy (admin)

Policy changes are allowed only when SENTINAI_API_KEY is set.

curl -s -X POST http://localhost:3002/api/ai-routing/policy \
  -H "content-type: application/json" \
  -H "x-api-key: ${SENTINAI_API_KEY}" \
  -d '{"name":"balanced","enabled":true,"abPercent":25,"budgetUsdDaily":80}' | jq

3.3 Rollback checklist

Immediately revert to static behavior

AI_ROUTING_ENABLED=false

Reset policy

AI_ROUTING_POLICY=balanced
AI_ROUTING_AB_PERCENT=0

Preserve status/traces for incident analysis

/api/ai-routing/status
/api/agent-decisions

4. Environment Defaults

AGENT_MEMORY_ENABLED=true
AGENT_MEMORY_RETENTION_DAYS=30
AI_ROUTING_ENABLED=true
AI_ROUTING_POLICY=balanced
AI_ROUTING_AB_PERCENT=10
AI_ROUTING_BUDGET_USD_DAILY=50