SentinAI Docs

RCA Engine (Root Cause Analysis) Guide

spec/rca-engine-guide.md

RCA Engine (Root Cause Analysis) Guide

πŸ“‹ Overview

RCA Engine is an AI-based analysis system that tracks the root cause after detecting anomalies and suggests solutions.

File: src/lib/rca-engine.ts

3-step analysis process

1️⃣ Timeline composition
β”œβ”€ Log parsing
β”œβ”€ Ideal Metric Conversion
└─ Sort by time

2️⃣ AI causality analysis
β”œβ”€ Utilize component dependency graph
β”œβ”€ Chain failure tracking
└─ Severity assessment

3️⃣ Provide recommended actions
β”œβ”€ Immediate action (Immediate)
└─ Preventive measures

πŸ—οΈ Optimism Rollup Architecture

Component relationship diagram

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   L1 (Ethereum) β”‚
                    β”‚   or Sepolia    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   op-node        β”‚
                    β”‚ (Derivation      β”‚
                    β”‚  Driver)         β”‚
                    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό          β–Ό              β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  op-geth     β”‚ β”‚ op-batcher β”‚ β”‚ op-proposer  β”‚
            β”‚  (Execution) β”‚ β”‚ (Batches)  β”‚ β”‚ (State Root) β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    └─────→ L1 (Submit batches & roots)

Role of each component

componentRoleDependencyScope of influence
L1External Chain (Ethereum/Sepolia)NoneAll components
op-nodeReceive L1 data β†’ derive L2 stateL1All subcomponents
op-gethL2 block execution (transaction processing)op-nodetransaction processing
op-batcherSubmit L2 Transaction Batch (L1)op-node, L1transaction compression
op-proposerSubmitted by Sang Geun Sang for L2 (L1)op-node, L1Withdrawal

Dependency graph

const DEPENDENCY_GRAPH = {
  'l1': {
    dependsOn: [],
    feeds: ['op-node', 'op-batcher', 'op-proposer'],
  },
  'op-node': {
    dependsOn: ['l1'],
    feeds: ['op-geth', 'op-batcher', 'op-proposer'],
  },
  'op-geth': {
    dependsOn: ['op-node'],
    feeds: [],
  },
  'op-batcher': {
    dependsOn: ['op-node', 'l1'],
    feeds: [],
  },
  'op-proposer': {
    dependsOn: ['op-node', 'l1'],
    feeds: [],
  },
};

Important: If an op-node fails, all child components are affected!


πŸ“Š Timeline configuration

Data Source

Timeline collects events from three sources:

1. Log parsing (Log Events)

function parseLogsToEvents(logs: Record<string, string>): RCAEvent[]

Supported Formats:

  • ISO 8601: 2024-12-09T14:30:45.123Z
  • Geth format: [12-09|14:30:45.123]
  • General format: 2024-12-09 14:30:45

Extraction Conditions:

  • ERROR, ERR, FATAL level β†’ type: error
  • WARN, WARNING level β†’ type: warning

example:

[12-09|14:30:45.123] ERROR [execution] block derivation failed: context deadline exceeded

β†’ {
  timestamp: 1733761845123,
component: 'op-geth', # automatic mapping
  type: 'error',
  description: 'block derivation failed: context deadline exceeded',
  severity: 'high'
}

2. Anomalous metric conversion (Anomaly Events)

function anomaliesToEvents(anomalies: AnomalyResult[]): RCAEvent[]

Metric β†’ Component Mapping:

metricscomponentCause
cpuUsageop-gethCPU spikes/load
txPoolPendingop-gethTransaction Accumulation
gasUsedRatioop-gethblock saturation
l2BlockHeight, l2BlockIntervalop-nodeBlock creation stagnation

example:

Anomaly: CPU spike (Z-Score: 3.2)

β†’ {
  timestamp: 1733761900000,
  component: 'op-geth',
  type: 'metric_anomaly',
  description: 'CPU usage spike: 30% β†’ 65%',
severity: 'high' # |Z| Since > 2.5
}

3. Sort chronologically

function buildTimeline(
  anomalies: AnomalyResult[],
  logs: Record<string, string>,
  minutes: number = 5
): RCAEvent[]

movement:

  1. Combine log + anomaly metrics
  2. Filter only the last 5 minutes of data
  3. Sort by timestamp

result:

[
  {
    "time": "2024-12-09T14:28:00Z",
    "component": "op-node",
    "type": "error",
    "description": "L1 reorg detected"
  },
  {
    "time": "2024-12-09T14:28:30Z",
    "component": "op-geth",
    "type": "warning",
    "description": "Derivation stalled"
  },
  {
    "time": "2024-12-09T14:29:00Z",
    "component": "op-geth",
    "type": "metric_anomaly",
    "description": "TxPool: 1000 β†’ 5000 (monotonic increase)"
  }
]

🧠 AI-based causal analysis

System Prompt Structure

RCA Engine provides clear instructions from an SRE perspective to Claude:

1. Component Architecture (detailed description of 5 components)
2. Dependency Graph
3. Common Failure Patterns (5 typical failure patterns)
4. Analysis Guidelines (Analysis Methodology)

5 typical failure patterns

1️⃣ L1 Reorg (L1 chain reorganization)

Cause: Chain reorganization occurs in L1

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1 Reorg                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ op-node Derivation Reset       β”‚
β”‚ (Initialization of inductive state) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L2 Block Generation Stall      β”‚
β”‚ (Pause block creation) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Symptoms:

  • Block height plateau 2 minutes or more
  • Temporarily stop synchronization

2️⃣ L1 Gas Spike

Cause: L1 network congestion

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1 Gas Price Surge       β”‚
β”‚ (Gas costs rise rapidly) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
    β–Ό           β–Ό
Batcher    Proposer
Failed    Failed
β”‚         β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β–Ό
TxPool
Accumulation

Symptoms:

  • op-batcher: batch submission failed
  • TxPool: monotonic increase (over 5 minutes)
  • 둜그: "transaction underpriced" λ˜λŠ” "replacement transaction underpriced"

3️⃣ op-geth Crash

Cause: Op-geth process crash (OOM, signal, etc.)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ op-geth Crash    β”‚
β”‚ (End process) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
CPU: 100% β†’ 0%
Memory: Peak β†’ 0
Port: Open β†’ Closed

Symptoms:

  • CPU suddenly drops to 0% (Zero-drop detection)
  • Stop processing all transactions
  • 둜그: "connection refused", "unexpected EOF"

4️⃣ Network Partition (P2P network disconnection)

Cause: P2P communication disconnection between nodes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Network Partition        β”‚
β”‚ (P2P Gossip disconnection) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ op-node Peer Loss        β”‚
β”‚ (Loss of peer node connectivity) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Unsafe Head Divergence
(Safe Head Radiation)

Symptoms:

  • on-node: "peer disconnected" 둜그
  • Block interval: increase
  • Unsafe head: different from expected value

5️⃣ Sequencer Stall (Sequencer μ •μ§€)

Cause: Problem with the Sequencer node itself

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sequencer Stall      β”‚
β”‚ (Stop block generation) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β–Ό             β–Ό
Block Height   TxPool
Plateau        Growth
(2 minutes+) (5 minutes+)

Symptoms:

  • Block height: no change
  • TxPool: continues to increase
  • Log: timeout such as "context deadline exceeded"

AI analysis result format

The JSON returned by Claude:

{
  "rootCause": {
    "component": "op-node" | "op-geth" | "op-batcher" | "op-proposer" | "l1" | "system",
"description": "Clear root cause description",
    "confidence": 0.0 - 1.0
  },
  "causalChain": [
    {
      "timestamp": 1733761800000,
      "component": "op-node",
      "type": "error" | "warning" | "metric_anomaly" | "state_change",
"description": "What happened in this step"
    }
  ],
  "affectedComponents": ["op-geth", "op-batcher"],
  "remediation": {
    "immediate": ["Step 1", "Step 2"],
    "preventive": ["Measure 1", "Measure 2"]
  }
}

Confidence score

ReliabilityMeaningSituation
0.9~1.0very highclear log + ideal metric matching
0.7~0.9HighOnly one of the logs or metrics is clear
0.5~0.7middleSeveral possibilities
0.3~0.5lowAI call failure β†’ Fallback
< 0.3very lowLack of data

πŸ”€ Dependency tracking

Upstream dependency lookup

findUpstreamComponents(component: RCAComponent): RCAComponent[]

yes:

Upstream dependencies of op-geth:
  op-geth β†’ op-node β†’ l1

Upstream dependencies of op-batcher:
  op-batcher β†’ [op-node, l1]

Track downstream impacts

findAffectedComponents(rootComponent: RCAComponent): RCAComponent[]

yes:

Components affected when op-node fails:
  op-node fails
β”œβ”€ op-geth impact (op-geth requires op-node)
β”œβ”€ op-batcher impact
└─ op-proposer influence

Components affected when op-geth fails:
  op-geth fails
└─ (None - op-geth does not supply any other components)

πŸ› οΈ Fallback analysis (AI call failure)

Automatically perform rule-based analysis when AI calls fail.

Fallback logic

function generateFallbackAnalysis(
  timeline: RCAEvent[],
  anomalies: AnomalyResult[],
  lastError?: string
): RCAResult

movement:

  1. Find the first ERROR event in the Timeline
  2. List all components affected by that component
  3. Provide basic recommended actions

Confidence: 0.3 (low - manual verification recommended)

Recommended Action for Return:

{
  "immediate": [
    "Check component logs for detailed error messages",
    "Verify all pods are running: kubectl get pods -n <namespace>",
    "Check L1 connectivity and block sync status"
  ],
  "preventive": [
    "Set up automated alerting for critical metrics",
    "Implement health check endpoints for all components",
    "Document incident response procedures"
  ]
}

πŸ“ Log parsing details

Supported log formats

ISO 8601 format

2024-12-09T14:30:45.123Z ERROR [op-geth] failed to execute block
β†’ timestamp: 1733761845123

Geth Format

[12-09|14:30:45.123] op-geth ERROR block execution timeout
β†’ timestamp: Year-December-09 14:30:45.123

General format

2024-12-09 14:30:45 ERROR op-node derivation failed
β†’ timestamp: 14:30:45 on the date

Component name normalization

const COMPONENT_NAME_MAP = {
  'op-geth': 'op-geth',
  'geth': 'op-geth',
  'op-node': 'op-node',
  'node': 'op-node',
  'op-batcher': 'op-batcher',
  'batcher': 'op-batcher',
  'op-proposer': 'op-proposer',
  'proposer': 'op-proposer',
};

Log level extraction

const LOG_LEVEL_MAP = {
'ERROR', 'ERR', 'FATAL' β†’ type: 'error'   (심각도: high)
'WARN', 'WARNING'       β†’ type: 'warning' (심각도: medium)
};

πŸ“Š Execution example

Step 1: Configure Timeline

Timeline Events (within 5 minutes):
[14:28:00] op-node     ERROR  L1 reorg detected
[14:28:30] op-node     WARNING Derivation stalled
[14:29:00] op-geth     METRIC  TxPool: 1000 β†’ 5000
[14:29:30] op-geth     ERROR   Connection refused
[14:30:00] op-batcher  ERROR   Batch submission failed

Step 2: AI Analysis

Prompt to be sent:

System: [RCA_SYSTEM_PROMPT includes architecture, patterns, etc.]

User:
== Event Timeline ==
[timeline JSON]

== Detected Anomalies ==
- txPoolPending: 5000 (z-score: 3.1, spike)

== Recent Metrics ==
[Metric Snapshot]

== Component Logs ==
[Log contents]

Analyze the above data and identify the root cause.

Claude responds:

{
  "rootCause": {
    "component": "op-node",
"description": "Chain reorganization occurs in L1, which resets the induced state of the op-node. This causes op-geth execution to be delayed and transactions to accumulate in the TxPool.",
    "confidence": 0.85
  },
  "causalChain": [
    {
      "timestamp": 1733761680000,
      "component": "l1",
      "type": "error",
      "description": "L1 reorg detected"
    },
    {
      "timestamp": 1733761710000,
      "component": "op-node",
      "type": "error",
      "description": "Derivation reset due to L1 reorg"
    },
    {
      "timestamp": 1733761740000,
      "component": "op-geth",
      "type": "metric_anomaly",
      "description": "TxPool accumulation (1000 β†’ 5000)"
    }
  ],
  "affectedComponents": ["op-geth", "op-batcher"],
  "remediation": {
    "immediate": [
      "Monitor L1 finality status",
      "Check op-node derivation progress",
      "Verify op-geth is catching up with pending transactions"
    ],
    "preventive": [
      "Increase watchdog timeout thresholds during L1 finality uncertainty",
      "Implement automated derivation state validation",
      "Set up alerts for L1 reorg patterns"
    ]
  }
}

Step 3: Save results

{
  "id": "rca-1733761845-abc123",
  "rootCause": { ... },
  "causalChain": [ ... ],
  "affectedComponents": ["op-geth", "op-batcher"],
  "timeline": [ ... ],
  "remediation": { ... },
  "generatedAt": "2024-12-09T14:30:45.678Z"
}

πŸ“ž API usage

RCA Analysis Request

curl -X POST "http://localhost:3002/api/rca" \
  -H "Content-Type: application/json" \
  -d '{
    "autoTriggered": false
  }'

response:

{
  "success": true,
  "result": {
    "id": "rca-1733761845-abc123",
    "rootCause": { ... },
    "causalChain": [ ... ],
    "affectedComponents": ["op-geth", "op-batcher"],
    "timeline": [ ... ],
    "remediation": {
      "immediate": [ ... ],
      "preventive": [ ... ]
    },
    "generatedAt": "2024-12-09T14:30:45.678Z"
  }
}
# Recent 10 RCA analysis results
curl -s "http://localhost:3002/api/rca?limit=10" | jq '.history'

# Specific RCA analysis results
curl -s "http://localhost:3002/api/rca/rca-1733761845-abc123" | jq '.result'

βš™οΈ Performance optimization

Settings

/** Maximum number of history items */
const MAX_HISTORY_SIZE = 20;

/** AI call timeout */
const AI_TIMEOUT = 30000;  // 30 seconds

/** Number of retries */
const MAX_RETRIES = 2;

/** Retry wait time */
retry_delay = 1000 * (attempt + 1);  // exponential backoff

Timeline period

/** By default, only the most recent 5 minutes of data is analyzed */
buildTimeline(anomalies, logs, minutes = 5)

πŸ” Fallback trigger condition

If RCA analysis fails:

  1. AI call failure (network error, timeout)
  2. JSON parsing failure
  3. AI response is not in expected format

At this time, it automatically switches to rule-based analysis and the confidence level is displayed as 0.3.


fileRole
src/lib/rca-engine.tsMain RCA Engine
src/types/rca.tstype definition
src/app/api/rca/route.tsAPI endpoint
src/lib/anomaly-detector.tsLayer 1 abnormality detection
src/lib/ai-client.tsAI 호좜 (Claude)

🎯 Summary of Key Features

βœ… Component-centric Analysis: Based on Optimism architecture βœ… Causal Chain Tracing: Tracing from root cause to final symptom βœ… Dependency Graph: Automatic calculation of component dependencies βœ… AI-Powered: Claude-based semantic analysis βœ… Fallback Support: Rule-based analysis when AI fails βœ… Actionable Advice: Provides immediate action + preventive action βœ… History Management: Save the last 20 analysis results