RCA Engine (Root Cause Analysis) Guide
π Overview
RCA Engine is an AI-based analysis system that tracks the root cause after detecting anomalies and suggests solutions.
File: src/lib/rca-engine.ts
3-step analysis process
1οΈβ£ Timeline composition
ββ Log parsing
ββ Ideal Metric Conversion
ββ Sort by time
2οΈβ£ AI causality analysis
ββ Utilize component dependency graph
ββ Chain failure tracking
ββ Severity assessment
3οΈβ£ Provide recommended actions
ββ Immediate action (Immediate)
ββ Preventive measures
ποΈ Optimism Rollup Architecture
Component relationship diagram
βββββββββββββββββββ
β L1 (Ethereum) β
β or Sepolia β
ββββββββββ¬βββββββββ
β
βΌ
ββββββββββββββββββββ
β op-node β
β (Derivation β
β Driver) β
ββββββ¬ββββββββββββββ
ββββββ΄ββββββ¬βββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββ ββββββββββββββββ
β op-geth β β op-batcher β β op-proposer β
β (Execution) β β (Batches) β β (State Root) β
ββββββββββββββββ ββββββββββββββ ββββββββββββββββ
β
βββββββ L1 (Submit batches & roots)
Role of each component
| component | Role | Dependency | Scope of influence |
|---|---|---|---|
| L1 | External Chain (Ethereum/Sepolia) | None | All components |
| op-node | Receive L1 data β derive L2 state | L1 | All subcomponents |
| op-geth | L2 block execution (transaction processing) | op-node | transaction processing |
| op-batcher | Submit L2 Transaction Batch (L1) | op-node, L1 | transaction compression |
| op-proposer | Submitted by Sang Geun Sang for L2 (L1) | op-node, L1 | Withdrawal |
Dependency graph
const DEPENDENCY_GRAPH = {
'l1': {
dependsOn: [],
feeds: ['op-node', 'op-batcher', 'op-proposer'],
},
'op-node': {
dependsOn: ['l1'],
feeds: ['op-geth', 'op-batcher', 'op-proposer'],
},
'op-geth': {
dependsOn: ['op-node'],
feeds: [],
},
'op-batcher': {
dependsOn: ['op-node', 'l1'],
feeds: [],
},
'op-proposer': {
dependsOn: ['op-node', 'l1'],
feeds: [],
},
};
Important: If an op-node fails, all child components are affected!
π Timeline configuration
Data Source
Timeline collects events from three sources:
1. Log parsing (Log Events)
function parseLogsToEvents(logs: Record<string, string>): RCAEvent[]
Supported Formats:
- ISO 8601:
2024-12-09T14:30:45.123Z - Geth format:
[12-09|14:30:45.123] - General format:
2024-12-09 14:30:45
Extraction Conditions:
- ERROR, ERR, FATAL level β type:
error - WARN, WARNING level β type:
warning
example:
[12-09|14:30:45.123] ERROR [execution] block derivation failed: context deadline exceeded
β {
timestamp: 1733761845123,
component: 'op-geth', # automatic mapping
type: 'error',
description: 'block derivation failed: context deadline exceeded',
severity: 'high'
}
2. Anomalous metric conversion (Anomaly Events)
function anomaliesToEvents(anomalies: AnomalyResult[]): RCAEvent[]
Metric β Component Mapping:
| metrics | component | Cause |
|---|---|---|
cpuUsage | op-geth | CPU spikes/load |
txPoolPending | op-geth | Transaction Accumulation |
gasUsedRatio | op-geth | block saturation |
l2BlockHeight, l2BlockInterval | op-node | Block creation stagnation |
example:
Anomaly: CPU spike (Z-Score: 3.2)
β {
timestamp: 1733761900000,
component: 'op-geth',
type: 'metric_anomaly',
description: 'CPU usage spike: 30% β 65%',
severity: 'high' # |Z| Since > 2.5
}
3. Sort chronologically
function buildTimeline(
anomalies: AnomalyResult[],
logs: Record<string, string>,
minutes: number = 5
): RCAEvent[]
movement:
- Combine log + anomaly metrics
- Filter only the last 5 minutes of data
- Sort by timestamp
result:
[
{
"time": "2024-12-09T14:28:00Z",
"component": "op-node",
"type": "error",
"description": "L1 reorg detected"
},
{
"time": "2024-12-09T14:28:30Z",
"component": "op-geth",
"type": "warning",
"description": "Derivation stalled"
},
{
"time": "2024-12-09T14:29:00Z",
"component": "op-geth",
"type": "metric_anomaly",
"description": "TxPool: 1000 β 5000 (monotonic increase)"
}
]
π§ AI-based causal analysis
System Prompt Structure
RCA Engine provides clear instructions from an SRE perspective to Claude:
1. Component Architecture (detailed description of 5 components)
2. Dependency Graph
3. Common Failure Patterns (5 typical failure patterns)
4. Analysis Guidelines (Analysis Methodology)
5 typical failure patterns
1οΈβ£ L1 Reorg (L1 chain reorganization)
Cause: Chain reorganization occurs in L1
βββββββββββββββββββββββββββββββββββ
β L1 Reorg β
ββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββ
β op-node Derivation Reset β
β (Initialization of inductive state) β
ββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββ
β L2 Block Generation Stall β
β (Pause block creation) β
ββββββββββββββββββββββββββββββββββ
Symptoms:
- Block height plateau 2 minutes or more
- Temporarily stop synchronization
2οΈβ£ L1 Gas Spike
Cause: L1 network congestion
ββββββββββββββββββββββββββββ
β L1 Gas Price Surge β
β (Gas costs rise rapidly) β
βββββββββββ¬βββββββββββββββββ
β
βββββββ΄ββββββ
βΌ βΌ
Batcher Proposer
Failed Failed
β β
ββββββ¬βββββ
βΌ
TxPool
Accumulation
Symptoms:
- op-batcher: batch submission failed
- TxPool: monotonic increase (over 5 minutes)
- λ‘κ·Έ: "transaction underpriced" λλ "replacement transaction underpriced"
3οΈβ£ op-geth Crash
Cause: Op-geth process crash (OOM, signal, etc.)
ββββββββββββββββββββ
β op-geth Crash β
β (End process) β
ββββββββββ¬ββββββββββ
β
βΌ
CPU: 100% β 0%
Memory: Peak β 0
Port: Open β Closed
Symptoms:
- CPU suddenly drops to 0% (Zero-drop detection)
- Stop processing all transactions
- λ‘κ·Έ: "connection refused", "unexpected EOF"
4οΈβ£ Network Partition (P2P network disconnection)
Cause: P2P communication disconnection between nodes
ββββββββββββββββββββββββββββ
β Network Partition β
β (P2P Gossip disconnection) β
ββββββββββ¬ββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β op-node Peer Loss β
β (Loss of peer node connectivity) β
ββββββββββ¬ββββββββββββββββββ
β
βΌ
Unsafe Head Divergence
(Safe Head Radiation)
Symptoms:
- on-node: "peer disconnected" λ‘κ·Έ
- Block interval: increase
- Unsafe head: different from expected value
5οΈβ£ Sequencer Stall (Sequencer μ μ§)
Cause: Problem with the Sequencer node itself
ββββββββββββββββββββββββ
β Sequencer Stall β
β (Stop block generation) β
ββββββββββββ¬ββββββββββββ
β
ββββββββ΄βββββββ
βΌ βΌ
Block Height TxPool
Plateau Growth
(2 minutes+) (5 minutes+)
Symptoms:
- Block height: no change
- TxPool: continues to increase
- Log: timeout such as "context deadline exceeded"
AI analysis result format
The JSON returned by Claude:
{
"rootCause": {
"component": "op-node" | "op-geth" | "op-batcher" | "op-proposer" | "l1" | "system",
"description": "Clear root cause description",
"confidence": 0.0 - 1.0
},
"causalChain": [
{
"timestamp": 1733761800000,
"component": "op-node",
"type": "error" | "warning" | "metric_anomaly" | "state_change",
"description": "What happened in this step"
}
],
"affectedComponents": ["op-geth", "op-batcher"],
"remediation": {
"immediate": ["Step 1", "Step 2"],
"preventive": ["Measure 1", "Measure 2"]
}
}
Confidence score
| Reliability | Meaning | Situation |
|---|---|---|
| 0.9~1.0 | very high | clear log + ideal metric matching |
| 0.7~0.9 | High | Only one of the logs or metrics is clear |
| 0.5~0.7 | middle | Several possibilities |
| 0.3~0.5 | low | AI call failure β Fallback |
| < 0.3 | very low | Lack of data |
π Dependency tracking
Upstream dependency lookup
findUpstreamComponents(component: RCAComponent): RCAComponent[]
yes:
Upstream dependencies of op-geth:
op-geth β op-node β l1
Upstream dependencies of op-batcher:
op-batcher β [op-node, l1]
Track downstream impacts
findAffectedComponents(rootComponent: RCAComponent): RCAComponent[]
yes:
Components affected when op-node fails:
op-node fails
ββ op-geth impact (op-geth requires op-node)
ββ op-batcher impact
ββ op-proposer influence
Components affected when op-geth fails:
op-geth fails
ββ (None - op-geth does not supply any other components)
π οΈ Fallback analysis (AI call failure)
Automatically perform rule-based analysis when AI calls fail.
Fallback logic
function generateFallbackAnalysis(
timeline: RCAEvent[],
anomalies: AnomalyResult[],
lastError?: string
): RCAResult
movement:
- Find the first ERROR event in the Timeline
- List all components affected by that component
- Provide basic recommended actions
Confidence: 0.3 (low - manual verification recommended)
Recommended Action for Return:
{
"immediate": [
"Check component logs for detailed error messages",
"Verify all pods are running: kubectl get pods -n <namespace>",
"Check L1 connectivity and block sync status"
],
"preventive": [
"Set up automated alerting for critical metrics",
"Implement health check endpoints for all components",
"Document incident response procedures"
]
}
π Log parsing details
Supported log formats
ISO 8601 format
2024-12-09T14:30:45.123Z ERROR [op-geth] failed to execute block
β timestamp: 1733761845123
Geth Format
[12-09|14:30:45.123] op-geth ERROR block execution timeout
β timestamp: Year-December-09 14:30:45.123
General format
2024-12-09 14:30:45 ERROR op-node derivation failed
β timestamp: 14:30:45 on the date
Component name normalization
const COMPONENT_NAME_MAP = {
'op-geth': 'op-geth',
'geth': 'op-geth',
'op-node': 'op-node',
'node': 'op-node',
'op-batcher': 'op-batcher',
'batcher': 'op-batcher',
'op-proposer': 'op-proposer',
'proposer': 'op-proposer',
};
Log level extraction
const LOG_LEVEL_MAP = {
'ERROR', 'ERR', 'FATAL' β type: 'error' (μ¬κ°λ: high)
'WARN', 'WARNING' β type: 'warning' (μ¬κ°λ: medium)
};
π Execution example
Step 1: Configure Timeline
Timeline Events (within 5 minutes):
[14:28:00] op-node ERROR L1 reorg detected
[14:28:30] op-node WARNING Derivation stalled
[14:29:00] op-geth METRIC TxPool: 1000 β 5000
[14:29:30] op-geth ERROR Connection refused
[14:30:00] op-batcher ERROR Batch submission failed
Step 2: AI Analysis
Prompt to be sent:
System: [RCA_SYSTEM_PROMPT includes architecture, patterns, etc.]
User:
== Event Timeline ==
[timeline JSON]
== Detected Anomalies ==
- txPoolPending: 5000 (z-score: 3.1, spike)
== Recent Metrics ==
[Metric Snapshot]
== Component Logs ==
[Log contents]
Analyze the above data and identify the root cause.
Claude responds:
{
"rootCause": {
"component": "op-node",
"description": "Chain reorganization occurs in L1, which resets the induced state of the op-node. This causes op-geth execution to be delayed and transactions to accumulate in the TxPool.",
"confidence": 0.85
},
"causalChain": [
{
"timestamp": 1733761680000,
"component": "l1",
"type": "error",
"description": "L1 reorg detected"
},
{
"timestamp": 1733761710000,
"component": "op-node",
"type": "error",
"description": "Derivation reset due to L1 reorg"
},
{
"timestamp": 1733761740000,
"component": "op-geth",
"type": "metric_anomaly",
"description": "TxPool accumulation (1000 β 5000)"
}
],
"affectedComponents": ["op-geth", "op-batcher"],
"remediation": {
"immediate": [
"Monitor L1 finality status",
"Check op-node derivation progress",
"Verify op-geth is catching up with pending transactions"
],
"preventive": [
"Increase watchdog timeout thresholds during L1 finality uncertainty",
"Implement automated derivation state validation",
"Set up alerts for L1 reorg patterns"
]
}
}
Step 3: Save results
{
"id": "rca-1733761845-abc123",
"rootCause": { ... },
"causalChain": [ ... ],
"affectedComponents": ["op-geth", "op-batcher"],
"timeline": [ ... ],
"remediation": { ... },
"generatedAt": "2024-12-09T14:30:45.678Z"
}
π API usage
RCA Analysis Request
curl -X POST "http://localhost:3002/api/rca" \
-H "Content-Type: application/json" \
-d '{
"autoTriggered": false
}'
response:
{
"success": true,
"result": {
"id": "rca-1733761845-abc123",
"rootCause": { ... },
"causalChain": [ ... ],
"affectedComponents": ["op-geth", "op-batcher"],
"timeline": [ ... ],
"remediation": {
"immediate": [ ... ],
"preventive": [ ... ]
},
"generatedAt": "2024-12-09T14:30:45.678Z"
}
}
RCA history search
# Recent 10 RCA analysis results
curl -s "http://localhost:3002/api/rca?limit=10" | jq '.history'
# Specific RCA analysis results
curl -s "http://localhost:3002/api/rca/rca-1733761845-abc123" | jq '.result'
βοΈ Performance optimization
Settings
/** Maximum number of history items */
const MAX_HISTORY_SIZE = 20;
/** AI call timeout */
const AI_TIMEOUT = 30000; // 30 seconds
/** Number of retries */
const MAX_RETRIES = 2;
/** Retry wait time */
retry_delay = 1000 * (attempt + 1); // exponential backoff
Timeline period
/** By default, only the most recent 5 minutes of data is analyzed */
buildTimeline(anomalies, logs, minutes = 5)
π Fallback trigger condition
If RCA analysis fails:
- AI call failure (network error, timeout)
- JSON parsing failure
- AI response is not in expected format
At this time, it automatically switches to rule-based analysis and the confidence level is displayed as 0.3.
π Related files
| file | Role |
|---|---|
src/lib/rca-engine.ts | Main RCA Engine |
src/types/rca.ts | type definition |
src/app/api/rca/route.ts | API endpoint |
src/lib/anomaly-detector.ts | Layer 1 abnormality detection |
src/lib/ai-client.ts | AI νΈμΆ (Claude) |
π― Summary of Key Features
β Component-centric Analysis: Based on Optimism architecture β Causal Chain Tracing: Tracing from root cause to final symptom β Dependency Graph: Automatic calculation of component dependencies β AI-Powered: Claude-based semantic analysis β Fallback Support: Rule-based analysis when AI fails β Actionable Advice: Provides immediate action + preventive action β History Management: Save the last 20 analysis results