SentinAI Docs

Anomaly Detection standards and operation method

spec/anomaly-detection-guide.md

Anomaly Detection standards and operation method

πŸ“‹ Overview

Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know.

  1. Layer 1: Statistical-based anomaly detection (Z-Score + Rule-based)
  2. Layer 2: AI semantic analysis (Claude-based root cause analysis)
  3. Layer 3: Send notification (Slack/Webhook)

πŸ” Layer 1: Statistics-based anomaly detection

outline

Layer 1 analyzes real-time metric data to detect immediate anomalies.

파일: src/lib/anomaly-detector.ts

Detection Metrics

metricsUnitDescription
cpuUsage%L2 node CPU utilization (0~100%)
txPoolPendingdogNumber of pending transactions
gasUsedRatio%Block's gas usage rate (0~1)
l2BlockHeightnumberL2 latest block height
l2BlockIntervalsecondsConsecutive block generation interval

Detection rules

1️⃣ Z-Score based detection (most common)

Criteria: Deviating from the mean by more than 2.5 times the standard deviation

Z-Score = (Current value - Average) / Standard deviation

Detection Condition: |Z-Score| > 2.5

example:

Average CPU utilization: 30%
Standard deviation: 5%
Current value: 50%

Z-Score = (50 - 30) / 5 = 4.0
β†’ Since 4.0 > 2.5, anomaly detected! (Spike)

Settings:

const Z_SCORE_THRESHOLD = 2.5;  // Confidence 99.3%
const MIN_HISTORY_POINTS = 5;   // Minimum 5 historical data

Applies to:

  • CPU Usage (using Z-Score)
  • TxPool Pending (using Z-Score)
  • Gas Used Ratio (using Z-Score)
  • L2 Block Interval (using Z-Score)

2️⃣ CPU 0% Drop (Process Crash)

Baseline: CPU suddenly drops to 0%

Average CPU >= 10% for last 3 data
β†’ Current CPU < 1%
β†’ Suspected process crash

Settings:

if (currentCpu < 1 && recentMean >= 10) {
// Determined as a process crash
}

example:

Recent CPU change: 35% β†’ 32% β†’ 38% (average 35%)
Current CPU: 0%

β†’ Anomaly detection! (Drop, rule: zero-drop)
β†’ Severity: Critical (process aborted)

3️⃣ L2 Block Height Plateau (Sequencer μ •μ§€)

Criteria: Block height does not change for more than 2 minutes

All blocks have the same height for the last 2 minutes
β†’ Sequencer stoppage suspicion

Settings:

const BLOCK_PLATEAU_SECONDS = 120;  // 2λΆ„

// test
if (all_recent_heights_same_&& duration >= 120 seconds) {
// Judged as Sequencer stop
}

example:

time block height state
14:00  12340    βœ“
14:30  12340    βœ“
15:00 12340 βœ“ ← No change for 60 minutes

β†’ Anomaly detection! (Plateau, rule: plateau)
β†’ Severity: High (Sequencer stopped)

4️⃣ Monotonically increase TxPool (Batcher fails)

Criteria: Transaction pool continues to grow for more than 5 minutes

All txPool values ​​increase or remain the same for the last 5 minutes
β†’ Suspected batcher failure (transaction batch not processed)

Settings:

const TXPOOL_MONOTONIC_SECONDS = 300;  // 5λΆ„

// test
for (let i = 1; i < history.length; i++) {
if (current[i] < current[i-1]) {
isMonotonic = false;  // If it decreases even once, it is normal.
  }
}

if (isMonotonic && increment > 0) {
// Judgment as batcher failure
}

example:

Time TxPool Status
00:00  100     βœ“
01:00 150 βœ“ (increase)
02:00 180 βœ“ (increased)
03:00 190 βœ“ (increased)
04:00 195 βœ“ (increased)
05:00 200 βœ“ (Increase) ← Continue to increase for 5 minutes

β†’ Anomaly detection! (Spike, rule: monotonic-increase)
β†’ Severity: High (Batcher batch not processed)

Detection priority

Detection order (collision avoidance):

1. CPU 0% Drop (most severe)
2. L2 Block Height Plateau
3. TxPool Monotonic Increase
4. Z-Score based detection (excluding metrics already detected in the rules above)

Exception handling

ConditionsAction
Historical data < 5Skip detection (insufficient data)
standard deviation = 0Z-Score = 0 (normal, no change)
Metric already detectedAvoid duplicate detection

🧠 Layer 2: AI semantic analysis

outline

When an anomaly is detected at Layer 1, Layer 2 uses Claude AI to analyze the root cause.

File: src/lib/anomaly-ai-analyzer.ts

Prompt Structure

System Prompt:
β”œβ”€ SRE role definition
β”œβ”€ Optimism component relationship diagram
β”œβ”€ Common failure patterns (5 types)
└─ Analysis guidelines

User Prompt:
β”œβ”€ List of detected abnormalities
β”œβ”€ Current metric data
└─ related logs (op-geth, op-node, op-batcher, op-proposer)

Optimism component failure patterns

patternCauseSymptomsImpact
L1 ReorgL1 chain reorganizationop-node induced state reset β†’ temporary synchronization stopblock height stagnation
L1 Gas SpikeL1 gas prices soarBatcher fails to send batch to L1Increase TxPool
op-geth Crashop-geth process crashCPU plummets to 0%All downstream impacts
Network PartitionP2P network disconnectionUnable to communicate with fellow nodesUnsafe head divergence
Sequencer StallSequencer stopBlock creation stoppedBlock height stagnates, TxPool increases

AI analysis results

interface DeepAnalysisResult {
  severity: 'low' | 'medium' | 'high' | 'critical';
  anomalyType: 'performance' | 'security' | 'consensus' | 'liveness';
correlations: string[];           // related symptoms
predictedImpact: string;          // expected impact
suggestedActions: string[];       // ꢌμž₯ 쑰치
relatedComponents: string[];      // Affected Components
}

example:

{
  "severity": "critical",
  "anomalyType": "liveness",
  "correlations": [
"CPU 0% drop detected",
"Start TxPool monotonically increasing (batch unprocessed)"
  ],
"predictedImpact": "Op-geth is down, so all transaction processing is halted. User traffic impacted.",
  "suggestedActions": [
"Restart op-geth process",
"Check memory/disk space",
β€œReview recent logs”
  ],
  "relatedComponents": [
    "op-geth",
    "op-node",
    "op-batcher"
  ]
}

Performance optimization

Caching:

const ANALYSIS_CACHE_TTL_MS = 5 * 60 * 1000;  // 5λΆ„

// Do not reanalyze the same anomaly within 5 minutes

Rate Limiting:

const MIN_AI_CALL_INTERVAL_MS = 60 * 1000;  // 1λΆ„

// AI call up to 1 time per minute

πŸ“’ Layer 3: Sending notifications

Notification filtering

condition:

  1. AI analysis severity >= set threshold
  2. Cooldown elapsed since last notification

Settings:

interface AlertConfig {
enabled: boolean;                    // Whether to enable notifications
  webhookUrl?: string;                 // Slack/Discord URL
  thresholds: {
notifyOn: AISeverity[];            // Notification target severity (low/medium/high/critical)
cooldownMinutes: number;           // Prevent duplicate notifications (minutes)
  };
}

Default:

notifyOn: ['high', 'critical'] // Notify only when high or higher
cooldownMinutes: 10                   // 10λΆ„ cooldown

Notification Channel

ChannelUseSettings
SlackOperation Team NotificationALERT_WEBHOOK_URL
WebhookExternal system integrationCustom URL
DashboardShow dashboardautomatic recording

πŸ“Š Entire pipeline

Metric collection (1 minute interval)
    ↓
Layer 1: Statistical detection (on the fly)
β”œβ”€ Z-Score test
β”œβ”€ CPU 0% Drop check
β”œβ”€ Block plateau inspection
β”œβ”€ TxPool Monotonic Check
    ↓
[Anomaly detected?]
    β”‚
β”œβ”€ YES β†’ Layer 2: AI analysis (only once per minute)
β”‚ β”œβ”€ Root cause analysis
β”‚ β”œβ”€ Severity assessment
β”‚ └─ Provide recommended actions
    β”‚            ↓
β”‚ Layer 3: Sending notifications (based on settings)
β”‚ └─ Severity >= threshold and cooldown elapses
    β”‚
└─ NO β†’ Normal (continue monitoring)

πŸ§ͺ Test example

Quick Test: Z-Score detection

# 1. Create Mock Data (Rising Trend)
curl -X POST "http://localhost:3002/api/metrics/seed?scenario=rising"

# 2. Check for abnormalities
curl -s "http://localhost:3002/api/metrics" | jq '.anomalies'

# Expected results:
# [
#   {
#     "isAnomaly": true,
#     "metric": "cpuUsage",
#     "direction": "spike",
#     "zScore": 3.2,
#     "rule": "z-score"
#   }
# ]

Deep Test: AI Analytics

# 1. Abnormal event inquiry
curl -s "http://localhost:3002/api/anomalies" | jq '.events[0]'

# 2. Check Layer 2 AI analytics
curl -s "http://localhost:3002/api/anomalies" | jq '.events[0].deepAnalysis'

# Expected results:
# {
#   "severity": "high",
#   "anomalyType": "performance",
# "correlations": ["CPU spikes persist"],
# "predictedImpact": "Possible block creation delay",
#   "suggestedActions": ["..."],
#   "relatedComponents": ["op-geth", "op-node"]
# }

βš™οΈ Customize settings

Environment variables

# Can be set in .env.local

# Adjust Z-Score threshold (default 2.5)
# Fix Z_SCORE_THRESHOLD in anomaly-detector.ts

# Block Plateau Time (default 120 seconds)
# BLOCK_PLATEAU_SECONDS = 120

# TxPool Monotonic time (default 300 seconds)
# TXPOOL_MONOTONIC_SECONDS = 300

# Notification settings
# Can be configured in /api/anomalies/config

Change notification settings with API

curl -X PUT "http://localhost:3002/api/anomalies/config" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "webhookUrl": "https://hooks.slack.com/services/...",
    "thresholds": {
      "notifyOn": ["high", "critical"],
      "cooldownMinutes": 10
    }
  }'

πŸ“ˆ Reference values ​​for each metric

CPU Usage

statusCPU%Description
summit20~40Typical L2 node
load40~70high traffic
danger70~99Impending Overload
crash0~1process abort

Block Interval

statusspacingDescription
summit2~4 secondsOptimism standard
slow4~10 secondsnetwork delay
very slow10~60 secondsserious congestion
stop60 seconds+Sequencer stop

TxPool Pending

statuscountDescription
summit0~1000Normal load
High1000~10000Batcher delay
very high10000+Batcher failure

fileRole
src/lib/anomaly-detector.tsLayer 1 statistical detection
src/lib/anomaly-ai-analyzer.tsLayer 2 AI analysis
src/lib/alert-dispatcher.tsLayer 3 notification sending
src/types/anomaly.tstype definition
src/app/api/anomalies/route.tsAPI endpoint

πŸ“š Additional Resources