Server data from the Official MCP Registry
Multi-agent observability: cascade failure detection, heartbeats, and forensic replay
Multi-agent observability: cascade failure detection, heartbeats, and forensic replay
Valid MCP server (2 strong, 4 medium validity signals). No known CVEs in dependencies. Package registry verified. Imported from the Official MCP Registry.
14 files analyzed · 1 issue found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: AGENTWATCH_DB
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-nicofains1-agentwatch": {
"env": {
"AGENTWATCH_DB": "your-agentwatch-db-here"
},
"args": [
"-y",
"pocketflow-monitor"
],
"command": "npx"
}
}
}From the project's GitHub README.
Your agent swarm crashed at 2am. You have logs from 10 agents and no idea which one started the cascade. AgentWatch tells you.
It tracks heartbeats, links actions across agents, walks backward from any failure to the root cause, and replays the full sequence. Works with any agent framework (CrewAI, AutoGen, LangGraph, PocketFlow, custom). Stores everything in a local SQLite file.
Early stage. Issues and feedback welcome: https://github.com/nicofains1/agentwatch/issues
No install needed:
npx @nicofains1/agentwatch demo
This seeds a 5-agent fleet, triggers a cascade failure, and shows you the full trace:
AgentWatch Fleet Dashboard
============================================================
Agents: 5 total | 3 healthy | 1 degraded | 1 error | 0 offline
Cascade Failure (4 steps, root cause: scheduler/dispatch-batch)
============================================================
[ROOT] scheduler/dispatch-batch [ok] 15ms
{"assigned_to": "fetcher"}
|
[ 1 ] fetcher/call-api [error] 30000ms
TIMEOUT after 30000ms
|
[ 2 ] processor/transform [error] 120ms
Error: input is null - expected array from fetcher
|
[FAIL] notifier/send-alert [error] 8ms
Error: no processed data to report
npm install @nicofains1/agentwatch
Requires Node 18+. Uses better-sqlite3 (native bindings, no external database needed).
import { AgentWatch } from '@nicofains1/agentwatch';
const aw = new AgentWatch(); // creates agentwatch.db in the current directory
// Report heartbeats from your agents
aw.report('agent-a', 'healthy');
aw.report('agent-b', 'healthy');
// Trace an action in agent-a
const traceId = aw.createTraceId();
const e1 = aw.trace(traceId, 'agent-a', 'fetch-data',
'url=https://api.example.com', 'rows=150');
// Trace a dependent action in agent-b that fails
const e2 = aw.trace(traceId, 'agent-b', 'process',
JSON.stringify({ rows: 150 }), 'Error: out of memory', {
parentEventId: e1.id,
status: 'error',
durationMs: 4200,
});
// Walk back to the root cause
const chain = aw.correlate(e2.id);
console.log(chain?.root_cause);
// -> { agent: 'agent-a', action: 'fetch-data', ... }
// Print fleet status
console.log(aw.dashboardText());
Heartbeats - Each agent calls aw.report(name, status) on a schedule. AgentWatch tracks health over time and marks agents as stale or offline based on configurable thresholds.
Cross-agent tracing - Actions are linked by trace ID and optional parent event ID. When agent-c fails because agent-b sent bad data that came from agent-a, the full chain is queryable.
Cascade detection - correlate(failureEventId) walks backward from any failure to the root cause, returning the full chain with timing and output at each step.
Alert de-duplication - The same alert type from the same agent within a time window collapses into one entry with an incrementing count. Severity auto-escalates: info (1x) -> warning (3x) -> critical (10x).
Forensic replay - replay(traceId) returns all cascade chains within a trace. Useful for post-mortem analysis when a single trace touched multiple agents.
OpenTelemetry export - Export traces as OTEL spans (GenAI semantic conventions). Works with Jaeger, Grafana, or any OTEL-compatible backend. Requires optional peer deps.
npx @nicofains1/agentwatch demo # run the demo
npx @nicofains1/agentwatch dashboard # fleet health overview
npx @nicofains1/agentwatch cascade <event-id> # trace cascade from a failure
npx @nicofains1/agentwatch failures [agent] # list recent failures
npx @nicofains1/agentwatch alerts [agent] # list active alerts
npx @nicofains1/agentwatch replay <trace-id> # replay all cascades in a trace
npx @nicofains1/agentwatch mcp # start MCP server (stdio)
Set AGENTWATCH_DB to point to your database file. Default: agentwatch.db in the current directory.
AgentWatch runs as an MCP server. Add it to your Claude Code or Cursor config:
Claude Code (~/.claude/claude_desktop_config.json or .claude/settings.json):
{
"mcpServers": {
"agentwatch": {
"command": "npx",
"args": ["@nicofains1/agentwatch", "mcp"],
"env": {
"AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
}
}
}
}
Cursor (.cursor/mcp.json):
{
"mcpServers": {
"agentwatch": {
"command": "npx",
"args": ["@nicofains1/agentwatch", "mcp"],
"env": {
"AGENTWATCH_DB": "/absolute/path/to/agentwatch.db"
}
}
}
}
This exposes 13 tools: agentwatch_dashboard, agentwatch_report_heartbeat, agentwatch_trace, agentwatch_cascade, agentwatch_replay, agentwatch_get_alerts, agentwatch_get_failures, agentwatch_get_trace, agentwatch_fleet_health, agentwatch_create_trace_id, agentwatch_alert, agentwatch_resolve_alert, agentwatch_dashboard_text.
const aw = new AgentWatch({
db_path: 'agentwatch.db', // SQLite file path
alert_window_minutes: 30, // de-dup window for alerts
heartbeat_stale_minutes: 30, // when to mark agents as offline
});
aw.report(agent, status, context?) // status: 'healthy' | 'degraded' | 'error' | 'offline'
aw.getLatestHeartbeat(agent) // -> Heartbeat | undefined
aw.getFleetHealth() // -> AgentHealth[]
aw.createTraceId() // -> string (UUID)
aw.trace(traceId, agent, action, input, output, {
parentEventId?: number,
status?: 'ok' | 'error', // default: 'ok'
durationMs?: number,
}) // -> TraceEvent
aw.getTraceEvents(traceId) // -> TraceEvent[]
aw.getRecentFailures(agent?, limit?) // -> TraceEvent[]
aw.correlate(failureEventId) // -> CascadeChain | null
aw.replay(traceId) // -> CascadeChain[]
aw.alert(agent, alertType, message)
aw.resolveAlert(alertId)
aw.activeAlerts(agent?) // -> Alert[]
aw.dashboard() // -> DashboardOutput (structured)
aw.dashboardText() // -> string (formatted for terminal)
Requires optional peer deps @opentelemetry/api and @opentelemetry/sdk-trace-base.
await aw.exportTraceToOtel(traceId, { serviceName: 'my-agents' });
await aw.exportRecentToOtel(1); // last 1 hour
SQLite via better-sqlite3. The database file is created automatically on first use. WAL mode is on for concurrent reads.
Tables: heartbeats, trace_events, alerts.
MIT
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.
by Microsoft · Content & Media
Convert files (PDF, Word, Excel, images, audio) to Markdown for LLM consumption
by mcp-marketplace · Developer Tools
Scaffold, build, and publish TypeScript MCP servers to npm — conversationally
by mcp-marketplace · Finance
Free stock data and market news for any MCP-compatible AI assistant.