Debugging and evaluation for agentic AI systems

Observability startup Raindrop AI has released “Workshop,” an open-source, MIT-licensed debugging and evaluation tool aimed squarely at the emerging era of agentic AI systems.

Positioned as a local debugger for AI agents, Workshop gives developers a unified way to inspect everything their agent does in real time—from token generation to tool calls and decision pathways. All activity is streamed into a lightweight, local SQLite database (.db file) and visualized through a dashboard typically hosted at localhost:5899.

At its core, the system functions as a local daemon and UI layer that captures full execution traces as they happen. Developers can replay an agent’s behavior, inspect failures, and pinpoint where reasoning or execution went off track. According to Raindrop co-founder and CTO Ben Hylak, a former Apple and SpaceX engineer, the entire trace is stored in a single compact database file, minimizing overhead while preserving full fidelity. Unlike cloud-based observability stacks, Workshop keeps all telemetry local, addressing growing concerns around privacy and the risks of shipping sensitive agent traces to external services.

Installation is designed to be frictionless. Workshop can be deployed via a one-line shell command that configures binaries and updates PATH settings for bash, zsh, and fish shells. For developers who prefer transparency or customization, the project is fully open source on GitHub and built using the Bun runtime. The tool is available across macOS, Linux, and Windows.

Beyond observability, Workshop introduces a concept Raindrop calls a “self-healing eval loop.” In this workflow, coding agents such as Claude Code can read execution traces, generate evaluation tests, and iteratively correct failures in the underlying code or prompts. For example, if an AI veterinary assistant fails to ask critical follow-up questions, Workshop captures the full interaction. An agent can then analyze the trace, write an evaluation that flags the missing logic, and re-run the system until the behavior conforms to expected outcomes.

The tool is designed to integrate with a broad AI development ecosystem, including TypeScript, Python, Rust, and Go, as well as frameworks such as the Vercel AI SDK, OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI. It also supports agentic coding environments including Claude Code, Cursor, Devin, and OpenCode. Released under the permissive MIT license, Workshop is intended to encourage broad adoption, community contributions, and enterprise deployment without vendor lock-in. Hylak described the project as an effort to provide a “sane” local debugging layer for autonomous systems, fundamentally changing how developers build and validate agent behavior.

One way to unpack Workshop’s “self-healing eval loop” is to view it as an attempt to close the feedback gap between agent execution and agent evaluation. In traditional LLM applications, traces are collected, reviewed asynchronously, and manually converted into test cases or prompt fixes. Workshop compresses this into a tighter loop: execution telemetry → structured trace → LLM-generated evaluation → automated retry. Architecturally, this resembles an always-on “test harness generator” sitting alongside the agent runtime. The key technical dependency is not the logging layer itself, but the reliability of the intermediate representation: if traces are incomplete, overly verbose, or semantically ambiguous, the downstream evals become unstable. In effect, Workshop is betting that agent behavior can be faithfully reconstructed as deterministic enough artifacts to allow another model to write meaningful assertions about it.

The more contentious question is whether “self-healing” holds up beyond controlled environments. In production systems, agent failures are often non-deterministic, context-dependent, and influenced by external tools, APIs, and real-world state changes. This introduces a fundamental tension: eval loops are strongest when failure modes are repeatable and clearly specifiable, but weakest when errors are emergent or stochastic. There is also a risk of recursive optimization-agents may learn to satisfy the eval rather than the underlying business intent, producing brittle “overfitted” behavior that passes tests but degrades in unseen scenarios. In that sense, Workshop’s loop may be less a true self-healing mechanism and more a rapid hypothesis-testing engine for agent behavior. Its production value will likely depend on how rigorously teams design eval boundaries, rather than on whether the loop itself can autonomously “fix” systems at scale.