Benchmarking AI Agents for Observability: The o11y-bench Approach

The Challenge of Evaluating AI in Observability

Assessing the performance of AI agents is inherently difficult. When these agents are applied to observability workflows, the complexity multiplies. While modern AI models have demonstrated impressive gains in coding and tool use, observability presents unique hurdles. In a real incident, the critical skills are not limited to writing a correct query. They involve determining which signal matters, distinguishing a real anomaly from background noise, correlating data from metrics, logs, and traces, and making modifications in dashboards without disrupting the work of other engineers.

Benchmarking AI Agents for Observability: The o11y-bench Approach

Why General Benchmarks Fall Short

Standard benchmarks often measure straightforward tool-calling abilities, but observability tasks are far from simple. Root-cause investigations and dashboard creation rely on the interaction between vast datasets, time ranges, and saved application states. This complexity makes it hard to verify whether an AI agent has truly completed a task. For instance, a query might be syntactically correct yet select the wrong data series; a dashboard may render but be saved with errors. These subtle mistakes can have serious consequences in production environments.

Introducing o11y-bench

To help the observability community navigate this new era of AI assistance, we are open-sourcing o11y-bench (available at github.com/grafana/o11y-bench). This benchmark is designed specifically to evaluate AI agents on real-world observability workflows. It runs agents against a live Grafana stack that includes the Grafana MCP server, and grades them on a curated set of tasks within that environment.

What It Tests

The benchmark focuses on the tasks that matter most in practice:

Querying metrics, logs, and traces – verifying that agents can retrieve and interpret the right data.
Incident investigation – assessing the ability to correlate signals and identify root causes.
Dashboard modifications – ensuring changes are made correctly and without breaking existing views.

Each task is scored based on precise criteria that reflect real operational needs, not just syntactic accuracy.

Built on Harbor

o11y-bench is built on Harbor, an open-source framework that standardizes environments for benchmarking agents on focused task sets. By leveraging Harbor, o11y-bench provides a sandboxed environment where models and agent harnesses can be run alongside a Grafana Docker container pre-loaded with synthetic metrics, logs, and traces. Getting started is as simple as running a single command:

mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --agent opencode

Open Source and Reproducible

We believe transparency is essential for building trust in AI systems. By open-sourcing the tasks, environment, grading logic, and results, we enable the community to inspect, reproduce, and challenge the benchmark. This approach also helps model developers improve their systems by providing clear feedback on observability-specific skills.

Internal Anchor Links for Navigation

For easier browsing, you can jump directly to key sections: The Challenge, Introducing o11y-bench, Open Source.

Why This Matters for Grafana Users

Standardized measurement offers critical insights. It helps users distinguish between an agent that looks good in a demo and one that can be trusted in real workflows. In observability, dangerous mistakes are often subtle—choosing the wrong metric, misinterpreting a trace, or corrupting a shared dashboard. o11y-bench exposes these weaknesses in a controlled setting.

Getting Involved

We invite the community to use o11y-bench, contribute new tasks, and share feedback. This benchmark is just the beginning of a collaborative effort to make AI-assisted observability both powerful and trustworthy.