Scaling Code Review with AI: Cloudflare's Multi-Agent Orchestration

Introduction

Code review is a cornerstone of modern software development, catching bugs early and spreading knowledge across teams. Yet it can also become a bottleneck, with merge requests languishing in queues as reviewers struggle to context-switch. At Cloudflare, the median wait for a first review often stretched into hours. To address this, we built an AI-powered code review system that uses a coordinated team of specialized agents, dramatically reducing review times while maintaining high quality. This article details our journey from experimentation to production, sharing the architecture and lessons learned.

Scaling Code Review with AI: Cloudflare's Multi-Agent Orchestration — Source: blog.cloudflare.com

The Problem with Traditional Code Review

Merge requests can stall for many reasons: reviewer availability, cognitive load from context-switching, and an endless cycle of nitpicks and revisions. We saw this firsthand across thousands of internal projects. While automated tools like linters help, they only catch surface-level issues. We needed something that could understand code semantics, flag real bugs, and scale across our diverse codebases.

Early Attempts: From Off-the-Shelf Tools to Naive Prompts

Our first step was evaluating existing AI code review tools. Many worked well and offered customization, but none provided the flexibility needed for an organization of Cloudflare's size. So we pivoted to a DIY approach: feeding git diffs into a large language model with a generic prompt. The results were noisy—vague suggestions, hallucinated syntax errors, and irrelevant advice like “consider adding error handling” on functions that already had it. Clearly, a naive approach wouldn't work for complex codebases.

The Solution: Multi-Agent Orchestration

Instead of building a monolithic reviewer, we created a CI-native orchestration system atop OpenCode, an open-source coding agent. Now, when a Cloudflare engineer opens a merge request, it gets an initial pass from a coordinated team of up to seven specialized AI agents:

Security – Scans for vulnerabilities and insecure patterns
Performance – Identifies potential slowdowns and inefficiencies
Code Quality – Checks for type errors, dead code, and stylistic issues
Documentation – Ensures comments and docs are accurate and complete
Release Management – Verifies version bumps and changelog entries
Compliance – Enforces internal Engineering Codex rules
Coordinator – Deduplicates findings, judges severity, and posts a single structured review

How the Coordinator Works

The coordinator agent is the linchpin. It collects outputs from all specialists, removes duplicates, evaluates the true severity of each issue (e.g., blocking vs. advisory), and compiles a single, readable comment. This prevents the noise of multiple overlapping suggestions and gives engineers a clear action list. The system can automatically approve clean code, flag real bugs, and even block merges when it detects serious problems or security vulnerabilities.

Results and Impact

We've run this system internally across tens of thousands of merge requests. Key outcomes include:

Reduced median first-review wait time from hours to minutes
High accuracy in identifying genuine bugs and security issues
Automatic approval for trivial or well-tested changes
Blocking merges only for critical problems, minimizing developer frustration

This system is part of our broader Code Orange: Fail Small initiative, aimed at improving engineering resiliency.

Architecture Deep Dive

Building an LLM-powered system at the heart of CI/CD presented unique challenges. We had to handle model latency, API failures, and varying output formats. Our architecture uses a plugin-based design: each specialist is a modular plugin with a specific prompt and context. The coordinator uses a lightweight LLM call to merge results. This modularity lets us add or swap agents without rebuilding the whole system. We also implemented guardrails to prevent the system from becoming a blocker—for example, if the coordinator times out, the review defaults to a human-friendly summary.

Lessons Learned

We discovered that:

Specialization beats generalization. A single model with a massive prompt produced worse results than multiple targeted models.
Deduplication is critical. Without it, engineers would ignore the output as noise.
Severity estimation requires careful tuning. Overly aggressive blocking erodes trust.
The system must be fast. Engineers won't wait minutes for an AI review during a hotfix.

Conclusion

AI-assisted code review can be both scalable and reliable when built as an orchestration of specialized agents rather than a monolithic black box. At Cloudflare, this system has cut review wait times, caught real bugs, and become a trusted part of our development workflow. We're excited to continue refining it and sharing our findings with the community.

For more details, see our initial challenges or jump to the architecture discussion.