Debugging a CUBIC Congestion Window Stall in QUIC: A Step-by-Step Guide
Introduction
When a QUIC connection using CUBIC as its congestion controller experiences congestion collapse and its congestion window (cwnd) stays pinned at the minimum value forever, it can cripple throughput. This guide takes you through the real-world debugging process that uncovered a subtle bug in the app-limited exclusion logic—a port of a Linux kernel fix that went wrong in QUIC. By following these steps, you'll learn how to identify the symptom, trace it to the misbehaving code, apply a minimal fix, and verify recovery. This process applies to any CCA (Congestion Control Algorithm) implementation where interactions between sender limitation and window growth create logical dead ends.

What You Need
- Knowledge of CUBIC congestion control (RFC 9438)
- Familiarity with QUIC transport protocol and quiche (Cloudflare's QUIC implementation)
- Access to a QUIC integration test environment that can simulate early heavy loss
- Source code of the QUIC stack you're investigating (e.g., quiche on GitHub)
- Ability to inspect congestion window (cwnd) and state transitions during a connection
- Understanding of the concept of "app-limited" vs "network-limited" flows
Step 1: Recognize the Symptom – Test Failures in High-Loss Scenarios
Start by observing your test suite. In the original case, an integration test failed approximately 61% of the time. These failures only appeared when early congestion (heavy packet loss) drove the CUBIC window down to its minimum. Draw these characteristics from your own test logs:
- The connection recovers from loss but then never allows the cwnd to grow again.
- Throughput remains flat even after congestion clears.
- The issue is not present in steady-state or growth-phase tests.
If your tests show these symptoms, you likely have a bug in the recovery logic of your CUBIC implementation after a congestion collapse.
Step 2: Reproduce the Failure Under Controlled Conditions
To isolate the bug, create a minimal test that forces early heavy loss followed by a period of no loss. For example:
- Set up a QUIC connection between a client and server using CUBIC as the default congestion controller.
- Inject a burst of packet drops (say 50% loss) during the first few round trips.
- After the loss event, stop all further drops so the network is perfectly clean.
- Measure the cwnd over time. Expect it to recover slowly to a higher value. If it remains at the post-loss minimum, you've reproduced the bug.
Reproducing with a fixed, scriptable network emulator (e.g., using tc or a custom delay/loss module) ensures the bug is deterministic.
Step 3: Understand CUBIC's Idle/App-Limited Exclusion Logic
Read the relevant part of RFC 9438 (Section 4.2-12) about the app-limited exclusion. The idea: when a connection is app-limited (i.e., the sender has no more data to send, so it's not fully utilizing the window), CUBIC should not count that round trip for window growth. This prevents the cwnd from inflating artificially when the application simply has nothing to send.
In Linux kernel TCP, a fix was added to ensure that after an app-limited period, CUBIC resets its growth state. This fix worked for TCP because TCP's stack had additional checks. When this same logic was ported naively to QUIC (as in quiche), it introduced a deadlock: after a congestion collapse, the cwnd drops to its minimum (e.g., 2 packets). The connection becomes app-limited because the tiny window can't fill the pipe, so the app-limited exclusion constantly triggers, preventing any growth. As a result, the window never leaves the minimum.
Step 4: Inspect the Code – Find Where App-Limited Exclusion Is Applied
In your QUIC implementation, locate the portion of the CUBIC code that checks whether the current round is app-limited. Typically, this involves a flag set when the number of bytes sent in a round is less than the cwnd. Search for something like:

if (app_limited) { skip_cwnd_growth(); }
In quiche's CUBIC, the problematic code reset a counter or state variable when an app-limited round was detected. That variable was also used to decide whether to grow after recovery. The fix must ensure that after a congestion event, this state is not reset in a way that permanently inhibits growth.
Step 5: Craft the One-Line Fix
The original fix was a nearly one-line change: do not apply the app-limited exclusion during the final part of recovery. Specifically, in the code path handling the end of the recovery phase (when the window should start increasing again), remove the check that was causing the reset. For example:
// Before: after loss recovery, re-enter app-limited check
if (app_limited) return;
// After: in recovery exit, always allow at least one growth step
// (omit the app_limited early return)
This change broke the deadlock by letting CUBIC increase its window just enough to exit the app-limited state, after which normal growth resumes.
Step 6: Verify the Fix and Perform Regression Testing
After applying the change:
- Rerun the reproduction test from Step 2. The cwnd should now gradually increase after the loss event.
- Run the full integration test suite. The previously failing test (61% failure) should now pass consistently.
- Check that the fix does not break other CUBIC behaviors: steady-state growth, loss responses, and idle periods should still work correctly.
- Perform performance benchmarking to confirm throughput is no longer artificially capped.
If all tests pass, the fix is safe to deploy.
Tips for a Clean Resolution
- Always test recovery paths: Congestion collapse near minimum cwnd is a corner case many tests skip. Include it in your validation suite.
- Understand the origin of patches: Porting kernel fixes to userspace QUIC demands extra care because assumptions differ (e.g., TCP vs QUIC pacing, application handling).
- Keep the fix minimal: A one-line change is less likely to introduce new bugs. Isolate the exact condition that causes the loop.
- Monitor after deployment: Even with thorough testing, monitor metrics like cwnd evolution and loss recovery rates to catch regressions in production.
- Document the edge case: Your future self (and colleagues) will thank you for noting why the app-limited exclusion cannot apply during the early post-recovery phase.
This guide is based on a real bug found in Cloudflare's quiche. The root cause was a misunderstanding of how CUBIC's app-limited exclusion interacts with the minimum window after a collapse. Applying these steps will help you solve similar issues in your own CCA implementations.