Overcoming the Zigzag: How Momentum Accelerates Gradient Descent

By • min read

The Curvature Conundrum in Gradient Descent

Gradient descent is the workhorse of optimization, yet it harbors a fundamental limitation that becomes painfully obvious on real-world loss surfaces. These surfaces rarely have perfectly spherical contours; instead, they exhibit uneven curvature—steep in one direction and nearly flat in another. This anisotropic condition forces the algorithm into an inefficient pattern known as “zigzagging.”

Overcoming the Zigzag: How Momentum Accelerates Gradient Descent — Source: www.marktechpost.com

When the learning rate is set high, the optimizer makes rapid progress along the flat direction but overshoots and oscillates wildly along the steep one. Conversely, a low learning rate stabilizes the steep direction but slows convergence to a crawl. This trade-off is not an edge case—it is the norm for standard gradient descent. The underlying culprit is the condition number of the Hessian matrix, which quantifies the ratio of the largest to smallest curvature. A high condition number (e.g., 100) means the surface is 100 times more curved in one direction than another, creating the perfect storm for inefficiency.

How Momentum Breaks the Trade-Off

Momentum addresses this challenge by incorporating historical gradient information. Instead of relying solely on the current gradient to update the parameters, it maintains a running average—often called the velocity—and uses this accumulated direction for the step. The effect is twofold:

Consistent gradients in the flat direction reinforce each other, allowing faster movement across plateaus.
Oscillating gradients in the steep direction tend to cancel out, reducing instability and damping the zigzag motion.

This mechanism effectively smooths the optimization trajectory, enabling the solver to take larger effective steps without diverging. Many practitioners view momentum as a “memory” that helps the optimizer retain past directions, much like a ball rolling down a hill gains inertia.

The Update Equations in Action

The formal difference lies in the update rule. For vanilla gradient descent, the parameter update is simply the negative gradient scaled by the learning rate. With momentum, we introduce a velocity variable v that decays over time (controlled by a coefficient β typically around 0.9):

Velocity update: v = β·v + (1−β)·∇L (where ∇L is the gradient of the loss)
Parameter update: θ = θ − α·v (where α is the learning rate)

The β coefficient determines how much past gradients influence the current direction. A high β (e.g., 0.99) preserves a long memory, but can lead to overshooting if not carefully tuned.

A Controlled Simulation: An Anisotropic Surface

To visualize the difference, consider a stretched bowl loss surface defined by the function L(x,y) = 0.05x² + 5y². Here, the x-direction is nearly flat (curvature 0.1), while the y-direction is steep (curvature 10). The Hessian matrix is diagonal with eigenvalues 0.1 and 10, giving a condition number of exactly 100—a classic example that forces gradient descent into zigzagging.

The learning rate α is chosen deliberately at 0.18. The stability limit for vanilla GD is 2/λ_max = 2/10 = 0.2; any higher and the optimizer diverges outright. At 0.18, the steep axis update factor is |1 − 10×0.18| = 0.8, meaning the optimizer overshoots and reverses direction every step. For the flat axis, the factor is |1 − 0.1×0.18| = 0.982, recovering only 1.8% of the remaining distance per step. This combination—oscillation in one direction, near-stagnation in the other—is the worst-case scenario that momentum is designed to fix.

Simulation Results: Steps to Convergence

Starting from the same initial point (x=5, y=5) and tracking the number of steps required to reach a low-loss region, the numbers speak clearly:

Vanilla Gradient Descent: 185 steps
Momentum (β=0.9): 159 steps
Momentum (β=0.99): Failed to converge entirely

The improvement with β=0.9 is modest but consistent—a 14% reduction in steps. However, the failure at β=0.99 highlights a key caveat: too much momentum can cause the optimizer to overshoot the minimum, especially on steep surfaces, leading to divergence or endless oscillation.

Practical Implications and Tuning Tips

Understanding why gradient descent zigzags and how momentum fixes it helps practitioners make informed choices:

Monitor curvature: If you observe oscillatory behavior in your loss curves, suspect a high condition number. Techniques like feature scaling or normalization can help reduce the issue at the root.
Set β carefully: A common default is 0.9, which provides a good balance. Higher values (0.95–0.99) may speed up convergence on very flat surfaces but risk instability on steep ones.
Consider adaptive methods: Algorithms like Adam combine momentum with per-parameter learning rates to handle uneven curvature more robustly.

Momentum does not eliminate the fundamental trade-off, but it shifts the boundary, allowing faster progress without sacrificing stability. The zigzag becomes a gentle sway, and convergence becomes more reliable.

Conclusion

Gradient descent’s inefficiency on anisotropic loss surfaces stems from the conflicting demands of steep and flat directions. Momentum elegantly resolves this by accumulating gradient history, reinforcing consistent directions while canceling oscillations. As shown in a controlled simulation, this can reduce steps by over 14% when tuned correctly. However, excessive momentum (e.g., β=0.99) can backfire, emphasizing the need for careful hyperparameter selection. By understanding the mechanics behind the zigzag and the fix, you can optimize your models faster and with fewer headaches.