Overcoming Gradient Descent Oscillations with Momentum

By • min read

Gradient descent is a cornerstone optimization algorithm, but it often struggles on real-world loss surfaces that have uneven curvature—steep in one direction and flat in another. This imbalance forces a trade-off between stability and speed, leading to inefficient zigzagging. Momentum offers a clever fix by incorporating past gradients to smooth out updates. Below, we answer key questions about why this zigzagging happens and how momentum resolves it.

What causes gradient descent to zigzag on loss surfaces?

Gradient descent zigzags when the loss surface has significantly different curvatures along different axes—a situation known as anisotropy. In such surfaces, the gradient points more steeply in one direction (e.g., the y-axis) and is nearly flat in another (e.g., the x-axis). Standard gradient descent updates parameters using only the current gradient, which can be large in the steep direction and tiny in the flat direction. With a fixed learning rate, the optimizer overshoots in the steep direction, reversing direction each step, while barely moving in the flat direction. This creates a back-and-forth oscillation, slowing convergence and wasting steps. The issue is quantified by the condition number—the ratio of the largest to smallest curvature eigenvalues. A high condition number (e.g., 100) means the surface is 100 times more curved in one direction, forcing the algorithm into this inefficient zigzag pattern.

Overcoming Gradient Descent Oscillations with Momentum — Source: www.marktechpost.com

Why does a high learning rate cause oscillations?

A high learning rate amplifies the overshooting problem in steep directions. On an anisotropic surface, the optimal learning rate is limited by the steepest curvature: stability requires the learning rate to be less than 2 divided by the largest eigenvalue of the Hessian. If you choose a learning rate close to this limit (e.g., 0.18 when the max eigenvalue is 10), the update factor in the steep direction becomes negative (e.g., |1 − 10 × 0.18| = 0.8), meaning each step overshoots the minimum and reverses direction. In the flat direction, the same learning rate yields a factor near 1 (e.g., 0.982), so progress is glacial—only 1.8% of the remaining distance is recovered per step. Thus, a high learning rate trades speed in the flat direction for violent oscillation in the steep direction. Lowering the learning rate stabilizes the steep direction but makes convergence agonizingly slow everywhere. This trade-off is inherent to vanilla gradient descent on ill-conditioned surfaces.

How does momentum help reduce zigzagging?

Momentum mitigates zigzagging by maintaining a velocity—a running average of past gradients—rather than relying solely on the current gradient. When gradients consistently point in the same direction (as in the flat region), the velocity builds up, allowing larger, more decisive steps. In contrast, oscillating gradients (as in the steep direction) tend to cancel each other out when averaged, dampening the back-and-forth motion. This dual effect accelerates progress along flat directions while stabilizing updates along steep ones. The velocity is typically controlled by a hyperparameter β (momentum coefficient), which determines how much past gradients influence the current update. A typical value is 0.9, giving a smooth trade-off. By effectively smoothing the gradient signal, momentum lets the optimizer escape the narrow stability constraints of vanilla gradient descent, often converging faster and with less erratic behavior.

What are the update equations for momentum?

Momentum introduces an intermediate variable, often called velocity (v), that accumulates past gradients. The update equations at each step t are:

Velocity update: v_t = β v_t−1 + η ∇f(θ_t−1)
Parameter update: θ_t = θ_t−1 − v_t

Here, η is the learning rate, ∇f is the gradient of the loss function, and β (between 0 and 1) controls how much past gradients are retained. A β of 0 recovers vanilla gradient descent. Standard practice uses β ≈ 0.9. The velocity is initialized as zero. This formulation means that if gradients consistently point in the same direction, the velocity grows, accelerating movement. If gradients oscillate, the velocity remains small, reducing overshoot. The net effect is a more directed update that cuts through oscillations.

How does the condition number affect gradient descent performance?

The condition number, defined as the ratio of the largest to smallest eigenvalue of the Hessian matrix, quantifies how ill-conditioned a loss surface is. For a surface that is 100 times more curved in one direction than another, the condition number is 100. A high condition number forces vanilla gradient descent into a narrow stability region: the learning rate must be below 2 divided by the largest eigenvalue to avoid divergence, but this rate is far too small for the flat direction, causing slow progress. The steep direction sees overshooting and oscillation because the update factor |1 − λ_max η| is near 1 but opposite sign, while the flat direction has an update factor extremely close to 1, barely decreasing the loss. The condition number directly dictates how many steps are needed to converge—on a 100:1 surface, vanilla GD requires hundreds of steps. Methods like momentum improve the effective conditioning by making the optimization trajectory less sensitive to this ratio.

What were the results of the simulation comparing vanilla GD and momentum?

In a simulation on a controlled anisotropic loss surface (with eigenvalues 0.1 and 10, condition number 100), vanilla gradient descent with a learning rate of 0.18 required 185 steps to converge. Momentum with β = 0.9 achieved convergence in 159 steps, demonstrating a notable speedup. However, a momentum coefficient of 0.99 failed to converge entirely—the algorithm became unstable because the high momentum caused excessive overshooting in the steep direction, leading to divergence. This highlights that while momentum can improve convergence, an excessively high β can reintroduce instability, especially on surfaces where the gradient in the steep direction is large. The simulation used a fixed step count and tracked position changes, confirming that momentum's smoothing effect works best with moderate β values. The result underscores the importance of tuning the momentum parameter to balance acceleration and stability.

Why did a momentum coefficient of 0.99 fail to converge?

A momentum coefficient β of 0.99 assigns extremely high weight to past gradients, making the velocity very persistent. On the anisotropic surface, the gradient in the steep direction is consistently large and oscillatory (due to overshooting). With β close to 1, these large past gradients dominate the velocity, causing the update to overshoot even more severely. Instead of canceling out, the oscillations amplify because the velocity retains a strong memory of the previous overshoot. Over a few steps, the velocity grows unbounded, leading to parameter updates that leap out of the region of interest. The algorithm diverges because the effective step size becomes too large for stability. In contrast, a moderate β (like 0.9) allows the velocity to respond more quickly to current gradients, balancing momentum with damping. This failure case illustrates that momentum is not a panacea—it must be carefully tuned to the surface's curvature to avoid catastrophic accumulation of past gradient information.