Why Traditional Weather Models Still Outperform AI for Extreme Events: A Technical Guide

Overview

Extreme weather events—such as record-breaking heatwaves, cold snaps, and windstorms—cost hundreds of billions of dollars annually in damages. Accurate forecasting of these events is critical for early warning systems that save lives and reduce economic losses. While artificial intelligence (AI)-based weather models have shown impressive skill for routine forecasts, a recent study published in Science Advances reveals a critical limitation: AI models systematically underestimate both the frequency and intensity of record-breaking extreme weather compared to traditional physics-based models.

Why Traditional Weather Models Still Outperform AI for Extreme Events: A Technical Guide — Source: www.carbonbrief.org

This guide explores the reasons behind this underperformance, explains the methodology used in the study, and provides step-by-step instructions for evaluating model performance on extreme events. It also highlights common pitfalls to avoid when interpreting or deploying AI weather models for extreme-event forecasting. By the end, you will understand why traditional models remain indispensable for capturing the most unusual and dangerous weather.

Jump to: Prerequisites | Step-by-step Instructions | Common Mistakes | Summary

Prerequisites

Basic understanding of weather forecasting: Familiarity with numerical weather prediction (NWP) and statistical models.
Knowledge of extreme event metrics: Terms like return periods, intensity thresholds, and frequency distribution.
Access to research data (optional): The study used ERA5 reanalysis and operational forecasts from ECMWF, plus AI models like Google's GraphCast and Huawei's Pangu-Weather. Raw data may require subscriptions.
Python/R skills (for hands-on replication): Ability to handle NetCDF or GRIB files, compute extreme value statistics, and compare model outputs.
Computational resources: Moderate CPU/GPU for processing large datasets (optional).

Step-by-step Instructions for Evaluating Model Performance on Extreme Weather

Step 1: Understand the Core Difference Between AI and Physics-Based Models

Physics-based models (e.g., ECMWF's IFS) solve fundamental equations of atmospheric dynamics—conservation of mass, momentum, and energy—on a three-dimensional grid. They represent physical processes through parameterizations. AI models (e.g., GraphCast, Pangu-Weather) learn statistical patterns from historical data using neural networks. They do not enforce physical laws; they rely solely on correlations in training data. This distinction is crucial: physics-based models can simulate conditions never seen before (e.g., unprecedented heat), while AI models are constrained to patterns present in their training dataset.

Step 2: Select Benchmark Extreme Events for Testing

The study focused on record-breaking events from 2018 and 2020—two years with numerous extreme records globally. For each event, define the variable (temperature, wind speed) and location. Use reanalysis data (ERA5) as the ground truth. Identify the top 10 hottest days, coldest days, and windiest days per location. Record the actual observed magnitude and date.

Step 3: Obtain Forecasts from Both Model Types

Traditional model: Use ECMWF's HRES (high-resolution) forecasts at lead times of 1–10 days. Download from the MARS archive or use ERA5 verification.
AI models: Access publicly available outputs from Google GraphCast (via WeatherBench) or Huawei Pangu-Weather (via GitHub). Ensure the same lead times and spatial resolution.

Step 4: Compute Bias and Error Metrics for Extreme Thresholds

For each event, compare the model's forecasted value (e.g., 2-meter temperature) to the ERA5 truth. Focus on values exceeding the 99th percentile (record-breaking). Calculate:

Bias: Forecast minus observation. Positive bias means overestimation; negative means underestimation.
Frequency of correct extreme classification: Did the model predict an extreme when it actually occurred? (Hit rate) And did it avoid false alarms?
Intensity error: Absolute difference between forecast and record-breaking value.

Step 5: Apply Extreme Value Theory (EVT) for Statistical Rigor

Standard skill scores (RMSE, CRPS) are dominated by common weather. Use EVT to assess tail behavior. Fit a generalized Pareto distribution (GPD) to the exceedances above a high threshold (e.g., 95th percentile). Compare the scale and shape parameters between models and truth. A physics-based model should have a thicker tail (more realistic extreme magnitude) than an AI model, which tends to underestimate tail heaviness.

Step 6: Visualize the Results

Create a scatterplot of observed vs. forecasted extreme values. Overlay a 1:1 line. Physics-based models usually cluster near the line for extremes, while AI models show a systematic downward bias for the highest values. Also plot return level curves (e.g., 10-year return period) to see how much each model under- or overestimates the magnitude of rare events.

Step 7: Interpret the Findings

The study found that AI models underestimated both the frequency (fewer predicted extreme events) and intensity (weaker extreme values) of record-breaking hot, cold, and windy events. For example, during the 2018 heatwave over northern Europe, GraphCast predicted temperatures up to 3°C cooler than observed. This is because the training data (1979–2017) did not include such extreme anomalies; the AI model “regressed toward the mean.” Physics-based models, constrained by energy conservation, can generate physically consistent extremes even if unseen.

Common Mistakes

Assuming AI models are always better because they win on average skill scores: Typical verification metrics (RMSE, CRPS) weight every grid point equally, even non-extreme weather. A model can have excellent average performance yet completely miss the tail. Always test extremes separately.
Using the same training data for decades without updating extremes: AI models trained on historical data may become outdated as climate change shifts distributions. Regular retraining with recent extremes can partially mitigate this, but physics-based models remain robust.
Ignoring lead time dependence: AI models may perform better at short leads (1–3 days) but degrade faster for extremes at longer leads. Compare performance across multiple lead times.
Treating all AI models as identical: Some newer architectures (e.g., FourCastNet) incorporate equivariance to spatial symmetries, but the fundamental limitation on unprecedented events remains due to statistical learning.
Overlooking verification bias: If the training data and verification data come from the same reanalysis, AI models may appear artificially good. Use independent observations (e.g., station data) for validation.

Summary

AI weather models have revolutionized routine forecasting by offering speed and average accuracy, but this guide demonstrates that traditional physics-based models remain essential for record-breaking extreme events. The key reason is that AI models rely on historical data and cannot reliably extrapolate beyond the range of their training set, while physics-based models encode fundamental laws that permit unprecedented extremes. When evaluating or deploying weather models, always test performance on extreme tails using extreme value theory. A hybrid approach—using AI for initial guidance and physics-based models for extreme event verification—is recommended. The study is a “warning shot” against a hasty replacement of established systems. For the most dangerous weather, physics still wins.