Please Don’t Tell Me You’re Still Using MAPE!

Yvonne Badulescu
Feb 16
7 min read

Provocative title, I know... but I have learned so much about the misuse of this forecast error measure this past year and I would like to share with you what I've learned and what you can use instead of MAPE.

MAPE and RMSE don't enable comparison across products

The Problem

Most companies rely on forecast accuracy metrics such as MAPE (Mean Absolute Percentage Error) and RMSE (Root Mean Squared Error) to evaluate how well their demand forecasts perform. These measures are intuitive and easy to communicate, but they become misleading when applied across products with very different scales or sales patterns.

A forecasting team in a retail or manufacturing company might be responsible for hundreds or even thousands of SKUs. Some products sell thousands of units per week, while others sell only a few each month. When management asks, “How accurate are our forecasts overall?” the natural instinct is to calculate an average MAPE or RMSE across products. However, these numbers do not mean what people think they do.

RMSE is expressed in the same unit as the data (units sold, litres, tonnes). It reflects the magnitude of forecast errors, but cannot be meaningfully compared across products. An RMSE of 100 might be acceptable for a high-volume product but disastrous for a low-volume one. Averaging RMSEs across products mixes different scales and produces a meaningless figure.
MAPE, while scale-free, has its own issues. It is asymmetric which means that it penalises over-forecasts (positive bias) more heavily than under-forecasts (negative bias), because the error is divided by the actual value, so when actual demand is small, even a slight overestimate can produce an exaggeratedly large percentage error. Moreover, for items with volumes close to zero, or those with intermittent demand, the resulting MAPE can show huge errors even when the forecasts are reasonable.

The result is that managers often draw the wrong conclusions about forecast performance. A product family may appear to perform worse simply because it has smaller volumes, or a forecasting improvement might be invisible when aggregated using incompatible metrics. In practice, the company ends up managing “forecast accuracy” through distorted numbers rather than real forecasting value.

To assess forecasting quality fairly across products, we need measures that account for differences in scale and provide a consistent basis for comparison.

Click here to learn more about forecast accuracy measures.

The Core Issue: Scale Dependence

When you compare forecast errors across products, scale (how big or small the numbers are) quietly distorts the picture. Here’s a concrete example with two SKUs part of the same product family:

SKU A (high volume): weekly sales in the ~1,000 range
SKU B (low/intermittent): weekly sales in the 0–5 range

Example data to show forecast accuracy calculation

SKU A absolute errors: 50, 100, 50, 50

MAE(A) = (50+100+50+50)/4 = 62.5
RMSE(A) = √[(50²+100²+50²+50²)/4] = √[(2,500+10,000+2,500+2,500)/4] = √(17,500/4) = √4,375 ≈ 66.0
MAPE(A) = average(|error|/actual) = (0.050 + 0.0909 + 0.0556 + 0.050)/4 ≈ 6.16%

SKU B absolute errors: 1, 1, 1, 1

MAE(B) = 1
RMSE(B) = √[(1²+1²+1²+1²)/4] = √(4/4) = 1
MAPE(B) = average(|error|/actual) → undefined at Week 2 (actual=0). If you (improperly) drop zeros: (1/5 + 1/2 + 1/3)/3 = (0.20 + 0.50 + 0.333)/3 ≈ 34.4%

What goes wrong?

You can’t average RMSE across products. RMSE is in units. An RMSE of 66 units for SKU A vs 1 unit for SKU B tells you nothing about relative accuracy. Taking an average at product family level like (66.0 + 1)/2 = 33.5 is meaningless, it mixes litres with teaspoons.
MAPE breaks on zeros and exaggerates small volumes.
Same absolute error, very different business impact. A 1-unit error on SKU B might be operationally trivial, but percentage metrics can make it look catastrophic. Conversely, a 50-unit error on SKU A may be operationally fine (0.05%), but it dominates RMSE in absolute terms.

Bottom line:

RMSE/MAE can’t be aggregated across products because of units/scale.
MAPE is unstable with zeros and biased for low volumes. To compare performance fairly across many SKUs, you need scale-free, benchmarked measures (relative ratios or scaled errors) and the right way to average them, this is where relative-to-benchmark errors and geometric means come in (next section).

The Better Way: Relative & Scaled Errors

…the next question becomes: what should we use instead?

Two families of measures solve the problem: relative error ratios and scaled errors. Both are scale-free, interpretable, and can be aggregated fairly across products, which makes them ideal for portfolio-level reporting and model benchmarking in supply chain contexts.

1. Relative Error Ratios (Forecast Value Added approach)

The first approach is to evaluate how much better (or worse) a forecasting method performs compared to a benchmark model, such as a Seasonal Naïve forecast.Instead of measuring accuracy in units or percentages, we express it as a ratio of errors:

where E can be the MAE (Mean Absolute Error) or RMSE for product i.

A ratio below 1 means the model performs better than the benchmark (e.g., 0.8 = 20% improvement).
A ratio above 1 means it performs worse (e.g., 1.2 = 20% deterioration).

This ratio is a direct expression of Forecast Value Added (FVA), i.e. how much value your forecasting process contributes beyond a simple, transparent baseline.

Example: Suppose for SKU A and SKU B we have the following:

example data to calculate forecast added value

Both SKUs perform better than the benchmark (Naïve), but the scale no longer matters. A ratio of 0.5 means SKU A’s model reduces error by 50%; the same interpretation holds for any other product, regardless of units sold.

Averaging across products: When summarizing performance across products, we can’t just take a regular (arithmetic) average of the ratios because ratios are multiplicative, not additive.

Here’s what that means in plain language:

If one product performs twice as well as the benchmark (ratio = 0.5), and another performs twice as poorly (ratio = 2.0), the overall effect is balanced — no real gain or loss.
But if you took a normal average (0.5 + 2.0) / 2 = 1.25, it would misleadingly suggest a 25 % deterioration.

To handle ratios correctly, we use the geometric mean, which multiplies the ratios and then takes the n-th root:

equation to calculate the geometric mean

This method keeps the proportional relationships intact and avoids one extreme product dominating the result.

Using our example: SKU A ratio = 0.50, SKU B ratio = 0.38; GM = √(0.50 x 0.38) = 0.43

That means that, on average, the model performs about 57 % (1-0.43) better than the benchmark across these two SKUs which is a fairer and more realistic summary than a simple average.

Where the Seasonal Naïve benchmark fits in: All of these comparisons, whether relative ratios or scaled errors, depend on having a meaningful baseline. In supply chain data, most products have some kind of seasonal pattern: ice cream sells more in summer, heaters in winter, sunscreen before holidays. If we benchmark against a simple Naïve model (which just repeats the last observation), we risk underestimating performance for seasonal products. A Seasonal Naïve model, on the other hand, predicts each period using the value from the same season last year (for instance, “next July = last July”). That makes it a much fairer and more realistic baseline for operational forecasting, and it’s the standard choice when computing Forecast Value Added (FVA), MASE, or RMSSE.

2. Scaled Errors: MASE and RMSSE

The second family of measures: MASE (Mean Absolute Scaled Error) and RMSSE (Root Mean Squared Scaled Error), builds this logic directly into the metric. Each product’s forecast error is scaled by the in-sample error of a benchmark model (again, ideally Seasonal Naïve). This produces a dimensionless, interpretable value that works even when demand levels differ or include zeros.

Formulas

equations to calculate mean absolute scaled error and root mean squared scaled error

Here m represents the seasonal period (e.g., m=12 for monthly data with yearly seasonality).

Both SKUs perform better than the benchmark. You’ll notice the MASE values (0.47 and 0.38) are very close to the ratios obtained earlier through the Forecast Value Added approach (0.50 and 0.38). The small difference arises because MASE scales by the average historical change (rather than the full Naïve forecast error). In practice, both methods tell the same story: the model reduces error by roughly half for SKU A and by about 60% for SKU B, a consistent improvement over a simple baseline.

3. What This Means in Practice

In real forecasting teams, whether at a manufacturer, retailer, or CPG company, these relative or scaled errors offer a more meaningful way to evaluate performance across hundreds of SKUs:

They eliminate scale distortion, allowing fair comparison between high- and low-volume products.
They link performance to a baseline, making it easier to communicate value added by advanced models.
They can be aggregated portfolio-wide using geometric means to report a single, interpretable figure:

This approach shifts the conversation from arbitrary percentages to true forecasting value. It helps data scientists, planners, and executives talk about accuracy in the same language, one that reflects real improvement, not statistical illusion.

4. Don’t Forget Bias: Accuracy Isn’t Everything

Forecast accuracy metrics like MAE, RMSE, or even MASE tell us how far off forecasts are on average, but not in which direction. A model can appear accurate overall while systematically over- or under-forecasting. That’s where bias comes in.

What forecast bias measures

Bias captures the directional tendency of forecast errors, whether forecasts are, on average, too high or too low.

Positive bias → forecasts are too high (over-forecasting).
Negative bias → forecasts are too low (under-forecasting).

A forecast can have a small RMSE but still exhibit strong bias, meaning errors don’t cancel out over time, and planners may consistently overstock or understock.

How to calculate bias

The simplest form of bias is based on the mean error (ME):

where Yt = actual value at time t, and Y^t = forecast at time t

A bias of 0 means forecasts are, on average, balanced. A negative bias means the model tends to under-forecast (actuals > forecasts). A positive bias means it tends to over-forecast (forecasts > actuals).

To make this metric comparable across products, you can also express it in percentage terms, similar to MAPE but keeping the sign:

Interpretation:

PBIAS = 0% → unbiased forecasts.
PBIAS > 0% → systematic over-forecasting.
PBIAS < 0% → systematic under-forecasting.

Why bias matters operationally

Even a small systematic bias can have large operational consequences.

A +10% bias may inflate safety stocks, working capital, and storage costs.
A –10% bias increases the risk of stockouts, missed service levels, and reactive ordering.

Bias is therefore not a secondary metric, it’s a necessary complement to accuracy. FVA analysis often uses bias alongside scale-free accuracy measures to assess whether process interventions (like demand reviews or machine-learning adjustments) are genuinely improving forecasts or simply shifting the bias.

In short: Accuracy tells you how much you’re wrong. Bias tells you which way you’re wrong, and that’s often where the real cost lies.

Metrics shape how we see our models and, more importantly, how we make supply chain decisions. When the metric is biased, so is our interpretation of performance. Moving beyond MAPE isn’t just a statistical preference; it’s a step toward managing forecast value, not forecast error.

Click here to learn more about accuracy in AI forecasting.

Click here to learn more about forecast accuracy for intermittent demand.