Which Forecast Accuracy Measures Should You Use (and When Does It Actually Matter)

Yvonne Badulescu
Jan 12
9 min read

Forecast accuracy is one of those areas that seems straightforward until you try to apply it consistently. Most people start with a basic metric, often MAPE or RMSE, because that’s what their software provides, or because it’s easy to explain. But over time, the limitations start to show. The same model performs well under one metric and poorly under another. Different teams prefer different measures. Some products look accurate, while others clearly aren’t, even though they all score “well.”

So the question becomes: which error metrics are actually meaningful, and when do they matter?

I’ve been working through this in various forecasting projects, and the short answer is: it depends. But there are consistent patterns, and once you understand what each metric emphasizes, it becomes much easier to choose the right one, or at least to explain why different results are emerging.

How Different Metrics Lead to Different Conclusions

Each forecast accuracy metric highlights a different aspect of model performance. Some focus on large errors, some treat all errors equally, others scale by volume or actual demand. This means that your choice of metric can change which model appears to perform better especially when evaluating products with very different demand patterns.

In our recent study, we looked at this more closely. The goal wasn’t to introduce a new metric but to understand how metric choice affects the conclusions we draw. We compared several types of forecasting models, including those with judgmental adjustments, and tested them using different accuracy measures. It became clear that depending on which metric you used, MAE, MAPE, RMSE, or others, you could arrive at a different recommendation for which model yield the most accurate forecasts. That matters when you’re trying to select a forecast model for a particular product or category. A model that looks accurate using MAPE might not perform as well using RMSE, especially if the product has volatile demand. And if you're comparing products with different volumes or units, some metrics will exaggerate or downplay performance differences simply due to scale.

This is particularly relevant if you're managing a portfolio of products and need to choose the right forecasting method for each one. It’s also a key issue if you’re reporting forecast accuracy across teams or departments. Using the wrong metric can make some models or teams appear worse, or better, than they actually are, simply because of how the numbers are calculated.

The bigger point is this: forecast accuracy is not a single, objective truth. It depends on what you’re measuring, what matters for your business, and what kind of patterns your data shows. Choosing the right error metric is part of the modeling process. It’s not just a reporting step.

How Forecast Accuracy Is Built

One helpful way to understand the behavior of different metrics is to break them down into parts. Most error metrics are built from three components:

How error is calculated for each data point: This might be a simple difference (actual minus forecast), the absolute value of the difference, the square of the difference, or the log of a ratio. Each choice affects what kind of errors are emphasized.
How the error is normalized: Normalized means the error is adjusted by dividing it by a reference value, so it can be compared across different scales. Some metrics use no normalization at all. Others divide by the actual value (as in MAPE), by the sum of actual and forecast (as in sMAPE), or by a reference forecast (as in MASE). This affects how the metric handles different scales.
How the errors are aggregated: This could be a mean, a median, a geometric mean, or another method. Mean aggregation is sensitive to outliers, while median aggregation is more robust. This choice influences how much individual large errors dominate the overall score.

These components combine to form the final metric. Once you understand that structure, it becomes easier to compare metrics or to build your own version that suits your specific goals.

A Guide to Forecast Accuracy Metric Selection

The key is to match the metric to what the forecast is supposed to support. If your forecast feeds into inventory planning, then absolute error might be more relevant than percentage error. If your goal is sales performance reporting, then percentage error might make more sense. If your users care about large misses (such as stockouts or lost sales), then a metric that penalizes outliers (like RMSE) may be appropriate. But if you’re evaluating models across product groups, you probably want something that accounts for differences in volume or scale.

Here’s a quick overview of common metrics and what they capture:

RMSE (Root Mean Squared Error)

Emphasizes large errors: RMSE squares the error before averaging, so large mistakes have more impact. A single big error can disproportionately affect the final result.
Penalizes volatility: If your forecast swings too much compared to actual values, RMSE will increase. It's good at detecting models that are unstable.
Sensitive to outliers: Since outliers are squared, even one bad forecast can distort the result. This makes RMSE less reliable if your data has extreme fluctuations.
Useful when large misses are costly: Best used in contexts like inventory management for high-value items, where overstocking or stockouts create major financial risk (e.g. electronics or luxury goods).
Not good for: Products with highly irregular demand where a few extreme errors would unfairly dominate model evaluation (e.g. promotional spikes).

MAE (Mean Absolute Error)

Treats all errors equally: Every error counts the same, regardless of whether it’s large or small. No extra weight is given to outliers.
Easier to interpret: The result is in the same unit as the data (e.g. units, dollars), making it more intuitive to understand and communicate.
More robust to outliers than RMSE: Because it doesn't square the errors, it’s less affected by extreme values in the data.
Works well when consistent performance matters more than extreme cases: Appropriate when you're forecasting stable demand for everyday products like household goods or dry groceries.
Not good for: Situations where large errors have serious consequences and need to be weighted more heavily (e.g. critical inventory shortages).

MAPE (Mean Absolute Percentage Error)

Scales errors to actuals: Errors are shown as percentages, which helps compare performance across products with different sales volumes.
Common in business reporting: Widely used in dashboards and executive reports due to its percentage format, which feels more relatable.
Cannot handle zeros: If any actual values are zero, MAPE becomes undefined or spikes, which makes it unreliable for sparse or intermittent demand.
Biased toward under-forecasting: Because over-forecasting is penalized more than under-forecasting in percentage terms, it can favor conservative models.
Best for: Communicating forecast performance in sales-driven environments where products have steady demand and positive volumes (e.g. apparel, packaged foods).
Not good for: Low-volume or intermittent demand where actual values can be zero or near-zero (e.g. spare parts, new product launches).

sMAPE (Symmetric MAPE)

Attempts to balance MAPE’s asymmetry: It divides the error by the average of actual and forecast, rather than just actual, to reduce bias.
Still has known issues, especially when actual and forecast values are both small: If both values are close to zero, sMAPE can produce extreme results or instability.
Best for: Comparing percentage errors when your data contains small forecast values, and you want a more balanced view than MAPE offers (e.g. marketing response forecasts or niche products).
Not good for: Very low demand series where both forecast and actual values are close to zero, creating distorted percentage errors.

MdAE (Median Absolute Error)

More stable in the presence of outliers or skewed distributions: It uses the median instead of the mean, which prevents extreme values from skewing the result.
Good for intermittent demand or categories with inconsistent patterns: Works well when most forecasts are close to actuals but occasionally include large misses.
Best for: Spare parts, maintenance supplies, or seasonal items where the majority of periods have low or zero demand, but some periods see spikes.
Not good for: Evaluating performance across a full portfolio if you need average-based summaries rather than medians.

RMSPE (Root Mean Squared Percentage Error)

Like RMSE, but in percentage terms: It measures the square of percentage errors, so it reflects both scale and variability.
Amplifies large percentage errors: Small actual values paired with moderate forecast errors can produce large RMSPE scores.
Can be problematic with low-volume series: Not ideal when actual sales values are small or near zero, which inflates the error.
Best for: High-margin, low-volume product categories where relative accuracy matters more than raw units, like specialty pharmaceuticals or B2B equipment.
Not good for: Categories where actual sales are often low, because small denominators exaggerate the percentage error.

Relative Measures such as GMRAE (Geometric Mean Relative Absolute Error), RelMAE (Relative Mean Absolute Error), and Theil’s U

Compare your model to a benchmark: These metrics assess whether your forecast is better than a simple alternative, such as a naïve model (e.g. last period = this period).
Useful for model selection across categories: Helps evaluate which forecasting method outperforms a standard baseline when product behavior varies.
Sensitive to poor benchmark performance: If the benchmark performs poorly, your model might look better than it really is.
Best for: Academic evaluations, model competitions, or when testing new forecasting approaches against legacy methods.
Not good for: Communicating results to non-technical stakeholders, since the meaning depends on the benchmark and is not always intuitive.

Scaled Measures such as MASE (Mean Absolute Scaled Error), and RMSSE (Root Mean Squared Scaled Error)

Normalize error against a naïve forecast: These scale your model's error relative to a simple forecast, making it more meaningful across series.
Allow comparisons across time series: You can use them to assess performance across products with different volumes and units.
Less intuitive but statistically consistent: The values are not always easy to explain, but they behave well mathematically across use cases.
Best for: System-wide evaluations, automated model selection pipelines, or organizations managing large product portfolios.
Not good for: Business reporting or stakeholder presentations where metrics need to be easy to understand at a glance.

Each of these metrics answers a different question. The important thing is not to treat them as interchangeable.

What Happens with Machine Learning Forecasts

In traditional forecasting, models are typically evaluated after training using a clear and consistent metric. With machine learning, the process can be more complex. Most ML models optimize a loss function during training. For regression approaches like decision tree ensembles (e.g. XGBoost, LightGBM) or neural networks predicting continuous outputs, that loss function is often the squared error (MSE), chosen because it simplifies the math and supports efficient optimization.

But when it comes time to evaluate the forecast, the model is often judged using metrics like MAPE or MAE, which capture different aspects of error. This creates a gap between how the model is trained and how it is assessed. In some cases, this mismatch has little impact. But when you're comparing models or building automated model selection systems, it becomes more important. A model optimized for RMSE may not perform best under MAE or MAPE. If your business selects models using one metric and reports results using another, it can lead to inconsistent decisions and misaligned expectations.

Another challenge with ML forecasting is the variety of data involved. Models are often trained and tested across thousands of series with different characteristics. This means the metrics used need to handle differences in scale, volume, and behavior. Scale-independent or benchmark-relative metrics tend to be more useful here. MASE and GMRAE are two common choices. In these contexts, it’s also more common to report multiple metrics. For example, you might include RMSE to capture large errors, MASE for comparability, and bias to detect consistent under- or over-forecasting. This gives a fuller picture of model behavior.

I wrote a recent article on xAI (explainable AI), and the methods used to explain machine learning forecasts can help make sense of these differences. When models become more complex, it’s not enough to just report accuracy. You also need to understand why the model made a certain prediction and how that ties to business expectations.

A Few Practical Guidelines

There’s no formula that will tell you which metric to use, but a few patterns are worth keeping in mind:

If your actuals can be close to zero, avoid MAPE. It will produce extreme values or fail entirely.
If you are concerned about large errors, consider RMSE or RMSPE.
If your series are short or intermittent, median-based metrics tend to be more stable.
If you are comparing across categories, use scaled or relative metrics.
If you are communicating to a non-technical audience, MAE or percentage-based metrics are usually easier to explain.
If you are training ML models, try to align your loss function with the evaluation metric, especially if accuracy influences decisions.
Always check for bias: a model that consistently under-forecasts or over-forecasts can have good average accuracy but still cause problems operationally.

Forecast accuracy is not a single number. It is a set of trade-offs, shaped by what you measure and why. Different metrics highlight different risks. Some reward consistency. Others reward responsiveness. Some work better for large volumes. Others are more forgiving on erratic data. None of them tell the full story on their own. If your models are scoring differently under different metrics, that is not a problem to solve, it is a signal to explore. Understanding why rankings shift can lead to better model selection, clearer communication, and more confident decisions. You do not need to use every metric. But using only one is rarely enough.