Builder · June 2026
Cut forecast error from 40% to 24% without retraining the model.
XGBoost deployed in late 2025 for hourly call volume forecasting. Trained on data from April 2025. For the first two to three months it tracked well enough: forecasts were close, staffing plans held, no one was complaining.
Then volume started moving.
A 65% ramp in five months. The model, trained on April 2025 patterns, had no way to track it. Adding more features did not close the gap. The issue was not what the model was learning, it was that what it had already learned was stale. It kept predicting around 1,300 calls a day. By May, 2,000 were arriving.
The model was short. Not randomly, in a pattern. Every weekday. Every week. By 17% on average, with Fridays reaching 22%. The staffing plans built on those numbers were consistently too lean.
When you plan for 1,400 calls and 1,700 arrive, that is not a close miss. That is a planning failure, and it was happening on repeat.
The model was not broken. It was doing exactly what it was trained to do: predicting the volume it had seen before. The problem was that volume had grown, and the model had not.
A 17% systematic underprediction is not a model error. It is a structural gap between when the model was trained and when it is being asked to forecast.
The Gap
XGBoost works by learning leaf means from training data. Every terminal node in every tree holds the average outcome of all the samples that landed there. When the model predicts, it routes your input down trees and returns a weighted sum of those leaf values.
If your training data comes from a period when Tuesday 8pm averaged 1,200 calls, that is what the leaf holds. Come April, when the same slot is hitting 1,700, the leaf has not moved. The features that could signal recent growth exist, rolling averages, lag features, but they do not outweigh three months of training data anchored to a lower volume regime.
The question was not how to build a better model. It was how to fix the output.
The Mechanism
While testing approaches, two pieces of research reframed the problem.
Gradient boosting was consistently strong at learning stable hourly shape patterns from rich features. Pure ML models that tried to learn both level and shape simultaneously consistently underperformed. The architectures that won decomposed the problem: one model for level, one for shape.
For intraday call volume specifically, the right architecture separates daily level from hourly shape. The model that tries to do both at once either gets the shape wrong or the level wrong. Decompose first, then combine.
Both papers were describing the exact problem in production. The XGBoost model was excellent at shape, the relative pattern of calls across hours within a day was accurate. What it could not do was track the absolute level as volume grew.
You do not need to replace the model. You need to add a layer above it.
11 approaches evaluated against an April 2026 holdout. Four that mattered.
DOW Scale Factor, 28-day median
Per day of week, compute the median ratio of actual to model over the last 28 days. Clip to [0.85, 1.50]. Multiply every future prediction for that DOW by its scale. Recomputes on every run, no intervention needed.
ETS Level × XGBoost Shape (M5 architecture)
ETS Holt-Winters forecasts the daily total. XGBoost predicts hourly proportions. Final = ETS daily × shape. The architecture that won M5, tested here. Failed: ETS applied a single growth trend to all days. Weekday volume grew 15-20%. Weekend volume was flat. ETS projected both at weekday rates. Saturday overprediction hit 60-90%.
Aggregation method: sum vs average vs median vs hourly
4 observations per DOW in 28 days. One outlier day from the March AWS outage moved the sum substantially. Median ignored it. Wednesday gained 7.8pp. Hourly scales (one per DOW × hour slot) overfitted with only 4 observations per slot. Median, one-line change.
Window size: 14d through 56d
56 days appeared to win at 12.0% daily WAPE. A sanity check on per-DOW ratio drift between February and March showed that pre-ramp observations were dragging scales down on the exact days that needed the most correction. The 56d result was a data artifact. 28d kept.
The 56-day window looked clean. 0.20pp better on hourly WAPE, 0.70pp better on daily. Easy to ship.
The data showed something different. Between February and March, the actual-to-model ratio shifted by 0.13 to 0.19 for Monday, Friday, and Thursday, the three highest-volume weekdays. Volume had ramped, concentrated in weekdays. Weekend volume barely moved.
Including February in a 56-day window meant the median was averaging across two different volume regimes. On the days that most needed a high correction (high-volume weekdays during a ramp), the 56d median was pulling the scale down.
There was a second problem. The validation CSV still contained March 18-19 AWS outage rows. The live pipeline already excluded those dates. Once removed, 28d performed equivalently. The 56d "win" was fully explained by two artifacts, neither of which existed in production.
A result that can only be reproduced by keeping the data wrong is not a result.
The Trap
Every time the model runs, the adjuster reads the last 28 days of actual vs model, computes the median ratio per day of week, clips to [0.85, 1.50], and applies it to every hourly prediction for that DOW in the forward forecast.
The current live scale factors show the shape of the correction:
Every weekday gets a 23-39% lift. Sunday gets a minor downward nudge: XGBoost was slightly over-predicting Sundays, and the scale reflects that. The 1.50 clip prevents overcorrection on data-quality weeks.
Scales auto-update on every main.py run as new actuals arrive. No manual steps, no calendar reminders. It runs itself.
That was the method. Here is Week 23 in production. The Hybrid is the long-range baseline forecast, the number the business would have used before Model Boost. Model Boost won every single day.
A long-range forecast anchored to older volume is the baseline. Model Boost, updating every run from the last 28 days, closes the gap between what the business planned and what actually arrived.
Week of June 7–13, 2026
That is the part that took a while to accept. XGBoost learned the shape of call volume accurately. When demand peaks within a day, which hours carry the most weight, how Friday afternoons differ from Monday mornings. All of that was right.
What it could not do was track a sustained volume ramp without being retrained on it. That is not a model failure. It is a structural characteristic of tree-based methods on monotonically growing series. The research had a name for it. The M5 competition had mapped it across 42,000 real-world series. Shen and Huang had mapped the call center version of it in 2007.
The correction layer does not fix the model. It fixes the output. And because it uses the most recent 28 days of actuals, it adjusts automatically as volume changes: faster than any retraining cycle would, with no manual work.
There is a whole tradition of decomposing forecasting problems this way. From supply chain to call centers to Walmart's demand planning at scale. Separate what the model is good at from what it is not. Correct the rest with simpler tools that can track what the model cannot.
That is what Model Boost does. It works.
June 2026
Numbers in this story are indicative. Volume figures, error rates, and improvement metrics reflect real patterns but have been adjusted and are not exact operational data.