Model Boost: Vishal's Blog

00 · The Setup

THE MODEL
WAS WORKING

XGBoost deployed in late 2025 for hourly call volume forecasting. Trained on data from April 2025. For the first two to three months it tracked well enough: forecasts were close, staffing plans held, no one was complaining.

Then volume started moving.

Daily call volume, 2026

January 2026 1,214 / day

February 2026 1,297 / day

March 2026 1,300 / day

April 2026 1,500 / day

May 2026 2,007 / day

A 65% ramp in five months. The model, trained on April 2025 patterns, had no way to track it. Adding more features did not close the gap. The issue was not what the model was learning, it was that what it had already learned was stale. It kept predicting around 1,300 calls a day. By May, 2,000 were arriving.

01 · The Problem

THE GAP

The model was short. Not randomly, in a pattern. Every weekday. Every week. By 17% on average, with Fridays reaching 22%. The staffing plans built on those numbers were consistently too lean.

When you plan for 1,400 calls and 1,700 arrive, that is not a close miss. That is a planning failure, and it was happening on repeat.

Tuesday 8pm, April 2026

Model learned (Oct–Dec training) 1,200

Model predicted 1,350

Actual calls arrived 1,700

The model was not broken. It was doing exactly what it was trained to do: predicting the volume it had seen before. The problem was that volume had grown, and the model had not.

A 17% systematic underprediction is not a model error. It is a structural gap between when the model was trained and when it is being asked to forecast.
The Gap

02 · The Mechanism

WHY TREES GET STUCK

XGBoost works by learning leaf means from training data. Every terminal node in every tree holds the average outcome of all the samples that landed there. When the model predicts, it routes your input down trees and returns a weighted sum of those leaf values.

If your training data comes from a period when Tuesday 8pm averaged 1,200 calls, that is what the leaf holds. Come April, when the same slot is hitting 1,700, the leaf has not moved. The features that could signal recent growth exist, rolling averages, lag features, but they do not outweigh three months of training data anchored to a lower volume regime.

What it learned 1,200 Leaf mean from Oct–Dec training. This is the model's ceiling for that slot.

What it predicted 1,350 Recency features pushed it slightly higher. Not enough to close the gap.

What arrived 1,700 350 calls short on a single hour, with no mechanism to self-correct.

The question was not how to build a better model. It was how to fix the output.
The Mechanism

03 · The Research

TWO PAPERS,
ONE INSIGHT

While testing approaches, two pieces of research reframed the problem.

M5 Forecasting Competition / Walmart, 42,000 series

Gradient boosting was consistently strong at learning stable hourly shape patterns from rich features. Pure ML models that tried to learn both level and shape simultaneously consistently underperformed. The architectures that won decomposed the problem: one model for level, one for shape.

Shen and Huang / Call Center Intraday Forecasting

For intraday call volume specifically, the right architecture separates daily level from hourly shape. The model that tries to do both at once either gets the shape wrong or the level wrong. Decompose first, then combine.

Both papers were describing the exact problem in production. The XGBoost model was excellent at shape, the relative pattern of calls across hours within a day was accurate. What it could not do was track the absolute level as volume grew.

You do not need to replace the model. You need to add a layer above it.

04 · The Test

FOUR
EXPERIMENTS

11 approaches evaluated against an April 2026 holdout. Four that mattered.

DOW Scale Factor, 28-day median

Per day of week, compute the median ratio of actual to model over the last 28 days. Clip to [0.85, 1.50]. Multiply every future prediction for that DOW by its scale. Recomputes on every run, no intervention needed.

17.9% → 13.7% daily WAPE Shipped

ETS Level × XGBoost Shape (M5 architecture)

ETS Holt-Winters forecasts the daily total. XGBoost predicts hourly proportions. Final = ETS daily × shape. The architecture that won M5, tested here. Failed: ETS applied a single growth trend to all days. Weekday volume grew 15-20%. Weekend volume was flat. ETS projected both at weekday rates. Saturday overprediction hit 60-90%.

30–32% hourly WAPE Rejected

Aggregation method: sum vs average vs median vs hourly

4 observations per DOW in 28 days. One outlier day from the March AWS outage moved the sum substantially. Median ignored it. Wednesday gained 7.8pp. Hourly scales (one per DOW × hour slot) overfitted with only 4 observations per slot. Median, one-line change.

Median: 0.83pp better than sum Shipped

Window size: 14d through 56d

56 days appeared to win at 12.0% daily WAPE. A sanity check on per-DOW ratio drift between February and March showed that pre-ramp observations were dragging scales down on the exact days that needed the most correction. The 56d result was a data artifact. 28d kept.

28d wins once artifact removed 28d kept

05 · The Trap

THE 56-DAY TRAP

The 56-day window looked clean. 0.20pp better on hourly WAPE, 0.70pp better on daily. Easy to ship.

The data showed something different. Between February and March, the actual-to-model ratio shifted by 0.13 to 0.19 for Monday, Friday, and Thursday, the three highest-volume weekdays. Volume had ramped, concentrated in weekdays. Weekend volume barely moved.

Per-DOW ratio drift: Feb vs Mar

Monday

1.018 → 1.165 +0.147

Friday

1.037 → 1.223 +0.186

Thursday

1.103 → 1.236 +0.133

Saturday

1.178 → 1.080 flat

Including February in a 56-day window meant the median was averaging across two different volume regimes. On the days that most needed a high correction (high-volume weekdays during a ramp), the 56d median was pulling the scale down.

There was a second problem. The validation CSV still contained March 18-19 AWS outage rows. The live pipeline already excluded those dates. Once removed, 28d performed equivalently. The 56d "win" was fully explained by two artifacts, neither of which existed in production.

A result that can only be reproduced by keeping the data wrong is not a result.
The Trap

06 · The System

HOW IT
RUNS NOW

Every time the model runs, the adjuster reads the last 28 days of actual vs model, computes the median ratio per day of week, clips to [0.85, 1.50], and applies it to every hourly prediction for that DOW in the forward forecast.

The current live scale factors show the shape of the correction:

Every weekday gets a 23-39% lift. Sunday gets a minor downward nudge: XGBoost was slightly over-predicting Sundays, and the scale reflects that. The 1.50 clip prevents overcorrection on data-quality weeks.

Scales auto-update on every main.py run as new actuals arrive. No manual steps, no calendar reminders. It runs itself.

07 · Live Data

Live · Jun 7–13, 2026

WEEK 23

That was the method. Here is Week 23 in production. The Hybrid is the long-range baseline forecast, the number the business would have used before Model Boost. Model Boost won every single day.

Model Boost

Hybrid (long-range)

9.4% Model Boost avg daily err

22.1% Hybrid avg daily err

A long-range forecast anchored to older volume is the baseline. Model Boost, updating every run from the last 28 days, closes the gap between what the business planned and what actually arrived.
Week of June 7–13, 2026

The Point of It ·

THE MODEL
WAS RIGHT

That is the part that took a while to accept. XGBoost learned the shape of call volume accurately. When demand peaks within a day, which hours carry the most weight, how Friday afternoons differ from Monday mornings. All of that was right.

What it could not do was track a sustained volume ramp without being retrained on it. That is not a model failure. It is a structural characteristic of tree-based methods on monotonically growing series. The research had a name for it. The M5 competition had mapped it across 42,000 real-world series. Shen and Huang had mapped the call center version of it in 2007.

The correction layer does not fix the model. It fixes the output. And because it uses the most recent 28 days of actuals, it adjusts automatically as volume changes: faster than any retraining cycle would, with no manual work.

There is a whole tradition of decomposing forecasting problems this way. From supply chain to call centers to Walmart's demand planning at scale. Separate what the model is good at from what it is not. Correct the rest with simpler tools that can track what the model cannot.

That is what Model Boost does. It works.
June 2026

MODEL
BOOST

THE MODEL
WAS WORKING

THE GAP

WHY TREES GET STUCK

TWO PAPERS,
ONE INSIGHT

FOUR
EXPERIMENTS

THE 56-DAY TRAP

HOW IT
RUNS NOW

WEEK 23

MAY–JUN 2026
78,034 CALLS

THE MODEL
WAS RIGHT

MODEL BOOST

THE MODELWAS WORKING

THE GAP

WHY TREES GET STUCK

TWO PAPERS,ONE INSIGHT

FOUREXPERIMENTS

THE 56-DAY TRAP

HOW ITRUNS NOW

WEEK 23

MAY–JUN 202678,034 CALLS

THE MODELWAS RIGHT

MODEL
BOOST

THE MODEL
WAS WORKING

TWO PAPERS,
ONE INSIGHT

FOUR
EXPERIMENTS

HOW IT
RUNS NOW

MAY–JUN 2026
78,034 CALLS

THE MODEL
WAS RIGHT