Hierarchical forecasting
Built with
Page icon

Hierarchical forecasting

💡
there is a lot of potential in this particular area of forecasting and that developing new hierarchical forecasting methods, capable of reconciling forecasts so that the forecast error is minimized, not only on average but separately at each aggregation level, could provide substantial accuracy improvements.
Makridakis - “M5 accuracy competition: Results, findings, and conclusions” 2022, International Journal of Forecasting

Why consistent forecasts? Some examples

In many occasions, more than one signal must be forecasted. If we know that the signals are subject to an algebraic relation, we would like their forecast to respect it as well. Why?
Sales at different aggregation levels
Using consistent forecasts for sales at different aggregation levels is important in logistic planning. Inconsistent forecasts leads to unnecessary stock accumulations or lack of availability.
Water distribution network
Valves and pumps must be jointly controlled to keep water pressure within the network, as a function of the predicted demand. If the forecasted demand of the pipes drawing from the same vassel doesn’t sum up, our control actions will lead to a pressure change.
Distributed demand side management
Consider a district with a residential battery trying to perform peak shaving, and residential batteries trying to maximize self consumption. If forecasts are not consistent, the battery at the feeder will try to compensate actions of the privately-owned batteries, leading to an economic loss.
PV power production Consider a region with a high penetration of PV power plants. Considering independent CDFs for the power produced by the PV power plants will result in an overall narrow CDF, since errors will tend to compensate for each other. In practice, the predictions of PV power plants of the same region have highly correlated errors, since all heavily depend on the same irradiance and temperature forecasts. Probabilistic reconciliation can produce more accurate CDF of the total produced power.
Another reason to work on reconciliation techniques is that, if the number of TS to forecast is high, it is hard to directly produce a set of consistent by design forecasts. It’s a good idea to use well-known techniques and use a post-processing step to reconcile the forecasts.

Background

Given a set of ndn_d signals y=[yi]i=1ndy = [y_i]_{i=1}^{n_d}, yi=RTy_i = \mathbb{R}^{T}
Forecast reconciliation refers to the following two-steps procedure:
produce the base forecasts y^=[y^i]i=1nd\hat{y} = [\hat{y}_i]_{i=1}^{n_d}
produce a set of reconciled forecasts y~=[y~i]i=1nd\tilde{y} = [\tilde{y}_i]_{i=1}^{n_d} which respect some linear relation:
y~=f([y^i]iτ) s.t. y~S where S is a linear subspace\begin{aligned}& \tilde{y}=f\left(\left[\hat{y}_i\right]_{i \in \tau}\right) \\& \text { s.t. } \tilde{y} \in \mathcal{S} \text { where } \mathcal{S} \text { is a linear subspace}\end{aligned}

Hierarchical setting - A simple example

The most used linear subspace is the one generated by partial summation of some of a group of base forecasts.
Let’s call yb=[yi]i=1nby_b = [y_i]_{i=1}^{n_b} the set of nbn_b bottom time series whose aggregation generates the set of nun_u upper level time series yuy_u, for a total of n=nu+nbn = n_u + n_b time series
y=[yi]i=1n=3=[ytotyu,yx,yyyb]y=\left[y_i\right]_{i=1}^{n = 3}=[\underbrace{y_{tot}}_{y_u}, \underbrace{y_x, y_y}_{y_b}]
The aggregating structure of the hierarchy on the right can be encoded in a summation matrix SS:
S=[111001]=[s1,zs2,zs1,ys2,ys1,xs2,x]=[S1,S2]\begin{aligned}S & =\left[\begin{array}{ll}1 & 1 \\1 & 0 \\0 & 1\end{array}\right]= \left[\begin{array}{ll}s_{1, z} & s_{2, z} \\s_{1, y} & s_{2, y} \\s_{1, x} & s_{2, x}\end{array}\right]\\& =\left[S_1, S_2\right]\end{aligned}
All the observed samples of yy  lie on the plane defined by SS, that is, the span of the column vectors S1,S2S_1, S_2
ALT
ALT
The matrix SS can be used to generate the whole set of time series yy starting only from the bottom time series yby_b. This is possible since the upper time series are produced by aggregation. You could also see this process as a decoding process (or a compression).
y=Syb[ytotyyyx]=[111001][yyyx]\begin{align}y &= S y_b\\ \left[\begin{array}{l}y_{tot}\\y_y\\y_x\end{array}\right]&=\left[\begin{array}{ll}1 & 1 \\1 & 0 \\0 & 1\end{array}\right] \left[\begin{array}{l}y_y\\y_x\end{array}\right] \end{align}

Linear reconciliation

Usually, the reconciliation of hierarchical time series is obtained through a linear transformation.
y~=f(y^)=SGy^\tilde{y}=f(\hat{y})=S G \hat{y}
This linear transformation can be splitted in two (conceptual) parts:
G:RnRnbG: \mathbb{R}^n\rightarrow\mathbb{R}^{n_b}: maps he whole forecasts vector into the reconciled set of bottom time series: y~b=Gy^\tilde{y}_b=G \hat{y}
S:RnbRnS:\mathbb{R}^{n_b}\rightarrow\mathbb{R}^n: obtains the aggregate-consistent forecasts for the whole hierarchy y~=Sy~b\tilde{y}=S \tilde{y}_b
BOTTOM-UP
If GG  has zero entries for all the aggregated time series, the linear reconciliation is equivalent to a bottom-up approach: we just keep y^b\hat{y}_b , discarding forecasts produced for the aggregations, and produce aggregate-consistent forecasts for aggregated quantities by summations.
TOP-DOWN
The top-down approach forecasts just the top-level time series and repartition them among the bottom time series using some heuristics
G=[010001]G=\left[\begin{array}{lll}0 & 1 & 0 \\0 & 0 & 1\end{array}\right]
G=[P100P200]G=\left[\begin{array}{lll}P_1 & 0 & 0 \\P_2 & 0 & 0\end{array}\right]

Orthogonal projection

The bottom-up and top-down approaches don’t use all the information contained in the base forecasts y^\hat{y} since they just use y^b\hat{y}_b  or y^u\hat{y}_u to produce the corrected forecasts for the bottom time series y~b\tilde{y}_b
The information contained in the whole set of forecasts can be used by applying orthogonal projections. This guarantees that:
The correction displacement y~y^\tilde{y} -\hat{y}  is minimal
The correction always reduces the total sum of squares error of the corrected forecasts y~y22\Vert \tilde{y} -y\Vert_2^2
We can write point 1 as the following optimization problem:
y~=arg minβy^Sβ22\tilde{y} = \argmin_{\beta} \Vert \hat{y} -S \beta\Vert_2^2
Sketch of Proof of point 2 is a simple application of Pythagorean theorem:
🧾
No matter what our original prediction y^\hat{y} was, its distance d^\hat{d}  from the ground truth yy is always greater than the distance d~\tilde{d} of the projected forecasts y~\tilde{y}  from yy.
d^d~\color{orange}{\hat{d} \geq \tilde{d}}
That is, the orthogonal projection onto SS can be obtained by linear regression, where the features are the columns of the summation matrix SS
y~=S(STS)1STGy\tilde{y} = S \underbrace{(S^T S)^{-1}S^T}_{G}y
ALT

Oblique projection

Not all the series in the hierarchy should be treated equally (aggregations have bigger magnitudes)
Error correlation could be present
These factors can be taken into account if we apply generalized least squares, using the cross-covariance matrix of the forecasts’ errors.
y~=S(STW1S)1STW1Gy^\tilde{y} = S\underbrace{(S^TW^{-1}S)^{-1}S^TW^{-1}}_{G}\hat{y}
GEOMETRIC INTERPRETATION
ALT
Left: orthogonal projection into the coherent subspace. Right: oblique projection using the observed more likely direction of the error. From A. Pantagiotelis et al., Forecast reconciliation: A geometric view with new insights on bias correction
ALT
MATHEMATICAL INTERPRETATION
If we pre-transform the summation matrix columns and the base forecasts as :
si=W1/2siy^=W1/2y^s^*_i = W^{-1/2}s_i \quad \quad \hat{y}^* =W^{-1/2}\hat{y}
The above expression reduces to an orthogonal projection → it can be seen as an orthogonal projection on a transformed space.