ARIMA and advanced forecasting models
Built with
Page icon

ARIMA and advanced forecasting models

ARIMA models

AR(1) model

yt=c+ϕ1yt1+εty_t=c+\phi_1 y_{t-1}+\varepsilon_t
ϕ={0WN1,c=0RW1,c0RW, drift0oscillates\phi = \begin{cases} 0& \quad \text{WN}\\ 1, c=0& \quad \text{RW}\\ 1, c\neq0& \quad \text{RW, drift}\\ \leq 0 &\quad \text{oscillates} \end{cases}
Long term behaviour for AR(1)
Substituting recursively the expression for the next step predictions:
yt=c+ϕyt1+ϵt=c+ϕ(c+ϕyt2+ϵt1)yt1+ϵt=ci=0tϕi+ϕty0+j=0t1ϕjϵtj=yt=ci=0tϕiE=c1ϕ+ϕty0E=0+j=0t1ϕjϵtjE=0\begin{aligned}y_t&=c+\phi y_{t-1}+\epsilon_t\\ &=c+\phi \underbrace{\left(c+\phi y_{t-2}+\epsilon_{t-1}\right)}_{y_{t-1}}+\epsilon_t\\ &=c \sum_{i=0}^t \phi^i+\phi^t y_0+\sum_{j=0}^{t-1} \phi^j \epsilon_{t-j}\\ &=y_t=\underbrace{c \sum_{i=0}^t \phi^i}_{E=\frac{c}{1-\phi}}+\underbrace{\phi^t y_0}_{E=0}+\underbrace{\sum_{j=0}^{t-1} \phi^j \epsilon_{t-j}}_{E=0} \end{aligned}
Recalling that
limTi=0i=nxi=11x\lim _{T \rightarrow \infty} \sum_{i=0}^{i=n} x^i=\frac{1}{1-x}
and that ϵN(0,σ)\epsilon \sim \mathcal{N}(0, \sigma)
ALT
Simulation with c=8, ϕ=0.8, c1ϕ=40c=8, \ \phi=0.8,\ \frac{c}{1-\phi}=40
ALT

MA models

This is a multiple regression with past noise as predictors.
yt=c+εtθ1εt1+θ2εt2++θqεtqy_t=c+\varepsilon_t-\theta_1 \varepsilon_{t-1}+\theta_2 \varepsilon_{t-2}+\cdots+\theta_q \varepsilon_{t-q}
Any MA(q) process can be written as an AR(\infty) process if we impose some constraints on the MA parameters. This can be seen for the MA(1) process by recursively substituting εt\varepsilon_t into the equation:
yt=θεt1+εt=θ(yt1θεt1)+εt=i=1(1)i+1θiyti\begin{aligned}y_t &= \theta\varepsilon_{t-1} + \varepsilon_t\\ &= \theta(y_{t-1}-\theta\varepsilon_{t-1}) + \varepsilon_t\\ &= \sum_{i=1}^{\infty} (-1)^{i+1}\theta^iy_{t-i} \end{aligned}
Also the converse is true: any AR(p) process can be turned into a MA(\infty) process.
yt=ϕyt1+εt=ϕ(ϕyt2+εt1)+εt=i=1ϕiεti\begin{aligned}y_t &= \phi y_{t-1} + \varepsilon_t\\ &= \phi(\phi y_{t-2}+\varepsilon_{t-1}) + \varepsilon_t\\ &= \sum_{i=1}^{\infty} \phi^i \varepsilon_{t-i} \end{aligned}
❓ Why do we need both AR and MA models if they are equivalent?
ALT

ARMA models

extended form
yt=c+ϕ1yt1+ϕ2yt2+ϕpytp+θ1εt1+θ2εt2+θqεtq+εt\begin{aligned} & y_t=c+\phi_1 y_{t-1}+\phi_2 y_{t-2}+\cdots \phi_p y_{t-p} \\ & +\theta_1 \varepsilon_{t-1}+\theta_2 \varepsilon_{t-2}+\ldots \theta_q \varepsilon_{t-q}+\varepsilon_t \\ & \end{aligned}
compact form
yt=i=1pϕiyti+i=1qθiεti+εty_t=\sum_{i=1}^p \phi_i y_{t-i}+\sum_{i=1}^q \theta_i \varepsilon_{t-i}+\varepsilon_t
lag operator polynomial form
Φ(B)yt=Θ(B)εtyt=Φ1(B)Θ(B)εt\begin{aligned}& \Phi(B) y_t=\Theta(B) \varepsilon_t \\& y_t=\Phi^{-1}(B) \Theta(B) \varepsilon_t\end{aligned}
where the lag operator is defined as B(yt)=yt1B(y_t) = y_{t-1}. If the roots of the Φ1(B)\Phi^{-1}(B) polynomial are all outside the unit circle in the complex plane, we can write the ARMA process as per the last equation in the third column. This makes clear that an ARMA model has an equivalent AR representation.

Equivalent State Space representations

There are several ways to transform ARMA models into a state-space form. One easy way is to define the xtx_t and utu_t vectors as:
yt=i=1pϕiyti+i=1qθiεti+εt=axt+but+εty_t=\sum_{i=1}^p \phi_i y_{t-i}+\sum_{i=1}^q \theta_i \varepsilon_{t-i}+\varepsilon_t =a x_t+b u_t+\varepsilon_t
Since yty_t  is the first element of xtx_t at the next step, one can write an update equation for xx
xt+1=[a1110]xt+[b]ut+[1]εt=Axt+But+Kεtyt+1=[1]xt+1=Cxt+1\begin{aligned}& x_{t+1}=\left[\begin{array}{llll} & a & & \\1 & & \empty \\& 1 & \ddots & \\ \empty & & 1 & 0\end{array}\right] x_t+\left[\begin{array}{l}b \\\empty\end{array}\right] u_t +\left[\begin{array}{l}1 \\\empty\end{array}\right] \varepsilon_t = Ax_t + Bu_t + K\varepsilon_t\\& y_{t+1}=\left[\begin{array}{ll}1 & \empty\end{array}\right] x_{t+1} = Cx_{t+1}\end{aligned}
ARMA process can be seen as structured SS models
As we have seen in the State Space models lesson, the stability of a SS system can be linked to eigenvalues of the matrix A~=AKC\tilde{A} = A-KC

ARIMA models

Appropriate for non-stationary signals
Φ(B)(1B)dyt=Θ(B)εt\Phi(B)(1-B)^dy_t= \Theta(B) \varepsilon_t
This can be seen as taking the d-th discrete derivative of the yty_t signal before applying an ARMA model.
Φ(B)yt=Θ(B)εt/(1B)d\Phi(B)y_t= \Theta(B) \varepsilon_t / (1-B)^d
This can be seen as taking the d-th discrete integral of the noise, hence the name of the model (integrated MA)
ALT

Model fitting

The autoarima procedure implemented in major forecasting libraries takes care of the steps 3-5:
Plot the data and identify any unusual observations.
If necessary, transform the data (using a Box-Cox transformation) to stabilise the variance.
If the data are non-stationary, take first differences of the data until the data are stationary.
Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?
Try your chosen model(s), and use the AICc to search for a better model.
Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
Once the residuals look like white noise, calculate forecasts.
This is done exploring the space of ARIMA models in terms of the q, p, parameters.
AIC=2log(L)+2(p+q+k+1)\mathrm{AIC}=-2 \log (L)+2(p+q+k+1)
AICc=AIC+2(p+q+k+1)(p+q+k+2)Tpqk2\mathrm{AICc}=\mathrm{AIC}+\frac{2(p+q+k+1)(p+q+k+2)}{T-p-q-k-2}
where k=1 if c≠0
ALT

Forecasting

ARIMA models produce forecasts by recursively using the produced forecast as covariate. We look at one example for an ARIMA(3,1,1) model:
(1ϕ^1Bϕ^2B2ϕ^3B3)(1B)yt=(1+θ^1B)εt\left(1-\hat{\phi}_1 B-\hat{\phi}_2 B^2-\hat{\phi}_3 B^3\right)(1-B) y_t=\left(1+\hat{\theta}_1 B\right) \varepsilon_t
Then we expand the left hand side to obtain:
[1(1+ϕ^1)B+(ϕ^1ϕ^2)B2+(ϕ^2ϕ^3)B3+ϕ^3B4]yt=(1+θ^1B)εt\left[1-\left(1+\hat{\phi}_1\right) B+\left(\hat{\phi}_1-\hat{\phi}_2\right) B^2+\left(\hat{\phi}_2-\hat{\phi}_3\right) B^3+\hat{\phi}_3 B^4\right] y_t=\left(1+\hat{\theta}_1 B\right) \varepsilon_t
and applying the backshift operator gives
yt(1+ϕ^1)yt1+(ϕ^1ϕ^2)yt2+(ϕ^2ϕ^3)yt3+ϕ^3yt4=εt+θ^1εt1y_t-\left(1+\hat{\phi}_1\right) y_{t-1}+\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{t-2}+\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{t-3}+\hat{\phi}_3 y_{t-4}=\varepsilon_t+\hat{\theta}_1 \varepsilon_{t-1}
Finally, we move all terms other than y to the right hand side:
yt=(1+ϕ^1)yt1(ϕ^1ϕ^2)yt2(ϕ^2ϕ^3)yt3ϕ^3yt4+εt+θ^1εt1y_t=\left(1+\hat{\phi}_1\right) y_{t-1}-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{t-2}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{t-3}-\hat{\phi}_3 y_{t-4}+\varepsilon_t+\hat{\theta}_1 \varepsilon_{t-1}
The expression at time T can be written as:
yT+1=(1+ϕ^1)yT(ϕ^1ϕ^2)yT1(ϕ^2ϕ^3)yT2ϕ^3yT3+εT+1+θ^1εTy_{T+1}=\left(1+\hat{\phi}_1\right) y_T-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{T-1}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-2}-\hat{\phi}_3 y_{T-3}+\varepsilon_{T+1}+\hat{\theta}_1 \varepsilon_T
since we cannot observe the error at the next time, we set it to zero and replace εT\varepsilon_T  with the last observer error
y^T+1T=(1+ϕ^1)yT(ϕ^1ϕ^2)yT1(ϕ^2ϕ^3)yT2ϕ^3yT3+θ^1eT\hat{y}_{T+1 \mid T}=\left(1+\hat{\phi}_1\right) y_T-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{T-1}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-2}-\hat{\phi}_3 y_{T-3}+\hat{\theta}_1 e_T
The successive forecast is obtained as
y^T+2T=(1+ϕ^1)y^T+1T(ϕ^1ϕ^2)yT(ϕ^2ϕ^3)yT1ϕ^3yT2\hat{y}_{T+2 \mid T}=\left(1+\hat{\phi}_1\right) \hat{y}_{T+1 \mid T}-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_T-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-1}-\hat{\phi}_3 y_{T-2}

Probabilistic forecasts

Analytic expression more complicated than simple models
Analytic form holds under the assumption of uncorrelated and normally distributed errors
If the residuals are uncorrelated but not normally distributed, then bootstrapped intervals can be obtained instead
ARIMA-based intervals tend to be too narrow. This occurs because only the variation in the errors has been accounted for. There is also variation in the parameter estimates, and in the model order, that has not been included in the calculation. In addition, the calculation assumes that the historical patterns that have been modelled will continue into the forecast period.

ARIMAX

Advanced models

Why forecasting is important at Amazon:
Maintaining surplus inventory levels for every product is cost prohibitive
Historical patterns can be leveraged to make decisions on inventory levels for products with predictable consumption patterns
2007: standard textbook time series forecasting methods. Unable to produce accurate forecasts for situations such as new products that had no prior history or products with highly seasonal sale patterns
2009-2013: Insight: products across multiple categories behave similarly → trained a global (spare)QRF model on different products together, gaining statistical strength across multiple categories
2015-2017: RNN and FF NNs didn’t outperform SQRF in their first iteration. The breakthrough happened training a FFNN on quantile loss.
2017: Getting rid of handcrafted features. Multi-horizon Quantile Recurrent Forecaster, developed starting from deepAR
2020: Transformer-based architecture
ALT

General regressors for forecasting

General purpose regressors, like kNN, xgboost, QRFs, can be used as a input-output map.
The future values are embedded in a target matrix
If a recurrent strategy is used, like in SS models, they are usually referred to MISO models (multiple input single output)
If multiple future values are learned one shot they are usually named MIMO
Covariate set usually include moving averages or past embeddings of the signals

Neural network building blocks for TS

ALT
ht=σh(Whxt+Uhht1+bh)yt=σy(Wyht+by)\begin{aligned}h_t & =\sigma_h\left(W_h x_t+U_h h_{t-1}+b_h\right) \\y_t & =\sigma_y\left(W_y h_t+b_y\right)\end{aligned}
ALT
zt=σg(Wzxt+Uzht1+bz)rt=σg(Wrxt+Urht1+br)h^t=ϕh(Whxt+Uh(rtht1)+bh)ht=ztht1+(1zt)h^t\begin{aligned}z_t & =\sigma_g\left(W_z x_t+U_z h_{t-1}+b_z\right) \\r_t & =\sigma_g\left(W_r x_t+U_r h_{t-1}+b_r\right) \\\hat{h}_t & =\phi_h\left(W_h x_t+U_h\left(r_t \odot h_{t-1}\right)+b_h\right) \\h_t & =z_t \odot h_{t-1}+\left(1-z_t\right) \odot \hat{h}_t\end{aligned}
ALT
ft=σg(Wfxt+Ufht1+bf)it=σg(Wixt+Uiht1+bi)ot=σg(Woxt+Uoht1+bo)c~t=σc(Wcxt+Ucht1+bc)ct=ftct1+itc~tht=otσh(ct)\begin{aligned}f_t & =\sigma_g\left(W_f x_t+U_f h_{t-1}+b_f\right) \\i_t & =\sigma_g\left(W_i x_t+U_i h_{t-1}+b_i\right) \\o_t & =\sigma_g\left(W_o x_t+U_o h_{t-1}+b_o\right) \\\tilde{c}_t & =\sigma_c\left(W_c x_t+U_c h_{t-1}+b_c\right) \\c_t & =f_t \odot c_{t-1}+i_t \odot \tilde{c}_t \\h_t & =o_t \odot \sigma_h\left(c_t\right)\end{aligned}

Naive NN approaches

The predictions are obtained by transforming the hidden states into contexts c[t+1:t+H]\mathbf{c}[t+1: t+H] that are decoded and adapted into y^[t+1:t+H],[q]\hat{\mathbf{y}}_{[t+1: t+H],[q]}
ht=ENC([yt,xt(h),x(s)],ht1)c[t+1:t+H]=Linear([ht,x[:t+H](f)])y^τ,[q]=MLP([cτ,xτ(f)])\begin{aligned}\mathbf{h}_t & =\operatorname{ENC}\left(\left[\mathbf{y}_t, \mathbf{x}_t^{(h)}, \mathbf{x}^{(s)}\right], \mathbf{h}_{t-1}\right) \\\mathbf{c}_{[t+1: t+H]} & =\operatorname{Linear}\left(\left[\mathbf{h}_t, \mathbf{x}_{[: t+H]}^{(f)}\right]\right) \\\hat{y}_{\tau,[q]} & =\operatorname{MLP}\left(\left[\mathbf{c}_\tau, \mathbf{x}_\tau^{(f)}\right]\right)\end{aligned}
ENC={RNNLSTMGRUENC=\begin{cases}RNN\\ LSTM\\ GRU\end{cases}
Where ht,yt,x(s),xt(h),x[:t+H](f)h_t, y_t, \mathbf{x}^{(s)}, \mathbf{x}_t^{(h)}, \mathbf{x}_{[: t+H]}^{(f)} are the hidden state, the input, the static, past and future exogenous signals available at prediction time.

DeepAR model

It’s basically a naive approach coupled with a model for the likelihood of the observations
ALT
Training (left): At each time step t, the inputs to the network are the covariates xi,tx_{i, t}, the target value at the previous time step zi,t1z_{i, t-1}, as well as the previous network output hi,t1\mathbf{h}_{i, t-1}. The network output hi,t=h(hi,t1,zi,t1,xi,t,Θ)\mathbf{h}_{i, t}=h\left(\mathbf{h}_{i, t-1}, z_{i, t-1}, \mathbf{x}_{i, t}, \Theta\right) is then used to compute the parameters θi,t=θ(hi,t,Θ)\theta_{i, t}=\theta\left(\mathbf{h}_{i, t}, \Theta\right) of the likelihood (zθ)\ell(z \mid \theta), which is used for training the model parameters. Prediction (right): the history of the time series zi,tz_{i, t} is fed in for t<t0t<t_0, then in the prediction range (right) for tt0t \geq t_0 a sample z^i,t(θi,t)\hat{z}_{i, t} \sim \ell\left(\cdot \mid \theta_{i, t}\right)  is drawn and fed back for the next point until the end of the prediction range t=t0+Tt=t_0+T generating one sample trace. Repeating this prediction process yields many traces representing the joint predicted distribution.

Training

The model’s parameters are fitted maximizing the log-likelihood. The authors proposed two forms for \ell, for continuous and counting variables
L=i=1Nt=t0Tlog(zi,tθ(hi,t))\mathcal{L}=\sum_{i=1}^N \sum_{t=t_0}^T \log \ell\left(z_{i, t} \mid \theta\left(\mathbf{h}_{i, t}\right)\right)
Gaussian distribution
G(zμ,σ)=(2πσ2)12exp((zμ)2/(2σ2))μ(hi,t)=wμThi,t+bμ and σ(hi,t)=log(1+exp(wσThi,t+bσ))\begin{aligned}\ell_{\mathrm{G}}(z \mid \mu, \sigma) & =\left(2 \pi \sigma^2\right)^{-\frac{1}{2}} \exp \left(-(z-\mu)^2 /\left(2 \sigma^2\right)\right) \\\mu\left(\mathbf{h}_{i, t}\right) & =\mathbf{w}_\mu^T \mathbf{h}_{i, t}+b_\mu \quad \text { and } \quad \sigma\left(\mathbf{h}_{i, t}\right)=\log \left(1+\exp \left(\mathbf{w}_\sigma^T \mathbf{h}_{i, t}+b_\sigma\right)\right)\end{aligned}
Negative binomial distribution
NB(zμ,α)=Γ(z+1α)Γ(z+1)Γ(1α)(11+αμ)1α(αμ1+αμ)zμ(hi,t)=log(1+exp(wμThi,t+bμ)) and α(hi,t)=log(1+exp(wαThi,t+bα))\begin{aligned}\ell_{\mathrm{NB}}(z \mid \mu, \alpha) & =\frac{\Gamma\left(z+\frac{1}{\alpha}\right)}{\Gamma(z+1) \Gamma\left(\frac{1}{\alpha}\right)}\left(\frac{1}{1+\alpha \mu}\right)^{\frac{1}{\alpha}}\left(\frac{\alpha \mu}{1+\alpha \mu}\right)^z \\\mu\left(\mathbf{h}_{i, t}\right) & =\log \left(1+\exp \left(\mathbf{w}_\mu^T \mathbf{h}_{i, t}+b_\mu\right)\right) \quad \\ \text { and } \quad \alpha\left(\mathbf{h}_{i, t}\right)&=\log \left(1+\exp \left(\mathbf{w}_\alpha^T \mathbf{h}_{i, t}+b_\alpha\right)\right)\end{aligned}

deepAR in practice

The model is usually augmented with hand-crafted features, such as the hour of the day and past lags of the target
ALT
the target z and additional exogenous inputs x
ALT
ALT
additional time feature u, hour of the day
ALT
ALT
additionally hand-crafted features: daily-lagged, 1-hour embeddings of z
ALT