ARIMA and advanced forecasting models

ARIMA models

AR(1) model

y_t=c+\phi_1 y_{t-1}+\varepsilon_t

\phi = \begin{cases} 0& \quad \text{WN}\\ 1, c=0& \quad \text{RW}\\ 1, c\neq0& \quad \text{RW, drift}\\ \leq 0 &\quad \text{oscillates} \end{cases}

Long term behaviour for AR(1)

Substituting recursively the expression for the next step predictions:

\begin{aligned}y_t&=c+\phi y_{t-1}+\epsilon_t\\ &=c+\phi \underbrace{\left(c+\phi y_{t-2}+\epsilon_{t-1}\right)}_{y_{t-1}}+\epsilon_t\\ &=c \sum_{i=0}^t \phi^i+\phi^t y_0+\sum_{j=0}^{t-1} \phi^j \epsilon_{t-j}\\ &=y_t=\underbrace{c \sum_{i=0}^t \phi^i}_{E=\frac{c}{1-\phi}}+\underbrace{\phi^t y_0}_{E=0}+\underbrace{\sum_{j=0}^{t-1} \phi^j \epsilon_{t-j}}_{E=0} \end{aligned}

Recalling that  

lim⁡T→∞∑i=0i=nxi=11−x\lim _{T \rightarrow \infty} \sum_{i=0}^{i=n} x^i=\frac{1}{1-x}limT→∞​∑i=0i=n​xi=1−x1​﻿ 

and that ϵ∼N(0,σ)\epsilon \sim \mathcal{N}(0, \sigma)ϵ∼N(0,σ)﻿​

ALT

Simulation with c=8, ϕ=0.8, c1−ϕ=40c=8, \ \phi=0.8,\ \frac{c}{1-\phi}=40c=8, ϕ=0.8, 1−ϕc​=40﻿​

ALT

MA models

This is a multiple regression with past noise as predictors.

y_t=c+\varepsilon_t-\theta_1 \varepsilon_{t-1}+\theta_2 \varepsilon_{t-2}+\cdots+\theta_q \varepsilon_{t-q}

Any MA(q) process can be written as an AR(∞\infty∞﻿) process if we impose some constraints on the MA parameters. This can be seen for the MA(1) process by recursively substituting εt\varepsilon_tεt​﻿ into the equation:

\begin{aligned}y_t &= \theta\varepsilon_{t-1} + \varepsilon_t\\ &= \theta(y_{t-1}-\theta\varepsilon_{t-1}) + \varepsilon_t\\ &= \sum_{i=1}^{\infty} (-1)^{i+1}\theta^iy_{t-i} \end{aligned}

Also the converse is true: any AR(p) process can be turned into a MA(∞\infty∞﻿) process.

\begin{aligned}y_t &= \phi y_{t-1} + \varepsilon_t\\ &= \phi(\phi y_{t-2}+\varepsilon_{t-1}) + \varepsilon_t\\ &= \sum_{i=1}^{\infty} \phi^i \varepsilon_{t-i} \end{aligned}

 Why do we need both AR and MA models if they are equivalent?

ALT

ARMA models

extended form

\begin{aligned} & y_t=c+\phi_1 y_{t-1}+\phi_2 y_{t-2}+\cdots \phi_p y_{t-p} \\ & +\theta_1 \varepsilon_{t-1}+\theta_2 \varepsilon_{t-2}+\ldots \theta_q \varepsilon_{t-q}+\varepsilon_t \\ & \end{aligned}

compact form

y_t=\sum_{i=1}^p \phi_i y_{t-i}+\sum_{i=1}^q \theta_i \varepsilon_{t-i}+\varepsilon_t

lag operator polynomial form

\begin{aligned}& \Phi(B) y_t=\Theta(B) \varepsilon_t \\& y_t=\Phi^{-1}(B) \Theta(B) \varepsilon_t\end{aligned}

where the lag operator is defined as B(yt)=yt−1B(y_t) = y_{t-1}B(yt​)=yt−1​﻿. If the roots of the Φ−1(B)\Phi^{-1}(B)Φ−1(B)﻿ polynomial are all outside the unit circle in the complex plane, we can write the ARMA process as per the last equation in the third column. This makes clear that an ARMA model has an equivalent AR representation. 

Equivalent State Space representations

There are several ways to transform ARMA models into a state-space form. One easy way is to define the xtx_txt​﻿ and utu_tut​﻿ vectors as:

y_t=\sum_{i=1}^p \phi_i y_{t-i}+\sum_{i=1}^q \theta_i \varepsilon_{t-i}+\varepsilon_t =a x_t+b u_t+\varepsilon_t

Since yty_t yt​﻿ is the first element of xtx_txt​﻿ at the next step, one can write an update equation for xxx﻿ 

\begin{aligned}& x_{t+1}=\left[\begin{array}{llll} & a & & \\1 & & \empty \\& 1 & \ddots & \\ \empty & & 1 & 0\end{array}\right] x_t+\left[\begin{array}{l}b \\\empty\end{array}\right] u_t +\left[\begin{array}{l}1 \\\empty\end{array}\right] \varepsilon_t = Ax_t + Bu_t + K\varepsilon_t\\& y_{t+1}=\left[\begin{array}{ll}1 & \empty\end{array}\right] x_{t+1} = Cx_{t+1}\end{aligned}

→ ARMA process can be seen as structured SS models

As we have seen in the State Space models lesson,  the stability of a SS system can be linked to eigenvalues of the matrix  A~=A−KC\tilde{A} = A-KCA~=A−KC﻿​

ARIMA models

Appropriate for non-stationary signals

\Phi(B)(1-B)^dy_t= \Theta(B) \varepsilon_t

This can be seen as taking the d-th discrete derivative of the yty_tyt​﻿ signal before applying an ARMA model. 

\Phi(B)y_t= \Theta(B) \varepsilon_t / (1-B)^d

This can be seen as taking the d-th discrete integral of the noise, hence the name of the model (integrated MA) 

ALT

Model fitting

The autoarima procedure implemented in major forecasting libraries takes care of the steps 3-5:

Plot the data and identify any unusual observations.

If necessary, transform the data (using a Box-Cox transformation) to stabilise the variance.

If the data are non-stationary, take first differences of the data until the data are stationary.

Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?

Try your chosen model(s), and use the AICc to search for a better model.

Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.

Once the residuals look like white noise, calculate forecasts.

This is done exploring the space of ARIMA models in terms of the q, p, parameters. 

\mathrm{AIC}=-2 \log (L)+2(p+q+k+1)

\mathrm{AICc}=\mathrm{AIC}+\frac{2(p+q+k+1)(p+q+k+2)}{T-p-q-k-2}

where k=1 if c≠0 

ALT

Forecasting

ARIMA models produce forecasts by recursively using the produced forecast as covariate. We look at one example for an ARIMA(3,1,1) model:

(1−ϕ^1B−ϕ^2B2−ϕ^3B3)(1−B)yt=(1+θ^1B)εt\left(1-\hat{\phi}_1 B-\hat{\phi}_2 B^2-\hat{\phi}_3 B^3\right)(1-B) y_t=\left(1+\hat{\theta}_1 B\right) \varepsilon_t(1−ϕ^​1​B−ϕ^​2​B2−ϕ^​3​B3)(1−B)yt​=(1+θ^1​B)εt​﻿​

Then we expand the left hand side to obtain:

[1−(1+ϕ^1)B+(ϕ^1−ϕ^2)B2+(ϕ^2−ϕ^3)B3+ϕ^3B4]yt=(1+θ^1B)εt\left[1-\left(1+\hat{\phi}_1\right) B+\left(\hat{\phi}_1-\hat{\phi}_2\right) B^2+\left(\hat{\phi}_2-\hat{\phi}_3\right) B^3+\hat{\phi}_3 B^4\right] y_t=\left(1+\hat{\theta}_1 B\right) \varepsilon_t[1−(1+ϕ^​1​)B+(ϕ^​1​−ϕ^​2​)B2+(ϕ^​2​−ϕ^​3​)B3+ϕ^​3​B4]yt​=(1+θ^1​B)εt​﻿​

and applying the backshift operator gives

yt−(1+ϕ^1)yt−1+(ϕ^1−ϕ^2)yt−2+(ϕ^2−ϕ^3)yt−3+ϕ^3yt−4=εt+θ^1εt−1y_t-\left(1+\hat{\phi}_1\right) y_{t-1}+\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{t-2}+\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{t-3}+\hat{\phi}_3 y_{t-4}=\varepsilon_t+\hat{\theta}_1 \varepsilon_{t-1}yt​−(1+ϕ^​1​)yt−1​+(ϕ^​1​−ϕ^​2​)yt−2​+(ϕ^​2​−ϕ^​3​)yt−3​+ϕ^​3​yt−4​=εt​+θ^1​εt−1​﻿​

Finally, we move all terms other than y to the right hand side:

yt=(1+ϕ^1)yt−1−(ϕ^1−ϕ^2)yt−2−(ϕ^2−ϕ^3)yt−3−ϕ^3yt−4+εt+θ^1εt−1y_t=\left(1+\hat{\phi}_1\right) y_{t-1}-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{t-2}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{t-3}-\hat{\phi}_3 y_{t-4}+\varepsilon_t+\hat{\theta}_1 \varepsilon_{t-1}yt​=(1+ϕ^​1​)yt−1​−(ϕ^​1​−ϕ^​2​)yt−2​−(ϕ^​2​−ϕ^​3​)yt−3​−ϕ^​3​yt−4​+εt​+θ^1​εt−1​﻿​

The expression at time T  can be written as: 

yT+1=(1+ϕ^1)yT−(ϕ^1−ϕ^2)yT−1−(ϕ^2−ϕ^3)yT−2−ϕ^3yT−3+εT+1+θ^1εTy_{T+1}=\left(1+\hat{\phi}_1\right) y_T-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{T-1}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-2}-\hat{\phi}_3 y_{T-3}+\varepsilon_{T+1}+\hat{\theta}_1 \varepsilon_TyT+1​=(1+ϕ^​1​)yT​−(ϕ^​1​−ϕ^​2​)yT−1​−(ϕ^​2​−ϕ^​3​)yT−2​−ϕ^​3​yT−3​+εT+1​+θ^1​εT​﻿​

since we cannot observe the error at the next time, we set it to zero and replace εT\varepsilon_T εT​﻿ with the last observer error

y^T+1∣T=(1+ϕ^1)yT−(ϕ^1−ϕ^2)yT−1−(ϕ^2−ϕ^3)yT−2−ϕ^3yT−3+θ^1eT\hat{y}_{T+1 \mid T}=\left(1+\hat{\phi}_1\right) y_T-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_{T-1}-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-2}-\hat{\phi}_3 y_{T-3}+\hat{\theta}_1 e_Ty^​T+1∣T​=(1+ϕ^​1​)yT​−(ϕ^​1​−ϕ^​2​)yT−1​−(ϕ^​2​−ϕ^​3​)yT−2​−ϕ^​3​yT−3​+θ^1​eT​﻿​

The successive forecast is obtained as

y^T+2∣T=(1+ϕ^1)y^T+1∣T−(ϕ^1−ϕ^2)yT−(ϕ^2−ϕ^3)yT−1−ϕ^3yT−2\hat{y}_{T+2 \mid T}=\left(1+\hat{\phi}_1\right) \hat{y}_{T+1 \mid T}-\left(\hat{\phi}_1-\hat{\phi}_2\right) y_T-\left(\hat{\phi}_2-\hat{\phi}_3\right) y_{T-1}-\hat{\phi}_3 y_{T-2}y^​T+2∣T​=(1+ϕ^​1​)y^​T+1∣T​−(ϕ^​1​−ϕ^​2​)yT​−(ϕ^​2​−ϕ^​3​)yT−1​−ϕ^​3​yT−2​﻿​

Probabilistic forecasts

Analytic expression more complicated than simple models

Analytic form holds under the assumption of uncorrelated and normally distributed errors

If the residuals are uncorrelated but not normally distributed, then bootstrapped intervals can be obtained instead

ARIMA-based intervals tend to be too narrow. This occurs because only the variation in the errors has been accounted for. There is also variation in the parameter estimates, and in the model order, that has not been included in the calculation. In addition, the calculation assumes that the historical patterns that have been modelled will continue into the forecast period.

ARIMAX

Advanced models

Case study - Forecasting at Amazon

Why forecasting is important at Amazon:

Maintaining surplus inventory levels for every product is cost prohibitive

Historical patterns can be leveraged to make decisions on inventory levels for products with predictable consumption patterns

2007: standard textbook time series forecasting methods. Unable to produce accurate forecasts for situations such as new products that had no prior history or products with highly seasonal sale patterns

2009-2013: Insight: products across multiple categories behave similarly → trained a global (spare)QRF model on different products together, gaining statistical strength across multiple categories

2015-2017: RNN and FF NNs didn’t outperform SQRF in their first iteration. The breakthrough happened training a FFNN on quantile loss.

2017: Getting rid of handcrafted features. Multi-horizon Quantile Recurrent Forecaster, developed starting from deepAR

2020: Transformer-based architecture  

ALT

General regressors for forecasting

General purpose regressors, like kNN, xgboost, QRFs, can be used as a input-output map.

The future values are embedded in a target matrix

If a recurrent strategy is used, like in SS models, they are usually referred to MISO models (multiple input single output)

If multiple future values are learned one shot they are usually named MIMO

Covariate set usually include moving averages or past embeddings of the signals

Neural network building blocks for TS

ALT

\begin{aligned}h_t & =\sigma_h\left(W_h x_t+U_h h_{t-1}+b_h\right) \\y_t & =\sigma_y\left(W_y h_t+b_y\right)\end{aligned}

ALT

\begin{aligned}z_t & =\sigma_g\left(W_z x_t+U_z h_{t-1}+b_z\right) \\r_t & =\sigma_g\left(W_r x_t+U_r h_{t-1}+b_r\right) \\\hat{h}_t & =\phi_h\left(W_h x_t+U_h\left(r_t \odot h_{t-1}\right)+b_h\right) \\h_t & =z_t \odot h_{t-1}+\left(1-z_t\right) \odot \hat{h}_t\end{aligned}

ALT

\begin{aligned}f_t & =\sigma_g\left(W_f x_t+U_f h_{t-1}+b_f\right) \\i_t & =\sigma_g\left(W_i x_t+U_i h_{t-1}+b_i\right) \\o_t & =\sigma_g\left(W_o x_t+U_o h_{t-1}+b_o\right) \\\tilde{c}_t & =\sigma_c\left(W_c x_t+U_c h_{t-1}+b_c\right) \\c_t & =f_t \odot c_{t-1}+i_t \odot \tilde{c}_t \\h_t & =o_t \odot \sigma_h\left(c_t\right)\end{aligned}

Naive NN approaches

The predictions are obtained by transforming the hidden states into contexts c[t+1:t+H]\mathbf{c}[t+1: t+H]c[t+1:t+H]﻿ that are decoded and adapted into y^[t+1:t+H],[q]\hat{\mathbf{y}}_{[t+1: t+H],[q]}y^​[t+1:t+H],[q]​﻿​

\begin{aligned}\mathbf{h}_t & =\operatorname{ENC}\left(\left[\mathbf{y}_t, \mathbf{x}_t^{(h)}, \mathbf{x}^{(s)}\right], \mathbf{h}_{t-1}\right) \\\mathbf{c}_{[t+1: t+H]} & =\operatorname{Linear}\left(\left[\mathbf{h}_t, \mathbf{x}_{[: t+H]}^{(f)}\right]\right) \\\hat{y}_{\tau,[q]} & =\operatorname{MLP}\left(\left[\mathbf{c}_\tau, \mathbf{x}_\tau^{(f)}\right]\right)\end{aligned}

ENC=\begin{cases}RNN\\ LSTM\\ GRU\end{cases}

Where ht,yt,x(s),xt(h),x[:t+H](f)h_t, y_t, \mathbf{x}^{(s)}, \mathbf{x}_t^{(h)},  \mathbf{x}_{[: t+H]}^{(f)}ht​,yt​,x(s),xt(h)​,x[:t+H](f)​﻿ are the hidden state, the input, the static, past and future exogenous signals available at prediction time.

DeepAR model

It’s basically a naive approach coupled with a model for the likelihood of the observations

ALT

Training (left): At each time step t, the inputs to the network are the covariates xi,tx_{i, t}xi,t​﻿, the target value at the previous time step zi,t−1z_{i, t-1}zi,t−1​﻿, as well as the previous network output hi,t−1\mathbf{h}_{i, t-1}hi,t−1​﻿. The network output hi,t=h(hi,t−1,zi,t−1,xi,t,Θ)\mathbf{h}_{i, t}=h\left(\mathbf{h}_{i, t-1}, z_{i, t-1}, \mathbf{x}_{i, t}, \Theta\right)hi,t​=h(hi,t−1​,zi,t−1​,xi,t​,Θ)﻿ is then used to compute the parameters θi,t=θ(hi,t,Θ)\theta_{i, t}=\theta\left(\mathbf{h}_{i, t}, \Theta\right)θi,t​=θ(hi,t​,Θ)﻿ of the likelihood ℓ(z∣θ)\ell(z \mid \theta)ℓ(z∣θ)﻿, which is used for training the model parameters. Prediction (right): the history of the time series zi,tz_{i, t}zi,t​﻿ is fed in for t<t0t<t_0t<t0​﻿, then in the prediction range (right) for t≥t0t \geq t_0t≥t0​﻿ a sample z^i,t∼ℓ(⋅∣θi,t)\hat{z}_{i, t} \sim \ell\left(\cdot \mid \theta_{i, t}\right)  z^i,t​∼ℓ(⋅∣θi,t​)﻿ is drawn and fed back for the next point until the end of the prediction range t=t0+Tt=t_0+Tt=t0​+T﻿ generating one sample trace. Repeating this prediction process yields many traces representing the joint predicted distribution.

Training

The model’s parameters are fitted maximizing the log-likelihood. The authors proposed two forms for ℓ\ellℓ﻿, for continuous and counting variables

\mathcal{L}=\sum_{i=1}^N \sum_{t=t_0}^T \log \ell\left(z_{i, t} \mid \theta\left(\mathbf{h}_{i, t}\right)\right)

Gaussian distribution

\begin{aligned}\ell_{\mathrm{G}}(z \mid \mu, \sigma) & =\left(2 \pi \sigma^2\right)^{-\frac{1}{2}} \exp \left(-(z-\mu)^2 /\left(2 \sigma^2\right)\right) \\\mu\left(\mathbf{h}_{i, t}\right) & =\mathbf{w}_\mu^T \mathbf{h}_{i, t}+b_\mu \quad \text { and } \quad \sigma\left(\mathbf{h}_{i, t}\right)=\log \left(1+\exp \left(\mathbf{w}_\sigma^T \mathbf{h}_{i, t}+b_\sigma\right)\right)\end{aligned}

Negative binomial distribution

\begin{aligned}\ell_{\mathrm{NB}}(z \mid \mu, \alpha) & =\frac{\Gamma\left(z+\frac{1}{\alpha}\right)}{\Gamma(z+1) \Gamma\left(\frac{1}{\alpha}\right)}\left(\frac{1}{1+\alpha \mu}\right)^{\frac{1}{\alpha}}\left(\frac{\alpha \mu}{1+\alpha \mu}\right)^z \\\mu\left(\mathbf{h}_{i, t}\right) & =\log \left(1+\exp \left(\mathbf{w}_\mu^T \mathbf{h}_{i, t}+b_\mu\right)\right) \quad \\ \text { and } \quad \alpha\left(\mathbf{h}_{i, t}\right)&=\log \left(1+\exp \left(\mathbf{w}_\alpha^T \mathbf{h}_{i, t}+b_\alpha\right)\right)\end{aligned}

deepAR in practice

The model is usually augmented with hand-crafted features, such as the hour of the day and past lags of the target

ALT

the target z and additional exogenous inputs x

ALT

additional time feature u, hour of the day

ALT

additionally hand-crafted features: daily-lagged, 1-hour embeddings of z 

ALT