What can be forecasted
Built with
Page icon

What can be forecasted

Time series

A time series is a collection of values of a variable xx over time:
{xt;tT}\left\{x_t ; t \in \mathcal{T}\right\}
where T\mathcal{T}  can be regularly or irregularly spaced. Most of the literature assumes discrete regularly sampled time series.

Time series patterns

Time series can be characterized by:
Trends: a tendency, a long term change in the mean level
Seasonalities: patterns with constant period lengths
Ciclicities: cyclic patterns with variable lengths
Regime switches / turning points: different trends within a series
Can you spot some of the above features in the following TS plots?
ALT
Australian clay brick production
ALT
ALT
Google closing stock prices
ALT

Seek and you will find

ALT
ALT

What can be forecasted?

Given a pool of signal, which ones can be easily forecasted?
Dual question: Which signals should I give up forecasting?
The single most important conditions in forecasting a time series:
💡
The future is similar to the past
ALT
Counterexample: CISCO stock prices between 2000 and 2002
ALT
Good news!
In many cases, this condition holds. If it doesn’t, we can try to force it:
remove cycles and trends
remove effect of the forecast on the signal
transform the data
Good news pt. II
We have some tools that help understand whether a given time series is forecastable. These tools are based on a stochastic process representation of time series.

Stochastic processes

A probability space is a triple (Ω,A,P)(\Omega, \mathcal{A}, P), where
Ω\Omega is a nonempty set, we call it the sample space.
A\mathcal{A} is a σ-algebra of subsets of Ω\Omega, i.e. a family of subsets closed with respect to countable union and complement with respect to Ω\Omega.
PP is a probability measure defined for all members of A\mathcal{A},. That is a function P:A[0,1]P: \mathcal{A} \rightarrow[0,1] such that P(A)0P(\mathcal{A})\geq0 for all AAA \in \mathcal{A}, P(Ω)=1,P(i=1Ai)=i=1P(Ai)P(\Omega)=1, P\left(\cup_{i=1}^{\infty} A_i\right)=\sum_{i=1}^{\infty} P\left(A_i\right), for all sequences AiAA_i \in \mathcal{A}
A stochastic process is a family of random variables defined on the same probability space (Ω,A,P)(\Omega, \mathcal{A}, P):
{xt(ω);tT}\left\{x_t(\omega) ; t \in \mathcal{T}\right\}

Examples

{xt(ω);tN}xt(ω)N(0,1)xt(ω)xs(ω) for ts\left\{x_t(\omega) ; t \in \mathbb{N}\right\} \qquad x_t(\omega) \sim N(0,1) \qquad x_t(\omega) \perp x_s(\omega) \text { for } t \neq s
ALT
{xt(ω);tN}xt(ω)xt1(ω)N(0,1)\left\{x_t(\omega) ; t \in \mathbb{N}\right\} \qquad x_t(\omega)-x_{t-1}(\omega) \sim N(0,1)
ALT

Joint distribution

The joint distribution function of (xt1(ω),xt2(ω),,xts(ω))\left(x_{t_1}(\omega), x_{t_2}(\omega),\dots, x_{t_s}(\omega)\right) can be defined as:
Ft1,t2,,ts(b1,b2,,bs)=P(xt1(ω)b1,xt2(ω)b2,,xts(ω)bs)) F_{t_1, t_2, \cdots, t_s}\left(b_1, b_2, \cdots, b_s\right)=\mathrm{P}\left(x_{t_1}(\omega) \leq b_1, x_{t_2}(\omega) \leq \left.b_2, \ldots, x_{t_s}(\omega) \leq b_s\right) \right)
The above expression is known also as a copula, that is, a cumulative density function over a multivariate space.
If FF was known for a particular stochastic process xt(ω)x_t(\omega) , we could answer questions such as:
Which is the probability that the process {xt(ω);tZ}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\} passes through [a, b] at time tt?
Which is the probability that the process {xt(ω);tZ}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\} passes through [a, b] at time t1t_1 and through [c, d] at time t2t_2?
It turns out that this ability is pretty much all we need to know about the stochastic process.
💡
Theorem Under general conditions the probabilistic structure of the stochastic process {xt(ω);tZ}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\} is completely specified if we are given the joint probability distribution of (xt1(ω),xt2(ω),,xts(ω))\left(x_{t_1}(\omega), x_{t_2}(\omega),\dots, x_{t_s}(\omega)\right) for all values of s and for all choices of {t1,t2,,ts}\left\{t_1, t_2, \cdots, t_s\right\} (Priestly 1981, p.104)
In fact, once FF is known, we can generate an infinite set of realizations from the stochastic process xt(ωi)i=[1,2,...]x_t(\omega_i) \quad i=[1, 2,...\infty] , also called scenarios
ALT
7 days ahead forecast of the power consumption of a distribution system operator. Blue: observed. Violet: point forecast and possible scenarios
ALT
If the future is not similar to the past, we cannot estimate Ft1,t2,,tsF_{t_1, t_2, \cdots, t_s} for arbitrary values of s. Even for a univariate stochastic process {xt(ω);tZ}, where xtN(μt,σt2)\left\{x_t(\omega) ; t \in \mathbb{Z}\right\}, \text { where } x_t \sim N\left(\mu_t, \sigma_t^2\right), this requires to estimate
Ft(b)=b12πσt2exp{(vμtσt)2}dv for t=0±1,tF_t(b)=\int_{-\infty}^b \frac{1}{\sqrt{2 \pi \sigma_t^2}} \exp \left\{-\left(\frac{v-\mu_t}{\sigma_t}\right)^2\right\} d v \text { for } t=0 \pm 1, \ldots t
To reduce the number of parameters to be learned (in the example above {(μt,σt)t=1ts}\{(\mu_t,\sigma_t)_{t=1}^{t_s}\}), some regularity conditions must hold.
the TS must be somehow homogeneous in time
the process must have limited memory
Let’s do a step back and try to ask a somehow simpler question:
💡
Is it possible to estimate statistical moments of the time series?

Mean, covariance and where to find them

We define the mean and (auto)covariance of a stochastic process as:
μx(t)=E(xt)\mu_x(t)=\mathbb{E}\left(x_t\right)
γx(t1,t2)=γx(t,tk)=γx(k)=E[(xtμ)(xtkμ)]\gamma_x(t_1, t_2)=\gamma_x(t, t-k)=\gamma_x(k)=\mathbb{E}\left[\left(x_{t}-\mu\right)\left(x_{t -k}-\mu\right)\right]
Since we can only observe one realization of the stochastic process in time, estimates for mean and variance can only be done by time averaging (as opposed to sample path average).

Stationary processes and ergodicity

A process is said to be stationary (in a given window) if sliding along the time axis the joint CDF doesn’t change
Ft1+k,t2+kt2+k=Ft1,t2t2F_{t_1+k, t_2+k\dots t_2+k} = F_{t_1, t_2\dots t_2}
Additionally, a process is said to be ergodic w.r.t. one of its moment if we can estimate them by observing a sufficiently long realization of the process and if estimations converge quadratically to the true values.
In other words, imagine to observe a set of sample paths xj={xj,t1,xj,t2,xj,t3xT}x_j = \{x_{j,t_1},x_{j,t_2},x_{j,t_3}\dots x_T\}. The process is ergodic if averaging over different xjx_j gets the same results of averaging over time.
❓ Which one of the two examples of stochastic processes above is ergodic?
💡
Slutsky’s Theorem Let xtx_t  be a stationary process with mean μ\mu and autocovariance γk\gamma_k  If: limkγk=0lim_{k\rightarrow \infty }\gamma_k = 0 or k=0γk<\sum_{k=0}^{\infty}\left|\gamma_k\right|<\infty then xtx_t  is mean-ergodic
Intuitively, this result tell us that if any two random variables positioned far apart in the sequence are almost uncorrelated, then some new information can be continually added so that the sample mean will approach the population mean.
By the law of large numbers, good estimates of mean and covariance for ergodic processes are:
μ^T(t)=1Ttt+Txt\hat{\mu}_T(t)=\frac{1}{T}\sum_{t}^{t+T}x_t
γ^x(k)=1Ttt+T(xtμ^T)(xtkμ^T)\hat{\gamma}_x(k)=\frac{1}{T}\sum_{t}^{t+T}\left(x_t-\hat{\mu}_T\right)\left(x_{t-k}-\hat{\mu}_T\right)
Beware! In some cases these estimations do not converge quickly
Consider the case in which the stochastic process has distribution:
FX(x)={1(xmx)αxxm0x<xmF_X(x)= \begin{cases}1-\left(\frac{x_{\mathrm{m}}}{x}\right)^\alpha & x \geq x_{\mathrm{m}} \\ 0 & x<x_{\mathrm{m}}\end{cases}
This is known as Pareto distribution, which is a power law distribution. These are also known as fat tail distributions. Two notable examples of fat-tailed natural variables are earthquakes and pandemics.
ALT
Top: pdfs for normal and pareto distribution with alpha = 1.12. Bottom: convergence of sample means
ALT

ACF function

The autocorrelation function is defined as:
ρk=γkγ0=cov(xt,xtk)σ(xt)σ(xtk)=stationarycov(xt,xtk)σ2(x)\rho_k = \frac{\gamma_k}{\gamma_0}=\frac{cov(x_t, x_{t-k})}{\sigma(x_t)\sigma(x_{t-k})} \stackrel{stationary}{=} \frac{cov(x_t, x_{t-k})}{\sigma^2(x)}
If we get only T timesteps to evaluate ρk\rho_k, we must be careful to keep TkT-k  large enough for the estimation to be meaningful
ALT
When evaluating ACF for lag 9 we just got 2 observations to estimate the correlation for a 20-steps time series
ALT
Which is which?
ALT

ACF as an IID test

The ACF function can be used to construct the following hypothesis test:
H0:xti.i.d.(0,σ2)H_0:x_t \sim i.i.d.(0, \sigma^2)
if a process is IID, we have that ρk=0\rho_k = 0.
A decision rule for the hypothesis test could be:
🎲
reject H0H_0 if ρ^k>c\left|\hat{\rho}_k\right|>c
that is, for a given lag kk we require the magnitude of the correlation to be less than a constant value.
How do we choose c? We can request a 95% confidence level:
P(ρ^k<cH0)=0.95P(cρ^kcH0)=0.95\mathbb{P}(\left|\hat{\rho}_k\right|<c\vert H_0) = 0.95\\ \mathbb{P}(-c\leq\hat{\rho}_k\leq c\vert H_0) = 0.95
We know that ρ^kN(0,1/n)\hat{\rho}_k\sim \mathcal{N}(0,1/n) for i.i.d. processes for CLT
This means that P(cρ^kcH0)=0.95\mathbb{P}(-c\leq\hat{\rho}_k\leq c\vert H_0) = 0.95 iff c=1.96nc=\frac{1.96}{\sqrt n}
Why ρ^k\hat{\rho}_k is normally distributed with 0 mean and 1/n variance?
Eρ^kEγ^x(k)=E1Ttt+T(xtμ^T)(xtkμ^T)=μ=0E1Ttt+Txtxtk=0\begin{align*} \mathbb{E}\hat{\rho}_k \sim & \mathbb{E}\hat{\gamma}_x(k)=\\ &\mathbb{E}\frac{1}{T}\sum_{t}^{t+T}\left(x_t-\hat{\mu}_T\right)\left(x_{t-k}-\hat{\mu}_T\right)\stackrel{\mu=0}{=}\\ &\mathbb{E}\frac{1}{T}\sum_{t}^{t+T}x_tx_{t-k}\stackrel{\bot}{=}0 \end{align*}
Furthermore, if we call xtxtk=yx_tx_{t-k}=y, yy is still a i.i.d.

Portmanteau tests

Instead of testing a given lag kk, we can test for a set of lags. For an i.i.d. process, the following random variables are approximately chi-square with K degrees of freedom.
QK=Tk=1Kρ^k2Q_K=T \sum_{k=1}^K \hat{\rho}_k^2
Box-Pierce test
QK=T(T+2)k=1K(Tk)1ρ^k2Q^*_K=T(T+2) \sum_{k=1}^K (T-k)^{-1}\hat{\rho}_k^2
Ljung-Box test

Spectral and approximate entropy

Entropy can be used as a measure of complexity and forecastability. Entropy is a measure of information contained in a random event. Since the information contained in two independent events xyx \perp y  will sum, while their joint probability is p(x,y)=p(x)p(y)p(x, y) = p(x)p(y), the only possible form of the information is:
logbp(x)\log _b p(x)
If we take the expected value, and consider events with low probability to be more informative →
xXp(x)logbp(x)-\sum_{x \in \mathcal{X}} p(x) \log _b p(x)
Entropy gives an idea of complexity of the distribution of the signal. However, entropy for these two signals will be the same:
[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1…] [0,1,1,1,0,0,0,1,0,1,0,0,0,1,0,1, 1,…]
A better measure of forecastability is approximate entropy, which computes the entropy of maximum discrepancy of segments of length m of a time series.
ALT
ϕm(r)=1ni=1nlog(1dist<r)\phi^m(r)=\frac{1}{n} \sum_{i=1}^n \log \left(\mathbf{1}_{dist<r}\right)
entropy on distribution of “r-close neighbors” of the embedded sequences

Benchmark forecasting methods

Some forecasting methods are extremely simple and yet surprisingly effective. They are stateless methods
Naïve method
This method performs well for h=1h=1, or at scale which are smaller than the characteristic time scale of the signal
y^t+ht=yt\hat{y}_{t+h \mid t}=y_t
Mean method
We just take the mean value of the time series observed so far
y^t+ht=yˉ=(y1++yt)/t\hat{y}_{t+h \mid t}=\bar{y}=\left(y_1+\cdots+y_t\right) / t
Conditional Mean method
We take the mean of the observations that were registered in similar conditions
y^t+ht=1TtTxytTx={t:dist(xt,xt+h)<ϵ}\hat{y}_{t+h \mid t}=\frac{1}{T}\sum_{t\in \mathcal{T_x}}y_t \quad \mathcal{T_x}=\{t: dist(x_t, x_{t+h})<\epsilon \}
Example: I’m predicting sales for a given product on an online platform. The platform applies time-varying discounts based on different market conditions. We can forecast the product taking the mean sales in the same day of the week when the same discount was applied.
Seasonal Naïve method
y^t+ht=yt+hm(k+1)\hat{y}_{t+h \mid t}=y_{t+h-m(k+1)}
where mm is the period of the seasonality

Confidence intervals

Assuming residuals are normal, uncorrelated:
ALT
Note that when h=1 and T is large, these all give the same approximate forecast variance

Time series scores

Standard ML metrics are:
ALT
For time series, it is usually better to normalize these scores.
Example: mean absolute scaled error MASE:mean(eT+h/Q)MASE: mean(\vert e_{T+h}\vert / Q). Here QQ is a normalizing factor.
For non-seasonal time series
Q=(T1)1t=2Tytyt1Q=(T-1)^{-1} \sum_{t=2}^T\left|y_t-y_{t-1}\right|
is equivalent to MAE relative to a naïve method.
For seasonal time series
Q=(Tm)1t=m+1TytytmQ=(T-m)^{-1} \sum_{t=m+1}^T\left|y_t-y_{t-m}\right|
is equivalent to MAE relative to a seasonal naïve method

Bonus: Confidence intervals derivation

Mean method

The mean model is
y^t+1=c+εt\hat{y}_{t+1} = c + \varepsilon_t
Where ϵtN(0,σ)\epsilon_t \sim \mathcal{N}(0, \sigma). For this method, we need to estimate cc, which is done via empirical mean:
c^=1TtTyt\hat{c} = \frac{1}{T}\sum_t^Ty_t
Since the model treats the predicted variable as constant, the prediction error is independent of the step ahead hh. The predictive distribution is Gaussian, with variance:
Var(y^t+h)=Var(y^t)=Var(c^)+Var(εt)=σ2T+σ2=(1+1T)σ2Var(\hat{y}_{t+h})=Var(\hat{y}_{t})=Var(\hat{c}) + Var(\varepsilon_t) = \frac{\sigma^2}{\sqrt{T}} + \sigma^2 = \left(1 + \frac{1}{T}\right)\sigma^2
The first quantity is the variance associated to the estimation of the parameter cc, and stems directly from the central limit theorem: the uncertainty in the estimation of the distribution mean through empirical mean is normally distributed with variance ~ (1/T).

Naive method

The naive method assumes that the true data generation process has the following form:
yt+1=yt+εty_{t+1} = y_{t} + \varepsilon_t
We can obtain the formula for the hthh_{th} step ahead prediction as:
yt+1=yt+εtyt+2=yt+1+εt+1=yt+εt+εt+1yt+h=yt+i=1hεt+i,\begin{aligned}y_{t+1} & =y_t+\varepsilon_t \\y_{t+2} & =y_{t+1}+\varepsilon_{t+1}=y_t+\varepsilon_t+\varepsilon_{t+1} \\\quad \vdots & \\y_{t+h} & =y_t+\sum_{i=1}^h \varepsilon_{t+i},\end{aligned}
Since the errors are considered to be zero mean, the predictive variable and its variance are:
y^T+hT=E(yT+hT)=yTσ^h2=V(yT+hT)=hσ2\begin{aligned}\hat{y}_{T+h \mid T} & =E\left(y_{T+h \mid T}\right)=y_T \\\hat{\sigma}_h^2 & =V\left(y_{T+h \mid T}\right)=h \sigma^2\end{aligned}
Where the last expression is due to the fact that a summation of Gaussian random variables is still a Gaussian with variance equal to the sum of their variances.

Seasonal method

The seasonal naive method assumes the following datageneration process y^t+ht=yt+hm(k+1)\hat{y}_{t+h \mid t}=y_{t+h-m(k+1)}
This can be rewritten as:
yt+1=yt+1m+εt+1yt+2=yt+2m+εt+2yt+m=yt+εt+myt+m+1=yt+1+εt+m+1=yt+1m+εT+1+εt+m+1yt+2m=yt+m+εt+2m=yt+εt+m+εt+2myt+2m+1=yt+m+1+εt+2m+1=yt+1m+εt+1+εt+m+1+εt+2m+1yt+h=yt+hm(k+1)+i=0kεt+h+m(ik)\begin{aligned}y_{t+1} & =y_{t+1-m}+\varepsilon_{t+1} \\y_{t+2} & =y_{t+2-m}+\varepsilon_{t+2} \\\vdots & \\y_{t+m} & =y_t+\varepsilon_{t+m} \\y_{t+m+1} & =y_{t+1}+\varepsilon_{t+m+1}=y_{t+1-m}+\varepsilon_{T+1}+\varepsilon_{t+m+1} \\\vdots & \\y_{t+2 m} & =y_{t+m}+\varepsilon_{t+2 m}=y_t+\varepsilon_{t+m}+\varepsilon_{t+2 m} \\y_{t+2 m+1} & =y_{t+m+1}+\varepsilon_{t+2 m+1}=y_{t+1-m}+\varepsilon_{t+1}+\varepsilon_{t+m+1}+\varepsilon_{t+2 m+1} \\\vdots & \\y_{t+h} & =y_{t+h-m(k+1)}+\sum_{i=0}^k \varepsilon_{t+h+m(i-k)}\end{aligned}
Thus:
y^T+hT=E(yT+hT)=yT+hm(k+1)σ^h2=V(yT+hT)=(k+1)σ2\begin{gathered}\hat{y}_{T+h \mid T}=E\left(y_{T+h \mid T}\right)=y_{T+h-m(k+1)} \\\hat{\sigma}_h^2=V\left(y_{T+h \mid T}\right)=(k+1) \sigma^2\end{gathered}