What can be forecasted

Time series

A time series is a collection of values of a variable xxx﻿ over time:

\left\{x_t ; t \in \mathcal{T}\right\}

where T\mathcal{T} T﻿ can be regularly or irregularly spaced. Most of the literature assumes discrete regularly sampled time series.

Time series patterns

Time series can be characterized by:

Trends: a tendency, a long term change in the mean level

Seasonalities: patterns with constant period lengths

Ciclicities: cyclic patterns with variable lengths

Regime switches / turning points: different trends within a series

Can you spot some of the above features in the following TS plots?

ALT

Australian clay brick production

ALT

Google closing stock prices

ALT

Seek and you will find

ALT

What can be forecasted?

Given a pool of signal, which ones can be easily forecasted?

Dual question: Which signals should I give up forecasting? 

The single most important conditions in forecasting a time series:

The future is similar to the past

ALT

Counterexample: CISCO stock prices between 2000 and 2002

ALT

Good news!

In many cases, this condition holds. If it doesn’t, we can try to force it:

remove cycles and trends

remove effect of the forecast on the signal

transform the data

Good news pt. II

We have some tools that help understand whether a given time series is forecastable. These tools are based on a stochastic process representation of time series.

Stochastic processes

A probability space is a triple (Ω,A,P)(\Omega, \mathcal{A}, P)(Ω,A,P)﻿, where

Ω\OmegaΩ﻿ is a nonempty set, we call it the sample space.

A\mathcal{A}A﻿ is a σ-algebra of subsets of Ω\OmegaΩ﻿, i.e. a family of subsets closed
with respect to countable union and complement with respect to
Ω\OmegaΩ﻿.

PPP﻿ is a probability measure defined for all members of A\mathcal{A}A﻿,. That
is a function P:A→[0,1]P: \mathcal{A} \rightarrow[0,1]P:A→[0,1]﻿ such that P(A)≥0P(\mathcal{A})\geq0P(A)≥0﻿ for all A∈AA \in \mathcal{A}A∈A﻿,
P(Ω)=1,P(∪i=1∞Ai)=∑i=1∞P(Ai)P(\Omega)=1, P\left(\cup_{i=1}^{\infty} A_i\right)=\sum_{i=1}^{\infty} P\left(A_i\right)P(Ω)=1,P(∪i=1∞​Ai​)=∑i=1∞​P(Ai​)﻿, for all sequences Ai∈AA_i \in \mathcal{A}Ai​∈A﻿​

A stochastic process is a family of random variables defined on the same probability space (Ω,A,P)(\Omega, \mathcal{A}, P)(Ω,A,P)﻿:

\left\{x_t(\omega) ; t \in \mathcal{T}\right\}

Examples

\left\{x_t(\omega) ; t \in \mathbb{N}\right\} \qquad x_t(\omega) \sim N(0,1) \qquad x_t(\omega) \perp x_s(\omega) \text { for } t \neq s

ALT

\left\{x_t(\omega) ; t \in \mathbb{N}\right\} \qquad x_t(\omega)-x_{t-1}(\omega) \sim N(0,1)

ALT

Joint distribution

The joint distribution function of (xt1(ω),xt2(ω),…,xts(ω))\left(x_{t_1}(\omega), x_{t_2}(\omega),\dots, x_{t_s}(\omega)\right)(xt1​​(ω),xt2​​(ω),…,xts​​(ω))﻿ can be defined as:

F_{t_1, t_2, \cdots, t_s}\left(b_1, b_2, \cdots, b_s\right)=\mathrm{P}\left(x_{t_1}(\omega) \leq b_1, x_{t_2}(\omega) \leq \left.b_2, \ldots, x_{t_s}(\omega) \leq b_s\right) \right)

The above expression is known also as a copula, that is, a cumulative density function over a multivariate space.

If FFF﻿ was known for a particular stochastic process xt(ω)x_t(\omega) xt​(ω)﻿, we could answer questions such as:

Which is the probability that the process {xt(ω);t∈Z}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\}{xt​(ω);t∈Z}﻿ passes through [a, b] at time ttt﻿?

Which is the probability that the process {xt(ω);t∈Z}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\}{xt​(ω);t∈Z}﻿ passes through [a, b] at time t1t_1t1​﻿ and through [c, d] at time t2t_2t2​﻿? 

It turns out that this ability is pretty much all we need to know about the stochastic process.

Theorem
Under general conditions the probabilistic structure of the stochastic process {xt(ω);t∈Z}\left\{x_t(\omega) ; t \in \mathbb{Z}\right\}{xt​(ω);t∈Z}﻿ is completely specified if we are given the joint probability distribution of (xt1(ω),xt2(ω),…,xts(ω))\left(x_{t_1}(\omega), x_{t_2}(\omega),\dots, x_{t_s}(\omega)\right)(xt1​​(ω),xt2​​(ω),…,xts​​(ω))﻿ for all values of s and for all choices of {t1,t2,⋯ ,ts}\left\{t_1, t_2, \cdots, t_s\right\}{t1​,t2​,⋯,ts​}﻿ 
(Priestly 1981, p.104)

In fact, once FFF﻿ is known, we can generate an infinite set of realizations from the stochastic process  xt(ωi)i=[1,2,...∞]x_t(\omega_i) \quad i=[1, 2,...\infty] xt​(ωi​)i=[1,2,...∞]﻿, also called scenarios

ALT

7 days ahead forecast of the power consumption of a distribution system operator. Blue: observed. Violet: point forecast and possible scenarios

ALT

If the future is not similar to the past, we cannot estimate Ft1,t2,⋯ ,tsF_{t_1, t_2, \cdots, t_s}Ft1​,t2​,⋯,ts​​﻿ for arbitrary values of s. Even for a univariate stochastic process {xt(ω);t∈Z}, where xt∼N(μt,σt2)\left\{x_t(\omega) ; t \in \mathbb{Z}\right\}, \text { where } x_t \sim N\left(\mu_t, \sigma_t^2\right){xt​(ω);t∈Z}, where xt​∼N(μt​,σt2​)﻿, this requires to estimate 

F_t(b)=\int_{-\infty}^b \frac{1}{\sqrt{2 \pi \sigma_t^2}} \exp \left\{-\left(\frac{v-\mu_t}{\sigma_t}\right)^2\right\} d v \text { for } t=0 \pm 1, \ldots t

To reduce the number of parameters to be learned (in the example above {(μt,σt)t=1ts}\{(\mu_t,\sigma_t)_{t=1}^{t_s}\}{(μt​,σt​)t=1ts​​}﻿), some regularity conditions must hold.

the TS must be somehow homogeneous in time

the process must have limited memory

Let’s do a step back and try to ask a somehow simpler question:

Is it possible to estimate statistical moments of the time series?

Mean, covariance and where to find them

We define the mean and (auto)covariance of a stochastic process as:

 μx(t)=E(xt)\mu_x(t)=\mathbb{E}\left(x_t\right)μx​(t)=E(xt​)﻿ 

γx(t1,t2)=γx(t,t−k)=γx(k)=E[(xt−μ)(xt−k−μ)]\gamma_x(t_1, t_2)=\gamma_x(t, t-k)=\gamma_x(k)=\mathbb{E}\left[\left(x_{t}-\mu\right)\left(x_{t -k}-\mu\right)\right]γx​(t1​,t2​)=γx​(t,t−k)=γx​(k)=E[(xt​−μ)(xt−k​−μ)]﻿​

Since we can only observe one realization of the stochastic process in time, estimates for mean and variance can only be done by time averaging (as opposed to sample path average). 

Stationary processes and ergodicity

A process is said to be stationary (in a given window) if sliding along the time axis the joint CDF doesn’t change 

F_{t_1+k, t_2+k\dots t_2+k} = F_{t_1, t_2\dots t_2}

Additionally, a process is said to be ergodic w.r.t. one of its moment if we can estimate them by observing a sufficiently long realization of the process and if estimations converge quadratically to the true values.

In other words, imagine to observe a set of sample paths xj={xj,t1,xj,t2,xj,t3…xT}x_j = \{x_{j,t_1},x_{j,t_2},x_{j,t_3}\dots x_T\}xj​={xj,t1​​,xj,t2​​,xj,t3​​…xT​}﻿. The process is ergodic if averaging over different xjx_jxj​﻿ gets the same results of averaging over time. 

 Which one of the two examples of stochastic processes above is ergodic?

Slutsky’s Theorem
Let xtx_t xt​﻿ be a stationary process with mean μ\muμ﻿ and autocovariance γk\gamma_k γk​﻿
If:  
                                                        limk→∞γk=0lim_{k\rightarrow \infty }\gamma_k = 0limk→∞​γk​=0﻿        or       ∑k=0∞∣γk∣<∞\sum_{k=0}^{\infty}\left|\gamma_k\right|<\infty∑k=0∞​∣γk​∣<∞﻿
then xtx_t xt​﻿ is mean-ergodic

Intuitively, this result tell us that if any two random variables positioned far apart in the sequence are almost uncorrelated, then some new information can be continually added so that the sample mean will approach the population mean.

By the law of large numbers, good estimates of mean and covariance for ergodic processes are:

 μ^T(t)=1T∑tt+Txt\hat{\mu}_T(t)=\frac{1}{T}\sum_{t}^{t+T}x_tμ^​T​(t)=T1​∑tt+T​xt​﻿ 

γ^x(k)=1T∑tt+T(xt−μ^T)(xt−k−μ^T)\hat{\gamma}_x(k)=\frac{1}{T}\sum_{t}^{t+T}\left(x_t-\hat{\mu}_T\right)\left(x_{t-k}-\hat{\mu}_T\right)γ^​x​(k)=T1​∑tt+T​(xt​−μ^​T​)(xt−k​−μ^​T​)﻿​

Beware! In some cases these estimations do not converge quickly

Consider the case in which the stochastic process has distribution:

F_X(x)= \begin{cases}1-\left(\frac{x_{\mathrm{m}}}{x}\right)^\alpha & x \geq x_{\mathrm{m}} \\ 0 & x<x_{\mathrm{m}}\end{cases}

This is known as Pareto distribution, which is a power law distribution. These are also known as fat tail distributions. Two notable examples of fat-tailed natural variables are earthquakes and pandemics.

ALT

Top: pdfs for normal and pareto distribution with alpha = 1.12. Bottom: convergence of sample means 

ALT

ACF function

The autocorrelation function is defined as:

\rho_k = \frac{\gamma_k}{\gamma_0}=\frac{cov(x_t, x_{t-k})}{\sigma(x_t)\sigma(x_{t-k})} \stackrel{stationary}{=} \frac{cov(x_t, x_{t-k})}{\sigma^2(x)}

If we get only T timesteps to evaluate ρk\rho_kρk​﻿, we must be careful to keep T−kT-k T−k﻿ large enough for the estimation to be meaningful

ALT

When evaluating ACF for lag 9 we just got 2 observations to estimate the correlation for a 20-steps time series

ALT

Which is which?

ALT

ACF as an IID test

The ACF function can be used to construct the following hypothesis test:

                                                                                                H0:xt∼i.i.d.(0,σ2)H_0:x_t \sim i.i.d.(0, \sigma^2)H0​:xt​∼i.i.d.(0,σ2)﻿​

if a process is IID, we have that ρk=0\rho_k = 0ρk​=0﻿. 

A decision rule for the hypothesis test could be: 

reject H0H_0H0​﻿ if ∣ρ^k∣>c\left|\hat{\rho}_k\right|>c∣ρ^​k​∣>c﻿​

that is, for a given lag kkk﻿ we require the magnitude of the correlation to be less than a constant value.

How do we choose c? We can request a 95% confidence level:

\mathbb{P}(\left|\hat{\rho}_k\right|<c\vert H_0) = 0.95\\ \mathbb{P}(-c\leq\hat{\rho}_k\leq c\vert H_0) = 0.95

We know that ρ^k∼N(0,1/n)\hat{\rho}_k\sim \mathcal{N}(0,1/n)ρ^​k​∼N(0,1/n)﻿ for i.i.d. processes for CLT

This means that P(−c≤ρ^k≤c∣H0)=0.95\mathbb{P}(-c\leq\hat{\rho}_k\leq c\vert H_0) = 0.95P(−c≤ρ^​k​≤c∣H0​)=0.95﻿ iff c=1.96nc=\frac{1.96}{\sqrt n}c=n​1.96​﻿​

Why ρ^k\hat{\rho}_kρ^​k​﻿ is normally distributed with 0 mean and 1/n variance?

\begin{align*} \mathbb{E}\hat{\rho}_k \sim & \mathbb{E}\hat{\gamma}_x(k)=\\ &\mathbb{E}\frac{1}{T}\sum_{t}^{t+T}\left(x_t-\hat{\mu}_T\right)\left(x_{t-k}-\hat{\mu}_T\right)\stackrel{\mu=0}{=}\\ &\mathbb{E}\frac{1}{T}\sum_{t}^{t+T}x_tx_{t-k}\stackrel{\bot}{=}0 \end{align*}

Furthermore, if we call xtxt−k=yx_tx_{t-k}=yxt​xt−k​=y﻿, yyy﻿ is still a i.i.d.

Portmanteau tests

Instead of testing a given lag kkk﻿, we can test for a set of lags. For an i.i.d. process, the following random variables are approximately chi-square with K degrees of freedom.  

Q_K=T \sum_{k=1}^K \hat{\rho}_k^2

                                        Box-Pierce test

Q^*_K=T(T+2) \sum_{k=1}^K (T-k)^{-1}\hat{\rho}_k^2

                                           Ljung-Box test

Spectral and approximate entropy

Entropy can be used as a measure of complexity and forecastability. Entropy is a measure of information contained in a random event. Since the information contained in two independent events x⊥yx \perp y x⊥y﻿ will sum, while their joint probability is p(x,y)=p(x)p(y)p(x, y) = p(x)p(y)p(x,y)=p(x)p(y)﻿, the only possible form of the information is:

\log _b p(x)

If we take the expected value, and consider events with low probability to be more informative →

-\sum_{x \in \mathcal{X}} p(x) \log _b p(x)

Entropy gives an idea of complexity of the distribution of the signal. However, entropy for these two signals will be the same:

                  [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1…]                               [0,1,1,1,0,0,0,1,0,1,0,0,0,1,0,1, 1,…]

A better measure of forecastability is approximate entropy, which computes the entropy of maximum discrepancy of segments of length m of a time series. 

ALT

ϕm(r)=1n∑i=1nlog⁡(1dist<r)\phi^m(r)=\frac{1}{n} \sum_{i=1}^n \log \left(\mathbf{1}_{dist<r}\right)ϕm(r)=n1​∑i=1n​log(1dist<r​)﻿​

entropy on distribution of “r-close neighbors” of the embedded sequences

Benchmark forecasting methods

Some forecasting methods are extremely simple and yet surprisingly effective. They are stateless methods

Naïve method

This method performs well for h=1h=1h=1﻿, or at scale which are smaller than the characteristic time scale of the signal

y^t+h∣t=yt\hat{y}_{t+h \mid t}=y_ty^​t+h∣t​=yt​﻿​

Mean method

We just take the mean value of the time series observed so far

y^t+h∣t=yˉ=(y1+⋯+yt)/t\hat{y}_{t+h \mid t}=\bar{y}=\left(y_1+\cdots+y_t\right) / ty^​t+h∣t​=yˉ​=(y1​+⋯+yt​)/t﻿​

Conditional Mean method

We take the mean of the observations that were registered in similar conditions 

y^t+h∣t=1T∑t∈TxytTx={t:dist(xt,xt+h)<ϵ}\hat{y}_{t+h \mid t}=\frac{1}{T}\sum_{t\in \mathcal{T_x}}y_t \quad \mathcal{T_x}=\{t: dist(x_t, x_{t+h})<\epsilon \}y^​t+h∣t​=T1​∑t∈Tx​​yt​Tx​={t:dist(xt​,xt+h​)<ϵ}﻿​

Example: I’m predicting sales for a given product on an online platform. The platform applies time-varying discounts based on different market conditions. We can forecast the product taking the mean sales in the same day of the week when the same discount was applied.

Seasonal Naïve method

y^t+h∣t=yt+h−m(k+1)\hat{y}_{t+h \mid t}=y_{t+h-m(k+1)}y^​t+h∣t​=yt+h−m(k+1)​﻿​

where mmm﻿ is the period of the seasonality

Confidence intervals

Assuming residuals are normal, uncorrelated:

ALT

Note that when h=1 and T is large, these all give the same approximate forecast variance

Time series scores

Standard ML metrics are:

ALT

For time series, it is usually better to normalize these scores.

Example: mean absolute scaled error MASE:mean(∣eT+h∣/Q)MASE: mean(\vert e_{T+h}\vert / Q)MASE:mean(∣eT+h​∣/Q)﻿. Here QQQ﻿ is a normalizing factor. 

For non-seasonal time series

Q=(T−1)−1∑t=2T∣yt−yt−1∣Q=(T-1)^{-1} \sum_{t=2}^T\left|y_t-y_{t-1}\right|Q=(T−1)−1∑t=2T​∣yt​−yt−1​∣﻿​

is equivalent to MAE relative to a naïve
method.

For seasonal time series

Q=(T−m)−1∑t=m+1T∣yt−yt−m∣Q=(T-m)^{-1} \sum_{t=m+1}^T\left|y_t-y_{t-m}\right|Q=(T−m)−1∑t=m+1T​∣yt​−yt−m​∣﻿​

is equivalent to MAE relative to a seasonal
naïve method

Bonus: Confidence intervals derivation

Mean method

The mean model is 

\hat{y}_{t+1} = c + \varepsilon_t

Where ϵt∼N(0,σ)\epsilon_t \sim \mathcal{N}(0, \sigma)ϵt​∼N(0,σ)﻿. For this method, we need to estimate ccc﻿, which is done via empirical mean:

\hat{c} = \frac{1}{T}\sum_t^Ty_t

Since the model treats the predicted variable as constant, the prediction error is independent of the step ahead hhh﻿. The predictive distribution is Gaussian, with variance:

Var(\hat{y}_{t+h})=Var(\hat{y}_{t})=Var(\hat{c}) + Var(\varepsilon_t) = \frac{\sigma^2}{\sqrt{T}} + \sigma^2 = \left(1 + \frac{1}{T}\right)\sigma^2

The first quantity is the variance associated to the estimation of the parameter ccc﻿, and stems directly from the central limit theorem: the uncertainty in the estimation of the distribution mean through empirical mean is normally distributed with variance ~ (1/T).

Naive method

 The naive method assumes that the true data generation process has the following form:

y_{t+1} = y_{t} + \varepsilon_t

We can obtain the formula for the hthh_{th}hth​﻿ step ahead prediction as:

\begin{aligned}y_{t+1} & =y_t+\varepsilon_t \\y_{t+2} & =y_{t+1}+\varepsilon_{t+1}=y_t+\varepsilon_t+\varepsilon_{t+1} \\\quad \vdots & \\y_{t+h} & =y_t+\sum_{i=1}^h \varepsilon_{t+i},\end{aligned}

Since the errors are considered to be zero mean, the predictive variable and its variance are:

\begin{aligned}\hat{y}_{T+h \mid T} & =E\left(y_{T+h \mid T}\right)=y_T \\\hat{\sigma}_h^2 & =V\left(y_{T+h \mid T}\right)=h \sigma^2\end{aligned}

Where the last expression is due to the fact that a summation of Gaussian random variables is still a Gaussian with variance equal to the sum of their variances.

Seasonal method

The seasonal naive method assumes the following datageneration process y^t+h∣t=yt+h−m(k+1)\hat{y}_{t+h \mid t}=y_{t+h-m(k+1)}y^​t+h∣t​=yt+h−m(k+1)​﻿​

 This can be rewritten as:

\begin{aligned}y_{t+1} & =y_{t+1-m}+\varepsilon_{t+1} \\y_{t+2} & =y_{t+2-m}+\varepsilon_{t+2} \\\vdots & \\y_{t+m} & =y_t+\varepsilon_{t+m} \\y_{t+m+1} & =y_{t+1}+\varepsilon_{t+m+1}=y_{t+1-m}+\varepsilon_{T+1}+\varepsilon_{t+m+1} \\\vdots & \\y_{t+2 m} & =y_{t+m}+\varepsilon_{t+2 m}=y_t+\varepsilon_{t+m}+\varepsilon_{t+2 m} \\y_{t+2 m+1} & =y_{t+m+1}+\varepsilon_{t+2 m+1}=y_{t+1-m}+\varepsilon_{t+1}+\varepsilon_{t+m+1}+\varepsilon_{t+2 m+1} \\\vdots & \\y_{t+h} & =y_{t+h-m(k+1)}+\sum_{i=0}^k \varepsilon_{t+h+m(i-k)}\end{aligned}

Thus:

\begin{gathered}\hat{y}_{T+h \mid T}=E\left(y_{T+h \mid T}\right)=y_{T+h-m(k+1)} \\\hat{\sigma}_h^2=V\left(y_{T+h \mid T}\right)=(k+1) \sigma^2\end{gathered}