Stationarity and assumptions

Consider the random process xt=c+xt−1+ϵx_t = c+ x_{t-1} +\epsilonxt​=c+xt−1​+ϵ﻿, also known as random walk with drift.

What’s the mean of the process at time t?

Is the process ergodic?

Can you make it stationary and ergodic by taking first differences?

ALT

Enforcing stationarity

We have seen that the random walk process 

x_t = x_{t-1} +\epsilon \quad \epsilon \sim \mathcal{N(0, 1), \quad \epsilon_k \perp \epsilon_j}

can be made mean-ergodic and stationary by taking the first order time derivative. Are there other methods to “regularize” a time series, so that we can better learn to predict it?  

Transformations

Simple transformations are easier to explain and work well enough.
Transformations can have very large effect on PI.
log1p() can be useful for data with zeros.
Choosing logs is a simple way to force forecasts to be positive

ALT

Original

ALT

Square root

ALT

Cube root

ALT

Log

ALT

In general, transformations can

be used to normalize the data and make it more suitable for modeling and analysis.

stabilize the variance of the data and reduce the impact of outliers.

improve the accuracy of statistical models and predictions by reducing the impact of non-normality in the data.

Box-Cox transform

w_t= \begin{cases}\log \left(y_t\right), & \lambda=0 \\ \left(\operatorname{sign}\left(y_t\right)\left|y_t\right|^\lambda-1\right) / \lambda, & \lambda \neq 0 .\end{cases}

For different choices of λ we get different transformations:

λ = 1: (No substantive transformation)
λ = 1/2 : (Square root plus linear transformation)
λ = 0: (Natural logarithm)
λ = −1: (Inverse plus 1)

Yeo–Johnson transform

y_i^{(\lambda)}= \begin{cases}\left(\left(y_i+1\right)^\lambda-1\right) / \lambda & \text { if } \lambda \neq 0, y \geq 0 \\ \log \left(y_i+1\right) & \text { if } \lambda=0, y \geq 0 \\ -\left(\left(-y_i+1\right)^{(2-\lambda)}-1\right) /(2-\lambda) & \text { if } \lambda \neq 2, y<0 \\ -\log \left(-y_i+1\right) & \text { if } \lambda=2, y<0\end{cases}

Allows also for zero and negative values

λ = 1: Identity

General patterns - embeddings and sliding windows

A general pattern in univariate time series forecasting is to try to learn the time series dynamics from an embedding of past values. Time embedding is the fundamental techniques used by

exponential smoothing

state space models

AR, ARMA, ARIMAX models

stateless models (generic regressors)

Time embedding for forecasting applications is mostly based on ideas from chaotic systems and system identification. 

Takens theorem [1]: some chaotic systems can be deterministically determined by lags of the time series of their states. Furthermore, the system can be linearly embedded 

[1] Huke, J. P. Embedding Nonlinear Dynamical Systems: A Guide to Takens’ Theorem

Taken’s theorem ensures we can predict the next values of x just considering the last 7 lagged values of x

Subspace system identification [2], one of the most used techniques to identify linear systems, creates a map between Hankel matrices of the inputs and outputs 

[2] S. Joe Qin,  An overview of subspace identification

Example: Lorenz system equations:

\left\{\begin{array}{l}\frac{d x}{d t}=\sigma y-\sigma x \\\frac{d y}{d t}=x \rho-y-x z \\\frac{d z}{d t}=x y-\beta z\end{array}\right.

ALT

A linear state-space system can be described as a linear combination of states, inputs and noise

\begin{aligned}& x_{k+1}=A x_k+B u_k+w_k \\& y_k=C x_k+D u_k+v_k\end{aligned}

Subspace techniques identify the dynamic matrices A,B,C,DA, B, C, DA,B,C,D﻿ starting from [yt,yt−1,…yt−m]t=1T,[ut,ut−1,…ut−m]t=1T[y_t, y_{t-1}, \dots y_{t-m}]_{t=1}^T, [u_t, u_{t-1}, \dots u_{t-m}]_{t=1}^T [yt​,yt−1​,…yt−m​]t=1T​,[ut​,ut−1​,…ut−m​]t=1T​﻿​

Taken’s and Hankel matrices

A univariate time series [xt]t=1T, x∈R[x_t]_{t=1}^T, \ x \in \mathbb{R}[xt​]t=1T​, x∈R﻿, can be time-embedded in a (m,d)(m, d)(m,d)﻿-dimensional matrix using an embedding lag ddd﻿ 

E(m, d) = \begin{array}{cccc}x(t) & x(t-d) & \cdots & x(t-(m-1) d) \\x(t+1) & x(t-d+1) & \cdots & x(t-(m-1) d+1) \\x(t+2) & x(t-d+2) & \cdots & x(t-(m-1) d)+2) \\\ldots & \ldots & \cdots & \end{array}

If  d≥1d\geq1d≥1﻿ we get a Page or Taken’s matrix, when d=1d=1 d=1﻿ we call it an Hankel matrix.   

The common idea behind these techniques is that the dynamics of an unknown system producing an observable time series yty_t yt​﻿ can be learned from the data. Since, with no loss of generality, time series can be thought to be the output of a data generation process, we can try to apply these techniques to the forecasting task.

Moving averages and nonanticipativity

Besides considering the last m,d−m, d-m,d−﻿lagged values of the time series, we can also apply transformations to the embedding matrix E(m,d)E(m, d)E(m,d)﻿. For d=1d=1d=1﻿, taking the rows’ mean of E(m,1)E(m, 1)E(m,1)﻿ computes the backward moving average of the signal.   

 Compare the centered (orange) and backward (green) 1-day moving average of the following signal. Which one can be used to forecast it? 

ALT

Centered (orange) and lagged (green) 2-days moving averages for a the power consumption of a Swiss DSO.

ALT

If the time series shows trends, a simple moving average can help making the series stationary, effectively un-correlating xtx_txt​﻿ with its past and future lags →we can better estimate moments of xtx_txt​﻿ distribution, and its joint CDF Ft1,t2,⋯ ,ts(b1,b2,⋯ ,bs)=P(xt1(ω)≤b1,xt2(ω)≤b2,…,xts(ω)≤bs))F_{t_1, t_2, \cdots, t_s}\left(b_1, b_2, \cdots, b_s\right)=\mathrm{P}\left(x_{t_1}(\omega) \leq b_1, x_{t_2}(\omega) \leq \left.b_2, \ldots, x_{t_s}(\omega) \leq b_s\right) \right)Ft1​,t2​,⋯,ts​​(b1​,b2​,⋯,bs​)=P(xt1​​(ω)≤b1​,xt2​​(ω)≤b2​,…,xts​​(ω)≤bs​))﻿. 

ALT

Effect of daily detrend on the ACF: the process becomes more stationary as lagged steps decorrelate

ALT

STL decompositions

STL decomposition is a popular method for decomposing a time series into its seasonal, trend, and remainder components. The acronym stands for "Seasonal and Trend decomposition using Loess," which refers to the technique used for estimating the components.

Seasonal time series can be decomposed as:

y_t = f(S_t , T_t , R_t )=\begin{cases}y_t = S_t + T_t + R_t \qquad \text{additive}\\ y_t = S_t T_t R_t \qquad \text{multiplicative} \end{cases}

Since applying the log to both the side of the multiplicative decomposition form leads to 

log(yt)=log(St)+log(Tt)+log(Rt)log(y_t) = log(S_t) + log(T_t) + log(R_t)log(yt​)=log(St​)+log(Tt​)+log(Rt​)﻿​

multiplicative decomposition → simply take the logarithm before applying the additive decomposition 

We just need to estimate trend TtT_tTt​﻿ and seasonal components StS_tSt​﻿​

The seasonal component is obtained by averaging over all seasons

Trend-cycle component is obtained by fitting a loess curve to the data

ALT

Pros:

Smoothness of trend-cycle also controlled by user.

Robust to outliers

Very versatile:

trend window controls loess bandwidth applied to de-seasonalised
values.

season window controls loess bandwidth applied to detrended
subseries

Anomaly detection in TS

Why? 

Detecting outliers and anomalies and replacing or discarding them before fitting a forecasting model can positively increase the forecaster’s performance. 

How?

Standard outlier filters can be adapted to time series data by applying them to sliding windows of the original signal

Sigma Filter and Hample filter

Sigma filter: mark xtx_t xt​﻿ as outlier if ∣x−xˉ∣>3σ\vert x-\bar{x}\vert > 3 \sigma∣x−xˉ∣>3σ﻿. Under normal distribution of the data, this filter corresponds to keep 99.7% of the observations

The Hample filter works by first calculating the mean and standard deviation of a sliding window of data points. 

Centers the window of odd length at the current sample xsx_sxs​﻿.

Computes the local median, xˉs\bar{x}_sxˉs​﻿, and standard deviation, σs\sigma_sσs​﻿, over the current window of data.

Compares the current sample with nσσsn_\sigma\sigma_snσ​σs​﻿, where nσn_\sigmanσ​﻿ is the threshold value. If ∣xs−xˉs∣>nσσs\vert x_s-\bar{x}_s\vert > n_\sigma \sigma_s∣xs​−xˉs​∣>nσ​σs​﻿, the filter identifies the current sample,  xsx_sxs​﻿, as an outlier and replaces it with the median value, xˉs\bar{x}_sxˉs​﻿.

ALT

Boxplot Filter

The boxplot filter doesn’t assume a particular distribution. An observation is kept if:

q0.25−1.5IQR≤x≤q0.25+1.5IQRq_{0.25} - 1.5 IQR \leq x \leq q_{0.25} + 1.5 IQR q0.25​−1.5IQR≤x≤q0.25​+1.5IQR﻿​

where IQR is the interquartile range = q0.75−q0.25q_{0.75}-q_{0.25}q0.75​−q0.25​﻿. We can replace the sigma rule with the boxplot rule in the Hample filter to adapt it to time series data.  

 Do you think these filters can be combined with STL decomposition for seasonal time series? How?

Spectral Residual Technique

SR technique was proposed by researcher at Microsoft to spot anomalies in time series. It exploit information in the frequency domain by taking the Fourier transform of the time series, and consists of the following steps:

Take the log of the absolute value of the Fourier transform of the time series.

Smooth the log values using a rolling average.

Compute the spectral residuals by subtracting the smoothed log values from the log values.

Inverse Fourier transform the spectral residuals to obtain a saliency map or anomaly score

The anomaly score can then be used to identify regions of the time series that contain unusual activity. The SR technique has the advantage of being computationally efficient and easy to implement. Requires some parameter tuning to obtain good results.

\begin{aligned}& A(f), P(f)=\text { Amplitude }(\mathfrak{F}(\mathbf{x})), \text { Phase }(\mathfrak{F}(\mathbf{x})) \\& L(f)=\log (A(f)) \\& A L(f)=h_q(f) \cdot L(f) \\& R(f)=L(f)-A L(f) \\& S(\mathbf{x})=\left\|\mathfrak{\mathfrak{F}}^{-1}(\exp (R(f)+i P(f)))\right\|\end{aligned}

ALT

Time-Series Anomaly Detection Service at Microsoft, H.Ren et al, 2019 

ALT

Autoencoders

Autoencoders can be used to generate saliency maps and detect anomalies in TS. 

Train an autoencoder on “good” data to predict sliding windows of the original signal,  x^s=[x^t,x^t+1…x^t+h]\hat{x}_s = [\hat{x}_t, \hat{x}_{t+1}\dots\hat{x}_{t+h}] x^s​=[x^t​,x^t+1​…x^t+h​]﻿. The training is done minimizing reconstruction error ∥xs−x^s∥2\Vert x_s -\hat{x}_s\Vert^2 ∥xs​−x^s​∥2﻿​

At inference time, run the autoencoder on new observed data, and use the reconstruction error ∥xs−x^s∥2\Vert x_s -\hat{x}_s\Vert^2 ∥xs​−x^s​∥2﻿ as saliency map 

Mark the observations as outliers if the reconstruction error is above a threshold

ALT

Schematics of a feed-forward autoencoder. The original sliding window is encoded in a vector with (usually) smaller dimensionality (latent representation), which is then used to reconstruct the original time series.

ALT

Forecast-based

We can combine ideas from the previous slides and define an anomaly detection strategy which is based on a forecasting model.

Given a forecasting model f(xt)f(x_t)f(xt​)﻿ we can write:

y_{t+k} = f(x_t) + \epsilon = \hat{y}_{t+k}+\epsilon

If the future point can shows too extreme values of ϵ\epsilonϵ﻿, we can mark it as an outlier. 

\left|y_{t+k}-\hat{y}_{t+k}\right|>\tau

Usually τ\tauτ﻿ is chosen based on the assumed distribution of the error, for example normal gaussian. In this case we can use 3σ3\sigma3σ﻿ for example.

ALT

from: Blazquez-Garcia et al., A review on outlier/anomaly detection in time series data,2020 ACM

ALT

Data imputation in TS

Multivariate Imputation by Chained Equation

Multivariate Imputation by Chained Equation (MICE) is a popular method for imputing missing data in time series analysis. 

The method involves using regression models to estimate the missing values of each variable in the time series, based on the other variables in the series. The process is repeated multiple times, with each iteration using the updated estimates of the missing values from the previous iterations to improve the accuracy of the imputations. MICE has the advantage of being able to handle missing data in multiple variables simultaneously, and can provide accurate imputations even when the missing data is non-random or correlated with other variables in the series.

Multivariate Imputation by Chained Equation (MICE) Method:

Create an imputation model for each variable with missing data.

Initialize the imputed values for each variable with missing data e.g. with mean

Iteratively update the imputed values for each variable with missing data, using the other variables in the time series as predictors.

Repeat the previous step until convergence or a maximum number of iterations is reached.

Use the imputed values to perform subsequent analyses.

Expectation Maximization SVD

The EM-SVD algorithm is particularly indicated for seasonal data.

Normalize the daily embedded data by subtracting daily mean and dividing by the daily standard deviation

replace missing values with the mean values of the corresponding minute of the day

compute a truncated SVD decomposition of the matrix, using the first npn_pnp​﻿ principal components

replace the missing data with the SVD approximation

repeat from point 3 until the algorithm converges. Convergence is reached when the mean change in missing data is below 1e-3.

ALT