Stationarity and assumptions
Built with
Page icon

Stationarity and assumptions

Consider the random process xt=c+xt1+ϵx_t = c+ x_{t-1} +\epsilon, also known as random walk with drift.
What’s the mean of the process at time t?
Is the process ergodic?
Can you make it stationary and ergodic by taking first differences?
ALT

Enforcing stationarity

We have seen that the random walk process
xt=xt1+ϵϵN(0,1),ϵkϵjx_t = x_{t-1} +\epsilon \quad \epsilon \sim \mathcal{N(0, 1), \quad \epsilon_k \perp \epsilon_j}
can be made mean-ergodic and stationary by taking the first order time derivative. Are there other methods to “regularize” a time series, so that we can better learn to predict it?

Transformations

Simple transformations are easier to explain and work well enough. Transformations can have very large effect on PI. log1p() can be useful for data with zeros. Choosing logs is a simple way to force forecasts to be positive
ALT
Original
ALT
ALT
Square root
ALT
ALT
Cube root
ALT
ALT
Log
ALT
In general, transformations can
be used to normalize the data and make it more suitable for modeling and analysis.
stabilize the variance of the data and reduce the impact of outliers.
improve the accuracy of statistical models and predictions by reducing the impact of non-normality in the data.

Box-Cox transform

wt={log(yt),λ=0(sign(yt)ytλ1)/λ,λ0.w_t= \begin{cases}\log \left(y_t\right), & \lambda=0 \\ \left(\operatorname{sign}\left(y_t\right)\left|y_t\right|^\lambda-1\right) / \lambda, & \lambda \neq 0 .\end{cases}
For different choices of λ we get different transformations:
λ = 1: (No substantive transformation) λ = 1/2 : (Square root plus linear transformation) λ = 0: (Natural logarithm) λ = −1: (Inverse plus 1)

Yeo–Johnson transform

yi(λ)={((yi+1)λ1)/λ if λ0,y0log(yi+1) if λ=0,y0((yi+1)(2λ)1)/(2λ) if λ2,y<0log(yi+1) if λ=2,y<0y_i^{(\lambda)}= \begin{cases}\left(\left(y_i+1\right)^\lambda-1\right) / \lambda & \text { if } \lambda \neq 0, y \geq 0 \\ \log \left(y_i+1\right) & \text { if } \lambda=0, y \geq 0 \\ -\left(\left(-y_i+1\right)^{(2-\lambda)}-1\right) /(2-\lambda) & \text { if } \lambda \neq 2, y<0 \\ -\log \left(-y_i+1\right) & \text { if } \lambda=2, y<0\end{cases}
Allows also for zero and negative values
λ = 1: Identity

General patterns - embeddings and sliding windows

General patterns - embeddings and sliding windows

A general pattern in univariate time series forecasting is to try to learn the time series dynamics from an embedding of past values. Time embedding is the fundamental techniques used by
exponential smoothing
state space models
AR, ARMA, ARIMAX models
stateless models (generic regressors)
Time embedding for forecasting applications is mostly based on ideas from chaotic systems and system identification.
Takens theorem [1]: some chaotic systems can be deterministically determined by lags of the time series of their states. Furthermore, the system can be linearly embedded
🧾
[1] Huke, J. P. Embedding Nonlinear Dynamical Systems: A Guide to Takens’ Theorem
Taken’s theorem ensures we can predict the next values of x just considering the last 7 lagged values of x
Subspace system identification [2], one of the most used techniques to identify linear systems, creates a map between Hankel matrices of the inputs and outputs
🧾
[2] S. Joe Qin, An overview of subspace identification
Example: Lorenz system equations:
{dxdt=σyσxdydt=xρyxzdzdt=xyβz\left\{\begin{array}{l}\frac{d x}{d t}=\sigma y-\sigma x \\\frac{d y}{d t}=x \rho-y-x z \\\frac{d z}{d t}=x y-\beta z\end{array}\right.
ALT
A linear state-space system can be described as a linear combination of states, inputs and noise
xk+1=Axk+Buk+wkyk=Cxk+Duk+vk\begin{aligned}& x_{k+1}=A x_k+B u_k+w_k \\& y_k=C x_k+D u_k+v_k\end{aligned}
Subspace techniques identify the dynamic matrices A,B,C,DA, B, C, D starting from [yt,yt1,ytm]t=1T,[ut,ut1,utm]t=1T[y_t, y_{t-1}, \dots y_{t-m}]_{t=1}^T, [u_t, u_{t-1}, \dots u_{t-m}]_{t=1}^T 

Taken’s and Hankel matrices

A univariate time series [xt]t=1T, xR[x_t]_{t=1}^T, \ x \in \mathbb{R}, can be time-embedded in a (m,d)(m, d)-dimensional matrix using an embedding lag dd
E(m,d)=x(t)x(td)x(t(m1)d)x(t+1)x(td+1)x(t(m1)d+1)x(t+2)x(td+2)x(t(m1)d)+2)E(m, d) = \begin{array}{cccc}x(t) & x(t-d) & \cdots & x(t-(m-1) d) \\x(t+1) & x(t-d+1) & \cdots & x(t-(m-1) d+1) \\x(t+2) & x(t-d+2) & \cdots & x(t-(m-1) d)+2) \\\ldots & \ldots & \cdots & \end{array}
If d1d\geq1 we get a Page or Taken’s matrix, when d=1d=1  we call it an Hankel matrix.
The common idea behind these techniques is that the dynamics of an unknown system producing an observable time series yty_t  can be learned from the data. Since, with no loss of generality, time series can be thought to be the output of a data generation process, we can try to apply these techniques to the forecasting task.

Moving averages and nonanticipativity

Besides considering the last m,dm, d-lagged values of the time series, we can also apply transformations to the embedding matrix E(m,d)E(m, d). For d=1d=1, taking the rows’ mean of E(m,1)E(m, 1) computes the backward moving average of the signal.
❓ Compare the centered (orange) and backward (green) 1-day moving average of the following signal. Which one can be used to forecast it?
ALT
Centered (orange) and lagged (green) 2-days moving averages for a the power consumption of a Swiss DSO.
ALT
If the time series shows trends, a simple moving average can help making the series stationary, effectively un-correlating xtx_t with its past and future lags →we can better estimate moments of xtx_t distribution, and its joint CDF Ft1,t2,,ts(b1,b2,,bs)=P(xt1(ω)b1,xt2(ω)b2,,xts(ω)bs))F_{t_1, t_2, \cdots, t_s}\left(b_1, b_2, \cdots, b_s\right)=\mathrm{P}\left(x_{t_1}(\omega) \leq b_1, x_{t_2}(\omega) \leq \left.b_2, \ldots, x_{t_s}(\omega) \leq b_s\right) \right).
ALT
Effect of daily detrend on the ACF: the process becomes more stationary as lagged steps decorrelate
ALT

STL decompositions

STL decomposition is a popular method for decomposing a time series into its seasonal, trend, and remainder components. The acronym stands for "Seasonal and Trend decomposition using Loess," which refers to the technique used for estimating the components.
Seasonal time series can be decomposed as:
yt=f(St,Tt,Rt)={yt=St+Tt+Rtadditiveyt=StTtRtmultiplicativey_t = f(S_t , T_t , R_t )=\begin{cases}y_t = S_t + T_t + R_t \qquad \text{additive}\\ y_t = S_t T_t R_t \qquad \text{multiplicative} \end{cases}
Since applying the log to both the side of the multiplicative decomposition form leads to
log(yt)=log(St)+log(Tt)+log(Rt)log(y_t) = log(S_t) + log(T_t) + log(R_t)
multiplicative decomposition → simply take the logarithm before applying the additive decomposition
We just need to estimate trend TtT_t and seasonal components StS_t
The seasonal component is obtained by averaging over all seasons
Trend-cycle component is obtained by fitting a loess curve to the data
ALT
Pros:
Smoothness of trend-cycle also controlled by user.
Robust to outliers
Very versatile:
trend window controls loess bandwidth applied to de-seasonalised values.
season window controls loess bandwidth applied to detrended subseries

Anomaly detection in TS

Why?
Detecting outliers and anomalies and replacing or discarding them before fitting a forecasting model can positively increase the forecaster’s performance.
How?
Standard outlier filters can be adapted to time series data by applying them to sliding windows of the original signal

Sigma Filter and Hample filter

Sigma filter: mark xtx_t  as outlier if xxˉ>3σ\vert x-\bar{x}\vert > 3 \sigma. Under normal distribution of the data, this filter corresponds to keep 99.7% of the observations
The Hample filter works by first calculating the mean and standard deviation of a sliding window of data points.
Centers the window of odd length at the current sample xsx_s.
Computes the local median, xˉs\bar{x}_s, and standard deviation, σs\sigma_s, over the current window of data.
Compares the current sample with nσσsn_\sigma\sigma_s, where nσn_\sigma is the threshold value. If xsxˉs>nσσs\vert x_s-\bar{x}_s\vert > n_\sigma \sigma_s, the filter identifies the current sample,  xsx_s, as an outlier and replaces it with the median value, xˉs\bar{x}_s.
ALT

Boxplot Filter

The boxplot filter doesn’t assume a particular distribution. An observation is kept if:
q0.251.5IQRxq0.25+1.5IQRq_{0.25} - 1.5 IQR \leq x \leq q_{0.25} + 1.5 IQR 
where IQR is the interquartile range = q0.75q0.25q_{0.75}-q_{0.25}. We can replace the sigma rule with the boxplot rule in the Hample filter to adapt it to time series data.
❓ Do you think these filters can be combined with STL decomposition for seasonal time series? How?

Spectral Residual Technique

SR technique was proposed by researcher at Microsoft to spot anomalies in time series. It exploit information in the frequency domain by taking the Fourier transform of the time series, and consists of the following steps:
Take the log of the absolute value of the Fourier transform of the time series.
Smooth the log values using a rolling average.
Compute the spectral residuals by subtracting the smoothed log values from the log values.
Inverse Fourier transform the spectral residuals to obtain a saliency map or anomaly score
The anomaly score can then be used to identify regions of the time series that contain unusual activity. The SR technique has the advantage of being computationally efficient and easy to implement. Requires some parameter tuning to obtain good results.
A(f),P(f)= Amplitude (F(x)), Phase (F(x))L(f)=log(A(f))AL(f)=hq(f)L(f)R(f)=L(f)AL(f)S(x)=F1(exp(R(f)+iP(f)))\begin{aligned}& A(f), P(f)=\text { Amplitude }(\mathfrak{F}(\mathbf{x})), \text { Phase }(\mathfrak{F}(\mathbf{x})) \\& L(f)=\log (A(f)) \\& A L(f)=h_q(f) \cdot L(f) \\& R(f)=L(f)-A L(f) \\& S(\mathbf{x})=\left\|\mathfrak{\mathfrak{F}}^{-1}(\exp (R(f)+i P(f)))\right\|\end{aligned}
ALT
Time-Series Anomaly Detection Service at Microsoft, H.Ren et al, 2019
ALT

Autoencoders

Autoencoders can be used to generate saliency maps and detect anomalies in TS.
Train an autoencoder on “good” data to predict sliding windows of the original signal, x^s=[x^t,x^t+1x^t+h]\hat{x}_s = [\hat{x}_t, \hat{x}_{t+1}\dots\hat{x}_{t+h}] . The training is done minimizing reconstruction error xsx^s2\Vert x_s -\hat{x}_s\Vert^2 
At inference time, run the autoencoder on new observed data, and use the reconstruction error xsx^s2\Vert x_s -\hat{x}_s\Vert^2  as saliency map
Mark the observations as outliers if the reconstruction error is above a threshold
ALT
Schematics of a feed-forward autoencoder. The original sliding window is encoded in a vector with (usually) smaller dimensionality (latent representation), which is then used to reconstruct the original time series.
ALT

Forecast-based

We can combine ideas from the previous slides and define an anomaly detection strategy which is based on a forecasting model.
Given a forecasting model f(xt)f(x_t) we can write:
yt+k=f(xt)+ϵ=y^t+k+ϵy_{t+k} = f(x_t) + \epsilon = \hat{y}_{t+k}+\epsilon
If the future point can shows too extreme values of ϵ\epsilon, we can mark it as an outlier.
yt+ky^t+k>τ\left|y_{t+k}-\hat{y}_{t+k}\right|>\tau
Usually τ\tau is chosen based on the assumed distribution of the error, for example normal gaussian. In this case we can use 3σ3\sigma for example.
ALT
from: Blazquez-Garcia et al., A review on outlier/anomaly detection in time series data,2020 ACM
ALT

Data imputation in TS

Multivariate Imputation by Chained Equation

Multivariate Imputation by Chained Equation (MICE) is a popular method for imputing missing data in time series analysis.
The method involves using regression models to estimate the missing values of each variable in the time series, based on the other variables in the series. The process is repeated multiple times, with each iteration using the updated estimates of the missing values from the previous iterations to improve the accuracy of the imputations. MICE has the advantage of being able to handle missing data in multiple variables simultaneously, and can provide accurate imputations even when the missing data is non-random or correlated with other variables in the series.
Multivariate Imputation by Chained Equation (MICE) Method:
Create an imputation model for each variable with missing data.
Initialize the imputed values for each variable with missing data e.g. with mean
Iteratively update the imputed values for each variable with missing data, using the other variables in the time series as predictors.
Repeat the previous step until convergence or a maximum number of iterations is reached.
Use the imputed values to perform subsequent analyses.

Expectation Maximization SVD

The EM-SVD algorithm is particularly indicated for seasonal data.
Normalize the daily embedded data by subtracting daily mean and dividing by the daily standard deviation
replace missing values with the mean values of the corresponding minute of the day
compute a truncated SVD decomposition of the matrix, using the first npn_p principal components
replace the missing data with the SVD approximation
repeat from point 3 until the algorithm converges. Convergence is reached when the mean change in missing data is below 1e-3.
ALT
ALT
ALT
ALT