Bias-variance decomposition

The bias-variance tradeoff arise from a frequentist viewpoint of the model complexity issue (as opposed to the Bayesian one). This decomposition helps to understand how complexity affects generalization.  Suppose we have a model y^(x)\hat{y}(x) y^​(x)﻿ and a dataset composed of inputs  xix_i xi​﻿ and outputs  yiy_iyi​﻿,  D={(xi,yi)i=1N}\mathcal{D} = \{(x_i, y_i)_{i=1}^N\}D={(xi​,yi​)i=1N​}﻿. We would like to find a meaningful decomposition of the expectation of the sum of square loss function when we repeatedly train our model over different datasets D:\mathcal{D}:D:﻿ EDL(y^(x),y)=EDEx((y^(x)−y)2)=ExED((y^(x)−y)2)\mathbb{E}_{\mathcal{D}}\mathcal{L}(\hat{y}(x),  y) =\mathbb{E}_{\mathcal{D}}\mathbb{E}_x((\hat{y}(x)-  y)^2) 
=\mathbb{E}_{x}\mathbb{E}_{\mathcal{D}}((\hat{y}(x)-  y)^2) 
ED​L(y^​(x),y)=ED​Ex​((y^​(x)−y)2)=Ex​ED​((y^​(x)−y)2)﻿. We further assume that the observations y=t+ϵy = t + \epsilon y=t+ϵ﻿ are the sum of a real target and intrinsic noise ϵ\epsilon ϵ﻿ having zero mean and variance σ2\sigma^2σ2﻿.

Let’s start considering the standard variance decomposition of a random variable xxx﻿:

\begin{align} \mathbb{E}(x-\mathbb{E}x)^2 &= \mathbb{E}x^2 -2\mathbb{E}(x\mathbb{E}x) + (\mathbb{E}x)^2\\ &=\mathbb{E}x^2-(\mathbb{E}x)^2 \end{align}

where the expectation is taken over the probability distribution of x, p(x). Rearranging the terms, we have an expression for the sum of squares:

\begin{equation} \mathbb{E}x² = (\mathbb{E}x)^2 + \mathbb{E}(x-\mathbb{E}x)^2 \end{equation}

for instance, this can be view as a special case of Jensen’s inequality for convex functions (Ef(x)≤f(Ex)\mathbb{E}f(x)\leq f(\mathbb{E}x)Ef(x)≤f(Ex)﻿). Now, we can replace xxx﻿ with the model error y−y^=t+ϵ−y^y-\hat{y} = t+\epsilon-\hat{y}  y−y^​=t+ϵ−y^​﻿ and use the above expression to retrieve the e expectation over different datasets  D\mathcal{D}D﻿ of the sum of squares loss, while keeping xxx﻿ fixed:

The last passage holds since ϵ\epsilonϵ﻿ has zero mean.

\begin{aligned} \mathbb{E}_{\mathcal{D}}(t_i + \epsilon -\hat{y}(x_i))^2 &= \mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 + \mathbb{E}_{\mathcal{D}}\epsilon^2+2\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))\epsilon \\ &= \mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 + \sigma^2 \end{aligned}

Now, applying (3) to the first term:

\begin{align}\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i))^2 &= (\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i)))^2 + \mathbb{E}_{\mathcal{D}}(\hat{y}(x_i)-t_i -\mathbb{E}_{\mathcal{D}}(\hat{y}(x_i) + t_i))^2\\ &=(\mathbb{E}_{\mathcal{D}}(t_i-\hat{y}(x_i)))^2 + \mathbb{E}_{\mathcal{D}}(\hat{y}(x_i) -\mathbb{E}_{\mathcal{D}}\hat{y}(x_i) )^2\\ &=bias^2 + VaR \end{align}

Putting the two equations together, we get:

\mathbb{E}_{\mathcal{D}}(y_i-\hat{y}(x_i))^2 = bias^2 +VaR + \sigma^2

The first term, called the squared bias, represents the extent to which the average prediction over all data sets differs from the desired regression function. The second term, called the variance,
measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function y^(xi)\hat{y}(x_i)y^​(xi​)﻿  is sensitive to the particular choice of data set.